A Primer for the Qube Data Warehouse Schema

The dwh schema is a fairly textbook star schema, with things like jobs and worker usage in fact tables, and the various things that you might use in a SQL WHERE clause in dimension tables.

The job_fact table, a typical "fact" table.

For example, to get reports about various jobs over time, you'll be querying the dwh.job_fact table:

job_time* are stored as timestamp types

Icon

With Qube 7.x which uses PostgreSQL as the backbone, job_time* are stored as timestamp types, so, for example, you can fetch time data for a specific job as in:

pfx=> SELECT job_id, job_timesubmit, job_timestart,job_timecomplete FROM dwh.job_fact WHERE job_id=57 ;

 job_id |     job_timesubmit     |     job_timestart      |    job_timecomplete

--------+------------------------+------------------------+------------------------

     57 | 2020-08-07 23:49:57-10 | 2020-08-07 23:49:57-10 | 2020-08-07 23:50:03-10

(1 row)

"*_sk" columns can be used to do INNER JOINs to a similarly named dimension table

Any column that is named with an _sk suffix is a Synthetic Key that points to a corresponding dimension table, named with the part of the column before the _sk; the dimension table will have a _dim suffix in the name. This way, it's easy to write the JOIN's, the column name is a clue to the dimension table, which will have a column of the same name. Almost every dimension table will consist of a *_sk PRIMARY KEY and a name column.

A typical dimension table, the "user_dim" table

For example, the user_sk column can be used to do a SQL INNER JOIN to the user_dim table.

Get a count of all jobs for a particular user:

The time dimension table

The dwh.time_dim table is provided so that you don't have to perform date/time operations on every row in a fact table (since they can run into the 100's of millions of rows), instead you do a SQL INNER JOIN to it and use the values in the time_dim table in your WHERE clause. The time_sk column in every fact table has an identical value in the time_dim table which has a single row with a primary key time_sk. The time_sk value is actually the unix epoch time in seconds:

The "job status" dimension table

The dwh.jobstatus_dim table is one of the few exceptions to the normal dimension table structure; it provides a mapping between the integer and human-readable status values.

The job_fact table, a typical "fact" table.

"*_sk" columns can be used to do INNER JOINs to a similarly named dimension table

A typical dimension table, the "user_dim" table

Get a count of all jobs for a particular user:

The time dimension table

The "job status" dimension table

Get a count of all jobs for a particular user for August, 2020:

Get a count of all jobs for each user for all of 2020:

Get a count of all jobs for each user for all of 2020, broken down by month and the job's final status:

Get the sum total of cpu_seconds used for each user for the last 7 days, broken down by user, date, and the job's final status:

Qube Qube! Documentation Developer's Guide A Primer for the Qube Data Warehouse Schema A Primer for the Qube Data Warehouse Schema

The job_fact table, a typical "fact" table.

"*_sk" columns can be used to do INNER JOINs to a similarly named dimension table

A typical dimension table, the "user_dim" table

Get a count of all jobs for a particular user:

The time dimension table

The "job status" dimension table

Get a count of all jobs for a particular user for August, 2020:

Get a count of all jobs for each user for all of 2020:

Get a count of all jobs for each user for all of 2020, broken down by month and the job's final status:

Get the sum total of cpu_seconds used for each user for the last 7 days, broken down by user, date, and the job's final status:

Qube
Qube! Documentation
Developer's Guide
A Primer for the Qube Data Warehouse Schema

A Primer for the Qube Data Warehouse Schema