The pfx_dw
schema is a fairly textbook "star schema", with things like jobs and worker usage in fact tables, and the various things that you might use in a SQL WHERE
clause in dimension tables.
The job dimension table, a typical "fact" table.
For example, to get reports about various jobs over time, you'll be querying the pfx_dw.job_fact table:
"*_sk" columns can be used to do INNER JOINs to a similarly named dimension table
Any column that is named with an _sk
suffix is a Synthetic Key that points to a corresponding dimension table, named with the part of the column before the _sk
; the dimension table will have a _dim
suffix in the name. This way, it's easy to write the JOIN
's, the column name is a clue to the dimension table, which will have a column of the same name. Almost every dimension table will consist of a *_sk
PRIMARY KEY
and a name column.
A typical dimension table, the "user_dim" table
For example, the user_sk column can be used to do a SQL INNER JOIN
to the user_dim table.
Get a count of all jobs for a particular user:
The time dimension table
The pfx_dw.time_dim table is provided so that you don't have to perform date/time operations on every row in a fact table (since they can run into the 100's of millions of rows), instead you do a SQL INNER JOIN
to it and use the values in the time_dim table in your WHERE
clause. The time_sk column in every fact table has an identical value in the time_dim
table which has a single row with a primary key time_sk. The time_sk
value is actually the unix epoch time in seconds:
The "job status" dimension table
The pfx_dw.jobstatus_dim table is one of the few exceptions to the normal dimension table structure; it provides a mapping between the integer and human-readable status values.
Get a count of all jobs for a particular user for January, 2014:
Get a count of all jobs for each user for all of 2013:
Get a count of all jobs for each user for all of 2013, broken down by month and the job's final status:
Get the sum total of cpu_seconds used for each user for the last 7 days, broken down by user, date, and the job's final status:
SELECT user . name , time . date , status.status_char , SUM (fact.cpu_seconds) as "cpu_time" FROM job_fact AS fact INNER JOIN user_dim AS user ON fact.user_sk= user .user_sk INNER JOIN time_dim AS time ON fact.time_sk= time .time_sk INNER JOIN jobstatus_dim AS status ON fact.jobstatus_sk=status.jobstatus_sk WHERE DATEDIFF(CURDATE(), time .date_time) < 7 GROUP BY user . name , time . date , status.status_int ORDER BY time . date , cpu_time DESC , status.status_char ; + --------+------------+-------------+----------+ | name | date | status_char | cpu_time | + --------+------------+-------------+----------+ << snipped >> | jburk | 2014-07-14 | complete | 351036 | | jburk | 2014-07-14 | killed | 60029 | | jburk | 2014-07-14 | failed | 139 | | coxj | 2014-07-14 | killed | 98 | | garza | 2014-07-14 | killed | 0 | | jburk | 2014-07-15 | complete | 28910 | | fubar | 2014-07-15 | complete | 18610 | | foobar | 2014-07-15 | complete | 18561 | | jburk | 2014-07-15 | killed | 16967 | | jburk | 2014-07-15 | failed | 27 | | jburk | 2014-07-16 | complete | 46797 | | jburk | 2014-07-16 | killed | 17136 | | jburk | 2014-07-16 | failed | 2 | << snipped >> + --------+------------+-------------+----------+ |