Defining Shelves from configuration¶
Shelves are defined as dictionaries containing keys and ingredient. All the examples below use YAML.
Defining Shelves¶
Shelves are defined in configuration as dictionaries with keys and values that are Ingredient configuration definitions. A simple example looks like this.
total_population:
kind: Metric
field: pop2000
state:
kind: Dimension
field: state
See examples for more Shelf examples.
Defining Ingredients¶
Ingredients are defined using fields (which may contain conditions). Those conditions may reference more fields in turn and so forth.
Metric¶
Metrics will always apply a default aggregation of ‘sum’ to any fields used.
kind: Metric
field: {field}
divide_by: {field} (optional)
divide_by
is an optional denominator that field
will be divided by safely.
Dimension¶
Metrics will always apply a default aggregation of ‘sum’ to their field.
kind: Dimension
field: {field}
{role}_field: {field} (optional)
buckets: A list of labeled conditions (optional)
buckets_default_label: string (optional)
quickselects: A list of labeled conditions (optional)
Adding id and other roles to Dimension¶
Dimensions can be defined with extra fields. The prefix before _field
is the field’s role. The role will be suffixed to each value in the
recipe rows. Let’s look at an example.
hospital:
field: hospital_name
id_field: hospital_id
latitude_field: hospital_lat
longitude_field: hospital_lng
Each result row will include
hospital
hospital_id
The field defined asid_field
hospital_latitude
The field defined aslatitude_field
hospital_longitude
The field defined aslongitude_field
Defining buckets¶
Buckets let you group continuous values (like salaries or ages). Here’s an example:
groups:
kind: Dimension
field: age
buckets:
- label: 'northeasterners'
field: state
in: ['Vermont', 'New Hampshire']
- label: 'babies'
lt: 2
- label: 'children'
lt: 13
- label: 'teens'
lt: 20
buckets_default_label: 'oldsters'
The conditions are evaluated in order. buckets_default_label is used for any values that didn’t match any condition.
For convenience, conditions defined in buckets will use the field from the Dimension
unless a different field is defined in the condition. In the example above, the first
bucket uses field: state
explicitly while all the other conditions use field: age
from the Dimension.
If you use order_by a bucket dimension, the order will be the order in which the buckets were defined.
Adding quickselects to a Dimension¶
quickselects are a way of associating conditions with a dimension.
region:
kind: Dimension
field: sales_region
total_sales:
kind: Metric
field: sales_dollars
date:
kind: Dimension
field: sales_date
quickselects:
- label: 'Last 90 days'
between:
- 90 days ago
- tomorrow
- label: 'Last 180 days'
between:
- 180 days ago
- tomorrow
These conditions can then be accessed through Ingredient.build_filter
.
The AutomaticFilters
extension is an easy way to use this.
recipe = Recipe(session=oven.Session(), extension_classes=[AutomaticFilters]). \
.dimensions('region') \
.metrics('total_sales') \
.automatic_filters({
'date__quickselect': 'Last 90 days'
})
Defining Fields¶
Fields can be defined with a short string syntax or a dictionary syntax. The string syntax always is normalized into the dictionary syntax.
field:
value: '{column reference}'
aggregation: '{aggregation (optional)}'
operators: {list of operators}
as: {optional type to coerce into}
default: {default value, optional}
or
field: '{string field definition}'
This may include field references that look like
@{ingredient name from the shelf}.
Defining Fields with Dicts¶
Dictionaries provide access to all options when defining a field.
Key |
Required |
Description |
---|---|---|
value |
required |
string What column to use. |
aggregation |
optional |
string (default is ‘sum’ for Metric and ‘none’ for Dimension) What aggregation to use, if any. Possible aggregations are:
|
condition |
optional |
A Condition will limit what rows of data are aggregated for a field. |
operators |
optional |
A list of Operators are fields combined with a math operator to the base field. |
default |
optional |
An integer, string, float, or boolean value (optional) A value to use if the column is NULL. |
Warning
The following two fields are for internal use.
Key |
Required |
Description |
---|---|---|
ref |
optional |
string Replace this field with the field defined in the specified key in the shelf. |
_use_raw_value |
optional |
boolean Don’t evaluate value as a column, treat it as a constant in the SQL expression. |
Defining Fields with Strings¶
Fields can be defined using strings. When using strings, words are treated as column references. If the words are prefixed with an ‘@’ (like @sales), the field of the ingredient named sales in the shelf will be injected.
Aggregations can be called like functions to apply that aggregation to a column.
Field string |
Description |
---|---|
revenue - expenses |
The sum of column revenue minus the sum of column expenses. field: revenue - expenses
# is the same as
field:
value: revenue
aggregation: sum # this may be omitted because 'sum'
# is the default aggregation for Metrics
operators:
- operator: '-'
field:
value: expenses
aggregation: sum
|
@sales / @student_count |
Find the field definition of the field named ‘sales’ in the shelf. Divide it by the field definition of the field named ‘student_count’. |
count_distinct(student_id) |
Count the distinct values of column student_id. field: count_distinct(student_id)
# is the same as
field:
value: student_id
aggregation: count_distinct
|
Defining Field Operators¶
Operators lets you perform math with fields.
Key |
Required |
Description |
---|---|---|
operator |
required |
string One of ‘+’, ‘-’, ‘*’, ‘/’ |
field |
required |
A field definition (either a string or a dictionary) |
For instance, operators can be used like this:
# profit - taxes - interest
field:
value: profit
operators:
- operator: '-'
field: taxes
- operator: '-'
field: interest
Defining Conditions¶
Conditions can include a field and operator or a list of conditions and-ed or or-ed together.
field: {field definition}
label: string (an optional string label)
{operator}: {value} or {list of values}
or
or: # a list of conditions
- {condition1}
- {condition2}
...
- {conditionN}
or
and: # a list of conditions
- {condition1}
- {condition2}
...
- {conditionN}
or
a condition reference @{ingredient name from the shelf}.
Conditions consist of a field and exactly one operator.
Condition |
Value is… |
Description |
---|---|---|
gt |
A string, int, or float. |
Find values that are greater than the value For example: # Sales dollars are greater than 100.
condition:
field: sales_dollars
gt: 100
|
gte (or ge) |
A string, int, or float. |
Find values that are greater than or equal to the value |
lt |
A string, int, or float. |
Find values that are less than the value |
lte (or le) |
A string, int, or float. |
Find values that are less than or equal to the value |
eq |
A string, int, or float. |
Find values that are equal to the value |
ne |
A string, int, or float. |
Find values that are not equal to the value |
like |
A string |
Find values that match the SQL LIKE expression For example: # States that start with the capital letter C
condition:
field: state
like: 'C%'
|
ilike |
A string |
Find values that match the SQL ILIKE (case insensitive like) expression. |
between |
A list of two values |
Find values that are between the two values. |
in |
A list of values |
Find values that are in the list of values |
notin |
A list of values |
Find values that are not in the list of values |
ands and ors in conditions¶
Conditions can and
and or
a list of conditions together.
Here’s an example:
# Find states that start with 'C' and end with 'a'
# Note the conditions in the list don't have to
# use the same field.
condition:
and:
- field: state
like: 'C%'
- field: state
like: '%a'
Date conditions¶
If the field
is a date or datetime, absolute and relative dates
can be defined in values using string syntax. Recipe uses the
Dateparser library.
Here’s an example.
# Find sales that occured within the last 90 days.
condition:
field: sales_date
between:
- '90 days ago'
- 'tomorrow'
Labeled conditions¶
Conditions may optionally be labeled by adding a label property.
quickselects are a feature of Dimension that are defined with a list of labeled conditions.
Examples¶
A simple shelf with conditions¶
This shelf is basic.
teens:
kind: Metric
field:
value: pop2000
condition:
field: age
between: [13,19]
state:
kind: Dimension
field: state
Using this shelf in a recipe.
recipe = Recipe(shelf=shelf, session=oven.Session())\
.dimensions('state')\
.metrics('teens')
print(recipe.to_sql())
print(recipe.dataset.csv)
The results look like:
SELECT census.state AS state,
sum(CASE
WHEN (census.age BETWEEN 13 AND 19) THEN census.pop2000
END) AS teens
FROM census
GROUP BY census.state
state,teens,state_id
Alabama,451765,Alabama
Alaska,71655,Alaska
Arizona,516270,Arizona
Arkansas,276069,Arkansas
...
Metrics referencing other metric definitions¶
The following shelf has a Metric pct_teens
that divides one previously defined Metric
teens
by another total_pop
.
teens:
kind: Metric
field:
value: pop2000
condition:
field: age
between: [13,19]
total_pop:
kind: Metric
field: pop2000
pct_teens:
field: '@teens'
divide_by: '@total_pop'
state:
kind: Dimension
field: state
Using this shelf in a recipe.
recipe = Recipe(shelf=shelf, session=oven.Session())\
.dimensions('state')\
.metrics('pct_teens')
print(recipe.to_sql())
print(recipe.dataset.csv)
Here’s the results. Note that recipe performs safe division.
SELECT census.state AS state,
CAST(sum(CASE
WHEN (census.age BETWEEN 13 AND 19) THEN census.pop2000
END) AS FLOAT) / (coalesce(CAST(sum(census.pop2000) AS FLOAT), 0.0) + 1e-09) AS pct_teens
FROM census
GROUP BY census.state
state,pct_teens,state_id
Alabama,0.10178190714599038,Alabama
Alaska,0.11773975168751254,Alaska
Arizona,0.10036487658951877,Arizona
Arkansas,0.10330245760980436,Arkansas
...
Dimensions containing buckets¶
Dimensions may be created by bucketing a field.
total_pop:
kind: Metric
field: pop2000
age_buckets:
kind: Dimension
field: age
buckets:
- label: 'babies'
lt: 2
- label: 'children'
lt: 13
- label: 'teens'
lt: 20
buckets_default_label: 'oldsters'
mixed_buckets:
kind: Dimension
field: age
buckets:
- label: 'northeasterners'
in: ['Vermont', 'New Hampshire']
field: state
- label: 'babies'
lt: 2
- label: 'children'
lt: 13
- label: 'teens'
lt: 20
buckets_default_label: 'oldsters'
Using this shelf in a recipe.
recipe = Recipe(shelf=shelf, session=oven.Session())\
.dimensions('mixed_buckets')\
.metrics('total_pop')\
.order_by('mixed_buckets')
print(recipe.to_sql())
print(recipe.dataset.csv)
Here’s the results. Note this recipe orders by mixed_buckets
. The buckets are
ordered in the order they are defined.
SELECT CASE
WHEN (census.state IN ('Vermont',
'New Hampshire')) THEN 'northeasterners'
WHEN (census.age < 2) THEN 'babies'
WHEN (census.age < 13) THEN 'children'
WHEN (census.age < 20) THEN 'teens'
ELSE 'oldsters'
END AS mixed_buckets,
sum(census.pop2000) AS total_pop
FROM census
GROUP BY CASE
WHEN (census.state IN ('Vermont',
'New Hampshire')) THEN 'northeasterners'
WHEN (census.age < 2) THEN 'babies'
WHEN (census.age < 13) THEN 'children'
WHEN (census.age < 20) THEN 'teens'
ELSE 'oldsters'
END
ORDER BY CASE
WHEN (census.state IN ('Vermont',
'New Hampshire')) THEN 0
WHEN (census.age < 2) THEN 1
WHEN (census.age < 13) THEN 2
WHEN (census.age < 20) THEN 3
ELSE 9999
END
mixed_buckets,total_pop,mixed_buckets_id
northeasterners,1848787,northeasterners
babies,7613225,babies
children,44267889,children
teens,28041679,teens
oldsters,199155741,oldsters