Defining Shelves from configuration¶

Shelves are defined as dictionaries containing keys and ingredient. All the examples below use YAML.

Defining Shelves¶

Shelves are defined in configuration as dictionaries with keys and values that are Ingredient configuration definitions. A simple example looks like this.

total_population:
  kind: Metric
  field: pop2000
state:
  kind: Dimension
  field: state

See examples for more Shelf examples.

Defining Ingredients¶

Ingredients are defined using fields (which may contain conditions). Those conditions may reference more fields in turn and so forth.

Metric¶

Metrics will always apply a default aggregation of ‘sum’ to any fields used.

kind: Metric
field: {field}
divide_by: {field} (optional)

divide_by is an optional denominator that field will be divided by safely.

Dimension¶

Metrics will always apply a default aggregation of ‘sum’ to their field.

kind: Dimension
field: {field}
{role}_field: {field} (optional)
buckets: A list of labeled conditions (optional)
buckets_default_label: string (optional)
quickselects: A list of labeled conditions (optional)

Adding id and other roles to Dimension¶

Dimensions can be defined with extra fields. The prefix before _field is the field’s role. The role will be suffixed to each value in the recipe rows. Let’s look at an example.

hospital:
  field: hospital_name
  id_field: hospital_id
  latitude_field: hospital_lat
  longitude_field: hospital_lng

Each result row will include

hospital
hospital_id The field defined as id_field
hospital_latitude The field defined as latitude_field
hospital_longitude The field defined as longitude_field

Defining buckets¶

Buckets let you group continuous values (like salaries or ages). Here’s an example:

groups:
    kind: Dimension
    field: age
    buckets:
    - label: 'northeasterners'
      field: state
      in: ['Vermont', 'New Hampshire']
    - label: 'babies'
      lt: 2
    - label: 'children'
      lt: 13
    - label: 'teens'
      lt: 20
    buckets_default_label: 'oldsters'

The conditions are evaluated in order. buckets_default_label is used for any values that didn’t match any condition.

For convenience, conditions defined in buckets will use the field from the Dimension unless a different field is defined in the condition. In the example above, the first bucket uses field: state explicitly while all the other conditions use field: age from the Dimension.

If you use order_by a bucket dimension, the order will be the order in which the buckets were defined.

Adding quickselects to a Dimension¶

quickselects are a way of associating conditions with a dimension.

region:
    kind: Dimension
    field: sales_region
total_sales:
    kind: Metric
    field: sales_dollars
date:
    kind: Dimension
    field: sales_date
    quickselects:
    - label: 'Last 90 days'
      between:
      - 90 days ago
      - tomorrow
    - label: 'Last 180 days'
      between:
      - 180 days ago
      - tomorrow

These conditions can then be accessed through Ingredient.build_filter. The AutomaticFilters extension is an easy way to use this.

recipe = Recipe(session=oven.Session(), extension_classes=[AutomaticFilters]). \
            .dimensions('region') \
            .metrics('total_sales') \
            .automatic_filters({
              'date__quickselect': 'Last 90 days'
            })

Defining Fields¶

Fields can be defined with a short string syntax or a dictionary syntax. The string syntax always is normalized into the dictionary syntax.

field:
    value: '{column reference}'
    aggregation: '{aggregation (optional)}'
    operators: {list of operators}
    as: {optional type to coerce into}
    default: {default value, optional}

or

field: '{string field definition}'
This may include field references that look like
@{ingredient name from the shelf}.

Defining Fields with Dicts¶

Dictionaries provide access to all options when defining a field.

dictionary field options¶
Key	Required	Description
value	required	string What column to use.
aggregation	optional	string (default is ‘sum’ for Metric and ‘none’ for Dimension) What aggregation to use, if any. Possible aggregations are: ‘sum’ ‘min’ ‘max’ ‘avg’ ‘count’ ‘count_distinct’ ‘month’ (round to the nearest month for dates) ‘week’ (round to the nearest week for dates) ‘year’ (round to the nearest year for dates) ‘quarter’ (round to the nearest quarter for dates) ‘age’ (calculate age based on a date and the current date) ‘none’ (perform no aggregation) ‘median’ (calculate the median value, note: this aggregation is not available on all databases). ‘percentile[1,5,10,25,50,75,90,95,99]’ (calculate the nth percentile value where higher values correspond to higher percentiles, note: this aggregation is not available on all databases).
condition	optional	A `condition` Condition will limit what rows of data are aggregated for a field.
operators	optional	A list of `operator` Operators are fields combined with a math operator to the base field.
default	optional	An integer, string, float, or boolean value (optional) A value to use if the column is NULL.

Warning

The following two fields are for internal use.

internal dictionary field options¶
Key	Required	Description
ref	optional	string Replace this field with the field defined in the specified key in the shelf.
_use_raw_value	optional	boolean Don’t evaluate value as a column, treat it as a constant in the SQL expression.

Defining Fields with Strings¶

Fields can be defined using strings. When using strings, words are treated as column references. If the words are prefixed with an ‘@’ (like @sales), the field of the ingredient named sales in the shelf will be injected.

Aggregations can be called like functions to apply that aggregation to a column.

string field examples¶
Field string	Description
revenue - expenses	The sum of column revenue minus the sum of column expenses. field: revenue - expenses # is the same as field: value: revenue aggregation: sum # this may be omitted because 'sum' # is the default aggregation for Metrics operators: - operator: '-' field: value: expenses aggregation: sum
@sales / @student_count	Find the field definition of the field named ‘sales’ in the shelf. Divide it by the field definition of the field named ‘student_count’.
count_distinct(student_id)	Count the distinct values of column student_id. field: count_distinct(student_id) # is the same as field: value: student_id aggregation: count_distinct

Defining Field Operators¶

Operators lets you perform math with fields.

operator options¶
Key	Required	Description
operator	required	string One of ‘+’, ‘-’, ‘*’, ‘/’
field	required	A field definition (either a string or a dictionary)

For instance, operators can be used like this:

# profit - taxes - interest
field:
  value: profit
  operators:
  - operator: '-'
    field: taxes
  - operator: '-'
    field: interest

Defining Conditions¶

Conditions can include a field and operator or a list of conditions and-ed or or-ed together.

field: {field definition}
label: string (an optional string label)
{operator}: {value} or {list of values}

or

or:     # a list of conditions
- {condition1}
- {condition2}
...
- {conditionN}

or

and:    # a list of conditions
- {condition1}
- {condition2}
...
- {conditionN}

or

a condition reference @{ingredient name from the shelf}.

Conditions consist of a field and exactly one operator.

condition options¶
Condition	Value is…	Description
gt	A string, int, or float.	Find values that are greater than the value For example: # Sales dollars are greater than 100. condition: field: sales_dollars gt: 100
gte (or ge)	A string, int, or float.	Find values that are greater than or equal to the value
lt	A string, int, or float.	Find values that are less than the value
lte (or le)	A string, int, or float.	Find values that are less than or equal to the value
eq	A string, int, or float.	Find values that are equal to the value
ne	A string, int, or float.	Find values that are not equal to the value
like	A string	Find values that match the SQL LIKE expression For example: # States that start with the capital letter C condition: field: state like: 'C%'
ilike	A string	Find values that match the SQL ILIKE (case insensitive like) expression.
between	A list of two values	Find values that are between the two values.
in	A list of values	Find values that are in the list of values
notin	A list of values	Find values that are not in the list of values

ands and ors in conditions¶

Conditions can and and or a list of conditions together.

Here’s an example:

# Find states that start with 'C' and end with 'a'
# Note the conditions in the list don't have to
# use the same field.
condition:
  and:
  - field: state
    like: 'C%'
  - field: state
    like: '%a'

Date conditions¶

If the field is a date or datetime, absolute and relative dates can be defined in values using string syntax. Recipe uses the Dateparser library.

Here’s an example.

# Find sales that occured within the last 90 days.
condition:
  field: sales_date
  between:
  - '90 days ago'
  - 'tomorrow'

Labeled conditions¶

Conditions may optionally be labeled by adding a label property.

quickselects are a feature of Dimension that are defined with a list of labeled conditions.

Examples¶

A simple shelf with conditions¶

This shelf is basic.

teens:
    kind: Metric
    field:
        value: pop2000
        condition:
            field: age
            between: [13,19]
state:
    kind: Dimension
    field: state

Using this shelf in a recipe.

recipe = Recipe(shelf=shelf, session=oven.Session())\
    .dimensions('state')\
    .metrics('teens')
print(recipe.to_sql())
print(recipe.dataset.csv)

The results look like:

SELECT census.state AS state,
      sum(CASE
              WHEN (census.age BETWEEN 13 AND 19) THEN census.pop2000
          END) AS teens
FROM census
GROUP BY census.state

state,teens,state_id
Alabama,451765,Alabama
Alaska,71655,Alaska
Arizona,516270,Arizona
Arkansas,276069,Arkansas
...

Metrics referencing other metric definitions¶

The following shelf has a Metric pct_teens that divides one previously defined Metric teens by another total_pop.

teens:
    kind: Metric
    field:
        value: pop2000
        condition:
            field: age
            between: [13,19]
total_pop:
    kind: Metric
    field: pop2000
pct_teens:
    field: '@teens'
    divide_by: '@total_pop'
state:
    kind: Dimension
    field: state

Using this shelf in a recipe.

recipe = Recipe(shelf=shelf, session=oven.Session())\
    .dimensions('state')\
    .metrics('pct_teens')
print(recipe.to_sql())
print(recipe.dataset.csv)

Here’s the results. Note that recipe performs safe division.

SELECT census.state AS state,
      CAST(sum(CASE
                    WHEN (census.age BETWEEN 13 AND 19) THEN census.pop2000
                END) AS FLOAT) / (coalesce(CAST(sum(census.pop2000) AS FLOAT), 0.0) + 1e-09) AS pct_teens
FROM census
GROUP BY census.state

state,pct_teens,state_id
Alabama,0.10178190714599038,Alabama
Alaska,0.11773975168751254,Alaska
Arizona,0.10036487658951877,Arizona
Arkansas,0.10330245760980436,Arkansas
...

Dimensions containing buckets¶

Dimensions may be created by bucketing a field.

total_pop:
    kind: Metric
    field: pop2000
age_buckets:
    kind: Dimension
    field: age
    buckets:
    - label: 'babies'
      lt: 2
    - label: 'children'
      lt: 13
    - label: 'teens'
      lt: 20
    buckets_default_label: 'oldsters'
mixed_buckets:
    kind: Dimension
    field: age
    buckets:
    - label: 'northeasterners'
      in: ['Vermont', 'New Hampshire']
      field: state
    - label: 'babies'
      lt: 2
    - label: 'children'
      lt: 13
    - label: 'teens'
      lt: 20
    buckets_default_label: 'oldsters'

Using this shelf in a recipe.

recipe = Recipe(shelf=shelf, session=oven.Session())\
    .dimensions('mixed_buckets')\
    .metrics('total_pop')\
    .order_by('mixed_buckets')
print(recipe.to_sql())
print(recipe.dataset.csv)

Here’s the results. Note this recipe orders by mixed_buckets. The buckets are ordered in the order they are defined.

SELECT CASE
          WHEN (census.state IN ('Vermont',
                                  'New Hampshire')) THEN 'northeasterners'
          WHEN (census.age < 2) THEN 'babies'
          WHEN (census.age < 13) THEN 'children'
          WHEN (census.age < 20) THEN 'teens'
          ELSE 'oldsters'
      END AS mixed_buckets,
      sum(census.pop2000) AS total_pop
FROM census
GROUP BY CASE
            WHEN (census.state IN ('Vermont',
                                    'New Hampshire')) THEN 'northeasterners'
            WHEN (census.age < 2) THEN 'babies'
            WHEN (census.age < 13) THEN 'children'
            WHEN (census.age < 20) THEN 'teens'
            ELSE 'oldsters'
        END
ORDER BY CASE
            WHEN (census.state IN ('Vermont',
                                    'New Hampshire')) THEN 0
            WHEN (census.age < 2) THEN 1
            WHEN (census.age < 13) THEN 2
            WHEN (census.age < 20) THEN 3
            ELSE 9999
        END

mixed_buckets,total_pop,mixed_buckets_id
northeasterners,1848787,northeasterners
babies,7613225,babies
children,44267889,children
teens,28041679,teens
oldsters,199155741,oldsters

Defining Shelves from configuration¶