Defining Shelves from configuration

Shelves are defined as dictionaries containing keys and ingredient. All the examples below use YAML.

Defining Shelves

Shelves are defined in configuration as dictionaries with keys and values that are Ingredient configuration definitions. A simple example looks like this.

total_population:
  kind: Metric
  field: pop2000
state:
  kind: Dimension
  field: state

See examples for more Shelf examples.

Defining Ingredients

Ingredients are defined using fields (which may contain conditions). Those conditions may reference more fields in turn and so forth.

Metric

Metrics will always apply a default aggregation of ‘sum’ to any fields used.

kind: Metric
field: {field}
divide_by: {field} (optional)

divide_by is an optional denominator that field will be divided by safely.

Dimension

Metrics will always apply a default aggregation of ‘sum’ to their field.

kind: Dimension
field: {field}
{role}_field: {field} (optional)
buckets: A list of labeled conditions (optional)
buckets_default_label: string (optional)
quickselects: A list of labeled conditions (optional)

Adding id and other roles to Dimension

Dimensions can be defined with extra fields. The prefix before _field is the field’s role. The role will be suffixed to each value in the recipe rows. Let’s look at an example.

hospital:
  field: hospital_name
  id_field: hospital_id
  latitude_field: hospital_lat
  longitude_field: hospital_lng

Each result row will include

  • hospital

  • hospital_id The field defined as id_field

  • hospital_latitude The field defined as latitude_field

  • hospital_longitude The field defined as longitude_field

Defining buckets

Buckets let you group continuous values (like salaries or ages). Here’s an example:

groups:
    kind: Dimension
    field: age
    buckets:
    - label: 'northeasterners'
      field: state
      in: ['Vermont', 'New Hampshire']
    - label: 'babies'
      lt: 2
    - label: 'children'
      lt: 13
    - label: 'teens'
      lt: 20
    buckets_default_label: 'oldsters'

The conditions are evaluated in order. buckets_default_label is used for any values that didn’t match any condition.

For convenience, conditions defined in buckets will use the field from the Dimension unless a different field is defined in the condition. In the example above, the first bucket uses field: state explicitly while all the other conditions use field: age from the Dimension.

If you use order_by a bucket dimension, the order will be the order in which the buckets were defined.

Adding quickselects to a Dimension

quickselects are a way of associating conditions with a dimension.

region:
    kind: Dimension
    field: sales_region
total_sales:
    kind: Metric
    field: sales_dollars
date:
    kind: Dimension
    field: sales_date
    quickselects:
    - label: 'Last 90 days'
      between:
      - 90 days ago
      - tomorrow
    - label: 'Last 180 days'
      between:
      - 180 days ago
      - tomorrow

These conditions can then be accessed through Ingredient.build_filter. The AutomaticFilters extension is an easy way to use this.

recipe = Recipe(session=oven.Session(), extension_classes=[AutomaticFilters]). \
            .dimensions('region') \
            .metrics('total_sales') \
            .automatic_filters({
              'date__quickselect': 'Last 90 days'
            })

Defining Fields

Fields can be defined with a short string syntax or a dictionary syntax. The string syntax always is normalized into the dictionary syntax.

field:
    value: '{column reference}'
    aggregation: '{aggregation (optional)}'
    operators: {list of operators}
    as: {optional type to coerce into}
    default: {default value, optional}

or

field: '{string field definition}'
This may include field references that look like
@{ingredient name from the shelf}.

Defining Fields with Dicts

Dictionaries provide access to all options when defining a field.

dictionary field options

Key

Required

Description

value

required

string

What column to use.

aggregation

optional

string

(default is ‘sum’ for Metric and ‘none’ for Dimension)

What aggregation to use, if any. Possible aggregations are:

  • ‘sum’

  • ‘min’

  • ‘max’

  • ‘avg’

  • ‘count’

  • ‘count_distinct’

  • ‘month’ (round to the nearest month for dates)

  • ‘week’ (round to the nearest week for dates)

  • ‘year’ (round to the nearest year for dates)

  • ‘quarter’ (round to the nearest quarter for dates)

  • ‘age’ (calculate age based on a date and the current date)

  • ‘none’ (perform no aggregation)

  • ‘median’ (calculate the median value, note: this aggregation is not available on all databases).

  • ‘percentile[1,5,10,25,50,75,90,95,99]’ (calculate the nth percentile value where higher values correspond to higher percentiles, note: this aggregation is not available on all databases).

condition

optional

A condition

Condition will limit what rows of data are aggregated for a field.

operators

optional

A list of operator

Operators are fields combined with a math operator to the base field.

default

optional

An integer, string, float, or boolean value (optional)

A value to use if the column is NULL.

Warning

The following two fields are for internal use.

internal dictionary field options

Key

Required

Description

ref

optional

string

Replace this field with the field defined in the specified key in the shelf.

_use_raw_value

optional

boolean

Don’t evaluate value as a column, treat it as a constant in the SQL expression.

Defining Fields with Strings

Fields can be defined using strings. When using strings, words are treated as column references. If the words are prefixed with an ‘@’ (like @sales), the field of the ingredient named sales in the shelf will be injected.

Aggregations can be called like functions to apply that aggregation to a column.

string field examples

Field string

Description

revenue - expenses

The sum of column revenue minus the sum of column expenses.

field: revenue - expenses

# is the same as

field:
  value: revenue
  aggregation: sum  # this may be omitted because 'sum'
                    # is the default aggregation for Metrics
  operators:
  - operator: '-'
    field:
      value: expenses
      aggregation: sum

@sales / @student_count

Find the field definition of the field named ‘sales’ in the shelf.

Divide it by the field definition of the field named ‘student_count’.

count_distinct(student_id)

Count the distinct values of column student_id.

field: count_distinct(student_id)

# is the same as

field:
   value: student_id
   aggregation: count_distinct

Defining Field Operators

Operators lets you perform math with fields.

operator options

Key

Required

Description

operator

required

string

One of ‘+’, ‘-’, ‘*’, ‘/’

field

required

A field definition (either a string or a dictionary)

For instance, operators can be used like this:

# profit - taxes - interest
field:
  value: profit
  operators:
  - operator: '-'
    field: taxes
  - operator: '-'
    field: interest

Defining Conditions

Conditions can include a field and operator or a list of conditions and-ed or or-ed together.

field: {field definition}
label: string (an optional string label)
{operator}: {value} or {list of values}

or

or:     # a list of conditions
- {condition1}
- {condition2}
...
- {conditionN}

or

and:    # a list of conditions
- {condition1}
- {condition2}
...
- {conditionN}

or

a condition reference @{ingredient name from the shelf}.

Conditions consist of a field and exactly one operator.

condition options

Condition

Value is…

Description

gt

A string, int, or float.

Find values that are greater than the value

For example:

# Sales dollars are greater than 100.
condition:
  field: sales_dollars
  gt: 100

gte (or ge)

A string, int, or float.

Find values that are greater than or equal to the value

lt

A string, int, or float.

Find values that are less than the value

lte (or le)

A string, int, or float.

Find values that are less than or equal to the value

eq

A string, int, or float.

Find values that are equal to the value

ne

A string, int, or float.

Find values that are not equal to the value

like

A string

Find values that match the SQL LIKE expression

For example:

# States that start with the capital letter C
condition:
  field: state
  like: 'C%'

ilike

A string

Find values that match the SQL ILIKE (case insensitive like) expression.

between

A list of two values

Find values that are between the two values.

in

A list of values

Find values that are in the list of values

notin

A list of values

Find values that are not in the list of values

ands and ors in conditions

Conditions can and and or a list of conditions together.

Here’s an example:

# Find states that start with 'C' and end with 'a'
# Note the conditions in the list don't have to
# use the same field.
condition:
  and:
  - field: state
    like: 'C%'
  - field: state
    like: '%a'

Date conditions

If the field is a date or datetime, absolute and relative dates can be defined in values using string syntax. Recipe uses the Dateparser library.

Here’s an example.

# Find sales that occured within the last 90 days.
condition:
  field: sales_date
  between:
  - '90 days ago'
  - 'tomorrow'

Labeled conditions

Conditions may optionally be labeled by adding a label property.

quickselects are a feature of Dimension that are defined with a list of labeled conditions.

Examples

A simple shelf with conditions

This shelf is basic.

teens:
    kind: Metric
    field:
        value: pop2000
        condition:
            field: age
            between: [13,19]
state:
    kind: Dimension
    field: state

Using this shelf in a recipe.

recipe = Recipe(shelf=shelf, session=oven.Session())\
    .dimensions('state')\
    .metrics('teens')
print(recipe.to_sql())
print(recipe.dataset.csv)

The results look like:

SELECT census.state AS state,
      sum(CASE
              WHEN (census.age BETWEEN 13 AND 19) THEN census.pop2000
          END) AS teens
FROM census
GROUP BY census.state

state,teens,state_id
Alabama,451765,Alabama
Alaska,71655,Alaska
Arizona,516270,Arizona
Arkansas,276069,Arkansas
...

Metrics referencing other metric definitions

The following shelf has a Metric pct_teens that divides one previously defined Metric teens by another total_pop.

teens:
    kind: Metric
    field:
        value: pop2000
        condition:
            field: age
            between: [13,19]
total_pop:
    kind: Metric
    field: pop2000
pct_teens:
    field: '@teens'
    divide_by: '@total_pop'
state:
    kind: Dimension
    field: state

Using this shelf in a recipe.

recipe = Recipe(shelf=shelf, session=oven.Session())\
    .dimensions('state')\
    .metrics('pct_teens')
print(recipe.to_sql())
print(recipe.dataset.csv)

Here’s the results. Note that recipe performs safe division.

SELECT census.state AS state,
      CAST(sum(CASE
                    WHEN (census.age BETWEEN 13 AND 19) THEN census.pop2000
                END) AS FLOAT) / (coalesce(CAST(sum(census.pop2000) AS FLOAT), 0.0) + 1e-09) AS pct_teens
FROM census
GROUP BY census.state

state,pct_teens,state_id
Alabama,0.10178190714599038,Alabama
Alaska,0.11773975168751254,Alaska
Arizona,0.10036487658951877,Arizona
Arkansas,0.10330245760980436,Arkansas
...

Dimensions containing buckets

Dimensions may be created by bucketing a field.

total_pop:
    kind: Metric
    field: pop2000
age_buckets:
    kind: Dimension
    field: age
    buckets:
    - label: 'babies'
      lt: 2
    - label: 'children'
      lt: 13
    - label: 'teens'
      lt: 20
    buckets_default_label: 'oldsters'
mixed_buckets:
    kind: Dimension
    field: age
    buckets:
    - label: 'northeasterners'
      in: ['Vermont', 'New Hampshire']
      field: state
    - label: 'babies'
      lt: 2
    - label: 'children'
      lt: 13
    - label: 'teens'
      lt: 20
    buckets_default_label: 'oldsters'

Using this shelf in a recipe.

recipe = Recipe(shelf=shelf, session=oven.Session())\
    .dimensions('mixed_buckets')\
    .metrics('total_pop')\
    .order_by('mixed_buckets')
print(recipe.to_sql())
print(recipe.dataset.csv)

Here’s the results. Note this recipe orders by mixed_buckets. The buckets are ordered in the order they are defined.

SELECT CASE
          WHEN (census.state IN ('Vermont',
                                  'New Hampshire')) THEN 'northeasterners'
          WHEN (census.age < 2) THEN 'babies'
          WHEN (census.age < 13) THEN 'children'
          WHEN (census.age < 20) THEN 'teens'
          ELSE 'oldsters'
      END AS mixed_buckets,
      sum(census.pop2000) AS total_pop
FROM census
GROUP BY CASE
            WHEN (census.state IN ('Vermont',
                                    'New Hampshire')) THEN 'northeasterners'
            WHEN (census.age < 2) THEN 'babies'
            WHEN (census.age < 13) THEN 'children'
            WHEN (census.age < 20) THEN 'teens'
            ELSE 'oldsters'
        END
ORDER BY CASE
            WHEN (census.state IN ('Vermont',
                                    'New Hampshire')) THEN 0
            WHEN (census.age < 2) THEN 1
            WHEN (census.age < 13) THEN 2
            WHEN (census.age < 20) THEN 3
            ELSE 9999
        END

mixed_buckets,total_pop,mixed_buckets_id
northeasterners,1848787,northeasterners
babies,7613225,babies
children,44267889,children
teens,28041679,teens
oldsters,199155741,oldsters