Ingredients

Ingredients are the building block of recipe.

Ingredients can contain columns that are part of the SELECT portion of a query, filters that are part of a WHERE clause of a query, group_bys that contribute to a query’s GROUP BY and havings which add HAVING limits ot a query.

Creating ingredients in python

Ingredients can be created either in python or via configuration. To created Ingredients in python, use one of the four convenience classes.

  • Metric: Create an aggregated calculation using a column. This value appears only in the SELECT part of the SQL statement.

  • Dimension: Create a non-aggregated value using a column. This value appears in the SELECT and GROUP BY parts of the SQL statement.

  • Filter: Create a boolean expression. This value appears in the WHERE part of the SQL statement. Filters can be created automatically using the AutomaticFilters extension or by using a Dimension or Metric’sales build_filter method.

  • Having: Create a boolean expression with an aggregated ColumnElement. This value appears in the HAVING part of the SQL statement.

Metrics and Dimensions are commonly reused in working Recipe code, while filters are often created temporarily based on data.

Features of ingredients

Let’s explore some capabilities.

Formatters

Formatters are a list of python callables that take a single value. This let you manipulate the results of an ingredient with python code. If you use formatters, the original, unmodified value is available as {ingredient}_raw.

shelf = Shelf({
    'state': Dimension(Census.state),
    'age': WtdAvgMetric(Census.age, Census.pop2000),
    'gender': Dimension(Census.gender),
    'population': Metric(func.sum(Census.pop2000), formatters=[
        lambda value: int(round(value, -6) / 1000000)
    ])
})

recipe = Recipe(shelf=shelf, session=oven.Session())\
    .dimensions('gender').metrics('population')

for row in recipe.all():
    print('{} has {} people'.format(row.gender, row.population))
    print('\tThe original value is: {}'.format(row.population_raw))

The results look like

F has 144 million people
    The original value is: 143534804
M has 137 million people
    The original value is: 137392517

Building filters

Ingredient.build_filter

Storing extra attributes in meta

Extra keyword arguments that get passed to ingredient initialization get stored in the meta object. This can be used to extend the capabilities of ingredients and add extra features.

d = Dimension(Census.age, icon='cog')
print(d.meta.icon)
>>> 'cog'

Types of Ingredients

List of ingredients

Dimension

Dimensions are groupings that exist in your data. Dimension objects add the column to the select statement and the group by of the SQL query.

# A simple dimension
self.shelf['state'] = Dimension(Census.state)

Adding an id

Dimensions can use separate columns for ids and values. Consider a table of employees with an employee_id and a full_name. If you had two employees with the same name you need to be able to distinguish between them.

# Support an id and a label
self.shelf['employee']: Dimension(Employee.full_name,
                                  id_expression=Employee.id)

The id is accessible as employee_id in each row and their full name is available as employee.

If you build a filter using this dimension, it will filter against the id.

Adding an ordering

If you want to order a dimension in a custom way, pass a keyword argument order_by_expression. This code adds an order_by_expression that causes the values to sort case insensitively.

from sqlalchemy import func

# Support an id and a label
self.shelf['employee']: Dimension(Employee.full_name,
                                  order_by_expression=func.lower(
                                    Employee.full_name
                                  ))

The order_by expression is accessible as employee_order_by in each row and the full name is available as employee. If the employee dimension is used in a recipe, the recipe will always be ordered by func.lower(Employee.full_name).

Adding additional groupings

Both id_expression and order_by_expression are special cases of Dimension’s ability to be passed additional columns can be used for grouping. Any keyword argument suffixed with _expression adds additional roles to this Dimension. The first required expression supplies the dimension’s value role. For instance, you could create a dimension with an id, a latitude and a longitude.

For instance, the following

Dimension(Hospitals.name,
          latitude_expression=Hospitals.lat
          longitude_expression=Hospitals.lng,
          id='hospital')

would add columns named “hospital”, “hospital_latitude”, and “hospital_longitude” to the recipes results. All three of these expressions would be used as group bys.

Using lookups

You can use a lookup table to map values in your data to descriptive names. The _id property of your dimension contains the original value.

# Convert M/F into Male/Female
self.shelf['gender']: Dimension(Census.sex, lookup={'M': 'Male',
    'F': 'Female'}, lookup_default='Unknown')

If you use the gender dimension, there will be a gender_id in each row that will be “M” or “F” and a gender in each row that will be “Male” or “Female”.

shelf = Shelf({
    'state': Dimension(Census.state),
    'gender_desc': Dimension(Census.gender, lookup={'M': 'Male',
        'F': 'Female'}, lookup_default='Unknown'),
    'age': WtdAvgMetric(Census.age, Census.pop2000),
    'population': Metric(func.sum(Census.pop2000))
})

recipe = Recipe(shelf=shelf, session=oven.Session())\
    .dimensions('gender_desc').metrics('population')
print(recipe.to_sql())
print(recipe.dataset.csv)

Lookups inject a formatter in the first position. Because a formatter is used, recipe creates a gender_desc_raw on the response that contains the unformatted value then uses the lookup to create the gender_desc property. All dimensions also generate an {ingredient}_id property.

Here is the query and the results.

SELECT census.gender AS gender_desc_raw,
    sum(census.pop2000) AS population
FROM census
GROUP BY census.gender

gender_desc_raw,population,gender_desc,gender_desc_id
F,143534804,Female,F
M,137392517,Male,M

Metric

Metrics are aggregations performed on your data. Here’s an example of a few Metrics.

shelf = Shelf({
    'total_population': Metric(func.sum(Census.pop2000)),
    'min_population': Metric(func.min(Census.pop2000)),
    'max_population': Metric(func.max(Census.pop2000))
})
recipe = Recipe(shelf=shelf, session=oven.Session())\
    .metrics('total_population', 'min_population', 'max_population')
print(recipe.to_sql())
print(recipe.dataset.csv)

The results of this recipe are:

SELECT max(census.pop2000) AS max_population,
    min(census.pop2000) AS min_population,
    sum(census.pop2000) AS total_population
FROM census

max_population,min_population,total_population
294583,217,280927321

DivideMetric

Division in SQL introduces the possibility of division by zero. DivideMetric guards against division by zero while giving you a quick way to divide one calculation by another.

shelf = Shelf({
    'state': Dimension(Census.state),
    'popgrowth': DivideMetric(func.sum(Census.pop2008-Census.pop2000), func.sum(Census.pop2000)),
})
recipe = Recipe(shelf=shelf, session=oven.Session())\
    .dimensions('state').metrics('popgrowth')

This creates results like:

SELECT census.state AS state,
    CAST(sum(census.pop2008 - census.pop2000) AS FLOAT) /
      (coalesce(CAST(sum(census.pop2000) AS FLOAT), 0.0) + 1e-09) AS popgrowth
FROM census
GROUP BY census.state

state,popgrowth,state_id
Alabama,0.04749469366071285,Alabama
Alaska,0.09194726152996757,Alaska
Arizona,0.2598860676785905,Arizona
Arkansas,0.06585681816651036,Arkansas
California,0.0821639328251409,California
Colorado,0.14231283526592364,Colorado
...

The denominator has a tiny value added to it to prevent division by zero.

WtdAvgMetric

WtdAvgMetric generates a weighted average of a number using a weighting.

Warning

WtdAvgMetric takes two ColumnElements as arguments. The first is the value and the second is the weighting. Unlike other Metrics, these are not aggregated.

Here’s an example.

shelf = Shelf({
    'state': Dimension(Census.state),
    'avgage': WtdAvgMetric(Census.age, Census.pop2000),
})
recipe = Recipe(shelf=shelf, session=oven.Session())\
    .dimensions('state').metrics('avgage')

print(recipe.to_sql())
print(recipe.dataset.csv)

This generates results that look like this:

SELECT census.state AS state,
    CAST(sum(census.age * census.pop2000) AS FLOAT) / (coalesce(CAST(sum(census.pop2000) AS FLOAT), 0.0) + 1e-09) AS avgage
FROM census
GROUP BY census.state

state,avgage,state_id
Alabama,36.27787892421841,Alabama
Alaska,31.947384766048568,Alaska
Arizona,35.37065466080318,Arizona
Arkansas,36.63745110262778,Arkansas
California,34.17872597484759,California
...

Note: WtdAvgMetric uses safe division from DivideMetric.

Filter

Filter objects add a condition to the where clause of your SQL query. Filter objects can be added to a Shelf.

shelf = Shelf({
    'state': Dimension(Census.state),
    'population': Metric(func.sum(Census.pop2000)),
    'teens': Filter(Census.age.between(13,19)),
})
recipe = Recipe(shelf=shelf, session=oven.Session())\
    .dimensions('state')\
    .metrics('population')\
    .filters('teens')
print(recipe.to_sql())
print(recipe.dataset.csv)

This results in output like:

SELECT census.state AS state,
    sum(census.pop2000) AS population
FROM census
WHERE census.age BETWEEN 13 AND 19
GROUP BY census.state

state,population,state_id
Alabama,451765,Alabama
Alaska,71655,Alaska
Arizona,516270,Arizona

Different ways of generating Filters

Recipe has several ways of filtering recipes.

  • Filter objects can be added to the shelf. They can be added to the recipe by name from a shelf. This is best when you have a filter that you want to use in many place.

    shelf = Shelf({
        'age': Dimension(Census.age),
        'state': Dimension(Census.state),
        'population': Metric(func.sum(Census.pop2000)),
        'teens': Filter(Census.age.between(13,19)),
    })
    ...
    recipe = recipe.filters('teens')
    
  • Filter objects can be created dynamically and added to the recipe. This is best if the filtering needs to change dynamically.

    recipe = recipe.filters(Filter(Census.age.between(13,19))
    
  • Ingredient.build_filter can be used to build filters that refer to the ingredient’s column.

    age_filter = shelf['age'].build_filter([13,19], 'between')
    recipe = recipe.filters(age_filter)
    

    This is best when you want to reuse a column definition defined in an ingredient.

  • AutomaticFilters: The AutomaticFilters extension adds filtering syntax directly to recipe.

    recipe = recipe.automatic_filters({
      'age__between': [13,19]
    })
    

    This is best when you want to add many filters consistently. AutomaticFilters uses Ingredient.build_filter behind the scenes.

Having

Having objects are binary expressions with an aggregated column value. One easy way to generate Having objects is to build_filter using a Metric.

shelf = Shelf({
    'age': Dimension(Census.age),
    'avgage': WtdAvgMetric(Census.age, Census.pop2000),
    'state': Dimension(Census.state),
    'population': Metric(func.sum(Census.pop2000)),
})
# Find states with a population greater than 15 million
big_states = shelf['population'].build_filter(15000000, operator='gt')
recipe = Recipe(shelf=shelf, session=oven.Session())\
    .dimensions('state')\
    .metrics('population')\
    .order_by('-population')\
    .filters(big_states)

print(recipe.to_sql())
print(recipe.dataset.csv)

This generates the following results.

SELECT census.state AS state,
    sum(census.pop2000) AS population
FROM census
GROUP BY census.state
HAVING sum(census.pop2000) > 15000000
ORDER BY sum(census.pop2000) DESC

state,population,state_id
California,33829442,California
Texas,20830810,Texas
New York,18978668,New York
Florida,15976093,Florida