7.2. Aggregating

This section introduces operations for aggregating rows in a relation. Data scientists aggregate rows together to make summaries of data. For instance, a dataset containing daily sales can be aggregated to show monthly sales instead. Specifically, we’ll introduce grouping, a common operation for aggregating data.

We’ll work with the baby names data, as introduced in the previous section:

# Set up connection to database
import sqlalchemy
db = sqlalchemy.create_engine('sqlite:///babynames.db')
query = ''' 
SELECT *
FROM baby
LIMIT 10
'''

pd.read_sql(query, db)
Name Sex Count Year
0 Liam M 19659 2020
1 Noah M 18252 2020
2 Oliver M 14147 2020
... ... ... ... ...
7 Lucas M 11281 2020
8 Henry M 10705 2020
9 Alexander M 10151 2020

10 rows × 4 columns

7.2.1. Basic Group-Aggregate

Let’s say we want to find out the total number of babies born as recorded in this data. This is simply the sum of the Count column. SQL provides functions that we use in the SELECT statement, like SUM:

query = ''' 
SELECT SUM(Count)
FROM baby
'''

pd.read_sql(query, db)
SUM(Count)
0 352554503

Summing up the name counts is one simple way to aggregate the data—it combines data from multiple rows.

But let’s say we instead want to answer a more interesting question: are U.S. births trending upwards over time? To answer this question, we can sum the Count column within each year rather than taking the sum over the entire dataset. In other words, we split the data into groups based on Year, then sum up the Count values within each group.

sql-groupby-births

Fig. 7.2 A depiction of grouping then aggregating for example data.

We call this operation grouping followed by aggregating. In SQL, we specify what column we wish to use for grouping through the GROUP BY clause, then using aggregation functions in SELECT:

query = ''' 
SELECT Year, SUM(Count)
FROM baby
GROUP BY Year
'''

pd.read_sql(query, db)
Year SUM(Count)
0 1880 194419
1 1881 185772
2 1882 213385
... ... ...
138 2018 3487193
139 2019 3437438
140 2020 3287724

141 rows × 2 columns

Notice that the code is nearly the same as the non-grouped version, except that it contains a GROUP BY clause using the Year column. We also add the Year column to the SELECT clause so that each row of the result also contains the year.

The result is a relation with the total babies born for each year in the data. Notice that the Year column contains the unique Year values—there are no duplicate Year values anymore since we grouped them together. Now we can plot the counts over time:

counts_by_year = pd.read_sql(query, db)
counts_by_year.plot('Year', 'SUM(Count)');
../../_images/sql_aggregating_12_0.png

What do we see in this plot? First, we might notice that there seem to be suspiciously few babies born before 1920. One likely explanation is that the Social Security Administration was created in 1935, so its data for prior births could be less complete.

We might also notice the dip when World War II began in 1939, and the post-war Baby Boomer era from 1946-1964.

Here’s the basic recipe for grouping in SQL:

SELECT
  col1,           -- column used for grouping
  SUM(col2)       -- aggregation of another column
FROM table_name   -- relation to use
GROUP BY col1     -- the column(s) to group by

7.2.2. Grouping on Multiple Columns

We pass multiple columns into GROUP BY to group by multiple columns at once. This is useful when we need to further subdivide our groups. For example, we can group by both year and sex to see how many male and female babies were born over time.

query = ''' 
SELECT Year, Sex, SUM(Count)
FROM baby
GROUP BY Year, Sex
'''

pd.read_sql(query, db)
Year Sex SUM(Count)
0 1880 F 83929
1 1880 M 110490
2 1881 F 85034
... ... ... ...
279 2019 M 1785527
280 2020 F 1581301
281 2020 M 1706423

282 rows × 3 columns

Notice how the code closely follows the grouping recipe.

7.2.3. Other Aggregation Functions

The SQLite database has several other built-in aggregation functions, such as COUNT, AVG, MIN, and MAX. The full list of functions is available on the SQLite website 1.

We’ve already seen the SUM function:

query = ''' 
SELECT Year, SUM(Count)
FROM baby
GROUP BY Year
'''

pd.read_sql(query, db)
Year SUM(Count)
0 1880 194419
1 1881 185772
2 1882 213385
... ... ...
138 2018 3487193
139 2019 3437438
140 2020 3287724

141 rows × 2 columns

To use another aggregation function, we call it in the SELECT clause. For instance, we can use MAX instead of SUM:

query = ''' 
SELECT Year, MAX(Count)
FROM baby
GROUP BY Year
'''

pd.read_sql(query, db)
Year MAX(Count)
0 1880 9655
1 1881 8769
2 1882 9557
... ... ...
138 2018 19924
139 2019 20555
140 2020 19659

141 rows × 2 columns

Note

The available aggregation functions are one of the first places a data scientist may encounter differences in SQL implementations. For instance, SQLite has a relatively minimal set of aggregation functions, while PostgreSQL has many more 2. Most SQL implementations provide SUM, COUNT, MIN, MAX, and AVG.

7.2.4. Example: Have People Become More Creative With Baby Names?

Have people become more creative with baby names over time? One way to measure this is to see whether the number of unique baby names per year has increased over time.

To do this aggregation in SQL, we use the COUNT function and the DISTINCT keyword. The DISTINCT keyword tells SQL to only keep the unique values within a set of columns.

# Finds the unique baby names
query = ''' 
SELECT DISTINCT Name
FROM baby
'''

pd.read_sql(query, db)
Name
0 Liam
1 Noah
2 Oliver
... ...
100361 Crete
100362 Roll
100363 Zilpah

100364 rows × 1 columns

To count the number of distinct names, we can aggregate using the COUNT function. We’ll also use the AS keyword to rename the resulting column.

# Finds the number of unique baby names
query = ''' 
SELECT COUNT(DISTINCT Name) AS n_names
FROM baby
'''

pd.read_sql(query, db)
n_names
0 100364

Finally, we group by the Year column to aggregate over each year rather than over the entire dataset:

query = ''' 
SELECT Year, COUNT(DISTINCT Name) AS n_names
FROM baby
GROUP BY Year
'''

unique_names_by_year = pd.read_sql(query, db)
unique_names_by_year
Year n_names
0 1880 1889
1 1881 1829
2 1882 2012
... ... ...
138 2018 29619
139 2019 29417
140 2020 28613

141 rows × 2 columns

Now, we can plot the number of unique names over time:

unique_names_by_year.plot('Year', 'n_names');
../../_images/sql_aggregating_31_0.png

We see that the number of unique names has generally increased over time, even though the number of babies born has mostly stabilized since the 1960s.

Note

Unlike pandas, SQLite doesn’t provide a simple way to pivot a relation. Instead, we use GROUP BY on two columns in SQL, read the result into a dataframe, and then use the unstack() dataframe method 3.


1

https://www.sqlite.org/lang_aggfunc.html

2

https://www.postgresql.org/docs/current/functions-aggregate.html

3

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.unstack.html