# 7.2. Aggregating¶

This section introduces operations for aggregating rows in a relation. Data scientists aggregate rows together to make summaries of data. For instance, a dataset containing daily sales can be aggregated to show monthly sales instead. Specifically, we’ll introduce grouping, a common operation for aggregating data.

We’ll work with the baby names data, as introduced in the previous section:

```# Set up connection to database
import sqlalchemy
db = sqlalchemy.create_engine('sqlite:///babynames.db')
```
```query = '''
SELECT *
FROM baby
LIMIT 10
'''

pd.read_sql(query, db)
```
Name Sex Count Year
0 Liam M 19659 2020
1 Noah M 18252 2020
2 Oliver M 14147 2020
... ... ... ... ...
7 Lucas M 11281 2020
8 Henry M 10705 2020
9 Alexander M 10151 2020

10 rows × 4 columns

## 7.2.1. Basic Group-Aggregate¶

Let’s say we want to find out the total number of babies born as recorded in this data. This is simply the sum of the `Count` column. SQL provides functions that we use in the `SELECT` statement, like `SUM`:

```query = '''
SELECT SUM(Count)
FROM baby
'''

pd.read_sql(query, db)
```
SUM(Count)
0 352554503

Summing up the name counts is one simple way to aggregate the data—it combines data from multiple rows.

But let’s say we instead want to answer a more interesting question: are U.S. births trending upwards over time? To answer this question, we can sum the `Count` column within each year rather than taking the sum over the entire dataset. In other words, we split the data into groups based on `Year`, then sum up the `Count` values within each group. This process is depicted in A depiction of grouping then aggregating for example data..

Fig. 7.2 A depiction of grouping then aggregating for example data.

We call this operation grouping followed by aggregating. In SQL, we specify what column we wish to use for grouping through the `GROUP BY` clause, then using aggregation functions in `SELECT`:

```query = '''
SELECT Year, SUM(Count)
FROM baby
GROUP BY Year
'''

pd.read_sql(query, db)
```
Year SUM(Count)
0 1880 194419
1 1881 185772
2 1882 213385
... ... ...
138 2018 3487193
139 2019 3437438
140 2020 3287724

141 rows × 2 columns

Notice that the code is nearly the same as the non-grouped version, except that it contains a `GROUP BY` clause using the `Year` column. We also add the `Year` column to the `SELECT` clause so that each row of the result also contains the year.

The result is a relation with the total babies born for each year in the data. Notice that the Year column contains the unique `Year` values—there are no duplicate `Year` values anymore since we grouped them together. Now we can plot the counts over time:

```counts_by_year = pd.read_sql(query, db)
px.line(counts_by_year, x='Year', y='SUM(Count)', width=350, height=250)
```

What do we see in this plot? First, we might notice that there seem to be suspiciously few babies born before 1920. One likely explanation is that the Social Security Administration was created in 1935, so its data for prior births could be less complete.

We might also notice the dip when World War II began in 1939, and the post-war Baby Boomer era from 1946-1964.

Here’s the basic recipe for grouping in `SQL`:

```SELECT
col1,           -- column used for grouping
SUM(col2)       -- aggregation of another column
FROM table_name   -- relation to use
GROUP BY col1     -- the column(s) to group by
```

## 7.2.2. Grouping on Multiple Columns¶

We pass multiple columns into `GROUP BY` to group by multiple columns at once. This is useful when we need to further subdivide our groups. For example, we can group by both year and sex to see how many male and female babies were born over time.

```query = '''
SELECT Year, Sex, SUM(Count)
FROM baby
GROUP BY Year, Sex
'''

pd.read_sql(query, db)
```
Year Sex SUM(Count)
0 1880 F 83929
1 1880 M 110490
2 1881 F 85034
... ... ... ...
279 2019 M 1785527
280 2020 F 1581301
281 2020 M 1706423

282 rows × 3 columns

Notice how the code closely follows the grouping recipe.

## 7.2.3. Other Aggregation Functions¶

The SQLite database has several other built-in aggregation functions, such as `COUNT`, `AVG`, `MIN`, and `MAX`. The full list of functions is available on the SQLite website 1.

We’ve already seen the `SUM` function:

```query = '''
SELECT Year, SUM(Count)
FROM baby
GROUP BY Year
'''

pd.read_sql(query, db)
```
Year SUM(Count)
0 1880 194419
1 1881 185772
2 1882 213385
... ... ...
138 2018 3487193
139 2019 3437438
140 2020 3287724

141 rows × 2 columns

To use another aggregation function, we call it in the `SELECT` clause. For instance, we can use `MAX` instead of `SUM`:

```query = '''
SELECT Year, MAX(Count)
FROM baby
GROUP BY Year
'''

pd.read_sql(query, db)
```
Year MAX(Count)
0 1880 9655
1 1881 8769
2 1882 9557
... ... ...
138 2018 19924
139 2019 20555
140 2020 19659

141 rows × 2 columns

Note

The available aggregation functions are one of the first places a data scientist may encounter differences in SQL implementations. For instance, SQLite has a relatively minimal set of aggregation functions, while PostgreSQL has many more 2. Most SQL implementations provide `SUM`, `COUNT`, `MIN`, `MAX`, and `AVG`.

## 7.2.4. Example: Have People Become More Creative With Baby Names?¶

Have people become more creative with baby names over time? One way to measure this is to see whether the number of unique baby names per year has increased over time.

To do this aggregation in SQL, we use the `COUNT` function and the `DISTINCT` keyword. The `DISTINCT` keyword tells SQL to only keep the unique values within a set of columns.

```# Finds the unique baby names
query = '''
SELECT DISTINCT Name
FROM baby
'''

pd.read_sql(query, db)
```
Name
0 Liam
1 Noah
2 Oliver
... ...
100361 Crete
100362 Roll
100363 Zilpah

100364 rows × 1 columns

To count the number of distinct names, we can aggregate using the `COUNT` function. We’ll also use the `AS` keyword to rename the resulting column.

```# Finds the number of unique baby names
query = '''
SELECT COUNT(DISTINCT Name) AS n_names
FROM baby
'''

pd.read_sql(query, db)
```
n_names
0 100364

Finally, we group by the `Year` column to aggregate over each year rather than over the entire dataset:

```query = '''
SELECT Year, COUNT(DISTINCT Name) AS n_names
FROM baby
GROUP BY Year
'''

unique_names_by_year = pd.read_sql(query, db)
unique_names_by_year
```
Year n_names
0 1880 1889
1 1881 1829
2 1882 2012
... ... ...
138 2018 29619
139 2019 29417
140 2020 28613

141 rows × 2 columns

Now, we can plot the number of unique names over time:

```px.line(unique_names_by_year,
x='Year', y='n_names',
labels={'n_names': '# unique names'},
width=350, height=250)
```

We see that the number of unique names has generally increased over time, even though the number of babies born has mostly stabilized since the 1960s.

Note

Unlike `pandas`, SQLite doesn’t provide a simple way to pivot a relation. Instead, we use `GROUP BY` on two columns in SQL, read the result into a dataframe, and then use the `unstack()` dataframe method 3.

## 7.2.5. Takeaways¶

This section covers common ways to aggregate data in SQL using the `GROUP BY` keyword with one or more columns. In the next section, we’ll explain how to join relations together.

1

https://www.sqlite.org/lang_aggfunc.html

2

https://www.postgresql.org/docs/current/functions-aggregate.html

3

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.unstack.html