7.2. Aggregating

This section introduces operations for aggregating rows in a dataframe. Data scientists aggregate rows together to make summaries of data. For instance, a dataset containing daily sales can be aggregated to show monthly sales instead. Specifically, we’ll introduce grouping and pivoting, two common operations for aggregating data.

We’ll work with the baby names data, as introduced in the previous section:

baby = pd.read_csv('babynames.csv')
baby
Name Sex Count Year
0 Liam M 19659 2020
1 Noah M 18252 2020
2 Oliver M 14147 2020
... ... ... ... ...
2020719 Verona F 5 1880
2020720 Vertie F 5 1880
2020721 Wilma F 5 1880

2020722 rows × 4 columns

7.2.1. Basic Group-Aggregate

Let’s say you want to find out the total number of babies born as recorded in this data. This is simply the sum of the Count column:

baby['Count'].sum()
352554503

Summing up the name counts is one simple way to aggregate the data—it combines data from multiple rows.

But let’s say you instead want to answer a more interesting question: are U.S. births trending upwards over time? To answer this question, you can sum the Count column within each year rather than taking the sum over the entire dataset. In other words, you can split the data into groups based on Year, then sum up the Count values within each group.

groupby-births

Fig. 7.3 A depiction of grouping then aggregating for example data.

We call this operation grouping followed by aggregating. In pandas, you would write:

baby.groupby('Year')['Count'].sum()
Year
1880     194419
1881     185772
1882     213385
         ...   
2018    3487193
2019    3437438
2020    3287724
Name: Count, Length: 141, dtype: int64

Notice that the code is nearly the same as the non-grouped version, except that it starts with a call to .groupby('Year').

The result is a pd.Series with the total babies born for each year in the data. Notice that the index of this series contains the unique Year values. Now you can plot the counts over time:

counts_by_year = baby.groupby('Year')['Count'].sum()
counts_by_year.plot();
../../_images/pandas_aggregating_11_0.svg

What do you see in this plot? First, you might notice that there seem to be suspiciously few babies born before 1920. One likely explanation is that the Social Security Administration was created in 1935, so its data for prior births could be less complete.

You might also notice the dip when World War II began in 1939, and the post-war Baby Boomer era from 1946-1964.

Here’s the basic recipe for grouping in pandas:

(baby                # the dataframe
 .groupby('Year')    # column(s) to group
 ['Count']           # column(s) to aggregate
 .sum()              # how to aggregate
)

7.2.2. Grouping on Multiple Columns

You can pass multiple columns into .groupby as a list to group by multiple columns at once. This is useful when you need to further subdivide your groups. For example, we can group by both year and sex to see how many male and female babies were born over time.

counts_by_year_and_sex = (baby
 .groupby(['Year', 'Sex']) # Arg to groupby is a list of column names
 ['Count']
 .sum()
)
counts_by_year_and_sex 
Year  Sex
1880  F        83929
      M       110490
1881  F        85034
              ...   
2019  M      1785527
2020  F      1581301
      M      1706423
Name: Count, Length: 282, dtype: int64

Notice how the code closely follows the grouping recipe.

The counts_by_year_and_sex series has what we call a multi-level index with two levels, one for each column that was grouped. It’s a bit easier to see if we convert the series to a dataframe:

# The result only has one column
counts_by_year_and_sex.to_frame()
Count
Year Sex
1880 F 83929
M 110490
1881 F 85034
... ... ...
2019 M 1785527
2020 F 1581301
M 1706423

282 rows × 1 columns

There are two levels to the index because we grouped by two columns. It can be a bit tricky to work with multilevel indices, so you can reset the index to go back to a dataframe with a single index.

counts_by_year_and_sex.reset_index()
Year Sex Count
0 1880 F 83929
1 1880 M 110490
2 1881 F 85034
... ... ... ...
279 2019 M 1785527
280 2020 F 1581301
281 2020 M 1706423

282 rows × 3 columns

7.2.3. Custom Aggregation Functions

After grouping, pandas gives us flexible ways to aggregate the data. So far, you’ve seen how to use .sum() after grouping:

(baby
 .groupby('Year')
 ['Count']
 .sum() # aggregate by summing
)
Year
1880     194419
1881     185772
1882     213385
         ...   
2018    3487193
2019    3437438
2020    3287724
Name: Count, Length: 141, dtype: int64

pandas also supplies other aggregation functions, like .mean(), .size(), and .first(). Here’s the same grouping using .max():

(baby
 .groupby('Year')
 ['Count']
 .max() # aggregate by taking the max within each group
)
Year
1880     9655
1881     8769
1882     9557
        ...  
2018    19924
2019    20555
2020    19659
Name: Count, Length: 141, dtype: int64

But sometimes pandas doesn’t have the exact aggregation function you want to use. In these cases, you can define and use a custom aggregation function. pandas lets you do this through .agg(fn), where fn is a function that you define.

For instance, if we want to find the difference between the largest and smallest values within each group (the range of the data), we could first define a function called data_range, then pass that function into .agg().

# The input to this function is a pd.Series object containing a single column
# of data. It gets called once for each group.
def data_range(counts):
    return counts.max() - counts.min()

(baby
 .groupby('Year')
 ['Count']
 .agg(data_range) # aggregate using custom function
)
Year
1880     9650
1881     8764
1882     9552
        ...  
2018    19919
2019    20550
2020    19654
Name: Count, Length: 141, dtype: int64

7.2.4. Example: Have People Become More Creative With Baby Names?

Have people become more creative with baby names over time? One way to measure this is to see whether the number of unique baby names per year has increased over time.

We start by defining a count_unique function that counts the number of unique values in a series. Then, we pass that function into .agg().

def count_unique(s):
    return len(s.unique())

unique_names_by_year = (baby
 .groupby('Year')
 ['Name']
 .agg(count_unique) # aggregate using the custom count_unique function
)
unique_names_by_year
Year
1880     1889
1881     1829
1882     2012
        ...  
2018    29619
2019    29417
2020    28613
Name: Name, Length: 141, dtype: int64
unique_names_by_year.plot();
../../_images/pandas_aggregating_29_0.svg

We see that the number of unique names has generally increased over time, even though the number of babies born has mostly stabilized since the 1960s.

7.2.5. Pivoting

Pivoting is essentially a convenient way to arrange the results of a group and aggregation when grouping with two columns. Earlier in this section we grouped the baby names data by year and sex:

counts_by_year_and_sex = (baby
 .groupby(['Year', 'Sex']) 
 ['Count']
 .sum()
)
counts_by_year_and_sex.to_frame()
Count
Year Sex
1880 F 83929
M 110490
1881 F 85034
... ... ...
2019 M 1785527
2020 F 1581301
M 1706423

282 rows × 1 columns

This produces a pd.Series with the counts. We can also imagine the same data with the Sex index level “pivoted” to the columns of a dataframe. It’s easier to see with an example:

mf_pivot = pd.pivot_table(
    baby,
    index='Year',   # Column to turn into new index
    columns='Sex',  # Column to turn into new columns
    values='Count', # Column to aggregate for values
    aggfunc=sum)    # Aggregation function
mf_pivot
Sex F M
Year
1880 83929 110490
1881 85034 100738
1882 99699 113686
... ... ...
2018 1676884 1810309
2019 1651911 1785527
2020 1581301 1706423

141 rows × 2 columns

Notice that the data values are identical in the pivot table and the table produced with .groupby(); the values are just arranged differently. Pivot tables are useful for quickly summarizing data using two attributes, and you’ll see them often in articles and papers.

The pd.DataFrame.plot() function also happens to work well with pivot tables, since the function draws one line for each column of data in the table:

mf_pivot.plot();
../../_images/pandas_aggregating_36_0.svg