Modifying Structure
Contents
9.5. Modifying Structure¶
If a dataframe has an inconvenient structure, it can be difficult to do the analysis that we want. The wrangling process often reshapes the data frame in some way to make the analysis easier and more natural. These changes can simply be taking a subset of the rows and/or columns of the table, changing its granularity, and reshaping. In this section, we’ll use the techniques from Chapter 6 to show how to modify structure.
Simplify Structure. If a data frame has many features that are not needed
in our analysis then we may want to drop these extraneous columns to make
handling the data frame easier. Or, if we want to focus on a particular period
of time or geographic area we may want to take a subset of the rows
1. In Section 8.5, we read into
our data frame a small set of features from the hundreds available in the DAWN
survey because we were interested in understanding the patterns of type of ER
visit by demographics of the patient. Later, in
Section %s
, we restrict an analysis of
home sale prices to one year and
a few cities in an effort to limit the impact of inflation and to better study
the effect of location on sale price.
Adjust Granularity. In the previous Section 9.1, we provided an example where we aggregated monthly averages to a yearly average in order to visualize annual trends. In the following Section 9.6, we provide another example where we aggregate violation-level data to the inspection-level so that it can be combined with the inspection scores. In both of these examples, we adjust the granularity of the data frame to work with a coarser granularity by grouping together records and then aggregating values. Recall that for the CO2 measurements, we grouped the monthly values from the same year, and then aggregated the average CO2 measurements:
co2_NA.groupby('Yr')['Avg'].mean()
Yr
1958 315.33
1959 315.97
1960 316.91
...
2017 406.55
2018 408.52
2019 412.27
Name: Avg, Length: 62, dtype: float64
Other common aggregations are: the size of the group; the sum, minimum, and maximum value in a group; and the first or last value in a group. The details of adjusting granularity of pandas data frames can be found in Section 6.2, including how to group by multiple column values.
At times a dataset might have mixed granularity, where records are at different levels of detail. A common case is in data provided by government agencies where data at the county and state levels are included in the same file. When this happens, we usually want to split the data frame into two, one at the county level and the other at the state level. This makes a county level and state level analyses much easier, even feasible, to perform.
Reshape. Data, especially from government sources, are shared as pivot tables. These “wide” tables have data values as column names, and are often difficult to use in analysis. We may need to reshape them into a “long” form. Figure 9.2 depicts the same data stored in both wide and long data tables.
Fig. 9.2 An example of a wide data table (top) and a long data table (bottom). Both tables contain the same data. Each row of the wide data table maps to three rows in the long data table, as highlighted in the tables. Notice that in the wide data table, each row has three observations, one for each month. In the long data table, each row only has one observation. Long data tables are generally easier to aggregate for future analysis.¶
Note
Long form data is also frequently called tidy data [Wickham, 2014].
We reshaped the CO2 data into a wide data frame that is like a pivot table in shape. There is a column for each month and a row for each year.
co2_pivot = pd.pivot_table(
co2[10:34],
index='Yr', # Column to turn into new index
columns='Mo', # Column to turn into new columns
values='Avg', # Column to aggregate for values
aggfunc=sum) # Aggregation function
co2_wide = co2_pivot.reset_index()
display_df(co2_wide, cols=13)
Mo | Yr | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1959 | 315.62 | 316.38 | 316.71 | 317.72 | 318.29 | 318.15 | 316.54 | 314.80 | 313.84 | 313.26 | 314.8 | 315.58 |
1 | 1960 | 316.43 | 316.97 | 317.58 | 319.02 | 320.03 | 319.59 | 318.18 | 315.91 | 314.16 | 313.83 | 315.0 | 316.19 |
The column headings are months, and the cell values in the grid are the CO2 monthly averages. We can turn this data frame back into a “tall” data frame, where the column names become a feature, called month, and the values in the grid are reorganized into a second feature, called average.
co2_long = co2_wide.melt(id_vars=['Yr'],
var_name='month',
value_name='average')
co2_long
Yr | month | average | |
---|---|---|---|
0 | 1959 | 1 | 315.62 |
1 | 1960 | 1 | 316.43 |
2 | 1959 | 2 | 316.38 |
... | ... | ... | ... |
21 | 1960 | 11 | 315.00 |
22 | 1959 | 12 | 315.58 |
23 | 1960 | 12 | 316.19 |
24 rows × 3 columns
Notice that the data has been recaptured in its original shape (although the
rows are not in their original order). Wide-form data is more common when we
expect readers to look at the data table itself, like in a economics magazine
or news article. But long-form data is more useful for data analysis. For
instance, co2_long
lets us write short pandas
code to group by either year
or month, while the wide-form data makes it difficult to group by year.
9.5.1. Takeaways¶
This section covers several useful transformations for modifying structure, like subsetting, aggregating, and reshaping. The .melt()
method is particularly useful for converting wide-form into long-form data. In the next section, we’ll combine the techniques from this chapter to wrangle the restaurant inspection data.
- 1
Subsetting is covered in Section 6.1