The term data granularity refers to the level of detail captured in the data. It’s easiest to show you what we mean through an example.
The Global Monitoring Laboratory (GML) conducts research on the atmosphere. For instance, the GML has a station at Mauna Loa in Hawaii that measures carbon dioxide (CO2) levels 1. Here’s a picture from their website:
The CO2 data is available on the GML website 2. We’ve downloaded the data and loaded it into Python:
co2 = pd.read_csv('data/co2_mm_mlo.txt', header = None, skiprows = 72, sep = '\s+', names = ['Yr', 'Mo', 'DecDate', 'Avg', 'Int', 'Trend', 'days'])
738 rows × 7 columns
It actually takes a bit of work to get the data into Python properly. We’ll return to this data in the Data Quality chapter. For now, take a closer look at each row in the
co2 data table. What does a row of the table represent?
Notice that the
Mo columns contain years and months. And, there are multiple measurements for each year. You might have also guessed that there is only one measurement per month within a year, implying that each row of the table represents the readings for a month. To check these guesses, you’d look at the data description, which states:
The “average” column contains the monthly mean CO2 mole fraction determined from daily averages.
So, you see that each row of the data table represents a month. But the actual measurements from the research station happen more frequently. In fact, the GML website has datasets for daily and hourly measurements too 3.
You might already see some asymmetry here. We can go from hourly recordings to daily averages but not the other way around. We say that the hourly data has a finer granularity than the daily data. And, the daily data has a coarser granularity than the hourly data. You use aggregation to go to a coarser granularity — in
pandas, you would use
.agg(). But you usually can’t go to a finer granularity.
So why not always just use the data with the finest granularity available? On a computational level, very fine-grained data can become very large. The Mauna Loa Observatory started recording CO2 levels in 1958. Imagine how many rows the data table would contain if they took measurements every single second! But more importantly, you want the granularity of the data to match your research question. Suppose you want to see whether CO2 levels have risen over time, consistent with global warming predictions. You don’t need a CO2 measurement every second. In fact, you would probably want yearly data, which makes this plot:
# Remove missing data co2 = co2.query('Avg > 0')
(co2.groupby('Yr') ['Avg'] .mean() .plot() );
What happens if you decide to use monthly data instead? Here’s a plot:
plt.figure(figsize=(10, 5)) sns.lineplot(x='DecDate', y='Avg', data=co2);
You can see that there is seasonality to the CO2 levels — within a single year, the CO2 rises and falls. But, the long-term increase in CO2 levels is apparent.
7.3.1. Granularity Checklist¶
You should have answers to the following questions after looking at the granularity of your datasets.
What does a record represent?
co2 table, each record represents a month of CO2 readings.
Do all records capture granularity at the same level? (Sometimes a table will contain summary rows.)
Yes, for the
If the data were aggregated, how was the aggregation performed? Sampling and averaging are are common aggregations.
According to the data description, each record is the mean of daily readings. But, the data website also has hourly readings, so we suspect that both daily and monthly readings are aggregated from the hourly readings.
What kinds of aggregations can we perform on the data?
Time series data like the
co2 data allow for many useful aggregations. We’ve already aggregated the monthly data into yearly averages. If we wanted to examine the seasonality in more detail, we could aggregate the data by the day within a month (1-31).