Case Study: Data Science for Accurate and Timely Air Quality Measurements

12. Case Study: Data Science for Accurate and Timely Air Quality Measurements

California is prone to wildfires, so much so that its residents (like the authors of this book) sometimes joke that California is “always on fire”. However, wildfires themselves are no laughing matter. In 2020, forty separate fires covered the state in smoke, caused thousands of people to evacuate, and caused more than twelve billion dollars in damages (Fig. 12.1).


Fig. 12.1 Satellite image from August 2020 showing smoke covering California. (Image from Wikipedia licensed under CC BY-SA 3.0 IGO.)

In places like California, people use air quality measurements to know what kinds of protective measures they need to take. Depending on conditions, people may wish to wear a mask, use air filters, or avoid going outside altogether. Measures of air quality should be both accurate and timely. Inaccurate or biased measurements can cause people not to take air conditions as seriously as they should. Delayed alerts can expose people to harmful air.

In the United States, One important source of air measurements is the Air Quality System (AQS), run by the US government [US EPA, 2013]. The AQS places high-quality sensors at locations across the US and makes this data available to the public. These sensors are carefully calibrated to strict standards—in fact, the AQS sensors are generally seens as the gold standard for accuracy. However, they have a few downsides. First, these sensors are expensive: typically between $15,000 and $40,000 each. This means that there are fewer sensors, and these sensors are further apart. A person living far away from a sensor might not be able to use these measurements for their personal use. Second, these sensors do not provide real-time data. Since the data undergo additional calibration, the sensors only release hourly averages with a time lag of one to two hours. In essence, the AQS sensors are accurate but not timely.

In contrast to the AQS, a company called PurpleAir produces a sensor that sells for $230 to $260 and can be easily installed in a house or apartment. Because of the lower price point, people across the US have purchased sensors for personal use—the sensors can connect to a home WiFi network so people can easily monitor the air quality in their homes. These sensors can also report data back to PurpleAir. In 2020, there were thousands of PurpleAir sensors across the US making publicly available measurements of air quality. Compared to the AQS sensors, PurpleAir sensors are more timely—they make a measurement every two minutes rather than every hour. Since there are more deployed PurpleAir sensors, it’s less likely that a person lives too far away from a sensor to make use of the data. However, PurpleAir sensors are less accurate. To make the sensors affordable, PurpleAir uses a simpler method of counting particles in air that doesn’t measure particle density. This means that PurpleAir measurements can report that air quality is worse than it really is [Hug, 2020]. In essence, PurpleAir sensors are timely but not accurate.

Can we combine both AQS and PurpleAir sensors to produce measurements that are both timely and accurate? In fact, we can! The idea is to find pairs of AQS and PurpleAir sensors that are collocated, or next to each other in the same location. Then, we can treat the AQS sensors as the ground truth and correct the PurpleAir measurements to match the AQS measurements. Even though there are relatively few pairs of collocated AQS and PurpleAir sensors, we can generalize the correction to other PurpleAir sensors as long as the PurpleAir sensors are biased in a consistent way. In other words, it’s fine if the PurpleAir sensors aren’t accurate as long as they are precise—we can correct for bias using the AQS sensors but not variance.

This analysis and correction was first developed by Karoline Barkjohn et al. from the US Environmental Protection Agency [Barkjohn et al., 2021]. In this chapter, we’ll walk through and reproduce parts of their analysis using Python code. We included this case study for a few reasons.

First, this analysis gives us an opportunity to see how data scientists wrangle, explore, and visualize data in a real-world setting. The case study integrates the concepts we introduced in this part of the book nicely.

Second, this case study is an example of using a large, biased dataset to amplify the usefulness of a small, accurate dataset. Combining large and small datasets like this is particularly exciting to data scientists and applies broadly to other domains ranging from social science to medicine.

Finally, this analysis has real-world use—because of Barkjohn’s analysis, PurpleAir sensors can be included in official US government maps for air quality like As of this writing, the AirNow Fire and Smoke map includes both AQS and PurpleAir sensors, applying the same correction that Barkjohn developed to the PurpleAir sensors.

In the next section, we begin the analysis by finding AQS and PurpleAir sensors that are near each other.