12.1. Finding Collocated Sensors

When starting an analysis, we like to begin with a research question. In this case, our research question might be: “How do we correct PurpleAir sensor readings so that they match AQS sensor readings?” We focus specifically on readings for PM2.5 particles, which are particles that are smaller than 2.5 micrometers in diameter. These particles are small enough to be inhaled into the lungs, pose the greatest risk to health, and are especially common in wood smoke.

Our analysis begins by finding collocated pairs of AQS and PurpleAir sensors—sensors that are placed immediately next to each other. This step is important because it lets us reduce the effects of other variables that might caused differences in sensor readings. Consider what would happen if we compared an AQS sensor placed inside a building with a PurpleAir sensor placed outside the building. The two sensors would have different readings, but some of these differences would happen because the sensors are exposed to different environments. Ensuring that sensors are truly collocated lets us say that the differences in sensor readings is caused the different ways the sensors are built, rather than other potential confounding variables.

Barkjohn et al.’s analysis found pairs of AQS and PurpleAir sensors that are installed within 50 meters of each other. Then, they contacted each AQS site to see whether the people maintaining the site also maintained a PurpleAir sensor. This extra effort gave them confidence that their sensor pairs were truly collocated.

In this section, we’ll explore and clean data from the AQS and PurpleAir. Then, we’ll perform a similar join to construct a list of potentially collocated sensors. We won’t contact AQS sites ourselves; instead, we’ll proceed with the analysis by reusing Barkjohn et al.’s list of truly collocated sensors.

We’ve downloaded a list of AQS and PurpleAir sensors in data/list_of_aqs_sites.csv and data/list_of_purpleair_sensors.json. Let’s begin by reading these files into pandas. First, we check file sizes to see whether they are reasonable to load into memory.

!ls -lLh data/list_of*
-rw-r--r--  1 sam  staff   4.8M Oct 27 16:54 data/list_of_aqs_sites.csv
-rw-r--r--  1 sam  staff   3.8M Oct 22 16:10 data/list_of_purpleair_sensors.json

Both files are relatively small. We’ll start with the list of AQS sites.

12.1.1. Wrangling the List of AQS Sites

Let’s load the CSV file into pandas.

aqs_sites = pd.read_csv('data/list_of_aqs_sites.csv')

# The table output is fairly large, so display 2 rows
aqs_sites.head(2)
AQS_Site_ID POC State City ... Reporting_Agency Parameter_Name Annual_URLs Daily_URLs
0 01-003-0010 1 Alabama Fairhope ... Al Dept Of Env Mgt PM2.5 <a href='https://www3.epa.gov/cgi-bin/broker?_... <a href='https://www3.epa.gov/cgi-bin/broker?_...
1 01-027-0001 1 Alabama Ashland ... Al Dept Of Env Mgt PM2.5 <a href='https://www3.epa.gov/cgi-bin/broker?_... <a href='https://www3.epa.gov/cgi-bin/broker?_...

2 rows × 28 columns

# Number of rows and columns in the table
aqs_sites.shape
(1333, 28)

This dataframe comes from the AQS map of sites (link). We filtered the map to show only the AQS sites that measure PM2.5, then downloaded the list as a CSV using the map’s web app.

There are 28 columns in the table. Jupyter hides the middle columns to avoid overfilling the screen. However, we want to see what the columns are so that we can find out which columns are most useful for us. To do this, we’ll use a trick. We’ll slice out the first row of aqs_sites, then convert it to a dataframe.

aqs_sites.iloc[0].to_frame()
0
AQS_Site_ID 01-003-0010
POC 1
State Alabama
... ...
Parameter_Name PM2.5
Annual_URLs <a href='https://www3.epa.gov/cgi-bin/broker?_...
Daily_URLs <a href='https://www3.epa.gov/cgi-bin/broker?_...

28 rows × 1 columns

In this dataframe, each row corresponds to one column of aqs_sites. Next, we’ll tell pandas to display more rows so that we can browse all the columns are once.

from IPython.display import display

rows_to_show = 28
with pd.option_context('display.max_rows', rows_to_show):
    display(aqs_sites.iloc[0].to_frame())
0
AQS_Site_ID 01-003-0010
POC 1
State Alabama
City Fairhope
CBSA Daphne-Fairhope-Foley, AL
Local_Site_Name FAIRHOPE, Alabama
Address FAIRHOPE HIGH SCHOOL, 1 PIRATE DRIVE, FAIRHOPE...
Datum NAD83
Latitude 30.5
Longitude -87.88
LatLon_Accuracy_meters 4.0
Elevation_meters_MSL 37.19
Monitor_Start_Date 1/1/2000
Last_Sample_Date 6/30/2021
Active Yes
Measurement_Scale NEIGHBORHOOD
Measurement_Scale_Definition 500 M TO 4KM
Sample_Duration 24 HOUR
Sample_Collection_Frequency EVERY 3RD DAY
Sample_Collection_Method R & P Model 2025 PM-2.5 Sequential Air Sampler...
Sample_Analysis_Method Gravimetric
Method_Reference_ID RFPS-1006-145
FRMFEM Yes
Monitor_Type SLAMS
Reporting_Agency Al Dept Of Env Mgt
Parameter_Name PM2.5
Annual_URLs <a href='https://www3.epa.gov/cgi-bin/broker?_...
Daily_URLs <a href='https://www3.epa.gov/cgi-bin/broker?_...

For a complete description of the columns, we can reference the data dictionary that the AQS provides on their website (link).

Note

We often want to fiddle with pandas display options to show/hide rows of a large dataframe. For convenience, we’ve defined a method called display_df that has the basic logic of the cell above. In the future, we’ll use this code instead:

display_df(..., rows=28)

12.1.1.1. What’s the granularity?

For now, most important observation is that the aqs_sites dataframe doesn’t contain sensor recordings. It only has a list of AQS sites. In other words, we might expect that the granularity of aqs_sites is that each row represents a single AQS site. To check this, we can see whether the AQS_Site_IDs are unique.

aqs_sites['AQS_Site_ID'].value_counts()
19-163-0015    4
19-153-0030    4
49-005-0007    4
              ..
23-005-0015    1
23-019-0017    1
80-002-0014    1
Name: AQS_Site_ID, Length: 921, dtype: int64

It looks like some sites appear multiple times in this dataframe. Unfortunately, this means that the granularity of aqs_sites is probably at a finer granularity than the individual site level. Why are sites duplicated? We can take a closer look at some duplicated sites.

dup_site = aqs_sites.query("AQS_Site_ID == '19-163-0015'")
dup_site
AQS_Site_ID POC State City ... Reporting_Agency Parameter_Name Annual_URLs Daily_URLs
458 19-163-0015 1 Iowa Davenport ... University Hygenic Laboratory (University of I... PM2.5 <a href='https://www3.epa.gov/cgi-bin/broker?_... <a href='https://www3.epa.gov/cgi-bin/broker?_...
459 19-163-0015 2 Iowa Davenport ... University Hygenic Laboratory (University of I... PM2.5 <a href='https://www3.epa.gov/cgi-bin/broker?_... <a href='https://www3.epa.gov/cgi-bin/broker?_...
460 19-163-0015 3 Iowa Davenport ... University Hygenic Laboratory (University of I... PM2.5 <a href='https://www3.epa.gov/cgi-bin/broker?_... <a href='https://www3.epa.gov/cgi-bin/broker?_...
461 19-163-0015 4 Iowa Davenport ... University Hygenic Laboratory (University of I... PM2.5 <a href='https://www3.epa.gov/cgi-bin/broker?_... <a href='https://www3.epa.gov/cgi-bin/broker?_...

4 rows × 28 columns

This display isn’t very helpful since it hides the middle columns. In practice, we’d interactively display columns until we find some that appear to distinguish between the rows. Here, we’ll simply skip ahead and show a few columns of interest.

cols_that_distinguish = ['AQS_Site_ID', 'POC', 'Monitor_Start_Date',
                         'Last_Sample_Date', 'Sample_Collection_Method']
dup_site[cols_that_distinguish]
AQS_Site_ID POC Monitor_Start_Date Last_Sample_Date Sample_Collection_Method
458 19-163-0015 1 1/27/1999 8/31/2021 R & P Model 2025 PM-2.5 Sequential Air Sampler...
459 19-163-0015 2 2/9/2013 8/26/2021 R & P Model 2025 PM-2.5 Sequential Air Sampler...
460 19-163-0015 3 1/1/2019 9/30/2021 Teledyne T640 at 5.0 LPM
461 19-163-0015 4 1/1/2019 9/30/2021 Teledyne T640 at 5.0 LPM

The data dictionary for this table states this about the POC column:

This is the “Parameter Occurrence Code” used to distinguish different instruments that measure the same parameter at the same site.

In this case, there are four instruments at this site that all measure PM2.5. So, it appears that the granularity of the aqs_sites dataframe is that each row represents a single instrument at an AQS site.

For simplicity, the AQS also reports one average measurement for each site rather than one measurement for each sensor at a site. We’ll proceed by using this average measurement per site. Since we’re primarily using this data to match AQS site with PurpleAir sensors, we’ll adjust the granularity of this data by grouping by site ID, then taking the first row within each group. This removes duplicated site IDs and is appropriate here because we only need the location of each site to match with PurpleAir sensors.

def rollup_dup_sites(df):
    return (
        df.groupby('AQS_Site_ID')
        .first()
        .reset_index()
    )
aqs_sites = (pd.read_csv('data/list_of_aqs_sites.csv')
             .pipe(rollup_dup_sites))
aqs_sites
AQS_Site_ID POC State City ... Reporting_Agency Parameter_Name Annual_URLs Daily_URLs
0 01-003-0010 1 Alabama Fairhope ... Al Dept Of Env Mgt PM2.5 <a href='https://www3.epa.gov/cgi-bin/broker?_... <a href='https://www3.epa.gov/cgi-bin/broker?_...
1 01-027-0001 1 Alabama Ashland ... Al Dept Of Env Mgt PM2.5 <a href='https://www3.epa.gov/cgi-bin/broker?_... <a href='https://www3.epa.gov/cgi-bin/broker?_...
2 01-049-1003 1 Alabama Crossville ... Al Dept Of Env Mgt PM2.5 <a href='https://www3.epa.gov/cgi-bin/broker?_... <a href='https://www3.epa.gov/cgi-bin/broker?_...
... ... ... ... ... ... ... ... ... ...
918 56-039-1013 1 Wyoming None ... National Park Service PM2.5 <a href='https://www3.epa.gov/cgi-bin/broker?_... <a href='https://www3.epa.gov/cgi-bin/broker?_...
919 80-002-0012 3 Country Of Mexico Mexicali ... Tracer Technologies PM2.5 <a href='https://www3.epa.gov/cgi-bin/broker?_... <a href='https://www3.epa.gov/cgi-bin/broker?_...
920 80-002-0014 3 Country Of Mexico Mexicali ... Tracer Technologies PM2.5 <a href='https://www3.epa.gov/cgi-bin/broker?_... <a href='https://www3.epa.gov/cgi-bin/broker?_...

921 rows × 28 columns

Now, each row of aqs_sites corresponds to one AQS site, which is the granularity we want.

12.1.1.2. Subsetting

To match AQS sites with PurpleAir sensors, we only need the site ID, latitude, and longitude. So, we’ll subset and rename the columns of interest.

def subset_aqs(df):
    subset = df[['AQS_Site_ID', 'Latitude', 'Longitude']]
    subset.columns = ['site_id', 'lat', 'lon']
    return subset
aqs_sites = (pd.read_csv('data/list_of_aqs_sites.csv')
             .pipe(rollup_dup_sites)
             .pipe(subset_aqs))

aqs_sites
site_id lat lon
0 01-003-0010 30.50 -87.88
1 01-027-0001 33.28 -85.80
2 01-049-1003 34.29 -85.97
... ... ... ...
918 56-039-1013 44.37 -110.83
919 80-002-0012 32.63 -115.45
920 80-002-0014 32.63 -115.50

921 rows × 3 columns

Now, the aqs_sites dataframe is ready and we’ll move to the PurpleAir sites.

12.1.2. Wrangling the List of PurpleAir Sites

Unlike the AQS sites, the file containing PurpleAir sensor data comes in the JSON format. This file is provided by the PurpleAir website.

!ls -lLh data/list_of_purpleair_sensors.json
-rw-r--r--  1 sam  staff   3.8M Oct 22 16:10 data/list_of_purpleair_sensors.json
# Previews first few lines of the file. The `cut` command limits the output to
# the first 70 characters for each line since the lines are quite long.
!head data/list_of_purpleair_sensors.json | cut -c 1-70
{"version":"7.0.30",
"fields":
["ID","pm","pm_cf_1","pm_atm","age","pm_0","pm_1","pm_2","pm_3","pm_4"
"data":[
[20,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,97,0.0,0.0,0.0,0.0,0.0,0
[47,null,null,null,4951,null,null,null,null,null,null,null,96,null,nul
[53,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,1.2,5.2,6.0,97,0.0,0.5,702.3,57.5,6.
[74,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,97,0.0,0.0,0.0,0.0,0.0,0
[77,9.8,9.8,9.8,1,9.8,10.7,11.0,11.2,13.8,15.1,15.5,97,9.7,9.8,6523.5,
[81,6.5,6.5,6.5,0,6.5,6.1,6.1,6.6,8.1,8.3,9.7,97,5.9,6.8,4058.9,346.1,

From the first few lines of the JSON file, we can guess that the data rows are stored in the "data" key and the column labels are stored in the "fields" key. We can first use Python’s json library to read in the file as a Python dict:

import json

with open('data/list_of_purpleair_sensors.json') as f:
    pa_json = json.load(f)

list(pa_json.keys())
['version', 'fields', 'data', 'count']

Then, we’ll create a dataframe.

pa_sites = pd.DataFrame(pa_json['data'], columns=pa_json['fields'])
pa_sites
ID pm pm_cf_1 pm_atm ... Voc Ozone1 Adc CH
0 20 0.0 0.0 0.0 ... NaN NaN 0.01 1
1 47 NaN NaN NaN ... NaN 0.72 0.72 0
2 53 0.0 0.0 0.0 ... NaN NaN 0.00 1
... ... ... ... ... ... ... ... ... ...
23135 132237 5.3 5.3 5.3 ... NaN NaN 0.00 3
23136 132431 3.2 3.2 3.2 ... NaN NaN 0.03 3
23137 132471 0.5 0.5 0.5 ... NaN NaN 0.05 3

23138 rows × 36 columns

Like the AQS data, there are many more columns in this dataframe than pandas will display. Another quick way to look at the columns is to simply print the column labels.

pa_sites.columns
Index(['ID', 'pm', 'pm_cf_1', 'pm_atm', 'age', 'pm_0', 'pm_1', 'pm_2', 'pm_3',
       'pm_4', 'pm_5', 'pm_6', 'conf', 'pm1', 'pm_10', 'p1', 'p2', 'p3', 'p4',
       'p5', 'p6', 'Humidity', 'Temperature', 'Pressure', 'Elevation', 'Type',
       'Label', 'Lat', 'Lon', 'Icon', 'isOwner', 'Flags', 'Voc', 'Ozone1',
       'Adc', 'CH'],
      dtype='object')

In this case, we can guess that the columns we’re most interested in are the sensor IDs (ID), sensor labels (Label), latitude (Lat), and longitude (Lon). Again, we consulted the data dictionary on the PurpleAir website to double check what the columns represent.

12.1.2.1. What’s the Granularity?

Let’s check the ID column for duplicates, as we did for the AQS data.

pa_sites['ID'].value_counts()
20        1
88775     1
88771     1
         ..
57357     1
57355     1
132471    1
Name: ID, Length: 23138, dtype: int64

The value_counts() method lists the counts in descending order, so we can see that every ID was included only once. This lets us verify that the granularity of this data is already at the individual sensor level, so we don’t need to perform additional wrangling to adjust the granularity.

12.1.2.2. Subsetting

Next, we’ll subset the PurpleAir data.

def subset_pa(df):
    subset = df[['ID', 'Label', 'Lat', 'Lon']]
    subset.columns = ['id', 'label', 'lat', 'lon']
    return subset
pa_sites = (pd.DataFrame(pa_json['data'], columns=pa_json['fields'])
            .pipe(subset_pa))
pa_sites
id label lat lon
0 20 Oakdale 40.60 -111.84
1 47 OZONE TEST 40.48 -111.88
2 53 Lakeshore 40.25 -111.70
... ... ... ... ...
23135 132237 1965 Valleyview Drive 50.68 -120.27
23136 132431 Granite Dells 34.24 -111.31
23137 132471 Joy's House 34.01 -117.66

23138 rows × 4 columns

Now, the PurpleAir data is ready to join with the AQS data.

12.1.3. Joining AQS and PurpleAir Data

We have two dataframes:

aqs_sites.head(2)
site_id lat lon
0 01-003-0010 30.50 -87.88
1 01-027-0001 33.28 -85.80
pa_sites.head(2)
id label lat lon
0 20 Oakdale 40.60 -111.84
1 47 OZONE TEST 40.48 -111.88

Our goal is to match rows from these two dataframes where the sites are within 50 meters of each other. This is a bit more challenging than the joins we’ve seen thus far. For instance, one naive approach is to use the merge method of pandas:

# Does *not* join properly
aqs_sites.merge(pa_sites, left_on=['lat', 'lon'], right_on=['lat', 'lon'])
site_id lat lon id label
0 06-111-1004 34.45 -119.23 48393 VCAPCD OJ

However, this matches AQS and PurpleAir sites that have the exact same latitude and longitude. We won’t use the code above because it doesn’t perform the join we want. For each AQS site, we want to see find the PurpleAir sites that are “close enough”, so we match rows even if they have slightly different latitudes and longitudes.

But how much difference in latitude and longitude correspond to 50 meters? Here, we’ll use a simple approximation: 111,111 meters in the north-south direction roughly equal to one degree of latitude, and 111,111 * cos(latitude) in the east-west direction correspond to one degree of longitude 1. Thus, we can find the latitude and longitude ranges that correspond to 25 meters in each direction (to make a 50m by 50m rectangle around each point).

magic_meters_per_lat = 111_111
offset_in_m = 25
offset_in_lat = offset_in_m / magic_meters_per_lat
offset_in_lat
0.000225000225000225
# To simplify even more, we'll use the median latitude for the AQS sites
median_latitude = aqs_sites['lat'].median()
magic_meters_per_lon = 111_111 * np.cos(np.radians(median_latitude))
offset_in_lon = offset_in_m / magic_meters_per_lon
offset_in_lon
0.000291515219937587

Finally, we’ll join AQS and PurpleAir sites together where the coordinates are within the offset_in_lat and offset_in_lon of each other. Doing this in SQL is much easier than doing it in pandas, so we’ll push the tables into a temporary SQLite database, then run a query to read the tables back into a dataframe.

import sqlalchemy

# An empty database name uses an in-memory sqlite database that doesn't save
# any files to disk
db = sqlalchemy.create_engine('sqlite://')

aqs_sites.to_sql(name='aqs', con=db, index=False)
pa_sites.to_sql(name='pa', con=db, index=False)
query = f'''
SELECT
  aqs.site_id AS aqs_id,
  pa.id AS pa_id,
  pa.label AS pa_label,
  aqs.lat AS aqs_lat,
  aqs.lon AS aqs_lon,
  pa.lat AS pa_lat,
  pa.lon AS pa_lon
FROM aqs JOIN pa
  ON  pa.lat - {offset_in_lat} <= aqs.lat
  AND                             aqs.lat <= pa.lat + {offset_in_lat}
  AND pa.lon - {offset_in_lon} <= aqs.lon
  AND                             aqs.lon <= pa.lon + {offset_in_lon}
'''
matched = pd.read_sql(query, db)
matched
aqs_id pa_id pa_label aqs_lat aqs_lon pa_lat pa_lon
0 06-019-0011 6568 IMPROVE_FRES2 36.79 -119.77 36.79 -119.77
1 06-019-0011 13485 AMTS_Fresno 36.79 -119.77 36.79 -119.77
2 06-019-0011 44427 Fresno CARB CCAC 36.79 -119.77 36.79 -119.77
... ... ... ... ... ... ... ...
146 53-061-1007 3659 Marysville 7th 48.05 -122.17 48.05 -122.17
147 53-063-0021 54603 Augusta 1 SRCAA 47.67 -117.36 47.67 -117.36
148 56-021-0100 50045 WDEQ-AQD Cheyenne NCore 41.18 -104.78 41.18 -104.78

149 rows × 7 columns

We’ve achieved our goal—we matched AQS sites with PurpleAir sensors. Barkjohn’s paper mentions that they found 42 candidate matches in August of 2018. We ran this analysis using 2021 data, and we have 149 candidate matches. Perhaps in the years between 2018 and 2021 many more AQS or PurpleAir sensors were installed. In either case, this part of the analysis provides a taste of the types of wrangling we’ll use again in the rest of this chapter. In the next section, we’ll wrangle, explore, and clean sensor data from an AQS site.


1

This estimation works by assuming that the Earth is perfectly spherical. Then, one degree of latitude is \( \frac{\pi}{180} \cdot r \), where \( r \) is the radius of the Earth in meters. Plugging in the average radius of the Earth gives 111,111 meters per degree of latitude. Longitude is the same, but the radius of each “ring” around the Earth decreases as we get closer to the poles, so we adjust by a factor of \( \cos(\text{lat}) \) . It turns out that the Earth isn’t perfectly spherical, so these estimations can’t be used for precise calculations, like landing a rocket. But for our purposes they do just fine.