Finding Collocated Sensors
Contents
12.1. Finding Collocated Sensors¶
When starting an analysis, we like to begin with a research question. In this case, our research question might be: “How do we correct PurpleAir sensor readings so that they match AQS sensor readings?” We focus specifically on readings for PM2.5 particles, which are particles that are smaller than 2.5 micrometers in diameter. These particles are small enough to be inhaled into the lungs, pose the greatest risk to health, and are especially common in wood smoke.
Our analysis begins by finding collocated pairs of AQS and PurpleAir sensors—sensors that are placed immediately next to each other. This step is important because it lets us reduce the effects of other variables that might caused differences in sensor readings. Consider what would happen if we compared an AQS sensor placed inside a building with a PurpleAir sensor placed outside the building. The two sensors would have different readings, but some of these differences would happen because the sensors are exposed to different environments. Ensuring that sensors are truly collocated lets us say that the differences in sensor readings is caused the different ways the sensors are built, rather than other potential confounding variables.
Barkjohn et al.’s analysis found pairs of AQS and PurpleAir sensors that are installed within 50 meters of each other. Then, they contacted each AQS site to see whether the people maintaining the site also maintained a PurpleAir sensor. This extra effort gave them confidence that their sensor pairs were truly collocated.
In this section, we’ll explore and clean data from the AQS and PurpleAir. Then, we’ll perform a similar join to construct a list of potentially collocated sensors. We won’t contact AQS sites ourselves; instead, we’ll proceed with the analysis by reusing Barkjohn et al.’s list of truly collocated sensors.
We’ve downloaded a list of AQS and PurpleAir sensors in data/list_of_aqs_sites.csv
and data/list_of_purpleair_sensors.json
.
Let’s begin by reading these files into pandas
.
First, we check file sizes to see whether they are reasonable to load into memory.
!ls -lLh data/list_of*
-rw-r--r-- 1 sam staff 4.8M Oct 27 16:54 data/list_of_aqs_sites.csv
-rw-r--r-- 1 sam staff 3.8M Oct 22 16:10 data/list_of_purpleair_sensors.json
Both files are relatively small. We’ll start with the list of AQS sites.
12.1.1. Wrangling the List of AQS Sites¶
Let’s load the CSV file into pandas
.
aqs_sites = pd.read_csv('data/list_of_aqs_sites.csv')
# The table output is fairly large, so display 2 rows
aqs_sites.head(2)
AQS_Site_ID | POC | State | City | ... | Reporting_Agency | Parameter_Name | Annual_URLs | Daily_URLs | |
---|---|---|---|---|---|---|---|---|---|
0 | 01-003-0010 | 1 | Alabama | Fairhope | ... | Al Dept Of Env Mgt | PM2.5 | <a href='https://www3.epa.gov/cgi-bin/broker?_... | <a href='https://www3.epa.gov/cgi-bin/broker?_... |
1 | 01-027-0001 | 1 | Alabama | Ashland | ... | Al Dept Of Env Mgt | PM2.5 | <a href='https://www3.epa.gov/cgi-bin/broker?_... | <a href='https://www3.epa.gov/cgi-bin/broker?_... |
2 rows × 28 columns
# Number of rows and columns in the table
aqs_sites.shape
(1333, 28)
This dataframe comes from the AQS map of sites (link). We filtered the map to show only the AQS sites that measure PM2.5, then downloaded the list as a CSV using the map’s web app.
There are 28 columns in the table.
Jupyter hides the middle columns to avoid overfilling the screen.
However, we want to see what the columns are so that we can find out which columns are
most useful for us.
To do this, we’ll use a trick. We’ll slice out the first row of aqs_sites
, then
convert it to a dataframe.
aqs_sites.iloc[0].to_frame()
0 | |
---|---|
AQS_Site_ID | 01-003-0010 |
POC | 1 |
State | Alabama |
... | ... |
Parameter_Name | PM2.5 |
Annual_URLs | <a href='https://www3.epa.gov/cgi-bin/broker?_... |
Daily_URLs | <a href='https://www3.epa.gov/cgi-bin/broker?_... |
28 rows × 1 columns
In this dataframe, each row corresponds to one column of aqs_sites
.
Next, we’ll tell pandas
to display more rows so that we can browse all the
columns are once.
from IPython.display import display
rows_to_show = 28
with pd.option_context('display.max_rows', rows_to_show):
display(aqs_sites.iloc[0].to_frame())
0 | |
---|---|
AQS_Site_ID | 01-003-0010 |
POC | 1 |
State | Alabama |
City | Fairhope |
CBSA | Daphne-Fairhope-Foley, AL |
Local_Site_Name | FAIRHOPE, Alabama |
Address | FAIRHOPE HIGH SCHOOL, 1 PIRATE DRIVE, FAIRHOPE... |
Datum | NAD83 |
Latitude | 30.5 |
Longitude | -87.88 |
LatLon_Accuracy_meters | 4.0 |
Elevation_meters_MSL | 37.19 |
Monitor_Start_Date | 1/1/2000 |
Last_Sample_Date | 6/30/2021 |
Active | Yes |
Measurement_Scale | NEIGHBORHOOD |
Measurement_Scale_Definition | 500 M TO 4KM |
Sample_Duration | 24 HOUR |
Sample_Collection_Frequency | EVERY 3RD DAY |
Sample_Collection_Method | R & P Model 2025 PM-2.5 Sequential Air Sampler... |
Sample_Analysis_Method | Gravimetric |
Method_Reference_ID | RFPS-1006-145 |
FRMFEM | Yes |
Monitor_Type | SLAMS |
Reporting_Agency | Al Dept Of Env Mgt |
Parameter_Name | PM2.5 |
Annual_URLs | <a href='https://www3.epa.gov/cgi-bin/broker?_... |
Daily_URLs | <a href='https://www3.epa.gov/cgi-bin/broker?_... |
For a complete description of the columns, we can reference the data dictionary that the AQS provides on their website (link).
Note
We often want to fiddle with pandas
display options to show/hide rows of a
large dataframe. For convenience, we’ve defined a method called display_df
that has the basic logic of the cell above. In the future, we’ll use this code
instead:
display_df(..., rows=28)
12.1.1.1. What’s the granularity?¶
For now, most important observation is that the aqs_sites
dataframe doesn’t
contain sensor recordings. It only has a list of AQS sites. In other words, we
might expect that the granularity of aqs_sites
is that each row represents
a single AQS site.
To check this, we can see whether the AQS_Site_ID
s are unique.
aqs_sites['AQS_Site_ID'].value_counts()
19-163-0015 4
19-153-0030 4
49-005-0007 4
..
23-005-0015 1
23-019-0017 1
80-002-0014 1
Name: AQS_Site_ID, Length: 921, dtype: int64
It looks like some sites appear multiple times in this dataframe.
Unfortunately, this means that the granularity of aqs_sites
is probably at a
finer granularity than the individual site level.
Why are sites duplicated? We can take a closer look at some duplicated sites.
dup_site = aqs_sites.query("AQS_Site_ID == '19-163-0015'")
dup_site
AQS_Site_ID | POC | State | City | ... | Reporting_Agency | Parameter_Name | Annual_URLs | Daily_URLs | |
---|---|---|---|---|---|---|---|---|---|
458 | 19-163-0015 | 1 | Iowa | Davenport | ... | University Hygenic Laboratory (University of I... | PM2.5 | <a href='https://www3.epa.gov/cgi-bin/broker?_... | <a href='https://www3.epa.gov/cgi-bin/broker?_... |
459 | 19-163-0015 | 2 | Iowa | Davenport | ... | University Hygenic Laboratory (University of I... | PM2.5 | <a href='https://www3.epa.gov/cgi-bin/broker?_... | <a href='https://www3.epa.gov/cgi-bin/broker?_... |
460 | 19-163-0015 | 3 | Iowa | Davenport | ... | University Hygenic Laboratory (University of I... | PM2.5 | <a href='https://www3.epa.gov/cgi-bin/broker?_... | <a href='https://www3.epa.gov/cgi-bin/broker?_... |
461 | 19-163-0015 | 4 | Iowa | Davenport | ... | University Hygenic Laboratory (University of I... | PM2.5 | <a href='https://www3.epa.gov/cgi-bin/broker?_... | <a href='https://www3.epa.gov/cgi-bin/broker?_... |
4 rows × 28 columns
This display isn’t very helpful since it hides the middle columns. In practice, we’d interactively display columns until we find some that appear to distinguish between the rows. Here, we’ll simply skip ahead and show a few columns of interest.
cols_that_distinguish = ['AQS_Site_ID', 'POC', 'Monitor_Start_Date',
'Last_Sample_Date', 'Sample_Collection_Method']
dup_site[cols_that_distinguish]
AQS_Site_ID | POC | Monitor_Start_Date | Last_Sample_Date | Sample_Collection_Method | |
---|---|---|---|---|---|
458 | 19-163-0015 | 1 | 1/27/1999 | 8/31/2021 | R & P Model 2025 PM-2.5 Sequential Air Sampler... |
459 | 19-163-0015 | 2 | 2/9/2013 | 8/26/2021 | R & P Model 2025 PM-2.5 Sequential Air Sampler... |
460 | 19-163-0015 | 3 | 1/1/2019 | 9/30/2021 | Teledyne T640 at 5.0 LPM |
461 | 19-163-0015 | 4 | 1/1/2019 | 9/30/2021 | Teledyne T640 at 5.0 LPM |
The data dictionary for this table states this about the POC
column:
This is the “Parameter Occurrence Code” used to distinguish different instruments that measure the same parameter at the same site.
In this case, there are four instruments at this site that all
measure PM2.5. So, it appears that the granularity of the aqs_sites
dataframe
is that each row represents a single instrument at an AQS site.
For simplicity, the AQS also reports one average measurement for each site rather than one measurement for each sensor at a site. We’ll proceed by using this average measurement per site. Since we’re primarily using this data to match AQS site with PurpleAir sensors, we’ll adjust the granularity of this data by grouping by site ID, then taking the first row within each group. This removes duplicated site IDs and is appropriate here because we only need the location of each site to match with PurpleAir sensors.
def rollup_dup_sites(df):
return (
df.groupby('AQS_Site_ID')
.first()
.reset_index()
)
aqs_sites = (pd.read_csv('data/list_of_aqs_sites.csv')
.pipe(rollup_dup_sites))
aqs_sites
AQS_Site_ID | POC | State | City | ... | Reporting_Agency | Parameter_Name | Annual_URLs | Daily_URLs | |
---|---|---|---|---|---|---|---|---|---|
0 | 01-003-0010 | 1 | Alabama | Fairhope | ... | Al Dept Of Env Mgt | PM2.5 | <a href='https://www3.epa.gov/cgi-bin/broker?_... | <a href='https://www3.epa.gov/cgi-bin/broker?_... |
1 | 01-027-0001 | 1 | Alabama | Ashland | ... | Al Dept Of Env Mgt | PM2.5 | <a href='https://www3.epa.gov/cgi-bin/broker?_... | <a href='https://www3.epa.gov/cgi-bin/broker?_... |
2 | 01-049-1003 | 1 | Alabama | Crossville | ... | Al Dept Of Env Mgt | PM2.5 | <a href='https://www3.epa.gov/cgi-bin/broker?_... | <a href='https://www3.epa.gov/cgi-bin/broker?_... |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
918 | 56-039-1013 | 1 | Wyoming | None | ... | National Park Service | PM2.5 | <a href='https://www3.epa.gov/cgi-bin/broker?_... | <a href='https://www3.epa.gov/cgi-bin/broker?_... |
919 | 80-002-0012 | 3 | Country Of Mexico | Mexicali | ... | Tracer Technologies | PM2.5 | <a href='https://www3.epa.gov/cgi-bin/broker?_... | <a href='https://www3.epa.gov/cgi-bin/broker?_... |
920 | 80-002-0014 | 3 | Country Of Mexico | Mexicali | ... | Tracer Technologies | PM2.5 | <a href='https://www3.epa.gov/cgi-bin/broker?_... | <a href='https://www3.epa.gov/cgi-bin/broker?_... |
921 rows × 28 columns
Now, each row of aqs_sites
corresponds to one AQS site, which is the granularity
we want.
12.1.1.2. Subsetting¶
To match AQS sites with PurpleAir sensors, we only need the site ID, latitude, and longitude. So, we’ll subset and rename the columns of interest.
def subset_aqs(df):
subset = df[['AQS_Site_ID', 'Latitude', 'Longitude']]
subset.columns = ['site_id', 'lat', 'lon']
return subset
aqs_sites = (pd.read_csv('data/list_of_aqs_sites.csv')
.pipe(rollup_dup_sites)
.pipe(subset_aqs))
aqs_sites
site_id | lat | lon | |
---|---|---|---|
0 | 01-003-0010 | 30.50 | -87.88 |
1 | 01-027-0001 | 33.28 | -85.80 |
2 | 01-049-1003 | 34.29 | -85.97 |
... | ... | ... | ... |
918 | 56-039-1013 | 44.37 | -110.83 |
919 | 80-002-0012 | 32.63 | -115.45 |
920 | 80-002-0014 | 32.63 | -115.50 |
921 rows × 3 columns
Now, the aqs_sites
dataframe is ready and we’ll move to the PurpleAir sites.
12.1.2. Wrangling the List of PurpleAir Sites¶
Unlike the AQS sites, the file containing PurpleAir sensor data comes in the JSON format. This file is provided by the PurpleAir website.
!ls -lLh data/list_of_purpleair_sensors.json
-rw-r--r-- 1 sam staff 3.8M Oct 22 16:10 data/list_of_purpleair_sensors.json
# Previews first few lines of the file. The `cut` command limits the output to
# the first 70 characters for each line since the lines are quite long.
!head data/list_of_purpleair_sensors.json | cut -c 1-70
{"version":"7.0.30",
"fields":
["ID","pm","pm_cf_1","pm_atm","age","pm_0","pm_1","pm_2","pm_3","pm_4"
"data":[
[20,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,97,0.0,0.0,0.0,0.0,0.0,0
[47,null,null,null,4951,null,null,null,null,null,null,null,96,null,nul
[53,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,1.2,5.2,6.0,97,0.0,0.5,702.3,57.5,6.
[74,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,97,0.0,0.0,0.0,0.0,0.0,0
[77,9.8,9.8,9.8,1,9.8,10.7,11.0,11.2,13.8,15.1,15.5,97,9.7,9.8,6523.5,
[81,6.5,6.5,6.5,0,6.5,6.1,6.1,6.6,8.1,8.3,9.7,97,5.9,6.8,4058.9,346.1,
From the first few lines of the JSON file, we can guess that the data rows are
stored in the "data"
key and the column labels are stored in the "fields"
key.
We can first use Python’s json
library to read in the file as a Python dict
:
import json
with open('data/list_of_purpleair_sensors.json') as f:
pa_json = json.load(f)
list(pa_json.keys())
['version', 'fields', 'data', 'count']
Then, we’ll create a dataframe.
pa_sites = pd.DataFrame(pa_json['data'], columns=pa_json['fields'])
pa_sites
ID | pm | pm_cf_1 | pm_atm | ... | Voc | Ozone1 | Adc | CH | |
---|---|---|---|---|---|---|---|---|---|
0 | 20 | 0.0 | 0.0 | 0.0 | ... | NaN | NaN | 0.01 | 1 |
1 | 47 | NaN | NaN | NaN | ... | NaN | 0.72 | 0.72 | 0 |
2 | 53 | 0.0 | 0.0 | 0.0 | ... | NaN | NaN | 0.00 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
23135 | 132237 | 5.3 | 5.3 | 5.3 | ... | NaN | NaN | 0.00 | 3 |
23136 | 132431 | 3.2 | 3.2 | 3.2 | ... | NaN | NaN | 0.03 | 3 |
23137 | 132471 | 0.5 | 0.5 | 0.5 | ... | NaN | NaN | 0.05 | 3 |
23138 rows × 36 columns
Like the AQS data, there are many more columns in this dataframe than pandas
will display. Another quick way to look at the columns is to simply print
the column labels.
pa_sites.columns
Index(['ID', 'pm', 'pm_cf_1', 'pm_atm', 'age', 'pm_0', 'pm_1', 'pm_2', 'pm_3',
'pm_4', 'pm_5', 'pm_6', 'conf', 'pm1', 'pm_10', 'p1', 'p2', 'p3', 'p4',
'p5', 'p6', 'Humidity', 'Temperature', 'Pressure', 'Elevation', 'Type',
'Label', 'Lat', 'Lon', 'Icon', 'isOwner', 'Flags', 'Voc', 'Ozone1',
'Adc', 'CH'],
dtype='object')
In this case, we can guess that the columns we’re most interested in are the
sensor IDs (ID
), sensor labels (Label
), latitude (Lat
), and longitude (Lon
).
Again, we consulted the data dictionary on the PurpleAir website to double check
what the columns represent.
12.1.2.1. What’s the Granularity?¶
Let’s check the ID
column for duplicates, as we did for the AQS data.
pa_sites['ID'].value_counts()
20 1
88775 1
88771 1
..
57357 1
57355 1
132471 1
Name: ID, Length: 23138, dtype: int64
The value_counts()
method lists the counts in descending order, so we can see
that every ID was included only once. This lets us verify that the granularity of
this data is already at the individual sensor level, so we don’t need to perform
additional wrangling to adjust the granularity.
12.1.2.2. Subsetting¶
Next, we’ll subset the PurpleAir data.
def subset_pa(df):
subset = df[['ID', 'Label', 'Lat', 'Lon']]
subset.columns = ['id', 'label', 'lat', 'lon']
return subset
pa_sites = (pd.DataFrame(pa_json['data'], columns=pa_json['fields'])
.pipe(subset_pa))
pa_sites
id | label | lat | lon | |
---|---|---|---|---|
0 | 20 | Oakdale | 40.60 | -111.84 |
1 | 47 | OZONE TEST | 40.48 | -111.88 |
2 | 53 | Lakeshore | 40.25 | -111.70 |
... | ... | ... | ... | ... |
23135 | 132237 | 1965 Valleyview Drive | 50.68 | -120.27 |
23136 | 132431 | Granite Dells | 34.24 | -111.31 |
23137 | 132471 | Joy's House | 34.01 | -117.66 |
23138 rows × 4 columns
Now, the PurpleAir data is ready to join with the AQS data.
12.1.3. Joining AQS and PurpleAir Data¶
We have two dataframes:
aqs_sites.head(2)
site_id | lat | lon | |
---|---|---|---|
0 | 01-003-0010 | 30.50 | -87.88 |
1 | 01-027-0001 | 33.28 | -85.80 |
pa_sites.head(2)
id | label | lat | lon | |
---|---|---|---|---|
0 | 20 | Oakdale | 40.60 | -111.84 |
1 | 47 | OZONE TEST | 40.48 | -111.88 |
Our goal is to match rows from these two dataframes where the sites are within
50 meters of each other.
This is a bit more challenging than the joins
we’ve seen thus far.
For instance, one naive approach is to use the merge
method of pandas
:
# Does *not* join properly
aqs_sites.merge(pa_sites, left_on=['lat', 'lon'], right_on=['lat', 'lon'])
site_id | lat | lon | id | label | |
---|---|---|---|---|---|
0 | 06-111-1004 | 34.45 | -119.23 | 48393 | VCAPCD OJ |
However, this matches AQS and PurpleAir sites that have the exact same latitude and longitude. We won’t use the code above because it doesn’t perform the join we want. For each AQS site, we want to see find the PurpleAir sites that are “close enough”, so we match rows even if they have slightly different latitudes and longitudes.
But how much difference in latitude and longitude correspond to 50 meters?
Here, we’ll use a simple approximation: 111,111
meters in the
north-south direction roughly equal to
one degree of latitude, and 111,111 * cos(latitude)
in the east-west direction
correspond to one degree of longitude 1.
Thus, we can find the latitude and longitude ranges that correspond to 25 meters
in each direction (to make a 50m by 50m rectangle around each point).
magic_meters_per_lat = 111_111
offset_in_m = 25
offset_in_lat = offset_in_m / magic_meters_per_lat
offset_in_lat
0.000225000225000225
# To simplify even more, we'll use the median latitude for the AQS sites
median_latitude = aqs_sites['lat'].median()
magic_meters_per_lon = 111_111 * np.cos(np.radians(median_latitude))
offset_in_lon = offset_in_m / magic_meters_per_lon
offset_in_lon
0.000291515219937587
Finally, we’ll join AQS and PurpleAir sites together where the coordinates are
within the offset_in_lat
and offset_in_lon
of each other.
Doing this in SQL is much easier than doing it in pandas
, so we’ll
push the tables into a temporary SQLite database, then run a query to read
the tables back into a dataframe.
import sqlalchemy
# An empty database name uses an in-memory sqlite database that doesn't save
# any files to disk
db = sqlalchemy.create_engine('sqlite://')
aqs_sites.to_sql(name='aqs', con=db, index=False)
pa_sites.to_sql(name='pa', con=db, index=False)
query = f'''
SELECT
aqs.site_id AS aqs_id,
pa.id AS pa_id,
pa.label AS pa_label,
aqs.lat AS aqs_lat,
aqs.lon AS aqs_lon,
pa.lat AS pa_lat,
pa.lon AS pa_lon
FROM aqs JOIN pa
ON pa.lat - {offset_in_lat} <= aqs.lat
AND aqs.lat <= pa.lat + {offset_in_lat}
AND pa.lon - {offset_in_lon} <= aqs.lon
AND aqs.lon <= pa.lon + {offset_in_lon}
'''
matched = pd.read_sql(query, db)
matched
aqs_id | pa_id | pa_label | aqs_lat | aqs_lon | pa_lat | pa_lon | |
---|---|---|---|---|---|---|---|
0 | 06-019-0011 | 6568 | IMPROVE_FRES2 | 36.79 | -119.77 | 36.79 | -119.77 |
1 | 06-019-0011 | 13485 | AMTS_Fresno | 36.79 | -119.77 | 36.79 | -119.77 |
2 | 06-019-0011 | 44427 | Fresno CARB CCAC | 36.79 | -119.77 | 36.79 | -119.77 |
... | ... | ... | ... | ... | ... | ... | ... |
146 | 53-061-1007 | 3659 | Marysville 7th | 48.05 | -122.17 | 48.05 | -122.17 |
147 | 53-063-0021 | 54603 | Augusta 1 SRCAA | 47.67 | -117.36 | 47.67 | -117.36 |
148 | 56-021-0100 | 50045 | WDEQ-AQD Cheyenne NCore | 41.18 | -104.78 | 41.18 | -104.78 |
149 rows × 7 columns
We’ve achieved our goal—we matched AQS sites with PurpleAir sensors. Barkjohn’s paper mentions that they found 42 candidate matches in August of 2018. We ran this analysis using 2021 data, and we have 149 candidate matches. Perhaps in the years between 2018 and 2021 many more AQS or PurpleAir sensors were installed. In either case, this part of the analysis provides a taste of the types of wrangling we’ll use again in the rest of this chapter. In the next section, we’ll wrangle, explore, and clean sensor data from an AQS site.
- 1
This estimation works by assuming that the Earth is perfectly spherical. Then, one degree of latitude is \( \frac{\pi}{180} \cdot r \), where \( r \) is the radius of the Earth in meters. Plugging in the average radius of the Earth gives 111,111 meters per degree of latitude. Longitude is the same, but the radius of each “ring” around the Earth decreases as we get closer to the poles, so we adjust by a factor of \( \cos(\text{lat}) \) . It turns out that the Earth isn’t perfectly spherical, so these estimations can’t be used for precise calculations, like landing a rocket. But for our purposes they do just fine.