12.6. Exercises

  • In the Finding Collocated Sensors section, we used an approximation to find AQS and PurpleAir sensors within 50 meters of each other. Geospatial data appears in all kinds of domains and data scientists have a variety of tools for working with this kind of data. One such tool is the geopandas package (link). Use the geopandas package to create a map of the US with the AQS sites marked.

  • Use a geopandas spatial join to find the closest PurpleAir sensor to each AQS sensor.

  • Although our data cleaning process closely followed Barkjohn’s, we had to omit some steps for brevity. Read Section 3 (Quality assurance) of BarkJohn’s paper, and note down all the additional steps that the original analysis took that we did not include in this chapter. Which steps might be most important to include?

  • Barkjohn’s paper also distinguishes AQS sensors by whether they are FRM (Federal Reference Method) or FEM (Federal Equivalent Method). Do some research of your own to answer: what’s the difference between these two types of sensors? Which type is more accurate, if any? Why did Barkjohn decide to include both types of sensors in their analysis?

  • When we analyzed the PurpleAir data, we pointed out that PurpleAir sensors apply two different types of corrections on the raw laser readings. One correction is named CF1, and the other is named ATM. Conduct your own EDA to find out how these two corrections differ in the data.

  • In Section 12.4, we wrote Model 2 as:

    \[ \begin{aligned} f_{\theta}(x_i) = \text{PA}_i + \theta \end{aligned} \]

    Derive that \( \hat{\theta} = \frac{1}{n} \sum_i(\text{AQS}_i - \text{PA}_i) \) is the value for \( \theta \) that minimizes the mean squared loss.

  • Consider the simple linear model without the intercept term. That is:

    \[ \begin{aligned} f_{\theta}(x_i) = \theta \cdot \text{PA}_i \end{aligned} \]

    Derive \( \hat{\theta} \), the model parameter that minimizes the mean squared loss. Then, fit this model on the data and compare the test set RMSE against the other models. How does it compare?

  • (Needs background from Chapter 15.) For Model 3, we fit a calibration model, then inverted it to find the prediction model. Fit a prediction model directly, without fitting a calibration model. You might be surprised to see that the RMSE of this model is lower than using Model 3. Why will the training set RMSE of the direct linear regression model always be lower than inverting a calibration model? Why might we prefer the calibration model anyway?