{ "cells": [ { "cell_type": "code", "execution_count": 19, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "import sys\n", "import os\n", "if not any(path.endswith('textbook') for path in sys.path):\n", " sys.path.append(os.path.abspath('../../..'))\n", "from textbook_utils import *" ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "(ch:pa_cleaning_aqs)=\n", "# Wrangling and Cleaning AQS Sensor Data" ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "Now that we have located sensors that are near each other, we are ready to wrangle and clean the files that contain the measurement data for these sites. We demonstrate the tasks involved with one AQS instrument and its matching PurpleAir sensor. We picked a pair located in Sacramento, California. The AQS sensor ID is\n", "`06-067-0010`, and the PurpleAir sensor name is `AMTS_TESTINGA`." ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "The AQS provides a website and [API](https://aqs.epa.gov/aqsweb/documents/data_api.html) to download sensor data.\n", "We downloaded the daily measurements from May 20, 2018, to December 29, 2019, into the\n", "_data/aqs_06-067-0010.csv_ file. Let's begin by loading this file into a dataframe:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(2268, 31)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aqs_full = pd.read_csv('data/aqs_06-067-0010.csv')\n", "aqs_full.shape" ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "From the [data dictionary](https://aqs.epa.gov/aqsweb/airdata/FileFormats.html#_hourly_data_files), we find out that the\n", "column called `arithmetic_mean` corresponds to the actual PM2.5 measurements. Some AQS\n", "sensors take a measurement every hour. \n", "For our analysis, we downloaded the 24-hour averages (the arithmetic mean) of the\n", "hourly sensor measurements." ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "Let's carry out some quality checks and clean the data where necessary.\n", "We focus on checks related to scope and quality of values: \n", "\n", "1. Check and correct the granularity of the data.\n", "1. Remove unneeded columns.\n", "1. Check values in the `date_local` column.\n", "1. Check values in the `arithmetic_mean` column.\n", "\n", "For the sake of brevity, we've chosen a few important quality checks that\n", "specifically reinforce ideas we've covered in data wrangling, EDA, and visualization." ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "## Checking Granularity" ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "We would like each row of our data to correspond to a single date with an average\n", "PM2.5 reading for that date. As we saw earlier, a simple way to check is to see whether there are repeat values in the `date_local` column:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "date_local\n", "2019-01-03 12\n", "2018-12-31 12\n", "2018-12-28 12\n", " ..\n", "2018-11-28 12\n", "2018-11-25 12\n", "2018-11-22 12\n", "Name: count, Length: 189, dtype: int64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aqs_full['date_local'].value_counts()" ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "Indeed, there are 12 rows for each date, so the granularity is *not* at the individual date level. " ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "From the data dictionary, we learn that there are multiple standards for\n", "computing the final measurements from the raw sensor data.\n", "The `pollutant_standard` column contains the name of each standard.\n", "The `event_type` column marks whether data measured during \"exceptional events\"\n", "are included in the measurement.\n", "Let's check how different these average values are by calculating the range of 12 measurements:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "arithmetic_mean\n", "0.0 189\n", "Name: count, dtype: int64" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(aqs_full\n", " .groupby('date_local')\n", " ['arithmetic_mean']\n", " .agg(np.ptp) # np.ptp computes max() - min()\n", " .value_counts()\n", ")" ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "For all 189 dates, the max PM2.5 - min PM2.5 is 0. This means that we can simply take the first PM2.5 measurement for each date:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "def rollup_dates(df):\n", " return (\n", " df.groupby('date_local')\n", " .first()\n", " .reset_index()\n", " )" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(189, 31)" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aqs = (aqs_full\n", " .pipe(rollup_dates))\n", "aqs.shape" ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "This data-cleaning step gives us the desired granularity:\n", "each row represents a single date, with an average PM2.5 measurement for that date.\n", "Next, we further modify the structure of the dataframe and drop unneeded columns. " ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "## Removing Unneeded Columns" ] }, { "cell_type": "markdown", "metadata": { "user_expressions": [] }, "source": [ "We plan to match the PM2.5 measurements in the AQS dataframe with\n", "the PurpleAir PM2.5 measurements for each date.\n", "To simplify the structure, we can drop all but the date and PM2.5 columns.\n", "We also rename the PM2.5 column so that it's easier to understand:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "def drop_cols(df):\n", " subset = df[['date_local', 'arithmetic_mean']]\n", " return subset.rename(columns={'arithmetic_mean': 'pm25'})" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | date_local | \n", "pm25 | \n", "
---|---|---|
0 | \n", "2018-05-20 | \n", "6.5 | \n", "
1 | \n", "2018-05-23 | \n", "2.3 | \n", "
2 | \n", "2018-05-29 | \n", "11.8 | \n", "
3 | \n", "2018-06-01 | \n", "6.0 | \n", "
4 | \n", "2018-06-04 | \n", "8.0 | \n", "