{
"cells": [
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Reference: https://jupyterbook.org/interactive/hiding.html\n",
"# Use {hide, remove}-{input, output, cell} tags to hiding content\n",
"\n",
"import sys\n",
"import os\n",
"if not any(path.endswith('textbook') for path in sys.path):\n",
" sys.path.append(os.path.abspath('../../..'))\n",
"from textbook_utils import *"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"tags": [
"remove-cell"
]
},
"outputs": [],
"source": [
"csv_file = 'data/Full24hrdataset.csv'\n",
"usecols = ['Date', 'ID', 'region', 'PM25FM', 'PM25cf1', 'RH']\n",
"full = (pd.read_csv(csv_file, usecols=usecols, parse_dates=['Date'])\n",
" .dropna())\n",
"full.columns = ['date', 'id', 'region', 'pm25aqs', 'pm25pa', 'rh']\n",
"\n",
"bad_dates = ['2019-08-21', '2019-08-22', '2019-09-24']\n",
"GA = full.loc[(full['id'] == 'GA1') & (~full['date'].isin(bad_dates)) , :]"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"tags": [
"remove-cell"
]
},
"outputs": [],
"source": [
"from sklearn.linear_model import LinearRegression\n",
"\n",
"y = GA[['pm25pa']]\n",
"X = GA[['pm25aqs', 'rh']]\n",
"\n",
"model2 = LinearRegression().fit(X, y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"# Bootstrapping for Inference\n",
"\n",
"In many hypothesis tests the assumptions of the null hypothesis lead to a complete specification of a hypothetical population and data design (see {numref}`Figure %s `), and we use this specification to simulate the sampling distribution of a statistic. For example, the rank test for the Wikipedia experiment, led us to sampling the integers 1, ..., 200, which we easily simulated. Unfortunately, we can't always specify the population and model completely. To remedy the situation, we substitute the data for the population. This substitution is at the heart of the notion of the bootstrap. {numref}`Figure %s ` updates {numref}`Figure %s ` to reflect this idea; here the population distribution is replaced by the empirical distribution to create what is called the *bootstrap population*. \n",
"\n",
"```{figure} bootTriptych.png\n",
"---\n",
"name: boot-triptych\n",
"---\n",
"\n",
"This diagram imitates the data generation process by bootstrapping. Comparing this figure to {numref}`Figure %s `, you can see that the sample has been placed in the urn as a substitute for the population. The far right side shows a bootstrap sample taken using the same chance process as in the original study. In the middle, the sampling distribution of the bootstrap statistic is still a probability distribution based on the same mechanism for selecting the sample as in the original model, but now uses the bootstrap population. \n",
"```\n",
"\n",
"The rationale for the bootstrap goes like this:\n",
"+ Your sample looks like the population because it is a representative sample so we substitute the population with the sample and call it the bootstrap population.\n",
"+ Use the same data generation process that produced the original sample to get a new sample, which is called a *bootstrap sample* to reflect the change in the population. Calculate the statistic on the bootstrap sample in the same manner as before and call it the *bootstrap statistic*. The *bootstrap sampling distribution* of the bootstrap statistic should be similar in shape and spread to the true sampling distribution of the statistic.\n",
"+ Simulate the data generation process many times, using the bootstrap population, to get bootstrap samples and their bootstrap statistics. The distribution of the simulated bootstrap statistics approximates the bootstrap sampling distribution of the bootstrap statistic,\n",
"which itself approximates the original sampling distribution. \n",
"\n",
"Take a close look at {numref}`Figure %s ` and compare it to {numref}`Figure %s `. Essentially, the bootstrap simulation involves two approximations: the original sample approximates the population, and the simulation approximates the sampling distribution. We have been using the second approximation in our examples so far; the approximation of the population by the sample is the core notion behind bootstrapping. \n",
" \n",
"You might be wondering how to take a simple random sample from your bootstrap population and not wind up with the exact same sample each time. After all, if your sample has 100 units in it and you use it as your bootstrap population, then 100 draws from the bootstrap population without replacement will take all of the units and give you the same bootstrap sample every time. There are two approaches to solving this problem. The first is by far the most common.\n",
"\n",
"+ If the original population is very large, then there is little difference between sampling with and without replacement so when sampling from the bootstrap population, make draws with replacement.\n",
"+ \"Blow up the sample\" to be the same size as the original population. That is, tally up the fraction of each unique value in the sample, and add units to the bootstrap population so it is the same size as the original population, while maintaining the proportions. For example, if the sample is size 30 and 1/3 of the sample values are 0, then a bootstrap population of 300 should include 100 zeroes. Once you have this bootstrap population, use the original data generation procedure to take the bootstrap samples. \n",
"\n",
"The example of vaccine efficacy, used a bootstrap-like process, called the *parameterized bootstrap*. Our null model specified 0-1 urns, but we didn't know how many 0s and 1s to put in the urn. We used the sample to determine the proportions of 0s and 1s; that is, the sample specified the parameters of the multivariate hypergeometric. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
":::{warning}\n",
"\n",
"It's a common mistake to think that the center of the bootstrap sampling distribution is the same as the center of the true sampling distribution. If the mean of the sample is not 0 then the mean of the bootstrap population is also not 0. That's why we use the spread of the bootstrap distribution, and not its center, in hypothesis testing. The next example shows how we might use the bootstrap to test a hypothesis.\n",
"\n",
":::"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Boostrapping a test for a regression coefficient \n",
"\n",
"The case study on calibrating air quality monitors (see {numref}`Chapter %s `) fitted a model to adjust the measurements from an inexpensive monitor to more accurately reflected true air quality. This adjustment included a term in the model related to humidity. The fitted coefficient was about $0.22$, so that on days of high humidity the measurement is adjusted upward more than on days of low humidity. However, this coefficient is close to 0, and we might wonder whether including humidity in the model is really needed. In other words, we want to test the hypothesis that the coefficient for humidity in the linear model is 0. Unfortunately, we can't fully specify the model because it is based on measurements taken over a particular time period from a set of air monitors (both PurpleAir and those maintained by the EPA). This is where the bootstrap can help. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our model makes the assumption that the air quality measurements taken resemble the population of measurements. Note that weather conditions, the time of year, and the location of the monitors makes this statement a bit hand-wavy; what we mean here is that the measurements are similar to others taken under the same conditions as those when the original measurement were taken. Also, since we can imagine a virtually infinite supply of air quality measurements, we think of the procedure for generating measurements as draws with replacement from the urn. (Recall that in {numref}`Chapter %s ` we modeled the urn as repeated draws with replacement from an urn of measurement errors. This situation is a bit different because we \n",
"are also including the other factors mentioned already (weather, season, location)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our model is focussed on the coefficient for humidity in the linear model:\n",
"\n",
"$$\n",
"\\begin{aligned}\n",
"\\text{PA} = \\theta_0 + \\theta_1 \\text{AQ} + \\theta_2 \\text{RH} + ~error\n",
"\\end{aligned}\n",
"$$\n",
"\n",
"Here, $\\text{PA}$ refers to the PurpleAir PM2.5 measurement, $ \\text{RH}$ is the\n",
"relative humidity, and $\\text{AQ}$ stands for the more exact measurement of PM2.5 made by the EPA monitors.\n",
"The null hypothesis is $\\theta_2 = 0$; that is, the null model is the simpler model:\n",
"\n",
"$$\n",
"\\begin{aligned}\n",
"\\text{PA} = \\theta_0 + \\theta_1 \\text{AQ} + ~error \n",
"\\end{aligned}\n",
"$$\n",
"\n",
"We use the linear model fitting procedure from {numref}`Chapter %s `. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our bootstrap population are the measurements from Georgia that we used in {numref}`Chapter %s `). \n",
"Now, we sample rows from the data frame (which is equivalent to our urn) with replacement using the chance mechanism `randint`. This function takes random samples with replacement from a set of integers. We use the random sample of indices to create the bootstrap sample from the data frame. Then we fit the linear model and get the coefficient for humidity (our bootstrap statistic). The `boot_stat` function below performs this simulation process."
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.2262808797043853"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"n = len(GA)\n",
"\n",
"y = GA[['pm25pa']]\n",
"X = GA[['pm25aqs', 'rh']]\n",
"\n",
"model2 = LinearRegression().fit(X, y)\n",
"\n",
"from scipy.stats import randint\n",
"r = randint.rvs(low=0, high=(n-1), size=n)"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([2.21, 0.17])"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"m.coef_[0]"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [],
"source": [
"from scipy.stats import randint\n",
"\n",
"n = len(GA)\n",
"y = GA[['pm25pa']]\n",
"X = GA[['pm25aqs', 'rh']]\n",
"\n",
"theta2_orig = LinearRegression().fit(X, y).coef_[0][1]\n",
"\n",
"def boot_stat(X, y):\n",
" r = randint.rvs(low=0, high=(n-1), size=n)\n",
" \n",
" theta2 = LinearRegression().fit(X.iloc[r, :], y.iloc[r]).coef_[0][1]\n",
" \n",
" return theta2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When we repeat this process 10,000 times, we get an approximation to the bootstrap sampling distribution of the bootstrap statistic (the fitted humidity coefficient)."
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [],
"source": [
"rng = np.random.default_rng(42)\n",
"\n",
"boot_theta_hat = [boot_stat(X, y) for _ in range(10000)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We are interested in the shape and spread of this bootstrap sampling distribution. (We know that the center will be close to the original coefficient of $0.21$.)"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"tags": [
"remove-input"
]
},
"outputs": [
{
"data": {
"text/plain": [
"Text(0.5, 0, 'Humidity Coefficient $\\\\hat{\\\\theta}_B$')"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
"