# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import sys
import os
if not any(path.endswith('textbook') for path in sys.path):
    sys.path.append(os.path.abspath('../../..'))
from textbook_utils import *

16.3. Bootstrapping for Inference

In many hypothesis tests the assumptions of the null hypothesis lead to a complete specification of a hypothetical population and data design (see Figure 16.1), and we use this specification to simulate the sampling distribution of a statistic. For example, the rank test for the Wikipedia experiment, led us to sampling the integers 1, …, 200, which we easily simulated. Unfortunately, we can’t always specify the population and model completely. To remedy the situation, we substitute the data for the population. This substitution is at the heart of the notion of the bootstrap. Figure 16.2 updates Figure 16.1 to reflect this idea; here the population distribution is replaced by the empirical distribution to create what is called the bootstrap population.

../../_images/bootTriptych.png

Fig. 16.2 This diagram imitates the data generation process by bootstrapping. Comparing this figure to Figure 16.1, you can see that the sample has been placed in the urn to substitute for the population. The far right side shows a bootstrap sample taken using the same chance process as in the original study. In the middle, the sampling distribution of the bootstrap statistic is still a probability distribution based on the same mechanism for selecting the sample as in the original model, but now reflects the use of the bootstrap population.

The rationale for the bootstrap goes like this:

  • Your sample looks like the population because it is a representative sample so we substitute the population with the sample and call it the bootstrap population.

  • Use the original data generation process to produce a sample, which is now called a bootstrap sample to reflect the change in the population. Calculate the statistic on the bootstrap sample in the same manner as before and call it the bootstrap statistic. The bootstrap sampling distribution of the bootstrap statistic should be similar in shape and spread to the true sampling distribution of the statistic.

  • Simulate the data generation process many times, using the bootstrap population, to get bootstrap samples and their bootstrap statistics. The distribution of the simulated bootstrap statistics approximates the bootstrap sampling distribution of the bootstrap statistic.

Take a close look at Figure 16.2 and compare it to Figure 16.1. Essentially, the bootstrap simulation involves two approximations: the original sample approximates the population, and the simulation approximates the sampling distribution. We have been using the second approximation in our examples so far; the approximation of the population by the sample is the core notion behind bootstrapping.

You might be wondering how do you take a simple random sample from your bootstrap population and not wind up with the exact same sample each time. After all, if your sample has 100 units in it and you use it as your bootstrap population, then 100 draws from the bootstrap population without replacement will take all of the units and give you the same bootstrap sample every time. There are two approaches to solving this problem. The first is by far the most common.

  • If the original population is very large, then there is little difference between sampling with and without replacement so when sampling from the bootstrap population, make draws with replacement.

  • “Blow up the sample” to be the same size as the original population. That is, tally up the fraction of each unique value in the sample, and add units to the bootstrap population so it is the same size as the original population, while maintaining the proportions. For example, if the sample is size 30 and 1/3 of the sample values are 0s, then a bootstrap population of 300 should include 100 zeroes. Then use the original data generation procedure to take the bootstrap samples.

The example of vaccine efficacy, used a bootstrap-like process, called the parameterized bootstrap. Our null model specified 0-1 urns, but we didn’t know how many 0s and 1s to put in the urn. We used the sample to determine the proportions of 0s and 1s; that is, the sample specified the parameters of the multivariate hypergeometric.

It’s a common mistake to think that the center of the bootstrap sampling distribution is the same as the center of the true sampling distribution. If the mean of the sample is not 0 then the mean of the bootstrap population is also not 0. That’s why we use the spread of the bootstrap distribution, and not its center, in hypothesis testing. The next example shows how we might use the bootstrap to test a hypothesis.

16.3.1. Boostrapping a test for a regression coefficient

The case study on calibrating air quality monitors (see Chapter 12) fitted a model to adjust the measurements from an inexpensive monitor to more accurately reflected true air quality. This adjustment included a term in the model related to humidity. The fitted coefficient was about \(-0.086\), so that on days of high humidity the measurement is adjusted downward more than on days of low humidity. However, this coefficient is close to 0, and we might wonder whether including humidity in the model is really needed. In other words, we want to test the hypothesis that the coefficient for humidity in the linear model is 0. Unfortunately, we can’t fully specify the model because it is based on measurements taken over a particular time period from a set of air monitors (both PurpleAir and those maintained by the EPA). This is where the bootstrap can help.

Our model makes the assumption that the air quality measurements taken resemble the population of measurements. Note that weather conditions, the time of year, and the location of the monitors makes this statement a bit hand wavy; what we mean here is that the measurements are similar to others taken under the same conditions as those when the measurement were taken. Also, since we can imagine a virtually infinite supply of air quality measurements, we think of the procedure for generating measurements as draws with replacement from the urn. (Recall that in Chapter 2 we modeled the urn as repeated draws with replacement from an urn of measurement errors. This situation is a bit different because we are also including the other factors mentioned already (weather, season, location).

Our model is focussed on the coefficient for humidity in the linear model:

\[ \begin{aligned} \text{AQ} = \theta_0 + \theta_1 \text{PA} + \theta_2 \text{RH} + ~error \end{aligned} \]

Here, \(\text{PA}\) refers to the PurpleAir PM2.5 measurement, \( \text{RH}\) is the relative humidity, and \(\text{AQ}\) stands for the more exact measurement of PM2.5 made by the EPA monitors. The null hypothesis is \(\theta_2 = 0\); that is, the null model is the simpler model:

\[ \begin{aligned} \text{AQ} = \theta_0 + \theta_1 \text{PA} + ~error \end{aligned} \]

We use the linear model fitting procedure from Chapter 12 except this time, we return only the fitted coefficient.

def coeff_humidity(data):
    '''f(x) = θ₀ + θ₁PA + θ₂RH'''
    # Fit calibration model using sklearn
    X, y = data[['pm25aqs', 'rh']], data['pm25pa']
    model = LinearRegression().fit(X, y)
    [m1, m2], b = model.coef_, model.intercept_
    
    # Invert to find parameter
    theta_2 = - m2 / m1
    
    return theta_2

Our boostrap population is the training set that we created in Chapter 12). Now, we sample rows from the data frame (which is equivalent to our urn) with replacement using the chance mechansism randint. This function takes random samples with replacement from a set of integers. We use the random sample of indices to create the bootstrap sample from the data frame. Then we fit the linear model and get the coefficient for humidity (our bootstrap statistic). The boot_stat function below performs this simulation process.

from scipy.stats import randint

n = len(train)
theta_hat = coeff_humidity(train)

def boot_stat(train):
    r = randint.rvs(low=0, high=(n-1), size=n)
    theta = coeff_humidity(train.iloc[r, :])
    
    return theta

When we repeat this process 10,000 times, we get an approximation to the bootstrap sampling distribution of the bootstrap statistic (the fitted humidity coefficient).

rng = np.random.default_rng(42)

boot_theta_hat = [boot_stat(train) for _ in range(10000)]

We are interested in the shape and spread of this bootstrap sampling distribution. (We know that the center will be close to the original coefficient of \(-0.086\).)

Text(0.5, 0, 'Humidity Coefficient $\\hat{\\theta}_B$')
../../_images/inf_pred_gen_boot_12_1.svg

By design, the center of the bootstrap sampling distribution will be near \(\hat{\theta}\) because the bootstrap population consists of the observed data. The hypothesized value of \(0\) is far from the sampling distribution:

None of the 10,000 simulated regression coefficients are as large as the hypothesized coefficient. Statistical logic leads us to reject the null hypothesis that we do not need to adjust the model for humidity.

The form of the hypothesis test we performed here looks different than the earlier tests because the sampling distribution of the statistic is not centered on the null. That is because we are using the bootstrap to create the sampling distribution. We are in effect using a confidence interval for the coefficient to test the hypothesis. In the next section we introduce interval estimates more generally, including those based on the bootstrap, and we connect the concepts of hypothesis testing and confidence intervals.