16.7. Probability, Inference, and Prediction

Hypothesis testing, confidence intervals, and prediction intervals rely on probability calculations computed from the sampling distribution and the data generation process. These probability frameworks also enable us to run simulation and bootstrap studies for a hypothetical survey, an experiment, or some other chance process in order to study its random behavior. For example, we found the sampling distribution for an average of ranks under the assumption that the treatment in a Wikipedia experiment was not effective. Using simulation, we quantified the typical deviations from the expected outcome and the distribution of the possible values for the summary statistic. The triptych in Figure 1 16.1 provided a diagram to guide us in the process; it helped keep straight the differences between the population, probability, and sample and also showed their connections. In this section, we bring more mathematical rigor to these concepts.

We formally introduce the notions of expected value, standard deviation, and random variable. We begin with the Wikipedia example to describe these notions, before generalizing them. Along the way, we connect this formalism to the triptych that we have used as our guide throughout the chapter, but this time we create a triptych that represents the distributions for this particular example (see Figure 16.3).


Fig. 16.3 This diagram shows the population, sampling, and sample distributions and their summaries from the Wikipedia example. In this example, the population is known to consist of the integers from 1 to 200, and the sample are the ranks of the observed post-productivity measurements for the treatment group. In the middle, the sampling distribution of the average rank is created from a simulation study. Notice it is normal in shape with a center that matches the population average.

16.7.1. Formalizing the theory for average rank statistics

Recall in the Wikipedia example, we pooled the post-award productivity values from the treatment and control groups and converted them into ranks, \(1, 2, 3, \ldots, 200\) so the population is simply made up of the integers from \(1\) to \(200\). This means the population distribution is flat as it ranges from \(1\) to \(200\) (see leftside of Figure 16.3). In this example, the population summary (also called population parameter) we used was the average rank:

\[\theta^* ~=~ Avg(pop) ~=~ \frac{1}{200} \Sigma_{k=1}^{200} k ~=~ 100.5. \]

Another relevant summary is the spread about \(\theta^*\), defined as:

\[ SD(pop) ~=~ \sqrt{\frac {1}{200} \Sigma_{k=1}^{200} (k - \theta^*)^2} ~=~ \sqrt{\frac {1}{200} \Sigma_{k=1}^{200} (k - 100.5)^2} ~\approx~ 57.7 \]

The SD(pop), which is short for the population standard deviation, represents the typical deviation of a rank from the population average. To calculate SD(pop) for this example takes some mathematical handiwork. If you want to learn more see Pitman.

The observed sample consists of the integer ranks of the treatment group; we refer to these values as \(k_1, k_2, \ldots, k_{100}\). The sample distribution appears on the right in Figure 16.3 (each of the 100 integers appears once).

A parallel to the population average is the sample average, which was our statistic of interest:

\[ Avg(sample) ~=~ \frac{1}{100} \Sigma_{i=1}^{100} k_i ~=~ \bar{k} ~=~113.7. \]

The \(Avg(sample)\) is the observed value for \(\hat{\theta}\). Similarly, the spread about \(Avg(sample)\), called the standard deviation of the sample represents the typical deviation of a rank in the sample from the sample average:

\[ SD(sample) ~=~ \sqrt{\frac {1}{100} \Sigma_{i=1}^{100} (k_i - \bar{k})^2} ~=~ 55.3.\]

We have uncovered a direct parellel between the sample statistic and the population parameter in the case where they are averages. The parallel with the two SDs is also note worthy.

Next we turn to the data generation process itself: draw 100 marbles from the urn (with values \(1, 2,\ldots,200\)), without replacement, to create the treatment ranks. We represent the action of drawing the first marble from the urn and the integer that we get, by the capital letter \(Z_1\). This \(Z_1\) is called a random variable. It has a probability distribution determined by the urn model. That is, we can list all of the values that \(Z_1\) might take and the probability associated with each:

\[{\mathbb{P}}(Z_1 = k) ~=~ \frac{1}{200} ~~~~\textrm{for }k=1, \ldots, 200.\]

In this example, the probability distribution of \(Z_1\) is determined by a simple formula because all of the integers are equally likely to be drawn from the urn. (Chapter %s first introduces the notion of a probability distribution).

We often summarize the distribution of a random variable by its expected value and standard deviation. Like with the population and sample, these two quantities give us a sense of what to expect as an outcome and how far the actual value might be from what is expected.

For our example, the expected value of \(Z_1\) is simply,

\[\begin{split} \begin{aligned} \mathbb{E}[Z_1] &= 1 \mathbb{P}(Z_1 = 1) + 2 \mathbb{P}(Z_1 = 2) + \cdots + 200 \mathbb{P}(Z_1 = 200) \\ &= 1 \times \frac{1}{200} + 2 \times \frac{1}{200} + \cdots + 200 \times \frac{1}{200} \\ &= 100.5 \end{aligned} \end{split}\]

Notice that \(\mathbb{E}[Z_1] = \theta^*\), the population average from the urn. The average value in a population and the expected value of a random variable that represents one draw from an urn containing the population are always the same. This is more easily seen by expressing the population average as a weighted average of the unique values in the population weighted by the fraction of units that have that value. The expected value of a random variable of a draw at random from the population urn uses the exact same weights because they match the chance of selecting the particular value.


The term expected value can be a bit confusing because it need not be a possible value of the random variable. For example, \(\mathbb{E}[Z_1] = 100.5\), but only integers are possible values for \(Z_1\).

Next, the variance of \(Z_1\) is

\[\begin{split} \begin{aligned} \mathbb{V}(Z_1) &= (1 - \mathbb{E}[Z_1)]^2 \mathbb{P}(Z_1 = 1) + \cdots + [200 - \mathbb{E}(Z_1)]^2 \mathbb{P}(Z_1 = 200) \\ &= (1 - 100.5)^2 \times \frac{1}{200} + \cdots + (200 - 100.5)^2 \times \frac{1}{200} \\ &= 3333.25 \end{aligned} \end{split}\]

The \(SD(Z_1) = \sqrt{3333.25} = 57.7\). We again point out that the standard deviation of \(Z_1 \) matches the \(SD(pop)\).

To describe the entire data generation process in the triptych, we also define, \(Z_2 , Z_3, \ldots, Z_{100}\) as the result of the remaining 99 draws from the urn. By symmetry these random variables should all have the same probability distribution. That is,

\[\mathbb{P}(Z_1 = 17) ~=~ \mathbb{P}(Z_2 = 17) ~=~ \cdots ~=~ \mathbb{P}(Z_{100} = 17) ~=~ \frac{1}{200}.\]

This implies that each \(Z_i\) has the same expected value, 100.5, and standard deviation, 57.7. However, these random variables are not independent. For example, if you know that \(Z_1 = 17\), then it is not possible for \(Z_2 = 17\). More on this later.

To complete the middle portion of triptych, which involves \(\hat{\theta}\) and its sampling distribution,we describe the average rank statistic as follows:

\[\hat{\theta} = \frac{1}{100} \Sigma_{i=1}^{100} Z_i\]

Our simulation study showed us that the sampling distribution for \(\hat{\theta}\) looks normal in shape. We can use the expected value and SD of \(Z_1\) and our knowledge of the data generation process to find the expected value and SD of \(\hat{\theta}\). However, we need some more information about how combinations of random variables behave so we present the results and then circle back to explain why.

\[\begin{split} \begin{align} \mathbb{E}(\hat{\theta}) ~&=~ \mathbb{E}[\frac{1}{100} \Sigma_{i=1}^{100} Z_i]\\ ~&=~ \frac{1}{100} \Sigma_{i=1}^{100} \mathbb{E}[Z_i] \\ ~&=~ 100.5 \\ ~&=~ \theta^* \end{align} \end{split}\]

That is, the expected value of the average of draws from the population is the population average. Below we provide formulas for the variance of the average of the draws in terms of the population variance, as well as the SD.

\[\begin{split} \begin{align} \mathbb{V}(\hat{\theta}) ~&=~ \mathbb{V}[\frac{1}{100} \Sigma_{i=1}^{100} Z_i]\\ ~&=~ \frac{200-100}{100-1} \times \frac{\mathbb{V}(Z_i)}{100} \\ ~&=~ 16.75 \\ ~&~\\ SD(\hat{\theta}) ~&=~ \sqrt{\frac{100}{199}} \frac{SD(Z_1)}{10} \\ ~&=~ 4.1 \end{align} \end{split}\]

These computations relied on several properties of expected value and variance of a random variable and sums of random variables. We conclude this section with providing these properties for general random variables.

16.7.2. General properties of random variables

In general, a random variable represents a numeric outcome of a probabilistic event. In this book, we use capital letters like \(X\) or \(Y\) or \(Z\) to denote a random variable. The probability distribution for \(X\) is the specification, \(\mathbb{P}(X = x) = p_x\) for all values \(x\) that the random variable takes on. Although random variables can represent either discrete (e.g., the number of children in a family drawn at random from a population) or continuous (e.g., the air quality measured by an air monitor) quantities, we simplify all random variables to those with discrete outcomes. Since most measurements are made to a certain degree of precision, this simplification doesn’t limit us too much.

Then, the expected value of \(X\) is defined as:

\[\mathbb{E}[X] = \sum_{x} x p_x,\]

the variance \(X\) is defined as:

\[\begin{split} \begin{align} \mathbb{V}(X) ~&=~ \mathbb{E}[(X - \mathbb{E}[X])^2] \\ ~&=~ \sum_{x} [x - \mathbb{E}(X)]^2 p_x, \end{align} \end{split}\]

and, the \(SD(X)\) is the square-root of \(\mathbb{V}(X)\).

Simple formulas provide the expected value, variance, and standard deviation of scale and shift changes to random variables, such as \(a + bX\) for constants \(a\) and \(b\).

\[\begin{split} \begin{aligned} \mathbb{E}(a + bX) ~&=~ a + b\mathbb{E}(X) \\ \mathbb{V}(a + bX) ~&=~ b^2\mathbb{V}(X) \\ SD(a + bX) ~&=~ |b|SD(X) \\ \end{aligned} \end{split}\]

To convince yourself that these formulas make sense, think about how a distribution might change if you added a constant \(a\) to each value: it would simply shift the distribution, which in turn would shift the expected value but not change the size of the deviations about the expected value. On the other hand, scaling the values by 2 would spread the distribution out and essentially double the deviations from the expected value.

We are also interested in the properties of the sum of two or more random variables, say \(X\) and \(Y\). For this, we need to know how \(X\) and \(Y\) vary together. In other words, the joint distribution of \(X\) and \(Y\) assigns probabilities to combinations of their outcomes,

\[ \mathbb{P}(X =x, Y=y) ~=~ p_{x,y} \]

When \(X\) and \(Y\) are independent, then \(p_{x,y} = p_x p_y\). A summary of how \(X\) and \(Y\) vary together, called the covariance, is defined as:

\[\begin{split} \begin{align} Cov(X, Y) ~&=~ \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])] \\ ~&=~ \mathbb{E}[(XY) - \mathbb{E}(X)\mathbb{E}(Y)] \\ ~&=~ \Sigma{x,y}[(xy) - \mathbb{E}(X)\mathbb{E}(Y)]p_{x,y} \end{align} \end{split}\]

If \(X\) and \(Y\) are independent, then \(Cov(X,Y) = 0\).

Below are useful properties of expected value and variance of a sum of two random variables:

\[\begin{split} \begin{aligned} \mathbb{E}(X + Y) ~&=~ \mathbb{E}(X) + \mathbb{E}(Y) \\ \mathbb{V}(X + Y) ~&=~ \mathbb{V}(X) + 2Cov(X,Y) + \mathbb{V}(Y) \\ \end{aligned} \end{split}\]

These properties can be used to show that for random variables, \(X_1, X_2, \ldots X_n\), that are independent with expected value \(\mu\) and standard deviation \(\sigma\), the average, \(\bar{X}\), has the following expected value, variance, and standard deviation.

\[\begin{split} \begin{align} \mathbb{E}(\bar{X}) ~&=~ \mu\\ \mathbb{V}(\bar{X}) ~&=~ \sigma^2 /n\\ SD(\bar{X}) ~&=~ \sigma/\sqrt{n} \end{align} \end{split}\]

This situation arises from the urn model where \(X_1, \ldots,X_n\) are the result of random draws with replacement. In this case, \(\mu\) represents the average of the urn and \(\sigma\) the standard deviation.

And, when the random draws are made without replacement from an urn with \(N\) marbles, then \(\bar{X}\) has

\[\begin{split} \begin{align} \mathbb{E}(\bar{X}) ~&=~ \mu\\ \mathbb{V}(\bar{X}) ~&=~ \frac{N-n}{N-1} \times \frac{\sigma^2}{n}\\ \end{align} \end{split}\]

We used this formula earlier to compute the \(SD(\hat{\theta})\) in our example.

16.7.3. Probability behind testing and intervals

As mentioned at the beginning of this chapter, probability is the underpinning behind conducting a hyptohesis test, providing a confidence interval for an estimator and a prediction interval for a future observation.

We now have the technical machinery to explain these concepts, which we have carefully defined in this chapter without the use of formal technicalities. This time we present the results in terms of random variables and their distributions.

As introduced, a hypothesis test relies on a null model which provides the probability distribution for the statistic, \(\hat{\theta}\). The tests we carried out were essentially computing (sometimes approximately) the following probability: given the assumptions of the null distribution,

\[ \mathbb{P}(\hat{\theta} \geq \textrm{observed statistic}) \]

Often times, the random variable is normalized to and the following \(p\)-value is computed. Given the assumptions of the null distribution,

\[ \mathbb{P}\left( \frac{\hat{\theta} - {\theta}^*}{SD(\hat{\theta})} \geq \frac{\textrm{observed stat}- \theta^*}{SD(\hat{\theta})}\right)\]

When, \(SD(\hat{\theta})\) is not known, we approximated it via simulation or, when we have a formula for \(SD(\hat{\theta})\) in terms of \(SD(pop)\), we substitute \(SD(samp)\) in for \(SD(pop)\). This normalization has been popular because it simplifies the null distribution. For example, if \(\hat{\theta}\) has an approximate normal distribution than the normalized version will have a standard normal distribution with center 0 and SD of 1. These approximations are useful if many hypothesis tests are being carried out, such as in the case of A/B testing, for there is no need to simulate every for every sample.

The probability statement behind a confidence interval is quite similar. In particular, to create a 95% confidence interval under the normal conditions, we begin with the probability,

\[ \mathbb{P}\left( \frac{|\hat{\theta} - \theta^*|}{SD(\hat{\theta})} \leq 1.96 \right) ~=~ \mathbb{P}\left(\hat{\theta} - 1.96SD(\hat{\theta}) \leq \theta^* \leq \hat{\theta} + 1.96SD(\hat{\theta}) \right) ~\approx~ 0.95 \]

Note that \(\hat{\theta}\) is the random variable in the above probability statement. The confidence interval is created by substituting the observed statistic in for \(\hat\theta\) and calling it a 95% confidence interval:

\[ \left(\textrm{observed stat} - 1.96SD(\hat{\theta}),~ \textrm{observed stat} + 1.96SD(\hat{\theta}) \right) \]

Lastly, consider prediction intervals. The simplest case provides a measure of the expected variation of a future observation about the estimator. In the simple case, where the statistic is \(\bar{X}\) and we have a hypothetical new observation \(X_0\) with the same expected value and standard deviation of each of the \(X_i\), we have the standard deviation:

\[\begin{split} \begin{align} SD(X_0 - \bar{X}) ~&=~ \sqrt{ \mathbb{V}(X_0) + \mathbb{V}(\bar{X})} \\ ~&=~ \sqrt{ \sigma^2 + \sigma^2/n} \\ ~&=~ \sigma\sqrt{1 + 1/n} \end{align} \end{split}\]

Notice there are two parts to the variation: one due to the variation of \(X_0\) and the other due to the approximation of \(\mathbb{E}(X_0)\) by \(\bar{X}\).

In the case of more complex models, the variation in prediction breaks down into these two components: the inherent variation in the data about the model plus the variation in the sampling distribution due to the estimation of the model.