3.4. Probability Distribution for a Statistic

In this chapter, we used probability to find chances related to simple random samples, randomized controlled experiments, and measurement error. These probability frameworks enable us to run simulation studies for a hypothetical survey, an experiment, or some other chance process in order to study their behavior. For example, we found the expected outcome of a clinical trial for a vaccine under the assumption that the treatment was not effective; and, we studied the support for Clinton and Trump in a simple random sample that used the actual votes cast in the election. The simulation studies enabled us to quantify the typical deviations from the expected outcome and the distribution of the possible values for the summary statistics.

In this section, we formalize the notions of expected value, standard deviation, and probability distribution.

Our examples reduced the data to a summary statistic, and we used simulation to approximate the random behavior of the statistic. That is, we summarized the possible values for the statistic in a table with the proportion of simulations that yielded the value, and we found the average value and standard deviation of the values. In essence, the simulations were approximating probability calculations.

We formalize these calculations here. Our statistic is based on data \(x_1, x_2, \ldots, x_n\), which we refer to as \(T(x_1, \ldots , x_n)\textrm{ or }T\) for short. The probability distribution of the statistic is

\[\mathbb{P}(T(x_1, \ldots, x_n) = t) = p_t,\]

for all possible values \(t\) that the statistic could take.

The expected value of \(T\) is the average value, i.e.,

\[\mathbb{E}(T(x_1, \ldots, x_n) = \sum_t t\mathbb{P}(T(x_1, \ldots, x_n) = t) = \sum_t tp_t\]

And the standard error of the statistic is the standard deviation:

\[\mathbb{SE}(T) = \sqrt{\sum_t (t-\mathbb{E}(T))^2 p_t}\]

Except for very simple situations, we will rely on simulation studies to estimate the probability distribution of a statistic and its expected value and standard error.

Recall in the Section on Sampling Distributions, the small example examine a statistic that was the the proportion of dogs in a sample of 3 animals from a population of 4 dogs and 3 cats.  We derived the probability distribution of the statistic:

Sample Proportion

No. Occurrences

Chance

1

4

4/35

2/3

18

18/35

1/3

12

12/35

0

1

1/35

There are only 4 possible values for \(T\), and they are \(0\), \(1/3\), \(2/3\), and \(1\). The expected value and standard error of the statistic were:

\[{\mathbb{E}}(T) = (1 \times \frac{4}{35}) + (\frac{2}{3}\times \frac{18}{35}) + (\frac{1}{3} \times \frac{12}{35}) + (0 \times \frac{1}{35}) =~ \frac{4}{7}\]
\[{\mathbb{S}}(T) = \sqrt{(1-\frac{4}{7})^2\times \frac{4}{35} + \cdots +(0-\frac{4}{7})^2\times \frac{1}{35} } \approx 0.233\]

We can generalize these results for an urn model where the random sampling is either with or without replacement and the statistics is the average of the values on the marbles drawn. Suppose the population/urn consists of values \(x_1 , x_2, \ldots , x_N\), and we take \(n\) draws from the urn. For \(T\) the sample average when the draws are with replacement, we have

\[\mathbb{E}(T) = \frac{x_1 + x_2 + \cdots + x_N}{N} = \textrm{ population average}\]

and

\[\begin{split}{\mathbb{S}}(T) = \frac{1}{\sqrt{n}}\sqrt{[(x_1-\mathbb{E}(T))^2 + \cdots +(x_N-\mathbb{E}(T))^2]/N} \\ = \frac{1}{\sqrt{n}} \textrm{ population standard deviation} \end{split}\]

In the case of draws without replacement, the expected value is the same, i.e., it matches the population average, and the standard error decreases by a factor of \(\sqrt{(N-n)/(N-1)}\); that is,

\[{\mathbb{S}}(T_{SRS}) = \sqrt{\frac{N-n}{N-1}} ~\frac{1}{\sqrt{n}}~\textrm{ population standard deviation}. \]

When the population size is large compared to the sample size, we ignore this factor and treat the sampling as if it were with replacement.

Note this situation covers the case of a sample proportion because a proportion is simply the average of 0s and 1s.

In practice, we typically don’t know the population quantities so we approximate the population average with the sample average and the population standard deviation with the sample standard deviation.