3.5. Probability and the Urn Model

So far in this chapter, we used simulation to study the behavior of the chance process of drawing marbles from an urn. The urn model is quite versatile and enabled us to study election polls, clinical trials, and air sensors; it provides a probability framework to set up our simulation studies, whether for a hypothetical survey, an experiment, or some other chance process. We found the distribution of possible outcomes of a clinical trial for a vaccine, under the assumption that the treatment was not effective. We studied the support for Clinton and Trump in a simulation based on the actual votes cast in an election.

In this section, we bring probability concepts more formally into the characterization of the urn model. We begin by revisiting the small population of seven Starship protoypes, and use probability to describe our earlier findings.

3.5.1. Simple Random Sample (SRS)

Recall that we have seven units (prototype StarShips) that we labeled \(A\) through \(G\). We listed all possible results from drawing three of these units without replacement from the seven. These are listed again below.

\[\begin{split}ABC ~~ ABD ~~ ABE ~~ ABF ~~ ABG ~~ ACD ~~ ACE \\ ACF ~~ ACG ~~ ADE ~~ ADF ~~ ADG ~~ AEF ~~ AEG \\ AFG ~~ BCD ~~ BCE ~~ BCF ~~ BCG ~~ BDE ~~ BDF \\ BDG ~~ BEF ~~ BEG ~~BFG ~~CDE ~~ CDF ~~ CDG \\ CEF ~~ CEG ~~ CFG ~~ DEF ~~ DEG ~~ DFG ~~ EFG \end{split}\]

There are \(35\) unique samples of three from our population of seven. By design, each of these \(35\) samples is equally likely to be chosen (the marbles are indistinguishable and well mixed) so the chance of any particular sample is the same as any other sample. This means the chance must be \(1/35\),

\[{\mathbb{P}}(ABC) = {\mathbb{P}}(\textrm{ABD}) = \cdots = {\mathbb{P}}(\textrm{EFG}) = \frac{1}{35}.\]

We use the special symbol \({\mathbb{P}}\) to stand for “probability” or “chance”, and we read the statement \({\mathbb{P}}(ABC)\) as the “chance the sample contains the marbles labeled A, B, and C.”

We now have a more formal definition of “representative data” that is very useful: In a Simple Random Sample, every sample has the same chance of being selected.


Many people mistakingly think that the defining property of a SRS is that every unit has an equal chance of being in the sample. However, this is not the case. A SRS of \(n\) units from a population of \(N\) means that every possible subset of \(n\) units has the same chance of being selected.

We can use the enumeration of all of the possible samples from the urn to answer additional questions about this chance process. For example, to find the chance that marble \(A\) is in the sample, we can add up the chance of all samples that contain \(A\). There are 15 of them so the chance is:

\[{\mathbb{P}}(\textrm{A is in the sample}) = \frac{15}{35} = \frac{3}{7}.\]

And by symmetry, we have:

\[{\mathbb{P}}(\textrm{A in sample}) = {\mathbb{P}}(\textrm{B in sample}) = \cdots = {\mathbb{P}}(\textrm{G in sample}) = \frac {3}{7}.\]

To give an example of the potential of the urn model (and the SRS) to develop understanding of more complex chance processes, we introduce the stratified sample in the next section. Stratified sampling has the extra complication that multiple samples are drawn from disjoint subsets of the population.

3.5.2. Stratified Sample

In stratified sampling, we divide the population into non-overlapping groups, called strata (one group is called a stratum and more than one are strata), and then take a simple random sample from each. This is like having a separate urn for each stratum and drawing marbles from each urn, independently. The strata do not have to be the same size, and we need not take the same number of units from each.

In our Starship example, \(SN1\), \(SN2\), \(SN3\) and \(SN4\) were constructed differently from \(SN5\), \(SN6\), and \(SN7\) (the later three had no flaps and no nose cone). We might consider dividing our population of seven prototypes into two strata: those with and without flaps and nose cones. Our two strata would be:

\[\begin{split}\textrm{Stratum}~1: \{A, B, C, D \} \\ \textrm{Stratum}~2: \{ E, F, G \}\end{split}\]

Suppose we want to take a sample of size two from the first stratum and a sample of one from the second group of prototypes. All together, we have a sample of size three from our population, but this sampling scheme gives us fewer possible samples than the SRS.


As before, each of these samples is equally likely.

\[{\mathbb{P}}\left(ABE\right) = {\mathbb{P}}\left(CDG\right) = \frac{1}{18}.\]

However, not all of the 35 triples identified earlier are possible in this sampling scheme. For example,

\[{\mathbb{P}}\left(\textrm{AEF}\right) = 0.\]

Our design dictates that we choose only one unit from the second stratum and this triple includes two from the second urn.

Again, we can compute the probability that marble \(A\) is in our sample by counting up all of the occurrences of \(A\) among the 18 samples:

\[{\mathbb{P}}\left(A \textrm{ in sample}\right) = \frac{9}{18} = \frac{1}{2}.\]

Notice that this is the chance that \(A\) is chosen from the first stratum (ignoring what happens with the second stratum). Since two of the four marbles are selected from the first stratum, the chance is \(2/4\) or \(1/2\). In this sampling scheme, not all marbles have the same chance of appearing in the sample. For example, the chance that marble \(F\) is chosen is not \(1/2\) but \(1/3\).

\[{\mathbb{P}}\left(F \textrm{ in sample}\right) = \frac{6}{18} = {\mathbb{P}}\left(F \textrm{ chosen from stratum 2} \right)=\frac{1}{3}.\]

Why would we want to use stratified sampling?

  • When some strata are more heterogenous than others, if we take larger samples from these more variable strata, we can improve the accuracy of the sample.

  • Stratified sampling allows us to ensure that subgroups of the population are well-represented in the sample without using human judgement to select the individuals.

Despite not all samples being possible in a stratified design, each stratum’s sample is representative of its stratum. We can use our knowledge of the sampling scheme to combine the samples from each stratum in a representative way through weighting. This topic is addressed in the exercises.


It’s crucial to keep the sampling scheme in mind when we compute summary statistics, make plots, and fit models for otherwise, our analysis could be flawed.

The simple random sample is at the core of many probability sampling schemes, not just stratified sampling. For example, most government surveys use complex sampling schemes that involve multi-stages of sampling from clusters (see the exercises) and from strata.

We saw in Section XXX, that we are often interested in the behavior of a statistic that is calculated from a sample. The chance process used to collect the sample determines the sampling distribution of the statistic. In the next section, we revisit our small SRS and use probability to describe the expectation and variation of a statistic.

3.5.3. Sampling Distribution of a Statistic

Probability sampling induces a sampling distribution on any statistic that we calculate from our sample. We saw earlier that if we are interested in whether a prototype fails a pressure test, we can change our urn to one with 0s and 1s written on the model and tally the number of 1s drawn. Each of our possible samples gives us a summary statistic, a sample proportion, and we can compute the chance of observing a particular proportion, simply by knowing that there are four 1s and three 0s in the urn. The probability distribution table below summarizes these possible values and their chances. For example, the chance the proportion in a SRS is \(1\), is \(4/35\) because it is equivalent to the chance that the sample is \(ABD\) , \(ABF\), \(ADF\), or \(BDF\).

Sample Proportion

No. Occurrences














We can calculate the expected outcome from the SRS, according to the following equation.

\[\begin{split}{\mathbb{E}}(\textrm{sample proportion}) = 1 \times \frac{4}{35} + \frac{2}{3}\times \frac{18}{35} + \frac{1}{3} \times \frac{12}{35} + 0 \times \frac{1}{35}\\ = \frac{20}{35} \\ =~ \frac{4}{7}\end{split}\]

Here, we use the symbol \({\mathbb{E}}\) to represent what we might expect for the sample proportion. One in 35 samples yields a 0 for the proportion, 12 of the 35 samples have one failure (the proportion is \(1/3\) for these), and so on. The expected value weights each possible outcome by the chance of observing it.

Further, we can find the standard error (we call it the SE for short) of the sample proportion. In some sense, this is the typical deviation of the sample proportion from the expected, \(4/7\). Another name for the standard error is the root mean square error. This alternative name tells us how to compute it (by error we mean the difference between the sample statistic and the expected value).

\[\begin{split}{\mathbb{SE}} = \sqrt{(1-\frac{4}{7})^2\times \frac{4}{35} + (\frac{2}{3}-\frac{4}{7})^2\times \frac{18}{35} +(\frac{1}{3}-\frac{4}{7})^2\times \frac{12}{35} +(0-\frac{4}{7})^2\times \frac{1}{35} } \\ \approx 0.233\end{split}\]

The SE, also called the margin of error, indicates that even though our sample proportion has a expected value that matches \(4/7\), it is likely to be around \(0.23\) away from \(4/7\).

While these calculations are relatively straight forward, we can approximate them through a simulation study.

  • A table of the simulated statistics should look like the probability distribution, with enough repetitions of the process

  • The average of the simulated statistic should be close to the expected value.

  • The standard deviation of the simulated statistics should be close to the standard error.

In our examples we carried out 100,000 to 500,000 repetitions of the chance process, which should give very accurate results.


Simulation studies repeat a random process many many times, and summarizing patterns that result from the simulation can approximate the theoretical properties of the process. However, that is not the same as proving these theoretical properties, but often the guidance we get from the simulation is adequate for our purposes.

3.5.4. Formal Notation

In this section, we formalize the notions of expected value, standard deviation, and probability distribution. Our examples reduced the data to a summary statistic, and we used simulation to approximate the random behavior of the statistic. That is, we summarized the possible values for the statistic in a table with the proportion of simulations that yielded the value, and we found the average value and standard deviation of the values. In essence, the simulations were approximating probability calculations.

Formally, our statistic is based on data \(x_1, x_2, \ldots, x_n\), which we write as \(T(x_1, \ldots , x_n)\textrm{ or }T\) for short. The probability distribution of the statistic is

\[\mathbb{P}(T(x_1, \ldots, x_n) = t) = p_t,\]

for all possible values \(t\) that the statistic could take.

The expected value of \(T\) is the average value, as in our SRS example.

\[\mathbb{E}(T(x_1, \ldots, x_n) = \sum_t t\mathbb{P}(T(x_1, \ldots, x_n) = t) = \sum_t tp_t\]

And the standard error of the statistic is the standard deviation.

\[\mathbb{SE}(T) = \sqrt{\sum_t (t-\mathbb{E}(T))^2 p_t}\]

As an example, we apply these formulae to our simple example to calculate the expected value and standard error of the sample proportion for a sample of three from a population of seven that contains four 1s and three 0s. In this case, there are only 4 possible values for \(T\), and they are \(0\), \(1/3\), \(2/3\), and \(1\).

\[{\mathbb{E}}(T) = (1 \times \frac{4}{35}) + (\frac{2}{3}\times \frac{18}{35}) + (\frac{1}{3} \times \frac{12}{35}) + (0 \times \frac{1}{35}) =~ \frac{4}{7}\]
\[{\mathbb{SE}}(T) = \sqrt{(1-\frac{4}{7})^2\times \frac{4}{35} + \cdots +(0-\frac{4}{7})^2\times \frac{1}{35} } \approx 0.233\]

We can generalize these results for an urn model where the random sampling is either with or without replacement and the statistic is the average of the values on the marbles drawn. Suppose the population/urn has marbles with the values \(x_1 , x_2, \ldots , x_N\), and we take \(n\) draws from the urn. For \(T\) the sample average when the draws are with replacement, we have

\[\mathbb{E}(T) = \frac{x_1 + x_2 + \cdots + x_N}{N} = \textrm{ average of all the marbles in the urn}\]


\[\begin{split}{\mathbb{SE}}(T) = \frac{1}{\sqrt{n}}\sqrt{[(x_1-\mathbb{E}(T))^2 + \cdots +(x_N-\mathbb{E}(T))^2]/N} \\ = \frac{1}{\sqrt{n}} \textrm{ standard deviation of all the marbles in the urn} \end{split}\]

In the case of draws without replacement, the expected value is the same. That is, it matches the urn average, and the standard error decreases by a factor of \(\sqrt{(N-n)/(N-1)}\); that is,

\[{\mathbb{SE}}(T_{SRS}) = \sqrt{\frac{N-n}{N-1}} ~\frac{1}{\sqrt{n}}~\textrm{ standard deviation of the urn}. \]

When the population size is large compared to the sample size, we ignore this factor and treat the sampling as if it were with replacement.

Note this situation covers the case of a sample proportion because a proportion is simply the average of 0s and 1s.

In practice, we typically don’t know the population quantities so we approximate the population average with the sample average and the population standard deviation with the sample standard deviation.