3.1. The Urn Model

The urn model was developed by Jacob Bernoulli in the early 1700’s as a way to model the process of selecting items from a population. The urn model, describes the process of randomly sampling colored marbles from an urn – a kind of vase or container.

A picture of a clay urn would make this part more fun.

We need figure illustrating 3 black and 2 white marbles in an urn and a "sample on the side"

To use the urn model, we first need to make a few decisions:

  • the number of marbles in the urn;

  • the color (or label) of each marble;

  • and the number of marbles drawn from the urn.

Finally, we also need to decide the sampling process. As we select each marble for our sample, we could choose to record the color and return the marble to a random location in the urn (with replacement), or discard the marble so it cannot be sampled again (without replacement).

The answers to these questions are the parameters of our model. We can adapt the urn model to describe many real-world situations by our choice for these parameters.

To illustrate the urn model, consider a simple example of sampling 2 marbles from an urn containing 3 black marbles and 2 white marbles. We simulate the draw of two marbles from our urn without replacement between draws using numpy’s random.choice method. Notice that we set replace to False to indicate that once we sample a marble we don’t return it to the urn.

urn = ["⚫", "⚫", "⚫", "⚪", "⚪"]
print("Sample 1:", np.random.choice(urn, size=2, replace=False))
print("Sample 2:", np.random.choice(urn, size=2, replace=False))
Sample 1: ['⚫' '⚪']
Sample 2: ['⚪' '⚫']

With this basic setup, we can now ask fundamental questions about the kinds of samples we would expect to see. What is the chance that our contains marbles of only one color? What if we sample more marbles? What if we return each marble after selecting it? What happens if we repeat the process many times?

The answers to these questions are fundamental to our understanding of data collection. Furthermore, by developing the skills to reason about and simulate from the urn model, we will be building the skills needed to apply simulation to many real-world problems that can’t be easily described by classic statistical equations.

For example, we can use simulation to easily estimate the fraction of samples where both marbles have a matching color. In the following code we simulate 10,000 rounds of sampling two marbles from our urn. Using these simulated samples we can directly compute the proportion of the samples with matching marbles:

n_samples = 10_000
samples = [np.random.choice(urn, size=2, replace=False)
           for n in range(n_samples)]
is_matching = [s[0] == s[1] for s in samples]
print("Proportion of samples with matching marbles:", 
      np.mean(is_matching))
Proportion of samples with matching marbles: 0.4032

What would happen if we increased the number of samples? What if we changed the number of marbles in the urn? These are all questions we could study through simulation. While this might seem like a contrived example (it is…), replace the marbles with people on a dating service, colors with more complex attributes, and perhaps a neural network scoring a match and you can start to see the foundation of much more sophisticated analysis.

So far we have been focused on the sample, but we are often interested in the relationship between the samples we would observe and what they can tell us about the “population” of marbles that were originally in the urn.

We can draw an analogy to data scope from Chapter 2: a set of marbles drawn from the urn is a sample, and the collection of all marbles placed in the urn is the population. This particular urn model prescribes a particular selection method, called the Simple Random Sample (SRS). We describe the SRS and other sampling techniques based on the SRS in the next section.

3.1.1. Sampling Designs

The process of drawing marbles without replacement from an urn is equivalent to a Simple Random Sample (SRS). In a Simple Random Sample (SRS), every sample has the same chance of being selected. While the Simple Random Sample has the word “simple” in the name, constructing a simple random sample is often anything but simple and in many cases is also the best sampling procedure. And, if we are honest, it can also be a little confusing.

Note

Many people mistakingly think that the defining property of a Simple Random Sample (SRS) is that every unit has an equal chance of being in the sample. However, this is not the case. A SRS of \(n\) units from a population of \(N\) means that every possible subset of \(n\) units has the same chance of being selected.

To better understand the SRS we can return to the urn model. In this example, we will consider an urn with 7 marbles. Rather than color the marbles, we will label each with the letters A through G. Because each marble has a different label, we can more clearly see all possible samples that can be constructed.

A Simple Random Sample (SRS) implies that every possible sample is equally like. For example, if we select 3 marbles from the urn, it would be an SRS because all possible combinations of three marbles are equally likely.

Here we use the itertools library to generate the list of all combinations.

from itertools import combinations
all_samples = ["".join(x) for x in combinations("ABCDEFG", 3)]
print(all_samples)
print("Number of Samples:", len(all_samples))
['ABC', 'ABD', 'ABE', 'ABF', 'ABG', 'ACD', 'ACE', 'ACF', 'ACG', 'ADE', 'ADF', 'ADG', 'AEF', 'AEG', 'AFG', 'BCD', 'BCE', 'BCF', 'BCG', 'BDE', 'BDF', 'BDG', 'BEF', 'BEG', 'BFG', 'CDE', 'CDF', 'CDG', 'CEF', 'CEG', 'CFG', 'DEF', 'DEG', 'DFG', 'EFG']
Number of Samples: 35

Our list shows that there are \( 35 \) unique sets of three marbles. We could sample each of these sets 6 different ways. For example the set \(\{A, B, C\}\) can be sampled:

from itertools import permutations
print(["".join(x) for x in permutations("ABC")])
['ABC', 'ACB', 'BAC', 'BCA', 'CAB', 'CBA']

In this simple example, we get a complete picture of all the ways in which we can draw three marbles from the urn.

Our example is a SRS of three draws from a population of seven so the chance of any one particular sample must be \(1/35\),

\[{\mathbb{P}}(ABC) = {\mathbb{P}}(\textrm{ABD}) = \cdots = {\mathbb{P}}(\textrm{EFG}) = \frac{1}{35}.\]

We use the special symbol \({\mathbb{P}}\) to stand for “probability” or “chance”, and we read the statement \({\mathbb{P}}(ABC)\) as the “chance the sample contains the marbles labeled A, B, and C in any order.”

We can use the enumeration of all of the possible samples from the urn to answer additional questions about this chance process. For example, to find the chance that marble \(A\) is in the sample, we can add up the chance of all samples that contain \(A\). There are 15 of them so the chance is:

\[{\mathbb{P}}({A~is~in~the~sample}) = \frac{15}{35} = \frac{3}{7}.\]

When it’s too difficult to list and count all of the possible samples, we can use simulation to help understand this chance process.

The SRS (and its corresponding urn) is the main building block for more complex survey designs. We briefly describe two of the more widely used designs.

  • Stratified Sampling Divide the population into non-overlapping groups, called strata (one group is called a stratum and more than one are strata), and then take a simple random sample from each. This is like having a separate urn for each stratum and drawing marbles from each urn, independently. The strata do not have to be the same size, and we need not take the same number of units from each.

  • Cluster Sampling Divide the population into non-overlapping subgroups, take a simple random sample of the clusters, and include all of the units in a cluster in the sample. We can think on this as a SRS from one urn that contains large marbles that are themselves containers of small marbles. When opened, the sample of large marbles turns into the sample of small marbles. (Clusters tend to be smaller than strata.)

Often, we are interested in a summary of the sample; that is, some statistic. For any sample, we can calculate the statistic, and the urn model helps us find the distribution of possible values that statistic may take on. In the next section, we examine the distribution of a statistic for our example.

3.1.2. Sampling Distribution of a Statistic

Suppose we are interested in testing the failure pressure of a new fuel tank design for a rocket? It’s expensive to carry out the pressure tests since we will need to destroy the fuel tank and we may need to test more than one fuel tank to address variations in manufacturing.

We can use the urn model to choose how many prototypes need to be tested. We can summarize our test results by, say, the proportion of prototypes that fail the test. The urn model provides us the knowledge that each of the samples has the same chance of being selected and so the pressure test results are representative of the population.

Suppose we have 7 fuel tanks corresponding to our 7 labeled marbles from before. Furthermore, suppose we knew that the tanks \(A\), \(B\), \(D\), and \(F\) would fail the pressure test, if chosen. We will color those marbles red and the remaining marbles blue.

urn =["A⚫", "B⚫", "C⚪", "D⚫", "E⚪", "F⚫", "G⚪"]

For each sample of three marbles, we can find the proportion of failures according to how many of these four defective prototypes are in the sample. We give a few examples of this calculation.

Sample

ABC

BCE

BDF

CEG

Proportion

2/3

1/3

1

0

Since we are drawing three marbles from the urn, the only possible sample proportions are \( 0 \), \(1/3\), \(2/3\) and \( 1 \), and for each triple, we can calculate its corresponding proportion. There are 4 samples that give us all failed tests (a sample proportion of 1): \(ABD\) , \(ABF\), \(ADF\), \(BDF\), so the chance of observing a sample proportion of \( 1 \) is \(4/35\). We can summarize the distribution of values for the sample proportion into a table.

Proportion of Fails

No. of Samples

Fraction of Samples

1

4

4/35 \(\approx 0.11\)

2/3

18

18/35 \(\approx 0.51\)

1/3

12

12/35 \(\approx 0.34\)

0

1

1/35 \(\approx 0.03\)

Total

35

1

While these calculations are relatively straight forward, we can approximate them through a simulation study. To do this, we take samples of three from our population over and over, say 10,000 times. For each sample, we calculate the proportion of failures. That gives us 10,000 simulated sample proportions. The table of the simulated proportions should come close to our distribution table. We confirm this with a simulation study.

3.1.3. Simulating the Sampling Distribution

Simulation can be a powerful tool to understand complex random processes. In our earlier example of 7 fuel tanks, we were able to consider all possible samples from the corresponding urn model. However, in situations with large populations, samples, and more complex sampling processes, it may not be tractable to directly compute the chance of certain outcomes. In these situations, we often turn to simulation to provide accurate estimates of the quantities we can’t compute directly.

However, before we simulate complex processes, let’s first examine how simulation performs on our simple example of 7 fuel tanks. We will ignore the labels and focus on the colors indicating the success or failure of the test and we will code the colors at 0 and 1. Because we are interested in failures we will use 1 to indicated a failure and 0 to indicate a success. Thus we have an urn that looks like:

urn = [1, 1, 0, 1, 0, 1, 0]

Since we have encoded our marbles colors using 1 for fail and 0 for pass, we can take the mean of the sample to get the proportion of failures in a single sample

sample = np.random.choice(urn, size=3, replace=False)
print("Sample:", sample)
print("Prop Failures:", sample.mean())
Sample: [1 0 0]
Prop Failures: 0.3333333333333333

In a simulation, we would repeat the sampling process thousands of times to estimate the distribution of outcomes. Here we can construct 10,000 samples:

samples = [np.random.choice(urn, size=3, replace=False) 
           for i in range(10_000)] 
prop_failures = [s.mean() for s in samples]

We can study these 10,000 sample proportions and match our findings against what we calculated already using the complete enumeration of all 35 possible samples. We expect the simulation results to be close to our earlier calculations because we have repeated the sampling process many many times. That is, we want to compare the fraction of the 100,000 sample proportion that are \( 0 \), \(1/3\), \(2/3\), and \( 1 \) to those in the table. These fractions should be, approximately, \(1/35\), \(12/35\), \(18/35\), and \(4/35\), or about \(0.03\), \(0.34\), \(0.51\), and \(0.11\).

unique_els, counts_els = np.unique(prop_failures, 
                                   return_counts=True)
pd.DataFrame({"Proportion of Failures": unique_els, 
              "Fraction of Samples": counts_els/10_000})
Proportion of Failures Fraction of Samples
0 0.00 0.03
1 0.33 0.35
2 0.67 0.51
3 1.00 0.11

The simulation results closely match the table.

Note

Simulation studies leverage random number generators to sample many outcomes from a random process. In a sense, simulation studies convert complex random processes into data that we can readily analyze using the broad set of computational tools we cover in this book. While simulation studies typically do not provide definitive proof of a particular hypothesis, they can provide important evidence. In many situations, simulation is the most accurate estimation process we have.

Drawing marbles from an urn with 0s and 1s is such a popular framework for understanding randomness that this chance process has been given a formal name, the hypergeometric distribution. And, most software provide functionality to rapidly carry out simulations of the hypergeometric. We can redo our simulation using the hypergeometric.

3.1.4. The Hypergeometric Distribution

Instead of using random.choice, we can use numpy’s random.hypergeometric to simulate drawing marbles from the urn and counting the number of fails. The random.hypergeometric method is optimized for the 0-1 urn and allows us to ask for 10,000 simulations in one call. For completeness, we repeat our simulation study and calculate the empirical proportions.

simulations_fast = np.random.hypergeometric(
    ngood=4, nbad=3, nsample=3, size=10_000)
print(simulations_fast)
[1 1 2 ... 1 2 2]

Note: we don’t think that a pass is “bad”; it’s just a naming convention to call the type you want to count “good” and the other “bad”.

unique_els, counts_els = np.unique(simulations_fast, return_counts=True)
pd.DataFrame({"Number of Failures": unique_els, 
              "Fraction of Samples": counts_els/10_000})
Number of Failures Fraction of Samples
0 0 0.03
1 1 0.34
2 2 0.52
3 3 0.11

You might have asked yourself already - since the hypergeometric is so popular, why not provide the exact distribution of the possible values. In fact, these are available, and we show how to calculate them below.

from scipy.stats import hypergeom
num_failures = [0, 1, 2, 3]
pd.DataFrame(
    {"Number of Failures": num_failures, 
     "Fraction of Samples": hypergeom.pmf(num_failures, 7, 4, 3)})
Number of Failures Fraction of Samples
0 0 0.03
1 1 0.34
2 2 0.51
3 3 0.11

Note

Whenever possible, it’s a good idea to use the functionality provided in a third party package for simulating from a named distribution, rather than writing your own function, such as the random number generators offered in numpy. It’s best to take advantage of efficient and accurate code that others have developed.

Perhaps the two most common chance processes are those that arise from counting the number of 1s drawn from a 0-1 urn: drawing without replacement is the hypergeometric distribution and drawing with replacement is the binomial distribution.

While this simulation was simple, so simple that we could have used hypergeom.pmf to directly compute our distribution, we wanted to demonstrate the intuition that a simulation study can reveal. The approach we take in this book is to develop understanding about chance processes based on simulation studies. However, we do formalize the notion of a probability distribution of a statistic (like the proportion of fails in a sample) in Chapter 16.

Now that we have simulation as a tool for understanding accuracy, we can revisit the election example from Chapter 2 and carry out a post-election study of what might have gone wrong with the voter polls. This simulation study imitates drawing more than a thousand marbles (voters who participate in the poll) from an urn of six million. We can examine potential sources of bias and the variation in the polling results, and carry out a what-if analysis, where we examine how the predictions might have gone if even a larger number of draws from the urn were taken.