Simulation and Data Design

3. Simulation and Data Design

In this chapter, we develop the basic theoretical foundation needed to reason about how data is sampled and the implications on bias and variance. We will build this foundation not on the dry equations of classic statistics but instead on the story of a vase (an urn) filled with marbles. We will use the computational tools of simulation to reason about the properties of selecting marbles from the urn and what they tell us about data collection in the real-world. We will connect the simulation process to common statistical distributions (the dry equations…) but the basic tools of simulation will enable us to go beyond what can be directly modeled using equations.

We will used these new tools to study how the pollsters failed to predict the outcome of the United States Presidential Election in 2016. To do this, we use the actual votes cast in Pennsylvania and simulate the sampling variation for a poll of the six million voters. This simulation helps us uncover how response bias can skew polls. We will see how simply collecting more data using the same sampling procedure would not have helped.

In the second simulation study, we examine a controlled experiment that was used to demonstrate the efficacy of a COVID-19 vaccine but also launched a heated debate on the relative efficacy of vaccines. Abstracting the experiment to an urn model gives us a tool for studying assignment variation in randomized controlled experiments. Through simulation, we find the expected outcome of the clinical trial. Our simulation, along with careful examination of the data scope, debunks claims of vaccine ineffectiveness.

However, before we tackle some of the most significant data debates of our time, we will first start small, very small, with story of a few marbles living in an urn.