Simulation and Data Design

3. Simulation and Data Design

In this chapter, we develop the theory behind the chance processes introduced in Chapter 2. This theory makes the concepts of bias and variation more precise. We continue to motivate the accuracy of our data through the abstraction of an urn model that was first introduced in Chapter 2, and we use simulation studies to help us understand and make decisions based on the data.

The urn model gives us a technical framework to design and run simulation studies to understand larger and more complex situations. For example, we can dive deeper into understanding how the pollsters might have gotten the 2016 Presidential Election predictions wrong (Chapter 2). To do this, we use the actual votes cast in Pennsylvania and simulate the sampling variation for a poll of the six million voters. This simulation helps us uncover how response bias can skew polls, and convince us that collecting a lot more data would not have helped the situation (another example of big data hubris).

In a second simulation study, we examine the efficacy of a COVID-19 vaccine. A designed experiment for the vaccine was carried out on over 50,000 volunteers. Abstracting the experiment to an urn model gives us a tool for studying assignment variation in randomized controlled experiments. Through simulation, we find the expected outcome of the clinical trial. Our simulation, along with careful examination of the data scope, debunks claims of vaccine ineffectiveness.

In addition to sampling variation and assignment variation, we also cast measurement error in terms of an urn model. We use multiple measurements from different times of the day to estimate the accuracy of an air quality sensor. Later in Chapter 12, we provide a more comprehensive treatment of measurement error and instrument calibration for air quality sensors.

We begin with an artificial example of a small population; it’s so small that we can list all the possible samples that can be drawn from the population.