Theory for Data Design

3. Theory for Data Design

In this chapter, we develop the theory behind the chance processes introduced in the ref ch:data_scope chapter. This theory makes the concepts of bias and variation more precise. We continue to motivate the accuracy of our data through the abstraction of an urn model that was first introduced in Section 2.5.2 the ref sec:variationtypes section of Chapter 2 numref Chapter ch:data_scope, and we use both basic probability and simulation studies to develop the theory.

We use the urn model in two ways. First, we consider an artifical example with a small population. Since the population is so small, we can exactly calculate the chance of a particular sample being drawn from the population (see Section sec:theory_samplingVariation). Next, we use the urn model as a technical framework to design and run simulation studies to understand larger and more complex situations. We return to some of the examples from Data Scope numref Chapter ch:data_scope and, for example, dive deeper into understanding how the pollsters might have gotten the 2016 Presidential Election predictions wrong (Section 3.2 Section sec:theory_electionpoll). We use the actual votes cast in Pennsylvania to simulate the sampling variation for a poll of 1,400 from six million voters. This simulation helps us uncover how response bias can skew polls, and convince us that collecting a lot more data would not have helped the situtation (another example of big data hubris).

In a second simulation study (Section 3.3 Section sec:theory_randomAssignment), we examine the efficacy of a COVID-19 vaccine. A designed experiment for the vaccine was carried out on over 50,000 volunteers. Abstracting the experiment to an urn model gives us a tool for studying assignment variation in randomized controlled experiments. Through simulation, we find the expected outcome of the clinical trial. Our simulation, along with careful examination of the data scope, debunks claims of vaccine ineffectiveness.

In addition, to sampling variation and assignment variation, we also cast measurement error in terms of an urm model. In Section sec:theory_measurementError, we use multiple measurements from different times of the day to estimate the accuracy of an air quality sensor. Later in Section 11 Chapter ch:pa, we provide a more comprehensive treatment of measurement error and instrument calibration for air quality sensors.

The urn model is at the core of the simple random sample. The simple random sample can be extended to the stratfied random sample, and beyond to describe the basis of many complex surveys. (See Section 3.5).

Simulation studies enable us to approximate the typical variations in a chance process which carries over to the accuracy of summary statistics. If you are looking for a more technical approach to the topic, in Section sec:theory_probability, we use probability to work out formulas that formalize the theory of the urn model. On the other hand, you may wish to skip this section until they find the need for this formal theory.