2.5. Accuracy

In a census, the access frame matches the population, and the sample captures the entire population. In this situation, if we administer a well-designed questionnaire, then we have complete and accurate knowledge of the population and the scope is complete. Similarly in measuring air quality, if our instrument has perfect accuracy and is properly used, then we can measure the exact value of the air quality. These situations are rare, if not impossible. In most settings, we need to quantify the accuracy of our measurements in order to generalize our findings to the unobserved. For example, we often use the sample to estimate an average value for a population, infer the value of a scientific unknown from measurements, or predict the behavior of a new individual. In each of these settings, we also want a quantifiable degree of accuracy. We want to know how close our estimates, inferences, and predictions are to the truth.

The analogy of darts thrown at a dart board that was introduced earlier can be useful in understanding accuracy. We divide accuracy into two basic parts: bias and variance (also known as precision). Our goal is for the darts to hit the bullseye on the dart board and for the bullseye to line up with the unseen target. The spray of the darts on the board represents the variance in our measurements, and the gap from the bullseye to the unknown value that we are targeting represents the bias. Figure Figure 2.6 shows combinations of low and high bias and variance.


Fig. 2.6 In each of these diagrams, the dots represent the measurements taken. They form a scattershot within the access frame represented by the dart board. When the bullseye of the access frame is roughly centered on the targetted value (top row), the measurements are scattered around it and bias is low. The larger dart boards (right column) indicate a wider spread in the measurements.

Representative data puts us in the top row of the diagram, where there is low bias, meaning that the bullseye and the unseen target are in alignment. Ideally our instruments and protocols put us in the upper left part of the diagram, where the variance is also low. The pattern of points in the bottom row systematically miss the targeted value. Taking larger samples will not correct this bias.

2.5.1. Types of Bias

Bias comes in many forms. We describe some classic types here and connect them to our target-access-sample framework.

  • Coverage bias occurs when the access frame does not include everyone in the target population. For example, a survey based on cell-phone calls cannot reach those with only a landline or no phone. In this situation, those who cannot be reached may differ in important ways from those in the access frame.

  • Selection bias arises when the mechanism used to choose units for the sample tends to select certain units more often than they should. As an example, a convenience sample chooses the units most easily available. Problems can arise when those who are easy to reach differ in important ways from those harder to reach. Another example of selection bias can happen with observational studies and experiments. These studies often rely on volunteers (people who choose to participate), and this self-selection has the potential for bias, if the volunteers differ from the target population in important ways.

  • Non-response bias comes in two forms: unit and item. Unit non-response happens when someone selected for a sample is unwilling to participate, and item non-response occurs when, say, someone in the sample refuses to answer a particular survey question. Non-response can lead to bias if those who choose not to participate or to not answer a particular question are systematically different from those who respond.

  • Measurement bias happens when an instrument systematically misses the target in one direction. For example, low humidity can systematically give us incorrectly high measurements of air pollution. In addition, measurement devices can become unstable and drift over time and so produce systematic errors. In surveys, measurement bias can arise when questions are confusingly worded or leading, or when respondents may not be comfortable answering honestly.

Each of these types of bias can lead to situations where the data are not centered on the unknown targeted value. Often we cannot assess the potential magnitude of the bias, since little to no information is available on those who are outside of the access frame, less likely to be selected for the sample, or disinclined to respond. Protocols are key to reducing these sources of bias. Chance mechanisms to select a sample from the frame or to assign units to experimental conditions can eliminate selection bias. A non-response follow-up protocol to encourage participation can reduce non-response bias. A pilot survey can improve question wording and so reduce measurement bias. Procedures to calibrate instruments and protocols to take measurements in, say, random order can reduce measurement bias.

In the 2016 US Presidential Election, non-response bias and measurement bias were key factors in the inaccurate predictions of the winner. Nearly all voter polls leading up to the election predicted Clinton a winner over Trump. Clinton’s upset victory came as a surprise. After the election, many polling experts attempted to diagnose where things went wrong in the polls. The American Association for Public Opinion Research [Kennedy et al., 2017], found predictions made were flawed for two key reasons:

  • Over-representation of college-educated voters. College-educated voters are more likely to participate in surveys than those with less education, and in 2016 they were more likely to support Clinton. Non-response biased the sample and over-estimated support for Clinton [Pew Research Center, 2012].

  • Voters were undecided or changed their preferences a few days before the election. Since a poll is static and can only directly measure current beliefs, it cannot reflect a shift in attitudes.

It’s difficult to figure out whether people held back their preference or changed their preference and how large a bias this created. However, exit polls have helped polling experts understand what happened, after the fact. They indicate that in battleground states, such as Michigan, many voters made their choice in the final week of the campaign, and that group went for Trump by a wide margin.

Bias does not need to be avoided under all circumstances. If an instrument is highly precise (low variance) and has a small bias, then that instrument might be preferable to another with higher variance and no bias. As an example, biased studies are potentially useful to pilot a survey instrument or to capture useful information for the design of a larger study. Many times we can at best recruit volunteers for a study. Given this limitation, it can still be useful to enroll these volunteers in the study and use random assignment to split them into treatment groups. That’s the idea behind randomized controlled experiments.

Whether or not bias is present, data typically also exhibit variation. Variation can be introduced purposely by using a chance mechanism to select a sample, and it can occur naturally through an instrument’s precision. In the next section, we identify three common sources of variation.

2.5.2. Types of Variation

Variation that results from a chance mechanism has the advantage of being quantifiable.

  • Sampling variation results from using chance to take a sample. We can in principle compute the chance a particular sample is selected.

  • Assignment variation of units to treatment groups in a controlled experiment produces variation. If we split the units up differently, then we can get different results from the experiment. This randomness allows us to compute the chance of a particular group assignment.

  • Measurement error for instruments result from the measurement process; if the instrument has no drift and a reliable distribution of errors, then when we take multiple measurements on the same object, we get variations in measurements that are centered on the truth.

The Urn Model is a simple abstraction that can be helpful for understanding variation. This model examines a container (an urn) full of identical marbles that have been labeled, and we use the simple action of drawing balls from the urn to reason about sampling schemes, randomized controlled experiments, and measurement error. For each of these types of variation, the urn model helps us estimate the size of the variation using either probability or simulation (see Chapter 3). The example of selecting Wikipedia contributors to receive an informal award provides two examples of the urn model.

Recall that for the Wikipedia experiment, a group of 200 contributors was selected at random from 1,440 top contributors. These 200 contributors were then split, again at random, into two groups of 100 each. One group received an informal award and the other didn’t. Here’s how we use the urn model to characterize this process of selection and splitting:

  • Imagine an urn filled with 1,440 marbles that are identical in shape and size, and written on each marble is one of the 1,440 Wikipedia usernames. (This is the access frame.)

  • Mix the marbles in the urn really well, select one marble and set it aside.

  • Repeat the mixing and selecting of the marbles to obtain 200 marbles.

The marbles drawn form the sample. Then, to determine which of the 200 contributors receives the informal award, we work with another urn.

  • In a second urn, put in the 200 marbles from the above sample.

  • Mix these marbles well and select one marble and set it aside.

  • Repeat. That is, choose 100 marbles, one at a time, mixing in between, and setting the chosen marble aside.

The 100 drawn marbles are assigned to the treatment group and correspond to the contributors who receive the award. The 100 left in the urn form the control group and receive no award.

Both the selection of the sample and the choice of award recipients use a chance mechanism. If we were to repeat the first sampling activity again, returning all 1,440 the marbles to the original urn, then we would most likely get a different sample. This variation is the source of sampling variation. Likewise, if we were to repeat the random assignment process again (keeping the sample of 200 from the first step unchanged), then we would get a different treatment group. Assignment variation arises from this second chance process.

The Wikipedia experiment provided an example of both sampling and assignment variation. In both cases, the researcher imposed a chance mechanism on the data collection process. Measurement error can at times also be considered a chance process that follows an urn model. We characterize the measurement error in the air quality sensors in this way in the following example.

If we can draw an accurate analogy between variation in the data and the urn model, the urn model provides us the tools to estimate the size of the variation. (See Chapter 3). This is highly desirable because way we can give concrete values for the variation in our data. However, it’s vital to confirm that the urn model is a reasonable depiction of the source of variation. Otherwise, our claims of accuracy can be seriously flawed. Knowing as much as possible about data scope, including instruments and protocols and chance mechanism used in data collection, are needed to apply urn models.