In a census, the access frame matches the population, and the sample captures the entire population. In this situation, if we administer a well-designed questionnaire, then we have complete and accurate knowledge of the population and the scope is complete. Similarly in measuring air quality, if our instrument has perfect accuracy and is properly used, then we can measure the exact value of the air quality. These situations are rare, if not impossible. In most settings, we need to quantify the accuracy of our measurements in order to generalize our findings to the unobserved. For example, we often use the sample to estimate an average value for a population, infer the value of a scientific unknown from measurements, or predict the behavior of a new individual. In each of these settings, we also want a quantifiable degree of accuracy. We want to know how close our estimates, inferences, predictions are to the truth.
The analogy of darts thrown at a dart board that was introduced earlier in Section 2.4 can be useful in understanding accuracy. We divide accuracy into two basic parts: bias and variance (also known as precision). Our goal is for the darts to hit the bullseye on the dart board and for the bullseye to line up with the unseen target. The spray of the darts on the board represents the variance in our measurements, and the gap from the bullseye to the unkown value that we are targetting represents the bias. Figure Fig. 2.6 shows combinations of low and high bias and variance.
Representative data puts us in the top row of the diagram, where there is low bias, meaning that the bullseye and the unseen target are in alignment. Ideally our instruments and protocols put us in the upper left part of the diagram, where the variance is also low. The pattern of points in the bottom row systematically miss the targetted value. Taking larger samples will not correct this bias.
2.5.1. Types of Bias¶
Bias comes in many forms. We describe some classic types here and connect them to our target-access-sample framework.
Coverage bias can occur when the access frame does not include everyone in the target population. For example, a survey based on cell-phone calls cannot reach those with only a landline or no phone. In this situation, those who cannot be reached may differ in important ways from those in the access frame.
Selection bias can arise when the mechanism used to choose units for the sample from the access frame tends to select certain units more often than they should. As an example, a convenience sample chooses the units most easily available. Problems can arise when those who are easy to reach differ in important ways from those harder to reach. Another example of selection bias can happen with observational studies and experiments. These studies often rely on volunteers (people who choose to participate), and this self-selection has the potential for bias, if the volunteers differ from the target population in important ways.
Non-response bias comes in two forms: unit and item. Unit non-response happens when someone selected for a sample is unwilling to participate, and item non-response occurs when, say, someone in the sample refuses to answer a particular survey question. Non-response can lead to bias if those who choose not to participate or to not answer a particular question are systematically different from those who respond.
Measurement bias arises when an instrument systematically misses the target in one direction. For example, low humidity can systematically give us incorrectly high measurements of air pollution. In addition, measurement devices can become unstable and drift over time and so produce systematic errors. In surveys, measurement bias can arise when questions are confusingly worded or leading, or when respondents may not be comfortable answering honestly.
Each of these types of bias can lead to situations where the data are not centered on the unknown targetted value. Often we cannot assess the potential magnitude of the bias, since little to no information is available on those who are outside of the access frame, less likely to be selected for the sample, or disinclined to respond. Protocols are key to reducing these sources of bias. Chance mechanisms to select a sample from the frame or to assign units to experimental conditions can eliminate selection bias. A non-response follow-up protocol to encourage participation can reduce non-response bias. A pilot survey can improve question wording and so reduce measurement bias. Procedures to calibrate instruments and protocols to take measurements in, say, random order can reduce measurement bias.
In the 2016 US Presidential Election, non-response bias and measurement bias were key factors in the inaccurate predictions of the winner. Nearly all voter polls leading up to the election predicted Clinton a winner over Trump. Clinton’s upset victory came as a surprise. After the election, many polling experts attempted to diagnose where things went wrong in the polls.
EXAMPLE: 2016 US Presidential Election Upset, Ctd. According to the American Association for Public Opinion Research [Kennedy et al., 2017], predictions made before the election were flawed for two key reasons:
Over-representation of college-educated voters. College-educated voters are more likely to participate in surveys than those with less education, and in 2016 they were more likely to support Clinton. Non-response biased the sample and over-estimated support for Clinton [Pew Research Center, 2012].
Voters were undecided or changed their preferences a few days before the election. Since a poll is static and can only directly measure current beliefs, it cannot reflect a shift in attitudes.
It’s difficult to figure out whether people held back their preference or changed their preference and how large a bias this created. However, exit polls have helped polling experts understand what happened, after the fact. They indicate that in battleground states, such as Michigan, many voters made their choice in the final week of the campaign, and that group went for Trump by a wide margin. \(\blacksquare\)
Bias does not need to be avoided under all circumstances. If an instrument is highly precise (low variance) and has a small bias, then that instrument might be preferable to another with higher variance and no bias. As another example, biased studies are potentially useful to pilot a survey instrument or to capture useful information for the design of a larger study. Many times we can at best recruit volunteers for a study. Given this limitation, it can still be useful to enroll these volunteers in the study and use random assignment to split them into treatment groups. That’s the idea behind randomized controlled experiements.
Whether or not bias is present, data typically also exhibit variation. Variation can be introduced purposely by using a chance mechanism to select a sample, and it can occur naturally through an instrument’s precision. In the next section, we identify three common sources of variation.
2.5.2. Types of Variation¶
Variation that results from a chance mechanism has the advantage of being quantifiable.
Sampling variation is the variation that results when we use chance to take a sample. We can in principle compute the chance a particular sample is selected.
Assignment variation of units to treatment groups in a controlled experiment produces variation. If we split the units up differently, then we could get different results from the experiment. This randomness allows us to compute the chance of a particular group assignment.
Measurement error for instruments is the error that results in the measurement process; if the instrument has no drift and a reliable distribution of errors, then when we take multiple measurements on the same object, we would get a variation in measurements centered on the truth.
The Urn Model is a simple abstraction that can be helpful for understanding variation. This model examines a container (an urn) full of identical, labeled marbles. We can use the simplified model of drawing balls from the urn to reason about many sampling schemes, randomised controlled experiments, and some kinds of measurement error and their implications on the data we collect. For each of these types of variation, the urn model helps us estimate the size of the variation using either probability or simulation (see Chapter 3). The example of selecting Wikipedia contributors to receive an informal award provides two examples of the urn model.
EXAMPLE: Informal Rewards and Peer Production, Ctd. Recall that for the Wikipedia experiment, a group of 200 contributers was selected at random from 1,440 top contributors. These 200 contributers were then split, again at random, into two groups of 100 each. One group received an informal award and the other didn’t. Here’s how we use the urn model to characterize this process of selection and splitting:
Imagine an urn filled with 1,440 marbles that are identical in shape and size, and written on each marble is one of the 1,440 Wikipedia usernames. (This is the access frame.)
Mix the marbles in the urn really well, select one marble and set it aside.
Repeat the mixing and selecting of the marbles to obtain 200 marbles.
The marbles drawn form the sample. Then, to determine which of the 200 contributors receives the informal award, we work with another urn.
In a second urn, put in the 200 marbles from the above sample.
Mix these marbles well and select one marble and set it aside.
Repeat. That is, choose 100 marbles, one at a time, mixing in between, and setting the chosen marble aside.
The 100 drawn marbles are assigned to the treatment group and correspond to the contributors who receive the award. The 100 left in the urn form the control group and receive no award.
Both the selection of the sample and the choice of award recipients use a chance mechanism. If we were to repeat the first sampling activity again, returning all 1,440 the marbles to the original urn, then we would most likely get a different sample. This variation is the source of sampling variation. Likewise, if we were to repeat the random assignment process again (keeping the sample of 200 from the first step unchanged), then we would get a different treatment group. Assignment variation arises from this second chance process. \(\blacksquare\)
The Wikipedia experiment provided an example of both sampling and assignment variation. In both cases, the researcher imposed a chance mechanism on the data collection process. Measurement error can at times also be considered a chance process that follows an urn model. We characterize the measurement error in the air quality sensors in this way in the following example.
EXAMPLE: Purple Air, Ctd. Imagine a well-calibrated instrument as an urn with a very large (infinite) collection of marbles in it; each marble has an error written on it. This error is the difference between the unknown value we are trying to measure and the value the instrument reports. Each time we take a measurement with the instrument, a marble is selected from the urn at random and this error gets added to the unknown value that we are trying to measure. This is a hypothetical model, because we don’t know the unknown value and so can’t perform the addition. We observe only the final measurement. We see the sum of the error plus the true value and not the values on the marbles.
In this example, the urn contains measurement error for an instrument, and if well-calibrated the errors contain no bias. That is, they show no trend or pattern or systematic error. The measurement error is similar to the low bias row in Fig. 2.6. However, if the instrument is biased, then additional systematic errors are added to each draw before we see the measurement. Unfortunately, we can’t tell the difference between these two situations. We don’t know if we are in the low bias or high bias rows in Fig. 2.6. This is why instrument calibration is so important.
Calibration brings an instrument into alignment both in terms of bias and variability. One way to measure bias is to compare measurements taken from our instrument to those taken from a different, highly-accurate, and well-maintained instrument, such as an air monitor operated by the EPA. Why not just always rely on the EPA monitors? There are trade offs. The citizen sensors give us a plethora of information that is relevant to our localized situation, whereas the official EPA equipment provides fewer, more accurate measurements that are less specific to our setting. Both are useful, and as we show in Chapter 11, the highly precise equipment can be used to calibrate the sensors. \(\blacksquare\)
If we can draw an accurate analogy between variation in the data and the urn model, the urn model provides us the tools to estimate the size of the variation. (See Chapter 3). This is highly desireable because way we can give concrete values for the variation in our data. However, it’s vital to confirm that the urn model is a reasonable depiction of the source of variation. Otherwise, our claims of accuracy can be seriously flawed. Knowing as much as possible about data scope, including instruments and protocols and chance mechanism used in data collection, are needed to apply urn models.