In a census, the frame matches the population, and the sample captures the entire population. If we administer a well-designed questionnaire, then we have complete and accurate knowledge of the population and the scope is complete. Similarly in measuring air quality, if our instrument has perfect accuracy and is properly used, then we can measure the exact value of the air quality. These situations are rare, if not impossible. In most settings, we need to quantify the accuracy of our measurements in order to generalize our findings to the unobserved; e.g., we use the sample to estimate an average value for a population, infer the value of a scientific unknown from measurements, or predict the behavior of a new individual. In each of these settings, we also want a quantifiable degree of accuracy.
The analogy of shots fired at a target introduced in the Air Quality Example earlier can be useful in understanding accuracy. We divide accuracy into two basic parts: bias and variance (also known as precision). Our goal is to hit the bullseye. The spray of the shots represents the variance and the gap from the center of the spray to the bullseye represents the bias. Figure Fig. 2.7 shows combinations of low and high bias and variance.
Representative data puts us in the top row of the diagram, where there is low bias, and ideally our instruments and protocols put us in the upper left part of the diagram, where the variance is also low. The pattern of points in the bottom row systematically miss the bullseye. Taking larger samples will not correct the bias.
2.3.1. Types of Bias¶
Bias comes in many forms. We describe some classic types here and connect them to our data-collection construct.
Coverage bias can occur when not everyone in the target population belongs to the access frame; e.g., a survey based on cell-phones cannot reach those with only a landline or no phone, and those who cannot be reached may differ in important ways from those in the access frame.
Selection bias can arise when the mechanism used to choose units for the sample from the access frame tends to select certain units more often than they should; e.g., in a convenience sample the easiest units to reach are selected and these units may differ in important ways from those harder to reach. Additionally, observational studies and experiments often rely on volunteers (self-selection), which has the potential for bias if the volunteers differ from the target population in important ways.
Non-response bias comes in two forms: unit and item. Unit non-response happens when someone selected for a sample does not participate, and item non-response occurs when, say, someone in the sample refuses to answer a particular survey question. Non-response can lead to bias if those who choose not to participate or not to answer a particular question are systematically different from those who respond.
Measurement bias arises when an instrument systematically misses the bullseye in one direction, e.g., low humidity can systematically give us incorrectly high measurements of air pollution. In addition, measurement devices can become unstable and drift over time and so produce systematic errors. In surveys, measurement bias can arise when questions are confusingly worded or leading, or when respondents may not be comfortable answering honestly.
Each of these types of bias can lead to situations where the data are not centered on the bullseye. Often we cannot assess the potential magnitude of the bias, since little to no information is available on those who are outside of the access frame, less likely to be selected for the sample, or disinclined to respond. Protocols are key to reducing these sources of bias. Chance mechanisms to select a sample from the frame or to assign units to experimental conditions can eliminate selection bias; a non-response follow-up protocol to encourage participation can reduce non-response bias; a pilot survey can improve question wording and so reduce measurement bias. In other cases, procedures to calibrate instruments and protocols to take measurements in random order can reduce measurement bias.
184.108.40.206. Example: 2016 US Presidential Election Upset, Ctd.¶
After the 2016 US election, many experts re-examined the polls. According to the American Association for Public Opinion Research [Kennedy et al., 2017], predictions made before the election were flawed for two key reasons:
Over-representation of college-educated voters. College-educated voters are more likely to participate in surveys than those with less education, and in 2016 they were more likely to support Clinton. Non-response biased the sample and over-estimated support for Clinton [Pew Research Center, 2012].
Voters were undecided or changed their preferences a few days before the election. Since a poll is static and can only directly measure current beliefs, it cannot reflect a shift in attitudes.
It’s difficult to figure out whether people held back their preference or changed their preference and how large a bias this created. However, exit polls have helped experts understand what happened, after the fact. They indicate that in battleground states, such as Michigan, many voters made their choice in the final week of the campaign and that group went for Trump by a wide margin.
Bias does not need to be avoided under all circumstances. If an instrument is highly precise (low variance) and has a small bias, then that instrument might be preferable to another that has higher variance and no bias. Moreover, biased studies are potentially useful to pilot a survey instrument or to capture useful information for the design of a larger study. Many times we can at best recruit volunteers for a study; it can still be useful to engage these volunteers and use random assignment to enroll them in controlled experiments.
2.3.2. Types of Variation¶
Variation that results from a chance mechanism has the advantage of being quantifiable.
Random sampling error is the variation that results when we use chance to take a sample. We can in principle compute the chance a particular sample is selected.
Random assignment of units to treatment groups in a controlled experiment produces variation. If we split the units up differently, then we could get different results from the experiment. This randomness allows us to compute the chance of a particular group assignment.
Measurement error for instruments is the error that results in the measurement process; if the instrument has no drift and a reliable distribution of errors, then when we take multiple measurements on the same object, we would get a variation in measurements centered on the truth.
A conceptual tool that can be helpful for understanding variation is the urn model. The urn model applies to many sampling schemes, randomised controlled experiments, and some kinds of measurement error. For each of these types of variation, the urn model helps us estimate the size of the variation using either probability or simulation (see Chapter Section 3).
220.127.116.11. Example: Informal Rewards and Peer Production, Ctd.¶
Recall that for the Wikipedia experiment, 144,000 contributors were reduced to a collection of the top 1,440. From these, 200 were selected at random and then split, again at random, into two groups of 100 each. One group received an informal award and the other didn’t. Here’s how we use the urn model to characterize this process of selection and splitting:
Imagine an urn filled with 1,440 marbles that are identical in shape and size, and written on each marble is one of the 1,440 Wikipedia usernames. (This is the access frame.)
Mix the marbles in the urn really well, select one marble and set it aside. Repeat the mixing and selecting of the marbles to obtain 200 marbles. The marbles drawn form the sample.
Take another urn, and put the 200 marbles from the sample into it.
Mix these marbles well and choose 100, one at a time, mixing in between, and setting the chosen marble aside. The 100 drawn marbles are assigned to the treatment group and correspond to the contributors who receive the award. The 100 left in the urn form the control group and receive no award.
Both the selection of the sample and the choice of award recipients use a chance mechanism. If we were to repeat the sampling activity again, returning all the marbles to the original urn, then we would most likely get a different sample. This variation is the source of sampling error. Likewise, if we were to repeat the random assignment process again (keeping the sample the same), we would get a different treatment group. This variation arises from random assignment.
18.104.22.168. Example: Purple Air, Ctd.¶
The analogy of the urn model can also characterize the variation in the measurements of air quality. Imagine a well-calibrated instrument as an urn with a very large (infinite) set of marbles in it, each marble has an error written on it. This error is the difference between the true value we are trying to measure and the value the instrument reports. Each time we take a measurement with the instrument, a marble is selected from the urn at random and this error gets added to the true value of the quantity we are measuring. We observe only the final measurement; that is, we see the sum of the error plus the true value and not the values on the marbles. The urn contains measurement errors. A well-calibrated instrument shows no trend or pattern and the measurement error is similar to the low bias row in Fig. 2.7. However, if the instrument is biased, then additional systematic errors are added to each draw before we see the measurement. Unfortunately, we can’t tell the difference between these two situations, i.e., we don’t know if we are in the low bias or high bias rows in Fig. 2.7. This is why instrument calibration is so important. Calibration brings an instrument into alignment both in terms of bias and variability. One way to measure bias is to compare measurements taken from our instrument to those taken from a different highly accurate and well-maintained instrument, such as an air monitor operated by the EPA. Why not just rely on the EPA monitors? There are trade offs. The citizen sensors give us a plethora of information that is relevant to our localized situation, whereas the highly accurate equipment provides fewer, more accurate measurements that are less specific to our setting. Both are useful, and as we show in Chapter XX, the highly precise equipment can be used to calibrate the sensors.