Missing Values and Records

9.3. Missing Values and Records

In Section 2.5, we considered the potential problems when the population and the access frame are not in alignment, when we can’t access everyone we want to study. We also described problems when someone refuses to participate in the study. In these cases, entire records/observations are missing, and we discussed the kinds of bias that can occur due to missing records. If nonrespondents differ in critical ways from respondents or if the nonresponse rate is not negligible, then our analysis may be seriously flawed. The example of Section 3.2 showed that increasing the sample size without addressing nonresponse does not reduce nonresponse bias. Also in that section, we discussed ways to prevent nonresponse. These preventive measures include using incentives to encourage response, keeping surveys short, writing clear questions, training interviewers, and investing in extensive follow up procedures. Unfortunately, some amount of nonresponse is unavoidable.

After nonresponse has occurred, it is sometimes possible to use models to predict the missing data. But, predicting missing observations is never as good as observing them in the first place. Records are missing completely at random when the chance that a unit responds to a survey does not depend on what is being measured or on the sampling design. For example, if someone accidentally breaks the laboratory equipment at Manua Loa and CO2 is not recorded for a day, there is no reason to think that the level of CO2 that day had something to do with the lost measurements.

At other times, we consider records missing at random given covariates when the nonresponse depends only on observed features and not on the main response. For example, an ER visit in the DAWN survey would be missing at random given covariates if, say, the nonresponse rate was only dependent on race or sex (and not on anything else). In these limited cases, the observed data can be weighted to accommodate for nonresponse.

When a record is not entirely missing, but a particular field in a record is unavailable, we have nonresponse at the field-level. Some datasets use a special coding to signify that the information is missing. For example, Mauna Loa used -99.99 to indicate a missing CO2 measurement. We found only 7 of these values among 738 rows in the table. In the Mauna Loa case, we showed that these missing values have little impact on the analysis (Section 9.1).

In some surveys, missing information is further categorized as to whether the respondent refused to answer, was unsure of the answer, or the interviewer didn’t ask the question. Each of these types of missing values is recorded using a different value. For example, many questions in the DAWN survey use a code of -7 for not applicable, -8 for not documented, and -9 for missing 1. Codings such as these can help us further refine our study of nonresponse.

9.3.1. Imputing Missing Values

At times, we substitute a reasonable value for missing ones to create a “clean” data frame. This process is called imputation. Some common approaches for imputing values are deductive, mean, and hot-deck imputation.

In deductive imputation, we fill in a value through logical relations. For example, below are rows in the business data frame for San Francisco restaurant inspections. Their zip codes are erroneously marked as “Ca” and latitude and longitude are missing. We can look up the address on the USPS Website to get the correct zip code and we can use Google Maps to find the latitude and longitude of the restaurant to fill in these missing values.

bus[bus['postal_code'] == "Ca"]
business_id name address city ... postal_code latitude longitude phone_number
5480 88139 TACOLICIOUS 2250 CHESTNUT ST San Francisco ... Ca NaN NaN +14156496077

1 rows × 9 columns

Mean imputation uses an average value from rows in the dataset that have values. As a simple example, if a dataset on test scores has a missing value, mean imputation could fill in the missing value using the overall mean test score. A key issue with mean imputation is that the variability in the imputed feature will be smaller because the feature now has values that are identical to the mean. This affects later analysis if not handled properly—for instance, confidence intervals will be smaller than they should be.

Hot-deck imputation uses a chance process to select a value at random from rows that have values. As a simple example, hot-deck imputation could fill in missing test scores by randomly choosing another test score in the dataset. A potential problem with hot-deck imputation is that the strength of a relationship might decline because we have added randomness to the values.

For mean and hot-deck imputation, we often impute values based on others in the dataset who are similar in other features to the nonrespondents. More sophisticated imputation techniques use nearest-neighbor methods to find similar subgroups of records and others use regression techniques to predict the missing value [Little and Rubin, 2019].

In any of these types of imputation, we should create a new feature that contains the altered data or a new feature to indicate whether or not the response in the original feature has been imputed.

9.3.2. Takeaways

This section discussed why missing data and ways to handle missing data. Sometimes, missing data occur at random or at random given covariates. In these cases, the missing data are more feasible to fill in, or impute. We discussed a few ways to impute missing data, including deductive, mean, and hot-deck imputation.

Decisions to keep or drop a record, to change a value, or to remove a feature, may seem small, but they are critical. One anomalous record can seriously impact your findings. Whatever you decide, be sure to check the impact of dropping or changing features and records. And, be transparent and thorough in reporting any modifications you make to the data. It’s best to make these changes programmatically to reduce potential errors and enable others to confirm exactly what you have done by reviewing your code.

In the next section, we’ll discuss data transformations, with a special emphasis on timestamp data.


1

See https://www.datafiles.samhsa.gov/sites/default/files/field-uploads-protected/studies/DAWN-2010/DAWN-2010-datasets/DAWN-2010-DS0001/DAWN-2010-DS0001-info/DAWN-2010-DS0001-info-codebook.pdf