Quality Checks

9.2. Quality Checks

Once your data are in a table and you understand its scope and granularity, it’s time to inspect their quality. You may have come across errors in the data as you examined it and wrangled it into a data frame. In this section, we continue the inspection and carry out a more comprehensive assessment of the quality of the features and their values. We consider quality from four vantage points:

  • Scope: Do the data match your understanding of the population?

  • Measurements and Values: Are the values reasonable?

  • Relationships: Are related features in agreement?

  • Analysis: Which features can be used?

Quality based on scope. In the earlier Questions and Data Scope chapter, we addressed whether or not the data that have been collected can adequately address the problem at hand. In that chapter, we analyzed the data scope by looking at the data’s target population, access frame, and sample. This framework helps us consider possible limitations in the way in which the data were collected that might impact the generalizability of our findings.

While these broader data scope considerations are important as we deliberate our final conclusions, they are also useful for checking data quality. For example, with San Francisco restaurant inspections, all restaurant zip codes should start with 941. But, we can see that there are several zip codes that don’t:

bus['postal_code'].value_counts().tail(10)
94080        1
94621        1
94544        1
            ..
64110        1
94545        1
941033148    1
Name: postal_code, Length: 10, dtype: int64

This lets us spot problems in the data.

As another example, a bit of background reading on atmospheric CO2 revelas that typical measurements are about 400 ppm worldwide. So we can check whether the monthly averages of CO2 at Mauna Loa lie between 300 to 400 ppm.

Quality of measurements and recorded values. We can use our knowledge of the scope of the data to check quality, as mentioned already, but we can also check the quality of measurements by considering what might be a reasonable value for a feature. For example, whats seems like a reasonable range for the number of violations in a restaurant inspection? Possibly, 0 to 5? Other checks can be based on common knowledge of ranges: a restaurant inspection score must be between 0 and 100; months run between 1 and 12. We can use documentation to tells us the expected values for a feature. For example, the type of emergency room visit in the DAWN has been coded as 1, 2, …, 8 (see Figure 9.1 below for a screenshot of the codebook explanation of the field) so we can confirm that all values for the type of visit are indeed integers between 1 and 8.

../../_images/DAWN_codebook.png

Fig. 9.1 Screenshot of the description of the CASETYPE variable in the DAWN survey. Notice that there are eight possible values for this feature. And to help in figuring out if we have properly read the data, we can check the counts for these eight values.

More generic checks can confirm, say, that a proportion ranges between 0 and 1. We want to ensure that the data type matches our expectations. For example, we expect a price to be a number, whether or not it’s stored as integer, floating point, or string. Confirming that the units of measurement match what is expected can be another useful quality check to perform (for example weight recorded in pounds, not kilograms). We can devise checks for all of these situations.

Quality of relationships. In addition to checking individual values in a column, we also want to cross-check values between features. To do this cross-checking, we use builtin conditions on the relationship of values between two (or more) features.

For example, according to the documentation for the DAWN study, alcohol consumption is only considered a type of visit for patients under 21 so we can check that all instances that record alcohol for the type of visit also have age recorded as under 21. Alcohol is coded as a 3, and ages under 21 are coded as 1, 2, 3, and 4.

display_df(pd.crosstab(dawn['age'], dawn['type']), rows=12)
type 1 2 3 4 5 6 7 8
age
-8 2 2 0 21 5 1 1 36
1 0 6 20 6231 313 4 2101 69
2 8 2 15 1774 119 4 119 61
3 914 121 2433 2595 1183 48 76 4563
4 817 796 4953 3111 1021 95 44 6188
5 983 1650 0 4404 1399 170 48 9614
6 1068 1965 0 5697 1697 140 62 11408
7 957 1748 0 5262 1527 100 60 10296
8 1847 3411 0 10221 2845 113 115 18366
9 1616 3770 0 12404 3407 75 150 18381
10 616 1207 0 12291 2412 31 169 7109
11 205 163 0 24085 2218 12 308 1537

The cross tabulation shows that all of the alcohol cases are for patients under 21. The data values are as expected.

Quality for analysis. Even when data pass your quality checks, problems can arise with its usefulness. For example, if all but a handful of values for a feature are identical, then that feature adds little to the understanding of underlying patterns and relationships. Or, if there are too many missing values, especially if there is a discernible pattern in the missing values, our findings may be limited. If a feature has many bad/corrupted values, then we might question the accuracy of even those values that fall in the appropriate range.

We see below that the type of restaurant inspection in San Francisco can be either routine or from a complaint. Since only one of the 14,000+ inspections was from a complaint, we lose little if we drop this feature, and we might also want to drop that single inspection.

pd.value_counts(insp['type'])
routine      14221
complaint        1
Name: type, dtype: int64

Two questions that might have come to your mind at this point are: 1) How do we look for garbled, anomalous, and inconsistently coded values? And, 2) once we find them, what do we do? We address these questions generally here, and show some of the practicalities in the example in Section 9.6.

How to find bad values and features?

  • Check summary statistics, distributions, and value counts. Chapter 10 provides examples and guidance on how to go about checking the quality of your data using visualizations and summary statistics. We briefly mention a few approaches here. A table of counts of unique values in a feature can uncover unexpected encodings and lopsided distributions, where one option is a rare occurrence. Percentiles can be helpful in revealing the proportion of values with unusually high (or low) values.

  • Logical expressions can identify records with values out of range or relationships that are out of wack. Simply computing the number of records that do not pass the quality, check can quickly reveal the size of the problem.

  • Examine the whole record for those records with problematic values for a particular feature. At times, an entire record is garbled when, for example, a comma is misplaced in a CSV formatted file. Or, the record(s) might represent an unusual situation (such as ranches being included in data on house sales), and you will need to decide whether they should be included in your analysis or not.

  • Refer to an external source to figure out if there’s a reason for the anomaly.

What to do with your discoveries? You have essentially four options: leave the data as is; modify values; remove features; or drop records. Not every unusual aspect of the data needs to be fixed. You might have discovered a characteristic of your data that will inform you with your analysis, but otherwise does not need any correcting. Or, you might find that the problem is relatively minor and most likely will not impact your analysis so you can leave the data as is.

On the other hand, you might want to replace corrupted values with NaN. You might have figured out what went wrong and correct the value. Other possibilities for modifying records are covered in the examples of the Exploratory Data Analysis chapter. If you plan to change the values of a variable, then it’s good practice to create a new feature with the modified value and preserve the original feature, or at a minimum, create a new feature that indicates which values in the original feature have been modified. These approaches give you some flexibility in checking the influence of the modified values on your analysis.

If you find yourself modifying many values for a feature then you might consider eliminating the feature from the dataset. Either way, you will want to study the possible impact of excluding the feature from your analysis. In particular, you will want to determine whether the records with corrupted values are similar to each other, and different from the rest of the data. This would indicate that you may be unable to capture the impact of a potentially useful feature in your analysis. Rather than exclude the feature entirely, there may be a transformation that allows you to keep the feature while reducing the level of detail recorded.

At times, you will want to eliminate the problematic records. In general, we do not want to drop a large number of observations from a dataset without good reason. We may want to scale back our investigation to a particular subgroup of the data, but that’s a different situation than dropping records because of a corrupted value in a field (see Section xx). When you discover that an unusual value is in fact correct, you still might decide to exclude the record from your analysis because it’s so different from the rest of your data and you do not want it to overly influence your analysis.

9.2.1. Takeaways

In this section, we introduced four types of quality checks: checks on scope, measurements, relationships, and analysis. These checks can reveal issues in the data that need to be addressed before proceeding with analysis. One particularly important type of check is to look for missing data. We suggested that there may be times when you want to replace corrupted data values with NaN, and hence treat them as missing. At other times, data might arrive missing. What to do with missing data is an important topic and there is a lot of research on this problem; we cover ways to address missing data in the next section.