9.6. Example: Wrangling Restaurant Safety Violations

In this section, we demonstrate data wrangling techniques from this chapter on the restaurant safety dataset. The data are stored in three tables: bus (for businesses), insp (for inspections), and viol (for safety violations). The violations dataset contains detailed descriptions of violations found during an inspection. We would like to capture some of this information and connect it to the inspection score, which is in inspection-level dataset.

We want to figure out: what kinds of safety violations are associated with lower restaurant safety scores?

This example covers several key ideas in data wrangling related to changing structure:

  • aggregation to modify the granularity of a table

  • filtering to focus on a narrower segment of data

  • joining to bring together information across tables

Additionally, an important part of this example demonstrates how we transform text data into numeric quantities for analysis.

To get started let’s peek at the first few rows of each table to remind ourselves of what they contain.

print('bus:')
display_df(bus.head(2), cols=4)

# Note that we've already parsed timestamps and years for insp and viol 
print('insp:')
display_df(insp.head(2))

print('viol:')
display_df(viol.head(2))

9.6.1. Narrowing the Focus

As a first step, let’s simplify the structure by reducing the data to inspections from one year. (Recall that this dataset contains four years of inspection information.) Below we tally the number of records for each year in the inspections table.

pd.value_counts(insp['year'])
2016    5443
2017    5166
2015    3305
2018     308
Name: year, dtype: int64

Let’s take a subset of the data, restricting our preparations to inspections that took place in 2016. Here, we’ll use the pipe function once again to apply the same transformation to both the inspections and violations data frames.

def subset_2016(df):
    return df.query('year == 2016')

vio2016 = viol.pipe(subset_2016)
ins2016 = insp.pipe(subset_2016)
ins2016
business_id score date type timestamp year
0 19 94 20160513 routine 2016-05-13 2016
3 24 98 20161005 routine 2016-10-05 2016
4 24 96 20160311 routine 2016-03-11 2016
... ... ... ... ... ... ...
13820 90096 91 20161229 routine 2016-12-29 2016
13850 90268 100 20161229 routine 2016-12-29 2016
13852 90269 100 20161229 routine 2016-12-29 2016

5443 rows × 6 columns

Recall that timestamp contains the parsed timestamps. From the previous section on granularity, we also found that business_id and timestamp together uniquely identify the inspections (with a couple of exceptions).

By looking at the first few rows of ins2016, we can see that restaurants can receive multiple inspections in a year—business #24 had two inspections in 2016, one in March and another in October.

vio2016.head(6)
business_id date description timestamp year
2 19 20160513 Unapproved or unmaintained equipment or utensi... 2016-05-13 2016
3 19 20160513 Unclean or degraded floors walls or ceilings ... 2016-05-13 2016
4 19 20160513 Food safety certificate or food handler card n... 2016-05-13 2016
... ... ... ... ... ...
38147 89900 20161206 No hot water or running water [ date violatio... 2016-12-06 2016
38220 90096 20161229 No plan review or Building Permit [ date viol... 2016-12-29 2016
38221 90096 20161229 Unclean or unsanitary food contact surfaces [... 2016-12-29 2016

15624 rows × 5 columns

Notice that the first few records in the vio2016 dataframe are all for the same restaurant.

9.6.2. Aggregating Violations

Do restaurants with more violations have lower safety scores? To check this, we can count the number of violations in an inspections. To find this count, we can group this table by business_id and timestamp, then find the size of each group. Essentially, this grouping changes the granularity of violations to an inspection level.

num_vios = (vio2016
            .groupby(['business_id', 'timestamp'])
            .size()
            .reset_index()
            .rename(columns={0: 'num_vio'}));
num_vios
business_id timestamp num_vio
0 19 2016-05-13 3
1 24 2016-03-11 2
2 24 2016-10-05 1
... ... ... ...
4803 89790 2016-11-29 1
4804 89900 2016-12-06 1
4805 90096 2016-12-29 2

4806 rows × 3 columns

Now we want to merge num_vios with ins2016. What kind of merge do we want to perform? We want a left join of ins2016 with num_vios because there could be inspections that do not have any violations.

def left_join_vios(ins):
    return ins.merge(num_vios, on=['business_id', 'timestamp'], how='left')

ins_and_num_vios = ins2016.pipe(left_join_vios)
ins_and_num_vios
business_id score date type timestamp year num_vio
0 19 94 20160513 routine 2016-05-13 2016 3.0
1 24 98 20161005 routine 2016-10-05 2016 1.0
2 24 96 20160311 routine 2016-03-11 2016 2.0
... ... ... ... ... ... ... ...
5440 90096 91 20161229 routine 2016-12-29 2016 2.0
5441 90268 100 20161229 routine 2016-12-29 2016 NaN
5442 90269 100 20161229 routine 2016-12-29 2016 NaN

5443 rows × 7 columns

When there are no violations for an inspection, the feature num_vio will have a missing value (NaN). We can check how many values are missing:

ins_and_num_vios['num_vio'].isnull().sum()
833

So, about 15% of restaurant inspections in 2016 found no safety violations.

We’d expect that if a restaurant received a perfect safety score of 100, the restaurant had no violations. So, we can correct these missing values by setting them to 0. This is an example of deductive imputation, since we’re using domain knowledge to fill in missing values.

def zero_vios_for_perfect_scores(df):
    df = df.copy()
    df.loc[df['score'] == 100, 'num_vio'] = 0
    return df

ins_and_num_vios = (ins2016.pipe(left_join_vios)
                    .pipe(zero_vios_for_perfect_scores))
ins_and_num_vios
business_id score date type timestamp year num_vio
0 19 94 20160513 routine 2016-05-13 2016 3.0
1 24 98 20161005 routine 2016-10-05 2016 1.0
2 24 96 20160311 routine 2016-03-11 2016 2.0
... ... ... ... ... ... ... ...
5440 90096 91 20161229 routine 2016-12-29 2016 2.0
5441 90268 100 20161229 routine 2016-12-29 2016 0.0
5442 90269 100 20161229 routine 2016-12-29 2016 0.0

5443 rows × 7 columns

Now, there are only 65 missing violation counts:

ins_and_num_vios['num_vio'].isnull().sum()
65

We have corrected a large number of missing values. With further investigation, we find that some of the businesses have inspection dates in ins2016 that differ from the dates in vio2016 by one or two days. We could change the dates in vio2016 if we wanted to fix additional missing data. We leave this as an exercise for the reader.

Now, let’s examine the relationship between the number of violations and the inspection score.

plt.figure(figsize=(10, 4))
sns.stripplot(x="num_vio", y="score", data=ins_and_num_vios);
../../_images/wrangling_restaurants_27_0.png

As we might expect there is a negative relationship between the inspection score and the number of violations found during the inspection. We can also see variability in score. Restaurants with four violations, have a spread in scores, and the variability in scores grows with the number of violations. It appears that some violations are more serious than others and have a greater impact on the score.

9.6.3. Extracting Information from Violation Descriptions

How can we use the information in the description of the violation to capture the kind of violation that has occurred? We saw earlier that the feature description in the violations data frame has a lot of text, including information in square brackets about when the violation was corrected. We can tally the descriptions and examine the most common violations.

display_df(vio2016['description'].value_counts().head(20), rows=20)
Unclean or degraded floors walls or ceilings                                 161
Unapproved or unmaintained equipment or utensils                              99
Moderate risk food holding temperature                                        95
Inadequate and inaccessible handwashing facilities                            93
Inadequately cleaned or sanitized food contact surfaces                       92
Improper food storage                                                         81
Wiping cloths not clean or properly stored or inadequate sanitizer            71
Food safety certificate or food handler card not available                    64
Moderate risk vermin infestation                                              58
Foods not protected from contamination                                        56
Unclean nonfood contact surfaces                                              54
Inadequate food safety knowledge or lack of certified food safety manager     52
Permit license or inspection report not posted                                41
Improper storage of equipment utensils or linens                              41
Low risk vermin infestation                                                   34
High risk food holding temperature                                            32
Inadequate warewashing facilities or equipment                                31
Improper or defective plumbing                                                28
Inadequate ventilation or lighting                                            26
Unclean or unsanitary food contact surfaces                                   23
Name: description, dtype: int64

These descriptions are long and wordy. Reading through them, we see that some are related to the cleanliness of facilities, others are related to the food storage, and still others pertain to the cleanliness of the staff.

Since there are many types of violations, we’d like to group violations together into larger categories. One way to do this is to create a simple boolean flag depending on what the violation text contains. For example, we can create a feature called is_high_risk, that contains True if if a description contains the term “high risk”.

We’ll create eight new features for different categories of violations. Don’t worry about the particular details of the code for now—this code uses regular expressions, covered in the later Working with Text chapter. The important idea is that this code creates features containing True or False based on whether the violation description contains specific words.

def make_vio_categories(vio):
    def has(term):
        return vio['description'].str.contains(term)

    vio = vio.copy()
    vio['is_high_risk'] = has(r"high risk")
    vio['is_clean'] = has(r"clean|sanit")
    vio['is_food_surface'] = (has(r"surface") & has(r"\Wfood"))
    vio['is_vermin'] = has(r"vermin")
    vio['is_storage'] = has(r"thaw|cool|therm|storage")
    vio['is_display_permit'] = has(r"certificate|permit")
    vio['is_non_food_surface'] = has(r"wall|ceiling|floor|surface")
    vio['is_human'] = has(r"hand|glove|hair|nail")
    return vio
# The violations with the new categories
vio2016 = (viol.pipe(subset_2016)
           .pipe(make_vio_categories))
vio2016.head(2)
business_id date description timestamp ... is_storage is_display_permit is_non_food_surface is_human
2 19 20160513 Unapproved or unmaintained equipment or utensi... 2016-05-13 ... False False False False
3 19 20160513 Unclean or degraded floors walls or ceilings ... 2016-05-13 ... False False True False

2 rows × 13 columns

Now that we have these new features in vio2016, we again want to merge this information with the inspection information. We roll up the violations again to the restaurant/date level and track whether any of the values in the group is True.

feature_names = [
    'is_high_risk', 'is_clean', 'is_food_surface', 'is_vermin', 'is_storage',
    'is_display_permit', 'is_non_food_surface', 'is_human'
]

features = (vio2016
            .groupby(['business_id', 'timestamp'])
            [feature_names]
            .max())
features.head(2)
is_high_risk is_clean is_food_surface is_vermin is_storage is_display_permit is_non_food_surface is_human
business_id timestamp
19 2016-05-13 False True False False False True True True
24 2016-03-11 False True False False False False True False

We’ll once again use a left join to merge these new features into the inspection-level data frame, ins2016. And, for the special case of a score of 100, we set all of the new features to False.

def left_join_features(ins):
    return ins.merge(features, on=['business_id', 'timestamp'], how='left')

def zero_features_for_perfect_scores(ins):
    ins = ins.copy()
    ins.loc[ins['score'] == 100, feature_names] = False
    return ins
ins_and_vios = (ins2016.pipe(left_join_features)
                .pipe(zero_features_for_perfect_scores))
ins_and_vios
business_id score date type ... is_storage is_display_permit is_non_food_surface is_human
0 19 94 20160513 routine ... False True True True
1 24 98 20161005 routine ... False False True False
2 24 96 20160311 routine ... False False True False
... ... ... ... ... ... ... ... ... ...
5440 90096 91 20161229 routine ... False False True False
5441 90268 100 20161229 routine ... False False False False
5442 90269 100 20161229 routine ... False False False False

5443 rows × 14 columns

To see how each violation category relates to the score, we can make a collection of boxplots that compares the score distributions with and without each violation. First, we’ll subset the table so that we have just the scores and the violation categories.

ins_and_vios[['score', *feature_names]]
score is_high_risk is_clean is_food_surface ... is_storage is_display_permit is_non_food_surface is_human
0 94 False True False ... False True True True
1 98 False True False ... False False True False
2 96 False True False ... False False True False
... ... ... ... ... ... ... ... ... ...
5440 91 False True True ... False False True False
5441 100 False False False ... False False False False
5442 100 False False False ... False False False False

5443 rows × 9 columns

Our plotting library expects long-form data, so we first use .melt() to transform this table from wide-form to long-form:

features_long = pd.melt(ins_and_vios[['score', *feature_names]],
                        id_vars='score',
                        var_name='category',
                        value_name='violated')
features_long
score category violated
0 94 is_high_risk False
1 98 is_high_risk False
2 96 is_high_risk False
... ... ... ...
43541 91 is_human False
43542 100 is_human False
43543 100 is_human False

43544 rows × 3 columns

Now, we make our plot:

g = sns.catplot(data=features_long, x='violated', y='score',
                col='category', col_wrap=4, kind='box');
g.set(ylim=(59, 101));
../../_images/wrangling_restaurants_46_0.png

What do we observe? Restaurants that have no cleanliness-related violations have the highest scores. Having a display permit violation isn’t much of a detractor. And, high risk violations lower the distribution of scores more than any other violation. These observations are just the beginning of a possible analysis into understanding what a score means.

9.6.4. Takeaways

We used several techniques from this chapter to wrangle the restaurant safety data. First, we reshaped the data to subset rows corresponding to a year’s worth of data and to aggregate violation information to inspection-level granularity. We used quality checks, looked for missing values, and filled them in using deductive imputation. We also showed one way that text can be used in a data analysis by extracting features from the violation descriptions. Finally, we modified the dataframe’s structure to make it easier to produce the final plots for this analysis. In the next section, we’ll summarize what we’ve covered in this chapter.