9.6. Example: Wrangling Restaurant Safety Violations#

We wrap up this chapter with an example that demonstrates many data wrangling techniques. Recall from Chapter 8 that the San Francisco restaurant inspection data are stored in three tables: bus (for businesses/restaurants), insp (for inspections), and viol (for safety violations). The violations dataset contains detailed descriptions of violations found during an inspection. We would like to capture some of this information and connect it to the inspection score, which is an inspection-level dataset.

Our goal is to figure out the kinds of safety violations associated with lower restaurant safety scores. This example covers several key ideas in data wrangling related to changing structure:

Filtering to focus on a narrower segment of data
Aggregation to modify the granularity of a table
Joining to bring together information across tables

Additionally, an important part of this example demonstrates how we transform text data into numeric quantities for analysis.

As a first step, let’s simplify the structure by reducing the data to inspections from one year. (Recall that this dataset contains four years of inspection information.) In the following code, we tally the number of records for each year in the inspections table:

pd.value_counts(insp['year'])

year
2016    5443
2017    5166
2015    3305
2018     308
Name: count, dtype: int64

Reducing the data to cover one year of inspections will simplify our analysis. Later, if we want, we can return to carry out an analysis with all four years of data.

9.6.1. Narrowing the Focus#

We restrict our data wrangling to inspections that took place in 2016. Here, we can use the pipe function again in order to apply the same reshaping to both the inspections and violations data frames:

def subset_2016(df):
    return df.query('year == 2016')

vio2016 = viol.pipe(subset_2016)
ins2016 = insp.pipe(subset_2016)

ins2016.head(5)

	business_id	score	date	type	timestamp	year
0	19	94	20160513	routine	2016-05-13	2016
3	24	98	20161005	routine	2016-10-05	2016
4	24	96	20160311	routine	2016-03-11	2016
6	45	78	20160104	routine	2016-01-04	2016
9	45	84	20160614	routine	2016-06-14	2016

In Chapter 8, we found that business_id and timestamp together uniquely identify the inspections (with a couple of exceptions). We also see here that restaurants can receive multiple inspections in a year—business #24 had two inspections in 2016, one in March and another in October.

Next, let’s look at a few records from the violations table:

vio2016.head(5)

	business_id	date	description	timestamp	year
2	19	20160513	Unapproved or unmaintained equipment or utensi...	2016-05-13	2016
3	19	20160513	Unclean or degraded floors walls or ceilings ...	2016-05-13	2016
4	19	20160513	Food safety certificate or food handler card n...	2016-05-13	2016
6	24	20161005	Unclean or degraded floors walls or ceilings ...	2016-10-05	2016
7	24	20160311	Unclean or degraded floors walls or ceilings ...	2016-03-11	2016

Notice that the first few records are for the same restaurant. If we want to bring violation information into the inspections table, we need to address the different granularities of these tables. One approach is to aggregate the violations in some way. We discuss this next.

9.6.2. Aggregating Violations#

One simple aggregation of the violations is to count them and add that count to the inspections data table. To find the number of violations at an inspection, we can group the violations by business_id and timestamp and then find the size of each group. Essentially, this grouping changes the granularity of violations to an inspection level:

num_vios = (vio2016
            .groupby(['business_id', 'timestamp'])
            .size()
            .reset_index()
            .rename(columns={0: 'num_vio'}));
num_vios.head(3)

	business_id	timestamp	num_vio
0	19	2016-05-13	3
1	24	2016-03-11	2
2	24	2016-10-05	1

Now we need to merge this new information with ins2016. Specifically, we want to left-join ins2016 with num_vios because there could be inspections that do not have any violations and we don’t want to lose them:

def left_join_vios(ins):
    return ins.merge(num_vios, on=['business_id', 'timestamp'], how='left')

ins_and_num_vios = ins2016.pipe(left_join_vios)
ins_and_num_vios

	business_id	score	date	type	timestamp	year	num_vio
0	19	94	20160513	routine	2016-05-13	2016	3.0
1	24	98	20161005	routine	2016-10-05	2016	1.0
2	24	96	20160311	routine	2016-03-11	2016	2.0
...	...	...	...	...	...	...	...
5440	90096	91	20161229	routine	2016-12-29	2016	2.0
5441	90268	100	20161229	routine	2016-12-29	2016	NaN
5442	90269	100	20161229	routine	2016-12-29	2016	NaN

5443 rows × 7 columns

When there are no violations at an inspection, the feature num_vio has a missing value (NaN). We can check how many values are missing:

ins_and_num_vios['num_vio'].isnull().sum()

About 15% of restaurant inspections in 2016 had no safety violations recorded. We can correct these missing values by setting them to 0 if the restaurant had a perfect safety score of 100. This is an example of deductive imputation since we’re using domain knowledge to fill in missing values:

def zero_vios_for_perfect_scores(df):
    df = df.copy()
    df.loc[df['score'] == 100, 'num_vio'] = 0
    return df

ins_and_num_vios = (ins2016.pipe(left_join_vios)
                    .pipe(zero_vios_for_perfect_scores))

We can count the number of inspections with missing violation counts again:

ins_and_num_vios['num_vio'].isnull().sum()

We have corrected a large number of missing values. With further investigation, we find that some of the businesses have inspection dates that are close but don’t quite match. We could do a fuzzy match where inspections with dates that are only one or two days apart are matched. But for now, we just leave them as NaN.

Let’s examine the relationship between the number of violations and the inspection score:

fig = px.strip(ins_and_num_vios,
                 x=jitter(ins_and_num_vios['num_vio'], amt=0.5),
                 y='score', 
                 width=400, height=250)

fig.update_traces(marker=dict(opacity=0.5, size=4))

fig.update_layout(xaxis_title="Number of violations",
                  yaxis_title="Inspection score")

fig

../../_images/wrangling_restaurants_26_0.svg

As we might expect, there is a negative relationship between the inspection score and the number of violations. We can also see variability in scores. The variability in scores grows with the number of violations. It appears that some violations are more serious than others and have a greater impact on the score. We extract information about the kinds of violations next.

9.6.3. Extracting Information from Violation Descriptions#

We saw earlier that the feature description in the violations data frame has a lot of text, including information in square brackets about when the violation was corrected. We can tally the descriptions and examine the most common violations:

display_df(vio2016['description'].value_counts().head(15).to_frame(), rows=15)

	description
Unclean or degraded floors walls or ceilings	161
Unapproved or unmaintained equipment or utensils	99
Moderate risk food holding temperature	95
Inadequate and inaccessible handwashing facilities	93
Inadequately cleaned or sanitized food contact surfaces	92
Improper food storage	81
Wiping cloths not clean or properly stored or inadequate sanitizer	71
Food safety certificate or food handler card not available	64
Moderate risk vermin infestation	58
Foods not protected from contamination	56
Unclean nonfood contact surfaces	54
Inadequate food safety knowledge or lack of certified food safety manager	52
Permit license or inspection report not posted	41
Improper storage of equipment utensils or linens	41
Low risk vermin infestation	34

Reading through these wordy descriptions, we see that some are related to the cleanliness of facilities, others to food storage, and still others to cleanliness of the staff.

Since there are many types of violations, we can try to group them together into larger categories. One way to do this is to create a simple boolean flag depending on whether the text contains a special term, like vermin, hand, or high risk.

With this approach, we create eight new features for different categories of violations. Don’t worry about the particular details of the code for now—this code uses regular expressions, covered in Chapter 13. The important idea is that this code creates features containing True or False based on whether the violation description contains specific words:

def make_vio_categories(vio):
    def has(term):
        return vio['description'].str.contains(term)

    return vio[['business_id', 'timestamp']].assign(
        high_risk        = has(r"high risk"),
        clean            = has(r"clean|sanit"),
        food_surface     = (has(r"surface") & has(r"\Wfood")),
        vermin           = has(r"vermin"),
        storage          = has(r"thaw|cool|therm|storage"),
        permit           = has(r"certif|permit"),
        non_food_surface = has(r"wall|ceiling|floor|surface"),
        human            = has(r"hand|glove|hair|nail"),
    )

vio_ctg = vio2016.pipe(make_vio_categories)
vio_ctg

	business_id	timestamp	high_risk	clean	...	storage	permit	non_food_surface	human
2	19	2016-05-13	False	False	...	False	False	False	False
3	19	2016-05-13	False	True	...	False	False	True	False
4	19	2016-05-13	False	False	...	False	True	False	True
...	...	...	...	...	...	...	...	...	...
38147	89900	2016-12-06	False	False	...	False	False	False	False
38220	90096	2016-12-29	False	False	...	False	False	False	False
38221	90096	2016-12-29	False	True	...	False	False	True	False

15624 rows × 10 columns

Now that we have these new features in vio_ctg, we can find out whether certain violation categories are more impactful than others. For example, are restaurant scores impacted more for vermin-related violations than permit-related violations?

To do this, we want to first count up the violations per business. Then we can merge this information with the inspection information. First, let’s sum the number of violations for each business:

vio_counts = vio_ctg.groupby(['business_id', 'timestamp']).sum().reset_index()
vio_counts

	business_id	timestamp	high_risk	clean	...	storage	permit	non_food_surface	human
0	19	2016-05-13	0	1	...	0	1	1	1
1	24	2016-03-11	0	2	...	0	0	2	0
2	24	2016-10-05	0	1	...	0	0	1	0
...	...	...	...	...	...	...	...	...	...
4803	89790	2016-11-29	0	0	...	0	0	0	1
4804	89900	2016-12-06	0	0	...	0	0	0	0
4805	90096	2016-12-29	0	1	...	0	0	1	0

4806 rows × 10 columns

Once again, we use a left join to merge these new features into the inspection-level data frame. And for the special case of a score of 100, we set all of the new features to 0:

feature_names = ['high_risk', 'clean', 'food_surface', 'vermin',
                 'storage', 'permit', 'non_food_surface', 'human']
def left_join_features(ins):
    return (ins[['business_id', 'timestamp', 'score']]
            .merge(vio_counts, on=['business_id', 'timestamp'], how='left'))

def zero_features_for_perfect_scores(ins):
    ins = ins.copy()
    ins.loc[ins['score'] == 100, feature_names] = 0
    return ins

ins_and_vios = (ins2016.pipe(left_join_features)
                .pipe(zero_features_for_perfect_scores))
ins_and_vios.head(3)

	business_id	timestamp	score	...	permit	non_food_surface	human
0	19	2016-05-13	94	...	1.0	1.0	1.0
1	24	2016-10-05	98	...	0.0	1.0	0.0
2	24	2016-03-11	96	...	0.0	2.0	0.0

3 rows × 11 columns

To see how each violation category relates to the score, we can make a collection of box plots that compare the score distributions with and without each violation. Since our focus here is on the data’s patterns, not the visualization code, we hide the code here:

features = ins_and_vios.melt(id_vars=['business_id', 'timestamp', 'score'],
                             var_name='violation', value_name='num_vios')
features['vio'] = features['num_vios'] > 0
any_vio = {False:"No", True:"Yes"}
features['vio'] = features['vio'].map(any_vio)

g = sns.catplot(data=features, x='vio', y='score',
            col='violation', col_wrap=4, kind='box', height=2)

g.set_axis_labels( "" , "Inspection score");

../../_images/wrangling_restaurants_43_0.svg

Learning Data Science

Example: Wrangling Restaurant Safety Violations

Contents

9.6. Example: Wrangling Restaurant Safety Violations#

9.6.1. Narrowing the Focus#

9.6.2. Aggregating Violations#

9.6.3. Extracting Information from Violation Descriptions#