8.5. Table Shape and Granularity

In this section, we introduce granularity: what each row in the table represents.

As described earlier, we refer to a dataset’s structure as a mental representation of the data, and in particular, we represent data that has a table structure by arranging data values in rows and columns. Now that we have investigated the restaurant inspection files, we load them into dataframes and examine their shapes.

bus = pd.read_csv('data/businesses.csv', encoding='ISO-8859-1')
insp = pd.read_csv("data/inspections.csv")
viol = pd.read_csv("data/violations.csv")
print(" Businesses shape:", bus.shape)
print("Inspections shape:", insp.shape)
print(" Violations shape:", viol.shape)
 Businesses shape: (6406, 9)
Inspections shape: (14222, 4)
 Violations shape: (39042, 3)

We find that the businesses table has 6406 rows and 9 columns. Now, let’s figure out the granularity of this table by understanding what each row represents. To start, we’ll look at the first two rows.

business_id name address city ... postal_code latitude longitude phone_number
0 19 NRGIZE LIFESTYLE CAFE 1200 VAN NESS AVE, 3RD FLOOR San Francisco ... 94109 37.79 -122.42 +14157763262
1 24 OMNI S.F. HOTEL - 2ND FLOOR PANTRY 500 CALIFORNIA ST, 2ND FLOOR San Francisco ... 94104 37.79 -122.40 +14156779494

2 rows × 9 columns

Simply looking at these two rows, we get the impression that each record represents a single restaurant. But, we can’t tell from just two records whether or not this is the case. The field name business_id implies that it is the unique identifier for the business (restaurant). We can confirm this by checking whether the number of records in bus matches the number of unique values in business_id.

print("Number of records:", len(bus))
print("Number of unique business ids:", len(bus['business_id'].unique()))
Number of records: 6406
Number of unique business ids: 6406

The number of unique business_ids matches the number of rows in the table. In other words, business_id uniquely identifies each record in the DataFrame. We call business_id the primary key for the table. A primary key is the feature that uniquely identifies each row in a table. We use primary keys to join tables like in Chapter 6. Sometimes a primary key consists of two (or more) features. This is the case for the other two restaurant files. Let’s continue the examination of the inspections and violations data frames and find their granularity.

8.5.1. Granularity of Restaurant Inspections and Violations

We saw earlier in this chapter that there are many more rows in the inspections table compared to the business table. Let’s take a closer look at the first few rows of inspections.

business_id score date type
0 19 94 20160513 routine
1 19 94 20171211 routine
2 24 98 20171101 routine
3 24 98 20161005 routine

Like the businesses table, this table also contains a field called business_id, but we see duplicate values of the ID. The two records for business #19 have different date values, which implies that there is one record for each inspection of a restaurant. In other words, the granularity of this table is a restaurant inspection. If this is indeed the case, that would mean that the unique identifier of a row is the combination of business_id and date. That is, the primary key consists of two fields.

To confirm that the two fields form a primary key, we can group insp by the combination of business_id and date, and then find the size of each group. If business_id and date uniquely define each row of the dataframe, then each group should have size 1.

(insp
 .groupby(['business_id', 'date'])
 .size()
 .sort_values(ascending=False)
 .head(5)
)
business_id  date    
64859        20150924    2
87440        20160801    2
77427        20170706    2
19           20160513    1
71416        20171213    1
dtype: int64

The combination of ID and date, uniquely identifies each record in the inspections table, with the exception of three restaurants, which have two records for their ID-date combination. For example, restaurant 64859 has two records with an inspection date of 20150924. Let’s examine these rows.

insp.query('business_id == 64859 and date == 20150924')
business_id score date type
7742 64859 96 20150924 routine
7744 64859 91 20150924 routine

Restaurant 64859 got two different inspection scores on 20150924 (Sept. 24, 2015). How could this happen? It may be that the restaurant had two inspections in one day, or it might be an error. We address these sorts of questions when we consider the data quality in Chapter 9. In this case, we’ll assume that a restaurant can have two inspections on the same date. So, the primary key for the inspections table is the combination of restaurant ID and inspection date.

Note that the business_id field in the inspections table acts as a reference to the primary key in the business table. Because of this, business_id in insp is called a foreign key: it links each record in the inspections table to a record in the business table. This means that we can readily join these two tables together.

Next, we examine the granularity of the violations table.

business_id date description
0 19 20171211 Inadequate food safety knowledge or lack of ce...
1 19 20171211 Unapproved or unmaintained equipment or utensils
2 19 20160513 Unapproved or unmaintained equipment or utensi...
... ... ... ...
39039 94231 20171214 High risk vermin infestation [ date violation...
39040 94231 20171214 Moderate risk food holding temperature [ dat...
39041 94231 20171214 Wiping cloths not clean or properly stored or ...

39042 rows × 3 columns

Just looking at the first few records in viol we see that each inspection has multiple entries. The granularity is at the level of a violation found in an inspection. Reading the descriptions, we see that if corrected, a date is listed in the description within square brackets.

viol.loc[39039, 'description']
'High risk vermin infestation  [ date violation corrected: 12/15/2017 ]'

In summary, we have found that these three tables have different granularities. If we are interested in studying inspections, we can join the violations and inspections together using the business ID and inspection date. This would let us find the number of violations that happened in each inspection.

We can also reduce the inspection table by selecting the most recent inspection for each restaurant. This reduced data table essentially has a granularity of restaurant, which may be useful for a restaurant-based analysis. These kinds of actions reshape the data table, transform columns, and create new columns. We’ll cover these operations in Chapter 9.

We conclude this section with a look at the shape and granularity of the DAWN survey data.

8.5.2. DAWN Survey Shape and Granularity

As noted earlier in this chapter, the DAWN file has fixed-width formatting, and we need to rely on a codebook to find out where the fields are. For example, a snippet of the codebook in Figure 8.2 tells us that age appears in positions 34 and 35 in the row, and it is categorized into 11 age groups, 1 stands for 5 and under, 2 for 6 to 11, …, and 11 for 65 and older. Also, -8 represents a missing value.

../../_images/DAWN_Age.png

Fig. 8.2 Screenshot of a portion of the DAWN coding for age.

Earlier, we determined the file contains 200 thousand lines and over 280 million characters so, on average, there are about 1200 characters per line. This might be why they used a fixed-width file rather than a CSV file; the file would be much larger if there was a comma between every field!

Given the tremendous amount of information on each line, we read just a few variables into a data frame. We can use the pandas.read_fwf method to do this. We specify the exact positions of the fields to extract, provide names for these fields, and other information about the header and index.

colspecs = [(0,6), (14,29), (33,35), (35, 37), (37, 39), (1213, 1214)]
varNames = ["id", "wt", "age", "sex", "race","type"]
dawn = pd.read_fwf('data/DAWN-Data.txt', colspecs=colspecs, 
                   header=None, index_col=0, names=varNames)
wt age sex race type
id
1 0.94 4 1 2 8
2 5.99 11 1 3 4
3 4.72 11 2 2 4
4 4.08 2 1 3 4
5 5.18 6 1 3 8

We can compare the rows in the table to the number of lines in the file.

dawn.shape
(229211, 5)

The number of rows in the data frame matches the number of lines in the file. That’s good. The granularity of the data frame is a bit complicated due to the survey design. Recall that these data are part of a large scientific study, with a complex sampling scheme. A row represents an emergency room visit, so the granularity is at the emergency room visit level. However, in order to reflect the sampling scheme and be representative of the population, weights are provided. We must apply a weight to each record when we compute summary statistics, build histograms, and fit models.

The wt field contains a value that takes into account the probability of an ER visit like this one appearing in the sample. By “like this one” we mean a visit with similar features, like the visitor age, race, visit location, and time of day. We examine the different values in wt.

dawn['wt'].value_counts()
0.94     1719
84.26    1617
1.72     1435
         ... 
3.33        1
6.20        1
3.31        1
Name: wt, Length: 3500, dtype: int64

Note

What do these weights mean? As a simplified example, suppose you ran a survey and 75% of your respondents reported their sex as female. Since you know from the Census that roughly 50% of the U.S. population is female, you can adjust your survey responses by using a small weight (less than 1) for female responses and a larger weight (greater than 1) for male responses. The DAWN survey uses the same idea, except that they split the groups much more finely.

It is critical to include the survey weights in your analysis to get data that represents the population at large. For example, we can compare the calculation of the proportion of females among the ER visits both with and without the weights.

unweighted = np.average(dawn["sex"] == 2)
weighted = np.average(dawn["sex"] == 2, weights=dawn["wt"])

print(f'Unweighted % female: {unweighted:.1%}')
print(f'  Weighted % female: {weighted:.1%}')
Unweighted % female: 48.0%
  Weighted % female: 52.3%

These figures differ by more than 4 percentage points. The weighted version is a more accurate estimate of the proportion of females among the entire population of drug-related ER visits.

In this section, we discovered the granularity for the restaurant dataset tables and the DAWN data. Sometimes the granularity can be tricky to figure out, like we saw with the inspections data. And, we needed to take the sampling weights into account when looking at the granularity for the DAWN data. These examples show it’s important to take your time and review the data descriptions before proceeding with analysis.