Table Shape and Granularity
Contents
8.5. Table Shape and Granularity¶
In this section, we introduce granularity: what each row in the table represents.
As described earlier, we refer to a dataset’s structure as a mental representation of the data, and in particular, we represent data that has a table structure by arranging data values in rows and columns. Now that we have investigated the restaurant inspection files, we load them into dataframes and examine their shapes.
bus = pd.read_csv('data/businesses.csv', encoding='ISO-8859-1')
insp = pd.read_csv("data/inspections.csv")
viol = pd.read_csv("data/violations.csv")
print(" Businesses shape:", bus.shape)
print("Inspections shape:", insp.shape)
print(" Violations shape:", viol.shape)
Businesses shape: (6406, 9)
Inspections shape: (14222, 4)
Violations shape: (39042, 3)
We find that the businesses table has 6406 rows and 9 columns. Now, let’s figure out the granularity of this table by understanding what each row represents. To start, we’ll look at the first two rows.
business_id | name | address | city | ... | postal_code | latitude | longitude | phone_number | |
---|---|---|---|---|---|---|---|---|---|
0 | 19 | NRGIZE LIFESTYLE CAFE | 1200 VAN NESS AVE, 3RD FLOOR | San Francisco | ... | 94109 | 37.79 | -122.42 | +14157763262 |
1 | 24 | OMNI S.F. HOTEL - 2ND FLOOR PANTRY | 500 CALIFORNIA ST, 2ND FLOOR | San Francisco | ... | 94104 | 37.79 | -122.40 | +14156779494 |
2 rows × 9 columns
Simply looking at these two rows, we get the impression that each record represents a single restaurant. But, we can’t tell from just two records whether or not this is the case. The field name business_id
implies that it is the unique identifier for the business (restaurant). We can confirm this by checking whether the number of records in bus
matches the number of unique values in business_id
.
print("Number of records:", len(bus))
print("Number of unique business ids:", len(bus['business_id'].unique()))
Number of records: 6406
Number of unique business ids: 6406
The number of unique business_id
s matches the number of rows in the table.
In other words, business_id
uniquely identifies each record in the DataFrame. We call business_id
the primary key for the table. A primary key is the feature that uniquely identifies each row in a table. We use primary keys to join tables like in
Chapter 6. Sometimes a primary key consists of two (or more) features.
This is the case for the other two restaurant files.
Let’s continue the examination of the inspections and violations data frames
and find their granularity.
8.5.1. Granularity of Restaurant Inspections and Violations¶
We saw earlier in this chapter that there are many more rows in the inspections table compared to the business table. Let’s take a closer look at the first few rows of inspections.
business_id | score | date | type | |
---|---|---|---|---|
0 | 19 | 94 | 20160513 | routine |
1 | 19 | 94 | 20171211 | routine |
2 | 24 | 98 | 20171101 | routine |
3 | 24 | 98 | 20161005 | routine |
Like the businesses table, this table also contains a field called business_id
, but we see duplicate values of the ID. The two records for business #19 have different date values, which implies that there is one record for each inspection of a restaurant. In other words, the granularity of this table is a restaurant inspection. If this is indeed the case, that would mean that the unique identifier of a row is the combination of
business_id
and date
. That is, the primary key consists of two fields.
To confirm that the two fields form a primary key, we can group insp
by the combination of business_id
and date
, and then find the size of each group. If business_id
and date
uniquely define each row of the dataframe, then each group should have size 1.
(insp
.groupby(['business_id', 'date'])
.size()
.sort_values(ascending=False)
.head(5)
)
business_id date
64859 20150924 2
87440 20160801 2
77427 20170706 2
19 20160513 1
71416 20171213 1
dtype: int64
The combination of ID and date, uniquely identifies each record in the inspections table, with the exception of three restaurants, which have two records for their ID-date combination. For example, restaurant 64859
has two records with an inspection date of 20150924
. Let’s examine these rows.
insp.query('business_id == 64859 and date == 20150924')
business_id | score | date | type | |
---|---|---|---|---|
7742 | 64859 | 96 | 20150924 | routine |
7744 | 64859 | 91 | 20150924 | routine |
Restaurant 64859
got two different inspection scores on 20150924
(Sept. 24, 2015). How could this happen? It may be that the restaurant had two inspections in one day, or it might be an error.
We address these sorts of questions when we consider the data quality in Chapter 9.
In this case, we’ll assume that a restaurant can have two inspections on the
same date. So, the primary key for the inspections table is the combination of restaurant ID and inspection date.
Note that the business_id
field in the inspections table acts as a reference to the primary key in the business table.
Because of this, business_id
in insp
is called a foreign key: it
links each record in the inspections table to a record in the business table.
This means that we can readily join these two tables together.
Next, we examine the granularity of the violations table.
business_id | date | description | |
---|---|---|---|
0 | 19 | 20171211 | Inadequate food safety knowledge or lack of ce... |
1 | 19 | 20171211 | Unapproved or unmaintained equipment or utensils |
2 | 19 | 20160513 | Unapproved or unmaintained equipment or utensi... |
... | ... | ... | ... |
39039 | 94231 | 20171214 | High risk vermin infestation [ date violation... |
39040 | 94231 | 20171214 | Moderate risk food holding temperature [ dat... |
39041 | 94231 | 20171214 | Wiping cloths not clean or properly stored or ... |
39042 rows × 3 columns
Just looking at the first few records in viol we see that each inspection has multiple entries. The granularity is at the level of a violation found in an inspection. Reading the descriptions, we see that if corrected, a date is listed in the description within square brackets.
viol.loc[39039, 'description']
'High risk vermin infestation [ date violation corrected: 12/15/2017 ]'
In summary, we have found that these three tables have different granularities. If we are interested in studying inspections, we can join the violations and inspections together using the business ID and inspection date. This would let us find the number of violations that happened in each inspection.
We can also reduce the inspection table by selecting the most recent inspection for each restaurant. This reduced data table essentially has a granularity of restaurant, which may be useful for a restaurant-based analysis. These kinds of actions reshape the data table, transform columns, and create new columns. We’ll cover these operations in Chapter 9.
We conclude this section with a look at the shape and granularity of the DAWN survey data.
8.5.2. DAWN Survey Shape and Granularity¶
As noted earlier in this chapter, the DAWN file has fixed-width formatting, and we need to rely on a codebook to find out where the fields are. For example, a snippet of the codebook in Figure 8.2 tells us that age appears in positions 34 and 35 in the row, and it is categorized into 11 age groups, 1 stands for 5 and under, 2 for 6 to 11, …, and 11 for 65 and older. Also, -8 represents a missing value.

Fig. 8.2 Screenshot of a portion of the DAWN coding for age.¶
Earlier, we determined the file contains 200 thousand lines and over 280 million characters so, on average, there are about 1200 characters per line. This might be why they used a fixed-width file rather than a CSV file; the file would be much larger if there was a comma between every field!
Given the tremendous amount of information on each line, we read just a few
variables into a data frame. We can use the pandas.read_fwf
method to do this. We specify the exact positions of the fields to extract, provide names for these fields, and other information about the header and index.
colspecs = [(0,6), (14,29), (33,35), (35, 37), (37, 39), (1213, 1214)]
varNames = ["id", "wt", "age", "sex", "race","type"]
dawn = pd.read_fwf('data/DAWN-Data.txt', colspecs=colspecs,
header=None, index_col=0, names=varNames)
wt | age | sex | race | type | |
---|---|---|---|---|---|
id | |||||
1 | 0.94 | 4 | 1 | 2 | 8 |
2 | 5.99 | 11 | 1 | 3 | 4 |
3 | 4.72 | 11 | 2 | 2 | 4 |
4 | 4.08 | 2 | 1 | 3 | 4 |
5 | 5.18 | 6 | 1 | 3 | 8 |
We can compare the rows in the table to the number of lines in the file.
dawn.shape
(229211, 5)
The number of rows in the data frame matches the number of lines in the file. That’s good. The granularity of the data frame is a bit complicated due to the survey design. Recall that these data are part of a large scientific study, with a complex sampling scheme. A row represents an emergency room visit, so the granularity is at the emergency room visit level. However, in order to reflect the sampling scheme and be representative of the population, weights are provided. We must apply a weight to each record when we compute summary statistics, build histograms, and fit models.
The wt
field contains a value that takes into account the probability of an ER visit like this one appearing in the sample. By “like this one” we mean a visit with similar features, like the visitor age, race, visit location, and time of day. We examine the different values in wt
.
dawn['wt'].value_counts()
0.94 1719
84.26 1617
1.72 1435
...
3.33 1
6.20 1
3.31 1
Name: wt, Length: 3500, dtype: int64
Note
What do these weights mean? As a simplified example, suppose you ran a survey and 75% of your respondents reported their sex as female. Since you know from the Census that roughly 50% of the U.S. population is female, you can adjust your survey responses by using a small weight (less than 1) for female responses and a larger weight (greater than 1) for male responses. The DAWN survey uses the same idea, except that they split the groups much more finely.
It is critical to include the survey weights in your analysis to get data that represents the population at large. For example, we can compare the calculation of the proportion of females among the ER visits both with and without the weights.
unweighted = np.average(dawn["sex"] == 2)
weighted = np.average(dawn["sex"] == 2, weights=dawn["wt"])
print(f'Unweighted % female: {unweighted:.1%}')
print(f' Weighted % female: {weighted:.1%}')
Unweighted % female: 48.0%
Weighted % female: 52.3%
These figures differ by more than 4 percentage points. The weighted version is a more accurate estimate of the proportion of females among the entire population of drug-related ER visits.
In this section, we discovered the granularity for the restaurant dataset tables and the DAWN data. Sometimes the granularity can be tricky to figure out, like we saw with the inspections data. And, we needed to take the sampling weights into account when looking at the granularity for the DAWN data. These examples show it’s important to take your time and review the data descriptions before proceeding with analysis.