def df_interact(df):
'''
Outputs sliders that show rows and columns of df
'''
def peek(row=0):
return df[row:row + 5]
interact(peek, row=(0, len(df), 5))
print('({} rows, {} columns) total'.format(df.shape[0], df.shape[1]))
11.2. Cleaning The Stops Dataset¶
The Stops dataset (webpage) records police stops of pedestrians and vehicles. Let’s prepare it for further analysis.
We can use the head
command to display the first few lines of the file.
!head data/stops.json
{
"meta" : {
"view" : {
"id" : "6e9j-pj9p",
"name" : "Berkeley PD - Stop Data",
"attribution" : "Berkeley Police Department",
"averageRating" : 0,
"category" : "Public Safety",
"createdAt" : 1444171604,
"description" : "This data was extracted from the Department’s Public Safety Server and covers the data beginning January 26, 2015. On January 26, 2015 the department began collecting data pursuant to General Order B-4 (issued December 31, 2014). Under that order, officers were required to provide certain data after making all vehicle detentions (including bicycles) and pedestrian detentions (up to five persons). This data set lists stops by police in the categories of traffic, suspicious vehicle, pedestrian and bicycle stops. Incident number, date and time, location and disposition codes are also listed in this data.\r\n\r\nAddress data has been changed from a specific address, where applicable, and listed as the block where the incident occurred. Disposition codes were entered by officers who made the stop. These codes included the person(s) race, gender, age (range), reason for the stop, enforcement action taken, and whether or not a search was conducted.\r\n\r\nThe officers of the Berkeley Police Department are prohibited from biased based policing, which is defined as any police-initiated action that relies on the race, ethnicity, or national origin rather than the behavior of an individual or information that leads the police to a particular individual who has been identified as being engaged in criminal activity.",
The stops.json
file is clearly not a CSV file. In this case, the file contains data in the JSON (JavaScript Object Notation) format, a commonly used data format where data is recorded in a dictionary format. Python’s json
module makes reading in this file as a dictionary simple.
import json
# Note that this could cause our computer to run out of memory if the file
# is large. In this case, we've verified that the file is small enough to
# read in beforehand.
with open('data/stops.json') as f:
stops_dict = json.load(f)
stops_dict.keys()
dict_keys(['meta', 'data'])
Note that stops_dict
is a Python dictionary, so displaying it will display the entire dataset in the notebook. This could cause the browser to crash, so we only display the keys of the dictionary above. To peek at the data without potentially crashing the browser, we can print the dictionary to a string and only output some of the first characters of the string.
from pprint import pformat
def print_dict(dictionary, num_chars=1000):
print(pformat(dictionary)[:num_chars])
print_dict(stops_dict['meta'])
{'view': {'attribution': 'Berkeley Police Department',
'averageRating': 0,
'category': 'Public Safety',
'columns': [{'dataTypeName': 'meta_data',
'fieldName': ':sid',
'flags': ['hidden'],
'format': {},
'id': -1,
'name': 'sid',
'position': 0,
'renderTypeName': 'meta_data'},
{'dataTypeName': 'meta_data',
'fieldName': ':id',
'flags': ['hidden'],
'format': {},
'id': -1,
'name': 'id',
'position': 0,
'renderTypeName': 'meta_data'},
{'dataTypeName': 'meta_data',
'fieldName': ':position',
'flags': ['hidden'],
'format': {},
print_dict(stops_dict['data'], num_chars=300)
[[1,
'29A1B912-A0A9-4431-ADC9-FB375809C32E',
1,
1444146408,
'932858',
1444146408,
'932858',
None,
'2015-00004825',
'2015-01-26T00:10:00',
'SAN PABLO AVE / MARIN AVE',
'T',
'M',
None,
None],
[2,
'1644D161-1113-4C4F-BB2E-BF780E7AE73E',
2,
1444146408,
'932858',
14
We can likely deduce that the 'meta'
key in the dictionary contains a description of the data and its columns and the 'data'
contains a list of data rows. We can use this information to initialize a DataFrame.
# Load the data from JSON and assign column titles
stops = pd.DataFrame(
stops_dict['data'],
columns=[c['name'] for c in stops_dict['meta']['view']['columns']])
stops
sid | id | position | created_at | ... | Incident Type | Dispositions | Location - Latitude | Location - Longitude | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 29A1B912-A0A9-4431-ADC9-FB375809C32E | 1 | 1444146408 | ... | T | M | None | None |
1 | 2 | 1644D161-1113-4C4F-BB2E-BF780E7AE73E | 2 | 1444146408 | ... | T | M | None | None |
2 | 3 | 5338ABAB-1C96-488D-B55F-6A47AC505872 | 3 | 1444146408 | ... | T | M | None | None |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
29205 | 31079 | C2B606ED-7872-4B0B-BC9B-4EF45149F34B | 31079 | 1496269085 | ... | T | BM2TWN; | None | None |
29206 | 31080 | 8FADF18D-7FE9-441D-8709-7BFEABDACA7A | 31080 | 1496269085 | ... | T | HM4TCS; | 37.8698757000001 | -122.286550846 |
29207 | 31081 | F60BD2A4-8C47-4BE7-B1C6-4934BE9DF838 | 31081 | 1496269085 | ... | 1194 | AR; | 37.867207539 | -122.256529377 |
29208 rows × 15 columns
# Prints column names
stops.columns
Index(['sid', 'id', 'position', 'created_at', 'created_meta', 'updated_at',
'updated_meta', 'meta', 'Incident Number', 'Call Date/Time', 'Location',
'Incident Type', 'Dispositions', 'Location - Latitude',
'Location - Longitude'],
dtype='object')
The website contains documentation about the following columns:
Column |
Description |
Type |
---|---|---|
Incident Number |
Number of incident created by Computer Aided Dispatch (CAD) program |
Plain Text |
Call Date/Time |
Date and time of the incident/stop |
Date & Time |
Location |
General location of the incident/stop |
Plain Text |
Incident Type |
This is the occurred incident type created in the CAD program. A code signifies a traffic stop (T), suspicious vehicle stop (1196), pedestrian stop (1194) and bicycle stop (1194B). |
Plain Text |
Dispositions |
Ordered in the following sequence: 1st Character = Race, as follows: A (Asian) B (Black) H (Hispanic) O (Other) W (White) 2nd Character = Gender, as follows: F (Female) M (Male) 3rd Character = Age Range, as follows: 1 (Less than 18) 2 (18-29) 3 (30-39), 4 (Greater than 40) 4th Character = Reason, as follows: I (Investigation) T (Traffic) R (Reasonable Suspicion) K (Probation/Parole) W (Wanted) 5th Character = Enforcement, as follows: A (Arrest) C (Citation) O (Other) W (Warning) 6th Character = Car Search, as follows: S (Search) N (No Search) Additional dispositions may also appear. They are: P - Primary case report M - MDT narrative only AR - Arrest report only (no case report submitted) IN - Incident report FC - Field Card CO - Collision investigation report MH - Emergency Psychiatric Evaluation TOW - Impounded vehicle 0 or 00000 – Officer made a stop of more than five persons |
Plain Text |
Location - Latitude |
General latitude of the call. This data is only uploaded after January 2017 |
Number |
Location - Longitude |
General longitude of the call. This data is only uploaded after January 2017. |
Number |
Notice that the website doesn’t contain descriptions for the first 8 columns of the stops
table. Since these columns appear to contain metadata that we’re not interested in analyzing this time, we drop them from the table.
columns_to_drop = ['sid', 'id', 'position', 'created_at', 'created_meta',
'updated_at', 'updated_meta', 'meta']
# This function takes in a DF and returns a DF so we can use it for .pipe
def drop_unneeded_cols(stops):
return stops.drop(columns=columns_to_drop)
stops.pipe(drop_unneeded_cols)
Incident Number | Call Date/Time | Location | Incident Type | Dispositions | Location - Latitude | Location - Longitude | |
---|---|---|---|---|---|---|---|
0 | 2015-00004825 | 2015-01-26T00:10:00 | SAN PABLO AVE / MARIN AVE | T | M | None | None |
1 | 2015-00004829 | 2015-01-26T00:50:00 | SAN PABLO AVE / CHANNING WAY | T | M | None | None |
2 | 2015-00004831 | 2015-01-26T01:03:00 | UNIVERSITY AVE / NINTH ST | T | M | None | None |
... | ... | ... | ... | ... | ... | ... | ... |
29205 | 2017-00024245 | 2017-04-30T22:59:26 | UNIVERSITY AVE/6TH ST | T | BM2TWN; | None | None |
29206 | 2017-00024250 | 2017-04-30T23:19:27 | UNIVERSITY AVE / WEST ST | T | HM4TCS; | 37.8698757000001 | -122.286550846 |
29207 | 2017-00024254 | 2017-04-30T23:38:34 | CHANNING WAY / BOWDITCH ST | 1194 | AR; | 37.867207539 | -122.256529377 |
29208 rows × 7 columns
As with the Calls dataset, we will answer the following three questions about the Stops dataset:
Are there missing values in the dataset?
Are there any missing values that were filled in (e.g. a 999 for unknown age or 12:00am for unknown date)?
Which parts of the data were entered by a human?
11.2.1. Are there missing values?¶
We can clearly see that there are many missing latitude and longitudes. The data description states that these two columns are only filled in after Jan 2017.
# True if row contains at least one null value
null_rows = stops.isnull().any(axis=1)
stops[null_rows]
sid | id | position | created_at | ... | Incident Type | Dispositions | Location - Latitude | Location - Longitude | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 29A1B912-A0A9-4431-ADC9-FB375809C32E | 1 | 1444146408 | ... | T | M | None | None |
1 | 2 | 1644D161-1113-4C4F-BB2E-BF780E7AE73E | 2 | 1444146408 | ... | T | M | None | None |
2 | 3 | 5338ABAB-1C96-488D-B55F-6A47AC505872 | 3 | 1444146408 | ... | T | M | None | None |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
29205 | 31079 | C2B606ED-7872-4B0B-BC9B-4EF45149F34B | 31079 | 1496269085 | ... | T | BM2TWN; | None | None |
29206 | 31080 | 8FADF18D-7FE9-441D-8709-7BFEABDACA7A | 31080 | 1496269085 | ... | T | HM4TCS; | 37.8698757000001 | -122.286550846 |
29207 | 31081 | F60BD2A4-8C47-4BE7-B1C6-4934BE9DF838 | 31081 | 1496269085 | ... | 1194 | AR; | 37.867207539 | -122.256529377 |
29208 rows × 15 columns
We can check the other columns for missing values:
# True if row contains at least one null value without checking
# the latitude and longitude columns
null_rows = stops.iloc[:, :-2].isnull().any(axis=1)
df_interact(stops[null_rows])
(29208 rows, 15 columns) total
By browsing through the table above, we can see that all other missing values are in the Dispositions column. Unfortunately, we do not know from the data description why these Dispositions might be missing. Since only there are only 63 missing values compared to 25,000 rows in the original table, we can proceed with analysis while being mindful that these missing values could impact results.
11.2.1.1. Are there any missing values that were filled in?¶
It doesn’t seem like any previously missing values were filled in for us. Unlike in the Calls dataset where the date and time were in separate columns, the Call Date/Time column in the Stops dataset contains both date and time.
11.2.2. Which parts of the data were entered by a human?¶
As with the Calls dataset, it looks like most of the columns in this dataset were recorded by a machine or were a category selected by a human (e.g. Incident Type).
However, the Location column doesn’t have consistently entered values. Sure enough, we spot some typos in the data:
stops['Location'].value_counts()
2200 BLOCK SHATTUCK AVE 229
37.8693028530001~-122.272234021 213
UNIVERSITY AVE / SAN PABLO AVE 202
...
MURRAY ST / SEVENTH ST 1
ACTON/LINCOLN 1
HASTE ST / TELEGRAPH AVE 1
Name: Location, Length: 6393, dtype: int64
What a mess! It looks like sometimes an address was entered, sometimes a cross-street, and other times a latitude-longitude pair. Unfortunately, we don’t have very complete latitude-longitude data to use in place of this column. We may have to manually clean this column if we want to use locations for future analysis.
We can also check the Dispositions column:
dispositions = stops['Dispositions'].value_counts()
# Outputs a slider to pan through the unique Dispositions in
# order of how often they appear
interact(lambda row=0: dispositions.iloc[row:row+7],
row=(0, len(dispositions), 7))
<function __main__.<lambda>(row=0)>
The Dispositions columns also contains inconsistencies. For example, some dispositions start with a space, some end with a semicolon, and some contain multiple entries. The variety of values suggests that this field contains human-entered values and should be treated with caution.
# Strange values...
dispositions.iloc[[0, 20, 30, 266, 1027]]
M 1683
M; 238
M 176
BM4ICN 14
WM4TWN; M; 1
Name: Dispositions, dtype: int64
In addition, the most common disposition is M
which isn’t a permitted first character in the Dispositions column. This could mean that the format of the column changed over time or that officers are allowed to enter in the disposition without matching the format in the data description. In any case, the column will be challenging to work with.
We can take some simple steps to clean the Dispositions column by removing leading and trailing whitespace, removing trailing semi-colons, and replacing the remaining semi-colons with commas.
def clean_dispositions(stops):
cleaned = (stops['Dispositions']
.str.strip()
.str.rstrip(';')
.str.replace(';', ','))
return stops.assign(Dispositions=cleaned)
As before, we can now pipe the stops
DF through the cleaning functions we’ve defined:
stops_final = (stops
.pipe(drop_unneeded_cols)
.pipe(clean_dispositions))
stops_final
Incident Number | Call Date/Time | Location | Incident Type | Dispositions | Location - Latitude | Location - Longitude | |
---|---|---|---|---|---|---|---|
0 | 2015-00004825 | 2015-01-26T00:10:00 | SAN PABLO AVE / MARIN AVE | T | M | None | None |
1 | 2015-00004829 | 2015-01-26T00:50:00 | SAN PABLO AVE / CHANNING WAY | T | M | None | None |
2 | 2015-00004831 | 2015-01-26T01:03:00 | UNIVERSITY AVE / NINTH ST | T | M | None | None |
... | ... | ... | ... | ... | ... | ... | ... |
29205 | 2017-00024245 | 2017-04-30T22:59:26 | UNIVERSITY AVE/6TH ST | T | BM2TWN | None | None |
29206 | 2017-00024250 | 2017-04-30T23:19:27 | UNIVERSITY AVE / WEST ST | T | HM4TCS | 37.8698757000001 | -122.286550846 |
29207 | 2017-00024254 | 2017-04-30T23:38:34 | CHANNING WAY / BOWDITCH ST | 1194 | AR | 37.867207539 | -122.256529377 |
29208 rows × 7 columns
11.2.3. Conclusion¶
As these two datasets have shown, data cleaning can often be both difficult and tedious. Cleaning 100% of the data often takes too long, but not cleaning the data at all results in faulty conclusions; we have to weigh our options and strike a balance each time we encounter a new dataset.
The decisions made during data cleaning impact all future analyses. For example, we chose not to clean the Location column of the Stops dataset so we should treat that column with caution. Each decision made during data cleaning should be carefully documented for future reference, preferably in a notebook so that both code and explanations appear together.
# Save data to CSV for other chapters
stops_final.to_csv('data/stops.csv', index=False)