5.2. Data Wrangling#

Before we start our analysis, we check the quality of the data, simplify the structure where possible, and derive new measurements that might help us in our analysis. We cover these types of operations in Chapter 9, so don’t worry about the details of the code for now. Instead, focus on the differences between the data tables as we clean the data. We start by loading the data into Python.

bus = (pd.read_csv('data/seattle_bus_times.csv')
       .dropna(axis=0, how='any'))

The first few rows in the data table are shown here:

bus.head(3)

	OPD_DATE	VEHICLE_ID	RTE	DIR	...	STOP_ID	STOP_NAME	SCH_STOP_TM	ACT_STOP_TM
0	2016-03-26	6201	673	S	...	431	3RD AVE & PIKE ST (431)	01:11:57	01:13:19
1	2016-03-26	6201	673	S	...	431	3RD AVE & PIKE ST (431)	23:19:57	23:16:13
2	2016-03-26	6201	673	S	...	431	3RD AVE & PIKE ST (431)	21:19:57	21:18:46

3 rows × 9 columns

(The raw data are available as comma-separated values in a file, which we have loaded into this table; see Chapter 8 for details on this process.)

It looks like some of the columns in the table might be redundant, like the columns labeled STOP_ID and STOP_NAME. We can find the number of unique values and their counts to confirm this:

bus[['STOP_ID','STOP_NAME']].value_counts()

STOP_ID  STOP_NAME              
578      3RD AVE & PIKE ST (578)    19599
431      3RD AVE & PIKE ST (431)    19318
dtype: int64

There are two 3RD AVE & PIKE ST names for the stop. We wonder whether they are related to the direction of the bus, which we can check against the possible combinations of direction, stop ID and stop name:

bus[['DIR','STOP_ID','STOP_NAME']].value_counts()

DIR  STOP_ID  STOP_NAME              
N    578      3RD AVE & PIKE ST (578)    19599
S    431      3RD AVE & PIKE ST (431)    19318
dtype: int64

Indeed, the northern direction corresponds to stop ID 578 and the southern direction corresponds to stop ID 431. Since we are looking at only one stop in our analysis, we don’t really need anything more than the direction.

We can also check the number of unique route names:

bus['RTE'].value_counts()

  13228
  13179
  12510
Name: RTE, dtype: int64

These routes are numbered and don’t match the names C, D, and E from the original description of the problem. This issue involves another aspect of data wrangling: we need to dig up information that connects the route letters and numbers. We can get this info from the Seattle transit site. Yet another part of wrangling is to translate data values into ones that are easier to understand, so we replace the route numbers with their letters:

def clean_stops(bus):
    return bus.assign(
        route=bus["RTE"].replace({673: "C", 674: "D", 675: "E"}),
        direction=bus["DIR"].replace({"N": "northbound", "S": "southbound"}),
    )

We can also create new columns in the table that help us in our investigations. For example, we can use the scheduled and actual arrival times to calculate how late a bus is. To do this requires some work with date and time formats, which is covered in Chapter 9.

def compute_mins_late(bus):
    bus = bus.assign(
        scheduled=pd.to_datetime(bus["OPD_DATE"] + " " + bus["SCH_STOP_TM"]),
        actual=pd.to_datetime(bus["OPD_DATE"] + " " + bus["ACT_STOP_TM"]),
    )
    # if scheduled & actual span midnight, then the actual day needs to be
    # adjusted
    minute = pd.Timedelta("1 minute")
    hour = pd.Timedelta("1 hour")
    diff_hrs = (bus["actual"] - bus["scheduled"]) / hour
    bus.loc[diff_hrs > 20, "actual"] -= 24 * hour
    bus.loc[diff_hrs < -20, "actual"] += 24 * hour
    return bus.assign(minutes_late=(bus["actual"] - bus["scheduled"]) / minute)

bus = (
    pd.read_csv("data/seattle_bus_times.csv")
    .dropna(axis=0, how="any")
    .pipe(clean_stops)
    .pipe(compute_mins_late)
)
bus.head(3)

	OPD_DATE	VEHICLE_ID	RTE	DIR	...	direction	scheduled	actual	minutes_late
0	2016-03-26	6201	673	S	...	southbound	2016-03-26 01:11:57	2016-03-26 01:13:19	1.37
1	2016-03-26	6201	673	S	...	southbound	2016-03-26 23:19:57	2016-03-26 23:16:13	-3.73
2	2016-03-26	6201	673	S	...	southbound	2016-03-26 21:19:57	2016-03-26 21:18:46	-1.18

3 rows × 14 columns

Let’s examine the values of this new quantity to make sure that our calculations are correct:

 smallest amount late: -12.87 minutes
 greatest amount late:  150.28 minutes
 median amount late:    0.52 minutes

It’s a bit surprising that there are negative values for how late a bus is, but this just means the bus arrived earlier than scheduled. While the median lateness is only about half a minute, some of the buses are 2.5 hours late! Let’s take a look at the histogram of how many minutes late the buses are:

px.histogram(bus, x="minutes_late", nbins=120, width=450, height=300, 
             labels={'minutes_late':'Minutes late'})

We saw a similarly shaped histogram in Chapter 4. The distribution of how late the buses are is highly skewed to the right, but many arrive close to on time.

Finally, we conclude our wrangling by creating a simplified version of the data table. Since we only need to keep track of the route, direction, scheduled and actual arrival time, and how late the bus is, we create a smaller table and give the columns names that are a bit easier to read:

bus = bus[["route", "direction", "scheduled", "actual", "minutes_late"]]
bus.head()

	route	direction	scheduled	actual	minutes_late
0	C	southbound	2016-03-26 01:11:57	2016-03-26 01:13:19	1.37
1	C	southbound	2016-03-26 23:19:57	2016-03-26 23:16:13	-3.73
2	C	southbound	2016-03-26 21:19:57	2016-03-26 21:18:46	-1.18
3	C	southbound	2016-03-26 19:04:57	2016-03-26 19:01:49	-3.13
4	C	southbound	2016-03-26 16:42:57	2016-03-26 16:42:39	-0.30

These table manipulations are covered in Chapter 6.

Before we begin to model bus lateness, we want to explore and learn more about these data. We do that next.

Learning Data Science

Data Wrangling

5.2. Data Wrangling#