5.2. Data Wrangling#

Before we start our analysis, we check the quality of the data, simplify the structure where possible, and derive new measurements that might help us in our analysis. We cover these types of operations in Chapter 9, so don’t worry about the details of the code for now. Instead, focus on the differences between the data tables as we clean the data. We start by loading the data into Python.

bus = (pd.read_csv('data/seattle_bus_times.csv')
       .dropna(axis=0, how='any'))

The first few rows in the data table are shown here:

bus.head(3)
OPD_DATE VEHICLE_ID RTE DIR ... STOP_ID STOP_NAME SCH_STOP_TM ACT_STOP_TM
0 2016-03-26 6201 673 S ... 431 3RD AVE & PIKE ST (431) 01:11:57 01:13:19
1 2016-03-26 6201 673 S ... 431 3RD AVE & PIKE ST (431) 23:19:57 23:16:13
2 2016-03-26 6201 673 S ... 431 3RD AVE & PIKE ST (431) 21:19:57 21:18:46

3 rows × 9 columns

(The raw data are available as comma-separated values in a file, which we have loaded into this table; see Chapter 8 for details on this process.)

It looks like some of the columns in the table might be redundant, like the columns labeled STOP_ID and STOP_NAME. We can find the number of unique values and their counts to confirm this:

bus[['STOP_ID','STOP_NAME']].value_counts()
STOP_ID  STOP_NAME              
578      3RD AVE & PIKE ST (578)    19599
431      3RD AVE & PIKE ST (431)    19318
dtype: int64

There are two 3RD AVE & PIKE ST names for the stop. We wonder whether they are related to the direction of the bus, which we can check against the possible combinations of direction, stop ID and stop name:

bus[['DIR','STOP_ID','STOP_NAME']].value_counts()
DIR  STOP_ID  STOP_NAME              
N    578      3RD AVE & PIKE ST (578)    19599
S    431      3RD AVE & PIKE ST (431)    19318
dtype: int64

Indeed, the northern direction corresponds to stop ID 578 and the southern direction corresponds to stop ID 431. Since we are looking at only one stop in our analysis, we don’t really need anything more than the direction.

We can also check the number of unique route names:

bus['RTE'].value_counts()
673    13228
674    13179
675    12510
Name: RTE, dtype: int64

These routes are numbered and don’t match the names C, D, and E from the original description of the problem. This issue involves another aspect of data wrangling: we need to dig up information that connects the route letters and numbers. We can get this info from the Seattle transit site. Yet another part of wrangling is to translate data values into ones that are easier to understand, so we replace the route numbers with their letters:

def clean_stops(bus):
    return bus.assign(
        route=bus["RTE"].replace({673: "C", 674: "D", 675: "E"}),
        direction=bus["DIR"].replace({"N": "northbound", "S": "southbound"}),
    )

We can also create new columns in the table that help us in our investigations. For example, we can use the scheduled and actual arrival times to calculate how late a bus is. To do this requires some work with date and time formats, which is covered in Chapter 9.

def compute_mins_late(bus):
    bus = bus.assign(
        scheduled=pd.to_datetime(bus["OPD_DATE"] + " " + bus["SCH_STOP_TM"]),
        actual=pd.to_datetime(bus["OPD_DATE"] + " " + bus["ACT_STOP_TM"]),
    )
    # if scheduled & actual span midnight, then the actual day needs to be
    # adjusted
    minute = pd.Timedelta("1 minute")
    hour = pd.Timedelta("1 hour")
    diff_hrs = (bus["actual"] - bus["scheduled"]) / hour
    bus.loc[diff_hrs > 20, "actual"] -= 24 * hour
    bus.loc[diff_hrs < -20, "actual"] += 24 * hour
    return bus.assign(minutes_late=(bus["actual"] - bus["scheduled"]) / minute)

bus = (
    pd.read_csv("data/seattle_bus_times.csv")
    .dropna(axis=0, how="any")
    .pipe(clean_stops)
    .pipe(compute_mins_late)
)
bus.head(3)
OPD_DATE VEHICLE_ID RTE DIR ... direction scheduled actual minutes_late
0 2016-03-26 6201 673 S ... southbound 2016-03-26 01:11:57 2016-03-26 01:13:19 1.37
1 2016-03-26 6201 673 S ... southbound 2016-03-26 23:19:57 2016-03-26 23:16:13 -3.73
2 2016-03-26 6201 673 S ... southbound 2016-03-26 21:19:57 2016-03-26 21:18:46 -1.18

3 rows × 14 columns

Let’s examine the values of this new quantity to make sure that our calculations are correct:

 smallest amount late: -12.87 minutes
 greatest amount late:  150.28 minutes
 median amount late:    0.52 minutes

It’s a bit surprising that there are negative values for how late a bus is, but this just means the bus arrived earlier than scheduled. While the median lateness is only about half a minute, some of the buses are 2.5 hours late! Let’s take a look at the histogram of how many minutes late the buses are:

px.histogram(bus, x="minutes_late", nbins=120, width=450, height=300, 
             labels={'minutes_late':'Minutes late'})
../../_images/bus_clean_20_0.svg

We saw a similarly shaped histogram in Chapter 4. The distribution of how late the buses are is highly skewed to the right, but many arrive close to on time.

Finally, we conclude our wrangling by creating a simplified version of the data table. Since we only need to keep track of the route, direction, scheduled and actual arrival time, and how late the bus is, we create a smaller table and give the columns names that are a bit easier to read:

bus = bus[["route", "direction", "scheduled", "actual", "minutes_late"]]
bus.head()
route direction scheduled actual minutes_late
0 C southbound 2016-03-26 01:11:57 2016-03-26 01:13:19 1.37
1 C southbound 2016-03-26 23:19:57 2016-03-26 23:16:13 -3.73
2 C southbound 2016-03-26 21:19:57 2016-03-26 21:18:46 -1.18
3 C southbound 2016-03-26 19:04:57 2016-03-26 19:01:49 -3.13
4 C southbound 2016-03-26 16:42:57 2016-03-26 16:42:39 -0.30

These table manipulations are covered in Chapter 6.

Before we begin to model bus lateness, we want to explore and learn more about these data. We do that next.