Data Wrangling

5.2. Data Wrangling

Before we start out analysis, we check the quality of the data, simplify the structure where possible, and derive any new measurements that might help us in our analysis. These sorts of actions are described in more detail in Chapter 9.

The first few rows in the data table are shown here.

bus = pd.read_csv('~/Downloads/arrival_times.csv')
bus = bus.dropna(axis=0, how='any')

bus.head()
OPD_DATE VEHICLE_ID RTE DIR ... STOP_ID STOP_NAME SCH_STOP_TM ACT_STOP_TM
0 2016-03-26 6201 673 S ... 431 3RD AVE & PIKE ST (431) 01:11:57 01:13:19
1 2016-03-26 6201 673 S ... 431 3RD AVE & PIKE ST (431) 23:19:57 23:16:13
2 2016-03-26 6201 673 S ... 431 3RD AVE & PIKE ST (431) 21:19:57 21:18:46
3 2016-03-26 6201 673 S ... 431 3RD AVE & PIKE ST (431) 19:04:57 19:01:49
4 2016-03-26 6201 673 S ... 431 3RD AVE & PIKE ST (431) 16:42:57 16:42:39

5 rows × 9 columns

(The raw data are available as comma separate values in a file, which we have loaded into this table; see Chapter 8 for details on this process).

It looks like some of the columns in the table might be redundant, like the columns labeled STOP_ID and STOP_NAME. We can find the number of unique values and their counts to confirm this.

bus[['STOP_ID','STOP_NAME']].value_counts()
STOP_ID  STOP_NAME              
578      3RD AVE & PIKE ST (578)    19599
431      3RD AVE & PIKE ST (431)    19318
dtype: int64

There are two Third & Pike names for the stop. Could they be related to the direction of the bus? We can check the possible combinations of direction, stop ID and stop name.

bus[['DIR','STOP_ID','STOP_NAME']].value_counts()
DIR  STOP_ID  STOP_NAME              
N    578      3RD AVE & PIKE ST (578)    19599
S    431      3RD AVE & PIKE ST (431)    19318
dtype: int64

Indeed, the northern direction corresponds to stop number 578 and the southern direction corresponds to stop 431. Since we are looking only at one stop in our analysis, we don’t really need anything more than the direction.

We can also check the number of unique route names:

bus['RTE'].value_counts()
673    13228
674    13179
675    12510
Name: RTE, dtype: int64

These routes are numbered so they don’t match the names C, D, and E from the original description of the problem. This issue involves another aspect of data wrangling: we need to did up information that connects the route letters and numbers. We can get this info from the Seattle transit site. Yet another part of wrangling is to translate values into ones that are easier to understand so we replace the route numbers with their letters.

bus['route'] = bus['RTE'].replace({673: 'C', 674: 'D', 675: 'E'})

bus['direction'] = bus['DIR'].replace({'N': 'northbound', 'S': 'southbound'})

We can also create new columns in the table that help us in our investigations. For example, we can use the scheduled and actual arrival times to calculate how late a bus is. To do this requires some work with date and time formats, which is covered in Chapter 9.

bus['scheduled'] = pd.to_datetime(bus['OPD_DATE'] + ' ' + bus['SCH_STOP_TM'])
bus['actual'] = pd.to_datetime(bus['OPD_DATE'] + ' ' + bus['ACT_STOP_TM'])

minute = np.timedelta64(1, 'm')
hour = 60 * minute
diff_hrs = (bus['actual'] - bus['scheduled']) / hour
bus.loc[diff_hrs > 20, 'actual'] -= 24 * hour
bus.loc[diff_hrs < -20, 'actual'] += 24 * hour

bus['minutes_late'] = (bus['actual'] - bus['scheduled']) / minute

Let’s examine the values of this new quantity to make sure that our calculations are correct.

 smallest amount late:  -12.87 minutes
 greatest amount late:  150.28 minutes
 median amount late: 0.52 minutes

It’s a bit surprising that there are negative values for how late a bus is, but this just means the bus arrived earlier than scheduled. While, the median lateness is only about half a minute, some of the buses are 2.5 hours late! Let’s take a look at the histogram of how many minutes late the buses are.

fig = px.histogram(bus, x='minutes_late', 
                   nbins=120, width=350, height=250)

fig.update_xaxes(range=[-13, 40])

fig.show()
../../_images/bus_clean_19_0.svg

We saw a similar shaped histogram in Chapter 4. The distribution of how late the buses are is highly skewed to the right, but many arrive close to on time.

Finally, our wrangling concludes with creating a simplified version of the data table. Since we only need to keep track of the route, direction, scheduled and actual arrival time, and how late the bus is, we create a smaller table, and we give the columns names that are a bit easier to read.

bus = bus[['route', 'direction', 'scheduled', 
           'actual', 'minutes_late']].copy()

bus.head()
route direction scheduled actual minutes_late
0 C southbound 2016-03-26 01:11:57 2016-03-26 01:13:19 1.37
1 C southbound 2016-03-26 23:19:57 2016-03-26 23:16:13 -3.73
2 C southbound 2016-03-26 21:19:57 2016-03-26 21:18:46 -1.18
3 C southbound 2016-03-26 19:04:57 2016-03-26 19:01:49 -3.13
4 C southbound 2016-03-26 16:42:57 2016-03-26 16:42:39 -0.30

The table manipulations in this section are covered in Chapter 6.

Before we begin to model bus lateness, we want to explore and learn more about these data. We do that next.