The Constant Model

4.1. The Constant Model

A transit rider named Jake often takes the northbound C bus at the 3rd & Pike bus stop in downtown Seattle 1. The bus is supposed to arrive every 10 minutes, but Jake notices that he sometimes waits a long time for the bus. He wants to know how late the bus usually is. Washington State keeps track of both scheduled bus arrival times and the actual arrival times, which lets Jake calculate the minutes that each bus is late for his stop. We’ll use this dataset to introduce the constant model, which summarizes the data with a single number (a constant). First, we read in the data.

times = pd.read_csv('data/seattle_bus_times_NC.csv')
times
route direction scheduled actual minutes_late
0 C northbound 2016-03-26 06:30:28 2016-03-26 06:26:04 -4.40
1 C northbound 2016-03-26 01:05:25 2016-03-26 01:10:15 4.83
2 C northbound 2016-03-26 21:00:25 2016-03-26 21:05:00 4.58
... ... ... ... ... ...
1431 C northbound 2016-04-10 06:15:28 2016-04-10 06:11:37 -3.85
1432 C northbound 2016-04-10 17:00:28 2016-04-10 16:56:54 -3.57
1433 C northbound 2016-04-10 20:15:25 2016-04-10 20:18:21 2.93

1434 rows × 5 columns

In the table above, the minutes_late column records how late each bus was. Notice that some of the times are negative, which means that the bus arrived early. Let’s plot a histogram of the minutes each bus is late.

fig = px.histogram(times, x='minutes_late', width=350, height=250)
fig.update_xaxes(range=[-12, 60], title_text='Minutes Late')
../../_images/modeling_simple_4_0.svg

There are already some interesting patterns in the data. For example, many buses arrive earlier than scheduled and some are well over 20 minutes late. We also see a clear mode (high point) at 0, meaning many buses arrive roughly on time.

We want to understand how late the bus usually is. To do this, we’d like to summarize the lateness by a constant—this is a single number, like the mean, median, or mode. Let’s find each of these summary statistics for the minutes_late data.

From the histogram, we estimate that the mode of the data is 0. We’ll use Python to compute the mean and median.

print(f"mean:    {times['minutes_late'].mean():.2f} mins late")
print(f"median:  {times['minutes_late'].median():.2f} mins late")
print(f"mode:    {0:.2f} mins late")
mean:    1.92 mins late
median:  0.74 mins late
mode:    0.00 mins late

After finding these three summary statistics, we naturally want to know: which of these numbers is best? Rather than relying on rules of thumb, we’ll take a more formal approach. In this example, we’re making a model for bus lateness that consists of a constant. Let’s call this constant \( \theta \). For example, if we say that \( \theta = 5 \), our model’s best guess is that the bus will be 5 minutes late.

Now, \( \theta = 5 \) isn’t a particularly good guess. From the histogram of arrival times, we saw that there are lot more points closer to 0 than 5. But it isn’t clear that \(\theta = 0\) (the mode) is a better choice than \(\theta = 0.74\) (the median), \( \theta = 1.92 \) (the mean), or something in between. To make precise choices between different values of \( \theta \), we would like to assign each value of \(\theta\) a score that measures how well the model fits the data. Using more formal language, we’ll use a loss function to pick the best \(\theta\) for our model. A loss function takes as input a value of \( \theta \) and the points in our dataset. It outputs a single number that we can use to select the best \( \theta \). In the next section, we examine how to define and use loss functions to fit the constant model.


1

We (the authors) first learned of the bus arrival time data from an analysis by a data scientist named Jake VanderPlas. We’ve named the protagonist of this section in his honor.