Modeling Wait Times

5.4. Modeling Wait Times

We are interested in modeling the experience of someone waiting at a bus stop. We could develop a complex model that involves the intervals between scheduled arrivals, the bus line, and direction. Instead, we take a simpler approach and narrow the focus to one line, one direction, and one scheduled interval. We examine the northbound C line stops that are scheduled to arrive 12 minutes apart. Both the complex and the narrow approaches are legitimate, but we do not yet have the tools to approach the complex model (see Chapter 15 for more details on modeling).

bus_c_n_12 = bus_c_n[bus_c_n['sched_inter'] == 12].copy()

So far, we have examined the distribution of the number of minutes the bus is late. We create another histogram of this delay for the subset of data that we are analyzing (northbound C line stops that are scheduled to arrive 12 minutes after the previous bus).

fig = px.histogram(bus_c_n_12, x='minutes_late', 
                   nbins=120, width=350, height=250)

fig.update_xaxes(range=[-13, 40])

fig.show()
../../_images/bus_modeling_7_0.svg

And, we calculate the minimum, maximum, and median lateness.

print(f" smallest amount late:  {np.min(bus_c_n_12['minutes_late']):.2f} minutes\n",
      f"greatest amount late:  {np.max(bus_c_n_12['minutes_late']):.2f} minutes\n",
      f"median amount late: {np.median(bus_c_n_12['minutes_late']):.2f} minutes\n")
 smallest amount late:  -10.20 minutes
 greatest amount late:  57.00 minutes
 median amount late: -0.50 minutes

Interestingly, the northbound, 12 minute apart, buses on the C line are more often early than not.

Let’s revisit our question to confirm that we are on track for answering it. A summary of how late the buses are does not quite address the experience of the person waiting for the bus. When someone arrives at a bus stop, they need to wait for the next bus to arrive. Figure 5.1 shows an idealization of time passing as passengers and buses arrive at the bus stop. If people are arriving at random times at the bus stop, notice that they are more likely to arrive in a time interval where the bus is delayed because there’s a longer interval between buses. This arrival pattern is an example of size-biased sampling. So, to answer the question of what do people experience when waiting for a bus, we need to do more than summarize how late the bus is.

../../_images/BusDiagram.jpg

Fig. 5.1 This diagram shows an idealization of a timeline with buses arriving (rectangles above the timeline) and passengers arriving (circles). The time they have to wait for the next bus to arrive is indicated by the curly brackets.

We can design a simulation that mimics waiting for a bus over the course of one day, like in Chapter 3. To do this, we set up a string of bus arrivals that are 12 minutes apart from 6 am to midnight.

scheduled = 12 * np.arange(91)
scheduled
array([   0,   12,   24, ..., 1056, 1068, 1080])

Then, for each scheduled bus, we simulate its actual arrival time by adding on a random amount of time each bus is late. To do this, we choose the minutes late from the distribution of observed lateness of the actual buses. Notice how we have incorporated the real data in our simulation study by using the distribution of actual delays of the 12-minute apart buses.

minutes_late = bus_c_n_12['minutes_late'].to_numpy()
actual = scheduled + np.random.choice(minutes_late, 
                                     size = 91, replace=True)

We need to sort these arrival times because when a bus is super late, another may well come along before it.

actual.sort()
actual
array([   1.43,   13.43,   21.93, ..., 1060.07, 1065.65, 1110.45])

We also need to simulate the arrival of people at the bus stop at random times throughout the day. We can use another, different, urn model for the passenger arrivals. For the passengers, we put a marble in the urn with a time on it. These times run from time 0, which stands for 6 am, to the arrival of the last bus at midnight, which is 1068 minutes past 6 am, and to match the way the bus times are measured in our data, we make the times 1/100 of a minute apart.

pass_arrival_times = np.arange(100*1068)
pass_arrival_times / 100
array([   0.  ,    0.01,    0.02, ..., 1067.97, 1067.98, 1067.99])

Now, we can simulate the arrival of, say, 500 people, at the bus stop throughout the day. We draw 500 times from this urn, replacing the marbles between draws.

sim_arrival_times = np.random.choice(pass_arrival_times, 
                                     size = 500, replace=True)/100
sim_arrival_times.sort()
sim_arrival_times
array([   1.4 ,    3.46,    8.06, ..., 1058.15, 1062.85, 1067.47])

To find out how long each individual waits, we look for the soonest bus to arrive after their sampled time. The difference between these two times (the sampled time of the person and the soonest bus arrival after that) is how long the person waits.

i = np.searchsorted(actual, sim_arrival_times, side='right')
sim_wait_times = actual[i] - sim_arrival_times
sim_wait_times
array([ 0.03,  9.97,  5.37, ...,  1.92,  2.8 , 42.98])

We can set up a complete simulation where we simulate, say, 200 days of bus arrivals and for each day, we simulate 500 people arriving at the bus stop at random times throughout the day. In total, that’s 100,000 simulated wait times.

sim_wait_times = [ ]

for day in np.arange(0, 200, 1):
    bus_late = np.random.choice(minutes_late, size = 91, replace=True)
    actual = scheduled + bus_late
    actual.sort()
    sim_arrival_times = np.random.choice(pass_arrival_times, 
                                     size = 500, replace=True)/100
    sim_arrival_times.sort()
    i = np.searchsorted(actual, sim_arrival_times, side='right')
    sim_wait_times = np.append(sim_wait_times, actual[i] - sim_arrival_times)
    

We make a histogram of these simulated waiting times to examine the distribution.

fig = px.histogram(x=sim_wait_times, nbins=40,
                   histnorm='probability density',
                   width=350, height=250)


fig.update_xaxes(title="Simulated Wait Times for 100,000 Passengers")
fig.update_yaxes(title="Proportion")
fig.show()
../../_images/bus_modeling_28_0.svg

As we expect, we find a skewed distribution. We can model this by a constant where we use absolute loss to select the best constant. We saw in Chapter 4 that absolute loss gives us the median wait time.

print(f"median wait time: {np.median(sim_wait_times):.2f} minutes")
median wait time: 6.45 minutes

The median of about six and a half minutes doesn’t seem too long. While our model captures the typical wait time, we also want to provide an estimate of the variability in the process. This topic is covered in Chapter 16. We can compute the upper quartile of wait times to give us a sense of variability.

print(f"upper quartile wait time: {np.quantile(sim_wait_times, 0.75):.2f} minutes")
upper quartile wait time: 10.55 minutes

The upper quartile is quite large. It’s undoubtedly memorable when you have to wait more than 10 minutes for a bus that is supposed to arrive every 12 minutes, and this happens one in four times you take the bus!