# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import sys
import os
if not any(path.endswith('textbook') for path in sys.path):
    sys.path.append(os.path.abspath('../../..'))
from textbook_utils import *

4.2. Loss Functions

We’re modeling how late the northbound C bus is by a constant called \(\theta\), and we want to use the data of actual arrival times to figure out a good value for \(\theta\). To find \(\theta\), we use a loss function—a function that measures how far away our model is from the actual data.

A loss function is a mathematical function that takes in \(\theta\) and a data value \(y\). It outputs a single number, the loss, that measures how far away \(\theta\) is from \(y\). We write the loss function as \({\cal l}(\theta, y)\).

By convention, the loss function outputs lower values for better values of \(\theta\) and larger values for worse \(\theta\). To fit a model to our data, we select the particular \(\theta\) that produces a lower average loss across all other choices; in other words we find the \(\theta\) that minimizes the average loss for our data, \(y_1, \ldots, y_n\). We write the average loss as \(L(\theta, y_1, y_2, \ldots, y_n)\), where

\[\begin{split} \begin{aligned} L(\theta, y_1, y_2, \ldots, y_n) &= \text{mean}\left\{ {\cal l}(\theta, y_1), {\cal l}(\theta, y_2), \ldots, {\cal l}(\theta, y_n) \right\} \\ &= \frac{1}{n} \sum_{i = 1}^{n} {\cal l}(\theta, y_i)\\ \end{aligned} \end{split}\]

As a shorthand, we often define the vector \( \mathbf{y} = [ y_1, y_2, \ldots, y_n ] \). Then, we can write the average loss as:

\[\begin{split} L(\theta, \mathbf{y}) = \frac{1}{n} \sum_{i = 1}^{n}{\cal l}(\theta, {y_i})\\ \end{split}\]

Note

Notice that \({\cal l}(\theta, y)\) tells us the model’s loss on a single data point while \( L(\theta, \mathbf{y}) \) gives the model’s average loss for all of our data points. The capital \(L\) helps us remember that the average loss combines multiple smaller \(\cal l\) values.

Once we define a loss function, we can find the value of \(\theta\) that produces the smallest average loss. We call this minimizing value, \(\hat{\theta}\). In other words, of all the possible \(\theta\) values, \(\hat{\theta}\) is the one that produces the smallest average loss for our data. We call this process fitting a model to our data, since it finds the best constant model for a given loss function.

Next, we look at two specific loss functions: absolute error and square error. Our goal is to fit our model and find \(\hat{\theta}\) for each of these loss functions.

4.2.1. Mean Absolute Error

We start with the absolute error loss function. Here’s the idea behind absolute loss. For some value of \(\theta\) and data value \(y\),

  1. Find the error: \(y - \theta\), and

  2. Take the absolute value of the error: \(|y - \theta|\).

So, our loss function is \({\cal l}(\theta, y) = | y - \theta |\).

Taking the absolute value of the error is a simple way to convert negative errors into positive ones. For instance, the point, \(y=4\), is equally far away from \(\theta = 2\) and \(\theta = 6\), so the errors are equally “bad”.

The average loss is called the mean absolute error, or MAE for short. That is, the MAE is the average of each of the individual absolute errors:

\[\begin{split} L(\theta, {\mathbf y}) = \frac{1}{n} \sum_{i = 1}^{n} |y_i - \theta|\\ \end{split}\]

Notice that the name MAE, tells you how to compute it–take the mean of the absolute errors.

We can write a simple Python function to compute this loss:

def mae_loss(theta, y_vals):
    return np.mean(np.abs(y_vals - theta))

Let’s see how this loss function behaves when we have just five points \([ -1, 0, 2, 5, 10]\). We can try different values of \(\theta\) and see what the loss function outputs for each value.

try_thetas(thetas=[-2, 0, 1, 2, 3, 4],
           y_vals=[-1, 0, 2, 5, 10],
           rug_height=0.3,
           xlims=(-3, 12))
../../_images/modeling_loss_functions_12_0.svg

We suggest verifying some of these loss values by hand to check that you understand how the MAE is computed.

Of the values of \(\theta\) that we tried, we found that \( \theta = 2 \) has the lowest mean absolute error. For this simple example, \(2\) is the median of the data values. This isn’t a coincidence—let’s now see what the loss function outputs on the original dataset of bus times. We find the MAE when we set \(\theta\) to the mode, median, and mean of the arrival times.

try_thetas(thetas=[0, 0.74, 1.92],
           y_vals=times['minutes_late'],
           xlims=(-12, 30), figsize=(7, 2))
../../_images/modeling_loss_functions_15_0.svg

We see again that the median (middle plot) gives a smaller loss than the mode and mean (left and right plots). In fact, we can prove that for absolute loss, the minimizing \(\hat{\theta}\) is the \(\text{median} \{ y_1, y_2, \ldots, y_n \}\). (The proof appears in the Exercises.)

So far, we have found the best value of \( \theta \) by simply trying out a few values and then picking the one with the smallest loss. To get a better sense of the MAE as a function of \(\theta\), we can try many more values of \(\theta\) to plot a complete curve that shows how \(L(\theta, {\mathbf{y}})\) changes as \(\theta\) changes. We draw the curve for the example from above with the five data values, \([ -1, 0, 2, 5, 10]\).

thetas = np.arange(-2, 8, 0.05)
y_vals=np.array([-1, 0, 2, 5, 10])
losses = [mae_loss(theta, y_vals) for theta in thetas]

plt.figure(figsize=(4, 2.5))
plt.plot(thetas, losses)
plt.title(r'Mean Absolute Error when $\bf{y}$$ = [-1, 0, 2, 5, 10] $')
plt.xlabel(r'$ \theta $ Values')
plt.ylabel('Loss');
../../_images/modeling_loss_functions_18_0.svg

The plot above shows that in fact, \( \theta = 2\) is the best choice for the small dataset of five values. Notice the shape of curve. It is piecewise linear, where the line segments connect at the location of the data values (-1, 0, 2, and 5). This is a property of the absolute value function. With a lot of data, the flat pieces are less obvious. Our bus data have over 1400 points and its MAE curve appears below.

thetas = np.arange(-2, 8, 0.05)
y_vals=times['minutes_late']
losses = [mae_loss(theta, y_vals) for theta in thetas]

plt.figure(figsize=(4, 2.5))
plt.plot(thetas, losses)
plt.axvline(np.median(y_vals), linestyle='--',
                    label=rf'$\theta = median(y)$')
plt.title(r'Mean Absolute Error for Bus Times')
plt.xlabel(r'$ \theta $ Values')
plt.ylabel('Loss')
plt.legend();
../../_images/modeling_loss_functions_20_0.svg

We can use this plot to help confirm that the median of the data is the minimizing value, in other words \(\hat \theta = 0.74\).

Next, let’s look at another loss function, the squared error.

4.2.2. Mean Squared Error

We have used a constant model for our data and found that with mean absolute error, the minimizer is the median. Now, we keep our model the same but switch to a different loss function: squared error. Instead of taking the absolute difference between our data value \(y\) and the constant \(\theta\), we square the error. That is, for some value of \(\theta\) and data value \(y\),

  1. Find the error: \(y - \theta\) and

  2. Take the squared value of the error: \((y - \theta)^2\).

This gives the loss function \({\cal l}(\theta, y) = (y - \theta)^2\).

As before, we want to use all of our data to find the best \(\theta\) so we compute the mean squared error, or MSE for short:

\[ L(\theta, {\mathbf y}) = L(\theta, y_1, y_2, \ldots, y_n) = \frac{1}{n} \sum_{i = 1}^{n} (y_i - \theta)^2 \]

We can write a simple Python function to compute the MSE:

def mse_loss(theta, y_vals):
    return np.mean((y_vals - theta) ** 2)

Let’s again try the mean, median, and mode as potential minimizers of the MSE.

try_thetas(thetas=[0, 0.74, 1.92],
           y_vals=times['minutes_late'],
           xlims=(-12, 30), loss_fn=mse_loss, figsize=(5, 2))
../../_images/modeling_loss_functions_26_0.svg

Now, when we fit the constant model using MSE loss, we find that the mean (right plot) has a smaller loss than the mode and the median (left and middle plot).

Let’s plot the MSE curve for different values of \(\theta\) given our data. We can see that the minimizing value \( \hat{\theta} \) is close to 2.

thetas = np.arange(-2, 8, 0.05)
y_vals=times['minutes_late']
losses = [mse_loss(theta, y_vals) for theta in thetas]

plt.figure(figsize=(4, 2.5))
plt.plot(thetas, losses)
plt.axvline(np.mean(y_vals), linestyle='--',
                    label=rf'$\theta = mean(y)$')
plt.title(r'Mean Squared Error for Bus Times')
plt.xlabel(r'$ \theta $ Values')
plt.ylabel('Loss')
plt.legend();
../../_images/modeling_loss_functions_28_0.svg

One feature of this curve that is quite noticeable is how rapidly large the MSE grows compared to the MAE (note the range on the vertical axis). This growth has to do with the nature of squaring errors; it places a much higher loss on data values further away. If \(\theta = 10\) and \(y = 110\), the squared loss is \((10 - 110)^2 = 10000\) whereas the absolute loss is \(|10 - 110| = 100\). For this reason, MSE is more sensitive to unusually large values than MAE.

When we use MSE loss, we saw that \( \hat{\theta}\) appears to be the \(\text{mean} (\mathbf{y})\). Again, this no mere coincidence; the mean of the data always produces the \( \theta \) that minimizes the MSE. We can show how this comes about using the quadratic nature of MSE. To begin, we add and subtract \(\bar{y}\) in the loss function and expand the square as follows:

\[\begin{split} \begin{aligned} L(\theta, \mathbf{y}) &= \frac{1}{n} \sum_{i = 1}^{n}(y_i - \theta)^2\\ &= \frac{1}{n} \sum_{i = 1}^{n} [(y_i - \bar{y}) + (\bar{y} - \theta)]^2 \\ &= \frac{1}{n} \sum_{i = 1}^{n} [(y_i - \bar{y})^2 + 2(y_i - \bar{y})(\bar{y} - \theta) +(\bar{y} - \theta)^2]\\ \end{aligned} \end{split}\]

Next, we split the MSE into the sum of these three terms, and note that the middle term is 0, due to the simple property of the average: \(\sum (y_i - \bar{y}) = 0\).

\[\begin{split} \begin{aligned} \frac{1}{n} \sum_{i = 1}^{n}& (y_i - \bar{y})^2 + \frac{1}{n} \sum_{i = 1}^{n}2(y_i - \bar{y})(\bar{y} - \theta) + \frac{1}{n}\sum_{i = 1}^{n}(\bar{y} - \theta)^2 \\ &= \frac{1}{n} \sum_{i = 1}^{n} (y_i - \bar{y})^2 + 2(\bar{y} - \theta)\frac{1}{n} \sum_{i = 1}^{n}(y_i - \bar{y}) + \frac{1}{n}\sum_{i = 1}^{n}(\bar{y} - \theta)^2 \\ &= \frac{1}{n} \sum_{i = 1}^{n} (y_i - \bar{y})^2 + (\bar{y} - \theta)^2 \\ \end{aligned} \end{split}\]

Of the remaining two terms, the first does not involve \(\theta\) and the second is always non-negative:

\[\begin{split} \begin{aligned} L(\theta, \mathbf{y}) &= \frac{1}{n} \sum_{i = 1}^{n} (y_i - \bar{y})^2 + \frac{1}{n} \sum_{i = 1}^{n}(\bar{y} - \theta)^2\\ &\geq \frac{1}{n} \sum_{i = 1}^{n} (y_i - \bar{y})^2 \end{aligned} \end{split}\]

This inequality shows us that the MSE is minimized when \(\theta\) is \(\bar{y}\). That is, there is a single value of \(\theta\) that gives the smallest possible MSE for any dataset and that value is \({\hat \theta}= {\bar y}\).

We have seen that for absolute loss, the best constant is the median, but for square error, it’s the mean. The choice of the loss function is an important aspect of model fitting.

4.2.3. Choosing Loss Functions

Now that we’ve worked with two loss functions, we can return to our original question: why would we choose the median, mean, or mode over the others? Since the statistics minimize different loss functions 1, we can equivalently ask: what is the most appropriate loss function for our problem? To answer this question, we look at the context of our problem.

Compared to the MAE, the MSE gives especially large loss values when the bus is much later (or earlier) than expected. So, a bus rider who wants to understand the typical wait times would use MAE and the median (0.74 minutes late), and a rider who despises unexpected wait times would summarize the wait time using MSE and the mean (1.92 minutes late).

If we want to refine the model even more, we can use a more specialized loss function. For example, if when a bus arrives early, it waits at the stop until the scheduled time of departure, then we might want to assign an early arrival 0 loss. And, if a really late bus is a larger aggravation than a moderately late one, we might choose an asymmetric loss function that penalizes super late arrivals more.

In essence, context matters when choosing a loss function. By thinking carefully about how we plan to use the fitted model, we can pick a loss function that helps us make good decisions using data.


1

The mode minimizes a loss function called 0-1 loss. Although we haven’t covered this specific loss, the procedure is identical: pick the loss function, then find \(\hat{\theta}\) that minimizes the loss.