{
"cells": [
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"tags": [
"remove-cell"
]
},
"outputs": [],
"source": [
"# Reference: https://jupyterbook.org/interactive/hiding.html\n",
"# Use {hide, remove}-{input, output, cell} tags to hiding content\n",
"\n",
"import sys\n",
"import os\n",
"if not any(path.endswith('textbook') for path in sys.path):\n",
" sys.path.append(os.path.abspath('../../..'))\n",
"from textbook_utils import *"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"tags": [
"remove-cell"
]
},
"outputs": [],
"source": [
"times = pd.read_csv('data/seattle_bus_times_NC.csv')\n",
"\n",
"def mae_loss(theta, y_vals):\n",
" return np.mean(np.abs(y_vals - theta))\n",
"\n",
"def try_thetas(thetas, y_vals, xlims, loss_fn=mae_loss, figsize=(5, 3),\n",
" rug_height=0.1, cols=3):\n",
" if not isinstance(y_vals, np.ndarray):\n",
" y_vals = np.array(y_vals)\n",
" rows = int(np.ceil(len(thetas) / cols))\n",
" plt.figure(figsize=figsize)\n",
" for i, theta in enumerate(thetas):\n",
" ax = plt.subplot(rows, cols, i + 1)\n",
" sns.rugplot(y_vals, height=rug_height, ax=ax)\n",
" plt.axvline(theta, linestyle='--',\n",
" label=rf'$ \\theta = {theta} $')\n",
" plt.title(f'Avg loss = {loss_fn(theta, y_vals):.2f}')\n",
" plt.xlim(*xlims)\n",
" plt.yticks([])\n",
" plt.legend()\n",
" plt.tight_layout()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Minimizing Loss\n",
"\n",
"We want to model how late the northbound C bus is by a constant, which we call\n",
"$\\theta$, and we want to use the data of actual minutes each bus is late to figure\n",
"out a good value for $\\theta$.\n",
"To do this, we use a *loss function*---a function that measures\n",
"how far away our constant, $ \\theta $, is from the actual data.\n",
"\n",
"A loss function is a mathematical function that takes in $\\theta$ and a\n",
"data value $y$. It outputs a single number, the *loss*, that\n",
"measures how far away $\\theta$ is from $y$. We write the loss function\n",
"as ${\\cal l}(\\theta, y)$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By convention, the loss function outputs lower values for better values of\n",
"$\\theta$ and larger values for worse $\\theta$. To fit a constant to our data, we\n",
"select the particular $\\theta$ that produces the lowest average loss across all\n",
"choices for $ \\theta $. In other words, we find the $\\theta$ that *minimizes the average\n",
"loss* for our data, $y_1, \\ldots, y_n$. More formally, we write the average loss as $L(\\theta, y_1, y_2, \\ldots, y_n)$, where:\n",
"\n",
"$$\n",
"\\begin{aligned}\n",
"L(\\theta, y_1, y_2, \\ldots, y_n)\n",
"&= \\text{mean}\\left\\{ {\\cal l}(\\theta, y_1),\n",
" {\\cal l}(\\theta, y_2), \\ldots,\n",
" {\\cal l}(\\theta, y_n) \\right\\} \\\\\n",
"&= \\frac{1}{n} \\sum_{i = 1}^{n} {\\cal l}(\\theta, y_i)\\\\\n",
"\\end{aligned}\n",
"$$\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As a shorthand, we often use the vector $ \\mathbf{y} = [ y_1, y_2, \\ldots, y_n ] $.\n",
"Then we can write the average loss as:\n",
"\n",
"$$\n",
"L(\\theta, \\mathbf{y})\n",
"= \\frac{1}{n} \\sum_{i = 1}^{n}{\\cal l}(\\theta, {y_i})\\\\\n",
"$$\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
":::{note}\n",
"\n",
"Notice that ${\\cal l}(\\theta, y)$ tells us the model's loss for a single data\n",
"point while $ L(\\theta, \\mathbf{y}) $ gives the model's average\n",
"loss for all the data points. The capital $L$ helps us remember that the\n",
"average loss combines multiple smaller $\\cal l$ values.\n",
"\n",
":::"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once we define a loss function, we can find the value of $\\theta$ that produces\n",
"the smallest average loss. We call this minimizing value $\\hat{\\theta}$. In other words, of all the possible $\\theta$ values,\n",
"$\\hat{\\theta}$ is the one that produces the smallest average loss for our data.\n",
"We call this optimization process *model fitting*; it finds the best constant model for our data.\n",
"\n",
"Next, we look at two particular loss functions: absolute error and squared\n",
"error. Our goal is to fit the model and find $\\hat{\\theta}$ for each of these\n",
"loss functions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Mean Absolute Error\n",
"\n",
"We start with the *absolute error* loss function. Here's the idea behind\n",
"absolute loss. For some value of $\\theta$ and data value $y$:\n",
"\n",
"1. Find the error, $y - \\theta$.\n",
"1. Take the absolute value of the error, $|y - \\theta|$. \n",
"\n",
"So the loss function is ${\\cal l}(\\theta, y) = | y - \\theta |$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Taking the absolute value of the error is a simple way to convert negative\n",
"errors into positive ones. For instance, the point,\n",
"$y=4$, is equally far away from $\\theta = 2$ and $\\theta = 6$, so the errors are equally \"bad.\"\n",
"\n",
"The average of the absolute errors is called the _mean absolute error_ (MAE). The MAE is the average of each of the individual absolute errors:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$$\n",
"L(\\theta, {\\mathbf y})\n",
"= \\frac{1}{n} \\sum_{i = 1}^{n} |y_i - \\theta|\n",
"$$\n",
"\n",
"Notice that the name MAE tells you how to compute it: take the Mean of the\n",
"Absolute value of the Errors, $ \\{ y_i - \\theta \\} $.\n",
"\n",
"We can write a simple Python function to compute this loss:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"def mae_loss(theta, y_vals):\n",
" return np.mean(np.abs(y_vals - theta))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see how this loss function behaves when we have just five data points $[–1, 0, 2, 5, 10]$. We can try different values of $\\theta$ and see what the MAE outputs for each value:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"tags": [
"remove-input"
]
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
"