{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": [
"remove-cell"
]
},
"outputs": [],
"source": [
"import sys\n",
"import os\n",
"if not any(path.endswith('textbook') for path in sys.path):\n",
" sys.path.append(os.path.abspath('../../..'))\n",
"from textbook_utils import *"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"(sec:linear_multi_fit)=\n",
"# Fitting the Multiple Linear Model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the previous section we considered the case of two explanatory variables; one of these was called $x$ and the other $v$. Now we want to generalize the approach to $p$ explanatory variables. Our approach of choosing new letters to represent additional variables quickly fails us. Instead, we use a more formal and general approach that represents multiple predictors as a matrix, as depicted in {numref}`Figure %s `. We call $\\textbf{X}$ the *design matrix*. Notice that $\\textbf{X}$ has shape $ n \\times (p + 1)$. Each column of $\\textbf{X}$ represents a variable, and each row represents an observation. That is, $x_{i,j}$ is the measurement taken on observation $i$ of variable $j$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```{figure} figures/design-matrix.svg\n",
"---\n",
"name: fig:design-matrix\n",
"---\n",
"\n",
"Each row and column of $X$ represent an observation and a feature.\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
":::{note}\n",
"\n",
"One technicality: the design matrix is defined as a mathematical matrix,\n",
"not a dataframe, and a matrix doesn't include the column or row labels that the dataframe has.\n",
"But, we usually don't have to worry about converting `X` into a matrix\n",
"since most Python libraries for modeling treat dataframes as if they were matrices.\n",
"\n",
":::"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For a given observation, say, the second row in $\\textbf{X}$, we represent the outcome $y_2$ by the linear approximation:\n",
"\n",
"$$\n",
"\\begin{aligned}\n",
"y_2 \\approx {\\hat{y}_2} = \\theta_0 + \\theta_1 x_{2,1} + \\ldots + \\theta_p x_{2,p} \n",
"\\end{aligned}\n",
"$$\n",
"\n",
"Similar to the simple linear model, our\n",
"multiple linear model predicts $\\hat{y}_2$ as a linear combination of the values $[x_{2,1}, \\ldots,x_{2,p}]$.\n",
"\n",
"Also, we can write the model parameters as a $p+1$ column vector ${\\boldsymbol{\\theta}}$, \n",
"\n",
"$$\n",
"\\boldsymbol{\\theta} = \n",
"\\begin{bmatrix}\n",
"\\theta_0 \\\\\n",
"\\theta_1 \\\\\n",
"\\vdots \\\\\n",
"\\theta_p\n",
"\\end{bmatrix}\n",
"$$\n",
"\n",
"\n",
"\n",
"Putting these notational definitions together, we can write the vector of $\\mathbf{\\hat{y}}$ predictions for the entire dataset using matrix multiplication:\n",
"\n",
"$$\n",
"\\begin{aligned}\n",
"\\hat{\\mathbf{y}} &= {\\textbf{X}} {\\boldsymbol{\\theta}}\n",
"\\end{aligned}\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Additionally, the errors in using $\\hat{\\mathbf{y}}$ to predict $\\mathbf{y}$ can be expressed as the $n$-dimensional column vector:\n",
"\n",
"$$ \\mathbf{e} = \\mathbf{y} - \\hat{\\mathbf{y}}.$$\n",
"\n",
"If we check the dimensions of $\\textbf{X}$ and $\\boldsymbol{\\theta}$, it confirms that $\\hat{\\mathbf{y}}$ is an $n$-dimensional column vector. \n",
"Each element in this vector is the model's prediction for one observation.\n",
"\n",
"and we also represent $\\mathbf{y}$ as a column vector of outcomes:\n",
"$$ \n",
"\\mathbf{y} = \n",
"\\begin{bmatrix}\n",
"y_1 \\\\\n",
"y_2 \\\\\n",
"\\vdots \\\\\n",
"y_n\n",
"\\end{bmatrix}\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This matrix representation of the multiple linear model can help us fit the model.\n",
"Similar to the simple linear model, we find the best model using squared loss.\n",
"That is, we want to find the model parameters $\\hat{\\boldsymbol{\\theta}}$ that minimize the average squared loss:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$$\n",
"\\begin{aligned}\n",
"L({\\boldsymbol{\\theta}}, \\textbf{X}, {\\mathbf{y}})\n",
" &= \\frac{1}{n} \\sum_i [y_i - (\\theta_0 + \\theta_1 x_{i,1} + \\cdots + \\theta_p x_{i,p})]^2 \\\\\n",
" &= \\frac{1}{n} \\lVert \\mathbf{y} - {\\textbf{X}} {\\boldsymbol{\\theta}} \\rVert^2\n",
"\\end{aligned}\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here, we use the notation $ \\lVert \\mathbf{v} \\rVert^2 $ for a vector $\\mathbf{v}$ as a\n",
"shorthand for the sum of each vector element squared [^ell2]:\n",
"$\\lVert \\mathbf{v} \\rVert^2 = \\sum_i v_i^2$.\n",
"\n",
"We can fit our model using calculus as we did for the simple linear model.\n",
"However, this approach gets cumbersome, and instead we use a geometric argument that is more intuitive and easily leads to useful properties of the design matrix, errors, and predicted values.\n",
"\n",
"[^ell2]: $\\lVert \\mathbf{v} \\rVert^2$ is also called the $\\ell_2$ norm of a vector $\\mathbf{v}$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## A Geometric Problem"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Again, our goal is the find the $\\boldsymbol{\\hat{\\theta}}$ that minimizes our loss\n",
"function---we want to make $L({\\boldsymbol{\\theta}}, \\textbf{X}, \\mathbf{y})$ as small as possible for a given $\\textbf{X}$ and $\\mathbf{y}$.\n",
"The key insight is that we can restate this goal in a geometric way.\n",
"Since the model predictions $\\mathbf{\\hat{y}}$ and the true outcomes\n",
"$\\mathbf{y}$ are both vectors, we can think of them as points in *variable space*.\n",
"\n",
"{numref}`Figure %s ` gives and example of this, where we have just one\n",
"$\\mathbf{x}$ (and no intercept term). In the figure, $\\mathbf{x}$, $\\mathbf{y}$ and $\\mathbf{\\hat{y}}$ are displayed as $n$-dimensional vectors. And minimizing squared loss, is finding $\\theta$ that minimizes:\n",
"\n",
"$$\n",
"\\frac{1}{n} \\sum_i [y_i - \\theta x_{i}]^2 \n",
" = \\frac{1}{n} \\lVert {\\mathbf{y}} - {{\\theta}{\\mathbf{x}}} \\rVert^2\n",
"$$\n",
"\n",
"Since squared loss minimizes the distance between $\\mathbf{y}$ and $\\theta \\mathbf{x}$,\n",
"the picture shows us that the projection $\\mathbf{\\hat{y}} = \\hat{\\theta} \\mathbf{x}$ is the closest point of the form $\\theta \\mathbf{x}$ to $\\mathbf{y}$.\n",
"We call $\\mathbf{\\hat{y}}$ the *projection* of $\\mathbf{y}$ onto $\\mathbf{x}$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```{figure} figures/geom-2d.svg\n",
"---\n",
"name: fig:geom-2d\n",
"width: 250px\n",
"---\n",
"\n",
"A plot showing Y, y_hat, and one x ....\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, let's look at the case where we have two variables, $\\mathbf{x}_1$ and $\\mathbf{x}_2$.\n",
"We have to draw the picture slightly differently now, to reflect we have three vectors. \n",
"{numref}`Figure %s ` gives a diagram of this setting.\n",
"We are faced with the same problem--we need to find a vector of the form \n",
"${{\\theta_1}{\\mathbf{x}}_1} + {{\\theta_2}{\\mathbf{x}}_2}$ that minimizes the least squares loss, \n",
"\n",
"$$\n",
"\\frac{1}{n} \\sum_i [y_i - (\\theta_1 {x}_{1,i} + \\theta_2{x}_{2,i})]^2 \n",
" = \\frac{1}{n} \\lVert {\\mathbf{y}} - {\\textbf{X}}{\\boldsymbol{\\theta}} \\rVert^2\n",
"$$\n",
"\n",
"The $\\theta_1{\\mathbf{x}}_1 + \\theta_2{\\mathbf{x}}_2$ (or ${\\textbf{X}}{\\boldsymbol{\\theta}}$) is a linear combination of the vectors $\\mathbf{x}_1$\n",
"and $\\mathbf{x}_2$, and the shaded region in the figure shows all of the possible \n",
"combinations.\n",
"Hopefully, the diagram convinces you that the particular combination that minimizes the\n",
"distance between $\\mathbf{y}$ and vectors that lay in the shaded area is again the projection\n",
"of $\\mathbf{y}$ onto this gray area. The shaded area is called the *span of* $textbf{X}$\n",
"and written as $\\text{span}(\\textbf{X})$.\n",
"If you would like a more formal approach, rather than this proof-by-picture, we walk \n",
"through the steps of deriving $\\boldsymbol{\\hat{\\theta}}$ using vector spaces in the Exercises."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```{figure} figures/geom-span.svg\n",
"---\n",
"name: fig:geom-span\n",
"width: 250px\n",
"---\n",
"\n",
"A plot showing all possible values of .\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can derive the representation of $\\boldsymbol{\\hat{\\theta}}$ in terms of ${\\textbf{X}}$ and $\\mathbf{y}$. To do this, we use the earlier definition: $\\mathbf{e} = \\mathbf{y} - {\\mathbf{\\hat{y}}}$ and the observation from picture that shows $\\mathbf{e}$ is perpendicular to $\\text{span}(\\textbf{X})$.\n",
"We can now solve for $\\boldsymbol{\\hat{\\theta}}$ \n",
"(we have placed the complete proof in the Exercises):"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"$$\n",
"\\begin{aligned}\n",
"\\textbf{X} \\boldsymbol{\\hat{\\theta}} + \\mathbf{e} &= \\mathbf{y} \\\\\n",
"{\\textbf{X}}^\\top \\textbf{X} \\hat{\\theta} + {\\textbf{X}}^\\top \\mathbf{e} &= {\\textbf{X}}^\\top \\mathbf{y}\n",
" & (\\text{left-multiply both sides by } {\\textbf{X}}^\\top) \\\\\n",
"{\\textbf{X}}^\\top \\textbf{X} \\hat{\\theta} &= {\\textbf{X}}^\\top \\mathbf{y}\n",
" & ({\\textbf{X}}^\\top \\mathbf{e} = 0 \\text{ since } \\mathbf{e} \\perp \\text{span}(\\textbf{X})) \\\\\n",
"\\boldsymbol{\\hat{\\theta}} &= ({\\textbf{X}}^\\top \\textbf{X})^{-1} {\\textbf{X}}^\\top \\mathbf{y}\n",
" & (\\text{left-multiply both sides by } ({\\textbf{X}}^\\top \\textbf{X})^{-1})\n",
"\\end{aligned}\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This general approach to derive $\\boldsymbol{\\hat{\\theta}}$ for the multiple linear model also gives\n",
"us $\\hat{\\theta}_0$ and $\\hat{\\theta}_1$ for the simple linear model. If we set\n",
"${\\textbf{X}}$ to contain the intercept column and one column of features, using this\n",
"formula for $\\boldsymbol{\\hat{\\theta}}$ and some linear algebra, we can get the intercept and slope of the least-squares-fitted simple linear model. In fact, if ${\\textbf{X}}$ is simply a single column of $1$s, then we can use this formula to show that $\\boldsymbol{\\hat{\\theta}}$ is just the mean of $\\mathbf{x}$. This ties back to the constant model that we introduced in {numref}`Chapter %s `. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
":::{note}\n",
"\n",
"While, we can write a simple function to derive the $\\boldsymbol{\\hat{\\theta}}$ based on the formula, \n",
"\n",
"$$\n",
"\\boldsymbol{\\hat{\\theta}} = ({\\textbf{X}}^\\top \\textbf{X})^{-1} {\\textbf{X}}^\\top \\mathbf{y},\n",
"$$\n",
"\n",
"we recommend leaving the calculation of $\\boldsymbol{\\hat{\\theta}}$ to the optimally tuned methods provided in the 'sci-kit learn' or 'statsmodels' libraries. They handle cases where the design matrix is sparse, highly co-linear, and not invertible. \n",
"\n",
":::"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This solution for $\\boldsymbol{\\hat{\\theta}}$ (along with the pictures) reveal some useful properties of the fitted coefficients and the predictions. We present them here, and walk through the details of the derivations in the Exercises.\n",
"\n",
"+ The residuals are orthogonal to the predicted values. \n",
"+ The average of the residuals is $0$, if the model has an intercept term.\n",
"+ The variance of the residuals is just the ASE. \n",
"\n",
"These properties explain why we examine residual plots and plot the residuals against the predicted values. \n",
"In addition to examining the SD of the errors, the ratio of the ASE for a multiple linear model to the ASE for the constant model gives a measure of the model fit. The *multiple $R^2$* is defined as: \n",
"\n",
"$$ \n",
"R^2 = 1 - \\frac {\\lVert \\mathbf{y} - {\\textbf{X}}{\\boldsymbol{\\hat{\\theta}}} \\rVert^2}\n",
" {\\lVert {\\mathbf{y}} - \\bar{y} \\rVert^2}.\n",
"$$\n",
"\n",
"As the model fits the data closer and closer, the multiple $R^2$ gets nearer to $1$. That might seem like a good thing, but there are problems with this approach because $R^2$ continues to grow even as we add meaningless features to our model, as long as the features fill out the $\\text{span}(\\textbf{X})$. Other approaches to selecting a models are covered in {numref}`Chapter %s `."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.4"
}
},
"nbformat": 4,
"nbformat_minor": 4
}