Fitting the Multiple Linear Model

15.4. Fitting the Multiple Linear Model

In the previous section we considered the case of two explanatory variables; one of these was called \(x\) and the other \(v\). Now we want to generalize the approach to \(p\) explanatory variables. Our approach of choosing new letters to represent additional variables quickly fails us. Instead, we use a more formal and general approach that represents multiple predictors as a matrix, as depicted in Figure 15.4. We call \(\textbf{X}\) the design matrix. Notice that \(\textbf{X}\) has shape \( n \times (p + 1)\). Each column of \(\textbf{X}\) represents a variable, and each row represents an observation. That is, \(x_{i,j}\) is the measurement taken on observation \(i\) of variable \(j\).


Fig. 15.4 Each row and column of \(X\) represent an observation and a feature.


One technicality: the design matrix is defined as a mathematical matrix, not a dataframe, and a matrix doesn’t include the column or row labels that the dataframe has. But, we usually don’t have to worry about converting X into a matrix since most Python libraries for modeling treat dataframes as if they were matrices.

For a given observation, say, the second row in \(\textbf{X}\), we represent the outcome \(y_2\) by the linear approximation:

\[ \begin{aligned} y_2 \approx {\hat{y}_2} = \theta_0 + \theta_1 x_{2,1} + \ldots + \theta_p x_{2,p} \end{aligned} \]

Similar to the simple linear model, our multiple linear model predicts \(\hat{y}_2\) as a linear combination of the values \([x_{2,1}, \ldots,x_{2,p}]\).

Also, we can write the model parameters as a \(p+1\) column vector \({\boldsymbol{\theta}}\),

\[\begin{split} \boldsymbol{\theta} = \begin{bmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_p \end{bmatrix} \end{split}\]

Putting these notational definitions together, we can write the vector of \(\mathbf{\hat{y}}\) predictions for the entire dataset using matrix multiplication:

\[ \begin{aligned} \hat{\mathbf{y}} &= {\textbf{X}} {\boldsymbol{\theta}} \end{aligned} \]

Additionally, the errors in using \(\hat{\mathbf{y}}\) to predict \(\mathbf{y}\) can be expressed as the \(n\)-dimensional column vector:

\[ \mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}.\]

If we check the dimensions of \(\textbf{X}\) and \(\boldsymbol{\theta}\), it confirms that \(\hat{\mathbf{y}}\) is an \(n\)-dimensional column vector. Each element in this vector is the model’s prediction for one observation.

and we also represent \(\mathbf{y}\) as a column vector of outcomes:

\[\begin{split} \mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} \end{split}\]

This matrix representation of the multiple linear model can help us fit the model. Similar to the simple linear model, we find the best model using squared loss. That is, we want to find the model parameters \(\hat{\boldsymbol{\theta}}\) that minimize the average squared loss:

\[\begin{split} \begin{aligned} L({\boldsymbol{\theta}}, \textbf{X}, {\mathbf{y}}) &= \frac{1}{n} \sum_i [y_i - (\theta_0 + \theta_1 x_{i,1} + \cdots + \theta_p x_{i,p})]^2 \\ &= \frac{1}{n} \lVert \mathbf{y} - {\textbf{X}} {\boldsymbol{\theta}} \rVert^2 \end{aligned} \end{split}\]

Here, we use the notation \( \lVert \mathbf{v} \rVert^2 \) for a vector \(\mathbf{v}\) as a shorthand for the sum of each vector element squared 1: \(\lVert \mathbf{v} \rVert^2 = \sum_i v_i^2\).

We can fit our model using calculus as we did for the simple linear model. However, this approach gets cumbersome, and instead we use a geometric argument that is more intuitive and easily leads to useful properties of the design matrix, errors, and predicted values.

15.4.1. A Geometric Problem

Again, our goal is the find the \(\boldsymbol{\hat{\theta}}\) that minimizes our loss function—we want to make \(L({\boldsymbol{\theta}}, \textbf{X}, \mathbf{y})\) as small as possible for a given \(\textbf{X}\) and \(\mathbf{y}\). The key insight is that we can restate this goal in a geometric way. Since the model predictions \(\mathbf{\hat{y}}\) and the true outcomes \(\mathbf{y}\) are both vectors, we can think of them as points in variable space.

Figure 15.5 gives and example of this, where we have just one \(\mathbf{x}\) (and no intercept term). In the figure, \(\mathbf{x}\), \(\mathbf{y}\) and \(\mathbf{\hat{y}}\) are displayed as \(n\)-dimensional vectors. And minimizing squared loss, is finding \(\theta\) that minimizes:

\[ \frac{1}{n} \sum_i [y_i - \theta x_{i}]^2 = \frac{1}{n} \lVert {\mathbf{y}} - {{\theta}{\mathbf{x}}} \rVert^2 \]

Since squared loss minimizes the distance between \(\mathbf{y}\) and \(\theta \mathbf{x}\), the picture shows us that the projection \(\mathbf{\hat{y}} = \hat{\theta} \mathbf{x}\) is the closest point of the form \(\theta \mathbf{x}\) to \(\mathbf{y}\). We call \(\mathbf{\hat{y}}\) the projection of \(\mathbf{y}\) onto \(\mathbf{x}\).


Fig. 15.5 A plot showing Y, y_hat, and one x …

Now, let’s look at the case where we have two variables, \(\mathbf{x}_1\) and \(\mathbf{x}_2\). We have to draw the picture slightly differently now, to reflect we have three vectors. Figure 15.6 gives a diagram of this setting. We are faced with the same problem–we need to find a vector of the form \({{\theta_1}{\mathbf{x}}_1} + {{\theta_2}{\mathbf{x}}_2}\) that minimizes the least squares loss,

\[ \frac{1}{n} \sum_i [y_i - (\theta_1 {x}_{1,i} + \theta_2{x}_{2,i})]^2 = \frac{1}{n} \lVert {\mathbf{y}} - {\textbf{X}}{\boldsymbol{\theta}} \rVert^2 \]

The \(\theta_1{\mathbf{x}}_1 + \theta_2{\mathbf{x}}_2\) (or \({\textbf{X}}{\boldsymbol{\theta}}\)) is a linear combination of the vectors \(\mathbf{x}_1\) and \(\mathbf{x}_2\), and the shaded region in the figure shows all of the possible combinations. Hopefully, the diagram convinces you that the particular combination that minimizes the distance between \(\mathbf{y}\) and vectors that lay in the shaded area is again the projection of \(\mathbf{y}\) onto this gray area. The shaded area is called the span of \(textbf{X}\) and written as \(\text{span}(\textbf{X})\). If you would like a more formal approach, rather than this proof-by-picture, we walk through the steps of deriving \(\boldsymbol{\hat{\theta}}\) using vector spaces in the Exercises.


Fig. 15.6 A plot showing all possible values of .

We can derive the representation of \(\boldsymbol{\hat{\theta}}\) in terms of \({\textbf{X}}\) and \(\mathbf{y}\). To do this, we use the earlier definition: \(\mathbf{e} = \mathbf{y} - {\mathbf{\hat{y}}}\) and the observation from picture that shows \(\mathbf{e}\) is perpendicular to \(\text{span}(\textbf{X})\). We can now solve for \(\boldsymbol{\hat{\theta}}\) (we have placed the complete proof in the Exercises):

\[\begin{split} \begin{aligned} \textbf{X} \boldsymbol{\hat{\theta}} + \mathbf{e} &= \mathbf{y} \\ {\textbf{X}}^\top \textbf{X} \hat{\theta} + {\textbf{X}}^\top \mathbf{e} &= {\textbf{X}}^\top \mathbf{y} & (\text{left-multiply by } {\textbf{X}}^\top) \\ {\textbf{X}}^\top \textbf{X} \hat{\theta} &= {\textbf{X}}^\top \mathbf{y} & (\mathbf{e} \perp \text{span}(\textbf{X})) \\ \boldsymbol{\hat{\theta}} &= ({\textbf{X}}^\top \textbf{X})^{-1} {\textbf{X}}^\top \mathbf{y} & (\text{left-multiply by } ({\textbf{X}}^\top \textbf{X})^{-1}) \end{aligned} \end{split}\]

This general approach to derive \(\boldsymbol{\hat{\theta}}\) for the multiple linear model also gives us \(\hat{\theta}_0\) and \(\hat{\theta}_1\) for the simple linear model. If we set \({\textbf{X}}\) to contain the intercept column and one column of features, using this formula for \(\boldsymbol{\hat{\theta}}\) and some linear algebra, we can get the intercept and slope of the least-squares-fitted simple linear model. In fact, if \({\textbf{X}}\) is simply a single column of \(1\)s, then we can use this formula to show that \(\boldsymbol{\hat{\theta}}\) is just the mean of \(\mathbf{x}\). This ties back to the constant model that we introduced in Chapter 4.


While, we can write a simple function to derive the \(\boldsymbol{\hat{\theta}}\) based on the formula,

\[ \boldsymbol{\hat{\theta}} = ({\textbf{X}}^\top \textbf{X})^{-1} {\textbf{X}}^\top \mathbf{y}, \]

we recommend leaving the calculation of \(\boldsymbol{\hat{\theta}}\) to the optimally tuned methods provided in the ‘sci-kit learn’ or ‘statsmodels’ libraries. They handle cases where the design matrix is sparse, highly co-linear, and not invertible.

This solution for \(\boldsymbol{\hat{\theta}}\) (along with the pictures) reveal some useful properties of the fitted coefficients and the predictions. We present them here, and walk through the details of the derivations in the Exercises.

  • The residuals are orthogonal to the predicted values.

  • The average of the residuals is \(0\), if the model has an intercept term.

  • The variance of the residuals is just the ASE.

These properties explain why we examine plots of the residuals against the predictions. When we fit a multiple linear model, we also plot the residuals against variables that we are considering adding to the model. If they show a linear pattern, then we would add them to the model.

In addition to examining the SD of the errors, the ratio of the ASE for a multiple linear model to the ASE for the constant model gives a measure of the model fit. This is called the Multiple \(R^2\) and is defined as:

\[ R^2 = 1 - \frac {\lVert \mathbf{y} - {\textbf{X}}{\boldsymbol{\hat{\theta}}} \rVert^2} {\lVert {\mathbf{y}} - \bar{y} \rVert^2}. \]

As the model fits the data closer and closer, the multiple \(R^2\) gets nearer to \(1\). That might seem like a good thing, but there are problems with this approach because \(R^2\) continues to grow even as we add meaningless features to our model, as long as the features fill out the \(\text{span}(\textbf{X})\). To account for the size of a model, we often adjust the numerator and denominator by the number of fitted coefficients in the models. That is, we would normalize the numerator by \(1/[n-(p+1)]\) and the denominator by \(1/(n-1)\). Better approaches to selecting a models are covered in Chapter 17.


\(\lVert \mathbf{v} \rVert^2\) is also called the \(\ell_2\) norm of a vector \(\mathbf{v}\).