15.3. Multiple Linear Model

So far, we’ve used a single predictor variable \( x \) to predict the outcome \( y \). Now, we’ll introduce the multiple linear model, a linear model that uses multiple predictors to predict \( y \). This is useful because having multiple predictors can improve our model’s fit to the data and improve accuracy. After defining the multiple linear model, we’ll use it to predict AUM using a combination of variables.

If we have multiple predictors, we say that \( x \) is a \( p \)-dimensional column vector \( x = [ x_1, x_2, \ldots, x_p ] \). Then, for a given \( x \) the outcome \( y \) depends on a linear combination of \( x_i \):

\[ \begin{aligned} y = \theta_0 + \theta_1 x_1 + \ldots + \theta_p x_p + \epsilon \end{aligned} \]

Similar to the simple linear model, our multiple linear model \( f_{\theta}(x) \) predicts \( y \) for a given \( x \):

\[ \begin{aligned} f_{\theta}(x) = \theta_0 + \theta_1 x_1 + \ldots + \theta_p x_p \end{aligned} \]

We can simplify this notation if we add an intercept term to \( x \) so that \( x = [ 1, x_1, x_2, \ldots, x_p ] \). Since we also write our model parameters as a column vector \( \theta = [ \theta_0, \theta_1, \ldots, \theta_p ] \), we can use the definition of the dot product to write our model as:

\[\begin{split} \begin{aligned} f_{\theta}(x) &= \theta_0 + \theta_1 x_1 + \ldots + \theta_p x_p \\ &= \theta \cdot x \\ \end{aligned} \end{split}\]

As a final simplification, we’ll use matrix notation to show how our models works on our entire dataset. Before, we said that a single observation is \( (x, y) \), where \( x \) is a vector of predictor variables and \( y \) is the scalar outcome. Now, we’ll say that \( X \) is a matrix. Each row of \( X \) has the predictors for a single observation. We’ll also say that \( y \) is a vector (instead of scalar) with the outcomes for each observation:

\[\begin{split} \begin{aligned} X = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 1 & x_{21} & x_{22} & \cdots & x_{2p} \\ & & \vdots & & \\ 1 & x_{n1} & x_{n2} & \cdots & x_{np} \\ \end{bmatrix} & & y = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} \end{aligned} \end{split}\]

We call \( X \) the design matrix. It’s a \( n \times (p + 1) \) matrix (remember that we added an extra dimension for the intercept term). Now, we can write the predictions for the entire dataset using matrix multiplication:

\[ \begin{aligned} f_{\theta}(x) &= X \theta \end{aligned} \]

\( X \) is an \( n \times (p + 1) \) matrix and \( \theta \) is a \( (p + 1) \)-dimensional column vector. This means that \( X \theta \) is an \( n \)-dimensional column vector. Each item in the vector is the model’s predictions for one observation. It’s easier to understand the design matrix through an example, so let’s return to the Opportunity data.

15.3.1. Predicting Upward Mobility Using Multiple Variables

Before, we used the fraction of people with a ≤15 min commute time to predict the AUM for a commuting zone. Now, we’d like to use a combination of predictors. In his original analysis, Chetty created nine high-level predictor categories like segregation, income, and K-12 education. We’ll take one predictor from each of Chetty’s categories for a total of nine predictors, described in Table 15.1.

Table 15.1 The nine variables we use to predict AUM

Column name



Fraction of people with a ≤15 minute commute to work.


Gini coefficient, an measure of wealth inequality. Values are between 0 and 1, where small values mean wealth is evenly distributed and large values mean more inequality.


High school dropout rate.


Fraction of people who self-reported as religious.


Fraction of children with a single mother.


Local tax rate.


College graduation rate.


Fraction of teenagers who are working.


Fraction of people born outside the US.

Our original dataframe has around 40 predictors:

cz czname stateabbrv aum ... cs_fam_wkidsinglemom cs_divorced cs_married incgrowth0010
0 100.0 Johnson City TN 38.39 ... 0.19 0.11 0.60 -2.28e-03
1 200.0 Morristown TN 37.78 ... 0.19 0.12 0.61 -2.15e-03

2 rows × 43 columns

We’ll subset out the design matrix into a DataFrame X and the column of outcomes into a Series y.

predictors = [

X = (df[predictors]
    # Some predictors are missing; we'll drop them for simplicity
    # Move intercept column to appear first
    [['intr', *predictors]]
y = df.loc[X.index, 'aum']
X = X.reset_index(drop=True)
y = y.reset_index(drop=True)
print(f'X: {X.shape} matrix')
print(f'y: {y.shape} vector')
X: (479, 10) matrix
y: (479,) vector

Let’s look at X, a dataframe that we’re using as our design matrix. It has 479 observations and 9 predictors plus an intercept column:

intr frac_traveltime_lt15 gini dropout_r ... taxrate gradrate_r frac_worked1416 cs_born_foreign
0 1 0.33 0.47 -1.53e-02 ... 0.02 -2.43e-03 3.75e-03 1.18e-02
1 1 0.28 0.43 -2.35e-02 ... 0.02 -1.01e-01 4.78e-03 2.31e-02
2 1 0.36 0.44 -4.63e-03 ... 0.01 1.11e-01 2.89e-03 7.08e-03
... ... ... ... ... ... ... ... ... ...
476 1 0.45 0.36 7.07e-03 ... 0.02 -4.30e-02 4.33e-03 1.12e-01
477 1 0.65 0.36 -1.53e-02 ... 0.02 -4.94e-02 4.30e-03 2.33e-02
478 1 0.47 0.44 2.61e-03 ... 0.02 -2.70e-01 5.46e-03 3.69e-02

479 rows × 10 columns

Notice that our design matrix is a subset of our original dataframe—we just selected specific variables we want to use for prediction. Each row of our design matrix corresponds to an observation in our original data, and each column corresponds to a predictor variable, as depicted in Figure 15.1.


Fig. 15.1 Each row and column of \(X\) represent an observation and a feature.

One technicality: the design matrix is defined as a mathematical matrix, not a dataframe, and a matrix doesn’t include the column or row labels that the X dataframe has. But, we usually don’t have to worry about converting X into a matrix since most Python libraries for modeling treat dataframes as if they were matrices.


Once again, we’ll point out that people from different backgrounds use different terminology. For example, we say that each row in the design matrix \(X\) is an observation and each column is a variable. This is more common among people with backgrounds in statistics. Others say that each column of the design matrix represents a feature. Also, we say that our overall process of fitting and interpreting models is called modeling, while others call it machine learning.

Now that we have our data prepared for modeling, in the next section we’ll fit our model by finding the \( \hat{\theta} \) that minimizes our loss.