16.4. Regularization#

We just saw how cross-validation can help find a dimension for a fitted model that balances under- and overfitting. Rather than selecting the dimension of the model, we can build a model with all of the features, but restrict the size of the coefficients. We keep from overfitting by adding to the MSE a penalty term on the size of the coefficients. The penalty, called a regularization term, is \( \lambda \sum_{j = 1}^{p} \theta_j^2 \). We fit the model by minimizing the combination of mean squared error plus this penalty:

\[ \frac{1}{n} \sum_{i=1}^{n}(y_i - \mathbf{x}_i \boldsymbol{\theta})^2 ~+~ \lambda \sum_{j = 1}^{p} \theta_j^2 \]

When the regularization parameter, \(\lambda\), is large, it penalizes large coefficients. (We typically choose it by cross-validation.)

Penalizing the square of the coefficients is called \( L_2 \) regularization, or ridge regression. Another popular regularization penalizes the absolute size of the coefficients:

\[ \begin{aligned} \frac{1}{n} \sum_{i=1}^{n}(y_i - \mathbf{x}_i \boldsymbol{\theta})^2 ~+~ \lambda \sum_{j = 1}^{p} |\theta_j| \end{aligned} \]

This \( L_1 \) regularized linear model is also called lasso regression (lasso stands for Least Absolute Shrinkage and Selection Operator).

To get an idea about how regularization works, let’s think about the extreme cases: when \( \lambda \) is really large and when it’s close to 0 (\( \lambda \) is never negative). With a big regularization parameter, the coefficients are heavily penalized, so they shrink. On the other hand, when \( \lambda \) is tiny, the coefficients aren’t restricted. In fact, when \( \lambda \) is 0, we’re back in the world of ordinary least squares. A couple of issues crop up when we think about controlling the size of the coefficients through regularization:

  • We do not want to regularize the intercept term. This way, a large penalty fits a constant model.

  • When features have very different scales, the penalty can impact them differently, with large-valued features being penalized more than others. To avoid this, we standardize all of the features to have mean 0 and variance 1 before fitting the model.

Let’s look at an example with 35 features.

16.5. Example: A Market Analysis#

A market research project for a pharmaceutical company wants to model consumer interest in purchasing a cold sore health-care product. They gather data from 1,023 consumers. Each consumer is asked to rate on a 10-point scale 35 factors according to whether the factor matters to them when considering purchasing a cold sore treatment. They also rate their interest in purchasing the product.1

We begin by reading in the data:

ma_df = pd.read_csv('data/market-analysis.csv')

Table 16.1 lists the 35 factors and provides their correlation to the outcome, their interest in purchasing the product.

Table 16.1 Codebook#

Corr

Description

Corr

Description

x1

0.70

provides soothing relief

x19

0.54

has a non-messy application

x2

0.58

moisturizes cold sore blister

x20

0.70

good for any stage of a cold

x3

0.69

provides long-lasting relief

x21

0.49

easy to apply/take

x4

0.70

provides fast-acting relief

x22

0.52

package keeps from contamination

x5

0.72

shortens duration of a cold

x23

0.57

easy to dispense a right amount

x6

0.68

stops the virus from spreading

x24

0.63

worth the price it costs

x7

0.67

dries up cold sore

x25

0.57

recommended most by pharamacists

x8

0.72

heals fast

x26

0.54

recommended by doctors

x9

0.72

penetrates deep

x27

0.54

FDA approved

x10

0.65

relieves pain

x28

0.64

a brand I trust

x11

0.61

prevents cold

x29

0.60

clinically proven

x12

0.73

prevents from getting worse

x30

0.68

a brand I would recommend

x13

0.57

medicated

x31

0.74

an effective treatment

x14

0.61

prescription strength

x32

0.37

portable

x15

0.63

repairs damaged skin

x33

0.37

discreet packaging

x16

0.67

blocks virus from spreading

x34

0.55

helps conceal cold sores

x17

0.42

contains SPF

x35

0.63

absorbs quickly

x18

0.57

non-irritating

Based on their labels alone, some of these 35 features appear to measure similar aspects of desirability. We can compute the correlations between the explanatory variables to confirm this:

y x1 x2 x3 x4 x5 x6 ... x29 x30 x31 x32 x33 x34 x35
y 1.00 0.70 0.58 0.69 0.70 0.72 0.68 ... 0.60 0.68 0.74 0.37 0.37 0.55 0.63
x1 0.70 1.00 0.69 0.74 0.81 0.75 0.73 ... 0.63 0.68 0.79 0.42 0.33 0.57 0.71
x2 0.58 0.69 1.00 0.63 0.65 0.63 0.63 ... 0.60 0.62 0.66 0.43 0.36 0.55 0.63
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
x33 0.37 0.33 0.36 0.34 0.32 0.32 0.37 ... 0.39 0.41 0.35 0.34 1.00 0.35 0.36
x34 0.55 0.57 0.55 0.67 0.60 0.59 0.64 ... 0.58 0.48 0.58 0.16 0.35 1.00 0.54
x35 0.63 0.71 0.63 0.70 0.72 0.69 0.68 ... 0.65 0.67 0.72 0.46 0.36 0.54 1.00

36 rows × 36 columns

We see, for example, that the last feature, “absorbs quickly,” is highly correlated with the first three: “provides soothing relief,” “moisturizes,” and “provides long-lasting relief.”

Before we start to fit regularized models, we set up the design matrix and outcome vector and split the data into train and test sets. We place 200 observations in the test set:

y = ma_df["y"]
X = ma_df.drop(columns=["y"])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=200, random_state=42
)

As mentioned earlier, we need to standardize the features. We standardize the train set, and then, when we go to evaluate the model, we use the train set standardization on the test set. The StandardScaler method helps us with this process:

from sklearn.preprocessing import StandardScaler

scalerX = StandardScaler().fit(X_train) 
X_train_scaled = scalerX.transform(X_train)
X_test_scaled = scalerX.transform(X_test)

We confirm that the means of the 35 features in the train set are all 0 and the SDs are all 1:

np.allclose(X_train_scaled.mean(axis=0), 0)
True
np.allclose(X_train_scaled.std(axis=0), 1)
True

It’s important to note that this is not the case for the test set, because we use the averages and SDs from the train set to standardize the test set features:

X_test_scaled.mean(axis=0)
array([-0.17, -0.12, -0.15, ..., -0.12, -0.1 , -0.24])

To perform a lasso regression, we use the Lasso method in scikit-learn. Let’s see how the coefficients for the 35 features in the data frame change with \(\lambda\). We set up a range of values for the regularization parameter and fit the lasso for each (this parameter is referred to as alpha in Lasso):

from sklearn.linear_model import Lasso

coefs = []
mses = []
alphas = np.arange(0.01, 2, 0.01)

for a in alphas:
    model = Lasso(alpha=a)
    model.fit(X_train_scaled, y_train)
    coefs.append(model.coef_)
    mses.append(mean_squared_error(y_test, model.predict(X_test_scaled)))

For each feature, we can overlay a line plot of the coefficient against \( \lambda \):

../../_images/ms_regularization_24_0.svg

As \( \lambda \) increases, the model fitting is penalized more heavily and many coefficients shrink to 0. The lefthand side of the plot shows greater model complexity and corresponds to small \( \lambda \). Notice that for \( \lambda = 0.5 \), nearly all of the coefficients are 0. The exceptions are, from largest to smallest: x12, x30, x31, x5, x7, x1, x9, x28, x20.

We can also plot the MSE as a function of \( \lambda \) to see how it changes with the increase in the penalty:

px.line(x=alphas, y=mses,
        labels={"x": "Lambda", "y": "MSE"},
        width=350, height=250)
../../_images/ms_regularization_26_0.svg

When \( \lambda \) reaches about 1.25, the penalty is so large that the lasso regression is simply fitting a constant model to the data, and the MSE doesn’t change. We are again faced with a model selection question, but this time it’s in the form of deciding on \( \lambda \). We can use cross-validation to help us.

We use the LassoCV method to perform 5-fold cross-validation to choose \(\lambda\):

from sklearn.linear_model import LassoCV

lasso_cv_model = LassoCV(
    alphas=np.arange(0.01, 1, 0.01), cv=5, max_iter=100000
)

Notice that we have specified a maximum number of iterations because the minimization uses numerical optimization (see Chapter 20) to solve for the coefficients, and we have placed a cap on the number of iterations to run to reach the specified tolerance for the optimal parameter. We’re ready to fit the model on the train set:

lasso_best = lasso_cv_model.fit(X_train_scaled, y_train)

Cross-validation has chosen the following regularization parameter:

lasso_best.alpha_
0.04

Next, let’s compute the MSE for predictions in the test set using this cross-validated model:

y_test_pred = lasso_best.predict(X_test_scaled)
mean_squared_error(y_test, y_test_pred)
1.0957444655340487

Let’s find how many coefficients are not 0:

sum(np.abs(lasso_best.coef_) > 0)
16

We next demonstrate how the ridge regression coefficients change with the penalty. Ridge regression uses an \( L_2 \) penalty, but otherwise the implementation is similar to lasso regression. As before, we try a range of \( \lambda \) values and then plot the coefficients to see how they change:

from sklearn.linear_model import Ridge

coefsR = []
alphasR = np.arange(1, 1001, 25)

for a in alphasR:
    ridge = Ridge(alpha=a)
    ridge.fit(X_train_scaled, y_train)
    coefsR.append(ridge.coef_)
../../_images/ms_regularization_39_0.svg

This plot shows a very different shape compared to the lasso coefficient plot. The coefficients do not disappear entirely. The plots do have an important similarity in that as the penalty increases, the coefficients shrink. In a way, with ridge regression, we are using all of the variables a little bit. For large \( \lambda \), many coefficients, while not 0, are quite small. For example, we can count the number of coefficients larger than 0.05 for \( \lambda = 600 \):

sum(abs(coefsR[24, :]) > 0.05)
13

Similar to lasso, many of the features have tiny coefficients.

Again, to select the best \( \lambda \) we can turn to cross-validation and use RidgeCV. We omit the details here because the approach is the same as LassoCV.

Using \( L_1 \) or \( L_2 \) regularization allows us to avoid model overfitting by penalizing large coefficients. \( L_1 \) regularization has the advantage of zeroing out coefficients; it performs feature selection by discarding the features with a coefficient of 0. This is particularly useful when working with high-dimensional data with many features. A model that only uses a few features to make a prediction runs much faster than a model that requires many calculations. Since unneeded features tend to increase model variance without decreasing bias, we can sometimes increase the accuracy of other models by using lasso regression to select a subset of features to use.

Note

Sometimes we prefer one type of regularization over the other because it maps more closely to the domain we are working with. For example, if we know that the phenomenon we are trying to model results from many small factors, we might prefer ridge regression because it won’t discard these factors. On the other hand, some outcomes result from a few highly influential features. We prefer lasso regression in these situations because it discards unneeded features. Both \(L_2\) and \(L_1\) serve as useful throttles to navigate between under- and overfitting.

Regularization, train-test split, and cross-validation all have the goal of reducing over-fitting. The problem with over-fitting comes from using data to both fit a model and estimate the model’s error in predicting new observations. In the next section, we provide further intuition for this idea.


1

Stan Lipovetsky and W. Michael Conklin (2015) Predictor relative importance and matching regression parameters. Journal of Applied Statistics, 42:5, 1017-1031.