17.3. Cross-Validation

We can use the train-test paradigm to help use choose a model. The idea is to further divide the train set into separate parts where we fit the model on one part and evaluate it on another. This approach is called cross-validation. We describe one version of it, called \(k\)-fold cross-validation*. Figure 17.3 shows the idea behind this division of the data.

../../_images/CVDiagram.png

Fig. 17.3 Five-fold Cross-Validation. The train set is divide into 5 parts, which are used in turn to validate models built on the remainder of the data.

Cross-validation can help select the general form of a model. By this we mean, the degree of the polynomial, the number of features in the model, or a cut-off for a regularization penalty (see the next section). The basic steps behind \(k\)-fold cross-validation go as follows.

  • Divide the train set into \(k\) parts of roughly the same size, each of these parts are called folds. Use the same technique that was used to create the train and test sets to make the folds. Typically, we divide the data at random.

  • Set one fold aside to act as a test set.

    • Fit all models on the remainder of the train data (the training data less the particular fold).

    • Use the fold you set aside to evaluate all of these models.

  • Repeat this process \(k-1\) more times, where each time you set aside one fold, use the rest of the train set to fit models, and evaluate them on the fold that was set aside.

  • Combine the error in fitting each model across the folds, and choose the model with the smallest error.

These fitted models will not have identical coefficients across folds. As an example, when we fit a polynomial of, say, degree 3, we average the MSE across the \(k\) folds to get a value of polynomials of degree 3. Then, we do the same for polynomials of degree 4, and so on. We then compare the MSEs and choose the degree of the polynomial with the lowest MSE. The actual coefficients for the \(x\), \(x^2\) and \(x^3\) terms in the cubic polynomial are not be the same in each fit. Once the degree is selected, we refit the model using all of the training data and evaluate it on the test set. (We haven’t used the test set in any of the earlier steps to select the model).

Typically, we use between 5 or 10 folds. Another popular choice puts one observation in each fold. This special case is called leave-one-out cross-validation. It’s popularity stems from the simplicity in adjusting a least squares fit to have one fewer observation.

Generally, \(k\)-fold cross-validation takes some computation time since we typically have to refit each model from scratch for each fold. The scikit-learn library provides a convenient sklearn.model_selection.KFold class to implement \(k\)-fold cross-validation. We provide an example of \(k\)-fold cross-validation to give an idea of how it works.

17.3.1. Example: Fitting a Bent Line Model with Cross-validation

We demonstrate \(k\)-fold cross-validation on the gas consumption example. However, this time we fit a different set of models than polynomials. The original scatter plot of the data looks like the points fall along two connected line segments. In cold temperatures the relationship between gas consumption and temperature looks roughly linear with a negative slope of about \(-4\)cubic ft/degree, and in warmer months, the relationship appears nearly flat. So, rather than fit a polynomial, we can fit a bent line to the data.

Let’s start by fitting a line with a bend at 65 degrees. To do this, we create a second feature that enables the points with temperatures above 65 to have a different slope. The model is

\[ y = \theta_0 + \theta_1x + \theta_2(x-65)^+ \]

Here the \( (~)^+ \) stands for “positive part” so when \( x \) is less than 65 it evaluates to 0, and when \( x \) is 65 or greater, it is just \( x-65 \).

y = heat_df[['ccf']]
X = heat_df[['temp']]

X["temp65p"] = (X['temp'] - 65) * (X['temp'] >= 65 )
bend = LinearRegression().fit(X, y)

Let’s overlay this fitted “curve” on the scatter plot to see how well it captures the shape of the data.

fig = go.Figure()

fig.add_trace(
    go.Scatter(x=heat_df['temp'], y=heat_df["ccf"], mode='markers'))

fig.add_trace(
    go.Scatter(x=Xs['temp'], y=y_hats))

fig.update_layout(showlegend=False, width=350, height=250,
                 xaxis_title='Temperature (Farenheit)',
                   yaxis_title='Gas (cubic ft)')


fig.show()
../../_images/ms_cv_9_0.svg

This model appears to fit the data much better than the earlier polynomials. There are many bent line models possible. The line might bend at 55 degrees or 60 degrees, and so on. We can use \(k\)-fold cross-validation to choose the temperature value at which the line bends. We can consider models with bends at \( 40, 41, 42, \ldots, 69 \) degrees. For each of these, we need to create the additional feature to enable the line to bend there.

for i in np.arange(40, 70, 1):
    heat_df[ 'temp'+ i.astype('str') + 'p'] = (heat_df['temp'] - i) * (heat_df['temp'] >= i) 
 
heat_df
temp ccf temp40p temp41p ... temp66p temp67p temp68p temp69p
0 29 166 0 0 ... 0 0 0 0
1 31 179 0 0 ... 0 0 0 0
2 15 224 0 0 ... 0 0 0 0
... ... ... ... ... ... ... ... ... ...
96 76 11 36 35 ... 10 9 8 7
97 55 32 15 14 ... 0 0 0 0
98 39 91 0 0 ... 0 0 0 0

97 rows × 32 columns

The first step in cross-validation is to create our train and test sets. We choose 22 observations at random to be placed in the test set. That leaves 75 for the train set.

y = heat_df[['ccf']]
X = heat_df.drop(['ccf'], axis=1)

test_size = 22

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=test_size, random_state=0)


print(f'  Training set size: {len(X_train)}')
print(f'      Test set size: {len(X_test)}')
  Training set size: 75
      Test set size: 22

Now we can divide the train set into folds. We use three folds so that we have 25 observations in each fold. For each fold, we fit 30 models, one for each bend in the line.

from sklearn.model_selection import KFold

kf = KFold(n_splits=3, shuffle=True, random_state=66)

validation_errors = np.zeros((3, 30))
i = 0

for train_idx, valid_idx in kf.split(X_train):
    
    split_X_train, split_X_valid = X_train.iloc[train_idx, :], X_train.iloc[valid_idx, :]
    split_Y_train, split_Y_valid = y_train.iloc[train_idx, :], y_train.iloc[valid_idx, :]
    
    bent_models = [LinearRegression().fit(split_X_train.iloc[ :, [0,j]], split_Y_train)
        for j in range(1,31)]
    
    bent_preds = [bent_models[j].predict(split_X_valid.iloc[ :, [0,(j+1)]])
         for j in range(30)]
    
    error_bend = [mean_squared_error(split_Y_valid, bent_preds[j].flatten())
         for j in range(30)]
    
    validation_errors[i][ : ] = error_bend
    i = i+1       

Then, we combine the validation errors across the 3-folds and plot them against the location of the bend.

totals = [ sum(x)/3 for x in validation_errors.transpose() ]
../../_images/ms_cv_18_0.svg

The MSE looks quite flat for 57, 58, 59, and 60 degrees. The minimizing value occurs at 58 degrees so we choose that particular model. To assess this model with the test set, we first we fit it using the entire train set.

bend_final = LinearRegression().fit(X_train.loc[:, ["temp", "temp58p"]], y_train)
82.68661029815688

Then we use the fitted model to predict gas consumption for the test set.

y_pred_test = bend_final.predict(X_test.loc[:, ["temp", "temp60p"]])

mean_squared_error(y_test, y_pred_test.flatten())
82.68661029815688

Let’s overlay the bent-line fit on the scatter plot and examine the residuals to get a better idea as to the quality of the fit.

../../_images/ms_cv_24_0.svg

The fitted curve looks reasonable, and the size of the residuals is much smaller than those from the polynomial fit.

Note that the scikit-learn library has a cross_val_predict method to automatically perform cross-validation, so we don’t have to break the data into training and validation sets ourselves.

Using cross-validation to manage model complexity has a couple critical limitations: typically it requires the complexity to vary discretely; and there may not be a natural way to order the models. Rather than changing the dimensions of a sequence of models, we can instead fit a large model and apply constraints on the size of the coefficients. This notion is called regularization and is the topic of the next section.