15.7. Exercises

  • When we fit models to the Opportunity data, we actually removed several commuting zones: 34105, 34113, 34112, and 34106. These commuting zones were outliers in the data, since they had abnormally small AUMs for their corresponding predictor variables. All four of these CZs are in the same state. Which one?

  • Even a few extreme outliers can harm our model’s fit to the data. To show this, fit another simple linear model that uses the fraction with a ≤15 commute to predict AUM, but include the outlier commuting zones (34105, 34113, 34112, and 34106) in the training data. How does including the outliers change the model and residual plot?

  • Let \( f_\hat{\theta}(X) \) be the \( n \)-dimensional vector of a linear model’s predictions after fitting, and let \( \epsilon = y - f_\hat{\theta}(X) \) be the \( n \)-dimensional vector of the residuals. Prove that \( f_\hat{\theta}(X) \cdot \epsilon = 0 \).

  • We derived that \( \hat{\theta} = (X^\top X)^{-1} X^\top y \) . Construct a design matrix \( X \) where \( \hat{\theta} \) is undefined. Hint: this is the same as finding a matrix \( X \) where \( (X^\top X) \) is not invertible. What does this mean about \( \hat{\theta} \) ?

  • Create a design matrix that uses the nine predictor variables we discussed in Section 15.3 and the one-hot encoded US Census regions. This design matrix should have 13 columns total. Then, fit a linear model to predict AUM using this design matrix. How does this model perform on the test set compared to the model without the US Census regions?