Exercises

15.9. Exercises

  • When we fit models to the Opportunity data, we actually removed several commuting zones: 34105, 34113, 34112, and 34106. These commuting zones were outliers in the data, since they had abnormally small AUMs for their corresponding predictor variables. All four of these CZs are in the same state. Which one?

  • Even a few extreme outliers can harm our model’s fit to the data. To show this, fit another simple linear model that uses the fraction with a ≤15 commute to predict AUM, but include the outlier commuting zones (34105, 34113, 34112, and 34106) in the training data. How does including the outliers change the model and residual plot?

  • Let \( f_\hat{\theta}(X) \) be the \( n \)-dimensional vector of a linear model’s predictions after fitting, and let \( \epsilon = y - f_\hat{\theta}(X) \) be the \( n \)-dimensional vector of the residuals. Prove that \( f_\hat{\theta}(X) \cdot \epsilon = 0 \).

  • We derived that \( \hat{\theta} = (X^\top X)^{-1} X^\top y \) . Construct a design matrix \( X \) where \( \hat{\theta} \) is undefined. Hint: this is the same as finding a matrix \( X \) where \( (X^\top X) \) is not invertible. What does this mean about \( \hat{\theta} \) ?

  • Create a design matrix that uses the nine predictor variables we discussed in Section 15.3 and the one-hot encoded US Census regions. This design matrix should have 13 columns total. Then, fit a linear model to predict AUM using this design matrix. How does this model perform on the test set compared to the model without the US Census regions?

  • Remake the design matrix for one hot encoding the city variable in the SF housing data, and this time drop the first column from the design. To do this, simply add “drop=first” in the call to OneHotEncoder. Then fit a model that includes the intercept term. Compare the coefficients from this fit to the earlier fit that did not have an intercept term, and included all four city dummy variables. Show that the intercept matches the coefficient for Berkeley (the first city), and the coefficients for the other cities, when added to the intercept, match the city coefficients fit in the non-intercept model.