18.6. Summary

In this case study, we demonstrated the different purposes of modeling: description, inference, and prediction. For description, we sought a simple, understandable model. We hand crafted this model, beginning with our findings from the exploratory phase of the analysis. Every action we took to include a feature in the model, collapse categories, or transform a feature amounts to a decision we made while investigating the data.

In modeling a natural phenomenon such as the weight of a donkey, we would ideally make use of physical and statistical models. In this case, the physical model is the representation of a donkey by a cylinder. A inquisitive reader might have pointed out that we could have used this representation directly to estimate the weight of a donkey (cylinder) from its length and girth (since girth is \(2\pi r\)):

\[ weight \propto girth^2 \times length\]

This physical model suggests that the log transformed weight is approximately linear in girth and length:

\[ \log(weight) \propto \log(girth) + \log(length)\]

Given this physical model, you might wonder why we did not use logarithmic or square transformations in our model. We leave you to investigate such a model in greater detail. But generally, if the range of values measured is small, then the log function is roughly linear. To keep our model simple, we chose not to make these transformations given the strength of the statistical model seen by the high correlation between the girth and weight of the donkeys.

Recall that we added categorical variables to our model using one-hot encoding, and that we took out one variable for each category we transformed. Actually, leaving in all the variables wouldn’t change the model’s predictions—the model would still have the same test set error. But, this model would be over parameterized, which means that fitting the model multiple times could produce different values for \( \hat{\theta} \). This is problematic for inference, since it stops us from having a useful and consistent interpretation of the model. When dropping a one-hot encoded variable, we chose to drop the central or most common one so that vets don’t need to make adjustments to the model for common cases.

We did a lot of data dredging in this modeling exercise. We examined all possible models built from linear combinations of the numeric features, and we examined coefficients of dummy variables to decide whether to collapse categories. When we create models using an iterative approach like this, it is extremely important that we set aside data to assess the model. Evaluating the model on new data reassures us that the model we chose works well. The data that we set aside did not enter into any decision making when building the model so it gives us a good sense of how well the model works for making predictions.

Finally, this case study shows how fitting models is often a balance between simplicity and complexity, and a balance between physical and statistical models. As data scientists, we needed to make human judgment calls at each step in the analysis. In other words, modeling is both an art and a science.