18.5. Modeling a Donkey’s Weight

We want to build a simple model for predicting the weight of a donkey. The model should be easy for a vet to implement in the field with only a hand calculator. The model should also be easy to interpret.

In the real world, we’ll also want the model to depend on the vet’s situation—for example, whether they’re prescribing an antibiotic or an anesthetic. For brevity, we’ll only consider the case of prescribing an anesthetic; in the exercises, you have the chance to create a model for prescribing antibiotics.

Our first step is to choose a loss function for prescribing anesthetic.

18.5.1. A Loss Function for Prescribing Anesthetics

For anesthetics, overdosing can be much worse than underdosing. It’s not hard for a vet to see when a donkey has too little anesthetic (it’ll complain), so the vet can give the donkey a bit more. But, giving too much anesthetic can cause serious issues and could even be fatal. Because of this, we want an asymmetric loss function: it should have bigger losses for overestimating the weight compared to underestimating the weight.

We’ve created a loss function anes_loss(x) with this in mind. The plot below shows the asymmetry for the relative error: \(100(y - \hat{y})/\hat{y}\), where \(y\) is the true value and \(\hat{y}\) is the prediction. For this plot, -10 on the x-axis reflects overestimating the weight by 10%.

def anes_loss(x):
    w = (x >= 0) + 3 * (x < 0)
    return np.square(x) * w
xs = np.linspace(-40, 40, 500)
loss = anes_loss(xs)

fig = px.line(x=xs, y=loss, title='Modified Quadratic Loss',
              width=350, height=250)

margin(fig, t=30)
xlabel(fig, 'relative error (%)')
ylabel(fig, 'loss')
../../_images/donkey_model_5_0.svg

Next, let’s fit a simple linear model using this loss function.

18.5.2. Fitting a Simple Linear Model

We saw that girth has the highest correlation with weight among the donkey’s in our training set. So, we fit a model of the form:

\[ f_{\theta}(x) = \theta_0 + \theta_1 \cdot \text{Girth} \]

To find the best fit \(\theta_0\) and \(\theta_1\), we’ll first create a design matrix \( X \) that has Girth and an intercept term. We’ll also create the \(y\) vector of observed donkey weights.

X = train_set.assign(intr=1)[['intr', 'Girth']]
y = train_set['Weight']
X
intr Girth
230 1 116
74 1 117
354 1 123
... ... ...
157 1 123
41 1 103
381 1 106

433 rows × 2 columns

Now, we want to find the \(\theta_0\) and \(\theta_1\) that minimize anes_loss. To do this, we could use calculus as we did in Section 15.2. To make things simpler, this time we’ll instead use the minimize method from the scipy package, which uses an algorithm to find \( \theta_0 \) and \( \theta_1 \).

from scipy.optimize import minimize

def training_loss(X, y):
    def loss(theta):
        predicted = X @ theta
        return np.mean(anes_loss(100 * (y - predicted) / predicted))
    return loss

results = minimize(training_loss(X, y), np.ones(2))
theta_hat = results['x']
print('After fitting:')
print(f'θ₀ = {theta_hat[0]:>7.2f}')
print(f'θ₁ = {theta_hat[1]:>7.2f}')
After fitting:
θ₀ = -218.51
θ₁ =    3.16

Let’s see how this simple model does. We’ll use the model to predict the donkey weights on our training set, then look at the residual plot. The plot below shows the model error as a percentage of the predicted value.

predicted = X @ theta_hat
resids = 100 * (y - predicted) / predicted

resid = pd.DataFrame({'predicted weight': predicted, '% error': resids})
px.scatter(resid, x='predicted weight', y='% error',
           width=350, height=250)
../../_images/donkey_model_13_0.svg

With the simplest model, some of the predictions are off by 20 to 30 percent. Let’s see if a slightly more complicated model improves the predictions.

18.5.3. Fitting a Multiple Linear Model

Let’s incorporate the other numeric variables in our model. We have three numeric variables: Girth, Length, and Height. There are 7 total ways to combine these variables:

import itertools

numeric_vars = ['Girth', 'Length', 'Height']

models = [list(subset)
          for n in [1, 2, 3]
          for subset in itertools.combinations(numeric_vars, n)]
models
[['Girth'],
 ['Length'],
 ['Height'],
 ['Girth', 'Length'],
 ['Girth', 'Height'],
 ['Length', 'Height'],
 ['Girth', 'Length', 'Height']]

We’ll fit one model for each of these variable combinations. Then, we can look at how well each model does on the training set.

def training_error(model):
    X = train_set.assign(intr=1)[['intr', *model]]
    theta_hat = minimize(training_loss(X, y), np.ones(X.shape[1]))['x']
    predicted = X @ theta_hat
    return np.mean(anes_loss(100 * (y - predicted)/ predicted))

model_risks = [
    training_error(model)
    for model in models
]
model mean_training_error
0 [Girth] 94.36
1 [Length] 200.55
2 [Height] 268.88
3 [Girth, Length] 65.65
4 [Girth, Height] 86.18
5 [Length, Height] 151.15
6 [Girth, Length, Height] 63.44

As we stated earlier, the girth of the donkey is the single best predictor for weight. However, the combination of girth and length has quite a bit smaller average loss than girth alone, and this particular two-variable model is nearly as good as the model that includes all three. Since we want a simple model, we select the two-variable model over the three-variable.

Next, we’ll use feature engineering to incorporate categorical variables which will improve our model.

18.5.4. Bringing Qualitative Features into the Model

In our exploratory analysis, we found that the box plots of weight for donkey’s body condition and age could contain useful information in predicting weight. Since the body condition and age are categorical variables, we’ll use a one-hot encoding transform, as we explained in Section 15.5.

Using a one-hot encoding lets the model adjust its predictions for each category. Our current model includes the numeric variables girth and length:

\[ f_{\theta}(x) = \theta_0 + \theta_1 \cdot \text{Girth} + \theta_2 \cdot \text{Length} \]

Let’s say that the Age feature consists of three categories: “under 2”, “2 to 5”, and “over 5”. After applying a one-hot encoding, we’ll create 0-1 features for each category. This gives us the model:

\[\begin{split} \begin{aligned} f_{\theta}(x) = \theta_0 & + \theta_1 \cdot \text{Girth } + \theta_2 \cdot \text{Length } \\ & + \theta_3 \cdot \text{Age<2 } + \theta_4 \cdot \text{Age2-5 } \end{aligned} \end{split}\]

In this model, Age<2 is 1 for a donkey younger than 2 and 0 otherwise. Similarly, Age2-5 is 1 for a donkey between 2 and 5 years old and 0 otherwise.

We can think of this model as fitting three linear models that are identical except for the size of the constant, since the model is equivalent to:

\[\begin{split} \begin{align} (\theta_0 + \theta_3 ) + \theta_1 Girth + \theta_2 Length & ~~~~~~~~~~\text{for a donkey under 2.}\\ (\theta_0 + \theta_4 ) + \theta_1 Girth + \theta_2 Length & ~~~~~~~~~~\text{for a donkey between 2 and 4.}\\ \theta_0 + \theta_1 Girth + \theta_2 Length & ~~~~~~~~~~\text{for a donkey over 5.} \end{align} \end{split}\]

Note

As we discussed in the previous section on one-hot encoding, the one-hot encoded variables are also called dummy variables. When we make dummy variables, we take out one of the variables to make our model interpretable. In this case, we didn’t include a dummy variable for donkeys older than 5.

Now, let’s apply a one-hot encoding to all three of our categorical variables (BCS, Age, Sex). We can then see which categorical variables improve on our previous two-variable model.

X_one_hot = (
    train_set.assign(intr=1)
    [['intr', 'Length', 'Girth', 'BCS', 'Age', 'Sex']]
    .pipe(pd.get_dummies, columns=['BCS', 'Age', 'Sex'])
    .drop(columns=['BCS_3.0', 'Age_5-10', 'Sex_female'])
)
X_one_hot
intr Length Girth BCS_1.5 ... Age_<2 Age_>20 Sex_gelding Sex_stallion
230 1 101 116 0 ... 0 0 0 1
74 1 92 117 0 ... 0 0 0 1
354 1 103 123 0 ... 0 1 0 0
... ... ... ... ... ... ... ... ... ...
157 1 93 123 0 ... 0 0 0 1
41 1 89 103 0 ... 1 0 0 0
381 1 86 106 0 ... 0 0 0 0

433 rows × 15 columns

We dropped one level for each categorical variable. BCS, Age, and Sex have six, six, and three categories, respectively. Since we drop one category for each variable, we added 12 dummy variables to the design matrix for a total of 15 columns, including the intercept term intr, Girth and Length.

We fit the model that includes the dummies from all three categorical features, along with Girth and Length.

results = minimize(training_loss(X_one_hot, y), np.ones(X_one_hot.shape[1]))
theta_hat = results['x']
y_pred = X_one_hot @ theta_hat
training_error = (np.mean(anes_loss(100 * (y - y_pred)/ y_pred)))
print(f'Training error: {training_error:.2f}')
Training error: 51.47

This model does better than the previous model with only Girth and Length. Now, let’s try to make this model simpler while keeping its accuracy. To do this, we’ll look at the coefficients for each of the dummy variables, as shown in the plot below.

bcs_vars = ['BCS_1.5', 'BCS_2.0', 'BCS_2.5', 'BCS_3.5', 'BCS_4.0']
age_vars = ['Age_<2', 'Age_2-5', 'Age_10-15', 'Age_15-20', 'Age_>20']
sex_vars = ['Sex_gelding', 'Sex_stallion']

thetas = (pd.DataFrame({'var': X_one_hot.columns, 'theta_hat': theta_hat})
          .set_index('var'))

f1 = px.scatter(thetas.loc[bcs_vars].reset_index(), x='var', y='theta_hat')
f2 = px.scatter(thetas.loc[sex_vars].reset_index(), x='var', y='theta_hat')
f3 = px.scatter(thetas.loc[age_vars].reset_index(), x='var', y='theta_hat')

fig = plots_in_row([f1, f2, f3],
                   width=700,
                   shared_yaxes=True,
                   column_widths=[0.3, 0.2, 0.5])

ylabel(fig, 'theta_hat', row=1, col=1)
fig.update_xaxes(tickangle=45)
fig.update_traces(marker_size=10)
../../_images/donkey_model_31_0.svg

The coefficients confirm what we saw in the box plots. The coefficients for the sex of the donkey are close to zero, meaning that knowing the sex doesn’t really change the model. We also see that combining the age categories for donkeys over 5 will simplify the model without losing much. Since there are so few donkeys with a body condition score of 1.5, we are inclined to collapse that category as well.

We update the design matrix in view of these findings and refit the simpler model.

def combine_bcs(X):
    new_bcs_2 = X['BCS_2.0'] + X['BCS_1.5']
    return X.assign(**{'BCS_2.0': new_bcs_2}).drop(columns=['BCS_1.5'])

def combine_age_and_sex(X):
    return X.drop(columns=['Age_10-15', 'Age_15-20', 'Age_>20',
                           'Sex_gelding', 'Sex_stallion'])

X_one_hot_simple = (
    X_one_hot.pipe(combine_bcs)
    .pipe(combine_age_and_sex)
)
results = minimize(training_loss(X_one_hot_simple, y),
                   np.ones(X_one_hot_simple.shape[1]))
theta_hat = results['x']
y_pred = X_one_hot_simple @ theta_hat
training_error = (np.mean(anes_loss(100 * (y - y_pred)/ y_pred)))
print(f'Training error: {training_error:.2f}')
Training error: 53.20

The average error is close enough to that of the more complex model for us to settle on this simpler one. Let’s display the coefficients, and summarize the model.

var theta_hat
0 intr -175.25
1 Length 1.01
2 Girth 1.97
3 BCS_2.0 -6.33
4 BCS_2.5 -5.11
5 BCS_3.5 7.36
6 BCS_4.0 20.05
7 Age_2-5 -3.47
8 Age_<2 -6.49

Our model is roughly:

\[weight \approx -175 + Length + 2\times Girth\]

After this initial approximation, we use the categorical features to make some corrections: subtract 6 kg for a donkey with a BCS of 2 or less; subtract 5 kg for a BCS of 2.5; add 7 kg for a BCS of 3.5; and add 20 kg for a BCS of 4. In addition, we also subtract 6.5 kg for a donkey under 2 years old, and subtract 3.5 kg for a donkey between 2 and 5 years old.

This model seems quite simple to implement. Let’s see how well the model does in predicting the weights of the donkeys in the test set.

18.5.5. Model Assessment

Remember that we put aside 20% of our data before exploring and modeling with the remaining 80%. We are now ready to apply what we have learned from the training set to the test set. That is, we will use this fitted model to predict the weights of the donkeys in the test set. To do this, we need to prepare the test set. Our model uses the girth and length of the donkey, as well as dummy variables for the donkey’s age and body condition score.

y_test = test_set['Weight']

X_test = (
    test_set.assign(intr=1)
    [['intr', 'Length', 'Girth', 'BCS', 'Age', 'Sex']]
    .pipe(pd.get_dummies, columns=['BCS', 'Age', 'Sex'])
    .drop(columns=['BCS_3.0', 'Age_5-10', 'Sex_female'])
    .pipe(combine_bcs)
    .pipe(combine_age_and_sex)
)

We consolidated all of our manipulations of the design matrix to create the final version that we settled on in our modeling with the training set. Now we are ready to use the \(\theta\)s that we fitted with the training set to make weight predictions for those donkeys in the test set.

y_pred_test = X_test @ theta_hat
test_set_error = 100 * (y_test - y_pred_test) / y_pred_test
fig = px.scatter(x=y_pred_test, y=test_set_error,
                 title='Test Set Predictions',
                 width=350, height=250)
fig.add_hline(10, line_width=3, line_color='green')
fig.add_hline(-10, line_width=3, line_color='green')
margin(fig, t=30)
xlabel(fig, 'Predicted Weight')
ylabel(fig, 'Relative Error (%)')
../../_images/donkey_model_42_0.svg

Remember that positive relative error means underestimating weight. From the plot above, we see that all but six of the test set weights are within 10% of the predictions, and five of the errors that exceed 10% err on the side of underestimation. This makes sense given that our loss function penalized overestimation more. The scatter plot below shows the actual and predicted values.

fig = px.scatter(y=y_test, x=y_pred_test, 
                 title="Test Set Predictions with 10% Error Lines",
                 width=350, height=350)

fig.add_trace(go.Scatter(x=[60, 200], y=[60, 200], name='0% error',
                         mode='lines',
                         marker=dict(color='green')))
fig.add_trace(go.Scatter(x=[60, 200], y=[66, 220], name='10% error',
                         mode='lines',
                         marker=dict(color='red')))
fig.add_trace(go.Scatter(x=[60, 200], y=[54, 180], name='-10% error',
                         mode='lines',
                         marker=dict(color='red')))

fig.update_layout(showlegend=False)
xlabel(fig, "Predicted Weight")
ylabel(fig, "Actual Weight")
margin(fig, t=30)
../../_images/donkey_model_44_0.svg

We’ve accomplished our goal! We have a model that uses easy-to-get measurements, is simple enough to explain on a single page, and makes predictions within 10% of the actual donkey weight. In the next section, we’ll summarize our case study and reflect on our model.