Exploring

18.4. Exploring

Let’s look at the features in our dataframe for shapes and relationships that will help us make transformations and models (Chapter 10). We’ll start by looking at how the categorical features Age, Sex, and BCS relate with donkey weights.

f1 = px.box(train_set, x="Age", y="Weight", 
            category_orders = {"Age":['<2', '2-5', '5-10', '10-15', '15-20', '>20']})
f2 = px.box(train_set, x="Sex", y="Weight")

fig = left_right(f1, f2, column_widths=[0.7, 0.3])

fig.update_xaxes(title='Age', row=1, col=1)
fig.update_xaxes(title='Sex', row=1, col=2)
fig.update_yaxes(title='Weight', row=1, col=1)
../../_images/donkey_eda_4_0.svg
fig = px.box(train_set, x="BCS", y="Weight", points="all",
             width=700, height=250)
fig
../../_images/donkey_eda_5_0.svg

Notice that we plotted the points as well as the boxes for the body condition score because we saw earlier that there are only a handful of observations with a score of 1.5, so we don’t want to read too much into a boxplot of a few data points (Chapter 11). It appears that the median weight increases with the body condition score, but not in a simple linear fashion. On the other hand, weight distributions for the three sex categories appear roughly the same. As for age, once a donkey reaches five years, the distribution of weight doesn’t seem to change much. But, donkeys under 2 and donkeys between 2 to 5 years old have lower weights in general.

Next, let’s examine the quantitative variables. We plot all pairs of quantitative variables in the scatter plot matrix below.

train_num = train_set[['Weight', 'Length', 'Girth', 'Height']]
import plotly.figure_factory as ff

fig = ff.create_scatterplotmatrix(train_num,
                                  diag='histogram', 
                                  title='',
                                  height=600, width=600)
fig.update_traces(marker_color='#1F77B4')
fig.update_traces(marker_size=3, selector={'type': 'scatter'})
../../_images/donkey_eda_9_0.svg

The height, length, and girth of donkeys all appear linearly associated with weight, and with each other. This is not too surprising; given one of the donkey’s dimensions, we should have a good guess about the other dimensions. Girth appears most highly correlated with weight, and this is confirmed by the correlation coefficient matrix below.

train_num.corr()
Weight Length Girth Height
Weight 1.00 0.78 0.90 0.71
Length 0.78 1.00 0.66 0.58
Girth 0.90 0.66 1.00 0.70
Height 0.71 0.58 0.70 1.00

Our explorations uncovered several aspects of the data. We found that the donkey’s girth, length, and height all have linear associations with weight and with each other, and girth has the strongest linear relationship with weight. We also observed that the body condition score has a positive association with weight; the sex of the donkey does not appear related to weight; and neither does age for those donkeys over 5. In the next section, we’ll use these findings to build our model.