# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import sys
import os
if not any(path.endswith('textbook') for path in sys.path):
    sys.path.append(os.path.abspath('../../..'))
from textbook_utils import *

18.2. Wrangling and Transforming

We begin by taking a peek at the contents of our data file. To do this, we open the file and examine the first few rows (Chapter 8).

from pathlib import Path

# Create a Path pointing to our data file
insp_path = Path() / 'data' / 'donkeys.csv'

with insp_path.open() as f:
    # Display first five lines of file
    for _ in range(5):
        print(f.readline(), end='')
BCS,Age,Sex,Length,Girth,Height,Weight,WeightAlt
3,<2,stallion,78,90,90,77,NA
2.5,<2,stallion,91,97,94,100,NA
1.5,<2,stallion,74,93,95,74,NA
3,<2,female,87,109,96,116,NA

Since the file is CSV formatted, we can easily read it into a dataframe.

donkeys = pd.read_csv("donkeys.csv")
donkeys
BCS Age Sex Length Girth Height Weight WeightAlt
0 3.0 <2 stallion 78 90 90 77 NaN
1 2.5 <2 stallion 91 97 94 100 NaN
2 1.5 <2 stallion 74 93 95 74 NaN
... ... ... ... ... ... ... ... ...
541 2.5 10-15 stallion 103 118 103 174 NaN
542 3.0 2-5 stallion 91 112 100 139 NaN
543 3.0 5-10 stallion 104 124 110 189 NaN

544 rows × 8 columns

Over 500 donkeys participated in the survey, and eight measurements were make on each donkey. According to the documentation, the granularity is a single donkey (Chapter 9). Table %s below provides descriptions of the eight features.

Table 18.1 Donkey Study Codebook

Feature

Data Type

Feature Type

Description

BCS

float64

ordinal

Body Condition Score: from 1 (emaciated) through 3 (healthy) to 5 (obese) in increments of 0.5.

Age

string

ordinal

Age in years, under 2, 2-5, 5-10, 10-15, 15-20, and over 20 years.

Sex

string

nominal

Sex categories: stallion, gelding, female.

Length

int64

numeric

body length (cm) from front leg elbow to back of pelvis.

Girth

int64

numeric

body circumference (cm), measured just behind front legs.

Height

int64

numeric

body height (cm) up to point where neck connects to back.

Weight

int64

numeric

weight (kilogram).

WeightAlt

float64

numeric

second weight measurement taken on a small subset of donkeys.

Figure 18.1 is a stylized representation of a donkey as a cylinder with neck and legs appended. Notice the height measurement includes the legs. The girth and length are the circumference and length of the cylinder.

../../_images/donkeyDiagram.png

Fig. 18.1 The cylinder shown here represents the body of a donkey. Girth is measured around the body just behind the front legs, height is measured from the ground to where the neck connects to the top of the back, and length is measured from the front elbow to the back of the pelvis.

Our next step is to perform some quality checks on the data. In the previous section, we listed a few potential quality concerns based on scope. Next, we check the quality of the measurements and their distributions.

Let’s start with comparing the two weight measurements to check on the consistency of the scale. Below is a histogram of the difference between these two measurements for the small subset of donkeys that were weighed twice.

donkeys['difference'] = donkeys['WeightAlt'] - donkeys['Weight']

fig = px.histogram(donkeys, x='difference', nbins=3,
                   width=350, height=250)
fig.update_xaxes(title='difference between two <br> measurements of the same donkey')
fig
../../_images/donkey_clean_10_0.svg

The measurements are all within 1 kg of each other, and the majority are exactly the same (to the nearest kilogram).

Next, we look for unusual values in the body condition score.

donkeys['BCS'].value_counts()
3.0    307
2.5    135
3.5     55
      ... 
1.5      5
4.5      1
1.0      1
Name: BCS, Length: 8, dtype: int64

From this output, we see that there’s only one donkey with a body condition score of 1 (emaciated) and one donkey with a score of 4.5 (obese). Let’s look at these two donkeys.

donkeys[(donkeys['BCS'] == 1.0) | (donkeys['BCS'] == 4.5)]
BCS Age Sex Length ... Height Weight WeightAlt difference
291 4.5 10-15 female 107 ... 106 227 NaN NaN
445 1.0 >20 female 97 ... 102 115 NaN NaN

2 rows × 9 columns

Since these BCS values also have outlier weights, we’ll remove these two records. We may also decide to remove the five donkeys with a score of 1.5, if they appear anomalous in our later analysis.

def remove_bcs_outliers(donkeys):
    return donkeys[(donkeys['BCS'] >= 1.5) & (donkeys['BCS'] <= 4)] 

donkeys = (pd.read_csv('data/donkeys.csv')
           .pipe(remove_bcs_outliers))

Next, we examine the distribution of values for weight to see if there are any issues with quality.

fig = px.histogram(donkeys, x='Weight', nbins=20,
                   width=350, height=250)
fig
../../_images/donkey_clean_18_0.svg

Next, we’ll check the relationship between weight and height to assess the quality of the data for analysis.

fig = px.scatter(donkeys, x='Height', y='Weight',
                 width=350, height=250)
fig
../../_images/donkey_clean_20_0.svg

It appears that there is one very light donkey. The small donkey is far from the main concentration of donkeys and would overly influence our models. For this reason, we exclude it. Again, we keep in mind that we may also want to exclude the one or two heavy donkeys, if they appear to overly influence our future model fitting.

def remove_weight_outliers(donkeys):
    return donkeys[(donkeys['Weight'] >= 40)]

donkeys = (pd.read_csv('data/donkeys.csv')
           .pipe(remove_bcs_outliers)
           .pipe(remove_weight_outliers))

donkeys.shape
(541, 8)

In summary, based on our cleaning and quality checks, we removed three anomalous observations from the data frame. Now, we’re nearly ready to begin our exploratory analysis. Before we proceed, we’ll set aside some of our data as a test set.

18.3. Train-Test Split of the Data

We talked about why it’s important to separate out a test set from the training set in Chapter 15. When our goal is to create a model, we consider it best practice to separate out a test set early in the analysis, before we explore the data in detail. When we explore a dataset, we implicitly make decisions about what kinds of models to fit and what variables to use in the model. It’s important that our test set isn’t involved in these decisions so that it simulates completely new data to evaluate our model.

We’ll divide our data into an 80/20 split, where we use 80% of the data to explore and build a model. Then, we evaluate the model with the 20% that has been set aside.

np.random.seed(42)
n = len(donkeys)
indices = np.arange(n)
np.random.shuffle(indices)
n_train = int(np.round((0.8 * n)))

The code above takes a random shuffle of the indices for the data frame. Next, we assign the first 80% to the training data frame and the remaining 20% to the test set.

train_set = donkeys.iloc[indices[:n_train]]
test_set = donkeys.iloc[indices[n_train:]]

We confirm that the test and train sets are the expected shape.

train_set.shape
(433, 8)
test_set.shape
(108, 8)

Next, we’ll explore the training data to look for useful relationships and distributions that inform our modeling.