# Feature Types

## Contents

# 10.1. Feature Types¶

Before we make any exploratory plots, we examine the features (also known as
variables) of the data and decide on their type, which we call the **feature
type** or **variable type**. Although there are multiple ways of delineating
feature types, in this book we consider three basic ones:

**Nominal feature:**A feature that represents “named” categories, where the categories do not have a natural ordering, is called nominal. Examples include: political party affiliation (Democrat, Republican, Green, Other); American Kennel Club breed group (herding, hound, non-sporting, sporting, terrier, toy, working); and computer operating system (Windows, MacOS, Linux).**Ordinal feature:**Measurements that represent ordered categories are called ordinal. Examples of ordinal features are: T-shirt size (small, medium, large); Likert-scale response (disagree, neutral, agree); and level of education (high school, college, graduate school). It is important to note that with an ordinal feature, the difference between, say, small and medium, need not be the same as the difference between medium and large. We can order the categories, but the differences between consecutive categories may not be quantifiable. Even when they can be quantified, the differences between consecutive categories may vary. We give examples later in this section.Ordinal and nominal data are subtypes of

**categorical**data. Another name for these types is**qualitative**. In contrast we also have**quantitative**features.**Quantitative feature:**These data represent numeric amounts or quantities and so are called quantitative. Examples include: height measured to the nearest cm, price reported in USD, and distance measured to the nearest tenth of a km. Quantitative features can be further divided into**discrete**, meaning that only a small set of values are possible, and**continuous**, meaning that the quantity could in principal be reported to arbitrary precision. For example, the number of siblings takes on a discrete set of values (such as, 0, 1, 2,…, 8). On the other hand, height is measured in centimeters and can theoretically be reported to any number of decimal places so we consider it continuous. There is no hard and fast rule to determine whether a quantity is discrete or continuous.

**Data Storage Types vs. Feature Types**

Each column in a `pandas`

data frame has its own **storage type**. These types
can be integer, floating point, boolean, date-time format, category, and object
(strings of varying length are stored as objects in python with pointers to the
strings). It is essential to understand that a feature type is not the same as
a pandas storage type. We use the term *feature type* to refer to the
conceptual notion of the information, and the term *storage type* refers to the
representation of the information in the computer.

Note

Pandas calls the storage type `dtype`

, which is short for data type.
We refrain from using the term *data type* here because it can be confused with
both storage type and feature type.

A feature stored as an integer can represent nominal data, strings can be quantitative (e.g., “$100.00”), and, in practice, boolean values often represent nominal features that have only two possible values.

In order to determine a feature type, we often need to consult the
dataset’s **data dictionary** or **codebook**. A data dictionary is a document
included with the data that describes what each column in the data table
represents. In the following example, we take a look at the storage types and
feature types of the columns in the dogs data frame and see that the storage
type may not be a good indicator of the kind of information contained in a
field.

Next, we’ll give concrete examples of both storage and feature types.

## 10.1.1. Example: AKC Dog Breeds¶

Let’s take a look at the data table from the American Kennel Club. The subset of AKC data we are working with has 12 features and 172 breeds.

```
dogs = pd.read_csv('data/akc.csv')
dogs
```

breed | group | score | longevity | ... | size | weight | height | repetition | |
---|---|---|---|---|---|---|---|---|---|

0 | Border Collie | herding | 3.64 | 12.52 | ... | medium | NaN | 51.0 | <5 |

1 | Border Terrier | terrier | 3.61 | 14.00 | ... | small | 6.0 | NaN | 15-25 |

2 | Brittany | sporting | 3.54 | 12.92 | ... | medium | 16.0 | 48.0 | 5-15 |

... | ... | ... | ... | ... | ... | ... | ... | ... | ... |

169 | Wire Fox Terrier | terrier | NaN | 13.17 | ... | small | 8.0 | 38.0 | 25-40 |

170 | Wirehaired Pointing Griffon | sporting | NaN | 8.80 | ... | medium | NaN | 56.0 | 25-40 |

171 | Xoloitzcuintli | non-sporting | NaN | NaN | ... | medium | NaN | 42.0 | NaN |

172 rows × 12 columns

A cursory glance at the table shows us that breed, group and size appear to be strings, and the other columns numbers. The summary of the data frame, shown below, provides the index, name, count of non-null values, and dtype for each column.

```
dogs.info()
```

```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 172 entries, 0 to 171
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 breed 172 non-null object
1 group 172 non-null object
2 score 87 non-null float64
3 longevity 135 non-null float64
4 ailments 148 non-null float64
5 purchase_price 146 non-null float64
6 grooming 112 non-null float64
7 children 112 non-null float64
8 size 172 non-null object
9 weight 86 non-null float64
10 height 159 non-null float64
11 repetition 132 non-null object
dtypes: float64(8), object(4)
memory usage: 16.2+ KB
```

Several columns of this data frame have a numeric computational type, as
signified by `float64`

, which means that the column can contain non-integers.
Also, note that `pandas`

encodes string columns as the `object`

dtype rather
than using a `string`

dtype.
Notice that we guessed incorrectly that `repetition`

is quantitative.
Looking a bit more carefully at
the data table, we see that `repetition`

contains string values for ranges,
such as “< 5”, “15-25” and “25-40”, so this feature is ordinal.

Note

**Why are decimal columns stored as the float64 dtype?**
In computer architecture, a floating-point number, or “float” for short,
refers to a number that can have a decimal component. We won’t go in-depth
into computer architecture in this book, but we will point out when it
affects terminology, as in this case.
The dtype

`float64`

says that the column contains decimal numbers that each
take up 64 bits of space when stored in computer memory.**Why are strings stored as the object dtype?** Essentially,

`pandas`

uses
optimized storage types for numeric data, like `float64`

or `int64`

.
However, it doesn’t have optimizations for Python objects like strings,
dictionaries, or sets, so these are all stored as the `object`

dtype.
This means that the `object`

dtype is ambiguous, but in most real-world cases
we know whether `object`

columns contain strings or some other Python type.Next, let’s look at an example where the storage type differs from the
feature type.
At a first glance, we might guess `ailments`

and `children`

are quantitative
features because they are stored as `float64`

dtypes.
But, let’s look at the counts of their values.

```
display_df(dogs['ailments'].value_counts(), rows=8)
```

```
0.0 61
1.0 42
2.0 24
4.0 10
3.0 6
5.0 3
8.0 1
9.0 1
Name: ailments, dtype: int64
```

```
dogs['children'].value_counts()
```

```
1.0 67
2.0 35
3.0 10
Name: children, dtype: int64
```

Both `ailments`

and `children`

only take on a few integer values.
What does a value
of 3.0 for `children`

or 9.0 for `ailments`

mean? We need more information to
figure this out. The name of the column and how the information is stored in
the data frame is not enough.
Instead, take a look the data dictionary, shown in the AKC Dog Breed Codebook
table below.

Feature |
Description |
---|---|

breed |
dog breed, e.g., Border Collie, Dalmatian, Vizsla |

group |
American Kennel Club grouping (herding, hound, non-sporting, sporting, terrier, toy, working) |

score |
AKC score |

longevity |
typical lifetime (years) |

ailments |
number of serious genetic ailments |

purchase_price |
average purchase price from puppyfind.com |

grooming |
grooming required once every: 1 = day, 2 = week, 3 = few weeks |

children |
suitability for children: 1 = high, 2 = medium, 3 = low |

size |
size: small, medium, large |

weight |
typical weight (kg) |

height |
typical height from the shoulder (cm) |

repetition |
number of repetitions to understand a new command: <5, 5-15, 15-25, 25-40, 40-80, >80 |

Although the data dictionary does not explicitly specify the feature types, the
descriptions help us figure out that `children`

represents the suitability of
the breed for children, and a value of 1.0 corresponds to “high” suitability.
We also figure out that `ailments`

is a count of the number of serious genetic
ailments that dogs of this breed tend to have. Based on the codebook, we treat
`children`

as a categorical feature, even though it is stored as a floating
point number, and since low < medium < high, `children`

is ordinal. Since
`ailments`

is a count, we treat it as a quantitative (numeric) feature type,
and for some analyses we further define it as discrete numeric because there
are only a few possible values that `ailments`

can take on.

The codebook also confirms that the features: `score`

, `longevity`

,
`purchase_price`

, `weight`

, and `height`

are quantitative. It makes sense to
compare the longevity for one breed to that of another by looking at the
difference in their `longevity`

values. For example, chihuahuas typically live
about four years longer than dachshunds (16.5 to 12.6 years). It also makes
sense to compare the weight of one breed to another as a ratio; for example, a
dachshund is usually about five times heavier than a chihuahua (11 kg to 2 kg).
Of these quantitative features, `ailments`

is the only one that we consider to
be discrete.

The data dictionary descriptions for `breed`

, `group`

, `size`

and `repetition`

suggest that these features are qualitative. Each of these variables have
different, and yet commonly found, characteristics that are worth exploring a
bit more. We do this by examining the counts of each unique value for the
feature. We begin with `breed`

.

```
dogs['breed'].value_counts()
```

```
Australian Cattle Dog 1
Staffordshire Bull Terrier 1
Dandie Dinmont Terrier 1
..
Komondor 1
Boykin Spaniel 1
Alaskan Malamute 1
Name: breed, Length: 172, dtype: int64
```

The `breed`

feature has 172 unique values—that’s the same as the number of
records in the data frame. We can think of `breed`

as the **primary key**
for the
table. By design, each dog breed has one record, and this feature determines
the dataset’s granularity. Although technically `breed`

is a nominal feature,
it doesn’t really make sense to analyze it. We would want to confirm that all
values are unique and clean, and otherwise we would only use it to, say, label
unusual values in a plot.

Next we examine `group`

.

```
dogs['group'].value_counts()
```

```
sporting 28
terrier 28
working 27
hound 26
herding 25
non-sporting 19
toy 19
Name: group, dtype: int64
```

The `group`

feature has seven unique values, and since these groupings do not
have a natural ordering, we consider `group`

a nominal feature.

We look at `size`

next.

```
dogs['size'].value_counts()
```

```
medium 60
small 58
large 54
Name: size, dtype: int64
```

The `size`

feature has a natural ordering: small < medium < large so it is
ordinal. We don’t know how the category “small” is determined, but we do know
that a small breed is in some sense smaller than a medium-size breed, which is
smaller than a large one. We have an ordering, but differences and ratios
don’t make sense conceptually for this feature.

Nominal features, in comparison, do not provide meaning in even the direction
of the differences. A dog breed in the group `sporting`

and a breed in `toy`

differ from each other in several ways so `group`

values are not easily reduced
to an ordering.

The `repetition`

feature is an example of a quantitative variable that has been
collapsed into categories and become ordinal. The codebook tells us that
`repetition`

is the number of times a new command needs to be repeated before
the dog understands it. The numeric values have been placed into categories:
<5, 5-15, 15-25, 25-40, 40-80, >80.

```
dogs['repetition'].value_counts()
```

```
25-40 39
15-25 29
40-80 22
5-15 21
80-100 11
<5 10
Name: repetition, dtype: int64
```

Notice that these categories have different widths. The first is fewer than 5 repetitions, while others are 10, 15, and 40 repetitions wide. The ordering is clear, but the gaps from one category to the next are not the same magnitude.

Now that we have double checked the values in the variables against the descriptions in the codebook, we can augment the data dictionary to include this additional information about the feature types. Our dictionary is shown in the Revised AKC Dog Breed Codebook table below.

Feature |
Description |
Feature Type |
Storage Type |
---|---|---|---|

breed |
dog breed, e.g., Border Collie, Dalmatian, Vizsla |
primary key |
string |

group |
AKC group (herding, hound, non-sporting, sporting, terrier, toy, working) |
qualitative - nominal |
string |

score |
AKC score |
quantitative |
floating point |

longevity |
typical lifetime (years) |
quantitative |
floating point |

ailments |
number of serious genetic ailments (0, 1, …, 9) |
quantitative - discrete |
floating point |

purchase_price |
average purchase price from puppyfind.com |
quantitative |
floating point |

grooming |
groom once every: 1 = day, 2 = week, 3 = few weeks |
qualitative - ordinal |
floating point |

children |
suitability for children: 1 = high, 2 = medium, 3 = low |
qualitative - ordinal |
floating point |

size |
size: small, medium, large |
qualitative - ordinal |
string |

weight |
typical weight (kg) |
quantitative |
floating point |

height |
typical height from the shoulder (cm) |
quantitative |
floating point |

repetition |
number of repetitions to understand a new command: <5, 5-15, 15-25, 25-40, 40-80, >80 |
qualitative - ordinal |
string |

## 10.1.2. Transforming Qualitative Features¶

We discussed transformations in the Wrangling Dataframes chapter, but there are a few additional transformations related to the categories of qualitative features that we may want to perform. We may want to:

Relabel categories

Collapse categories

Convert a quantitative feature into ordinal

We’ll explain when we may want to make these transformations and give examples.

**Relabel Categories.** Summary statistics, like the mean and the median, make
sense for quantitative data, but typically not for qualitative data. For
example, the average price for toy breeds makes sense ($687), but
the “average” of children suitability doesn’t.
However, `pandas`

will happily compute the mean of the values in the `children`

column if we ask it to.

```
# Don't use this value in actual data analysis!
dogs["children"].mean()
```

```
1.4910714285714286
```

Note

This is a key difference between storage types and feature types—storage
types say what operations we can write code to *compute*, while
feature types say what operations *make sense for the data*.

Instead, we want to consider the distribution of ones, twos, and threes of
`children`

for toy breeds.

```
toy_dogs = dogs.query('group == "toy"')
sns.countplot(data=toy_dogs, x='children')
```

We can transform `children`

by replacing the numbers with their string
descriptions. Changing 1, 2, 3 into low, medium, and high makes
it easier to recognize that `children`

is categorical. With strings, we would
not be tempted to compute a mean, the categories would be connected to their
meaning, and labels for plots would have reasonable values by default.
Why would we not always want to have categorical data represented by strings?
Strings generally take up more computer memory to store, which can greatly
increase the size of a dataset if it contains many categorical features.

**Collapse Categories.** Let’s create a new column, called `play`

, to represent
the groups of dogs whose “purpose” is to play (or not). (This is a fictitious
distinction used for demonstration purposes). This group consists of the toy
and non-sporting breeds. The new feature, `play`

, is a transformation of
`group`

that collapses categories: toy and non-sporting are combined into one
category, and the remaining categories are placed in a second, non-play
category. The boolean (`bool`

) storage type is useful to indicate the
presence or absence of this characteristic.

```
with_play = dogs.assign(
play=(dogs["group"] == "toy") | (dogs["group"] == "non-sporting"))
with_play
```

breed | group | score | longevity | ... | weight | height | repetition | play | |
---|---|---|---|---|---|---|---|---|---|

0 | Border Collie | herding | 3.64 | 12.52 | ... | NaN | 51.0 | <5 | False |

1 | Border Terrier | terrier | 3.61 | 14.00 | ... | 6.0 | NaN | 15-25 | False |

2 | Brittany | sporting | 3.54 | 12.92 | ... | 16.0 | 48.0 | 5-15 | False |

... | ... | ... | ... | ... | ... | ... | ... | ... | ... |

169 | Wire Fox Terrier | terrier | NaN | 13.17 | ... | 8.0 | 38.0 | 25-40 | False |

170 | Wirehaired Pointing Griffon | sporting | NaN | 8.80 | ... | NaN | 56.0 | 25-40 | False |

171 | Xoloitzcuintli | non-sporting | NaN | NaN | ... | NaN | 42.0 | NaN | True |

172 rows × 13 columns

Representing a two-category qualitative feature as a boolean has a few
advantages. For example, the mean of `play`

makes sense because it returns the
fraction of `True`

values. When booleans are used for numeric calcuations,
`True`

becomes 1 and `False`

becomes 0.

```
with_play['play'].mean()
```

```
0.22093023255813954
```

This storage type gives us a shortcut to compute counts and averages of boolean values. Later in the book, we’ll see that it’s also a handy encoding for modeling.

**Convert Quantitative to Ordinal.** Finally, another transformation that we
sometimes find useful is to convert numeric values into categories. For
example, we might collapse the values in `ailments`

into categories: 0, 1, 2,
3, 4+. In other words, we turn `ailments`

from a quantitative feature into an
ordinal feature with the mapping 0→0, 1→1, 2→2, 3→3, and any value 4 or larger
→ 4+. Why might we want to make this transformation? Since so few breeds have
more than three genetic ailments, we think the simplification will be clearer
and adequate for our investigation.

Note

As of this writing (late 2021), `pandas`

also
implements a `category`

dtype which is
designed to work with qualitative data.
However, this storage type is not yet widely
adopted by the visualization and modeling libraries, which limits its
usefulness. For that reason, we do not transform our qualitative variables into
the `category`

dtype.
We expect that future readers may want to use the `category`

dtype as more
libraries support it.

## 10.1.3. The Importance of Feature Types¶

Feature types guide us in our data analysis. They help specify the operations, visualizations, and models we can meaningfully apply to the data. The Plots for Feature Types table below gives a mapping of the various plots that are typically good options for each feature type. Whether the variable(s) are quantitative or qualitative generally determines the set of viable plots to make, although there are exceptions. Other factors that enter into the decision are the number of observations, and whether the data takes on only a few distinct values. For example, we might make a bar chart, rather than a histogram, for a discrete quantitative variable.

Feature Type |
Dimension |
Plot |
---|---|---|

Quantitative |
One Feature |
Rug plot, histogram, density curve, box-and-whisker plot, violin plot |

Qualitative |
One Feature |
Bar plot, dot chart, line plot, pie chart |

Quantitative |
Two Features |
Scatter plot, smooth curve, contour plot, heat map, quantile-quantile plot |

Qualitative |
Two Features |
Side-by-side bar plots, mosaic plot, overlaid lines |

Mixed |
Two Features |
Overlaid density curves, side-by-side box-and-whisker plots, overlaid smooth curves, quantile-quantile plot |

The feature type also helps us decide the kind of summary statistics to calculate. With qualitative data, we usually don’t compute means or standard deviations, and instead compute the count, fraction, or percentage of records in each category. With a quantitative feature, we compute the mean or median as a measure of center, and, respectively, the standard deviation or inner quartile range (75th percentile - 25th percentile) as a measure of spread. In addition to the quartiles, we may find other percentiles informative.

Note

The *n*th percentile is that value *q* such that *n% of the data
values fall at or below it.* The value *q* might not be unique, and there are
several approaches to select a unique value from the possibilities. With enough
data, there should be little difference between these definitions.

To compute percentiles in Python, we prefer using:

```
# Uses our definition of percentile
np.percentile(data, interpolation='lower')
```

When exploring the data, we need to know how to interpret the shapes that our plots reveal. We also need to recognize certain kinds of features and understand what they tell us about the data. The next three sections give guidance with this interpretation. We also introduce many of the types of plots listed in Table 10.3 through the examples, and those that are not introduced here are covered in the Data Visualization chapter.