{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "import sys\n", "import os\n", "if not any(path.endswith('textbook') for path in sys.path):\n", " sys.path.append(os.path.abspath('../../..'))\n", "from textbook_utils import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(sec:eda_feature_types)=\n", "# Feature Types\n", "\n", "Before making an exploratory plot, or any plot for that matter, it's a good idea to examine the feature (or features) and decide on its *feature type*. (Sometimes we refer to a feature as a _variable_ and its type as *variable type*.) Although there are multiple ways of categorizing feature types, in this book we consider three basic ones:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Nominal*\n", ": A feature that represents \"named\" categories, where the categories do not have a natural ordering, is called nominal. Examples include political party affiliation (Democrat, Republican, Green, Other); dog type (herding, hound, non-sporting, sporting, terrier, toy, working); and computer operating system (Windows, macOS, Linux).\n", "\n", "*Ordinal*\n", ": Measurements that represent ordered categories are called ordinal. Examples of ordinal features are t-shirt size (small, medium, large); Likert-scale response (disagree, neutral, agree); and level of education (high school, college, graduate school). It is important to\n", "note that with an ordinal feature, the difference between, say, small and\n", "medium need not be the same as the difference between medium and large. Also, the differences between consecutive categories may not even be quantifiable. Think of the number of stars in a restaurant review\n", "and what one star means in comparison to two stars. \n", "\n", "Ordinal and nominal data are subtypes of *categorical* data. Another name\n", "for categorical data is *qualitative*. In contrast, we also have\n", "*quantitative* features:\n", "\n", "\n", "*Quantitative*\n", ": Data that represent numeric measurements or quantities\n", "are called quantitative. Examples include height measured to the\n", "nearest cm, price reported in USD, and distance measured to the nearest\n", "km. Quantitative features can be further divided into\n", "*discrete*, meaning that only a few values of the feature are possible, and\n", "*continuous*, meaning that the quantity could in principle be measured to\n", "arbitrary precision. The number of siblings in a family takes on a discrete\n", "set of values (such as 0, 1, 2,..., 8). In contrast, height can theoretically be\n", "reported to any number of decimal places, so we consider it continuous.\n", "There is no hard and fast rule to determine whether a quantity is discrete or continuous. In some cases, it can be a judgment call, and in others, we may want to purposefully consider a continuous feature to be discrete. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A feature type is not the same thing as a data storage type. Each column in a `pandas` dataframe has its own *storage type*. These types can be integer, floating point, boolean, date-time format, category, and object (strings of varying length are stored as objects in Python with pointers to the strings).\n", "We use the term *feature type* to refer to a\n", "conceptual notion of the information and the term *storage type* to refer to the\n", "representation of the information in the computer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A feature stored as an integer can represent nominal data, strings can be\n", "quantitative (like `\"\\$100.00\"`), and, in practice, boolean values often\n", "represent nominal features that have only two possible values.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ":::{note}\n", "\n", "`pandas` calls the storage type `dtype`, which is short for data type.\n", "We refrain from using the term *data type* here because it can be confused with\n", "both storage type and feature type.\n", "\n", ":::" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to determine a feature type, we often need to consult a\n", "dataset’s *data dictionary* or *codebook*. A data dictionary is a document\n", "included with the data that describes what each column in the data table\n", "represents. In the following example, we take a look at the storage and\n", "feature types of the columns in a dataframe about various dog breeds, \n", "and we find that the storage type is often not a good indicator of the kind \n", "of information contained in a field." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example: Dog Breeds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use the [American Kennel Club (AKC)](https://www.akc.org) data on registered dog breeds to introduce the various concepts related to EDA. The AKC, a nonprofit that was founded in 1884, has the stated mission to \"advance the study, breeding, exhibiting, running and maintenance of purebred dogs.\" The AKC organizes events like the National Championship, Agility Invitational, and Obedience Classic, and mixed-breed dogs are welcome to participate in most events. The [Information Is Beautiful](https://informationisbeautiful.net) website provides a dataset with information from the AKC on 172 breeds. Its visualization, [Best in Show](https://www.informationisbeautiful.net/visualizations/best-in-show-whats-the-top-data-dog/), incorporates many features of the breeds and is fun to look at." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The AKC dataset contains several different kinds of features, and we have extracted a handful of them that show a variety of types of information. These features include the name of the breed; its longevity, weight, and height; and other information such as its suitability for children and the number of repetitions needed to learn a new trick. Each record in the dataset is a breed of dog, and the information provided is meant to be typical of that breed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's read the data into a dataframe:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | breed | \n", "group | \n", "score | \n", "longevity | \n", "... | \n", "size | \n", "weight | \n", "height | \n", "repetition | \n", "
---|---|---|---|---|---|---|---|---|---|
0 | \n", "Border Collie | \n", "herding | \n", "3.64 | \n", "12.52 | \n", "... | \n", "medium | \n", "NaN | \n", "51.0 | \n", "<5 | \n", "
1 | \n", "Border Terrier | \n", "terrier | \n", "3.61 | \n", "14.00 | \n", "... | \n", "small | \n", "6.0 | \n", "NaN | \n", "15-25 | \n", "
2 | \n", "Brittany | \n", "sporting | \n", "3.54 | \n", "12.92 | \n", "... | \n", "medium | \n", "16.0 | \n", "48.0 | \n", "5-15 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
169 | \n", "Wire Fox Terrier | \n", "terrier | \n", "NaN | \n", "13.17 | \n", "... | \n", "small | \n", "8.0 | \n", "38.0 | \n", "25-40 | \n", "
170 | \n", "Wirehaired Pointing Griffon | \n", "sporting | \n", "NaN | \n", "8.80 | \n", "... | \n", "medium | \n", "NaN | \n", "56.0 | \n", "25-40 | \n", "
171 | \n", "Xoloitzcuintli | \n", "non-sporting | \n", "NaN | \n", "NaN | \n", "... | \n", "medium | \n", "NaN | \n", "42.0 | \n", "NaN | \n", "
172 rows × 12 columns
\n", "\n", " | breed | \n", "group | \n", "score | \n", "longevity | \n", "... | \n", "weight | \n", "height | \n", "repetition | \n", "kids | \n", "
---|---|---|---|---|---|---|---|---|---|
0 | \n", "Border Collie | \n", "herding | \n", "3.64 | \n", "12.52 | \n", "... | \n", "NaN | \n", "51.0 | \n", "<5 | \n", "low | \n", "
1 | \n", "Border Terrier | \n", "terrier | \n", "3.61 | \n", "14.00 | \n", "... | \n", "6.0 | \n", "NaN | \n", "15-25 | \n", "high | \n", "
2 | \n", "Brittany | \n", "sporting | \n", "3.54 | \n", "12.92 | \n", "... | \n", "16.0 | \n", "48.0 | \n", "5-15 | \n", "medium | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
169 | \n", "Wire Fox Terrier | \n", "terrier | \n", "NaN | \n", "13.17 | \n", "... | \n", "8.0 | \n", "38.0 | \n", "25-40 | \n", "NaN | \n", "
170 | \n", "Wirehaired Pointing Griffon | \n", "sporting | \n", "NaN | \n", "8.80 | \n", "... | \n", "NaN | \n", "56.0 | \n", "25-40 | \n", "NaN | \n", "
171 | \n", "Xoloitzcuintli | \n", "non-sporting | \n", "NaN | \n", "NaN | \n", "... | \n", "NaN | \n", "42.0 | \n", "NaN | \n", "NaN | \n", "
172 rows × 13 columns
\n", "