{
"cells": [
{
"cell_type": "code",
"execution_count": 85,
"metadata": {
"tags": [
"remove-cell"
]
},
"outputs": [],
"source": [
"import sys\n",
"import os\n",
"if not any(path.endswith('textbook') for path in sys.path):\n",
" sys.path.append(os.path.abspath('../../..'))\n",
"from textbook_utils import *"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"(ch:pandas_subsetting)=\n",
"# Subsetting\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"This section introduces operations for taking subsets of dataframes. When data\n",
"scientists first read in a dataframe, they often want to subset the specific\n",
"data that they plan to use. For example, a data scientist can *slice* out the\n",
"10 relevant features from a dataframe with hundreds of columns. Or they can\n",
"*filter* a dataframe to remove rows with incomplete data. For the rest of this\n",
"chapter, we demonstrate dataframe operations using a dataframe of baby names."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Scope and Question"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"There's a [2021 _New York Times_ article](https://www.nytimes.com/2021/06/12/style/lilibet-popular-baby-names.html) that talks about Prince Harry and\n",
"Meghan Markle's unique choice for their new baby daughter's name: Lilibet. The article has an interview with Pamela Redmond,\n",
"an expert on baby names, who talks about interesting trends in how people name\n",
"their kids. For example, she says that names that start with the letter _L_\n",
"have become very popular in recent years, while names that start with the\n",
"letter _J_ were most popular in the 1970s and 1980s. Are these claims reflected\n",
"in data? We can use `pandas` to find out."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"First, we import the package as `pd`, the canonical abbreviation:"
]
},
{
"cell_type": "code",
"execution_count": 86,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"We have a dataset of baby names stored in a comma-separated values (CSV) file\n",
"called _babynames.csv_. We use the _pd.read_csv_ function to read the file as a\n",
"`pandas.DataFrame` object:"
]
},
{
"cell_type": "code",
"execution_count": 87,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Name
\n",
"
Sex
\n",
"
Count
\n",
"
Year
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
Liam
\n",
"
M
\n",
"
19659
\n",
"
2020
\n",
"
\n",
"
\n",
"
1
\n",
"
Noah
\n",
"
M
\n",
"
18252
\n",
"
2020
\n",
"
\n",
"
\n",
"
2
\n",
"
Oliver
\n",
"
M
\n",
"
14147
\n",
"
2020
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
2020719
\n",
"
Verona
\n",
"
F
\n",
"
5
\n",
"
1880
\n",
"
\n",
"
\n",
"
2020720
\n",
"
Vertie
\n",
"
F
\n",
"
5
\n",
"
1880
\n",
"
\n",
"
\n",
"
2020721
\n",
"
Wilma
\n",
"
F
\n",
"
5
\n",
"
1880
\n",
"
\n",
" \n",
"
\n",
"
2020722 rows × 4 columns
\n",
"
"
],
"text/plain": [
" Name Sex Count Year\n",
"0 Liam M 19659 2020\n",
"1 Noah M 18252 2020\n",
"2 Oliver M 14147 2020\n",
"... ... .. ... ...\n",
"2020719 Verona F 5 1880\n",
"2020720 Vertie F 5 1880\n",
"2020721 Wilma F 5 1880\n",
"\n",
"[2020722 rows x 4 columns]"
]
},
"execution_count": 87,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"baby = pd.read_csv('babynames.csv')\n",
"baby"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The data in the `baby` table comes from the [US Social Security Administration (SSA)](https://www.ssa.gov/oact/babynames/index.html),\n",
"which records the baby name and birth sex for birth certificate purposes. The SSA makes the baby names data available on its website.\n",
"We've loaded this data into the `baby` table."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The SSA website has a [page](https://www.ssa.gov/oact/babynames/background.html) that describes the data in more detail. We won't go\n",
"in depth in this chapter about the data's limitations, but we'll point out\n",
"this relevant information from the website:\n",
"\n",
"> All names are from Social Security card applications for births that occurred\n",
"> in the United States after 1879. Note that many people born before 1937 never\n",
"> applied for a Social Security card, so their names are not included in our\n",
"> data. For others who did apply, our records may not show the place of birth,\n",
"> and again their names are not included in our data.\n",
">\n",
"> All data are from a 100% sample of our records on Social Security card\n",
"> applications as of March 2021."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"It's also important to point out that at the time of this writing, the SSA dataset only provides the binary options of male and female. We hope that in the future, national datasets like this one will provide more inclusive options."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dataframes and Indices"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's examine the `baby` dataframe in more detail. A dataframe has rows and\n",
"columns. Every row and column has a label, as highlighted in\n",
"{numref}`fig:baby-labels`. \n",
"\n",
"```{figure} figures/baby-labels.svg\n",
"---\n",
"name: fig:baby-labels\n",
"alt: baby-labels\n",
"---\n",
"The `baby` dataframe has labels for both rows and columns (boxed)\n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"By default, `pandas` assigns row labels as incrementing numbers starting from\n",
"0. In this case, the data at the row labeled `0` and column labeled `Name` has\n",
"the data `'Liam'`.\n",
"\n",
"Dataframes can also have strings as row labels. {numref}`fig:dog-labels` shows\n",
"a dataframe of dog data where the row labels are strings.\n",
"\n",
"```{figure} figures/dog-labels.svg\n",
"---\n",
"name: fig:dog-labels\n",
"alt: dog-labels\n",
"---\n",
"Row labels in dataframes can also be strings, as in this example, in which each row is\n",
"labeled using the dog breed name\n",
"```"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"The row labels have a special name. We call them the _index_ of a dataframe,\n",
"and `pandas` stores the row labels in a special `pd.Index` object. We won't\n",
"discuss the `pd.Index` object since it's less common to manipulate the\n",
"index itself. For now, it's important to remember that even though the index looks\n",
"like a column of data, the index really represents row labels, not data. For\n",
"instance, the dataframe of dog breeds has four columns of data, not five, since\n",
"the index doesn't count as a column."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Slicing\n",
"\n",
"*Slicing* is an operation that creates a new dataframe by taking a subset of\n",
"rows or columns out of another dataframe. Think about slicing a tomato---slices\n",
"can go both vertically and horizontally. To take slices of a dataframe in\n",
"`pandas`, we use the `.loc` and `.iloc` properties. Let's start with `.loc`.\n",
"\n",
"Here's the full `baby` dataframe:"
]
},
{
"cell_type": "code",
"execution_count": 90,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Name
\n",
"
Sex
\n",
"
Count
\n",
"
Year
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
Liam
\n",
"
M
\n",
"
19659
\n",
"
2020
\n",
"
\n",
"
\n",
"
1
\n",
"
Noah
\n",
"
M
\n",
"
18252
\n",
"
2020
\n",
"
\n",
"
\n",
"
2
\n",
"
Oliver
\n",
"
M
\n",
"
14147
\n",
"
2020
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
2020719
\n",
"
Verona
\n",
"
F
\n",
"
5
\n",
"
1880
\n",
"
\n",
"
\n",
"
2020720
\n",
"
Vertie
\n",
"
F
\n",
"
5
\n",
"
1880
\n",
"
\n",
"
\n",
"
2020721
\n",
"
Wilma
\n",
"
F
\n",
"
5
\n",
"
1880
\n",
"
\n",
" \n",
"
\n",
"
2020722 rows × 4 columns
\n",
"
"
],
"text/plain": [
" Name Sex Count Year\n",
"0 Liam M 19659 2020\n",
"1 Noah M 18252 2020\n",
"2 Oliver M 14147 2020\n",
"... ... .. ... ...\n",
"2020719 Verona F 5 1880\n",
"2020720 Vertie F 5 1880\n",
"2020721 Wilma F 5 1880\n",
"\n",
"[2020722 rows x 4 columns]"
]
},
"execution_count": 90,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"baby"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"`.loc` lets us select rows and columns using their labels. For example, to get the data in the row labeled `1` and column labeled `Name`:"
]
},
{
"cell_type": "code",
"execution_count": 88,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Noah'"
]
},
"execution_count": 88,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# The first argument is the row label\n",
"# ↓\n",
"baby.loc[1, 'Name']\n",
"# ↑\n",
"# The second argument is the column label"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
":::{warning}\n",
"Notice that `.loc` needs square brackets; running `baby.loc(1, 'Name')` will result in an error.\n",
":::"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"To slice out multiple rows or columns, we can use Python slice syntax instead\n",
"of individual values:"
]
},
{
"cell_type": "code",
"execution_count": 89,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Name
\n",
"
Sex
\n",
"
Count
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
Liam
\n",
"
M
\n",
"
19659
\n",
"
\n",
"
\n",
"
1
\n",
"
Noah
\n",
"
M
\n",
"
18252
\n",
"
\n",
"
\n",
"
2
\n",
"
Oliver
\n",
"
M
\n",
"
14147
\n",
"
\n",
"
\n",
"
3
\n",
"
Elijah
\n",
"
M
\n",
"
13034
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Name Sex Count\n",
"0 Liam M 19659\n",
"1 Noah M 18252\n",
"2 Oliver M 14147\n",
"3 Elijah M 13034"
]
},
"execution_count": 89,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"baby.loc[0:3, 'Name':'Count']"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"To get an entire column of data, we can pass an empty slice as the first argument:"
]
},
{
"cell_type": "code",
"execution_count": 91,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 19659\n",
"1 18252\n",
"2 14147\n",
" ... \n",
"2020719 5\n",
"2020720 5\n",
"2020721 5\n",
"Name: Count, Length: 2020722, dtype: int64"
]
},
"execution_count": 91,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"baby.loc[:, 'Count']"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that the output of this doesn't look like a dataframe, and it's not.\n",
"Selecting out a single row or column of a dataframe produces a `pd.Series`\n",
"object:"
]
},
{
"cell_type": "code",
"execution_count": 92,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Series'"
]
},
"execution_count": 92,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"counts = baby.loc[:, 'Count']\n",
"counts.__class__.__name__"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"What's the difference between a `pd.Series` object and a `pd.DataFrame` object?\n",
"Essentially, a `pd.DataFrame` is two-dimensional---it has rows and columns and\n",
"represents a table of data. A `pd.Series` is one-dimensional---it represents a\n",
"list of data. `pd.Series` and `pd.DataFrame` objects have many methods in\n",
"common, but they really represent two different things. Confusing the two can\n",
"cause bugs and confusion.\n",
"\n",
"To select specific columns of a dataframe, pass a list into `.loc`:"
]
},
{
"cell_type": "code",
"execution_count": 93,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Name
\n",
"
Sex
\n",
"
Count
\n",
"
Year
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
Liam
\n",
"
M
\n",
"
19659
\n",
"
2020
\n",
"
\n",
"
\n",
"
1
\n",
"
Noah
\n",
"
M
\n",
"
18252
\n",
"
2020
\n",
"
\n",
"
\n",
"
2
\n",
"
Oliver
\n",
"
M
\n",
"
14147
\n",
"
2020
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
2020719
\n",
"
Verona
\n",
"
F
\n",
"
5
\n",
"
1880
\n",
"
\n",
"
\n",
"
2020720
\n",
"
Vertie
\n",
"
F
\n",
"
5
\n",
"
1880
\n",
"
\n",
"
\n",
"
2020721
\n",
"
Wilma
\n",
"
F
\n",
"
5
\n",
"
1880
\n",
"
\n",
" \n",
"
\n",
"
2020722 rows × 4 columns
\n",
"
"
],
"text/plain": [
" Name Sex Count Year\n",
"0 Liam M 19659 2020\n",
"1 Noah M 18252 2020\n",
"2 Oliver M 14147 2020\n",
"... ... .. ... ...\n",
"2020719 Verona F 5 1880\n",
"2020720 Vertie F 5 1880\n",
"2020721 Wilma F 5 1880\n",
"\n",
"[2020722 rows x 4 columns]"
]
},
"execution_count": 93,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Here's the original dataframe\n",
"baby"
]
},
{
"cell_type": "code",
"execution_count": 95,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Name
\n",
"
Year
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
Liam
\n",
"
2020
\n",
"
\n",
"
\n",
"
1
\n",
"
Noah
\n",
"
2020
\n",
"
\n",
"
\n",
"
2
\n",
"
Oliver
\n",
"
2020
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
2020719
\n",
"
Verona
\n",
"
1880
\n",
"
\n",
"
\n",
"
2020720
\n",
"
Vertie
\n",
"
1880
\n",
"
\n",
"
\n",
"
2020721
\n",
"
Wilma
\n",
"
1880
\n",
"
\n",
" \n",
"
\n",
"
2020722 rows × 2 columns
\n",
"
"
],
"text/plain": [
" Name Year\n",
"0 Liam 2020\n",
"1 Noah 2020\n",
"2 Oliver 2020\n",
"... ... ...\n",
"2020719 Verona 1880\n",
"2020720 Vertie 1880\n",
"2020721 Wilma 1880\n",
"\n",
"[2020722 rows x 2 columns]"
]
},
"execution_count": 95,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# And here's the dataframe with only Name and Year columns\n",
"baby.loc[:, ['Name', 'Year']]\n",
"# └──────┬───────┘\n",
"# list of column labels"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Selecting columns is very common, so there's a shorthand:"
]
},
{
"cell_type": "code",
"execution_count": 94,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 Liam\n",
"1 Noah\n",
"2 Oliver\n",
" ... \n",
"2020719 Verona\n",
"2020720 Vertie\n",
"2020721 Wilma\n",
"Name: Name, Length: 2020722, dtype: object"
]
},
"execution_count": 94,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Shorthand for baby.loc[:, 'Name']\n",
"baby['Name']"
]
},
{
"cell_type": "code",
"execution_count": 96,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Name
\n",
"
Count
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
Liam
\n",
"
19659
\n",
"
\n",
"
\n",
"
1
\n",
"
Noah
\n",
"
18252
\n",
"
\n",
"
\n",
"
2
\n",
"
Oliver
\n",
"
14147
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
2020719
\n",
"
Verona
\n",
"
5
\n",
"
\n",
"
\n",
"
2020720
\n",
"
Vertie
\n",
"
5
\n",
"
\n",
"
\n",
"
2020721
\n",
"
Wilma
\n",
"
5
\n",
"
\n",
" \n",
"
\n",
"
2020722 rows × 2 columns
\n",
"
"
],
"text/plain": [
" Name Count\n",
"0 Liam 19659\n",
"1 Noah 18252\n",
"2 Oliver 14147\n",
"... ... ...\n",
"2020719 Verona 5\n",
"2020720 Vertie 5\n",
"2020721 Wilma 5\n",
"\n",
"[2020722 rows x 2 columns]"
]
},
"execution_count": 96,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Shorthand for baby.loc[:, ['Name', 'Count']]\n",
"baby[['Name', 'Count']]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Slicing using `.iloc` works similarly to `.loc`, except that `.iloc` uses the\n",
"*positions* of rows and columns rather than labels. It's easiest to show the\n",
"difference between `.iloc` and `.loc` when the dataframe index has strings, so\n",
"for demonstration purposes, let's look at a dataframe with information on dog\n",
"breeds:"
]
},
{
"cell_type": "code",
"execution_count": 97,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
grooming
\n",
"
food_cost
\n",
"
kids
\n",
"
size
\n",
"
\n",
"
\n",
"
breed
\n",
"
\n",
"
\n",
"
\n",
"
\n",
"
\n",
" \n",
" \n",
"
\n",
"
Labrador Retriever
\n",
"
weekly
\n",
"
466.0
\n",
"
high
\n",
"
medium
\n",
"
\n",
"
\n",
"
German Shepherd
\n",
"
weekly
\n",
"
466.0
\n",
"
medium
\n",
"
large
\n",
"
\n",
"
\n",
"
Beagle
\n",
"
daily
\n",
"
324.0
\n",
"
high
\n",
"
small
\n",
"
\n",
"
\n",
"
Golden Retriever
\n",
"
weekly
\n",
"
466.0
\n",
"
high
\n",
"
medium
\n",
"
\n",
"
\n",
"
Yorkshire Terrier
\n",
"
daily
\n",
"
324.0
\n",
"
low
\n",
"
small
\n",
"
\n",
"
\n",
"
Bulldog
\n",
"
weekly
\n",
"
466.0
\n",
"
medium
\n",
"
medium
\n",
"
\n",
"
\n",
"
Boxer
\n",
"
weekly
\n",
"
466.0
\n",
"
high
\n",
"
medium
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" grooming food_cost kids size\n",
"breed \n",
"Labrador Retriever weekly 466.0 high medium\n",
"German Shepherd weekly 466.0 medium large\n",
"Beagle daily 324.0 high small\n",
"Golden Retriever weekly 466.0 high medium\n",
"Yorkshire Terrier daily 324.0 low small\n",
"Bulldog weekly 466.0 medium medium\n",
"Boxer weekly 466.0 high medium"
]
},
"execution_count": 97,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dogs = pd.read_csv('dogs.csv', index_col='breed')\n",
"dogs"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"To get the first three rows and the first two columns by position, use `.iloc`:"
]
},
{
"cell_type": "code",
"execution_count": 98,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" grooming food_cost\n",
"breed \n",
"Labrador Retriever weekly 466.0\n",
"German Shepherd weekly 466.0\n",
"Beagle daily 324.0"
]
},
"execution_count": 99,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dogs.loc['Labrador Retriever':'Beagle', 'grooming':'food_cost']"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we'll look at filtering rows."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Filtering Rows"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"So far, we've shown how to use `.loc` and `.iloc` to slice a dataframe using\n",
"labels and positions.\n",
"\n",
"However, data scientists often want to *filter* rows---they want to take\n",
"subsets of rows using some criteria. Let's say we want to find the most\n",
"popular baby names in 2020. To do this, we can filter rows to keep only the\n",
"rows where the `Year` is 2020.\n",
"\n",
"To filter, we'd like to check whether each value in the `Year`\n",
"column is equal to 1970 and then keep only those rows.\n",
"\n",
"To compare each value in `Year`, we slice out the column and make a boolean\n",
"comparison\n",
"(this is similar to what we'd do with a `numpy` array):"
]
},
{
"cell_type": "code",
"execution_count": 102,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Name
\n",
"
Sex
\n",
"
Count
\n",
"
Year
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
Liam
\n",
"
M
\n",
"
19659
\n",
"
2020
\n",
"
\n",
"
\n",
"
1
\n",
"
Noah
\n",
"
M
\n",
"
18252
\n",
"
2020
\n",
"
\n",
"
\n",
"
2
\n",
"
Oliver
\n",
"
M
\n",
"
14147
\n",
"
2020
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
2020719
\n",
"
Verona
\n",
"
F
\n",
"
5
\n",
"
1880
\n",
"
\n",
"
\n",
"
2020720
\n",
"
Vertie
\n",
"
F
\n",
"
5
\n",
"
1880
\n",
"
\n",
"
\n",
"
2020721
\n",
"
Wilma
\n",
"
F
\n",
"
5
\n",
"
1880
\n",
"
\n",
" \n",
"
\n",
"
2020722 rows × 4 columns
\n",
"
"
],
"text/plain": [
" Name Sex Count Year\n",
"0 Liam M 19659 2020\n",
"1 Noah M 18252 2020\n",
"2 Oliver M 14147 2020\n",
"... ... .. ... ...\n",
"2020719 Verona F 5 1880\n",
"2020720 Vertie F 5 1880\n",
"2020721 Wilma F 5 1880\n",
"\n",
"[2020722 rows x 4 columns]"
]
},
"execution_count": 102,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Here's the dataframe for reference\n",
"baby"
]
},
{
"cell_type": "code",
"execution_count": 100,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 2020\n",
"1 2020\n",
"2 2020\n",
" ... \n",
"2020719 1880\n",
"2020720 1880\n",
"2020721 1880\n",
"Name: Year, Length: 2020722, dtype: int64"
]
},
"execution_count": 100,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Get a Series with the Year data\n",
"baby['Year']"
]
},
{
"cell_type": "code",
"execution_count": 101,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 True\n",
"1 True\n",
"2 True\n",
" ... \n",
"2020719 False\n",
"2020720 False\n",
"2020721 False\n",
"Name: Year, Length: 2020722, dtype: bool"
]
},
"execution_count": 101,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Compare with 2020\n",
"baby['Year'] == 2020"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that a boolean comparison on a `Series` gives a `Series` of booleans. This\n",
"is nearly equivalent to writing:\n",
"\n",
"```python\n",
"is_2020 = []\n",
"for value in baby['Year']:\n",
" is_2020.append(value == 2020)\n",
"```\n",
"\n",
"But the boolean comparison is easier to write and much faster to execute than a\n",
"`for` loop."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we tell `pandas` to keep only the rows where the comparison evaluated to `True`:"
]
},
{
"cell_type": "code",
"execution_count": 103,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Name
\n",
"
Sex
\n",
"
Count
\n",
"
Year
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
Liam
\n",
"
M
\n",
"
19659
\n",
"
2020
\n",
"
\n",
"
\n",
"
1
\n",
"
Noah
\n",
"
M
\n",
"
18252
\n",
"
2020
\n",
"
\n",
"
\n",
"
2
\n",
"
Oliver
\n",
"
M
\n",
"
14147
\n",
"
2020
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
31267
\n",
"
Zylynn
\n",
"
F
\n",
"
5
\n",
"
2020
\n",
"
\n",
"
\n",
"
31268
\n",
"
Zynique
\n",
"
F
\n",
"
5
\n",
"
2020
\n",
"
\n",
"
\n",
"
31269
\n",
"
Zynlee
\n",
"
F
\n",
"
5
\n",
"
2020
\n",
"
\n",
" \n",
"
\n",
"
31270 rows × 4 columns
\n",
"
"
],
"text/plain": [
" Name Sex Count Year\n",
"0 Liam M 19659 2020\n",
"1 Noah M 18252 2020\n",
"2 Oliver M 14147 2020\n",
"... ... .. ... ...\n",
"31267 Zylynn F 5 2020\n",
"31268 Zynique F 5 2020\n",
"31269 Zynlee F 5 2020\n",
"\n",
"[31270 rows x 4 columns]"
]
},
"execution_count": 103,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Passing a Series of booleans into .loc only keeps rows where the Series has\n",
"# a True value.\n",
"# ↓\n",
"baby.loc[baby['Year'] == 2020, :]"
]
},
{
"cell_type": "code",
"execution_count": 104,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Name
\n",
"
Sex
\n",
"
Count
\n",
"
Year
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
Liam
\n",
"
M
\n",
"
19659
\n",
"
2020
\n",
"
\n",
"
\n",
"
1
\n",
"
Noah
\n",
"
M
\n",
"
18252
\n",
"
2020
\n",
"
\n",
"
\n",
"
2
\n",
"
Oliver
\n",
"
M
\n",
"
14147
\n",
"
2020
\n",
"
\n",
"
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
...
\n",
"
\n",
"
\n",
"
31267
\n",
"
Zylynn
\n",
"
F
\n",
"
5
\n",
"
2020
\n",
"
\n",
"
\n",
"
31268
\n",
"
Zynique
\n",
"
F
\n",
"
5
\n",
"
2020
\n",
"
\n",
"
\n",
"
31269
\n",
"
Zynlee
\n",
"
F
\n",
"
5
\n",
"
2020
\n",
"
\n",
" \n",
"
\n",
"
31270 rows × 4 columns
\n",
"
"
],
"text/plain": [
" Name Sex Count Year\n",
"0 Liam M 19659 2020\n",
"1 Noah M 18252 2020\n",
"2 Oliver M 14147 2020\n",
"... ... .. ... ...\n",
"31267 Zylynn F 5 2020\n",
"31268 Zynique F 5 2020\n",
"31269 Zynlee F 5 2020\n",
"\n",
"[31270 rows x 4 columns]"
]
},
"execution_count": 104,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Filtering has a shorthand. This computes the same table as the snippet above\n",
"# without using .loc\n",
"baby[baby['Year'] == 2020]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, to find the most common names in 2020, sort the dataframe by `Count`\n",
"in descending order:"
]
},
{
"cell_type": "code",
"execution_count": 106,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Name
\n",
"
Sex
\n",
"
Count
\n",
"
Year
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
Liam
\n",
"
M
\n",
"
19659
\n",
"
2020
\n",
"
\n",
"
\n",
"
1
\n",
"
Noah
\n",
"
M
\n",
"
18252
\n",
"
2020
\n",
"
\n",
"
\n",
"
13911
\n",
"
Emma
\n",
"
F
\n",
"
15581
\n",
"
2020
\n",
"
\n",
"
\n",
"
2
\n",
"
Oliver
\n",
"
M
\n",
"
14147
\n",
"
2020
\n",
"
\n",
"
\n",
"
13912
\n",
"
Ava
\n",
"
F
\n",
"
13084
\n",
"
2020
\n",
"
\n",
"
\n",
"
3
\n",
"
Elijah
\n",
"
M
\n",
"
13034
\n",
"
2020
\n",
"
\n",
"
\n",
"
13913
\n",
"
Charlotte
\n",
"
F
\n",
"
13003
\n",
"
2020
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Name Sex Count Year\n",
"0 Liam M 19659 2020\n",
"1 Noah M 18252 2020\n",
"13911 Emma F 15581 2020\n",
"2 Oliver M 14147 2020\n",
"13912 Ava F 13084 2020\n",
"3 Elijah M 13034 2020\n",
"13913 Charlotte F 13003 2020"
]
},
"execution_count": 106,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Wrapping a long expression in parentheses lets us easily add\n",
"# line breaks to make it more readable.\n",
"(baby[baby['Year'] == 2020]\n",
" .sort_values('Count', ascending=False)\n",
" .head(7) # take the first seven rows\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that Liam, Noah, and Emma were the most popular baby names in 2020."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Example: How Recently has Luna Become a Popular Name?\n",
"\n",
"The _New York Times_ article mentions that the name Luna was almost nonexistent\n",
"before 2000 but has since grown to become a very popular name for girls.\n",
"When exactly did Luna become popular?\n",
"We can\n",
"check this using slicing and filtering. When approaching a data manipulation\n",
"task, we recommend breaking the problem down into smaller steps. For\n",
"example, we could think:\n",
"\n",
"1. Filter: keep only rows with `'Luna'` in the `Name` column.\n",
"1. Filter: keep only rows with `'F'` in the `Sex` column.\n",
"1. Slice: keep the `Count` and `Year` columns.\n",
"\n",
"Now it's a matter of translating each step into code:"
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"