Open on DataHub
# HIDDEN
# Clear previously defined variables
%reset -f

# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/03'))
# HIDDEN
import warnings
# Ignore numpy dtype warnings. These warnings are caused by an interaction
# between numpy and Cython and can be safely ignored.
# Reference: https://stackoverflow.com/a/40846742
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
warnings.filterwarnings("ignore", message="numpy.ufunc size changed")

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
import nbinteract as nbi

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.options.display.max_rows = 7
pd.options.display.max_columns = 8
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

Getting Started

In the remaining sections of this chapter we will work with the Baby Names dataset from Chapter 1. We will pose a question, break the question down into high-level steps, then translate each step into Python code using pandas DataFrames. We begin by importing pandas:

import pandas as pd

Now we can read in the data using pd.read_csv (docs).

baby = pd.read_csv('babynames.csv')
baby
Name Sex Count Year
0 Mary F 9217 1884
1 Anna F 3860 1884
2 Emma F 2587 1884
... ... ... ... ...
1891891 Verna M 5 1883
1891892 Winnie M 5 1883
1891893 Winthrop M 5 1883

1891894 rows × 4 columns

Note that for the code above to work, the babynames.csv file must be located in the same directory as this notebook. We can check what files are in the current folder by running the ls command-line tool:

!ls
babynames.csv                       pandas_indexes.ipynb
others                              pandas_intro.ipynb
pandas_apply_strings_plotting.ipynb pandas_structure.ipynb
pandas_grouping_pivoting.ipynb

When we use pandas to read in data, we get a DataFrame. A DataFrame is a tabular data structure where each column is labeled (in this case 'Name', 'Sex', 'Count', 'Year') and each row is labeled (in this case 0, 1, 2, ..., 1891893). Note that the Table object introduced in Data 8 only labels columns while DataFrames label both columns and rows.

The labels of a DataFrame are called the indexes of the DataFrame and make many data manipulations easier.

Indexes, Slicing, and Sorting

Let's use pandas to answer the following question:

What were the five most popular baby names in 2016?

Breaking the Problem Down

We can decompose this question into the following simpler table manipulations:

  1. Slice out the rows for the year 2016.
  2. Sort the rows in descending order by Count.

Now, we can express these steps in pandas.

Slicing using .loc

To select subsets of a DataFrame, we use the .loc slicing syntax. The first argument is the label of the row and the second is the label of the column:

baby
Name Sex Count Year
0 Mary F 9217 1884
1 Anna F 3860 1884
2 Emma F 2587 1884
... ... ... ... ...
1891891 Verna M 5 1883
1891892 Winnie M 5 1883
1891893 Winthrop M 5 1883

1891894 rows × 4 columns

baby.loc[1, 'Name'] # Row labeled 1, Column labeled 'Name'
'Anna'

To slice out multiple rows or columns, we can use :. Note that .loc slicing is inclusive, unlike Python's slicing.

# Get rows 1 through 5, columns Name through Count inclusive
baby.loc[1:5, 'Name':'Count']
Name Sex Count
1 Anna F 3860
2 Emma F 2587
3 Elizabeth F 2549
4 Minnie F 2243
5 Margaret F 2142

We will often want a single column from a DataFrame:

baby.loc[:, 'Year']
0          1884
1          1884
2          1884
           ... 
1891891    1883
1891892    1883
1891893    1883
Name: Year, Length: 1891894, dtype: int64

Note that when we select a single column, we get a pandas Series. A Series is like a one-dimensional NumPy array since we can perform arithmetic on all the elements at once.

baby.loc[:, 'Year'] * 2
0          3768
1          3768
2          3768
           ... 
1891891    3766
1891892    3766
1891893    3766
Name: Year, Length: 1891894, dtype: int64

To select out specific columns, we can pass a list into the .loc slice:

# This is a DataFrame again
baby.loc[:, ['Name', 'Year']]
Name Year
0 Mary 1884
1 Anna 1884
2 Emma 1884
... ... ...
1891891 Verna 1883
1891892 Winnie 1883
1891893 Winthrop 1883

1891894 rows × 2 columns

Selecting columns is common, so there's a shorthand.

# Shorthand for baby.loc[:, 'Name']
baby['Name']
0              Mary
1              Anna
2              Emma
             ...   
1891891       Verna
1891892      Winnie
1891893    Winthrop
Name: Name, Length: 1891894, dtype: object
# Shorthand for baby.loc[:, ['Name', 'Count']]
baby[['Name', 'Count']]
Name Count
0 Mary 9217
1 Anna 3860
2 Emma 2587
... ... ...
1891891 Verna 5
1891892 Winnie 5
1891893 Winthrop 5

1891894 rows × 2 columns

Slicing rows using a predicate

To slice out the rows with year 2016, we will first create a Series containing True for each row we want to keep and False for each row we want to drop. This is simple because math and boolean operators on Series are applied to each element in the Series.

# Series of years
baby['Year']
0          1884
1          1884
2          1884
           ... 
1891891    1883
1891892    1883
1891893    1883
Name: Year, Length: 1891894, dtype: int64
# Compare each year with 2016
baby['Year'] == 2016
0          False
1          False
2          False
           ...  
1891891    False
1891892    False
1891893    False
Name: Year, Length: 1891894, dtype: bool

Once we have this Series of True and False, we can pass it into .loc.

# We are slicing rows, so the boolean Series goes in the first
# argument to .loc
baby_2016 = baby.loc[baby['Year'] == 2016, :]
baby_2016
Name Sex Count Year
1850880 Emma F 19414 2016
1850881 Olivia F 19246 2016
1850882 Ava F 16237 2016
... ... ... ... ...
1883745 Zyahir M 5 2016
1883746 Zyel M 5 2016
1883747 Zylyn M 5 2016

32868 rows × 4 columns

Sorting Rows

The next step is the sort the rows in descending order by 'Count'. We can use the sort_values() function.

sorted_2016 = baby_2016.sort_values('Count', ascending=False)
sorted_2016
Name Sex Count Year
1850880 Emma F 19414 2016
1850881 Olivia F 19246 2016
1869637 Noah M 19015 2016
... ... ... ... ...
1868752 Mikaelyn F 5 2016
1868751 Miette F 5 2016
1883747 Zylyn M 5 2016

32868 rows × 4 columns

Finally, we will use .iloc to slice out the first five rows of the DataFrame. .iloc works like .loc but takes in numerical indices instead of labels. It does not include the right endpoint in its slices, like Python's list slicing.

# Get the value in the zeroth row, zeroth column
sorted_2016.iloc[0, 0]
'Emma'
# Get the first five rows
sorted_2016.iloc[0:5]
Name Sex Count Year
1850880 Emma F 19414 2016
1850881 Olivia F 19246 2016
1869637 Noah M 19015 2016
1869638 Liam M 18138 2016
1850882 Ava F 16237 2016

In Conclusion

We now have the five most popular baby names in 2016 and learned to express the following operations in pandas:

Operation pandas
Read a CSV file pd.read_csv()
Slicing using labels or indices .loc and .iloc
Slicing rows using a predicate Use a boolean-valued Series in .loc
Sorting rows .sort_values()