Exercises
6.7. Exercises¶
Use the
baby
data to create a plot of how popular your name was over time. If you used that plot to make a guess at your age, what would you guess? Is that close to your actual age? Think of a potential reason.
In this chapter we talked about how to use
.loc
and.iloc
for slicing. We’ve also shown a few shorthands. For each of these shorthand code snippets, convert them to the equivalent code that uses.loc
or.iloc
.baby['Name']
baby[0:5]
baby[['Name', 'Count']]
baby[baby['Count'] < 10]
What’s the difference between running:
baby['Name']
and:
baby[['Name']]
And, why does this code work:
baby[['Name']].iloc[0:5, 0]
but this code errors?
baby['Name'].iloc[0:5, 0]
The first code snippet below makes a dataframe with 6 rows, but the second makes a dataframe with 5 rows. Why?
baby.loc[0:5]
baby.iloc[0:5]
When plotting male and female baby names over time, you might notice that after 1950 there are generally more male babies. Is this trend reflected in the U.S. census data? Go to the Census website (https://data.census.gov/cedsci/) and check.
Find the five names with the highest standard deviation of yearly counts. What might a large standard deviation tell you about the popularity of these names over time?
Find the five names with the highest interquartile range of yearly counts. The interquartile range is the 75th percentile minus the 25th percentile of the data. You may find the
pd.Series.quantile()
function useful (link to documentation). Are these names different than the names with the highest standard deviation? Why might this happen?
We’ve shown this syntax for grouping:
baby.groupby('Year')['Count'].sum()
This code also does the same thing:
baby.groupby(baby['Year'])['Count'].sum()
The second syntax passes a
pd.Series
into.groupby()
. It’s a bit more verbose but also gives more flexibility. Why is this syntax more flexible?Hint: What does this code do?
baby.groupby(baby['Year'] // 10 * 10)['Count'].sum()
Let’s say you want to find the most popular male and female baby name each year. You might write this:
(baby .groupby([['Year', 'Sex']]) [['Count', 'Name']] .max() )
But this code doesn’t produce the right result. Why?
Now, write code to produce the most popular male and female name each year, along with its count. Hint: you can make use of the fact that within each year and birth sex, the names are sorted in descending order of popularity.
Come up with a realistic data example where a data scientist would prefer an inner join to a left join, and an example where a data scientist would prefer a left join to an inner join.
In the section on Joins, the
nyt
table doesn’t have any duplicate names. But a name could feasibly belong to multiple categories—for instance,Elizabeth
is a name from the Bible and a name for royalty. Let’s say we have a dataframe calledmulti_cat
that can list a name multiple times—once for each category it belongs to:
multi_cat = pd.DataFrame([
['Elizabeth', 'bible'],
['Elizabeth', 'royal'],
['Arjun', 'hindu'],
['Arjun', 'mythological'],
], columns=nyt_small.columns)
multi_cat
nyt_name | category | |
---|---|---|
0 | Elizabeth | bible |
1 | Elizabeth | royal |
2 | Arjun | hindu |
3 | Arjun | mythological |
What happens when we join baby
with this table? In general, what happens when
there are multiple rows that match in both left and right tables?
In a self-join, we take a table and join it with itself. For example, the
friends
table contains pairs of people who are friends with each other.
friends = pd.DataFrame([
['Jim', 'Scott'],
['Scott', 'Philip'],
['Philip', 'Tricia'],
['Philip', 'Ailie'],
], columns=['self', 'other'])
friends
self | other | |
---|---|---|
0 | Jim | Scott |
1 | Scott | Philip |
2 | Philip | Tricia |
3 | Philip | Ailie |
Why might a data scientist find the following self-join useful?
friends.merge(friends, left_on='other', right_on='self')
self_x | other_x | self_y | other_y | |
---|---|---|---|---|
0 | Jim | Scott | Scott | Philip |
1 | Scott | Philip | Philip | Tricia |
2 | Scott | Philip | Philip | Ailie |
Have names become longer on average over time? Produce a plot to answer this question.
In this chapter we found that you could make reasonable guesses at a person’s age just by knowing their name. For instance, the name “Luna” has sharply risen in popularity after 2000, so you could guess that a person named “Luna” was born around after 2000. Can you make reasonable guesses at a person’s age just from the first letter of their name? Write code to see whether this is possible, and which first letters provide the most information about a person’s age.