# 6.7. Exercises¶

• Use the `baby` data to create a plot of how popular your name was over time. If you used that plot to make a guess at your age, what would you guess? Is that close to your actual age? Think of a potential reason.

• In this chapter we talked about how to use `.loc` and `.iloc` for slicing. We’ve also shown a few shorthands. For each of these shorthand code snippets, convert them to the equivalent code that uses `.loc` or `.iloc`.

```baby['Name']
```
```baby[0:5]
```
```baby[['Name', 'Count']]
```
```baby[baby['Count'] < 10]
```
• What’s the difference between running:

```baby['Name']
```

and:

```baby[['Name']]
```

And, why does this code work:

```baby[['Name']].iloc[0:5, 0]
```

but this code errors?

```baby['Name'].iloc[0:5, 0]
```
• The first code snippet below makes a dataframe with 6 rows, but the second makes a dataframe with 5 rows. Why?

```baby.loc[0:5]
```
```baby.iloc[0:5]
```
• When plotting male and female baby names over time, you might notice that after 1950 there are generally more male babies. Is this trend reflected in the U.S. census data? Go to the Census website (https://data.census.gov/cedsci/) and check.

• Find the five names with the highest standard deviation of yearly counts. What might a large standard deviation tell you about the popularity of these names over time?

• Find the five names with the highest interquartile range of yearly counts. The interquartile range is the 75th percentile minus the 25th percentile of the data. You may find the `pd.Series.quantile()` function useful (link to documentation). Are these names different than the names with the highest standard deviation? Why might this happen?

• We’ve shown this syntax for grouping:

```baby.groupby('Year')['Count'].sum()
```

This code also does the same thing:

```baby.groupby(baby['Year'])['Count'].sum()
```

The second syntax passes a `pd.Series` into `.groupby()`. It’s a bit more verbose but also gives more flexibility. Why is this syntax more flexible?

Hint: What does this code do?

```baby.groupby(baby['Year'] // 10 * 10)['Count'].sum()
```
• Let’s say you want to find the most popular male and female baby name each year. You might write this:

```(baby
.groupby([['Year', 'Sex']])
[['Count', 'Name']]
.max()
)
```

But this code doesn’t produce the right result. Why?

Now, write code to produce the most popular male and female name each year, along with its count. Hint: you can make use of the fact that within each year and birth sex, the names are sorted in descending order of popularity.

• Come up with a realistic data example where a data scientist would prefer an inner join to a left join, and an example where a data scientist would prefer a left join to an inner join.

• In the section on Joins, the `nyt` table doesn’t have any duplicate names. But a name could feasibly belong to multiple categories—for instance, `Elizabeth` is a name from the Bible and a name for royalty. Let’s say we have a dataframe called `multi_cat` that can list a name multiple times—once for each category it belongs to:

```multi_cat = pd.DataFrame([
['Elizabeth', 'bible'],
['Elizabeth', 'royal'],
['Arjun', 'hindu'],
['Arjun', 'mythological'],
], columns=nyt_small.columns)
multi_cat
```
nyt_name category
0 Elizabeth bible
1 Elizabeth royal
2 Arjun hindu
3 Arjun mythological

What happens when we join `baby` with this table? In general, what happens when there are multiple rows that match in both left and right tables?

• In a self-join, we take a table and join it with itself. For example, the `friends` table contains pairs of people who are friends with each other.

```friends = pd.DataFrame([
['Jim', 'Scott'],
['Scott', 'Philip'],
['Philip', 'Tricia'],
['Philip', 'Ailie'],
], columns=['self', 'other'])
friends
```
self other
0 Jim Scott
1 Scott Philip
2 Philip Tricia
3 Philip Ailie

```friends.merge(friends, left_on='other', right_on='self')