7.6. Exercises

  • Write a SQL query on the baby relation, read the result into a pandas dataframe, and create a plot of how popular your name was over time. If you used that plot to make a guess at your age, what would you guess? Is that close to your actual age? Think of a potential reason.

  • When plotting male and female baby names over time, you might notice that after 1950 there are generally more male babies. Is this trend reflected in the U.S. census data? Go to the Census website (https://data.census.gov/cedsci/) and check.

  • Let’s say you want to find the most popular male and female baby name each year. You might write this query:

    SELECT Year, Sex, MAX(Count), MAX(Name)
    FROM baby
    GROUP BY Year, Sex
    

    But this code doesn’t produce the right result. Why?

    Now, write code to produce the most popular male and female name each year, along with its count. Hint: The answer is a bit simpler than you might expect because of a special SQLite rule (search for “Bare columns in an aggregate queries”).

  • Come up with a realistic data example where a data scientist would prefer an inner join to a left join, and an example where a data scientist would prefer a left join to an inner join.

  • In the section on Joins, the nyt table doesn’t have any duplicate names. But a name could feasibly belong to multiple categories—for instance, Elizabeth is a name from the Bible and a name for royalty. Let’s say we have a relation called multi_cat that can list a name multiple times—once for each category it belongs to:

query = ''' 
SELECT *
FROM multi_cat
'''

pd.read_sql(query, db)
nyt_name category
0 Elizabeth bible
1 Elizabeth royal
2 Arjun hindu
3 Arjun mythological

What happens when we join baby with this table? In general, what happens when there are multiple rows that match in both left and right tables?

  • In a self-join, we take a table and join it with itself. For example, the friends relation contains pairs of people who are friends with each other.

query = ''' 
SELECT *
FROM friends
'''

pd.read_sql(query, db)
self other
0 Jim Scott
1 Scott Philip
2 Philip Tricia
3 Philip Ailie

Why might a data scientist find the following self-join useful?

query = ''' 
SELECT *
FROM friends AS f1
  INNER JOIN friends AS f2
  ON f1.other = f2.self
'''

pd.read_sql(query, db)
self other self other
0 Jim Scott Scott Philip
1 Scott Philip Philip Ailie
2 Scott Philip Philip Tricia
  • The self-join query in the previous exercise uses the AS keyword to rename each relation. Why do we need to rename the relations in that query? In general, under what circumstances do we need to rename relations in the FROM clause?

  • Have names become longer on average over time? Write a SQL query, read the result into a pandas dataframe, then produce a plot to answer this question.

  • What does the following SQL query compute? What does the stat column contain?

query = ''' 
WITH yearly_avgs AS (
  SELECT Name, AVG(Count) AS avg
  FROM baby
  GROUP BY Name
),
sq_diffs AS (
  SELECT b.Name, POWER(Count - avg, 2) AS sq_diff
  FROM baby AS b JOIN yearly_avgs AS y
    ON b.Name = y.Name
)
SELECT Name, POWER(AVG(sq_diff), 0.5) AS stat
FROM sq_diffs
GROUP BY Name
ORDER BY stat DESC
LIMIT 10
'''

pd.read_sql(query, db)
Name stat
0 Michael 28296.71
1 Robert 26442.69
2 James 26434.03
... ... ...
7 Richard 15849.65
8 Patricia 13428.25
9 Matthew 13165.15

10 rows × 2 columns

  • In this chapter we found that you could make reasonable guesses at a person’s age just by knowing their name. For instance, the name “Luna” has sharply risen in popularity after 2000, so you could guess that a person named “Luna” was born around after 2000. Can you make reasonable guesses at a person’s age just from the first letter of their name?

    Answer this question by writing a SQL query, reading the result into a pandas dataframe, then producing plots to see whether this is possible. Then, see which first letters provide the most information about a person’s age.