Write a SQL query on the
babyrelation, read the result into a
pandasdataframe, and create a plot of how popular your name was over time. If you used that plot to make a guess at your age, what would you guess? Is that close to your actual age? Think of a potential reason.
When plotting male and female baby names over time, you might notice that after 1950 there are generally more male babies. Is this trend reflected in the U.S. census data? Go to the Census website (https://data.census.gov/cedsci/) and check.
Let’s say you want to find the most popular male and female baby name each year. You might write this query:
SELECT Year, Sex, MAX(Count), MAX(Name) FROM baby GROUP BY Year, Sex
But this code doesn’t produce the right result. Why?
Now, write code to produce the most popular male and female name each year, along with its count. Hint: The answer is a bit simpler than you might expect because of a special SQLite rule (search for “Bare columns in an aggregate queries”).
Come up with a realistic data example where a data scientist would prefer an inner join to a left join, and an example where a data scientist would prefer a left join to an inner join.
In the section on Joins, the
nyttable doesn’t have any duplicate names. But a name could feasibly belong to multiple categories—for instance,
Elizabethis a name from the Bible and a name for royalty. Let’s say we have a relation called
multi_catthat can list a name multiple times—once for each category it belongs to:
query = ''' SELECT * FROM multi_cat ''' pd.read_sql(query, db)
What happens when we join
baby with this table? In general, what happens when
there are multiple rows that match in both left and right tables?
In a self-join, we take a table and join it with itself. For example, the
friendsrelation contains pairs of people who are friends with each other.
query = ''' SELECT * FROM friends ''' pd.read_sql(query, db)
Why might a data scientist find the following self-join useful?
query = ''' SELECT * FROM friends AS f1 INNER JOIN friends AS f2 ON f1.other = f2.self ''' pd.read_sql(query, db)
The self-join query in the previous exercise uses the
ASkeyword to rename each relation. Why do we need to rename the relations in that query? In general, under what circumstances do we need to rename relations in the
Have names become longer on average over time? Write a SQL query, read the result into a
pandasdataframe, then produce a plot to answer this question.
What does the following SQL query compute? What does the
query = ''' WITH yearly_avgs AS ( SELECT Name, AVG(Count) AS avg FROM baby GROUP BY Name ), sq_diffs AS ( SELECT b.Name, POWER(Count - avg, 2) AS sq_diff FROM baby AS b JOIN yearly_avgs AS y ON b.Name = y.Name ) SELECT Name, POWER(AVG(sq_diff), 0.5) AS stat FROM sq_diffs GROUP BY Name ORDER BY stat DESC LIMIT 10 ''' pd.read_sql(query, db)
10 rows × 2 columns
In this chapter we found that you could make reasonable guesses at a person’s age just by knowing their name. For instance, the name “Luna” has sharply risen in popularity after 2000, so you could guess that a person named “Luna” was born around after 2000. Can you make reasonable guesses at a person’s age just from the first letter of their name?
Answer this question by writing a SQL query, reading the result into a
pandasdataframe, then producing plots to see whether this is possible. Then, see which first letters provide the most information about a person’s age.