Data scientists very frequently want to join two or more relations together in order to connect records between relations. For instance, an online bookstore might have one relation with the books each user has ordered and a second relation with the genres of each book. By joining the two relations together, the data scientist can see what genres each user prefers.
We’ll continue looking at the baby names data. We’ll use joins to check some trends mentioned in the New York Times article about baby names [Williams, 2021]. The article talks about how certain categories of names have become more or less popular over time. For instance, it mentions that mythological names like Julius and Cassius have become popular, while baby boomer names like Susan and Debbie have become less popular. How has the popularity of these categories changed over time?
We’ve taken the names and categories in the NYT article and put them in a small
# Set up connection to database import sqlalchemy db = sqlalchemy.create_engine('sqlite:///babynames.db')
query = ''' SELECT * FROM nyt; ''' pd.read_sql(query, db)
23 rows × 2 columns
Notice that the code above runs a query on
babynames.db, the same database
that contains the larger
baby relation from the previous sections. SQL
databases can hold more than one relation, making them very useful when we need
to work with many data tables at once. CSV files, on the other hand, typically
contain one data table each—if we perform a data analysis that uses twenty
data tables, we might need to keep track of the names, locations, and versions
of twenty CSV files. Instead, it could be simpler to store all the data tables
in a SQLite database stored in a single file.
To see how popular the categories of names are, we join the
nyt relation with
baby relation to get the name counts from
baby. We’ll start by
displaying the first few rows of the
query = ''' SELECT * FROM baby LIMIT 10; ''' pd.read_sql(query, db)
10 rows × 4 columns
Imagine going down each row in
baby and asking, is this name in the
table? If so, then add the value in the
category column to the row. That’s
the basic idea behind a join. Let’s look at a few simpler examples first.
7.3.1. Inner Joins¶
We’ve made smaller versions of the
nyt tables so it’s easier to
see what happens when we join tables together. The relations are called
query = ''' SELECT * FROM baby_small; ''' pd.read_sql(query, db)
query = ''' SELECT * FROM nyt_small; ''' pd.read_sql(query, db)
To join tables in SQL, we use the
JOIN clause to say which tables we want to
join, and the
ON clause to specify a predicate for joining the tables. Here’s
query = ''' SELECT * FROM baby_small JOIN nyt_small ON baby_small.Name = nyt_small.nyt_name ''' pd.read_sql(query, db)
Notice that the new table has the columns of both
tables. The rows with the name Noah are gone. And the remaining rows have their
The query above can be read as follows:
SELECT * FROM baby_small JOIN nyt_small -- the tables to join ON baby_small.Name = nyt_small.nyt_name -- only join rows together when the names are equal
When we join two tables together, we tell SQL the column(s) from each
table that we want to use to join using a predicate with the
SQL matches rows together when the values in the joining columns match, as
shown in Fig. 7.3.
By default, SQL does an inner join. If either table has rows that don’t
have matches in the other table, SQL drops those rows from the result. In
this case, the Noah rows in
baby_small don’t have matches in
nyt_small, so they are dropped. Also, the Freya row in
doesn’t have matches in
baby_small, so it’s dropped as well. Only the
rows with a match in both tables stay in the final result.
7.3.2. Left, Right, and Outer Joins¶
We sometimes want to keep rows without a match instead of dropping them entirely. There are other types of joins—left, right, and outer—that keep rows even when they don’t have a match.
In a left join, rows in the left table without a match are kept in the final result, as shown in Fig. 7.4.
To do a left join in
LEFT JOIN instead of
query = ''' SELECT * FROM baby_small LEFT JOIN nyt_small ON baby_small.Name = nyt_small.nyt_name ''' pd.read_sql(query, db)
Notice that the Noah rows are kept in the final table. Since those rows didn’t
have a match in the
nyt_small dataframe, SQL leaves
NULL values in the
category columns (which are then converted to
when read into a
pandas dataframe). Also, notice that the Freya row in
nyt_small is still dropped.
A right join works similarly to the left join, except that non-matching rows
in the right table are kept instead of the left table. SQLite doesn’t support
right joins directly, but we can perform the same join by reversing the order
of relations when we use
query = ''' SELECT * FROM nyt_small LEFT JOIN baby_small ON baby_small.Name = nyt_small.nyt_name ''' pd.read_sql(query, db)
Finally, an outer join keeps rows from both tables even when they don’t have
a match. SQLite doesn’t have a built-in keyword for outer joins. In cases where
an outer join is needed, we have to either use a different SQL engine or
perform an outer join via
pandas. However, in our (the author’s) experience,
outer joins are rarely used in practice compared to inner and left joins.
7.3.3. Example: Popularity of NYT Name Categories¶
Now, let’s return to the full
query = ''' SELECT * FROM baby LIMIT 10 ''' pd.read_sql(query, db)
10 rows × 4 columns
query = ''' SELECT * FROM nyt ''' pd.read_sql(query, db)
23 rows × 2 columns
We want to know how the popularity of name categories in
nyt have changed
over time. To answer this question:
nyt, matching rows where the names are equal.
Group the table by
Aggregate the counts using a sum.
query = ''' SELECT category, Year, SUM(Count) AS count --  FROM baby JOIN nyt --  ON baby.Name = nyt.nyt_name --  GROUP BY category, Year --  ''' cate_counts = pd.read_sql(query, db) cate_counts
650 rows × 3 columns
The bracketed numbers (
) in the query above show how each
step in our plan maps to the parts of the SQL query. Notice that the numbers
appear out of order. We often think of the
SELECT statement as the last
piece of the query to execute although it appears first.
Now, we can plot the popularity of individual categories:
boomers = px.line(cate_counts.query('category == "boomer"'), x='Year', y='count') myths = px.line(cate_counts.query('category == "mythology"'), x='Year', y='count') fig = left_right(boomers, myths, width=500, height=200, subplot_titles=['Boomer Names', 'Mythological Names']) margin(fig, t=30) fig
As the NYT article claims, “baby boomer” names have become less popular after 2000, while mythological names have become more popular.
We can also plot the popularities of all the categories at once. Take a look at the plots below and see whether they support the claims made in the New York Times article.
fig = px.line(cate_counts, x='Year', y='count', facet_col='category', facet_col_wrap=3, facet_row_spacing=0.15, width=600, height=400) margin(fig, t=30) fig.update_yaxes(matches=None, showticklabels=False)
When joining relations together, we match rows using the
JOIN keyword and
a boolean predicate.
SQL also allows us to specify the type of join (
when performing a join.
In the next section, we’ll explain how to transform values in a relation.