Data scientists transform columns when they need to change each value in a feature in the same way. For example, if a feature contains heights of people in feet, a data scientist might want to transform the heights to centimeters. In this section, we’ll show how to apply functions that transform columns of data using SQL.
# Set up connection to database import sqlalchemy db = sqlalchemy.create_engine('sqlite:///babynames.db')
query = ''' SELECT * FROM baby LIMIT 10; ''' pd.read_sql(query, db)
10 rows × 4 columns
In the baby names New York Times article [Williams, 2021], Pamela
mentions that names starting with the letter “L” and “K” became popular
after 2000. On the other hand, names starting with the letter “J” peaked in
popularity in the 1970s and 1980s and have dropped off in popularity since. We
can verify these claims using the
We approach this problem using the following steps:
Namecolumn into a new column that contains the first letters of each value in
Group the dataframe by the first letter and year.
Aggregate the name counts by summing.
To complete the first step, we’ll apply a function to the
SQLite provides scalar functions, or functions that transform single data
values. When called on a column of data, SQLite will apply these functions on
each value in the column. In contrast, aggregation functions like
COUNT take a column of values as input and compute a single value as output.
SQLite provides a comprehensive list of the built-in scalar functions on its
website 1. For instance, to find the lengths of each name, we use the
query = ''' SELECT Name, LENGTH(Name) FROM baby LIMIT 10; ''' pd.read_sql(query, db)
10 rows × 2 columns
Notice that the
LENGTH function is applied to each value within the
Like aggregation functions, each implementation of SQL provides a different set
of scalar functions. SQLite has a relatively minimal set of functions, while
PostgreSQL has many more 2. Most SQL implementations provide some
equivalent to SQLite’s
Calling a scalar function uses the same syntax as an aggregation function. This can result in confusing output if the two are mixed together in a single query:
query = ''' SELECT Name, LENGTH(Name), COUNT(Name) FROM baby LIMIT 10; ''' pd.read_sql(query, db)
For this reason, we must be careful when when scalar and aggregation functions
appear together within a
To extract the first letter of each name, we can use the
(short for “substring”). As described in the documentation, the
function takes three arguments. The first is the input string, the second is
the position to begin the substring (1-indexed), and the third is the
length of the substring.
query = ''' SELECT Name, SUBSTR(Name, 1, 1) FROM baby LIMIT 10; ''' pd.read_sql(query, db)
|Name||SUBSTR(Name, 1, 1)|
10 rows × 2 columns
Now, we can use the
AS keyword to rename the column:
query = ''' SELECT *, SUBSTR(Name, 1, 1) AS Firsts FROM baby LIMIT 10; ''' pd.read_sql(query, db)
10 rows × 5 columns
This completes step 1 of our analysis plan. SQL provides several options to
break queries into smaller steps, which is helpful in a more complex analysis
like this one. One option is to create an entirely new relation using the
CREATE TABLE statement. Another option is to use the
WITH keyword to
create a temporary relation just for the query at hand. We’ll demonstrate
WITH keyword for this example.
7.4.2. Multistep Queries Using a
WITH clause lets us assign a name to any
SELECT query. Then, we can
treat that query as though it exists as a relation in the database. For
instance, we can take the query above that calculates the first letter of each
name and call it
query = ''' WITH letters AS ( SELECT *, SUBSTR(Name, 1, 1) AS Firsts FROM baby ) SELECT * FROM letters LIMIT 10; ''' pd.read_sql(query, db)
10 rows × 5 columns
We can read this query as follows:
-- Create a temporary relation called letters by calculating the first -- letters for each name in baby WITH letters AS ( SELECT *, SUBSTR(Name, 1, 1) AS Firsts FROM baby ) -- Then, select the first ten rows from letters SELECT * FROM letters LIMIT 10;
WITH statements are highly useful since they can be chained together. We can
create multiple temporary relations in a
WITH statement that each perform a
bit of work on the previous result, which lets us gradually build complicated
queries a step at a time.
7.4.3. Example: Popularity of “L” Names¶
Now, we can perform the next steps in our analysis. We’ll group the relation by the
first letter and year, then aggregate the
Count column using a sum.
query = ''' WITH letters AS ( SELECT *, SUBSTR(Name, 1, 1) AS Firsts FROM baby ) SELECT Firsts, Year, SUM(Count) AS Count FROM letters GROUP BY Firsts, Year; ''' letter_counts = pd.read_sql(query, db) letter_counts
3641 rows × 3 columns
Finally, we use
pandas to plot the popularity of “L” names over time:
(letter_counts .loc[letter_counts['Firsts'] == 'L'] .plot('Year', 'Count') ) plt.title('Popularity of "L" names');
The plot shows that “L” names were popular in the 1960s, dipped in the decades after, but have indeed resurged in popularity after 2000.
What about “J” names?
(letter_counts .loc[letter_counts['Firsts'] == 'J'] .plot('Year', 'Count') ) plt.title('Popularity of "J" names');
The NYT article says that “J” names were popular in the 1970s and 80s. The plot agrees, and also shows that they have become less popular after 2000.