13.4. Text Analysis

So far, we’ve used Python methods and regular expressions to clean short text fields and strings. In this section, we’ll analyze entire documents using a technique called text mining, which transforms free-form text into a quantitative representation to uncover meaningful patterns and insights.

Text mining is a deep topic. Instead of a comprehensive treatment, we’ll introduce a few key ideas through an example. In this section, we analyze the State of the Union speeches from 1790 to 2022. Every year, the US president gives a State of the Union speech to Congress. These speeches talk about current events in the country and make recommendations for Congress to consider. The American Presidency Project makes these speeches available online 1.

We’ll start by opening the file that has all of the speeches.

from pathlib import Path

with Path('data/stateoftheunion1790-2022.txt').open(mode="r") as f:
    text = f.read()

Earlier in the chapter, we saw that each speech in the data begins with a line with three asterisks: ***. We can use a regular expression to count the number of times the string *** appears.

import re
num_speeches = len(re.findall(r"\*\*\*", text))
print(f'There are {num_speeches} speeches total')
There are 232 speeches total

In text analysis, a document refers to a single piece of text that we want to analyze. Here, each speech is a document. We’ll split apart the text variable into its individual documents.

records = text.split("***")

Then, we can put the speeches into a dataframe:

def extract_parts(speech):
    speech = speech.strip().split('\n')[1:]
    [name, date, *lines] = speech
    body = '\n'.join(lines).strip()
    return [name, date, body]

def read_speeches():
    return pd.DataFrame([extract_parts(l) for l in records[1:]],
                        columns = ["name", "date", "text"])

df = read_speeches()
df
name date text
0 George Washington January 8, 1790 Fellow-Citizens of the Senate and House of Rep...
1 George Washington December 8, 1790 Fellow-Citizens of the Senate and House of Rep...
2 George Washington October 25, 1791 Fellow-Citizens of the Senate and House of Rep...
... ... ... ...
229 Donald J. Trump February 4, 2020 Thank you very much. Thank you. Thank you very...
230 Joseph R. Biden, Jr. April 28, 2021 Thank you. Thank you. Thank you. Good to be ba...
231 Joseph R. Biden, Jr. March 1, 2022 Madam Speaker, Madam Vice President, our First...

232 rows × 3 columns

13.4.1. How Have the Speeches Changed Over Time?

Now that we have the speeches loaded into a dataframe, we want to write a program that can help us see how the speeches have changed over time. Our basic idea is to look at the words in the speeches—if two speeches included a very different set of words, our program should tell us that the speeches are very different. If we have a measure of similarity, we can see how the speeches have become different over time.

There are few problems in the data that we need to take care of first:

  1. Capitalization shouldn’t matter: Citizens and citizens should be considered the same word. We can address this by lowercasing the text.

  2. There are unspoken remarks in the text: [laughter] points out where the audience laughed, but these shouldn’t count as part of the speech. We can address this by using a regex to remove text within brackets: \[[^\]]+\]. Remember that \[ and \] match the literal left and right brackets, and [^\]] matches any character that isn’t a right bracket.

  3. We should take out characters that aren’t letters or whitespace: some speeches talk about finances, but a dollar amount shouldn’t count as a word. We can use the regex [^a-z\s] to remove these characters. This regex matches any character that isn’t a lowercase letter (a-z) or a whitespace character (\s).

def clean_text(df):
    bracket_re = re.compile(r'\[[^\]]+\]')
    not_a_word_re = re.compile(r'[^a-z\s]')
    cleaned = (df['text'].str.lower()
               .str.replace(bracket_re, '')
               .str.replace(not_a_word_re, ' '))
    return df.assign(text=cleaned)

df = (read_speeches()
      .pipe(clean_text))
df
name date text
0 George Washington January 8, 1790 fellow citizens of the senate and house of rep...
1 George Washington December 8, 1790 fellow citizens of the senate and house of rep...
2 George Washington October 25, 1791 fellow citizens of the senate and house of rep...
... ... ... ...
229 Donald J. Trump February 4, 2020 thank you very much thank you thank you very...
230 Joseph R. Biden, Jr. April 28, 2021 thank you thank you thank you good to be ba...
231 Joseph R. Biden, Jr. March 1, 2022 madam speaker madam vice president our first...

232 rows × 3 columns

Next, we’ll look at some more complex issues:

  1. Stop words like is, and, the, and but appear so often that we should just remove them.

  2. argue and arguing should count as the same word, even though they appear different in the text. To address this, we’ll use word stemming, which transforms both words to argu.

To handle these issues, we’ll use built-in methods from the nltk library.

Finally, we’ll transform the speeches into word vectors. A word vector represents a document using a vector of numbers. For example, one basic type of word vector counts up how many times each word appears in the text, as depicted in Figure 13.2.

../../_images/word-vectors.svg

Fig. 13.2 Bag-of-words vectors for three small example documents.

This simple transform is called bag-of-words, and we’ll apply it on all of our speeches. Then, we’ll use a statistic called term frequency-inverse document frequency (tf-idf for short) to normalize the counts. This technique puts more weight on words that only appear in a few documents. The idea is that if only a few documents mention the word sanction, this word is extra useful for distinguishing documents from each other. The scikit-learn library has a complete description of the transform and an implementation we’ll use.

After applying these transforms, we have a 2-dimensional array speech_vectors. Each row of this array is one speech transformed into a vector.

import nltk
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

stop_words = set(nltk.corpus.stopwords.words('english'))
porter_stemmer = PorterStemmer()

def stemming_tokenizer(document):
    return [porter_stemmer.stem(word)
            for word in nltk.word_tokenize(document)
            if word not in stop_words]
    return words

tfidf = TfidfVectorizer(tokenizer=stemming_tokenizer)
speech_vectors = tfidf.fit_transform(df['text'])
speech_vectors.shape
(232, 13211)

We have 232 speeches, and each speech was transformed into a length-13211 vector. To visualize these speeches, we’ll use a technique called principal component analysis to compress the length-13211 vectors into length-2 vectors which we can put directly into a scatter plot. (We cover principal component analysis in detail in Chapter 26.)

In the plot below, each point is one speech. We’ve colored the points using the year of the speech. Points that are close together represent similar speeches, and points that are far away represent dissimilar speeches.

from scipy.sparse.linalg import svds

def compute_pcs(data, k):
    centered = data - data.mean(axis=0)
    U, s, Vt = svds(centered, k=k)
    return U @ np.diag(s)

# Setting the random seed doesn't affect svds(), so re-running this code
# might flip the points along the x or y-axes.
pcs = compute_pcs(speech_vectors, k=2)

# So we'll use a hack: we make sure the first row's PCs are both positive to get
# the same plot each time.
if pcs[0, 0] < 0:
    pcs[:, 0] *= -1
if pcs[0, 1] < 0:
    pcs[:, 1] *= -1

with_pcs1 = df.assign(year=df['date'].str[-4:].astype(int),
                      pc1=pcs[:, 0], pc2=pcs[:, 1])
fig = px.scatter(with_pcs1, x='pc1', y='pc2', color='year',
           hover_data=['name'],
           width=550, height=350)
fig.update_layout(coloraxis_colorbar_thickness=15)
fig
../../_images/text_sotu_24_0.svg

We see a clear difference in speeches over time—speeches given in the 1800s used very different words than speeches after 2000. It’s also interesting to see that the speeches cluster tightly in the same time period. This suggests that speeches within the same period sound relatively similar, even though the speakers were from different political parties. Here are some questions we could explore next:

  1. What words seem to best differentiate speeches from different centuries?

  2. If we color the points by political party, will the points of the same political party be grouped together?

This section gave a whirlwind introduction to text analysis. We used text manipulation tools from previous sections to clean up the presidential speeches. Then, we used more advanced techniques like stemming, the tf-idf transform, and principal component analysis to compare speeches. Although we don’t have enough space in this book to cover all of these techniques in detail, we hope that this section piques your interest in the exciting world of text analysis.


1

https://www.presidency.ucsb.edu/