13.4. Text Analysis¶
The examples that we have seen so far in this chapter have cleaned text fields and converted strings into quantitative features for analysis. In this section, we demonstrate how to analyze documents. This analysis, called text mining, transforms unstructured text into a quantitative representation with the aim of uncovering meaningful patterns and insights.
We do not provide a comprehensive treatment of text mining here, but introduce a few key ideas through an example. We analyze the State of the Union speeches from 1790 to 2022. The State of the Union is a report given by the US president to the US Congress annually. It contains information on the situation in the country and recommendations for Congress to consider. The American Presidency Project1 has the text from all of these speeches.
We have collected these speeches into one file which we open and read as strings.
from pathlib import Path insp_path = Path() / 'data' / 'stateoftheunion1790-2022.txt' with insp_path.open(mode="r") as f: text = f.read()
We saw earlier in the chapter that the speeches are delimited by lines of the form ***. We can use a regular expression to count the number of occurrences of three asterisks.
print("Count ***s", len(re.findall(r"\*\*\*", text)))
Count ***s 232
The file contains 232 speeches (or 232 documents), and we can use string manipulation to create separate “documents” for each.
records = text.split("***")
Before we can apply text mining techniques, we must clean and prepare each document. For example, following the three asterisks there is a blank line and then three lines of information about the speech. We want to remove these from the document and place this information in corresponding features.
def extract_parts(line): parts = line.split("\n") name = parts.strip() date = parts.strip() text = "\n".join(parts[5:]).strip() return [name, date, text] df = pd.DataFrame([extract_parts(l) for l in records[1:]], columns = ["Name", "Date", "Text"]) df.head()
|0||George Washington||January 8, 1790||Fellow-Citizens of the Senate and House of Rep...|
|1||George Washington||December 8, 1790||Fellow-Citizens of the Senate and House of Rep...|
|2||George Washington||October 25, 1791||Fellow-Citizens of the Senate and House of Rep...|
|3||George Washington||November 6, 1792||Fellow-Citizens of the Senate and House of Rep...|
|4||George Washington||December 3, 1793||Fellow-Citizens of the Senate and House of Rep...|
To clean the documents, we remove blank lines, extra whitespace, convert uppercase letters to lowercase, and eliminate all literals that are not letters and spaces. Additionally, we remove remarks between squuare brackets, such as [laughter], because these are clarifications or comments from the audience, and we remove the phrase “The President.” when it appears at the beginning of a line because it refers to the president speaking and is not part of the speech itself. Notice in the regular expressions below how we: escape the metacharacter meaning of left and right brackets with a backslash in order to locate the bracketed remarks that are not part of the speech; use the caret inside the two equivalence classes, one time to locate any literal except the right bracket and another time to match everything that isn’t a letter or whitespace (the shortcut for whitespace is
\s); and locate the term “The President.” at the beginning of a new line. Also consider the sequence in which these string replacemnts are carried out. We want to be sure to replace “The President” first, before we start replacing end of line characters with blanks.
df['clean text'] = ( df['Text'] .str.replace("\n\s*The President\.", "") .str.replace("\[[^\]]+\]","") .str.replace("\n", " ") .str.lower() .str.replace(r"[^a-z\s]", " ") )
<ipython-input-6-326613f2bdaa>:2: FutureWarning: The default value of regex will change from True to False in a future version.
0 fellow citizens of the senate and house of rep... 1 fellow citizens of the senate and house of rep... 2 fellow citizens of the senate and house of rep... 3 fellow citizens of the senate and house of rep... 4 fellow citizens of the senate and house of rep... Name: clean text, dtype: object
Now we’re ready to perform a little text mining. The key idea idea is to analyze the speeches as collections of words. To do this, we organize the words in each document into a word vector. We break each document up into separate strings of one word each, which are referred to as terms or tokens. This process, called tokenization, often includes stemming, where a word is reduced to its “stem.” For example, “runs”, “running”, and “ran” are all mapped to the same stem, “run”. Stemming reduces similar words to the same core to make it easier to find similarities between documents. Another processing step removes stop words, such as “is”, “and”, “the”, and “a”, from the document. The rational being that these word are so common that they are not helpful in comparing documents.
Below we use a stemming method in the NLTK library to tokenize (by splitting on whitepace) and stem the words in a string. We will use this tokenizer on the speeches.
from nltk.stem.porter import PorterStemmer porter_stemmer = PorterStemmer() def stemming_tokenizer(str_input): words = re.split(" ", str_input) words = [porter_stemmer.stem(word) for word in words] return words
To create a speech’s word vector, we tally up the number of time each word in a dictionary occurs in the speech. The dictionary used to create the word vectors might simply be the collection of all unique words across all of the documents being analyzed, which is referred to as a bag of words.
Representing a speech in this way ignores the actual order of the words. It may not be immediately obvious that this representation can be useful. The bi-gram is an extension of this notion, where pairs of words are the tokens. You can imagine that this extension can bring more nuance to a text analysis.
The count of a word in a document is called the term frequency. Ideally the vectors of term frequencies can be used to compare documents. To make these comparisons, we typically normalize a term frequency by its inverse document frequency, which is a count of the number of documents the word appears in. A word that appears in all of the documents will likely not be of much use in our comparisons. The product of the term frequency and inverse document frequency is called the tf-idf.
We use the
TfidfVectorizer method in scikit learn to compute the tf-id vectors, and provide our tokenizer to stem the words.
from sklearn.feature_extraction.text import TfidfVectorizer vec = TfidfVectorizer(tokenizer=stemming_tokenizer, stop_words='english') tfidf = vec.fit_transform(df['clean text'])
Each of the 232 speeches is represented by a 13348 vector of tf-idf values.
Now that we have converted documents into high-dimensional vectors, we can use dimension reduction techniques to look for patterns (see Chapter 26). We close this section with a plot of the tf-idf vectors of the 232 documents. These have been reduced to two-dimensional representations in the scatter plot below. Each point there is on of the speeches, and these are color-coded by year. We note that speeches in the same era are close to each other, even when given by presidents of different political parties. Also notable are a couple of anomalous speeches. Something to look into.
[<matplotlib.lines.Line2D at 0x121ec2370>]