Examples of Text and Tasks

13.1. Examples of Text and Tasks

For each of the types of tasks introduced in this chapter, we provide a motivating example. These examples are based on real tasks that we have carried out, but we have reduced the data to snippets that reflect the core issue with the text.

Convert text into a standard format. Suppose we want to study connections between populaion demographics and election results. We can, say, get county demographics from the U.S. Census and election results from a Wikipedia table (see Chapter 14 on Web scarping). We need to join the tables on county name in order to carry out our investigation. Below are snippets of a few records from two such tables; these records were chosen to highlight key differences between the county names in each of the tables.

County State Voted
0 De Witt County IL 97.8
1 Lac qui Parle County MN 98.8
2 Lewis and Clark County MT 95.2
3 St John the Baptist Parish LA 52.6
County State Population
0 DeWitt IL 16,798
1 Lac Qui Parle MN 8,067
2 Lewis & Clark MT 55,716
3 St. John the Baptist LA 43,044

We would naturally like to join the election and census tables using the County column. Unfortunately, not a single county is spelled the same in the two tables. Before we can join these tables we need to clean the strings so that they have a common format. We need to change the case of characters, use common spellings and abbreviations, and remove punctuation.

Extract a piece of text to create a feature. The content of the Web log entry below has a lot structure. For example, the date always appears in square brackets. However, the various pieces of content are not consistently separated by the same delimiter, like in a CSV or TSV file, and they are not consistently placed at the same locations in the files, as in a fixed-width format. - -
[26/Jan/2004:10:47:58 -0800]"GET /stat141/Winter04 HTTP/1.1" 301 328
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)"

Even though the file format doesn’t align with one of the simple formats we saw in Chapter 8, we can use the structure that is there to extract pieces of text from the logs and create features for analysis.

Transform text into features. In Chapter 9, we demonstrated how to create a categorical feature out of a string. There, we examined the descriptions of restaurant violations, such as those shown below that pertain to cleanliness, and we created nominal variables for the presence of particular words. For example, one feature indicated that a violation description contained a word like glove, nail, hand, and hair, so that we could categorize these violations as related to cleanliness of the restaurant staff.

unclean or degraded floors walls or ceilings
inadequate and inaccessible handwashing facilities
inadequately cleaned or sanitized food contact surfaces
wiping cloths not clean or properly stored or inadequate sanitizer
foods not protected from contamination
unclean nonfood contact surfaces
unclean or unsanitary food contact surfaces
unclean hands or improper use of gloves
inadequate washing facilities or equipment
These new features can be used in an analysis of food safety scores.

Text analysis. Some times we want to compare entire documents. To do this we must represent a document in some analyzable form. One approach is to view the document as a collection of word counts and compare these counts by measuring how similar they are. For example, consider the State of the Union Addresses, which have been given annually since 1790. The first few lines of the first of these appears below.


State of the Union Address
George Washington
January 8, 1790

Fellow-Citizens of the Senate and House of Representatives:
I embrace with great satisfaction the opportunity which now presents itself of congratulating you on the present favorable prospects of our public …

All together, as of 2022, there have been 232 of these speeches delivered. We are interested in whether speeches are more similar in different eras or by the president’s political party. A lot of attention has been given to former President Trumps use of language. Is that apparent in his State of the Union speeches? To address these questions, we can do some text mining.

These examples serve to illustrate the ideas of string manipulation, regular expressions, and text analysis in the remainder of this chapter. We begin with the simpler notion of string manipulation and show how we can use these tools to canonicalize county names in order to join tables.