Examples of Text and Tasks

13.1. Examples of Text and Tasks

For each of the types of tasks introduced in this chapter, we provide a motivating example. These examples are based on real tasks that we have carried out, but we’ve reduced the data to snippets that reflect the core issue with the text.

Convert text into a standard format. Let’s say we want to study connections between population demographics and election results. To do this, we’ve taken election data from Wikipedia and population data from the US Census. But, we find that the county names between these two tables don’t match:

County State Voted
0 De Witt County IL 97.8
1 Lac qui Parle County MN 98.8
2 Lewis and Clark County MT 95.2
3 St John the Baptist Parish LA 52.6
County State Population
0 DeWitt IL 16,798
1 Lac Qui Parle MN 8,067
2 Lewis & Clark MT 55,716
3 St. John the Baptist LA 43,044

We can’t join these two tables together until we clean the strings to have a common format. In this case, we need to change the case of characters, use common spellings and abbreviations, and remove punctuation.

Extract a piece of text to create a feature. Text data sometimes has a lot of structure, especially when it was generated by a computer. As an example, we’ve displayed a web server’s log entry below. Notice how the entry has multiple pieces of data, but the pieces don’t have a consistent delimiter—for instance, the date appears in square brackets, but other parts of the data appear in quotes and parentheses. - -
[26/Jan/2004:10:47:58 -0800]"GET /stat141/Winter04 HTTP/1.1" 301 328
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)"

Even though the file format doesn’t align with one of the simple formats we saw in Chapter 8, we can use text processing techniques to extract pieces of text from the logs and create features for analysis.

Transform text into features. In Chapter 9, we created a categorical feature out of a string. There, we examined the descriptions of restaurant violations and we created nominal variables for the presence of particular words. We’ve displayed a few example violations here:

unclean or degraded floors walls or ceilings
inadequate and inaccessible handwashing facilities
inadequately cleaned or sanitized food contact surfaces
wiping cloths not clean or properly stored or inadequate sanitizer
foods not protected from contamination
unclean nonfood contact surfaces
unclean or unsanitary food contact surfaces
unclean hands or improper use of gloves
inadequate washing facilities or equipment
These new features can be used in an analysis of food safety scores.

Previously, we made simple features that marked whether a description contained a word like “glove” or “hair”. In this chapter, we’ll expand on this technique to create more sophisticated features.

Text analysis. Sometimes we want to compare entire documents. For example, the US President gives a State of the Union speech every year since 1790. Here are the first few lines of the very first speech:


State of the Union Address
George Washington
January 8, 1790

Fellow-Citizens of the Senate and House of Representatives:
I embrace with great satisfaction the opportunity which now presents itself
of congratulating you on the present favorable prospects of our public …

We might wonder: how have the State of the Union speeches changed over time? Or: do different political parties say different things in their speeches? To answer these questions, we can transform the documents into a numeric form which lets us write code to compare the speeches.

These examples serve to illustrate the ideas of string manipulation, regular expressions, and text analysis that we’ll dive into for the remainder of this chapter. We’ll start with simple string manipulation.