Examples of Text and Tasks
13.1. Examples of Text and Tasks¶
For each of the types of tasks introduced in this chapter, we provide a motivating example. These examples are based on real tasks that we have carried out, but we’ve reduced the data to snippets that reflect the core issue with the text.
Convert text into a standard format. Let’s say we want to study connections between population demographics and election results. To do this, we’ve taken election data from Wikipedia and population data from the US Census. But, we find that the county names between these two tables don’t match:
|0||De Witt County||IL||97.8|
|1||Lac qui Parle County||MN||98.8|
|2||Lewis and Clark County||MT||95.2|
|3||St John the Baptist Parish||LA||52.6|
|1||Lac Qui Parle||MN||8,067|
|2||Lewis & Clark||MT||55,716|
|3||St. John the Baptist||LA||43,044|
We can’t join these two tables together until we clean the strings to have a common format. In this case, we need to change the case of characters, use common spellings and abbreviations, and remove punctuation.
Extract a piece of text to create a feature. Text data sometimes has a lot of structure, especially when it was generated by a computer. As an example, we’ve displayed a web server’s log entry below. Notice how the entry has multiple pieces of data, but the pieces don’t have a consistent delimiter—for instance, the date appears in square brackets, but other parts of the data appear in quotes and parentheses.
22.214.171.124 - - [26/Jan/2004:10:47:58 -0800]"GET /stat141/Winter04 HTTP/1.1" 301 328 "http://anson.ucdavis.edu/courses" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)"
Even though the file format doesn’t align with one of the simple formats we saw in Chapter 8, we can use text processing techniques to extract pieces of text from the logs and create features for analysis.
Transform text into features. In Chapter 9, we created a categorical feature out of a string. There, we examined the descriptions of restaurant violations and we created nominal variables for the presence of particular words. We’ve displayed a few example violations here:
unclean or degraded floors walls or ceilings inadequate and inaccessible handwashing facilities inadequately cleaned or sanitized food contact surfaces wiping cloths not clean or properly stored or inadequate sanitizer foods not protected from contamination unclean nonfood contact surfaces unclean or unsanitary food contact surfaces unclean hands or improper use of gloves inadequate washing facilities or equipment These new features can be used in an analysis of food safety scores.
Previously, we made simple features that marked whether a description contained a word like “glove” or “hair”. In this chapter, we’ll expand on this technique to create more sophisticated features.
Text analysis. Sometimes we want to compare entire documents. For example, the US President gives a State of the Union speech every year since 1790. Here are the first few lines of the very first speech:
*** State of the Union Address George Washington January 8, 1790 Fellow-Citizens of the Senate and House of Representatives: I embrace with great satisfaction the opportunity which now presents itself of congratulating you on the present favorable prospects of our public …
We might wonder: how have the State of the Union speeches changed over time? Or: do different political parties say different things in their speeches? To answer these questions, we can transform the documents into a numeric form which lets us write code to compare the speeches.
These examples serve to illustrate the ideas of string manipulation, regular expressions, and text analysis that we’ll dive into for the remainder of this chapter. We’ll start with simple string manipulation.