import sys import os if not any(path.endswith('textbook') for path in sys.path): sys.path.append(os.path.abspath('../../..')) from textbook_utils import * import re
re.searchto find whether a string contains a phone number. The pattern that you write should detect a phone number in the following strings.
“Call me at 382-384-3840.”
“my number is (510) 849-3519. Call me!”
And not find a match in the following strings.
“my number is 510-849-35192”
“here’s my number: 510-849.3519”
Consider making your own tests as well.
re.subto alter the string below so that the dates have a common format that uses a dash for the day, month, and year separator.
‘03/12/2018, 03.13.18, 03/14/2018, 03:15:2018’
re.splitto separate the chapter name from the page number in the following table of contents for a book.
toc = ''' PLAYING PILGRIMS .............. 3 A MERRY CHRISTMAS ............. 13 THE LAURENCE BOY .............. 31 BURDENS ....................... 55 BEING NEIGHBORLY .............. 76 '''
Consider the first five sentences of the novel “Little Women” below. Extract the spoken dialog from each sentence.
text = '''"Christmas won't be Christmas without any presents," grumbled Jo, lying on the rug. "It's so dreadful to be poor!" sighed Meg, looking down at her old dress. "I don't think it's fair for some girls to have plenty of pretty things, and other girls nothing at all," added little Amy, with an injured sniff. "We've got Father and Mother, and each other," said Beth contentedly from her corner. The four young faces on which the firelight shone brightened at the cheerful words, but darkened again as Jo said sadly, "We haven't got Father, and shall not have him for a long time."'''
Return to the Web log example and use regular expressions with grouping to extend the pattern we used to extract the day, month, and year, to also extract the hour, minute, and second from the log. How would you also extract the time zone?
Give a regular expression for any lowercase string that has a repeated vowel, such as noon, peel, festoon, and looop.
Return to violation descriptions for the restaurant inspection (see Chapter 9 and derive some of your own features for exploring the relationship between the kinds of violations and the inspection score.
In text analysis, stop words are often removed because they are commonly used and thought to contain little information. However, the use of these words can be insightful in determining the authorship of documents. As an example, the Federalist papers were published anonymously in 1787-1788 by Alexander Hamilton, John Jay, and James Madison. All together 77 papers were published. It is generally agreed that Jay wrote five of the essays (2, 3, 4, 5, and 64), Hamilton wrote 43 papers , Madison of 14 (10, 14, and 37-48), and three were co-authored by Hamilton and Madison (18-20). However, the authorship of 12 papers (49-58, 62, and 63) is in dispute between Hamilton and Madison. The Federalist papers are available on line in Project Gutenberg at https://www.gutenberg.org/ebooks/18. Mosteller and Wallace1 have found that the words: to, this, there, on, of, by, a, also are excellent at discriminating between Hamilton and Madison authored papers. Carry out your own text analysis to confirm these findings and predict the author of the 12 papers with disputed authorships.
Rotten Tomatoes2 is a recommender system for movies and TV shows. It includes a Tomatometer where negative reviews are considered rotten and positive ones fresh. Reviews include scores (rated 1 to 5), comments, and a rotten/fresh designation. Data for 17,000 movies are available on Kaggle3 and another 2,000 one-sentence reviews from Rotten Tomatoes are also available4 for analysis. Explore the relationship between the comments and scores and designation. Try detecting the sentiment of a written review. Socher et al5 found that the bag of words approach to anayzing text for sentiments does not work well on short text, such as movie reviews, and recommend considering word order.