8.1. Data Source Examples

In this chapter, we use two datasets as examples: a government survey about drug abuse and administrative data from the City of San Francisco about restaurant inspections. In later sections, we demonstrate how taking stock of the format, encoding, and size of the files that contain the “raw” data can prevent problems with loading the source file into a data frame. Before we get started, we want to give an overview of these datasets and their scope (Chapter 2).

8.1.1. Drug Abuse Warning Network (DAWN) Survey

DAWN is a national healthcare survey that monitors trends in drug abuse and the emergence of new substances of abuse. The survey also aims to estimate the impact of drug abuse on the country’s health care system and to improve how emergency departments monitor substance-abuse crises. DAWN is administered by the U.S. Substance Abuse and Medical Health Services Administration (SAMHSA). DAWN was administered annually from 1998 through 2011. Due in part to the opioid epidemic, the DAWN survey was restarted in 2018. We examine the 2011 data that have been made available through the SAMHSA Data Archive 1.

The target population of the survey is all drug-related, emergency-room visits in the U.S. These visits are accessed through a frame of emergency rooms in hospitals (and their records). Hospitals are selected for the survey through probability sampling (covered in Chapter 3), and all drug-related visits to the sampled hospital’s emergency room are included in the survey. All types of drug-related visits are included, such as drug misuse, abuse, accidental ingestion, suicide attempts, malicious poisonings, and adverse reactions. For each visit, as many as 16 different drugs can be recorded, including illegal drugs, prescription drugs, and over-the-counter medications.

The source file for this dataset is an example of fixed-width formatting that rquires a codebook to decipher. Also, it is a reasonably large file and so motivates the topic of how to find a file’s size. And the granularity is unusual because an ER visit, not a person, is the subject of investigation and because of the complex survey design.

The San Francisco restaurant files have other characteristics that make them a good example for this chapter.

8.1.2. San Francisco Restaurant Food Safety

The San Francisco Department of Public Health routinely makes unannounced visits to restaurants and inspects them for food safety. The inspector calculates a score based on the violations observed and provides descriptions of the violations that were found. The target population here is all restaurants in the City of San Francisco. These restaurants are accessed through a frame of restaurant inspections that were conducted between 2013 and 2016. Some restaurants have multiple inspections in a year, and not all of the 7000+ restaurants are inspected annually.

Food safety scores are available through the city’s Open Data initiative, called DataSF. DataSF is one example of city governments around the world making their data publicly available; the DataSF mission is to “empower the use of data in decision making and service delivery” with the goal of improving the quality of life and work for residents, employers, employees and visitors 2.

The City of San Francisco requires restaurants to publicly display their scores (see Figure 8.1 below for an example placard) 3. These data offer an example of multiple files with different structures, fields, and granularity. One dataset contains summary results of inspections, another provides details about violations found during an inspection, and a third contains information about the restaurants. The violations include both serious problems related to the transmission of food borne illnesses and minor issues such as not properly displaying the inspection placard.


Fig. 8.1 A food safety scorecard displayed in restaurant. Scores range between 0 and 100.

Both the DAWN survey data and the San Francisco restaurant inspection data are available online as plain text files. However, their formats are different, and in the next section, we demonstrate how to figure out a file format so that we can read the data into a data frame.






In 2020, the city began giving restaurants color-coded placards indicating whether the restaurant passed (green), conditionally passed (yellow), or failed (red) the inspection. These new placards no longer display a numeric inspection score. However, a restaurant’s scores and violations are still available at DataSF.