8.2. File Formats

A file format describes how data are stored on the computer. Understanding the file format helps us figure out how to read the data into Python in order to work with it as a table. In this section, we introduce several popular formats used to store data tables.

Note

The file format and the structure of the data are two different things. The data’s structure is a mental representation of the data and tells us what kinds of operations we can do. For example, a table structure corresponds to data values arranged in rows and columns. But, the same table can be stored in many different types of file formats.

8.2.1. Delimited format

Delimited formats use a specific character to separate data values. Usually, these separators are either: a comma (Comma-Separated-Values or CSV for short), a tab (Tab-Separated Values or TSV), white-space, or a colon. These formats are natural for storing data that have a table structure. Each line in the file represents a record, which are delimited by newlines (\n or \r\n) characters. And, within a line, the record’s information is delimited by the comma character (,) for CSV or the tab character (\t) for TSV, and so on. The first line of these files often contains the names of the table’s columns/features.

The San Francisco restaurant scores are stored in CSV-formatted files. Let’s display the first few lines of the inspections.csv file. In Python, the built-in pathlib library has a useful Path object to specify paths to files and folders that work across platforms. The data are stored in the file data/inspections.csv, so we’ll use Path() to create the full path name. The Path object below has many useful methods, such as open() as shown below.

from pathlib import Path

# Create a Path pointing to our data file
insp_path = Path() / 'data' / 'inspections.csv'

with insp_path.open() as f:
    # Display first five lines of file
    for _ in range(5):
        print(f.readline(), end='')
"business_id","score","date","type"
19,"94","20160513","routine"
19,"94","20171211","routine"
24,"98","20171101","routine"
24,"98","20161005","routine"

Note

Paths are tricky when working across different operating systems (OS). For instance, a typical path in Windows might look like C:\files\data.csv, while a path in Unix or MacOS might look like ~/files/data.csv. Because of this, code that works on one OS can fail to run on other operating systems.

The pathlib Python library was created to avoid OS-specific path issues. By using it, the code shown here is more portable—it works across Windows, MacOS, and Unix.

Displaying the first few lines of a file is something we’ll do often, so we create a function as a shortcut:

def head(filepath, n=5):
    '''Prints the first n lines of filepath'''
    with filepath.open() as f:
        for _ in range(n):
            print(f.readline(), end='')
head(insp_path)
"business_id","score","date","type"
19,"94","20160513","routine"
19,"94","20171211","routine"
24,"98","20171101","routine"
24,"98","20161005","routine"

Notice that the field names appear in the first line of the file; they are comma-separated and in quotations. We see four fields: the business identifier, the restaurant’s score, the date of the inspection, and the type of inspection. Each line in the file corresponds to one inspection, and the ID, score, date and type values are separated by commas. In addition to identifying the file format, we also want to identify the format of the features. We see two things of note: the scores and dates both appear as strings. We will want to convert the scores to numbers so we can calculate summary statistics and create a histogram of scores. And, we will convert the date into a date-time format so that we can make time-series plots. We show how to carry out these transformations in Chapter 9.

All three of the restaurant source files are CSV formatted. On the otherhand, the DAWN source has a fixed-width format. We describe this format next.

8.2.2. Fixed-width Format

The fixed-width format (FWF) does not use delimiters to separate data values. Instead, the values for a specific field appear in the exact same position in each line. The DAWN source file has this format. Each line in the file is very long. For display purposes, we’ll only show the first few characters from the first 5 lines in the file.

dawn_path = Path() / 'data' / 'DAWN-Data.txt'
width = 65
with dawn_path.open() as f:
    for _ in range(5):
        print(f.readline()[:width])
     1 2251082    .9426354082   3 4 1 2201141 2 865 105 1102005 1
     2 2291292   5.9920106887   911 1 3201134 12077  81  82 283-8
     3 7 7 251   4.7231718669   611 2 2201143 12313   1  12  -7-8
     410 8 292   4.0801470012   6 2 1 3201122 1 234 358  99 215 2
     5 122 942   5.1777093467  10 6 1 3201134 3 865 105 1102005 1

Notice how the values appear to align from one row to the next. Notice also that they seem to be squished together with no separators. We need to know the exact position of each piece of information in a line in order to make sense of it. SAMHSA provides a 2,000-page codebook with all of the information needed to read the file. In the codebook we find that the age field appears in positions 34-35 and is coded in intervals from 1 to 11. For instance, the first two records shown above have age categories of 4 and 11, and the codebook tells us that a code of 4 stands for the age bracket “6 to 11”, and 11 is for “65+”.

Note

A widely adopted convention is to use the filename extension, such as .csv, .tsv, and .txt, to indicate the format of the contents of the file. File names that end with .csv are expected to contain comma-separated values, .tsv tab-separated values, and .txt generally is plain text without a particular format. However, these extension names are only suggestions. Even if a file has a .csv extension, the actual contents might not be formatted properly! It’s good practice to inspect the contents of the file before loading it into a data frame. If the file is not too large, you can open and examine it with a plain text editor. Otherwise, you view a couple of lines using .readline() or shell commands.

Other plain text formats that are popular include hierarchical formats and loosely structured formats (in contrast to formats that directly support table structures). These are covered in greater detail in other chapters, but for completeness we briefly describe them here.

8.2.3. Hierarchical Formats

Hierarchical data formats store data with a nested structure. For instance, the JavaScript Object Format (JSON) is a common format used for communication by web servers. JSON files have a hierarchical structure with keys and values similar to a Python dictionary. Each record in a JSON file can have different fields and records can contain other records. The eXtensible Markup Language (XML) and HyperText Markup Language (HTML) are other common formats for storing documents on the Internet. Like JSON, these files also contain data in a hierarchical, key-value format. We cover both formats (JSON and XML) in more detail in Chapter 14.

Next we briefly describe other plain text files that don’t fall into any of the previous categories but still have some structure to them that enables us to read and extract information.

8.2.4. Loosely Structured Formats

Web logs, instrument readings, and program logs typically provide data in plain text. For example, below is one line of a Web log (we’ve split it across multiple lines for readability). It contains information such as the date and time and type of request made to the Web site.

169.237.46.168 - -
[26/Jan/2004:10:47:58 -0800]"GET /stat141/Winter04 HTTP/1.1" 301 328
"http://anson.ucdavis.edu/courses"
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)"

There is structure present, but not in a simple delimited file format, which is what we mean by “loosely structured”. We see that the date and time appear between square brackets and the type of request (GET in this case) follows the date-time information and appears in quotes. Later in Chapter 13, we will use these observations about the log’s structure and string manipulation tools to extract the values of interest into a data table.

As another example, below is a single recording of measurements taken with a wireless device. The device reports the timestamp, identifier, location of the device, and the signal strengths that it picks up from other devices. This information uses a combination of formats: key=value pairs, semicolon delimited, and comma delimited values.

t=1139644637174;id=00:02:2D:21:0F:33;pos=2.0,0.0,0.0;degree=45.5;
00:14:bf:b1:97:8a=-33,2437000000,3;00:14:bf:b1:97:8a=-38,2437000000,3;

Like with the Web logs, we can use string manipulation and the patterns in the recordings to extract features into a table.

Although we focus on data tables in this chapter, all of this is to show: there are many types of file formats that store data!

So far, we have used the term ‘plain text’ to broadly cover formats that can be viewed with a text editor. However, a plain text file may have many different encodings, and if we don’t specify the encoding correctly, the values in the data frame might contain gibbersih. We give an overview of file encoding next.

8.2.5. File Encoding

Modern computers store data as long sequences of bits: 0s and 1s. Character encodings, like ASCII, tell the computer how to translate between bits and actual text. For example, in ASCII, the bits 100 001 stand for the letter A, and 100 010 for B. The most basic kind of plain text supports only standard ASCII characters, which includes the upper and lowercase English letters, numbers, punctuation symbols, and spaces.

ASCII encoding is not sufficient to represent a lot of special characters and characters from other languages. Other, more modern, character encodings have many more characters that can be represented. Common encodings for documents and Web pages are Latin-1 (ISO-8859-1) and UTF-8. UTF-8 has over one million characters, and is backwards compatible with ASCII, meaning that it uses the same representation for English letters, numbers, and punctuation as ASCII.

When we have a text file, we usually need to figure out its encoding. If we choose the wrong encoding to read in a file, Python either reads incorrect values or errors. The best way to find the encoding is by checking the data’s documentation which often explicitly says what the encoding is.

When we don’t know what the encoding is, we have to make a guess. The chardet package has a function called detect() that infers a file’s encoding. Since these guesses are imperfect, the function also returns a confidence between 0 and 1. We use this function to look at the files in the data folder for this chapter.

import chardet

line = '{:<25} {:<10} {}'.format

# for each file, print its name, encoding & confidence in the encoding
print(line('File', 'Encoding', 'Confidence'))

for filepath in Path('data').glob('*'):
    result = chardet.detect(filepath.read_bytes())
    print(line(str(filepath), result['encoding'], result['confidence']))
File                      Encoding   Confidence
data/inspections.csv      ascii      1.0
data/co2_mm_mlo.txt       ascii      1.0
data/violations.csv       ascii      1.0
data/DAWN-Data.txt        ascii      1.0
data/legend.csv           ascii      1.0
data/businesses.csv       ISO-8859-1 0.73

The detection function is quite certain that all but one of the files are ASCII encoded. The exception is businesses.csv, which appears to have an ISO-8859-1 encoding. We run into trouble, if we ignore this encoding and try to read the business file into Pandas without specifying the special encoding.

# naively reads file without considering encoding
>>> pd.read_csv('data/businesses.csv')
[...stack trace omitted...]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in
position 8: invalid continuation byte

To successfully read the data, we must specify the ISO-8859-1 encoding.

bus = pd.read_csv('data/businesses.csv', encoding='ISO-8859-1')
business_id name address postal_code
0 19 NRGIZE LIFESTYLE CAFE 1200 VAN NESS AVE, 3RD FLOOR 94109
1 24 OMNI S.F. HOTEL - 2ND FLOOR PANTRY 500 CALIFORNIA ST, 2ND FLOOR 94104
2 31 NORMAN'S ICE CREAM AND FREEZES 2801 LEAVENWORTH ST 94133
3 45 CHARLIE'S DELI CAFE 3202 FOLSOM ST 94110

In this section, we have introduced formats for plain text data that are widely used for storing and exchanging tables. The comma-separated-value format is the most common, but others, such as tab-separated and fixed-width, are also prevelant. Even though a file name has a ‘.csv’ extension, it’s still a good idea to confirm that it is indeed a CSV file. Likewise, file encoding can be a bit mysterious to figure out, and unless there is metadata that explicitly gives us the encoding, guesswork comes into play. When an encoding is not 100% confirmed then its a good idea to seek additional documentation.

Note

People often confuse CSV and TSV files with spreadsheets. This is in part because most spreadsheet software (like Microsoft Excel) will automatically display a CSV file as a table in a workbook. Behind the scenes, Excel looks at the file format and encoding just like we’ve done in this section. However, Excel files have a different format than CSV and TSV files, and we need to use different pandas functions to read these formats into Python.

Another potentially important aspect of a source file is its size. If a file is huge then we might not be able to read it into a data frame. In the next section, we discuss how to figure out a source file’s size.