8.3. File Size

Computers have finite resources. You have likely encountered these limits firsthand if your computer has slowed down from having too many applications open at once. We want to make sure that we do not exceed the computer’s limits while working with data, and we might examine a file differently, depending on its size. If we know that our dataset is relatively small, then a text editor or a spreadsheet can be convenient to look at the data. On the other hand, for large datasets, a more programmatic exploration or even distributed computing tools may be needed.

In many situations, we analyze datasets downloaded from the Internet. These files reside on the computer’s disk storage. In order to use Python to explore and manipulate the data, we need to read the data into the computer’s memory, also known as random access memory (RAM). All Python code requires the use of RAM, no matter how short the code is.

A computer’s RAM is typically much smaller than a computer’s disk storage. For example, one computer model released in 2018 had 32 times more disk storage than RAM. Unfortunately, this means that data files can often be much bigger than what is feasible to read into memory. Both disk storage and RAM capacity are measured in terms of bytes. Roughly speaking, each character in a text file adds one byte to a file’s size. To succinctly describe the sizes of larger files, we use the prefixes as described in the following Table 8.1.

Table 8.1 Prefixes for common filesizes.

Multiple

Notation

Number of Bytes

Kibibyte

KiB

1024

Mebibyte

MiB

1024²

Gibibyte

GiB

1024³

Tebibyte

TiB

1024⁴

Pebibyte

PiB

1024⁵

For example, a file containing 52428800 characters takes up \(52428800 / 1024^2 = 50~ {\textrm{mebibytes}}\), or 50 MiB on disk.

Note

Why use multiples of 1024 instead of simple multiples of 1000 for these prefixes? This is a historical result of the fact that most computers use a binary number scheme where powers of 2 are simpler to represent (\(1024 = 2^{10}\)). You will also see the typical SI prefixes used to describe size—kilobytes, megabytes, and gigabytes, for example. Unfortunately, these prefixes are used inconsistently. Sometimes a kilobyte refers to 1000 bytes; other times, a kilobyte refers to 1024 bytes. To avoid confusion, we will stick to kibi-, mebi-, and gibibytes which clearly represent multiples of 1024.

Many computers have much more disk storage than available memory. It is not uncommon to have a data file happily stored on a computer that will overflow the computer’s memory if we attempt to manipulate it with a program, including Python programs. We often begin our data work by making sure the files we are of manageable size. To accomplish this, we can use the built-in os library.

from pathlib import Path
import os

kib = 1024
line = '{:<25} {}'.format

print(line('File', 'Size (KiB)'))
for filepath in Path('data').glob('*'):
    size = os.path.getsize(filepath)
    print(line(str(filepath), np.round(size / kib)))
File                      Size (KiB)
data/inspections.csv      455.0
data/co2_mm_mlo.txt       50.0
data/violations.csv       3639.0
data/DAWN-Data.txt        273531.0
data/legend.csv           0.0
data/businesses.csv       645.0

We see that the businesses.csv file takes up 645 KiB on disk, making it well within the memory capacities of most systems. Although the violations.csv file takes up 3.6 MiB of disk storage, most machines can easily read it into a Pandas DataFrame too. But DAWN-Data.txt, which contains the DAWN survey data, is much larger.

The DAWN file takes up nearly 270 MiB of disk storage, and while some computers can work with this file in memory, it can slow down other systems. To make this data more manageable in Python, we can load in a subset of the columns rather than all of them.

Sometimes we are interested in the total size of a folder instead of the size of individual files. For example, if we have one file of inspections for each month in a year, we might like to see whether we can combine all the data into a single data frame.

mib = 1024**2

total = 0
for filepath in Path('data').glob('*'):
    total += os.path.getsize(filepath) / mib

print(f'The data/ folder contains {total:.2f} MiB')
The data/ folder contains 271.80 MiB

Note

As a rule of thumb, reading in a file using pandas usually requires at least five times the available memory as the file size. For example, reading in a 1 GiB file will typically require at least 5 GiB of available memory. Memory is shared by all programs running on a computer, including the operating system, web browsers, and Jupyter notebook itself. A computer with 4 GiB total RAM might have only 1 GiB available RAM with many applications running. With 1 GiB available RAM, it is unlikely that pandas will be able to read in a 1 GiB file.

Next, we discuss strategies for working with data that are far larger than what is feasible to load into memory.

8.3.1. Working With Large Datasets

The popular term “big data” generally refers to the scenario where the data are large enough that even top-of-the-line computers can’t read the data directly into memory. This is a common scenario in scientific domains like astronomy, where telescopes capture many large images of space. It’s also common for companies that have lots of users.

Figuring out how to draw insights from large datasets is an important research problem that motivates the fields of database engineering and distributed computing. While we won’t cover these fields in depth in this book, we can provide a brief overview of basic approaches.

Subset The Data.

One simple approach is to subset the data. Rather than loading in the entire source file, we can either select a specific part of it (e.g. one day’s worth of data), or we can randomly sample the dataset. Because of its simplicity, we use this approach quite often in this book. The natural downside is that this approach loses many of the benefits of analyzing a large dataset, like being able to study rare events.

Use a Database System.

As we discussed in Chapter 7, relational database management systems (RDBMS) were specifically designed to store large datasets. These systems can manipulate data that are too large to fit into memory using SQL queries. Because of their advantages, RDBMS’s are common in research and industry settings for data storage. One downside is that they often require a separate server for the data that needs its own configuration. Another downside is that SQL is less flexible in what it can compute than Python, which becomes especially relevant for modeling and prediction. One useful hybrid approach is to use SQL to subset, aggregate, or sample the data into batches that are small enough to read into Python. Then, we can use Python for more sophisticated analyses.

Use a Distributed Computing System.

Another approach to handle complex computations on large datasets is to use a distributed computing system like MapReduce, Spark, or Ray. These systems work best on tasks that can be split into many smaller parts since they split up large datasets into smaller ones, then run programs on all of the smaller datasets at once. Because of this, these systems have great flexibility and can be used in a variety of scenarios like modeling and prediction. Their main downside is that they can require a lot of work to install and configure properly, since these systems are typically installed across many computers that need to coordinate with each other.

In sum, this section introduced common file size notation and showed how to check file sizes in Python. It is convenient to use Python to determine a file format, encoding, and size. Another powerful tool for working with files is the shell, which is widely used and has a more succinct syntax than Python. In the next section, we introduce a few command-line tools available in the shell for carrying out the same tasks of finding out information about a file before reading it into a data frame.