Open on DataHub
# HIDDEN
# Clear previously defined variables
%reset -f

# Set directory for data loading to work properly
import os
os.chdir(os.path.expanduser('~/notebooks/04'))

Exploratory Data Analysis

In exploratory data analysis (EDA), a major component of the data science lifecycle, we summarize, visualize, and transform data in order to understand them more deeply. John Tukey, the statistician that defined the term EDA, writes:

‘Exploratory data analysis’ is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those we believe to be there.

A student of data science may find this definition unsatisfying — will an attitude alone generate a data analysis? Tukey's point, however, is that we should understand the data before rushing to apply statistical tests. We would benefit from remembering his words in today's resurgence of algorithmic decision-making.

Through exploratory data analysis we seek to deeply understand our data. Maintaining "a state of flexibility" helps us know what to look for. Fluency with our computational tools allows us to conduct our search. In this chapter, we emphasize the necessary attitude as we introduce increasingly sophisticated tools. Although EDA varies between domains of study, we almost always begin EDA by understanding:

  1. The data types of columns and the granularity of rows in the dataset.
  2. The distributions of quantitative data and measures of center and spread.
  3. Relationships between quantities in the dataset.