Open on DataHub
# Clear previously defined variables
%reset -f

# Set directory for data loading to work properly
import os

Exploratory Data Analysis

Exploratory data analysis is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those we believe to be there.

John Tukey

In Exploratory Data Analysis (EDA), the third step of the data science lifecycle, we summarize, visualize, and transform the data in order to understand it more deeply. In particular, through EDA we identify potential issues in the data and discover trends that inform further analyses.

We seek to understand the following properties about our data:

  1. Structure: the format of our data file.
  2. Granularity: how fine or coarse each row and column is.
  3. Scope: how (in)complete our data are.
  4. Temporality: how the data are situation in time.
  5. Faithfulness: how well the data captures "reality".

Although we introduce data cleaning and EDA separately to help organize this book, in practice you will often switch between the two. For example, visualizing a column may show misformatted values that you should use data cleaning techniques to process. With this in mind, we return to the Berkeley Police Department datasets for exploration.