Guidelines for Exploration
9.4. Guidelines for Exploration¶
So far in this chapter, we have:
introduced the notion of feature types;
seen how the feature types help you figure out what plots to make; and
described how to read distributions and relationships in a visualization
Let’s now describe generally the process of EDA. You have seen EDA in action already in Chapter X when we developed checks for data quality and feature transformations to improve their usefulness in data analysis. Below are a set of questions to guide you when making plots.
How are the values of Feature X distributed?
How do Feature X and Feature Y relate to each other?
Is the distribution of Feature X the same for subgroups defined by Feature Z?
Are there any unusual observations in X? in the combination of (X,Y)? in X for a subgroup of Z?
One approach that you may find helpful to develop your intuition about distributions and relationships of different kinds of features is to make a guess about what you will see before you make the plot. That is, try to sketch or describe your best answer to the above questions first, and then make the plot. For example, distributions that have a natural lower/upper bound on values tend to have a long tail on the other side. The distribution of income (bounded below by 0) tends to have a long right tail, and exam scores (bounded above by 100) tends to have a long left tail.
As you answer each of the above questions, it is important to tie your answer back to the feature and the dataset. It is also important to adopt an active, inquisitive approach to the investigation. Some questions to guide your explorations is to ask “what next” and “so what” questions, such as the following.
Do you have reason to expect that one group/observation might be different?
Why might your observation about the data shape matter?
What comparison might bring added value to the investigation?
Are there any potentially important features to create comparisons with/against?
We put these guidelines into practice, and provide an example of the EDA process in the next section.