10.3. What to Look For in a Relationship?

When we investigate multiple variables, we examine the relationship between them, in addition to their univariate distributions. In this section, we consider pairs of features and describe what to look for. According to Table 10.4, the combination of quantitative and qualitative features guides us to make different sorts of plots. We consider these combinations in turn.

Table 10.4 Plots for Pairs of Feature Types

Feature Type

Dimension

Plot

Quantitative

Two Features

Scatter plot, smooth curve, contour plot, heat map, quantile-quantile plot

Qualitative

Two Features

Side-by-side bar plots, mosaic plot, overlaid lines

Mixed

Two Features

Overlaid density curves, side-by-side box-and-whisker plots, overlaid smooth curves, quantile-quantile plot

We’ll work with the dogs dataframe once again.

dogs = pd.read_csv('data/akc.csv')
dogs
breed group score longevity ... size weight height repetition
0 Border Collie herding 3.64 12.52 ... medium NaN 51.0 <5
1 Border Terrier terrier 3.61 14.00 ... small 6.0 NaN 15-25
2 Brittany sporting 3.54 12.92 ... medium 16.0 48.0 5-15
... ... ... ... ... ... ... ... ... ...
169 Wire Fox Terrier terrier NaN 13.17 ... small 8.0 38.0 25-40
170 Wirehaired Pointing Griffon sporting NaN 8.80 ... medium NaN 56.0 25-40
171 Xoloitzcuintli non-sporting NaN NaN ... medium NaN 42.0 NaN

172 rows × 12 columns

10.3.1. Two Quantitative Features

If both features are quantitative, then we often examine their relationship with a scatter plot. Each point in a scatter plot marks the position of a pair of values for an observation. So, we can think of a scatter plot is like a two-dimensional rug plot.

With scatter plots, we look for linear and simple nonlinear relationships, where a transformation of one or the other or both features leads to a linear relationship, and we examine the strength of the relationship.

Figure 10.7 displays a scatter plot of weight and height of dog breeds (both are quantitative). We observe that dogs that are above average in height tend to be above average in weight. This relationship appears nonlinear: the change in weight for taller dogs grows faster than for shorter dogs. Indeed, that makes sense if we think of a dog as basically shaped like a box: for similarly proportioned boxes, the weight of the contents of the box has a cubic relationship to its length.

sns.scatterplot(data=dogs, x='height', y='weight')
sns.rugplot(data=dogs, x='height', y='weight', height=0.05);
../../_images/eda_relationships_9_0.png

Fig. 10.7 This scatter plot shows the non-linear relationship between weight and height for dog breeds.

Two Univariate Plots ≠ One Bivariate Plot. The histograms for two quantitative features do not contain enough information to create their scatter plot so we must exercise caution when we read a pair of histograms. That is, the histograms do not show how these two quantities vary together. We need to use one of the plots listed in the appropriate row of Table 10.4 (scatter plot, smooth curve, contour plot, heat map, quantile-quantile plot) to get a sense of the relationship between two quantitative features.

10.3.2. One Qualitative and One Quantitative Variable

When we examine the relationship between a quantitative and a qualitative feature, we often use the qualitative feature to divide the data into groups and compare the distribution of the quantitative feature across these groups. For example, we can compare the distribution of breed height for small, medium and large dogs (Figure 10.8). We see that the distribution of height for the small and medium breeds both appear bimodal, and the left mode is the larger mode in each group. Also, the small and medium groups have a larger spread in height than the large group of breeds.

plt.figure(figsize=(9, 4))
sns.kdeplot(data=dogs, x='height', hue='size');
../../_images/eda_relationships_14_0.png

Fig. 10.8 Kernel density estimator plot for dog heights split by sizes.

Side-by-side box plots offer a similar comparison of distributions across groups. Boxplots offer a simpler approach that can give a crude understanding of distributions. Figure 10.9 shows three boxplots for height, one for each size of dog. These plots make it clear that the size categorization is based on height because there is no overlap in height ranges for the groups. (This was not evident in the density curves due to the smoothing). What we don’t see in these box plots is the bimodality in the small and medium groups, but we can still see that the large dogs have a more narrow spread in height compared to the other two groups.

fig, (ax1, ax2) = plt.subplots(ncols = 2, figsize=(10, 4))

sns.boxplot(data=dogs, x='size', y='height', ax=ax1)
sns.violinplot(data=dogs, x='size', y='height', ax=ax2)

plt.tight_layout()
../../_images/eda_relationships_18_0.png

Fig. 10.9 Box plots and violin plots for heights split by size.

Also in Figure 10.9 is a violin plot of height for each size group. The violin plots sketch density curves along an axis for each dog. A flipped version of the density curve is added to create a symmetric “violin”. The violin plot aims to bridge the gap between overlaid density curves and side-by-side box plots.

Box-and-whisker plots (also known simply as box plots) give a visual summary of a few important statistics for a distribution. Typically, they display the median, 25th percentile, 75th percentile, the minimum, and the maximum. Box plots are constructed from a few summary statistics and so they cannot reveal as much structure in a distribution as a histogram or density curve. They primarily reveal symmetry and skew, long/short tails, and unusually large/small values.

Figure 10.10 is a visual explanation of the parts of a box plot; it is constructed from the 25th, 50th, and 75th percentiles. Asymmetry is evident from a median that is not in the middle of the box, tail lengths are shown by the whiskers, and outliers by the points that appear beyond the whiskers.

../../_images/box_plot.svg

Fig. 10.10 Box plot of dog longevities, with statistics labeled.

10.3.3. Two Qualitative Features

With two qualitative features, we often examine the distribution of one feature across subgroups defined by the second. In effect, we hold one feature constant and plot the distribution of the second. For example, we might consider the relationship between the suitability of a breed for children and the size of the breed. Figure 10.11 offers one view into this relationship. The three vertical bands represent the three child-friendly categories. The width of the band is proportional to the number of breeds in that category. Within each band we further subdivide it according to the size of the breed; these are the color blocks. We see that among the low-suitability breeds, roughly 3/4 of the dog breeds are small.

../../_images/kids-vs-size-band-plot.png

Fig. 10.11 A treemap showing the proportions of dog sizes for each level of children suitability.

The heights of the color blocks represent the percentage of dogs of the corresponding size in the suitability category. There are three sets of proportions (one each for low, medium, and high suitability) the the proportions in each set sum to 1 or 100%. These are also displayed numerically in following cell.

kids = (dogs['children'].replace({ 1.0: 'high', 2.0: 'medium', 3.0: 'low' }))

def proportions(series):
    return series / sum(series)

counts = (dogs.assign(kids=kids)
 .groupby(['kids', 'size'])
 .size()
 .rename('count')
)

prop_table = (counts
 .unstack(level=1)
 .reindex(['high', 'medium', 'low'])
 .apply(proportions, axis=1)
)

prop_table
size large medium small
kids
high 0.37 0.36 0.27
medium 0.29 0.34 0.37
low 0.10 0.20 0.70

We can also visualize these proportions using lineplots as shown in Figure 10.12. There is one line (set of connected dots) for each suitability level. The connected dots give the breakdown of size within suitability. We see that many small breeds have low suitability for kids.

props = prop_table.reset_index().melt(id_vars=['kids'], value_name='prop')
sns.pointplot(data=props,
              x='size', y='prop', hue='kids',
              order=['small', 'medium', 'large'],
              hue_order=['low', 'medium', 'high'])
plt.ylabel('proportion within size');
../../_images/eda_relationships_29_0.png

Fig. 10.12 This point plot splits the dogs by sizes and shows the proportion of dogs with each children suitability within each size.

We can also present these proportions as a collection of side-by-side bar plots:

g = sns.catplot(data=props,
                x='size', y='prop',
                order=['small', 'medium', 'large'],
                col='kids',
                col_order=['low', 'medium', 'high'],
                kind='bar')

g.set_ylabels('proportion within size');
../../_images/eda_relationships_33_0.png

10.3.4. In the Next Section

We’ve covered exploratory visualizations that incorporate one or two features. In the next section, we’ll discuss visualizations that incorporate more than two features at once.