9.3. What to Look For in a Relationship?

When we investigate multiple variables, we examine the relationship between them, in addition to their univariate distributions. In this section, we consider pairs of features and describe what to look for. According to Table 9.4, the combination of quantitative and qualitative features guides us to make different sorts of plots. We consider these combinations in turn.

Table 9.4 Plots for Pairs of Feature Types

Feature Type

Dimension

Plot

Quantitative

Two Features

Scatter plot, smooth curve, contour plot, heat map, quantile-quantile plot

Qualitative

Two Features

Side-by-side bar plots, mosaic plot, overlaid lines

Mixed

Two Features

Overlaid density curves, side-by-side box-and-whisker plots, overlaid smooth curves, quantile-quantile plot

We’ll work with the dogs dataframe once again.

dogs = pd.read_csv('data/akc.csv')
dogs
breed group score longevity ... size weight height repetition
0 Border Collie herding 3.64 12.52 ... medium NaN 51.0 <5
1 Border Terrier terrier 3.61 14.00 ... small 6.0 NaN 15-25
2 Brittany sporting 3.54 12.92 ... medium 16.0 48.0 5-15
... ... ... ... ... ... ... ... ... ...
169 Wire Fox Terrier terrier NaN 13.17 ... small 8.0 38.0 25-40
170 Wirehaired Pointing Griffon sporting NaN 8.80 ... medium NaN 56.0 25-40
171 Xoloitzcuintli non-sporting NaN NaN ... medium NaN 42.0 NaN

172 rows × 12 columns

9.3.1. Two Quantitative Features

If both features are quantitative, then we often examine their relationship with a scatter plot. In some sense a scatter plot is like a two-dimensional rug plot. Can you see why? Each point in a scatter plot marks the position of a pair of values for an observation.

With scatter plots, we look for linear and simple nonlinear relationships, where a transformation of one or the other or both features leads to a linear relationship, and we examine the strength of the relationship.

Figure 9.7 displays a scatter plot of weight and height of dog breeds (both are quantitative). We observe that dogs that are above average in height tend to be above average in weight. This relationship appears nonlinear: the change in weight for taller dogs grows faster than for shorter dogs. Indeed, that makes sense if we think of a dog as basically shaped like a box: for similarly proportioned boxes, the weight of the contents of the box has a cubic relationship to its length. When we look at a scatter plot of the logarithm of weight against height, we confirm a rough linear association (the right-hand plot in Figure 9.7). Additionally, the association between height and the logarithm of weight looks reasonably strong as the points tend to fall close to the curve (the correlation between these two variables is 0.9).

fig, axes = plt.subplots(ncols = 2, figsize=(10, 4))

sns.scatterplot(data=dogs, x='height', y='weight', ax=axes[0])
sns.rugplot(data=dogs, x='height', y='weight', height=0.05, ax=axes[0])

logged = dogs.assign(log_weight=np.log(dogs['weight']))

sns.scatterplot(data=logged, x='height', y='log_weight', ax=axes[1])
sns.rugplot(data=logged, x='height', y='log_weight', height=0.05, ax=axes[1])

plt.tight_layout()
../../_images/eda_relationships_9_0.png

Fig. 9.7 These two scatter plots examine the relationship between weight and height for dog breeds. In the plot on the right, we’ve plotted the log of the weights against heights. When we do so, we see that the points roughly follow a line, which implies a log-linear relationship between weight and height.

Two Univariate Plots ≠ One Bivariate Plot. The histograms for two quantitative features do not contain enough information to create their scatter plot so we must exercise caution when we read a pair of histograms. That is, the histograms do not show how these two quantities vary together. We need to use one of the plots listed in the appropriate row of Table 9.4 (scatter plot, smooth curve, contour plot, heat map, quantile-quantile plot) to get a sense of the relationship between two quantitative features.

Transformations. When two features do not have a linear relationship, we look for transformations of one feature or the other or both that straighten the relationship. It is easier for humans to discern a linear relationship than to differentiate between curvilinear relationships. Figure 9.7 gives a simple example of this. The left scatter plot reveals a curvilinear relationship between height and weight. The scatter plot on the right that plots the logarithm of weight against height reveals a linear relationship, this plot of the transformed weight indicates that the relationship is what we call log-linear.

The log function is considered the jackknife of transformations. It can straighten simple polynomials as well as exponentials. As another, albeit artificial, example, the leftmost plot in Figure 9.8 reveals a curvilinear relationship between x and y. The middle plot show a different curvilinear relationship between log(y) and x; this plot also appears nonlinear. A further log transformation, at the far right in the figure, displays a plot of log(y) against log(x). This plot confirms that the data have a log-log relationship because the transformed points fall along a line.

../../_images/example-transforms.png

Fig. 9.8 These scatter plots show how log transforms can “straighten” a curvilinear relationship between two variables.

9.3.2. One Qualitative and One Quantitative Variable

When we examine the relationship between a quantitative and a qualitative feature, we often use the qualitative feature to divide the data into groups and compare the distribution of the quantitative feature across these groups. For example, we can compare the distribution of breed height for small, medium and large dogs (Figure 9.9). We see that the distribution of height for the small and medium breeds both appear bimodal, and the left mode is the larger mode in each group. Also, the small and medium groups have a larger spread in height than the large group of breeds.

plt.figure(figsize=(9, 4))
sns.kdeplot(data=dogs, x='height', hue='size');
../../_images/eda_relationships_17_0.png

Fig. 9.9 Kernel density estimator plot for dog heights split by sizes.

Side-by-side box plots offer a similar comparison of distributions across groups. Boxplots offer a simpler approach that can give a crude understanding of distributions. Figure 9.10 shows three boxplots for height, one for each size of dog. These plots make it clear that the size categorization is based on height because there is no overlap in height ranges for the groups. (This was not evident in the density curves due to the smoothing). What we don’t see in these box plots is the bimodality in the small and medium groups, but we can still see that the large dogs have a more narrow spread in height compared to the other two groups.

fig, (ax1, ax2) = plt.subplots(ncols = 2, figsize=(10, 4))

sns.boxplot(data=dogs, x='size', y='height', ax=ax1)
sns.violinplot(data=dogs, x='size', y='height', ax=ax2)

plt.tight_layout()
../../_images/eda_relationships_21_0.png

Fig. 9.10 Box plots and violin plots for heights split by size.

Also in Figure 9.10 is a violin plot of height for each size group. The violin plots sketch density curves along an axis for each dog. A flipped version of the density curve is added to create a symmetric “violin”. The violin plot aims to bridge the gap between overlaid density curves and side-by-side box plots.

Box plots are constructed from a few summary statistics and so they cannot reveal as much structure in a distribution as a histogram or density curve. They primarily reveal symmetry and skew, long/short tails, and unusually large/small values. Figure AE is a visual explanation of the parts of a box plot; it is constructed from the 25th, 50th, and 75th percentiles. Asymmetry is evident from a median that is not in the middle of the box, tail lengths are shown by the whiskers, and outliers by the points that appear beyond the whiskers.

Figure AE. TBD

9.3.3. Two Qualitative Features

With two qualitative features, we often examine the distribution of one feature across subgroups defined by the second. In effect, we hold one feature constant and plot the distribution of the second. For example, we might consider the relationship between the suitability of a breed for children and the size of the breed. Figure 9.11 offers one view into this relationship. The three vertical bands represent the three child-friendly categories. The width of the band is proportional to the number of breeds in that category. Within each band we further subdivide it according to the size of the breed; these are the color blocks. We see that among the low-suitability breeds, roughly 3/4 of the dog breeds are small.

../../_images/kids-vs-size-band-plot.png

Fig. 9.11 A treemap showing the proportions of dog sizes for each level of children suitability.

The heights of the color blocks represent the percentage of dogs of the corresponding size in the suitability category. There are three sets of proportions (one each for low, medium, and high suitability) the the proportions in each set sum to 1 or 100%. These are also displayed numerically in following cell.

kids = (dogs['children'].replace({ 1.0: 'high', 2.0: 'medium', 3.0: 'low' }))

def proportions(series):
    return series / sum(series)

counts = (dogs.assign(kids=kids)
 .groupby(['kids', 'size'])
 .size()
 .rename('count')
)

prop_table = (counts
 .unstack(level=1)
 .reindex(['high', 'medium', 'low'])
 .apply(proportions, axis=1)
)

prop_table
size large medium small
kids
high 0.37 0.36 0.27
medium 0.29 0.34 0.37
low 0.10 0.20 0.70

We can also visualize these proportions using lineplots as shown in Figure 9.12. There is one line (set of connected dots) for each suitability level. The connected dots give the breakdown of size within suitability. We see that many small breeds have low suitability for kids.

props = prop_table.reset_index().melt(id_vars=['kids'], value_name='prop')
sns.pointplot(data=props,
              x='size', y='prop', hue='kids',
              order=['small', 'medium', 'large'],
              hue_order=['low', 'medium', 'high'])
plt.ylabel('proportion within size');
../../_images/eda_relationships_32_0.png

Fig. 9.12 This point plot splits the dogs by sizes and shows the proportion of dogs with each children suitability within each size.

We can also present these proportions as a collection of side-by-side bar plots:

g = sns.catplot(data=props,
                x='size', y='prop',
                order=['small', 'medium', 'large'],
                col='kids',
                col_order=['low', 'medium', 'high'],
                kind='bar')

g.set_ylabels('proportion within size');
../../_images/eda_relationships_36_0.png

9.3.4. In the Next Section

We’ve covered exploratory visualizations that incorporate one or two features. In the next section, we’ll discuss visualizations that incorporate more than two features at once.