Smoothing and Aggregating Data
Contents
11.2. Smoothing and Aggregating Data¶
When we have lots of data, we often don’t want to plot all of the individual data points. The scatter plot below shows data from the Cherry Blossom Run, an annual 10-mile race that takes place in April in Washington D.C. when the cherry trees are in bloom. These data were scraped from the Run’s Web pages and include official times and other information for all registered male runners from 1999 to 2012. We’ve put the runner’s age on the x-axis and race time on the y-axis.
fig = go.Figure(
data=go.Scattergl(x=runners['age'], y=runners['time'], mode='markers'),
layout=dict(width=350, height=250),
)
fig.update_xaxes(title_text='age')
fig.update_yaxes(title_text='time')
fig
This scatter plot contains over 70,000 points. With so many points, many of them overlap with each other. This is a common problem called over-plotting. In this case, over-plotting prevents us from seeing how time and age are related. About the only thing that we can see in this plot is a group of very young runners, which points to issues in the data. To address over-plotting, we use smoothing techniques that aggregate data before plotting.
11.2.1. Smoothing Techniques to Uncover Shape¶
The histogram is a familiar type of plot that uses smoothing. A histogram aggregates data values by putting points into bins and plotting one bar for each bin. Smoothing here means that we can not differenitate the location of individual points in a bin; that is, they are being displayed smoothly across the bin. For this reason, the area of a bin corresponds to the percentage (or count or proportion) of points in the bin. (Often the bins are equal in width and we take a shortcut to label the height of a bin as the proportion.)
The histogram below plots the distribution of lifespans for dog breeds. Above the histogram is a rug plot that draws a single line for every data value. We can see in the tallest bin that even a small amount of data can cause overplotting in the rug plot. By smoothing out the points in the rug plot, the histogram reveals the general shape of the distribution. In this case, we see that many breeds have a longevity of about 12 years. For more on how to read and interpret histograms, see {numref}‘Section %s sec:eda_distributions’.
fig = px.histogram(dogs, x="longevity", marginal="rug", nbins=20,
labels={"longevity":"years"},
histnorm='percent', width=350, height=250)
fig.data[0].marker.line = dict( color = 'black',width = 1)
fig
Another common smoothing technique is kernel density estimation (KDE). A KDE plot shows the distribution using a smooth curve rather than bars. In the plot below, we show the same histogram of dog longevities with a KDE curve overlain on top. We can see the KDE curve shows a similar shape distribution as the histogram.
from scipy.stats import gaussian_kde
fig = px.histogram(dogs, x="longevity", marginal="rug",
histnorm='probability density', nbins=20,
labels={"longevity":"years"},
width=450, height=250)
fig.update_traces(marker_color='rgba(76,114,176,0.3)',
selector=dict(type='histogram'))
fig.data[0].marker.line = dict( color = 'black',width = 1)
bandwidth = 0.2
xs = np.linspace(min(dogs['longevity']), max(dogs['longevity']), 100)
ys = gaussian_kde(dogs['longevity'], bandwidth)(xs)
curve = go.Scatter(x=xs, y=ys)
fig.add_trace(curve)
fig.update_traces(marker_color='rgb(76,114,176)',
selector=dict(type='scatter'))
fig.update_layout(showlegend=False)
It might have come as a surprise to think of a histogram as a smoothing method. The more computationally intensive smoother, the kernel density estimator, is useful for continuous numeric data. Both the KDE and histogram aim to help you see important features in the distibution of values. What about plots for two variables, like scatter plots? There are similar techniques to histograms and kernel density estimates for smoothing pairs of features when we have lots of data. This is the topic of the next section.
11.2.2. Smoothing Techniques to Uncover Relationships and Trends¶
We can find high density regions of a scatter plot by binning data, like in a histogram. The plot below, remakes the earlier scatter plot of the Cherry Blossom race times against age. (Note that we have dropped the dubious young runners from this plot). This plot uses rectangular bins to aggregate points together, and then shades the rectangles based on how many points fall in them.
runnersOver17 = runners[runners["age"] > 17]
plt.figure(figsize=(5, 3))
sns.histplot(runnersOver17, x='age', y='time', binwidth=[1, 250]);
Notice the high density region in the 25 to 40 age group signified by the dark blue region in the plot. The plot shows us that many of the runners in this age range complete the race in around 5000 seconds (about 80 minutes). We can also see upward curvature in the medium blue region for the 40-60 age group, indicating these runners are slower than the 25 to 40 age group, but there are quite a few of them.
Kernel density estimation also works in two dimensions. When we use KDE in two dimensions, we plot the contours of the resulting two-dimensional curve. You read this plot like a topographical map.
# Takes a while to run!
plt.figure(figsize=(5, 3))
sns.kdeplot(data=runnersOver17, x='age', y='time');
The two-dimensional KDE gives similar insights as the shaded bins. In this example, we see a high concentration of runners in the 25 to 40 age group and these runners have times that appear to be roughly 5000 seconds.
Using smoothing techniques lets us get a better picture of lots of data.
Smoothing can reveal the location of high concentration of data values and the shape of these high concentration areas, which can be impossible to otherwise see.
Another smoothing approach that can be even more informative, smooths the y-values for points with a similar x-value. To explain, let’s group together runners with similar ages; we use five-year increments: 20-25, 25-30, 30-35, etc. Then, for each 5-year bin of runners, we average their race times, plot the average time for each group, and connect the points to form a “curve”. Such a curve appears below.
times = (
runnersOver17.assign(age5=runnersOver17['age'] // 5 * 5)
.groupby('age5')
['time']
.mean()
.reset_index()
)
px.line(times, x='age5', y='time', width=350, height=250)
This plot shows once again that runners in the 25 to 40 year age range have typical run times of about 5400 seconds. It also show that older runners took longer to complete the race on average (not really a surprise, but it wasn’t nearly as evident in the earlier plots). The dip in tmes for runners under 20 and the flattening of the curve at 80 may be simply the result of fewer and fitter runners in these groups. There are other smoothing techniques to view trends. Some use a kernel smoothing approach, similar to the KDE. We don’t go into the details here.
Whether using a binning or kernel smoothing technique, these methods rely on a tuning parameter that specifies the width of the bin or the spread of the kernel, and we often need to specify tis parameter when making a histogram, KDE, or smooth curve. This is the topic of the next section.
11.2.3. Smoothing Techniques Need Tuning¶
Now that we’ve seen how smoothing is useful for plotting, we turn to the issue of tuning. For histograms, the width of the bins or, equivalently, for equal-width bins, the number of bins affect the look of the histogram. For example, the left histogram of longevity below has a few wide bins, and the right histogram has many narrow bins. In both cases, it’s hard to see the shape of the distribution. With a few wide bins, we have over-smoothed the distribution, which makes it impossible to discern modes and tails. On the other hand, too many narrow bins, gives a plot that’s little better than a rug plot. KDE plots have a parameter called the “bandwidth” that works similarly to the binwidth of a histogram.
f1 = px.histogram(dogs, x="longevity", nbins=3, histnorm='probability density',
width=350, height=250)
f1.update_traces(marker_color='rgba(76,114,176,0.3)',
selector=dict(type='histogram'))
f1.data[0].marker.line = dict( color = 'black',width = 1)
#bandwidth = 0.5
xs = np.linspace(min(dogs['longevity']), max(dogs['longevity']), 100)
ys = gaussian_kde(dogs['longevity'])(xs)
curve = go.Scatter(x=xs, y=ys)
f1.add_trace(curve)
f2 = px.histogram(dogs, x="longevity", nbins=100, histnorm='probability density',
width=350, height=250)
f2.update_traces(marker_color='rgba(76,114,176,0.3)',
selector=dict(type='histogram'))
f2.data[0].marker.line = dict( color = 'black',width = 1)
left_right(f1, f2, height=250)
Most histogram and KDE software automatically chooses the binwidth for the histogram and bandwidth for the kernel. However, these parameters often need a bit of fiddling to create the most useful plot. When you create visualizations that rely on tuning paramters, it’s important to try a few different values before settling on one.
A different approach to data reduction is to examine the quantiles. This is the topic of the next section.
11.2.4. Reducing Distributions to Quantiles¶
We found in {numref}‘Chapter %sch:eda’ that while box plots aren’t as informative as histograms, they can be useful when comparing the distributions of many groups at once. A box plot reduces the data to a few essential features based on the data quartiles. More generally, quantiles (the lower quartile, median, and upper quartile are the 25th, 50th, and 75th quantiles) can provide a useful reduction in the data when comparing distributions.
When two distributions are roughly similar, it can be hard to see the differences using histograms. For instance, the histograms below show the price distributions for two- and four-bedroom houses in the SF housing data. The distributions look roughly similar in shape. But a plot of their quantiles can handily compare the distributions’ center, spread, and tails.
fig = px.histogram(sfh.query('br in [2, 4]'),
x='price', log_x=True, facet_col='br', width=700, height=250)
margin(fig, t=30)
We can compare quantiles with a quantile-quantile plot, called q-q plot for short. To make this plot, we first compute percentiles (also called quantiles) for both distributions. Then, we plot the matching percentiles on a scatter plot. We usually also show the reference line (y = x).
br2 = sfh.query('br == 2')
br4 = sfh.query('br == 4')
percs = np.arange(1, 100, 1)
perc2 = np.percentile(br2['price'], percs, interpolation='lower')
perc4 = np.percentile(br4['price'], percs, interpolation='lower')
perc_sfh = pd.DataFrame({'percentile': percs, 'br2': perc2, 'br4': perc4})
perc_sfh
percentile | br2 | br4 | |
---|---|---|---|
0 | 1 | 1.50e+05 | 2.05e+05 |
1 | 2 | 1.82e+05 | 2.50e+05 |
2 | 3 | 2.03e+05 | 2.75e+05 |
... | ... | ... | ... |
96 | 97 | 1.04e+06 | 1.75e+06 |
97 | 98 | 1.20e+06 | 1.95e+06 |
98 | 99 | 1.44e+06 | 2.34e+06 |
99 rows × 3 columns
fig = px.scatter(perc_sfh, x='br2', y='br4', log_x=True, log_y=True, width=500, height=350,
labels={'br2': 'Price of Two Bedroom Houses',
'br4': 'Price of Four Bedroom Houses'})
fig.add_trace(
go.Scatter(x=[1e5, 2e6], y=[1e5, 2e6],
mode='lines', line=dict(dash='dash'),
name='Reference (y = x)')
)
fig
When the quantile points fall along a line, the variables have similarly shaped distributions. Lines parallel to the reference indicate a difference in center; lines with slopes other than 1 indicate a difference in spread, and curvature indicates a difference in shape. From the q-q plot above, we see that the distribution of price for four-bedroom houses is similar in shape to the two-bedroom distribution, except for a shift of about \(\$100K\) in price and a slightly longer right tail (indicated by the upward bend for large values). Reading a q-q plot takes practice. Once you get the hang of it though, it can be a handy way to compare distributions. Notice that the housing data have over 100,000 observations, and the q-q plot has reduced the data to 99 percentiles. However, we don’t always want to use smoothers. This is the topic of the next section.
11.2.5. When Not to Smooth¶
Smoothing and aggregating can help us see important features and relationships, but when we have only a handful of observations, smoothing techniques can give misleading representations of the data. With just a few observations, we prefer rug plots over histograms, box plots, and density curves, and we use scatter plots rather than smooth curves and density contours. This may seem obvious, but when we have a large amount of data, they can quickly dwindle when we split the data up into groups for comparisons. This phenomeon is an example of the “curse of dimensionality”.
One of the most common misuses of smoothing happens with box plots. As an example, below is a collection of seven box plots of longevity, one for each dog-breed group. Some of these boxplots have as few as two or three observations.
px.box(dogs, x='group', y='longevity', width=500, height=250)
The strip plot below is a preferrable visualization. We can still compare the groups, but we can also see the exact values in each group. We can see there are only three breeds in the non-sporting group, and the impression of a skewed distribution based on the corresponding box plot above would be reading too much into the appearance of the box.
px.strip(dogs, x="group", y="longevity", )
This section introduced the problem of over-plotting, where we have overlapping points because of a large dataset. To address this issue, we introduced smoothing techniques that aggregate data together. We saw two common examples of smoothing: binning and kernel smoothing, and applied them in the one- and two-dimensional settings. In one-dimension, these are histograms and kernel density curves, respectively, and they both help us see the shape of a distribution. In two dimensions, we examined binning to create contours that show high denisty regions through color (the histogram approach) and topographical curves (the kernel approach). Additionally, we discussed how to smooth y-values while keeping x-values fixed in order to visualize trends in the data, and how to compare distributions with quantile-quantile plots. We addressed the need to tune the smoothing amount to get more informative histograms and density curves, and we cautioned against smoothing with too few data.
There are many other ways to reduce over-plotting in scatter plots. For instance, we can make the dots partially transparent so overlapping points appear darker. If many observations have the same values (e.g., when measurements are rounded to the nearest inch), then we can add a small amount of random noise to the values to reduce the amount of over-plotting. This procedure is called “jittering”, and it is used in the strip plot above. Transparency and jittering are convenient for medium-sized data. However, they don’t work very well for large datasets since they still plot all the points in the data.
The quantile-quantile plot we introduced in this section offers one way to compare distributions, another is to use side-by-side box plots and yet another is to overlay KDE curves in the same plot. We often are aiming to compare distibutions and relatinships across subsets (or groups) of data, and in the next section, we discuss several design principles that facilitate meaningful comparisons for a variety of plot types.