19.1. Dimensions of a Data Table

You have likely used the word “dimension” in a geometric sense–a square is two-dimensional while a cube is three-dimensional. We also apply this geometric lens towards data. Consider the two-column table below of height and weight measurements for 251 men.

df
height weight
0 67.75 154.25
1 72.25 173.25
2 66.25 154.00
... ... ...
249 66.00 186.75
250 70.50 190.75
251 70.00 207.50

251 rows × 2 columns

If we view this data in a row-oriented way, we can think of each row of data as a vector with two elements. For example, the first row of df corresponds to the vector \( [ 67.75, 154.25 ] \). If we draw a point at the endpoint of each row vector on the coordinate plane, we produce the familiar scatter plot of weight on height:

../../_images/pca_dims_7_0.png

What about a table with two columns: one for height in inches and one for height in feet?

heights
inches feet
0 67.75 5.65
1 72.25 6.02
2 66.25 5.52
... ... ...
249 66.00 5.50
250 70.50 5.88
251 70.00 5.83

251 rows × 2 columns

Each row vector of this data table contains two elements as before. Yet a scatter plot reveals that all the data lie along one dimension, not two:

../../_images/pca_dims_12_0.png

Intuitively, we can use a one-dimensional line to represent the data rather than the two-dimensional plane of the scatter plot. The intrinsic dimension, or simply the dimension, of the heights table is 1, not 2.

19.1.1. The Dimension of a Data Table is the Rank

Observe that the feet column of the heights table is the inches column converted to feet by dividing each value by 12:

heights
inches feet
0 67.75 5.65
1 72.25 6.02
2 66.25 5.52
... ... ...
249 66.00 5.50
250 70.50 5.88
251 70.00 5.83

251 rows × 2 columns

The feet column is a linear combination of the inches column. We might view the feet column as contributing no extra information than the inches column. If the heights table lost the feet column, we could reconstruct the feet column exactly using the inches column alone. A linearly dependent column does not “add” another dimension to a data table.

In other words, the dimension of a data table is the number of linearly independent columns in the table, or equivalently the matrix rank of the the data.

19.1.2. Intuition for Reducing Dimensions

In most real-world scenarios, though, data seldomly contain columns that are exact linear combinations of other columns because of measurement error. Consider these measurements of overall density vs percent fat for individuals:

../../_images/pca_dims_18_0.png

The scatter plot shows that density is not a linear combination of percent fat, yet density and percent fat appear “almost” linearly dependent. Intuitively, we can closely approximate density values using a linear combination of the percent fat values.

We use dimensionality reduction techniques to automatically perform these approximations rather than manually examine individual pairs of data variables. Dimensionality reduction techniques are especially useful for exploratory data analysis on high-dimensional data. Consider the following data table of US House of Representative votes in September 2019:

df = pd.read_csv('vote_pivot.csv', index_col='member')
df
515 516 517 518 ... 552 553 554 555
member
A000055 1 0 0 0 ... 0 0 1 0
A000367 0 0 0 0 ... 1 1 0 1
A000369 1 1 0 0 ... 0 0 1 0
... ... ... ... ... ... ... ... ... ...
Y000062 1 1 1 1 ... 1 1 1 1
Y000065 0 0 0 0 ... 0 0 1 0
Z000017 1 1 0 0 ... 0 0 1 0

441 rows × 41 columns

In this dataset, every row corresponds to a single congressperson, and every column contains that congressperson’s vote on a single bill: a 1 means the person voted for the bill and a 0 means the person voted against the bill.

This dataset contains 41 numeric dimensions:

np.linalg.matrix_rank(df)
41

In other words, none of the vote columns are linearly dependent. This implies that no two congresspeople voted exactly the same on these 41 bills, and that no two congresspeople voted exactly opposite from each other.

But we don’t expect the votes to be completely random, either. In 2019 the US had two dominant political parties, the Democratic and the Republican parties, and we expect that in most cases a congressperson will vote similarly to other members of their political party.

That is, even though the matrix of votes contains 41 dimensions, we might preserve useful patterns in the data even after performing dimensionality reduction to decrease the number of data table dimensions.

The rest of this chapter introduces a useful technique for dimensionality reduction called principal component analysis or PCA for short. In the figure below, we use PCA to approximate the vote data with just 2 dimensions, then create a scatter plot using the 2-dimensional approximation.

df_c = df - np.mean(df, axis = 0)
u, s, vt = np.linalg.svd(df_c, full_matrices = False)
pcs = u * s
sns.scatterplot(x=pcs[:, 0], y=pcs[:, 1]);
../../_images/pca_dims_25_0.png

In the plot above, every point corresponds to one congressperson. We can color each point according to the political party of the congressperson:

../../_images/pca_dims_28_0.png

In this example, PCA successfully finds 2 dimensions that summarize the original 41 dimensions–these two dimensions capture the distinct voting patterns of political parties within the House of Representatives.

19.1.3. Summary

We use dimensionality reduction techniques to summarize data tables with many dimensions. This allows us to explore high dimensional data. One useful technique for dimensionality reduction is principal component analysis, which we introduce in the rest of this chapter.