Dimensions of a Data Table
22.1. Dimensions of a Data Table¶
You have likely used the word “dimension” in a geometric sense–a square is two-dimensional while a cube is three-dimensional. We also apply this geometric lens towards data. Consider the two-column table below of height and weight measurements for 251 men.
251 rows × 2 columns
If we view this data in a row-oriented way, we can think of each row of data as a vector with two elements. For example, the first row of
df corresponds to the vector \( [ 67.75, 154.25 ] \). If we draw a point at the endpoint of each row vector on the coordinate plane, we produce the familiar scatter plot of weight on height:
What about a table with two columns: one for height in inches and one for height in feet?
251 rows × 2 columns
Each row vector of this data table contains two elements as before. Yet a scatter plot reveals that all the data lie along one dimension, not two:
Intuitively, we can use a one-dimensional line to represent the data rather than the two-dimensional plane of the scatter plot. The intrinsic dimension, or simply the dimension, of the
heights table is 1, not 2.
22.1.1. The Dimension of a Data Table is the Rank¶
Observe that the
feet column of the
heights table is the
inches column converted to feet by dividing each value by 12:
251 rows × 2 columns
feet column is a linear combination of the
inches column. We might view the
feet column as contributing no extra information than the
inches column. If the
heights table lost the
feet column, we could reconstruct the
feet column exactly using the
inches column alone. A linearly dependent column does not “add” another dimension to a data table.
In other words, the dimension of a data table is the number of linearly independent columns in the table, or equivalently the matrix rank of the the data.
22.1.2. Intuition for Reducing Dimensions¶
In most real-world scenarios, though, data seldomly contain columns that are exact linear combinations of other columns because of measurement error. Consider these measurements of overall density vs percent fat for individuals:
The scatter plot shows that density is not a linear combination of percent fat, yet density and percent fat appear “almost” linearly dependent. Intuitively, we can closely approximate density values using a linear combination of the percent fat values.
We use dimensionality reduction techniques to automatically perform these approximations rather than manually examine individual pairs of data variables. Dimensionality reduction techniques are especially useful for exploratory data analysis on high-dimensional data. Consider the following data table of US House of Representative votes in September 2019:
df = pd.read_csv('vote_pivot.csv', index_col='member') df
441 rows × 41 columns
In this dataset, every row corresponds to a single congressperson, and every column contains that congressperson’s vote on a single bill: a
1 means the person voted for the bill and a
0 means the person voted against the bill.
This dataset contains 41 numeric dimensions:
In other words, none of the vote columns are linearly dependent. This implies that no two congresspeople voted exactly the same on these 41 bills, and that no two congresspeople voted exactly opposite from each other.
But we don’t expect the votes to be completely random, either. In 2019 the US had two dominant political parties, the Democratic and the Republican parties, and we expect that in most cases a congressperson will vote similarly to other members of their political party.
That is, even though the matrix of votes contains 41 dimensions, we might preserve useful patterns in the data even after performing dimensionality reduction to decrease the number of data table dimensions.
The rest of this chapter introduces a useful technique for dimensionality reduction called principal component analysis or PCA for short. In the figure below, we use PCA to approximate the vote data with just 2 dimensions, then create a scatter plot using the 2-dimensional approximation.
df_c = df - np.mean(df, axis = 0) u, s, vt = np.linalg.svd(df_c, full_matrices = False) pcs = u * s sns.scatterplot(x=pcs[:, 0], y=pcs[:, 1]);
In the plot above, every point corresponds to one congressperson. We can color each point according to the political party of the congressperson:
In this example, PCA successfully finds 2 dimensions that summarize the original 41 dimensions–these two dimensions capture the distinct voting patterns of political parties within the House of Representatives.
We use dimensionality reduction techniques to summarize data tables with many dimensions. This allows us to explore high dimensional data. One useful technique for dimensionality reduction is principal component analysis, which we introduce in the rest of this chapter.