15.5. Example: Where is the Land of Opportunity?

The US is nicknamed “the land of opportunity” because people believe that in the US even those with few resources can end up wealthy—economists call this economic mobility. In one study, economist Raj Chetty and his colleagues did a large-scale data analysis on economic mobility in the US [Chetty et al., 2014]. His basic question was whether the US is a land of opportunity. To answer this question, Chetty had to find a way to measure economic mobility.

Chetty accessed government tax records to get the 2011-12 incomes for everyone born in the US between 1980-82; he also obtained the incomes of their parents at the time they were born. In total, his dataset had about 10 million people. To measure economic mobility, Chetty took the subset of people born in a particular geographic region whose parents’ income was in the 25th income percentile (for that region), and he found their average income percentile in 2011. Chetty calls this statistic the absolute upward mobility (AUM). If a region’s average AUM is 25, then people born into the 25th percentile generally stay in the 25th percentile—they remain where their parents were when they born. High AUM values mean that the region has more upward mobility. People born in these regions generally wind up in a higher income bracket than their parents. For reference, the US average AUM is about 41 at the time of this writing. Chetty calculated the AUM for regions called commuting zones (CZ), which are roughly on the same scale as counties.

While the granularity of the original data is an individual, the data Chetty analyzed has a granularity at the CZ level. Income records can’t be publicly available because of privacy, but the AUM for a commuting zone can be made available. However, even with the granularity of a commuting zone, not all commuting zones are included in the data set because with the 40 features available in the data it might be possible to identify individuals in small CZs. This limitation points to a potential coverage bias. Measurement bias is another potential problem. For example, children of parents who become extremely wealthy may not file income taxes.

We also point out the limitations of working with data that are regional averages, rather than individual measurements. The relationships found between features are often more highly correlated at the aggregate level than at the individual level. This phenomenon is called ecological regression, and interpretations of findings from aggregated data need to be made with care.

Chetty had a hunch that some places in the US have higher economic mobility than others. His analysis found this to be true. He found that some cities, such as San Jose, Washington DC, and Seattle, have higher mobility than others, such as Charlotte, Milwaukee, and Atlanta. This means that, for example, people move from low to high income brackets in San Jose at a higher rate compared to Charlotte. Chetty used linear models to find that social and economic factors like segregation, income inequality, and local school systems are related to economic mobility.

In this analysis, our outcome variable is the AUM for a commuting zone, since we are interested in finding features that correlate with AUM. There are many possible such features in Chetty’s data, but we first investigate one particular feature: the fraction of people in a CZ who have a 15-minute or shorter commute to work.

15.5.1. Explaining Upward Mobility using Commute Time

We load the data into a DataFrame called cz_df.

cz_df
aum travel_lt15 gini rel_tot ... taxrate worked_14 foreign region
0 38.39 0.33 0.47 0.51 ... 0.02 3.75e-03 1.18e-02 South
1 37.78 0.28 0.43 0.54 ... 0.02 4.78e-03 2.31e-02 South
2 39.05 0.36 0.44 0.67 ... 0.01 2.89e-03 7.08e-03 South
... ... ... ... ... ... ... ... ... ...
702 44.12 0.42 0.42 0.29 ... 0.02 4.82e-03 9.85e-02 West
703 41.41 0.49 0.41 0.26 ... 0.01 4.39e-03 4.33e-02 West
704 43.20 0.24 0.42 0.32 ... 0.02 3.67e-03 1.13e-01 West

705 rows × 9 columns

Each row represents one commuting zone. The column aum has the average AUM metric for the commuting zone. There are many columns in this data frame, but for now we focus on the fraction of people in a CZ that have a 15 minute or shorter commute time, which is called travel_lt15. We plot AUM against this fraction to see the relationship between the two variables.

px.scatter(cz_df, x='travel_lt15', y='aum',
           width=350, height=250)
../../_images/linear_case_11_0.svg

The scatter plot shows a rough linear association between AUM and travel time. Indeed, we find the correlation to be quite strong.

from sklearn.linear_model import LinearRegression

cz_df[['aum', 'travel_lt15']].corr()
aum travel_lt15
aum 1.00 0.68
travel_lt15 0.68 1.00

We fit a simple linear model, explaining AUM with commute time.

y = cz_df[['aum']]
X = cz_df[['travel_lt15']]

model1 = LinearRegression().fit(X, y)
print(f" Intercept: {model1.intercept_[0]:.1f}\n",
      f"Slope: {model1.coef_[0][0]:.1f}")
 Intercept: 31.3
 Slope: 28.7

Interestingly, an increase in upward mobility of a CZ is associated with an increase in the fraction of people with a short commute time.

We can compare the SD of the AUM measurements to the SD of the residuals. This comparison gives us a sense of how useful the model is in explaining the variation in AUM.

predicted1 = model1.predict(X)
errors1 = y - predicted1

print(f"SD(errors): {np.std(errors1)}\n",
      f"SD(AUM): {np.std(cz_df['aum']):.2f}")
SD(errors): aum    4.14
dtype: float64
 SD(AUM): 5.61

The size of the errors about the regression line have decreased by about 25%.

Additionally, we examine the residuals for lack of linear fit since it can be easier to see curvature in a residual plot than in a plot of the original data.

<matplotlib.lines.Line2D at 0x11c072be0>
../../_images/linear_case_20_1.svg

It appears that the errors grow with AUM. We might try a transformation of the response variable, or fitting a model that is quadratic in the commute time fraction. We consider transformations and polynomials in the next section. First we see whether including additional variables offers a more accurate prediction of AUM.

15.5.2. Relating Upward Mobility Using Multiple Variables

In his original analysis, Chetty created several high-level features related to segregation, income, K-12 education, etc. We consider eight of Chetty’s predictors as we aim to build a more informative model for explaining AUM. These are described in Table 15.1.

Table 15.1 Potential explanatory for modeling AUM

Column name

Description

travel_lt15

Fraction of people with a ≤15 minute commute to work.

gini

Gini coefficient, an measure of wealth inequality. Values are between 0 and 1, where small values mean wealth is evenly distributed and large values mean more inequality.

rel_tot

Fraction of people who self-reported as religious.

single_mom

Fraction of children with a single mother.

taxrate

Local tax rate.

worked_14

Fraction of 14-16 year old teenagers who work.

foreign

Fraction of people born outside the US.

region

Region where the commuting zone is located (Northeast, Midwest, South, West).

Let’s first examine the correlations between AUM and the explanatory features and between these features themselves. (In the next section, we show how to create a model that includes categorical features.)

predictors = [
    'travel_lt15', 'gini', 'rel_tot', 'single_mom', 
    'taxrate', 'worked_14', 'foreign'
]

cz_df[(['aum'] + predictors)].corr()
aum travel_lt15 gini rel_tot single_mom taxrate worked_14 foreign
aum 1.00 0.68 -0.60 0.52 -0.77 0.35 0.65 -0.03
travel_lt15 0.68 1.00 -0.56 0.40 -0.42 0.34 0.60 -0.19
gini -0.60 -0.56 1.00 -0.29 0.57 -0.15 -0.58 0.31
... ... ... ... ... ... ... ... ...
taxrate 0.35 0.34 -0.15 0.08 -0.26 1.00 0.35 0.26
worked_14 0.65 0.60 -0.58 0.28 -0.60 0.35 1.00 -0.15
foreign -0.03 -0.19 0.31 -0.11 -0.04 0.26 -0.15 1.00

8 rows × 8 columns

We see that the fraction of single mothers in the commuting zone has the strongest correlation with AUM, which implies that it is also the single best feature to explain AUM. In addition, we see that several explanatory variables are highly correlated; the Gini coefficient is highly correlated with the fraction of teenagers who work, the fraction of single mothers, and the fraction with less than a 15-minute commute. With such highly correlated features, we need to take care in interpreting the coefficients because several different models might equally explain AUM with the covariates standing in for one another.

To begin, we can consider all possible two-feature models to see which one has the smallest prediction errors. Chetty derived 40 potential variables to use as predictors, which would have us checking \(40 \times 39 /2 = 780\) models. Fitting models, with all pairs, triples, etc. of variables quickly grows out of control. And, can lead to finding spurious correlations. (See Chapter 16.)

Here, we keep things a bit simple and examine just one two-variable model that includes the travel time and single mother features. After that, we look at the model that has all 7 numeric explanatory features in our data frame.

X2 = cz_df[['travel_lt15', 'single_mom']]
     
y = cz_df[['aum']]

model2 = LinearRegression().fit(X2, y)
print(f" Intercept: {model2.intercept_[0]:.1f} \n",
      f"Fraction with under 15 minute commute: {model2.coef_[0][0]:.2f} \n",  
      f"Fraction of single moms: {model2.coef_[0][1]:.2f}")
 Intercept: 49.0 
 Fraction with under 15 minute commute: 18.10 
 Fraction of single moms: -63.98

Notice that the coefficient for travel time is quite different than the coefficient for this variable in the simple linear model. That’s because the two features in our model are highly correlated. We need to keep in mind the other variables in the model when we interpret the coefficients.

Next we compare the errors from the two fits.

predicted2 = model2.predict(X2)
errors2 = y - predicted2

print(f" SD(errors in model 1): {np.std(errors1)}\n",
      f"SD(errors in model 2): {np.std(errors2)}")
 SD(errors in model 1): aum    4.14
dtype: float64
 SD(errors in model 2): aum    2.85
dtype: float64

The SD of the residuals have been reduced by another 30%. Adding a second variable to the model seems worth the extra complexity.

Let’s again visually examine the residual. We use the same scale on the \(y\)-axis to make it easier to compare with the plot for the one-variable model.

<matplotlib.lines.Line2D at 0x11c136520>
../../_images/linear_case_35_1.svg

The larger variability in the errors for higher AUM is still present. The implications are that the estimates, \(\hat{y}\), are unaffected, but confidence and prediction intervals will be too narrow for larger AUM. This problem can be addressed by weighted regression. See XXX if you are interested in learning more about this topic.

Note

Once again, we point out that data scientists from different backgrounds use different terminology to refer to the same concept. For example, we say that each row in the design matrix \(\textbf{X}\) is an observation and each column is a variable. This terminology is more common among people with backgrounds in statistics. Others say that each column of the design matrix represents a feature. Also, we say that our overall process of fitting and interpreting models is called modeling, while others call it machine learning.

Now, let’s fit a multiple linear model that uses all seven numeric variables to explain upward mobility. After fitting the model, we again plot the errors using the same \(y\)-axis scale as in the previous two residual plots.

X7 = cz_df[predictors]

model7 = LinearRegression().fit(X7, y)

predicted7 = model7.predict(X7)
errors7 = y - predicted7
<matplotlib.lines.Line2D at 0x11c1728e0>
../../_images/linear_case_40_1.svg

The model with seven features does not appear to be much better than the two-variable model. In fact, the standard deviation of the residuals has only decreased by 8%.

np.std(errors7) 
aum    2.59
dtype: float64

We can compare the Multiple \(R^2\) for these three models. (The adjustment for the number of features in the model makes little difference for us since we have over 700 observations.)

print(f" R-square for 7-variable model: {model7.score(X7, y):.2f}\n",
     f"R-square for 2-variable model: {model2.score(X2, y):.2f}\n",
     f"R-square for 1-variable model: {model1.score(X, y):.2f}\n")
 R-square for 7-variable model: 0.79
 R-square for 2-variable model: 0.74
 R-square for 1-variable model: 0.46

We confirm our earlier observations that two variables greatly improves the explanatory capability of the model, and the seven-variable model offers little improvement over the two-variable model.

So far, our models have used only numeric predictor variables. But, categorical data is often useful for model fitting as well. Additionally, in Chapter 10 we transformed variables and created new variables from combinations of variables. We address how to incorporate these variables into linear models next.