Answers: Multiple linear regression

Homework 11: Multiple linear regression

Answers to selected number problems

1.
a. Region, Division, State, County all categorical, nominal

names(cancer_reg)

"TARGET_deathRate" (ratio)     "Region" (nominal)           "Division" (nominal)
"State" (nominal)              "County" (nominal)           "medIncome" (ratio) 
"popEst2015" (ratio)           "povertyPercent" (ratio)     "studyPerCap" (ratio)
"MedianAge" (ratio)            "AvgHouseholdSize"           "PercentMarried" (ratio)
"BirthRate" (ratio)            "PctHS25_Over" (ratio)       "PctBachDeg25_Over" (ratio)
"PctUnemployed16_Over" (ratio) "PctPrivateCoverage" (ratio) "PctPublicCoverageAlone" (ratio)
"PctWhite" (ratio)             "PctBlack" (ratio)           "PctAsian" (ratio)
"PctOtherRace" (ratio)

b. TARGET_deathRate = dependent variable
all other variables independent variables

c. Therefore, TARGET_deathRate response variable and all other variables potential predictor variables

d. Factors are the nominal variables: Counties nested in States, States nested in Division, Division nested in Region.

3. Note: a full model is not the best model choice. Many of the predictor variables are correlated, hence, the full model violates basic assumptions about model building, i.e., predictor variables are independent.

a. Dependent variable is TARGET_deathRate. Lots of possible simpler models. To illustrate a multiple regression, I selected these ratio variables

AvgHouseholdSize
MedianAge
medIncome
PctPublicCoverageAlone
PctUnemployed16_Over
PctWhite
PercentMarried
povertyPercent

Lots of socioeconomic factors are associated with increased cancer risk and mortality. A couple of highlights from my selection: I predict negative linear association between cancer mortality and percent of population that is White (a well known health disparity).

b. Rcmdr: Statistics → Summaries → Correlation matrix…

This returns a symmetric matrix of correlations (above and below the diagonal are identical). Pearson correlations:

                AvgHouseholdSize MedianAge medIncome PctPublicCoverageAlone PctUnemployed16_Over PctWhite PercentMarried povertyPercent
AvgHouseholdSize          1.0000 -0.0319  0.1121  0.0611  0.1315 -0.1884 -0.1005  0.0743
MedianAge                -0.0319  1.0000 -0.0133 -0.0033  0.0186  0.0350  0.0464 -0.0293
medIncome                 0.1121 -0.0133  1.0000 -0.7198 -0.4531  0.1672  0.3551 -0.7890
PctPublicCoverageAlone    0.0611 -0.0033 -0.7198  1.0000  0.6554 -0.3610 -0.4600  0.7986
PctUnemployed16_Over      0.1315  0.0186 -0.4531  0.6554  1.0000 -0.5018 -0.5515  0.6551
PctWhite                 -0.1884  0.0350  0.1672 -0.3610 -0.5018  1.0000  0.6774 -0.5094
PercentMarried           -0.1005  0.0464  0.3551 -0.4600 -0.5515  0.6774  1.0000 -0.6429
povertyPercent            0.0743 -0.0293 -0.7890  0.7986  0.6551 -0.5094 -0.6429  1.0000

To help you visually I added bold-type to “large” correlation estimates (subjective, but recall that weak correlation is about magnitude 0.10, moderate correlation about 0.3, and large or strong correlation is about 0.5 or greater). Hard to make sense of this, so a heatmap can be helpful. Find and edit the R command repeated by Rcmdr, add to save to object, then re-run

myCor <- cor(cancer_reg[,c("AvgHouseholdSize","MedianAge","medIncome",
"PctPublicCoverageAlone","PctUnemployed16_Over","PctWhite","PercentMarried",
"povertyPercent")], use="complete")

heatmap(myCor)

heatmap

(I trimmed the default dendogram before saving the image.)

Note: a number of correlations are high, most (21/28) were statistically significant (P-value requested, use the Adjusted p-values because of the multiple comparisons problem (see Chapter 12.1).

c.

lm(formula = TARGET_deathRate ~ AvgHouseholdSize + MedianAge + 
medIncome + PctPublicCoverageAlone + PctUnemployed16_Over + 
PctWhite + povertyPercent, data = cancer_reg)

Coefficients:
                         Estimate  Std. Error  t value  Pr(>|t|) 
(Intercept)          177.31729409  7.75447960   22.866  < 2e-16 ***
AvgHouseholdSize      -3.12075854  1.07317909   -2.908  0.00366 ** 
MedianAge             -0.00087311  0.00972271   -0.090  0.92845 
medIncome             -0.00044412  0.00007038   -6.310  3.19e-10 ***
PctPublicCoverageAlone 0.85193782  0.13097327    6.505  9.07e-11 ***
PctUnemployed16_Over   1.27018836  0.18570336    6.840  9.54e-12 ***
PctWhite               0.01562574  0.03572544    0.437  0.66186 
povertyPercent         0.13947937  0.16319835    0.855  0.39281 
---

Residual standard error: 24.21 on 3039 degrees of freedom
Multiple R-squared: 0.2407, Adjusted R-squared: 0.2389 
F-statistic: 137.6 on 7 and 3039 DF, p-value: < 2.2e-16

Besides the intercept, three of seven predictor variables statistically significant. The model R2 is low, 24%; low, but given that the model is about mortality from cancer, explaining 24% of the variability in outcomes from income and other socioeconomic factors is a potentially important result.

Run stepwise regression, criterion = AIC

stepwise(LinearModel.2, direction='backward/forward', criterion='AIC')

Look for best model, at end of the output

The best model, with AIC=19422.95; recall that in general, we favor models with lower AIC values so when you compare your models that’s an acceptable way to select among similar models.

TARGET_deathRate ~ 181.5702427 - 3.0150568(AvgHouseholdSize) - 0.0004826(medIncome) + 0.8885527(PctPublicCoverageAlone) + 1.2973794(PctUnemployed16_Over)

d. Check for multicollinearity. Now that we have our best model, run linear model again, then check VIF.

vif(LinearModel.3)
AvgHouseholdSize MedianAge PctPublicCoverageAlone PctUnemployed16_Over
1.020019 1.002023 1.755654 1.780817

When VIF is one, the predictor variables are independent; if VIF is 2, then multicollinearity inflates the variance (standard errors of the coefficients) by a factor of 2. Large VIF indicates presence of multicollinearity: because VIF are less than 2 we would conclude that the four predictors are mostly independent of each other.

/MD