Answers: Multiple linear regression
Homework 11: Multiple linear regression
Answers to selected number problems
1.
a. Region, Division, State, County all categorical, nominal
names(cancer_reg) "TARGET_deathRate" (ratio) "Region" (nominal) "Division" (nominal) "State" (nominal) "County" (nominal) "medIncome" (ratio) "popEst2015" (ratio) "povertyPercent" (ratio) "studyPerCap" (ratio) "MedianAge" (ratio) "AvgHouseholdSize" "PercentMarried" (ratio) "BirthRate" (ratio) "PctHS25_Over" (ratio) "PctBachDeg25_Over" (ratio) "PctUnemployed16_Over" (ratio) "PctPrivateCoverage" (ratio) "PctPublicCoverageAlone" (ratio) "PctWhite" (ratio) "PctBlack" (ratio) "PctAsian" (ratio) "PctOtherRace" (ratio)
b. TARGET_deathRate
= dependent variable
all other variables independent variables
c. Therefore, TARGET_deathRate
response variable and all other variables potential predictor variables
d. Factors are the nominal variables: Counties nested in States, States nested in Division, Division nested in Region.
3. Note: a full model is not the best model choice. Many of the predictor variables are correlated, hence, the full model violates basic assumptions about model building, i.e., predictor variables are independent.
a. Dependent variable is TARGET_deathRate
. Lots of possible simpler models. To illustrate a multiple regression, I selected these ratio variables
AvgHouseholdSize MedianAge medIncome PctPublicCoverageAlone PctUnemployed16_Over PctWhite PercentMarried povertyPercent
Lots of socioeconomic factors are associated with increased cancer risk and mortality. A couple of highlights from my selection: I predict negative linear association between cancer mortality and percent of population that is White (a well known health disparity).
b. Rcmdr: Statistics → Summaries → Correlation matrix…
This returns a symmetric matrix of correlations (above and below the diagonal are identical). Pearson correlations:
AvgHouseholdSize MedianAge medIncome PctPublicCoverageAlone PctUnemployed16_Over PctWhite PercentMarried povertyPercent AvgHouseholdSize 1.0000 -0.0319 0.1121 0.0611 0.1315 -0.1884 -0.1005 0.0743 MedianAge -0.0319 1.0000 -0.0133 -0.0033 0.0186 0.0350 0.0464 -0.0293 medIncome 0.1121 -0.0133 1.0000 -0.7198 -0.4531 0.1672 0.3551 -0.7890 PctPublicCoverageAlone 0.0611 -0.0033 -0.7198 1.0000 0.6554 -0.3610 -0.4600 0.7986 PctUnemployed16_Over 0.1315 0.0186 -0.4531 0.6554 1.0000 -0.5018 -0.5515 0.6551 PctWhite -0.1884 0.0350 0.1672 -0.3610 -0.5018 1.0000 0.6774 -0.5094 PercentMarried -0.1005 0.0464 0.3551 -0.4600 -0.5515 0.6774 1.0000 -0.6429 povertyPercent 0.0743 -0.0293 -0.7890 0.7986 0.6551 -0.5094 -0.6429 1.0000
To help you visually I added bold-type to “large” correlation estimates (subjective, but recall that weak correlation is about magnitude 0.10, moderate correlation about 0.3, and large or strong correlation is about 0.5 or greater). Hard to make sense of this, so a heatmap can be helpful. Find and edit the R command repeated by Rcmdr, add to save to object, then re-run
myCor <- cor(cancer_reg[,c("AvgHouseholdSize","MedianAge","medIncome", "PctPublicCoverageAlone","PctUnemployed16_Over","PctWhite","PercentMarried", "povertyPercent")], use="complete") heatmap(myCor)
(I trimmed the default dendogram before saving the image.)
Note: a number of correlations are high, most (21/28) were statistically significant (P-value requested, use the Adjusted p-values because of the multiple comparisons problem (see Chapter 12.1).
c.
lm(formula = TARGET_deathRate ~ AvgHouseholdSize + MedianAge + medIncome + PctPublicCoverageAlone + PctUnemployed16_Over + PctWhite + povertyPercent, data = cancer_reg) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 177.31729409 7.75447960 22.866 < 2e-16 *** AvgHouseholdSize -3.12075854 1.07317909 -2.908 0.00366 ** MedianAge -0.00087311 0.00972271 -0.090 0.92845 medIncome -0.00044412 0.00007038 -6.310 3.19e-10 *** PctPublicCoverageAlone 0.85193782 0.13097327 6.505 9.07e-11 *** PctUnemployed16_Over 1.27018836 0.18570336 6.840 9.54e-12 *** PctWhite 0.01562574 0.03572544 0.437 0.66186 povertyPercent 0.13947937 0.16319835 0.855 0.39281 --- Residual standard error: 24.21 on 3039 degrees of freedom Multiple R-squared: 0.2407, Adjusted R-squared: 0.2389 F-statistic: 137.6 on 7 and 3039 DF, p-value: < 2.2e-16
Besides the intercept, three of seven predictor variables statistically significant. The model R2 is low, 24%; low, but given that the model is about mortality from cancer, explaining 24% of the variability in outcomes from income and other socioeconomic factors is a potentially important result.
Run stepwise regression, criterion = AIC
stepwise(LinearModel.2, direction='backward/forward', criterion='AIC')
Look for best model, at end of the output
The best model, with AIC=19422.95; recall that in general, we favor models with lower AIC values so when you compare your models that’s an acceptable way to select among similar models.
TARGET_deathRate ~ 181.5702427 - 3.0150568(AvgHouseholdSize) - 0.0004826(medIncome) + 0.8885527(PctPublicCoverageAlone) + 1.2973794(PctUnemployed16_Over)
d. Check for multicollinearity. Now that we have our best model, run linear model again, then check VIF.
vif(LinearModel.3) AvgHouseholdSize MedianAge PctPublicCoverageAlone PctUnemployed16_Over 1.020019 1.002023 1.755654 1.780817
When VIF is one, the predictor variables are independent; if VIF is 2, then multicollinearity inflates the variance (standard errors of the coefficients) by a factor of 2. Large VIF indicates presence of multicollinearity: because VIF are less than 2 we would conclude that the four predictors are mostly independent of each other.
/MD