Homework 11: Multiple linear regression
Objectives
- Apply general linear model approach to develop predictive model for cancer mortality by county.
- This is a challenging homework, with lots of parts. The goal for you is to organize these different parts into a consistent narrative about predicting cancer mortality rates and the narrative needs to be supported by your statistics (i.e., evidence).
- For your conclusions, discuss whether a simple model or a complicated model is best for predicting mortality rates. In other words, is a simple model with just one with only a few predictor variables good enough, or must we know
Homework 11 expectations
Read through the entire homework before starting to answer a question. You are expected to have read the chapter and to have completed preceding homework. Answers are provided to odd numbered problems — turn in your work for even numbered problems.
How to work this homework
You may work together, but each of your must turn in your own report. Don’t “plagiarize” from each other. Do include in your report who you worked with.
What to turn in: A pdf file containing your R code, statistical results, and your answer to the questions. Use of RMarkdown recommended; however copy/paste into a word document is also acceptable.
Submit your work to CANVAS. Obey proper file naming formats.
Resources for this homework
Chapter 17. Mike’s Biostatistics Book
Chapter 18. Mike’s Biostatistics Book
Mike’s Workbook for Biostatistics: A quick look at R and R Commander, Part01 – Part10 and previous homework pages presented in this workbook.
Additional R commands and or code provided below.
Background
Unlike correlation analysis, regression statistics are used to specify causal models. Causal models are used to make predictions. While lifetime risk of cancer is about 40% for males and 39% for females (cancer.org), rates differ by geographic location. Risk of cancer is not the same across the United States of America. Sometimes we think of the United States as one country, but the USA is made of different states, which differ for many known cancer risk factors. Rates also differ by county within a state. States in the West are different in many ways than states in the North East. To begin this project, spend some time at https://gis.cdc.gov/cancer/USCS/DataViz.html and investigate census and descriptive statistics involving cancer in the United States.
The purpose of this biostatistics homework is to develop a predictive model about mortality from cancer, given facts about the country. The data set is a smaller and modified sample from a “Multiple Linear Regression Challenge,” a Data Science project. You should definitely review Mike’s Biostatistics Book Chapter 18 and consider material in Mike’s Biostatistics Book Chapter 14 along with conducting the statistics requested in this homework.
Questions
- Load the data file into R and Rcmdr. Review each variable in the data set against the Data dictionary (below). Identify each variable to
a) data type (nominal, ratio, etc.)
b) independent or dependent
c) predictor variable or response variable
d) identify Factors that are crossed, nested, fixed effects, or random effects - The central questions in the data set are:
a) Does mortality vary by geographic location in the USA?
b) If mortality does vary by location, why?
Provide graphic and statistical evidence to test the first question (2a). Include as part of your answer a justification why you chose a particular statistical test.
OPTIONAL — use facilities at http://www.heatmapper.ca/geomap/ to produce a heat map of Cancer Mortality rates in the United States. All you need is a text file with two columns: State and Mortality. Note that a heat map can be made in R, but the online site makes this pretty easy to do. - Develop a predictive model using the general linear model function
lm()
Rcmdr: Statistics → Fit models → Linear models…
Before tackling and reporting your final predictive model, consider
a) Begin by reviewing risk factors for cancer. For example, age is a known risk factor of developing and dying of cancer. Build your predictive model by first reviewing and evaluating each variable — develop a case for or against including the variable in your model.
b) Remember that there are multiple predictor variables. Conduct a correlation analysis on the predictor variables (not the dependent variable) and address potential for multicollinearity
c) Identify the full model and use stepwise model selection (Rcmdr: Models → Stepwise model selection, select backward/forward, use AIC not BIC criterion for inclusion).
d) Consider model assumptions and provide evidence that you have considered the assumptions.
e) Conclude with justification of why your model is best (i.e., see Objectives 2 and 3, above). Include justification for why a simple model may be preferred against why a more complicated model may be required.
R data set
Reminder: this an RData file, so after saving the file to your working directory, to load the file:
Rcmdr: Data → Load data set…
Data dictionary
(variables listed in alphabetical order, not in order presented in worksheet)
Column name | Description |
Asian | Frequency of county residents who identify as Asian |
AvgHouseholdSize | Mean household size, county |
BachDeg25_Over | Frequency of county residents ages 25 and over highest education attained: bachelor’s degree |
BirthRate | Number of live births relative to number of women in county |
Black | Frequency of county residents who identify as Black |
County | County name |
Division | Census area within Census Region |
HS25_Over | Frequency of county residents ages 25 and over highest education attained: high school diploma |
ID_unit | DrD identification |
Married | Frequency of county residents who are married |
MedianAge | Median age of county residents |
medIncome | Median income per county |
Mortality | Dependent variable. Mean per capita (100,000) cancer mortalities |
OtherRace | Frequency of county residents who identify in a category which is not White, Black, or Asian |
popEst2015 | Population of county 2015 |
poverty | frequency of populace in poverty |
PrivateCoverage | Frequency of county residents with private health coverage |
PublicCoverageAlone | Frequency of county residents with government-provided health coverage alone |
Region | Census area |
State | State in United States of America |
studyPerCap | Per capita number of cancer-related clinical trials per county |
Unemployed16_Over | Frequency of county residents ages 16 and over unemployed |
White | Frequency of county residents who identify as White |
R or Rcmdr commands
myData <- read.table(header=TRUE, sep="t", text = " insert your data table here ") head(myData)
heatmap()
Test normality.
Rcmdr → Statistics → Summaries → Test for normality
Other R/Rcmdr commands provided in text
/MD