Homework 11: Multiple linear regression

Objectives

Apply general linear model approach to develop predictive model for cancer mortality by county.
This is a challenging homework, with lots of parts. The goal for you is to organize these different parts into a consistent narrative about predicting cancer mortality rates and the narrative needs to be supported by your statistics (i.e., evidence).
For your conclusions, discuss whether a simple model or a complicated model is best for predicting mortality rates. In other words, is a simple model with just one with only a few predictor variables good enough, or must we know

Homework 11 expectations

Read through the entire homework before starting to answer a question. You are expected to have read the chapter and to have completed preceding homework. Answers are provided to odd numbered problems — turn in your work for even numbered problems.

How to work this homework

You may work together, but each of your must turn in your own report. Don’t “plagiarize” from each other. Do include in your report who you worked with.

What to turn in: A pdf file containing relevant R code, statistical results — edited to support your answers to the questions, and your answer to the questions (even numbered only). Use of RMarkdown recommended — because it is a simple way to include graphs generated; however copy/paste into a word document is also acceptable.

Notes. By relevant we mean provide just the R code and results from R functions necessary to support your answers to the questions. For example, do not include

the entire data set when head(dataset) will do
screenshots of R output!! R output is text — copy/paste
all statistical output from an R function.

See Part09: Making a report for an example homework file.

Submit your work to CANVAS. Obey proper file naming formats.

Resources for this homework

Chapter 17. Mike’s Biostatistics Book

Chapter 18. Mike’s Biostatistics Book

Mike’s Workbook for Biostatistics: A quick look at R and R Commander, Part01 – Part10 and previous homework pages presented in this workbook.

Additional R commands and or code provided below.

Answers to selected problems

Background

Unlike correlation analysis, regression statistics are used to specify causal models. Causal models are used to make predictions. While lifetime risk of cancer is about 40% for males and 39% for females (cancer.org), rates differ by geographic location. Risk of cancer is not the same across the United States of America. Sometimes we think of the United States as one country, but the USA is made of different states, which differ for many known cancer risk factors. Rates also differ by county within a state. States in the West are different in many ways than states in the North East. To begin this project, spend some time at https://gis.cdc.gov/cancer/USCS/DataViz.html and investigate census and descriptive statistics involving cancer in the United States.

The purpose of this biostatistics homework is to develop a predictive model about mortality from cancer, given facts about the country. The data set is a smaller and modified sample from a “Multiple Linear Regression Challenge,” a Data Science project. You should definitely review Mike’s Biostatistics Book Chapter 18 and consider material in Mike’s Biostatistics Book Chapter 14 along with conducting the statistics requested in this homework.

Questions

Load the data file into R and Rcmdr. Review each variable in the data set against the Data dictionary (below). Identify each variable to
a) data type (nominal, ratio, etc.)
b) independent or dependent
c) predictor variable or response variable
d) identify Factors that are crossed, nested, fixed effects, or random effects
The central questions in the data set are:
a) Does mortality vary by geographic location in the USA?
b) If mortality does vary by location, why?
Provide graphic and statistical evidence to test the first question (2a). Include as part of your answer a justification why you chose a particular statistical test.
OPTIONAL — use facilities at http://www.heatmapper.ca/geomap/ to produce a heat map of Cancer Mortality rates in the United States. All you need is a text file with two columns: State and Mortality. Note that a heat map can be made in R, but the online site makes this pretty easy to do.
Develop a predictive model using the general linear model function lm()
Rcmdr: Statistics → Fit models → Linear models…
Before tackling and reporting your final predictive model, consider
a) Begin by reviewing risk factors for cancer. For example, age is a known risk factor of developing and dying of cancer. Build your predictive model by first reviewing and evaluating each variable — develop a case for or against including the variable in your model.
b) Remember that there are multiple predictor variables. Conduct a correlation analysis on the predictor variables (not the dependent variable) and address potential for multicollinearity
c) Identify the full model and use stepwise model selection (Rcmdr: Models → Stepwise model selection, select backward/forward, use AIC not BIC criterion for inclusion).
d) Consider model assumptions and provide evidence that you have considered the assumptions.
e) Conclude with justification of why your model is best (i.e., see Objectives 2 and 3, above). Include justification for why a simple model may be preferred against why a more complicated model may be required.

R data set

cancer_reg.RData

Reminder: this an RData file, so after saving the file to your working directory, to load the file:

Rcmdr: Data → Load data set…

Data dictionary

(variables listed in alphabetical order, not in order presented in worksheet)

Column name	Description
Asian	Frequency of county residents who identify as Asian
AvgHouseholdSize	Mean household size, county
BachDeg25_Over	Frequency of county residents ages 25 and over highest education attained: bachelor’s degree
BirthRate	Number of live births relative to number of women in county
Black	Frequency of county residents who identify as Black
County	County name
Division	Census area within Census Region
HS25_Over	Frequency of county residents ages 25 and over highest education attained: high school diploma
ID_unit	DrD identification
Married	Frequency of county residents who are married
MedianAge	Median age of county residents
medIncome	Median income per county
Mortality	Dependent variable. Mean per capita (100,000) cancer mortalities
OtherRace	Frequency of county residents who identify in a category which is not White, Black, or Asian
popEst2015	Population of county 2015
poverty	frequency of populace in poverty
PrivateCoverage	Frequency of county residents with private health coverage
PublicCoverageAlone	Frequency of county residents with government-provided health coverage alone
Region	Census area
State	State in United States of America
studyPerCap	Per capita number of cancer-related clinical trials per county
Unemployed16_Over	Frequency of county residents ages 16 and over unemployed
White	Frequency of county residents who identify as White

R or Rcmdr commands

myData <- read.table(header=TRUE, sep="t", text = "
insert your data table here
")

head(myData)

heatmap()

Test normality.

Rcmdr → Statistics → Summaries → Test for normality

Other R/Rcmdr commands provided in text

/MD