Homework 11: Multiple linear regression

Objectives

  1. Apply general linear model approach to develop predictive model for cancer mortality by county.
  2. This is a challenging homework, with lots of parts. The goal for you is to organize these different parts into a consistent narrative about predicting cancer mortality rates and the narrative needs to be supported by your statistics (i.e., evidence).
  3. For your conclusions, discuss whether a simple model or a complicated model is best for predicting mortality rates. In other words, is a simple model with just one with only a few predictor variables good enough, or must we know

Homework 11 expectations

Read through the entire homework before starting to answer a question. You are expected to have read the chapter and to have completed preceding homework. Answers are provided to odd numbered problems — turn in your work for even numbered problems.

How to work this homework

You may work together, but each of your must turn in your own report. Don’t “plagiarize” from each other. Do include in your report who you worked with.

What to turn in: A pdf file containing relevant R code, statistical results — edited to support your answers to the questions, and your answer to the questions (even numbered only). Use of RMarkdown recommended — because it is a simple way to include graphs generated; however copy/paste into a word document is also acceptable.

Notes. By relevant we mean provide just the R code and results from R functions necessary to support your answers to the questions. For example, do not include

  1. the entire data set when head(dataset) will do
  2. screenshots of R output!! R output is text — copy/paste
  3. all statistical output from an R function.

See Part09: Making a report for an example homework file.

Submit your work to CANVAS. Obey proper file naming formats.

Resources for this homework

Chapter 17. Mike’s Biostatistics Book

Chapter 18. Mike’s Biostatistics Book

Mike’s Workbook for Biostatistics: A quick look at R and R Commander, Part01 – Part10 and previous homework pages presented in this workbook.

Additional R commands and or code provided below.


Answers to selected problems


Background

Unlike correlation analysis, regression statistics are used to specify causal models. Causal models are used to make predictions. While lifetime risk of cancer is about 40% for males and 39% for females (cancer.org), rates differ by geographic location. Risk of cancer is not the same across the United States of America. Sometimes we think of the United States as one country, but the USA is made of different states, which differ for many known cancer risk factors. Rates also differ by county within a state. States in the West are different in many ways than states in the North East.  To begin this project, spend some time at https://gis.cdc.gov/cancer/USCS/DataViz.html and investigate census and descriptive statistics involving cancer in the United States.

The purpose of this biostatistics homework is to develop a predictive model about mortality from cancer, given facts about the country. The data set is a smaller and modified sample from a “Multiple Linear Regression Challenge,” a Data Science project. You should definitely review Mike’s Biostatistics Book Chapter 18 and consider material in Mike’s Biostatistics Book Chapter 14 along with conducting the statistics requested in this homework.

Questions

  1. Load the data file into R and Rcmdr. Review each variable in the data set against the Data dictionary (below). Identify each variable to
    a) data type (nominal, ratio, etc.)
    b) independent or dependent
    c) predictor variable or response variable
    d) identify Factors that are crossed, nested, fixed effects, or random effects
  2. The central questions in the data set are:
    a) Does mortality vary by geographic location in the USA?
    b) If mortality does vary by location, why?
    Provide graphic and statistical evidence to test the first question (2a). Include as part of your answer a justification why you chose a particular statistical test.
    OPTIONAL — use facilities at http://www.heatmapper.ca/geomap/ to produce a heat map of Cancer Mortality rates in the United States. All you need is a text file with two columns: State and Mortality. Note that a heat map can be made in R, but the online site makes this pretty easy to do.
  3. Develop a predictive model using the general linear model function lm()
    Rcmdr: Statistics → Fit models → Linear models…
    Before tackling and reporting your final predictive model, consider
    a) Begin by reviewing risk factors for cancer. For example, age is a known risk factor of developing and dying of cancer. Build your predictive model by first reviewing and evaluating each variable — develop a case for or against including the variable in your model.
    b) Remember that there are multiple predictor variables. Conduct a correlation analysis on the predictor variables (not the dependent variable) and address potential for multicollinearity
    c) Identify the full model and use stepwise model selection (Rcmdr: Models → Stepwise model selection, select backward/forward, use AIC not BIC criterion for inclusion).
    d) Consider model assumptions and provide evidence that you have considered the assumptions.
    e) Conclude with justification of why your model is best (i.e., see Objectives 2 and 3, above). Include justification for why a simple model may be preferred against why a more complicated model may be required.

R data set

cancer_reg.RData

Reminder: this an RData file, so after saving the file to your working directory, to load the file:

Rcmdr: Data → Load data set…

Data dictionary

(variables listed in alphabetical order, not in order presented in worksheet)

Column name Description
Asian Frequency of county residents who identify as Asian
AvgHouseholdSize Mean household size, county
BachDeg25_Over Frequency of county residents ages 25 and over highest education attained: bachelor’s degree
BirthRate Number of live births relative to number of women in county
Black Frequency of county residents who identify as Black
County County name
Division Census area within Census Region
HS25_Over Frequency of county residents ages 25 and over highest education attained: high school diploma
ID_unit DrD identification
Married Frequency of county residents who are married
MedianAge Median age of county residents
medIncome Median income per county
Mortality Dependent variable. Mean per capita (100,000) cancer mortalities
OtherRace Frequency of county residents who identify in a category which is not White, Black, or Asian
popEst2015 Population of county 2015
poverty frequency of populace in poverty
PrivateCoverage Frequency of county residents with private health coverage
PublicCoverageAlone Frequency of county residents with government-provided health coverage alone
Region Census area
State State in United States of America
studyPerCap Per capita number of cancer-related clinical trials per county
Unemployed16_Over Frequency of county residents ages 16 and over unemployed
White Frequency of county residents who identify as White

R or Rcmdr commands

myData <- read.table(header=TRUE, sep="t", text = "
insert your data table here
")

head(myData)
heatmap()

Test normality.

Rcmdr → Statistics → Summaries → Test for normality

Other R/Rcmdr commands provided in text

/MD

Homework 8: Multiway ANOVA

Objective:

  1. Learn how to use Rcmdr to analyze 2-way data
  2. To learn how to recognize and identify the hypotheses of a 2-way ANOVA design from the stacked worksheet.
  3. To learn about transforming data to achieve assumptions
  4. To learn about “marginality” and how choice of when factor enters model if has an effect on conclusions.

Homework 8 expectations

Read through the entire homework before starting to answer a question. You are expected to have read the chapter and to have completed preceding homework. Answers are provided to odd numbered problems — turn in your work for even numbered problems.

How to work this homework

You may work together, but each of your must turn in your own report. Don’t “plagiarize” from each other. Do include in your report who you worked with.

What to turn in: A pdf file containing relevant R code, statistical results — edited to support your answers to the questions, and your answer to the questions (even numbered only). Use of RMarkdown recommended — because it is a simple way to include graphs generated; however copy/paste into a word document is also acceptable.

Notes. By relevant we mean provide just the R code and results from R functions necessary to support your answers to the questions. For example, do not include

  1. the entire data set when head(dataset) will do
  2. screenshots of R output!! R output is text — copy/paste
  3. all statistical output from an R function.

See Part09: Making a report for an example homework file.

Submit your work to CANVAS. Obey proper file naming formats.

Resources for this homework

Chapter 12. Mike’s Biostatistics Book

Chapter 14. Mike’s Biostatistics Book

Mike’s Workbook for Biostatistics: A quick look at R and R Commander, Part01 – Part10 and previous homework pages presented in this workbook.

Additional R commands and or code provided below.


Answers to selected problems


Questions

You’ll need to load the ex2way data set into R/Rcmdr. Data set published at end of this page

  1. How many variables? Which variables are the factors? How many levels in each variable/factor?
  2. Write out all null hypotheses that can be tested in this 2-way ANOVA problem.
  3. Write out the equation for the model.
  4. Is this a crossed or nested design?
  5. This is an ANOVA problem; What assumptions are being made to conduct the ANOVA?
  6. What is the name of the nonparametric alternative test to a 2-way ANOVA?
  7. Generate a Plot of Means that conveys the responses given the experimental design.
  8. Test normality.
  9. Test equal variances.
  10. If assumptions are violated, create two new variables, modified from the response variable
    (a) rank the response variable
    Rcmdr: Manage variables… –> Compute new variable
    name = rank(Response)
    (b) log10 transform the variable
    Rcmdr: Manage variables… –> Compute new variable
    name = log10(Response)
    **Recheck #8 and/or #9 (which ever was violated) on the log10-transformed response variable and check to see if assumptions now valid
    (c) create a table and compare p-values for each of the hypotheses tested.
    (d) What can be concluded about the proper analysis — which results should be reported?
  11. Conduct a proper test of the 2-way model on appropriate response variable (i.e., assumptions met).
  12. a) What effects if any does the selection of type of test (sequential, Type II, or Type III) have on your conclusions? — Note that this implies you carry out the separate tests and provide a comparison (think table like question 10c).
    b) Balanced design?
    c) Unbalanced design?
    Hint: In other words, go back to the data and deliberately remove some data to make the groups unbalanced. Repeat your analyses, but change sequential, Type II, or Type III options and report changes to P-values, if any — report and compare the results.

R or Rcmdr commands

myData <- read.table(header=TRUE, sep="t", text = "
insert your data table here
")

head(myData)

Test normality.

Rcmdr → Statistics → Summaries → Test for normality

Multiway ANOVA: 2 options

Option 1. Rcmdr: Statistics → Means → Multi-way ANOVA

Option 2. Rcmdr → Statistics → Fit models → Linear model

then, to get the summary table, Rcmdr: Statistics → Models → Hypothesis tests → ANOVA table…

 

Data

ex2way data set

Factor1 Factor2 Response
G A 8.796347
G A 11.456232
G A 10.067908
G B 9.684717
G B 12.440822
G B 11.241605
G C 8.262376
G C 8.145118
G C 7.882312
H A 14.46131
H A 14.46038
H A 13.18818
H B 10.062405
H B 9.780971
H B 9.845786
H C 6.734254
H C 6.198078
H C 5.954699
I A 23.59453
I A 21.92648
I A 21.89414
I B 20.75972
I B 20.12655
I B 20.65819
I C 17.04579
I C 17.63584
I C 16.9222

/MD

Part 07. Working with your own data

What’s on this page?

  1. Enter data with combine, c()
    • c() vs list()
  2. Enter data with scan()
  3. Enter data directly into R with R editor
  4. Enter data directly within R Commander with R editor
  5. Reshape a data frame: From unstacked to stacked worksheet
  6. Add new column (variable) to existing data frame
  7. Load someone else’s data into R
  8. Skip R Commander menus, enter directly into R
  9. Load Your Own Data into R
  10. Import multiple data files into R
  11. More practice: Getting your own data into R
  12. Quiz

What to do

Complete the exercises on this page. The goal is to find out which method for data entry you prefer. Hint: importing data from a spreadsheet is a good choice!

  • Create variables with combine
  • For small data sets, scan() may be good option
  • Enter data directly with R’s Data Editor.
  • Enter data directly within R Commander (it’s still R’s Data Editor)
    • From unstacked to stacked worksheet in R Commander
  • Load someone else’s data into R
  • Use read.table() to enter copied data directly into R
  • From Rcmdr, import data from text (.txt, .csv)
  • From Rcmdr, import data from spreadsheet files (Microsoft Excel, Google sheets, LibreOffice Calc)
  • Save the data file in R format
  • Explore simple graphics and statistics with entered data

How to do it

For these exercises, you may work

  • within the Rcmdr script window
  • at the command line and R prompt
  • from a script document and the R GUI app
  • RStudio

Choose one way and stick with it. The quickest way is to just work with R Commander.

Let’s begin.

Before you start

A reminder, always set your working directory in R first (see Part 02. Getting started with R and Rcmdr), before starting to do your analyses. Makes life a lot easier.

As you may imagine, data may come from a variety of sources and R-project has detailed instructions to cover many situations (R Data Import/Export manual).  The following discussion covers situations common to BI311 and other beginning statistics courses.

1. Enter data with combine

The function c() is useful for entering small sets of numbers or other elements, as long as they are the same type (atomic vectors).

Geckos <- c(3.186, 2.427,4.031,1.995)
Anoles <- c(5.515,5.659,6.739,3.184)

Then, create your data frame.

lizard <- data.frame(Geckos,Anoles); lizard

Output from R

 Geckos Anoles
1 3.186 5.515
2 2.427 5.659
3 4.031 6.739
4 1.995 3.184

Note this format is termed an unstacked or wide format. More often we need a stacked worksheet, like so:

     Spp  Mass
1 Geckos 3.186
2 Geckos 2.427
3 Geckos 4.031
4 Geckos 1.995
5 Anoles 5.515
6 Anoles 5.659
7 Anoles 6.739
8 Anoles 3.184

This is a simple data set so easy enough to create using a read.table() call (see below).

c() vs list()

Alternatively, we might want to create it as data.frame as follows:

lizards <- data.frame("Geckos", "Anoles")
lizards$Geckos <- c(3.186, 2.427, 4.031, 1.995)
lizards$Anoles <- c(5.515, 5.659, 6.739, 3.184)

However, when running the code you’ll find that the lizards data.frame remains empty and R reports with [8] ERROR: replacement has 4 rows, data has 1.

Instead, create a list, add the variables to the list and then convert the list to a data.frame (h/t conversation at stackoverflow.com).

myList <- list(Spp = NA, Mass = NA)
l1 <- "Geckos"; l2 <- "Anoles"
Spp <- c(rep(l1,4), rep(l2,4))
Mass <- c(3.186, 2.427, 4.031, 1.995, 5.515, 5.659, 6.739, 3.184)
myList$Spp <- Spp
myList$Mass <- Mass
lizards <- as.data.frame(myList)

Both c() and list() look very similar, the difference is that list() is more general — you can create a vector of mixed types. c()

Note the trick with rep(). Instead of repeatedly typing “Gecko” and “Anoles” four times each, we used rep() which is short for repeat.

Variants

You may need to create a variable with repeated elements.  For example, I need a variable (a factor variable, actually), that marks time points in minutes, from zero to ten, and I need it to repeat for n = 3 trials. Here are two possible methods.

myT <- c(0:10)
newT <- rep(myT, each=3)

Print newT and R returns

 [1] 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8
[26] 8 8 9 9 9 10 10 10

The second method repeats the sequence in order

repT <- rep(myT, times=3)

and R returns

 [1] 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2
[26] 3 4 5 6 7 8 9 10

A factor variable is a categorical variable, which can be either numeric or character string. Many R functions work best if categorical variables are declared as factor variables, factor() or as.factor().

I may want to print the elements of a vector one per line.

cat(repT,sep="\n")

where cat is short for concatenate will do just that. Concatenate is an operation in computer languages to put strings of characters together; "\n" instructs to start characters on new lines.

2. Enter data with scan()

Again, for small data sets, a good option may be scan() — simply type in your numbers to objects.

gecko <- scan()

After typing the code, then pressing the enter key, you will be prompted to enter your numbers one at a time. An in-progress use of scan() is displayed below.

1: 3.186
2: 

Enter the four numbers, then follow with another return to end the entry. Repeat for the next variable, anoles. Create a data.frame as above to complete the data entry.

3. Enter data directly into R

Base R has a simple editor, Data Editor, that you can call to enter data into a spreadsheet-like form. First, create the data.frame object. “Everything in R is an object.”

myData <- data.frame()

Second, call the editor

myData <- edit(myData)

You should see the Data Editor form (Fig. 1)

Screenshot R Data Editor

Figure 1. Screenshot of R’s simple Data Editor.

Begin data entry. For an example, Table 1.


Table 1. Color and weights (g) of 16 M&M’s®

ColorMass
Orange0.80
Orange0.81
Orange0.80
Orange0.85
Orange0.84
Orange0.77
Orange0.83
Orange0.87
Blue0.83
Blue0.87
Blue0.83
Blue0.91
Green0.86
Brown0.87
Red0.79
Red1.11

click on var1 in the header row, enter the variable name Color, select character, then return (Fig. 2)

Screenshot variable entry box

Figure 2. Screenshot variable editor form

Repeat entry of variable names. Select numeric for Mass. To enter data, click in the first cell and enter the value. Use keyboard arrow keys, or mouse pointer to move between cells. Figure 3 shows a few entries in the Data Editor.

Screenshot data entry

Figure 3. Screenshot data entry in R Data Editor.

To view the data, type myData (or whatever you named the object) at the R prompt. R returns

> myData
   Color Mass
1 Orange 0.80
2 Orange 0.81
>

4. Enter data directly within R Commander

If you haven’t already, start R Commander

library(Rcmdr)

To open a new worksheet in Rcmdr — it’s still R’s Data Editor — follow these steps

Data → New data set…

then enter a name for the data set, e.g., lizards, in the pop-up dialog box and click OK to continue.

Change the default variable names to “gecko” in Var1 and “anoles” in Var2. Hint: To change the variable name, click on the column and a new popup menu will appear. Change the variable name from Var1 to “gecko”

Enter the data and labels stacked into columns, with row one enter”3.186″ in “gecko” (Var1) and “5.515” in “anoles”  (Var2), without the quotes, of course. After you are done, you have an unstacked worksheet, which should look like (Fig. 1).

Screenshot of R worksheet editor view.

Figure 1. Screenshot of R worksheet editor view.

To finish the worksheet select

File → Close

5. Reshape a data frame: From unstacked to stacked worksheet

The format of lizards, the data set we just created above, is in wide format aka an unstacked worksheet (row to column orientation). In most cases I am aware of, R expects your data in long form, also known as a stacked worksheet (column to row orientation) because most statistical functions in R expect that format. It is easy to go from unstacked to stacked in Rcmdr. This process is referred to as reshaping the data frame.

Data → Active data set → Stack variables in active data set…

Screenshot of filled out Stack variables menu

Figure 2. Screenshot of filled out Stack variables menu

You can confirm that your data entry is saved correctly by clicking on the View data set button (Fig. 3).

Screenshot of Rcmdr, red arrow points to Data set button.

Figure 3. Screenshot of Rcmdr, red arrow points to Data set button.

Click on the button to bring up popup menu shown in Figure 4.

Select the new data set. 

Figure 4. Select the new data set. 

Question 1: True or False. We can use the R editor to create a stacked data set instead of an unstacked worksheet.

Answer. True, of course!

In R, we can reshape the unstacked data frame with stack().

# Stack and save to new data frame
df <- stack(lizard)

We need to rename the variables after stacking. Check

names(df)

R returns

[1] "values" "ind"

The following code replaces the variable names

names(df)[names(df) == "values"] <- "mass"
names(df)[names(df) == "ind"] <- "lizard"

Check

names(df)

now returns

[1] "mass" "lizard"

If more than one variable name needs changing, an alternative to one-at-a-time approach described above is setNames().

df <- setNames(df, c("mass","lizard")

will do the trick.

6. Add new column (variable) to existing data frame

If we need to add a variable to an existing data frame, the new variable must have the same length. For example, add IDs 1 – 4 for geckos and 1 – 4 for anoles lizards in the data frame df. Use the rep() command we introduced before.

myID <- c(1:4)
myIDs <- rep(myID, times=2)

Use the $ operator

df$ID <- myIDs

Check results, print df, R returns

   mass lizard ID
1 3.186  gecko  1
2 2.427  gecko  2
3 4.031  gecko  3
4 1.995  gecko  4
5 5.515 anoles  1
6 5.659 anoles  2
7 6.739 anoles  3
8 3.184 anoles  4

which is what we expect.

The first column is called the index column, 1 – 8 without a header, is part of the data frame structure set by R. You can hide the index row by print(df, row.names=FALSE)

Although unnecessary, I would reorder the variables. Simple way is to note that column order in df is mass is column 1, lizard is column 2, and ID is column 3. Therefore,

df2 <- df[, c(3, 2, 1)]

and check the results

 ID lizard  mass
1 1  gecko 3.186
2 2  gecko 2.427
3 3  gecko 4.031
4 4  gecko 1.995
5 1 anoles 5.515
6 2 anoles 5.659
7 3 anoles 6.739
8 4 anoles 3.184

recall that the comma in [, c(3,2,1)] tells R to collect all rows from the data frame.

7. Load someone else’s data into R

A frequent operation in data analysis is the need to access and load data from a remote source. This is called “web scraping,” among many other terms to describe this behavior (see web scraping from a static webpage (rvest package). We’ll use a simple technique called copy and paste. We’ll go over a couple of options, one is to create a text file and then use R to manipulate the data into the format we need. Another option is to use a spreadsheet program instead — since most of you will be more comfortable with the spreadsheet software, this may be the preferred method for you. You should try both! The data we want to get into R are displayed in Table 2.


Table 2. Miles per hour for a 19.7 mile morning commute by Chevy S-10 on Oahu over eight consecutive weeks (2011).

MondayTuesdayWednesdayThursdayFriday
38.32340.96639.645.69239.6
34.94131.2633642.42936
3632.10828.28637.125
32.10828.28626.428.97640.966
47.52363332.10831.263
42.42939.63638.32339.6
38.32338.32330.462
23.29431.26331.26331.26324.75

8. Skip R Commander menus, enter data directly into R

For small data sets, you can paste the data set between text="", use

myTable <- read.table(header=TRUE, sep="\t", fill=TRUE, text="")

at the R prompt, where the columns are separated by tabs use sep="\t"(if separated by a single space, use sep=""), fill=TRUE is used to fill the empty space, missing values, from our table, and you would paste the scraped table or copied table from your spreadsheet (see Load Your Own Data into R), between the double quotes following text="". The result will be an object stored in R as a data frame, ready for your work.

Note. Column separated values on webpages picked up by copy/paste can be inconsistent. The best advice is to view the page contents and confirm, but a bit of trial and error will also solve it. First, assume tabs and try sep="\". If this fails, try a single space, sep="".

For an example, I copied all data rows and the header row from Table 1, then pasted between the “” at the R prompt, although it would be easier if you pasted the text into R script, then run the command from your script (macOS: Cmd + Enter; Win11: Ctrl + R).

Here’s results of copy

myTable1 <- read.table(header=TRUE, sep=" ", fill=TRUE, text="
Monday Tuesday Wednesday Thursday Friday
38.323 40.966 39.600 45.692 39.600
34.941 31.263 36.000 42.429 36.000
36 32.108 28.286 37.125 40.966
32.108 28.286 26.4 28.976 
47.52 36 33 32.108 31.263
42.429 39.6 36 38.323 39.6
38.323 38.323 30.462 
23.294 31.263 31.263 31.263 24.75
")

Note with respect to missing values. R seems to handle missing values best if you specify the missing value as NA instead of blank in your data set. Add na.strings = "NA" in place of fill=TRUE.

Verify the import, used head() function, which grabs the first six rows (or, tail() function, which grabs last six rows).

> head(myTable1, 3)
Monday Tuesday Wednesday Thursday Friday
1 38.323 40.966 39.600 45.692 39.600
2 34.941 31.263 36.000 42.429 36.000
3 36.000 32.108 28.286 37.125 40.966

Note I entered “3” in head(), which restricts the command to return the first three rows only.

Working with text files is more robust approach than copy/paste into your R script.

Highlight and then copy the table into your clipboard.

  • You can Open your text editor (e.g., TextEdit on a macOS; Notepad on Windows) and paste the contents into the blank text page.
    For example, I saved the table to a file called mytable.txt in my working folder.
  • Or, and this is the quickest (but may fail!) — once you’ve copied the table, it is now in the Clipboard, so R Commander has an option to retrieve from your clipboard, proceed directly to R Commander/Clipboard option described below.

macOS users — if copy/paste fails with R, see Can’t copy/paste into R script, macOS to solve the problem.

Replace text="" with file="" in the read.table() function. The code now is

myTable <- read.table(header=TRUE, sep="\t", fill=TRUE, file="myTable.txt")

For larger data sets, data may be imported as a text file, commonly one where variables (columns) are separated by commas. you can paste the data set between text="", use

myTable <- read.table(file.choose(), header=TRUE, sep=",")

or, if you know the name of the file, replace file.choose() with the name of the file in quotes, e.g., "myData.csv".

Do on your own. Copy and paste the data below into your text editor, save the file as .txt or .csv, then load the data in R and save to data frame object of your choosing.

Monday,Tuesday,Wednesday,Thursday,Friday
38.323,40.966,39.600,45.692,39.600
34.941,31.263,36.000,42.429,36.000
36,32.108,28.286,37.125,40.966
32.108,28.286,26.4,28.976,NA
47.52,36,33,32.108,31.263
42.429,39.6,36,38.323,39.6
38.323,38.323,30.462,NA,NA
23.294,31.263,31.263,31.263,24.75

Note “NA” added to denote missing value.

Loading data files into R from a spreadsheet app like Microsoft Excel is a common task to accomplish. A number of packages help with the process, including readxl

library(readxl)
#check number of worksheets in the spreadsheet file
excel_sheets(file.choose())
$import data from specific worksheet
myData <- read_excel(file.choose(),sheet=2)

Note: instead of sheet number, can point to the sheet by name, e.g., sheet = "Sheet2".

Another package, rio, “simplif[ies] the process of importing data into R” (vignette). Calling the command import(), rio will attempt to identify the file type and import the data with minimum input from the user. For example, you already know the sheet number to import, then

library(rio) 
myData <- import(file.choose(),sheet=2)

will do the trick.

Do on your own. Copy and paste the same data (or load from the saved text file) into your spreadsheet, save the file as .xls or .xlsx, then load the data in R and save to data frame object of your choosing.

Web scraping from a static webpage (rvest package)

In Part 03. Explore use of R, we copied data from a table and imported it into an R data frame via read.table(). Here, we’ll introduce R’s capabilities to grab data directly from a web page. Mike’s Workbook for Biostatistics and Mike’s Biostatistics Book are presented as static web pages (Wikipedia); it’s relatively straightforward to grab data from my sites. The following code will scrape and return the table data into a data frame object, ready for use.

Download and install rvest package, which is part of the larger tidyverse package for data science produces by Hadley Wickham and friends. We’ll just install rvest, but by installing tidyverse, you get rvest and many other useful tools that make R work even better.  Because there’s only the one table on the page, only two commands are needed: we read the web page and save it to an object, then call for the table with the html_table() function.

library(rvest)
webPage <- read_html("https://mikeworkbook.letgen.org/r-work/a-quick-look-at-r-and-r-commander/part-03-explore-use-of-r/")
myWeb <- as.data.frame(html_table(webPage))

If you copy entire table from the web page, and you include column headers in first row, then html_table(webPage, header=TRUE). Alternatively, skip the header row, copy only the data, and leave as the default html_table(webPage). The columns will be assigned names like V1, V2, etc. You can then rename the header row, as we described above in Reshape a data frame: From unstacked to stacked worksheet.

#Check the import
head(myWeb)

and R returns

 Min Trial01 Trial02 Trial03 Trial04
1 0 82.8 88.7 83.2 86.2
2 1 76.7 78.5 77.6 NA
3 2 69.4 74.3 72.5 74.3
4 3 67.2 71.4 70.7 68.9
5 4 63.3 67.6 68.5 61.5
6 5 60.6 65.5 65.8 57.7

That was easy!

rvest returns tibbles because the tidyverse world uses tibbles. Tibbles are more flexible than a data frame, but to be consistent with the rest of our work, we stick to data frames. If you want to know more about “tibble,” feel free to do a query, “why is it called tibble?”

Note. By sharing web scraping code, and pointing y’all to my web site, I give my permission for readers to web scrape data from Mike’s Biostatistics Book and Mike’s Workbook for Biostatistics. However, don’t abuse the gift. First, the data are copy right protected, albeit with a Creative Commons license. Second, don’t hammer my — or especially other’s — websites with excessive web scraping activity, which can harm the performance of a web site or even violate the rules of the hosting service for the website. Third, be aware and apply web scraping etiquette, e.g., see 2020 post by Alexandra Datsenko at Webbiquity. For other sites, be advised that you would be wise to review their conditions of use or terms of service statements before proceeding to scrape their pages. LinkedIN, for example, prohibits “… scrap[ing] the Services or otherwise copy profiles and other data from [LinkedIN].”

R Commander

Text file option

  1. Return to R Commander and select Data → Import data → from text file, clipboard, or URL…
    • Enter a name for the data set, e.g., myData
    • Location of data file:
      • If you are loading from a text file, then check Local file system (Figure 5)
      • If you are loading directly from the clipboard, then check Clipboard (Figure 6)
    • Field separator: check “Tabs”
    • Leave the rest as default values
    • Click OK

Rcmdr read text file

Figure 5. Rcmdr load data from a text file from Local file system

Rcmdr read data from clipboard

Figure 6. Rcmdr load data from Clipboard.

  1. Now, the dataset should be loaded in R. Check the Message window in R to see if there were errors.
  2. We need to convert the dataset into a long, stacked format. This is easy to do in Rcmdr.
    • Data → Active data set → Stack variables in active data set…
    • Select the columns (variables), enter a new name for the stacked data set, enter name for the data variable (mph), and a name for the factor (DaysOfWeek), see Figure 7.

Rcmdr stack worksheet

Figure 7. Rcmdr stack data from an unstacked (wide) data set.

Spreadsheet option

  1. Now, please open your spreadsheet app (e.g., Google Sheets, MS Excel), then return to this page and highlight/copy the data table to your clipboard.
  2. Return to Excel and paste the data into a worksheet.
  3. Convert the stacked worksheet into a stacked worksheet (see Hint below).
    • In your Excel file, stack the data into two columns (long form).
    • Label the first column “DaysOfWeek” and the second column “mph” (without the quotes).
  4. Save the workbook and remember where on your computer you saved the file.
  5. Next, upload the data file to R from within R Commander
    • Data → Import data → From Excel file
      • Select the correct worksheet from the popup menu.

Screenshot of an unstacked or wide format data set based on the data from Table 1 is displayed in Figure 8. See Figure 9 for a stacked (long format) version of the same data set as in Figure 8.

Screenshot of an unstacked worksheet

Figure 8. Screenshot of an unstacked or wide format worksheet

Screenshot stacked worksheet

Figure 9. Stacked or long format worksheet

Save the data file in R format

Finally, save the data you imported as an R file (.RData)

  1. Data → Active data set → Save active data set…
  2. Save the file with an appropriate file name to your working folder (Figure 10)

Rcmdr save data

Figure 10. Rcmdr Save As popup dialog.

Question 2. Try each of the three ways to get Table 1 data into R: Clipboard, text file, spreadsheet file.

Question 3. Once the data set is loaded in R, find the missing value(s) for the dataset. Report the row number(s) for the missing value(s).

Hint: Use is.na() function. Just enter the name of the variable to check for missing value(s). This will return a vector with FALSE for each non missing case, and TRUE for any missing value. Quick-R website provides this set of commands which you should also try

dataset[!complete.cases(dataset),]

where “dataset” is replaced with the name of your data set loaded in R. Try both!

Question 4. What kind of graph would be suitable to display the relationship (if any) between DaysOfWeek and mph? Explore the available default graphs in Rcmdr and select one.
A) bar chart
B) histogram
C) pie chart
D) scatter plot

Note: Question 3 and 4 are part of Part 08. Data exploration. You should open and work through Part 08, then return to complete or expand on your investigation of Question 3 and 4.

Data import into R (Rcmdr)

Work with one table at a time. R can handle multiple data sets in memory, but you work with one data set a at a time. You have several options — select the one that works best for you.

  1. In your browser, highlight and copy the table (skip the table label — just the column header and numbers), then, in R go to Rcmdr: Data → Import data → from text file, clipboard, look for any error messages, then check data by “View data” button.
  2. If the copy-paste from clipboard method fails for you, then copy and paste the data into an Excel spreadsheet. Then, import the data from a saved Excel spreadsheet in R by Rcmdr → Data → Import data → From excel file, look for error messages, then check data by “View data” button (only available if you are working with Rcmdr version 2.04 or higher — all of the NSM Macbooks should meet this requirement).
  3. After entering the data into Excel, export the data from your Excel spreadsheet as a text file. Then, import the data into R by Rcmdr → Import data from text file, look for error messages, then check data by “View data” button. If you do not know how to export to text file, please check our Moodle help pages.

Regardless of the approach you used, once you get the data into R, save the data in R format before proceeding. Rcmdr: Data → Active data set → Save active data set

I want you to either work with text files you create or Excel files you create, hence this exercise. This becomes important later in the course when you work with your own datasets. However, I will provide you with an Excel file with the two datasets (Knell, RalphPearson) just in case you run into trouble with data loading so that you’ll be able to move onto the descriptive statistics and graphics. Click here to get the RWork01.

9. Load Your Own Data into R

Think about your other science classes. Do you have data? It may be as simple as data for a calibration curve or a set of data for growth rates of a cell line. At a minimum, you will generate data during our Measurement Day (shells, darts). After we complete Measurement Day, bring these data sets into R by either the text file route or the spreadsheet route. To help, I have collated available distances measured for darts tossed by your classmates, imported into R, then saved as an R-ready data file for you. Simply click on the link and download the file to your working directory, then load the data file as you would any other R data file (Rcmdr: Data → Load data set …)

Question 5. Gather your datasets from Measurement Day and repeat Question 2, 3, and 4, adjusting the question to suit the variable(s) contained in your own data set.

10. Import multiple data files into R

Working with a single smallish data file is probably the exception. For example, we follow growth rates of yeast cells in 48-well arrays recorded every 15 minutes over 24 hours. For a single trial that’s 4608 readings for processing. A typical experiment might involve dozens of trials. The instrument saves the data in a particular format which requires processing before ready for analysis. This description screams for programming solution and R is excellent at this. Thus, once the script is written, it is logical to load multiple data files of the same type at the same time for processing. Because this is beyond the needs of an introduction class, I’ll simply point to a couple of resources for the interested student.

11. More practice: Getting your own data into R

I want you to explore ways to IMPORT data into R (Rcmdr) and how to save the data into a format that R likes (e.g., *.RData). Here are two small data sets, plus a link to a page containing Measurement Day results.


Table 3. Plasma testosterone levels in male lizards and distance from nearest neighbor. Data from Knell’s “Introductory R”

Lizard noNearest
Neighbor (m)
Plasma
testosterone (ng/ml)
1422.2
2726.1
3321.5
4523.8
5723.8
6320.5
7724.6
8422.6
9319.9
10523.9
11519.8
12421.3

 


Table 4. Breeding success and morphometrics among White-crowned sparrows near Pt. Reyes California (Ralph). Table 1 Ralph & Pearson 1971

PairMale ageFemale agePercent black feathers in crown MalePercent black feathers in crown FemaleTerritory sizeSuccessful
A113025980Yes
B1148201530Yes
C12601001620Yes
D119640710No
E2110081840Yes
F2110041260Yes
G12961001760No
H1110402600No
I221001004540Yes
J41100734480Yes
K2199202820Yes
L2110055850Yes
M118851890No
N1290982160No
O1199401500No
X3NA100NA600NA
Y1250982900NA
Z1NA80NA860NA

Question 6. Take a look closer at Table 1 from Ralph & Pearson. What does “NA” mean?

Proceed to carry out basic data analysis on these data sets (descriptive statistics, graphs)

Work with one table at a time

  1. Go to Rcmdr: Statistics — what is available (not dimmed) under “Summaries?”
  2. Proceed to determine descriptive statistics for Knell’s data set and, separately, for the Ralph & Pearson data set. Neighbor and Testosterone levels. Hint: you may wish to get statistics for the groups, not just the whole data set. Explore what you get with “Active summaries” versus what is available with “Numerical summaries.”
  3. Go to Rcmdr: Graphs — What kind of graph is best for this data set? First, consider plots for one variable at a time display
    – What’s an index plot? Compare to a dot plot
    – What’s a histogram? Compare to a density estimate
    Next, consider plots to compare groups for one variable (hint: look for a “Plot by groups” option).
    – What is a box plot? Compare to Plot of means
    Note: the bar chart in R is not the same bar chart that you are used to from Excel and other software. Be sure to ask me about this and see Chapter 4 in Mike’s Biostatistics eBook
  4. Important concept — while you are working on these datasets you need to be asking yourself, what are the experimental (independent) variables and which are the outcome (dependent) variables?

12. Page quiz


your own data

Ten questions from this page

Part 06. Work with an included dataset

What’s on this page?

  1. About data sets and data files
  2. Explore a dataset included with R
  3. To do
    • View the dataset
    • Explore the dataset
      • dim()
      • head()
      • tail()
      • names()
      • length()
      • typeof()
    • Move around a data frame
  4. Quiz

What to do

Complete the exercises on this page

  • Explore a dataset included with R

How to do it

For these exercises, you may work

  • within the Rcmdr script window
  • at the command line and R prompt
  • from a script document and the R GUI app
  • RStudio

Choose one way. The quickest way is to just work with R Commander.

Let’s begin.

Before you start

A reminder, always set your working directory in R first before beginning a project (see Part 02. Getting started with R and Rcmdr), before starting to do your analyses. Makes life a lot easier.

1. About data sets and data files

Statistics is all about working with data sets. Data sets may come from many different sources:

  • entered directly during an R session and saved to a data frame object.
  • extracted from a table from the CDC, or Wikipedia, or other online sources. Like many programming languages, R can do web scraping, which permits collecting data from websites and putting the data into a form suitable for R.
  • text file (generic file extension .txt), often with columns separated by columns (e.g., .csv files) or tab-delimited (e.g., .tsv files).
  • Spreadsheet file (generic file extension .ods; Microsoft Excel extension .xls or .xlsx).
  • R data files, file extension .Rdata (or .rda).

Example data sets are commonly included for R packages. This page is about working with data sets included as part of the installation of R and R Commander.

2. Explore a dataset included with R

To begin, explore the available data sets. At the R prompt type the command

data()

Alternatively, type and submit the same command in the R Commander script window. A popup window lists available data sets by package, file name, and title of the dataset. For example, DNase, which is in the datasets package (Figure 1).

R packages typically include documentation and the package datasets is no exception. To view included information about the the data sets included in datasets in your browser type the command

help(package="datasets")

Note: help files are by default just text files. Recent versions of R includes by default displaying html help pages. The html pages are displayed by a server included during the installation. This server is also used to display RMarkdown files.

Select and load a data file from attached package into R. From within R Commander, select Data → Data in packages → Read data set from an attached package.

You can try any of the data sets, but a nice one to work with is DNase, which is in the Datasets package (Figure 1).

Figure 1. Screenshot of R Commander select data from attached packages.

Alternatively, at the R prompt type

data(DNase, package="datasets")

To simplify your work, go ahead and attach the dataset. By attaching the data, you won’t have to type the dataset and variable name each time (e.g., DNase$conc), only the variable name (conc).

attach(DNase)

R Commander keeps track of active datasets available for use. Note in the menu area the box marked Data set. Before a dataset is loaded, R Commander shows “No active dataset” (Fig. 2).

Figure 2. R Commander, no active dataset

Figure 2. R Commander, no active dataset

After dataset is loaded, R Commander updates to show available datasets (Fig. 3).

Figure 3. R Commander, one active dataset available.

Figure 3. R Commander, one active dataset available.

3. To do

Once the data set is ready to go,

  • View the data
    • Just type the name of the dataset at the R prompt, or within Rcmdr, select the View or Edit button.
      • Note that you may need to find the popup window! Our small laptops have very little display real estate, so any new windows created by R or R Commander may be hidden behind a visible window.
      • MacOS and Win11 tip: A quick way to find and select open windows on your computer is to use a simple keystroke sequence: Alt + Tab. Hold down the Alt key, then select Tab key — you’ll see available windows. Repeat tap Tab key to cycle through and select available windows.
  • Explore the data set
    • For now,  this means noting what the variable names are, noting the data types of the variables are, etc. Commands to try include
    • dim()
      • Returns the number of rows and columns in the dataset. Example (red) and output
dim(DNase)
[1] 176 3
    • head()
      • View the first six rows of the dataset. Example (red) and output
head(DNase)
Grouped Data: density ~ conc | Run
  Run       conc density
1   1 0.04882812   0.017
2   1 0.04882812   0.018
3   1 0.19531250   0.121
4   1 0.19531250   0.124
5   1 0.39062500   0.206
6   1 0.39062500   0.215
    • tail()
      • View the last six rows of the dataset. Example not shown.
    •  names()
      • Returns names of the objects in the dataset. Example command (red) and output
names(DNase)
[1] "Run" "conc" "density"
    • length()
      • Just enter the variable name into the parentheses to check the number of observations of the variable. Example and output
length(Run)

Oops, that returns an error message!

[3] ERROR: object 'Run' not found

This error results because, although “Run” variable is part of the loaded dataset, the dataset is not attached. Thus, the simplest fix is to attach the dataset (see above), or specify that “Run” is a variable in the dataset DNase.

length(DNase$Run)

now returns

[1] 176
    • typeof()
      • Returns the (internal) type or storage mode of any R object. Example and output

typeof(DNase$Run)
[1] “integer”

  • Move around a data frame
    • Syntax for selecting a particular value is row, column. R data frames are like worksheets in spreadsheet apps: data are organized by rows and columns. You get used to looking at head(DNase) print outs, but most of us find looking at spreadsheets easier. So, to introduce “moving around a data frame,” lets view output from head(DNase) as it would appear in a spreadsheet.
id Run conc density
1 1 0.04882812 0.017
2 1 0.04882812 0.018
3 1 0.1953125 0.121
4 1 0.1953125 0.124
5 1 0.390625 0.206
6 1 0.390625 0.215

Now, if we wish to select the same value in our spreadsheet — we skip the header row (row 1) — so row 6, column C, we would type “=C6” in an empty cell (without the quotes). In R, we can do the same thing by calling the elements directly by number data.frame[rows,columns]. For example, DNase[5,2] points value at row 5, column 2 (value = 0.390625, see output from names(DNase) above; Fig. 4).

Figure 4. Red box highlights selection of DNase[5,2]

R uses the square brackets to help index a data frame (or a vector or a matrix). If we wish to select values three through five conc, we write DNase[2:5,2] (values = 0.1953125 0.1953125 0.3906250; Fig. 5).

Figure 5. Red box highlights selection of DNase[2:5,2]

4. Page quiz


Included datasets

Seven questions from this page