Part 07. Working with your own data

What’s on this page?

  1. Enter data with combine, c()
  2. Enter data with scan()
  3. Enter data directly into R with R editor
  4. Enter data directly within R Commander with R editor
  5. Reshape a data frame: From unstacked to stacked worksheet
  6. Add new column (variable) to existing data frame
  7. Load someone else’s data into R
  8. Skip R Commander menus, enter directly into R
  9. Load Your Own Data into R
  10. Import multiple data files into R
  11. More practice: Getting your own data into R
  12. Quiz

What to do

Complete the exercises on this page. The goal is to find out which method for data entry you prefer. Hint: importing data from a spreadsheet is a good choice!

  • Create variables with combine
  • For small data sets, scan() may be good option
  • Enter data directly with R’s Data Editor.
  • Enter data directly within R Commander (it’s still R’s Data Editor)
    • From unstacked to stacked worksheet in R Commander
  • Load someone else’s data into R
  • Use read.table() to enter copied data directly into R
  • From Rcmdr, import data from text (.txt, .csv)
  • From Rcmdr, import data from spreadsheet files (Microsoft Excel, Google sheets, LibreOffice Calc)
  • Save the data file in R format
  • Explore simple graphics and statistics with entered data

How to do it

For these exercises, you may work

  • within the Rcmdr script window
  • at the command line and R prompt
  • from a script document and the R GUI app
  • RStudio

Choose one way and stick with it. The quickest way is to just work with R Commander.

Let’s begin.

Before you start

A reminder, always set your working directory in R first (see Part 02. Getting started with R and Rcmdr), before starting to do your analyses. Makes life a lot easier.

As you may imagine, data may come from a variety of sources and R-project has detailed instructions to cover many situations (R Data Import/Export manual).  The following discussion covers situations common to BI311 and other beginning statistics courses.

1. Enter data with combine

The function c() is useful for entering small sets of numbers or other elements.

gecko <- c(3.186, 2.427,4.031,1.995)
anoles <- c(5.515,5.659,6.739,3.184)

Then, create your data frame.

lizard <- data.frame(gecko,anoles); lizard

Output from R

  gecko anoles
1 3.186 5.515
2 2.427 5.659
3 4.031 6.739
4 1.995 3.184

Variants

You may need to create a variable with repeated elements.  For example, I need a variable (a factor variable, actually), that marks time points in minutes, from zero to ten, and I need it to repeat for n = 3 trials. Here are two possible methods.

myT <- c(0:10)
newT <- rep(myT, each=3)

Print newT and R returns

 [1] 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8
[26] 8 8 9 9 9 10 10 10

The second method repeats the sequence in order

repT <- rep(myT, times=3)

and R returns

 [1] 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2
[26] 3 4 5 6 7 8 9 10

A factor variable is a categorical variable, which can be either numeric or character string. Many R functions work best if categorical variables are declared as factor variables, factor() or as.factor().

I may want to print the elements of a vector one per line.

cat(repT,sep="\n")

where cat is short for concatenate will do just that. Concatenate is an operation in computer languages to put strings of characters together; "\n" instructs to start characters on new lines.

2. Enter data with scan()

Again, for small data sets, a good option may be scan() — simply type in your numbers to objects.

gecko <- scan()

After typing the code, then pressing the enter key, you will be prompted to enter your numbers one at a time. An in-progress use of scan() is displayed below.

1: 3.186
2: 

Enter the four numbers, then follow with another return to end the entry. Repeat for the next variable, anoles. Create a data.frame as above to complete the data entry.

3. Enter data directly into R

Base R has a simple editor, Data Editor, that you can call to enter data into a spreadsheet-like form. First, create the data.frame object. “Everything in R is an object.”

myData <- data.frame()

Second, call the editor

myData <- edit(myData)

You should see the Data Editor form (Fig. 1)

Screenshot R Data Editor

Figure 1. Screenshot of R’s simple Data Editor.

Begin data entry. For an example, Table 1.


Table 1. Color and weights (g) of 16 M&M's®

ColorMass
Orange0.80
Orange0.81
Orange0.80
Orange0.85
Orange0.84
Orange0.77
Orange0.83
Orange0.87
Blue0.83
Blue0.87
Blue0.83
Blue0.91
Green0.86
Brown0.87
Red0.79
Red1.11
Click Show down arrow, select 25 or greater to view all 16 rows.

click on var1 in the header row, enter the variable name Color, select character, then return (Fig. 2)

Screenshot variable entry box

Figure 2. Screenshot variable editor form

Repeat entry of variable names. Select numeric for Mass. To enter data, click in the first cell and enter the value. Use keyboard arrow keys, or mouse pointer to move between cells. Figure 3 shows a few entries in the Data Editor.

Screenshot data entry

Figure 3. Screenshot data entry in R Data Editor.

To view the data, type myData (or whatever you named the object) at the R prompt. R returns

> myData
   Color Mass
1 Orange 0.80
2 Orange 0.81
>

4. Enter data directly within R Commander

If you haven’t already, start R Commander

library(Rcmdr)

To open a new worksheet in Rcmdr — it’s still R’s Data Editor — follow these steps

Data → New data set…

then enter a name for the data set, e.g., lizards, in the pop-up dialog box and click OK to continue.

Change the default variable names to “gecko” in Var1 and “anoles” in Var2. Hint: To change the variable name, click on the column and a new popup menu will appear. Change the variable name from Var1 to “gecko”

Enter the data and labels stacked into columns, with row one enter”3.186″ in “gecko” (Var1) and “5.515” in “anoles”  (Var2), without the quotes, of course. After you are done, you have an unstacked worksheet, which should look like (Fig. 1).

Screenshot of R worksheet editor view.

Figure 1. Screenshot of R worksheet editor view.

To finish the worksheet select

File → Close

5. Reshape a data frame: From unstacked to stacked worksheet

The format of lizards, the data set we just created above, is in wide format aka an unstacked worksheet (row to column orientation). In most cases I am aware of, R expects your data in long form, also known as a stacked worksheet (column to row orientation) because most statistical functions in R expect that format. It is easy to go from unstacked to stacked in Rcmdr. This process is referred to as reshaping the data frame.

Data → Active data set → Stack variables in active data set…

Screenshot of filled out Stack variables menu

Figure 2. Screenshot of filled out Stack variables menu

You can confirm that your data entry is saved correctly by clicking on the View data set button (Fig. 3).

Screenshot of Rcmdr, red arrow points to Data set button.

Figure 3. Screenshot of Rcmdr, red arrow points to Data set button.

Click on the button to bring up popup menu shown in Figure 4.

Select the new data set. 

Figure 4. Select the new data set. 

Question 1: True or False. We can use the R editor to create a stacked data set instead of an unstacked worksheet.

Answer. True, of course!

In R, we can reshape the unstacked data frame with stack().

# Stack and save to new data frame
df <- stack(lizard)

We need to rename the variables after stacking. Check

names(df)

R returns

[1] "values" "ind"

The following code replaces the variable names

names(df)[names(df) == "values"] <- "mass"
names(df)[names(df) == "ind"] <- "lizard"

Check

names(df)

now returns

[1] “mass” “lizard”

6. Add new column (variable) to existing data frame

If we need to add a variable to an existing data frame, the new variable must have the same length. For example, add IDs 1 – 4 for geckos and 1 – 4 for anoles lizards in the data frame df. Use the rep() command we introduced before.

myID <- c(1:4)
myIDs <- rep(myID, times=2)

Use the $ operator

df$ID <- myIDs

Check results, print df, R returns

   mass lizard ID
1 3.186  gecko  1
2 2.427  gecko  2
3 4.031  gecko  3
4 1.995  gecko  4
5 5.515 anoles  1
6 5.659 anoles  2
7 6.739 anoles  3
8 3.184 anoles  4

which is what we expect.

The first column is called the index column, 1 – 8 without a header, is part of the data frame structure set by R. You can hide the index row by print(df, row.names=FALSE)

Although unnecessary, I would reorder the variables. Simple way is to note that column order in df is mass is column 1, lizard is column 2, and ID is column 3. Therefore,

df2 <- df[, c(3, 2, 1)]

and check the results

 ID lizard  mass
1 1  gecko 3.186
2 2  gecko 2.427
3 3  gecko 4.031
4 4  gecko 1.995
5 1 anoles 5.515
6 2 anoles 5.659
7 3 anoles 6.739
8 4 anoles 3.184

recall that the comma in [, c(3,2,1)] tells R to collect all rows from the data frame.

7. Load someone else’s data into R

A frequent operation in data analysis is the need to access and load data from a remote source. This is called “web scraping,” among many other terms to describe this behavior (see web scraping from a static webpage (rvest package). We’ll use a simple technique called copy and paste. We’ll go over a couple of options, one is to create a text file and then use R to manipulate the data into the format we need. Another option is to use a spreadsheet program instead — since most of you will be more comfortable with the spreadsheet software, this may be the preferred method for you. You should try both! The data we want to get into R are displayed in Table 2.


Table 2. Miles per hour for a 19.7 mile morning commute by Chevy S-10 on Oahu over eight consecutive weeks (2011).

MondayTuesdayWednesdayThursdayFriday
38.32340.96639.645.69239.6
34.94131.2633642.42936
3632.10828.28637.125
32.10828.28626.428.97640.966
47.52363332.10831.263
42.42939.63638.32339.6
38.32338.32330.462
23.29431.26331.26331.26324.75
Hint: Table 1 data is displayed in an "unstacked" or "wide" format.
Click Show down arrow, select 25 or greater to view all 8 rows.

8. Skip R Commander menus, enter data directly into R

For small data sets, you can paste the data set between text="", use

myTable <- read.table(header=TRUE, sep="\t", fill=TRUE, text="")

at the R prompt, where the columns are separated by tabs use sep="\t"(if separated by a single space, use sep=" "), fill=TRUE is used to fill the empty space, missing values, from our table, and you would paste the scraped table or copied table from your spreadsheet (see Load Your Own Data into R), between the double quotes following text="". The result will be an object stored in R as a data frame, ready for your work.

For an example, I copied all data rows and the header row from Table 1, then pasted between the “” at the R prompt, although it would be easier if you pasted the text into R script, then run the command from your script (macOS: Cmd + Enter; Win11: Ctrl + R).

Here’s results of copy

myTable1 <- read.table(header=TRUE, sep=" ", fill=TRUE, text="
Monday Tuesday Wednesday Thursday Friday
38.323 40.966 39.600 45.692 39.600
34.941 31.263 36.000 42.429 36.000
36 32.108 28.286 37.125 40.966
32.108 28.286 26.4 28.976 
47.52 36 33 32.108 31.263
42.429 39.6 36 38.323 39.6
38.323 38.323 30.462 
23.294 31.263 31.263 31.263 24.75
")

Note with respect to missing values. R seems to handle missing values best if you specify the missing value as NA instead of blank in your data set. Add na.strings = "NA" in place of fill=TRUE.

Verify the import, used head() function, which grabs the first six rows (or, tail() function, which grabs last six rows).

> head(myTable1, 3)
Monday Tuesday Wednesday Thursday Friday
1 38.323 40.966 39.600 45.692 39.600
2 34.941 31.263 36.000 42.429 36.000
3 36.000 32.108 28.286 37.125 40.966

Note I entered “3” in head(), which restricts the command to return the first three rows only.

Working with text files is more robust approach than copy/paste into your R script.

Highlight and then copy the table into your clipboard.

  • You can Open your text editor (e.g., TextEdit on a macOS; Notepad on Windows) and paste the contents into the blank text page.
    For example, I saved the table to a file called mytable.txt in my working folder.
  • Or, and this is the quickest (but may fail!) — once you’ve copied the table, it is now in the Clipboard, so R Commander has an option to retrieve from your clipboard, proceed directly to R Commander/Clipboard option described below.

Replace text="" with file="" in the read.table() function. The code now is

myTable <- read.table(header=TRUE, sep="\t", fill=TRUE, file="myTable.txt")

For larger data sets, data may be imported as a text file, commonly one where variables (columns) are separated by commas. you can paste the data set between text="", use

myTable <- read.table(file.choose(), header=TRUE, sep=",")

or, if you know the name of the file, replace file.choose() with the name of the file in quotes, e.g., "myData.csv".

Do on your own. Copy and paste the data below into your text editor, save the file as .txt or .csv, then load the data in R and save to data frame object of your choosing.

Monday,Tuesday,Wednesday,Thursday,Friday
38.323,40.966,39.600,45.692,39.600
34.941,31.263,36.000,42.429,36.000
36,32.108,28.286,37.125,40.966
32.108,28.286,26.4,28.976,NA
47.52,36,33,32.108,31.263
42.429,39.6,36,38.323,39.6
38.323,38.323,30.462,NA,NA
23.294,31.263,31.263,31.263,24.75

Note “NA” added to denote missing value.

Loading data files into R from a spreadsheet app like Microsoft Excel is a common task to accomplish. A number of packages help with the process, including readxl

library(readxl)
#check number of worksheets in the spreadsheet file
excel_sheets(file.choose())
$import data from specific worksheet
myData <- read_excel(file.choose(),sheet=2)

Note: instead of sheet number, can point to the sheet by name, e.g., sheet = "Sheet2".

Another package, rio, “simplif[ies] the process of importing data into R” (vignette). Calling the command import(), rio will attempt to identify the file type and import the data with minimum input from the user. For example, you already know the sheet number to import, then

library(rio) 
myData <- import(file.choose(),sheet=2)

will do the trick.

Do on your own. Copy and paste the same data (or load from the saved text file) into your spreadsheet, save the file as .xls or .xlsx, then load the data in R and save to data frame object of your choosing.

Web scraping from a static webpage (rvest package)

In Part 03. Explore use of R, we copied data from a table and imported it into an R data frame via read.table(). Here, we’ll introduce R’s capabilities to grab data directly from a web page. Mike’s Workbook for Biostatistics and Mike’s Biostatistics Book are presented as static web pages (Wikipedia); it’s relatively straightforward to grab data from my sites. The following code will scrape and return the table data into a data frame object, ready for use.

Download and install rvest package, which is part of the larger tidyverse package for data science produces by Hadley Wickham and friends. We’ll just install rvest, but by installing tidyverse, you get rvest and many other useful tools that make R work even better.  Because there’s only the one table on the page, only two commands are needed: we read the web page and save it to an object, then call for the table with the html_table() function.

library(rvest)
webPage <- read_html("https://mikeworkbook.letgen.org/r-work/a-quick-look-at-r-and-r-commander/part-03-explore-use-of-r/")
myWeb <- as.data.frame(html_table(webPage))

#Check the import
head(myWeb)

and R returns

 Min Trial01 Trial02 Trial03 Trial04
1 0 82.8 88.7 83.2 86.2
2 1 76.7 78.5 77.6 NA
3 2 69.4 74.3 72.5 74.3
4 3 67.2 71.4 70.7 68.9
5 4 63.3 67.6 68.5 61.5
6 5 60.6 65.5 65.8 57.7

That was easy!

rvest returns tibbles because the tidyverse world uses tibbles. Tibbles are more flexible than a data frame, but to be consistent with the rest of our work, we stick to data frames. If you want to know more about “tibble,” feel free to do a query, “why is it called tibble?”

Note. By sharing web scraping code, and pointing y’all to my web site, I give my permission for readers to web scrape data from Mike’s Biostatistics Book and Mike’s Workbook for Biostatistics. However, don’t abuse the gift. First, the data are copy right protected, albeit with a Creative Commons license. Second, don’t hammer my — or especially other’s — websites with excessive web scraping activity, which can harm the performance of a web site or even violate the rules of the hosting service for the website. Third, be aware and apply web scraping etiquette, e.g., see 2020 post by Alexandra Datsenko at Webbiquity.

R Commander

Text file option

  1. Return to R Commander and select Data → Import data → from text file, clipboard, or URL…
    • Enter a name for the data set, e.g., myData
    • Location of data file:
      • If you are loading from a text file, then check Local file system (Figure 5)
      • If you are loading directly from the clipboard, then check Clipboard (Figure 6)
    • Field separator: check “Tabs”
    • Leave the rest as default values
    • Click OK

Rcmdr read text file

Figure 5. Rcmdr load data from a text file from Local file system

Rcmdr read data from clipboard

Figure 6. Rcmdr load data from Clipboard.

  1. Now, the dataset should be loaded in R. Check the Message window in R to see if there were errors.
  2. We need to convert the dataset into a long, stacked format. This is easy to do in Rcmdr.
    • Data → Active data set → Stack variables in active data set…
    • Select the columns (variables), enter a new name for the stacked data set, enter name for the data variable (mph), and a name for the factor (DaysOfWeek), see Figure 7.

Rcmdr stack worksheet

Figure 7. Rcmdr stack data from an unstacked (wide) data set.

Spreadsheet option

  1. Now, please open your spreadsheet app (e.g., Google Sheets, MS Excel), then return to this page and highlight/copy the data table to your clipboard.
  2. Return to Excel and paste the data into a worksheet.
  3. Convert the stacked worksheet into a stacked worksheet (see Hint below).
    • In your Excel file, stack the data into two columns (long form).
    • Label the first column “DaysOfWeek” and the second column “mph” (without the quotes).
  4. Save the workbook and remember where on your computer you saved the file.
  5. Next, upload the data file to R from within R Commander
    • Data → Import data → From Excel file
      • Select the correct worksheet from the popup menu.

Screenshot of an unstacked or wide format data set based on the data from Table 1 is displayed in Figure 8. See Figure 9 for a stacked (long format) version of the same data set as in Figure 8.

Screenshot of an unstacked worksheet

Figure 8. Screenshot of an unstacked or wide format worksheet

Screenshot stacked worksheet

Figure 9. Stacked or long format worksheet

Save the data file in R format

Finally, save the data you imported as an R file (.RData)

  1. Data → Active data set → Save active data set…
  2. Save the file with an appropriate file name to your working folder (Figure 10)

Rcmdr save data

Figure 10. Rcmdr Save As popup dialog.

Question 2. Try each of the three ways to get Table 1 data into R: Clipboard, text file, spreadsheet file.

Question 3. Once the data set is loaded in R, find the missing value(s) for the dataset. Report the row number(s) for the missing value(s).

Hint: Use is.na() function. Just enter the name of the variable to check for missing value(s). This will return a vector with FALSE for each non missing case, and TRUE for any missing value. Quick-R website provides this set of commands which you should also try

dataset[!complete.cases(dataset),]

where “dataset” is replaced with the name of your data set loaded in R. Try both!

Question 4. What kind of graph would be suitable to display the relationship (if any) between DaysOfWeek and mph? Explore the available default graphs in Rcmdr and select one.
A) bar chart
B) histogram
C) pie chart
D) scatter plot

Note: Question 3 and 4 are part of Part 08. Data exploration. You should open and work through Part 08, then return to complete or expand on your investigation of Question 3 and 4.

Data import into R (Rcmdr)

Work with one table at a time. R can handle multiple data sets in memory, but you work with one data set a at a time. You have several options — select the one that works best for you.

  1. In your browser, highlight and copy the table (skip the table label — just the column header and numbers), then, in R go to Rcmdr: Data → Import data → from text file, clipboard, look for any error messages, then check data by “View data” button.
  2. If the copy-paste from clipboard method fails for you, then copy and paste the data into an Excel spreadsheet. Then, import the data from a saved Excel spreadsheet in R by Rcmdr → Data → Import data → From excel file, look for error messages, then check data by “View data” button (only available if you are working with Rcmdr version 2.04 or higher — all of the NSM Macbooks should meet this requirement).
  3. After entering the data into Excel, export the data from your Excel spreadsheet as a text file. Then, import the data into R by Rcmdr → Import data from text file, look for error messages, then check data by “View data” button. If you do not know how to export to text file, please check our Moodle help pages.

Regardless of the approach you used, once you get the data into R, save the data in R format before proceeding. Rcmdr: Data → Active data set → Save active data set

I want you to either work with text files you create or Excel files you create, hence this exercise. This becomes important later in the course when you work with your own datasets. However, I will provide you with an Excel file with the two datasets (Knell, RalphPearson) just in case you run into trouble with data loading so that you’ll be able to move onto the descriptive statistics and graphics. Click here to get the RWork01.

9. Load Your Own Data into R

Think about your other science classes. Do you have data? It may be as simple as data for a calibration curve or a set of data for growth rates of a cell line. At a minimum, you will generate data during our Measurement Day (shells, darts). After we complete Measurement Day, bring these data sets into R by either the text file route or the spreadsheet route. To help, I have collated available distances measured for darts tossed by your classmates, imported into R, then saved as an R-ready data file for you. Simply click on the link and download the file to your working directory, then load the data file as you would any other R data file (Rcmdr: Data → Load data set …)

Question 5. Gather your datasets from Measurement Day and repeat Question 2, 3, and 4, adjusting the question to suit the variable(s) contained in your own data set.

10. Import multiple data files into R

Working with a single smallish data file is probably the exception. For example, we follow growth rates of yeast cells in 48-well arrays recorded every 15 minutes over 24 hours. For a single trial that’s 4608 readings for processing. A typical experiment might involve dozens of trials. The instrument saves the data in a particular format which requires processing before ready for analysis. This description screams for programming solution and R is excellent at this. Thus, once the script is written, it is logical to load multiple data files of the same type at the same time for processing. Because this is beyond the needs of an introduction class, I’ll simply point to a couple of resources for the interested student.

11. More practice: Getting your own data into R

I want you to explore ways to IMPORT data into R (Rcmdr) and how to save the data into a format that R likes (e.g., *.RData). Here are two small data sets, plus a link to a page containing Measurement Day results.


Table 3. Plasma testosterone levels in male lizards and distance from nearest neighbor. Data from Knell’s “Introductory R”

Lizard noNearest
Neighbor (m)
Plasma
testosterone (ng/ml)
1422.2
2726.1
3321.5
4523.8
5723.8
6320.5
7724.6
8422.6
9319.9
10523.9
11519.8
12421.3
Click Show down arrow, select 25 or greater to view all 13 rows.


 


Table 4. Breeding success and morphometrics among White-crowned sparrows near Pt. Reyes California (Ralph). Table 1 Ralph & Pearson 1971

PairMale ageFemale agePercent black feathers in crown MalePercent black feathers in crown FemaleTerritory sizeSuccessful
A113025980Yes
B1148201530Yes
C12601001620Yes
D119640710No
E2110081840Yes
F2110041260Yes
G12961001760No
H1110402600No
I221001004540Yes
J41100734480Yes
K2199202820Yes
L2110055850Yes
M118851890No
N1290982160No
O1199401500No
X3NA100NA600NA
Y1250982900NA
Z1NA80NA860NA
Click Show down arrow, select 25 or greater to view all 19 rows.

Question 6. Take a look closer at Table 1 from Ralph & Pearson. What does “NA” mean?

Proceed to carry out basic data analysis on these data sets (descriptive statistics, graphs)

Work with one table at a time

  1. Go to Rcmdr: Statistics — what is available (not dimmed) under “Summaries?”
  2. Proceed to determine descriptive statistics for Knell’s data set and, separately, for the Ralph & Pearson data set. Neighbor and Testosterone levels. Hint: you may wish to get statistics for the groups, not just the whole data set. Explore what you get with “Active summaries” versus what is available with “Numerical summaries.”
  3. Go to Rcmdr: Graphs — What kind of graph is best for this data set? First, consider plots for one variable at a time display
    – What’s an index plot? Compare to a dot plot
    – What’s a histogram? Compare to a density estimate
    Next, consider plots to compare groups for one variable (hint: look for a “Plot by groups” option).
    – What is a box plot? Compare to Plot of means
    Note: the bar chart in R is not the same bar chart that you are used to from Excel and other software. Be sure to ask me about this and see Chapter 4 in Mike’s Biostatistics eBook
  4. Important concept — while you are working on these datasets you need to be asking yourself, what are the experimental (independent) variables and which are the outcome (dependent) variables?

12. Page quiz


your own data

Ten questions from this page