Homework 2B: Descriptive statistics

This homework is about describing the middle and the variability of data using descriptive statistics.

Homework 2B expectations

Read through the entire homework before starting to answer a question.

How to work this homework

You may work together, but each of your must turn in your own report. Don’t “plagiarize” from each other. Do include in your report who you worked with.

What to turn in: A pdf file containing your R code, statistical results, and your answer to the questions. Use of RMarkdown recommended; however copy/paste into a word document is also acceptable.

Submit your work to CANVAS. Obey proper file naming formats.

Resources for this homework

Mike’s Biostatistics Book: Chapter 3

Mike’s Workbook for Biostatistics: A quick look at R and R Commander, Part01 – Part10 and refer also to code presented in Homework 2 from this workbook.

Additional R commands and or code provided below.


Questions

Central tendency

  1. Find the help page in R for the median function. How does the function handle missing values (NA)?
  2. For a simple data set like the following y <- c(1,1,3,6) you should now be able to calculate, by hand, the
    • mean
    • median
    • mode
  3. If the observations for a ratio scale variable are normally (symmetrically) distributed, which statistic of central tendency is best (e.g., less sensitive to outlier values)?
  4. In Chapter 3 we provided a function to calculate the mode. The function was
    temp = table(as.vector(x))
    names (temp)[temp==max(temp)]

    In the names() command, what do you think the result will be if you replace max in the command with min?

  5. If data are right skewed, what will be the order of the mean and median?
  6. Calculate the sample mean and median for the following data sets
    (a) Basal 5 hour fasting plasma glucose-to-insulin ratio of four inbred strains of mice,
    x <- c(44, 100, 105, 107) #(data from Berglund et al 2008)
    (b) Height in inches of mothers,
    mom <- c(67, 66.5, 64, 58.5, 68, 66.5) #(data from GaltonFamilies in R package HistData)
    and fathers,
    dad <- c(78.5, 75.5, 75, 75, 74, 74) #(data from GaltonFamilies in R package HistData)
    (c) Carbon dioxide (CO2) readings from Mauna Loa for the month of December for demi-decade 1960 – 2020
    years <-c (1960, 1965, 1970, 1975, 1980, 1985, 1990, 1995, 2000, 2005, 2010, 2015, 2020) #obviously, do not calculate statistics on years; you can use to make a plot
    co2 <- c(316.19, 319.42, 325.13, 330.62, 338.29, 346.12, 354.41, 360.82, 396.83, 380.31, 389.99, 402.06, 414.26) #data from Dr. Pieter Tans, NOAA/GML (gml.noaa.gov/ccgg/trends/) and Dr. Ralph Keeling, Scripps Institution of Oceanography (scrippsco2.ucsd.edu/)
    (d) Body mass of Rhinella marina (formerly Bufo marinus),
    bufo <- c(71.3, 71.4, 74.1, 85.4, 85.4, 86.6, 97.4, 99.6, 107, 115.7, 135.7, 156.2)

Measures of dispersion

  1. For a sample data set like y = c(1,1,3,6), you should now be able to calculate, by hand, the
    • range
    • standard deviation
    • variance
  2. If the difference between Q1 and Q3 is called the interquartile range (IQR), what do we call Q2?
  3. For our example data set, x <- c(4,4,4,4,5,6,6,6,7,7,8,8,8,8,8) calculate
    • IQR
    • sample standard deviation, s
    • coefficient of variation
  4. Use the sample() command in R to draw samples of size 4, 8, and 12 from your example data set stored in x. Repeat the calculations from question 3. For example x4 <- sample(x,4) will randomly select four observations from x, and will store it in the object x4, like so (your numbers probably will differ!) x4 <- sample(x,4); x4 [1] 8 6 8 6
  5. Repeat the exercise in question 4 again using different samples of 4, 8, and 12. For example, when I repeat sample(x,4) a second time I get sample(x,4) [1] 8 4 8 6
  6. Table 1. Summary statistics mean (+ standard deviation) of height, weight, and waist circumference of 20-39 year old men USA.
    Years Height, inches Weight, pounds Waist
    Circumference,
    inches
    1999 – 2000 69.4 (0.1) 185.8 (2.0) 37.1 (0.3)
    2007 – 2008 69.4 (0.2) 189.9 (2.1) 37.6 (0.3)
    2015 – 2016 69.3 (0.1) 196.9 (3.1) 38.7 (0.4)

    For Table 1, determine how many multiples of the standard deviation for observations greater than 95-percentile (e.g., determine the observation value for a person who is in the 95-percentile for Height in the different decades, etc.

  7. Calculate the sample range, IQR, sample standard deviation, and coefficient of variation for the following data sets
    • Basal 5 hour fasting plasma glucose-to-insulin ratio of four inbred strains of mice, x <- c(44, 100, 105, 107) #(data from Berglund et al 2008)
    • Height in inches of mothers,
    mom <- c(67, 66.5, 64, 58.5, 68, 66.5) #(data from GaltonFamilies in R package HistData)
    and fathers,
    dad <- c(78.5, 75.5, 75, 75, 74, 74) #(data from GaltonFamilies in R package HistData)
    • Carbon dioxide (CO2) readings from Mauna Loa for the month of December for demi-decade 1960 – 2020
    years <-c (1960, 1965, 1970, 1975, 1980, 1985, 1990, 1995, 2000, 2005, 2010, 2015, 2020) #obviously, do not calculate statistics on years; you can use to make a plot
    co2 <- c(316.19, 319.42, 325.13, 330.62, 338.29, 346.12, 354.41, 360.82, 396.83, 380.31, 389.99, 402.06, 414.26) #data from Dr. Pieter Tans, NOAA/GML (gml.noaa.gov/ccgg/trends/) and Dr. Ralph Keeling, Scripps Institution of Oceanography (scrippsco2.ucsd.edu/)
    • Body mass of Rhinella marina (formerly Bufo marinus) (see Fig. 1),
    bufo <- c(71.3, 71.4, 74.1, 85.4, 85.4, 86.6, 97.4, 99.6, 107, 115.7, 135.7, 156.2) 

R commands, copy, paste & modify to Script

To access R help menu for a command, add question mark in front of the command. For example, ?mean brings up the help page in your default browser

IQR()

mean()

quartile()

median()

range()

sample(x, size, replace=FALSE)

sd()

var()

Rcmdr: Statistics → Summaries → Numerical Summaries…, many statistics available in Numerical Summaries

Screenshot Rcmdr Numerical summaries

You should pay attention to significant figures. This is a small set of problems with one or two numbers per response, so editing h=by hand probably easiest. However, R has format() command. For example, mean estimate may be 65.08333, but significant figures are just 1 past the decimal.

format(65.08333, 2,3)

returns

65.1