Homework 2B: Descriptive statistics

This homework is about describing the middle and the variability of data using descriptive statistics.

Homework 2B expectations

Read through the entire homework before starting to answer a question.

How to work this homework

You may work together, but each of your must turn in your own report. Don’t “plagiarize” from each other. Do include in your report who you worked with.

What to turn in: A pdf file containing relevant R code, statistical results — edited to support your answers to the questions, and your answer to the questions (even numbered only). Use of RMarkdown recommended — because it is a simple way to include graphs generated; however copy/paste into a word document is also acceptable.

Notes. By relevant we mean provide just the R code and results from R functions necessary to support your answers to the questions. For example, do not include

the entire data set when head(dataset) will do
screenshots of R output!! R output is text — copy/paste
all statistical output from an R function.

See Part09: Making a report for an example homework file.

Submit your work to CANVAS. Obey proper file naming formats.

Resources for this homework

Mike’s Biostatistics Book: Chapter 3

Mike’s Workbook for Biostatistics: A quick look at R and R Commander, Part01 – Part10 and refer also to code presented in Homework 2 from this workbook.

Additional R commands and or code provided below.

Questions

Central tendency

Find the help page in R for the median function. How does the function handle missing values (NA)?
For a simple data set like the following y <- c(1,1,3,6) you should now be able to calculate, by hand, the
• mean
• median
• mode
If the observations for a ratio scale variable are normally (symmetrically) distributed, which statistic of central tendency is best (e.g., less sensitive to outlier values)?
In Chapter 3 we provided a function to calculate the mode. The function was
```
temp = table(as.vector(x))
```
```
names (temp)[temp==max(temp)]
```
In the names() command, what do you think the result will be if you replace max in the command with min?
If data are right skewed, what will be the order of the mean and median?
Calculate the sample mean and median for the following data sets
(a) Basal 5 hour fasting plasma glucose-to-insulin ratio of four inbred strains of mice,
x <- c(44, 100, 105, 107) #(data from Berglund et al 2008)
(b) Height in inches of mothers,
mom <- c(67, 66.5, 64, 58.5, 68, 66.5) #(data from GaltonFamilies in R package HistData)
and fathers,
dad <- c(78.5, 75.5, 75, 75, 74, 74) #(data from GaltonFamilies in R package HistData)
(c) Carbon dioxide (CO₂) readings from Mauna Loa for the month of December for demi-decade 1960 – 2020
years <-c (1960, 1965, 1970, 1975, 1980, 1985, 1990, 1995, 2000, 2005, 2010, 2015, 2020) #obviously, do not calculate statistics on years; you can use to make a plotco2 <- c(316.19, 319.42, 325.13, 330.62, 338.29, 346.12, 354.41, 360.82, 396.83, 380.31, 389.99, 402.06, 414.26) #data from Dr. Pieter Tans, NOAA/GML (gml.noaa.gov/ccgg/trends/) and Dr. Ralph Keeling, Scripps Institution of Oceanography (scrippsco2.ucsd.edu/)
(d) Body mass of Rhinella marina (formerly Bufo marinus),
bufo <- c(71.3, 71.4, 74.1, 85.4, 85.4, 86.6, 97.4, 99.6, 107, 115.7, 135.7, 156.2)

Measures of dispersion

For a sample data set like y = c(1,1,3,6), you should now be able to calculate, by hand, the
• range
• standard deviation
• variance
If the difference between Q1 and Q3 is called the interquartile range (IQR), what do we call Q2?
For our example data set, x <- c(4,4,4,4,5,6,6,6,7,7,8,8,8,8,8) calculate
• IQR
• sample standard deviation, s
• coefficient of variation
Use the sample() command in R to draw samples of size 4, 8, and 12 from your example data set stored in x. Repeat the calculations from question 3. For example x4 <- sample(x,4) will randomly select four observations from x, and will store it in the object x4, like so (your numbers probably will differ!) x4 <- sample(x,4); x4 [1] 8 6 8 6
Repeat the exercise in question 4 again using different samples of 4, 8, and 12. For example, when I repeat sample(x,4) a second time I get sample(x,4) [1] 8 4 8 6

Table 1. Summary statistics mean (+ standard deviation) of height, weight, and waist circumference of 20-39 year old men USA.

Years	Height, inches	Weight, pounds	Waist Circumference, inches
1999 – 2000	69.4 (0.1)	185.8 (2.0)	37.1 (0.3)
2007 – 2008	69.4 (0.2)	189.9 (2.1)	37.6 (0.3)
2015 – 2016	69.3 (0.1)	196.9 (3.1)	38.7 (0.4)

For Table 1, determine how many multiples of the standard deviation for observations greater than 95-percentile (e.g., determine the observation value for a person who is in the 95-percentile for Height in the different decades, etc.

Calculate the sample range, IQR, sample standard deviation, and coefficient of variation for the following data sets
• Basal 5 hour fasting plasma glucose-to-insulin ratio of four inbred strains of mice, x <- c(44, 100, 105, 107) #(data from Berglund et al 2008)
• Height in inches of mothers,
mom <- c(67, 66.5, 64, 58.5, 68, 66.5) #(data from GaltonFamilies in R package HistData)
and fathers,
dad <- c(78.5, 75.5, 75, 75, 74, 74) #(data from GaltonFamilies in R package HistData)
• Carbon dioxide (CO₂) readings from Mauna Loa for the month of December for demi-decade 1960 – 2020
years <-c (1960, 1965, 1970, 1975, 1980, 1985, 1990, 1995, 2000, 2005, 2010, 2015, 2020) #obviously, do not calculate statistics on years; you can use to make a plotco2 <- c(316.19, 319.42, 325.13, 330.62, 338.29, 346.12, 354.41, 360.82, 396.83, 380.31, 389.99, 402.06, 414.26) #data from Dr. Pieter Tans, NOAA/GML (gml.noaa.gov/ccgg/trends/) and Dr. Ralph Keeling, Scripps Institution of Oceanography (scrippsco2.ucsd.edu/)
• Body mass of Rhinella marina (formerly Bufo marinus) (see Fig. 1),
bufo <- c(71.3, 71.4, 74.1, 85.4, 85.4, 86.6, 97.4, 99.6, 107, 115.7, 135.7, 156.2)

R commands, copy, paste & modify to Script

To access R help menu for a command, add question mark in front of the command. For example, ?mean brings up the help page in your default browser

IQR()

mean()

quartile()

median()

range()

sample(x, size, replace=FALSE)

sd()

var()

Rcmdr: Statistics → Summaries → Numerical Summaries…, many statistics available in Numerical Summaries

Screenshot Rcmdr Numerical summaries

You should pay attention to significant figures. This is a small set of problems with one or two numbers per response, so editing h=by hand probably easiest. However, R has format() command. For example, mean estimate may be 65.08333, but significant figures are just 1 past the decimal.

format(65.08333, 2,3)

returns

65.1