“Descriptive statistics aim to summarize a sample, rather than use the data to learn about the population that the sample of data is thought to represent. This generally means that descriptive statistics, unlike inferential statistics, are not developed on the basis of probability theory.” (Wikipedia)
We will use the package “car”. For details on this package, see https://cran.r-project.org/web/packages/car/car.pdf
# install.packages("car") - run this code if you do not have car" package installed
library(car)
Let’s explore the dataset “Davis” from the car package. It is called “Self-Reports of Height and Weight Description”. The Davis data frame has 200 rows and 5 columns. The subjects were men and women engaged in regular exercise. There are some missing data. This data frame contains the following columns:
sex - A factor with levels: F, female; M, male.
weight - Measured weight in kg.
height - Measured height in cm.
repwt - Reported weight in kg.
repht - Reported height in cm.
data <- Davis
dim(data) # first thing to do to see how many rows and columns
names(data) # columns names
class(data) # should be "data.frame""
sapply(data[1,], class) # to see types of variables in each column
head(data)
summary(data) # shows quantiles for each column and how many NA's !!!!
quantile(data$repwt, na.rm = TRUE) # to see the range of values in specific column
unique(data$sex)
unique(data$weight)
length(unique(data$weight))
table(data$sex) # build a contingency table of the counts at each combination of factor levels
table(data$sex, data$weight) # to see relationships between variables
table(data$sex, data$weight < 80) # to get 2x2 contingency table
table(cut(data$weight, quantile(data$weight)), data$sex) # contigency table of sex by weight quantiles
# contigency table of quantiles of weight versus repwt
x <- data$weight
y <- data$repwt
table(cut(x, quantile(x)), cut(y, quantile(y, na.rm = TRUE)))
Who is more inclined to report lower weight, males or femails?
d <- data[data$weight == 80,]
d <- data[data$weight < 60,]
summary(d)
summary(d$sex)
dim(d)
nrow(d)
data[data$weight < 60 & data$repwt >= 60,] # NA are included !!!
data[which(data$weight < 60 & data$repwt >= 60),] # use which() to exclude NA values
sum(is.na(data$repwt)) # the number of missing values for repwt
table(is.na(data$repht)) # how many complete data (FALSE) and how many missing (TRUE) for repht
#by default table() doesn't show missing value; e.g.
table(data$repht)
table(data$repht, useNA = "ifany") #now it shows
#How many rows contain missing values?
sum(!complete.cases(data))
data[!complete.cases(data), ] # here are these rows
What is the proportion of femails and males who didn’t report the weight?
Basic R functions: mean, sd, var, min, max, median, range, IQR, and quantile. All these function require special treatment for missing values, using parameter na.rm = TRUE. While summary(data) does not.
max_weight <- max(data$weight)
data[data$weight == max_weight, ] # full info on person with the max_weight
max_repwt <- max(data$repwt, na.rm = T)
data[data$repwt == max_repwt, ]
# if there are missing values, use which()
data[which(data$repwt == max_repwt), ]
# Are those values outliers?
Can be defined (by the Tuley’s test) as values in the sample that differ from the Q1 or Q3 by more than 1.5/IQR (where IQR = Q3 - Q1).
However, there is no formal definition of outliers; they need to be treated subjectively.
quantile(data$weight)
q <- unname(quantile(data$weight)) # the vector of quantile values
iqr <- q[4] - q[2] # or use IQR(data$weight) instead
upper_limit <- q[4] + 1.5 * iqr # values above this limit are outliers
max_weight > upper_limit # TRUE or FALSE?
#How many thus defined "formal" outliers?
data[which(data$weight > upper_limit), ]
# Let's remove them and see which of the summary statistics change
d <- data[which(!data$weight > upper_limit), ]
# Did mean change?
mean(d$weight)
mean(data$weight)
# Did median change?
median(data$weight)
median(d$weight)
# Did SD change?
sd(data$weight)
sd(d$weight)
# Did quantiles change?
quantile(data$weight)
quantile(d$weight)
boxplot(data$weight) # we can clearly see three outliers
data[which(data$weight > upper_limit), ] # here they are
#boxplots with and w/o outliers side by side
par(mfrow = c(1,2)) # two plots in one row
boxplot(data$weight)
boxplot(d$weight) # now we got another outliers !
# Let's explore them
quantile(d$weight)
q <- unname(quantile(d$weight)) # the vector of quantile values
upper_limit <- q[4] + 1.5 * IQR(d$weight) # values above this limit are
d[which(d$weight > upper_limit), ]
# Should we remove them or not?
Box-plots side-by-side on one graph
par(mfrow = c(1,2)) # two plots in one row
boxplot(data$weight ~ data$sex)
boxplot(data$weight ~ data$sex, outline = FALSE) # outliers are not shown
But how to plot together two or more vectors of different lengths?
library(reshape2) # for melt() to use below
par(mfrow = c(1,1))
male <- d[which(d$sex == "M"),] # we use data with outliers removed
female <- d[which(d$sex == "F"),]
m <- male$weight
f <- female$weight
length(m)
length(f)
# make a data frame first filling missing values in a smaller vector with NA
len = max(length(m), length(f))
Female = c(f, rep(NA, len - length(f)))
Male = c(m, rep(NA, len - length(m)))
df <- structure(data.frame(Male, Female)) #here is where order of data to display can be changed!!
meltdf <- melt(df)
boxplot(data = meltdf, value ~ variable)
Now to make the plot more informative and looking better
# Define the plot parameters
y_limits <- c(30, 130)
colors <- c("blue", "red")
ylab <- "Weight, kg"
title <- "Distribution of weight by sex"
boxplot(data = meltdf, value ~ variable, outline = FALSE,
#pars = list(boxwex = .4),
ylab = ylab,
cex.lab = 1.5, #to change (multiply) the font size of the axes legends
cex.axis = 1.2, #to change (multiply) the font size of the axes
ylim = y_limits,
border = colors, #color the boxplot borders
boxwex = 0.6,
staplewex = 0.4,
frame.plot = FALSE, #this removes upper and right borders on the plot area
outwex = 0.5,
cex.main = 1.5, #to change (multiply) the size of the title
main = title # the title of the graph
)
stripchart(data = meltdf, value ~ variable,
col = colors,
method = "jitter", jitter = .2,
pch = c(16, 15), cex = c(1.0, 1.0), #different points and of different size can be used
vertical = TRUE, add = TRUE)
# let's show the (yet unknown) significance level on the plot
text <- "Are means significantly different?"
y <- 120 # y position of the horizontal line
offset <- 5 # length of vertical segments
x <- 1
segments(x, y, x + 1, y)
segments(x, y - offset, x, y)
segments(x + 1, y - offset, x + 1, y)
text(x + 0.5, y + offset, paste(text), cex = 1) #cex defines the font size of the text
hist(data$weight)
hist(d$weight)
x <- d$weight
bin_size <- 1
start <- 30
end <- 110
bins <- seq(start, end, by = bin_size)
hist(x, breaks = bins, col = "blue")
Two histograms on one graph
# let's draw histograms for male and female weights, w/o outliers, vectors m and f
xlim = c(start, end)
title <- "Distribution of weights for males and females"
colors <- c("red", rgb(0, 0, 1, 0.7)) # the last number in rgb() is for transparency
hist(f, col = colors[1], xlim = xlim, xlab = "Weight, kg", main = title)
hist(m, add = TRUE, col = colors[2], xlim = xlim)
legend("topright", legend = c("Females", "Males"), fill = colors, bty = "n", border = NA)
Draw the same graph as above with bins of size 2 and using colors <- c(rgb(1, 0, 1, 0.7), rgb(0, 0, 1, 0.5))
plot(data$weight, data$repwt)
plot(d$weight, d$repwt, pch = 20)
plot(d$weight, d$repwt, pch = 20, col = d$sex)
legend("topleft",legend = c("Females", "Males"), col = 1:2, pch = 20)
Colors are taken from currently setup palette(). It can be changed to control for colors.
palette(c("red", "blue")) # changing palette()
plot(d$weight, d$repwt, col = d$sex, pch = 20, xlab = "Measured weight, kg", ylab = "Reported weight, kg", main = "Reported versus measured weights by sex")
legend("bottomright",legend = c("Females", "Males"), col = 1:2, pch = 20, bty = "n")
palette("default") # reset back to the default
F(x) = the proportion of observations which values are <= x
weight.ecdf = ecdf(d$weight) # obtain empirical CDF values
plot(weight.ecdf, xlab = "Quantiles of weight, kg", main = "Empirical cumulative distribution of weights")
# let's draw vertical lines depicting median, Q1 and Q2
summary(d$weight)
abline(v = 63.0) # vertical line for median
abline(v = 55.0, col = "blue") # vertical line for Q1
abline(v = 73.0, col = "red") # vertical line for Q3
# let's draw horizontal lines for quantiles
abline(h = 0.5) # median
abline(h = 0.25, col = "blue")
abline(h = 0.75, col = "red")
Plot eCDFs for male and female weights in different colors on the same graph
See https://www.r-bloggers.com/normal-distribution-functions/
Let’s look at three normal distributed random variables X1, X2, and X3 with means and standard deviations of (0,1), (-2.5,3), and (8,0.5), respectively. And apply Z-transformation.
dnorm() - Probability Density Function (PDF)
rnorm() - generates random sample from a normal distribution
xs <- seq(-15, 15, length.out = 200)
ys.1 <- dnorm(xs, mean = 0, sd = 1) # dnorm() !!!
ys.2 <- dnorm(xs, mean = -2.5, sd = 3)
ys.3 <- dnorm(xs, mean = 8, sd = 0.5)
plot(x = xs, y = ys.1, type = "l", lwd = 3, col = "black", ylim = c(0, 1), xlab = "X", ylab = "Density", yaxs = "i")
lines(x = xs, y = ys.2, lwd = 3, col = "blue")
lines(x = xs, y = ys.3, lwd = 3, col = "red")
# Z-transformation
mean <- 8
sd <- 0.5
x.sample <- rnorm(n = 200, mean = mean, sd = sd) # rnorm() !!!
z.sample <- (x.sample - mean) / sd
hist(x.sample, freq = F, add = T)
hist(z.sample, freq = F, add = T)
pnorm(z, mu, sd) - gives an area under thr standard normal curve to the left of z
xs <- seq(-4, 4, 0.01)
cdf <- pnorm(xs, 0, 1)
plot(xs, cdf, col = "blue", xlab = "", ylab = "Cumulative Probability", type = "l", lwd = 3, cex = 2, main = "CDF of Standard Normal", cex.axis = .8, yaxs = "i")
For the Normal distribution with mean mu and standard deviation sd, ~68% of the observations fall within the range [mu - sd; mu + sd.
xs <- seq(-4.5, 4.5, length.out = 200)
dens <- dnorm(xs)
plot(x = xs, y = dens, type = "l", ylim = c(0,0.45), lwd = 3, xlab = "X", ylab = "Standard Normal density", yaxs = "i")
sd <- 1
xs <- c(seq(-sd, sd, length.out = 200))
ys <- dnorm(xs)
polygon(c(-sd, xs, sd), c(0, ys, 0), col = "blue")
# What is the area of the blue area?
pnorm(sd) - pnorm(-sd)
## [1] 0.6826895
~95.5% of the observations fall within the range [mu - 2sd; mu + 2sd].
xs <- seq(-4.5, 4.5, length.out = 200)
dens <- dnorm(xs)
plot(x = xs, y = dens, type = "l", ylim = c(0,0.45), lwd = 3, xlab = "X", ylab = "Standard Normal density", yaxs = "i")
sd <- 2
xs <- c(seq(-sd, sd, length.out = 200))
ys <- dnorm(xs)
polygon(c(-sd, xs, sd), c(0, ys, 0), col = "green")
# What is the area of the blue area?
pnorm(sd) - pnorm(-sd)
## [1] 0.9544997
~99.7% of the observations fall within the range [mu - 3sd; mu + 3sd].
xs <- seq(-4.5, 4.5, length.out = 200)
dens <- dnorm(xs)
plot(x = xs, y = dens, type = "l", ylim = c(0,0.45), lwd = 3, xlab = "X", ylab = "Standard Normal density", yaxs = "i")
sd <- 3
xs <- c(seq(-sd, sd, length.out = 200))
ys <- dnorm(xs)
polygon(c(-sd, xs, sd), c(0, ys, 0), col = "red")
# What is the area of the blue area?
pnorm(sd) - pnorm(-sd)
## [1] 0.9973002
qnorm(p, mu, sd) - gives the value of z at which CDF (area on the left of z) is equal p
#Find the critical values -z and z for the area under the standard normal curve of 99.99%
p <- 0.9999
qnorm(p) #gives a value of z such that P(X < z) = p
## [1] 3.719016
qnorm(p + (1 - p) / 2) #gives a value of z such that P(-z < X < z) = p
## [1] 3.890592
ks.test(rnorm(10000), rnorm(50))
par(mfrow = c(2,1))
hist(rnorm(1000))
hist(rgamma(1000,2,2))
ks.test(rnorm(1000), rgamma(1000,2,2))
hist(rgamma(1000,1,1))
hist(rgamma(1000,2,2))
ks.test(rgamma(1000,1,1), rgamma(1000,2,2))
par(mfrow = c(1,1))
# let's generate a sample from the normal distribution of size length(d$weight) and mu = mean(d$weight) and sd = sd(d$weight)
len <- length(d$weight)
mu <- mean(d$weight)
sd <- sd(d$weight)
set.seed(42)
x <- rnorm(len, mean = mu, sd = sd)
hist(x) # frequency of observations
# let's plot eCDF for sample x together with eCDF for d$weight
plot(weight.ecdf, xlab = "Quantiles of weight, kg", main = "Empirical cumulative distribution of weights")
# let's draw vertical lines depicting median, Q1 and Q2
summary(d$weight)
abline(v = 63.0) # vertical line for median
abline(v = 55.0, col = "blue") # vertical line for Q1
abline(v = 73.0, col = "red") # vertical line for Q3
# let's draw horizontal lines for quantiles
abline(h = 0.5) # median
abline(h = 0.25, col = "blue")
abline(h = 0.75, col = "red")
x.ecdf = ecdf(x) # obtain empirical CDF values
plot(x.ecdf, add = T, col = "blue")
qqPlot(d$weight) # package "car", plot with confidence interval
For more detail, see http://stats.stackexchange.com/questions/101274/how-to-interpret-a-qq-plot?rq=1
The Shapiro–Wilk test tests the null hypothesis that a sample came from a normally distributed population. Interpretation: “If the p-value is less than the chosen alpha level, then the null hypothesis is rejected and there is evidence that the data tested are not from a normally distributed population. On the contrary, if the p-value is greater than the chosen alpha level, then the null hypothesis that the data came from a normally distributed population cannot be rejected (e.g., for an alpha level of 0.05, a data set with a p-value of 0.02 rejects the null hypothesis that the data are from a normally distributed population).”
shapiro.test(d$weight)
shapiro.test(x)
# let's test weights for females and males indpendently
shapiro.test(f)
shapiro.test(m)
qqPlot(f)
qqPlot(m)
Shapiro test cannot be run for sample size above 5000 because, for large amounts of data, it can detect even very small deviations from normality, thus leading to rejection of the null hypothesis event though for practical purposes the data is more than normal enough.
set.seed(400)
x <- rnorm(5001)
#shapiro.test(x)
ks.test(x,"pnorm",mean(x),sqrt(var(x))) #compare sample x with the normal distirbution that has mean(x) and var(x)
ks.test(d$weight,"pnorm",mean(d$weight),sqrt(var(d$weight)))
ks.test(f,"pnorm",mean(f),sqrt(var(f)))
ks.test(m,"pnorm",mean(m),sqrt(var(m)))
test <- ks.test(d$weight,"pnorm",mean(d$weight),sqrt(var(d$weight)))
test$p.value
round(test$p.value, digits = 3)
We will use a gamma distribution. Examples of gamma distribution: The gamma distribution has been used to model the size of insurance claims and rainfalls. In bacterial gene expression, the copy number of a constitutively expressed protein often follows the gamma distribution. In genomics, the gamma distribution is applied in peak calling step (i.e. in recognition of signal) in ChIP-seq data analysis.
shape <- 3
scale <- 2
x <- rgamma(500, shape, scale)
shapiro.test(x)
ks.test(x,"pnorm",mean(x),sqrt(var(x)))
y <- log(x + 1) # log transformation
shapiro.test(y)
ks.test(y,"pnorm",mean(y),sqrt(var(y)))
par(mfrow = c(2,2))
hist(x)
hist(y)
qqPlot(x) #library(car)
qqPlot(y) #library(car)
#install.packages("e1071")
library(e1071)
shape <- 3
scale <- 2
x <- rgamma(500, shape, scale)
par(mfrow = c(3,2))
hist(x)
qqPlot(x) #library(car)
y <- log(x + 1)
hist(y)
qqPlot(y) #library(car)
skew.score <- function(c, x) (skewness(log(x + c)))^2
coef <- optimise(skew.score, c(0, 20), x = x)$minimum
y2 <- log(x + coef)
hist(y2)
qqPlot(y2) #library(car)
x <- c(8.11, 8.11, 8.09, 8.08, 8.06, 8.02, 7.99, 7.99, 7.97, 7.95, 7.92, 7.92, 7.92, 7.89, 7.87, 7.84, 7.79, 7.79, 7.77, 7.76, 7.72, 7.71, 7.66, 7.62, 7.61, 7.59, 7.55, 7.53, 7.50, 7.50, 7.42, 7.38, 7.38, 7.26, 7.25, 7.08, 6.96, 6.84, 6.55) # London Olympic 2012, men jumps
shapiro.test(x)
par(mfrow = c(3,2))
hist(x)
qqPlot(x)
y1 <- x^2
shapiro.test(y1)
hist(y1)
qqPlot(y1)
y2 <- x^10
shapiro.test(y2)
hist(y2)
qqPlot(y2)
Do not transform! Use suitable models, e.g., Poisson regression.
lambda <- 4
x <- rpois(500, lambda)
par(mfrow = c(1,1))
#hist(x)
qqPlot(x) #library(car)