Hands-on for module 2 of BIST bio-statistics course 2016, CRG, Barcelona (May 9th, 2016)

Contents

  1. Probability
  2. Distributions
    1. Discrete Distributions
    2. Continuous Distributions
  3. Sampling Distributions
  4. Literature

1. Probability

A phenomenon is called random if individual outcomes are uncertain but there is nonetheless a regular distribution of outcomes in a large number of repetitions. The probability of any outcome of a random phenomenon is the proportions of times the outcome would occur in a very long series of repetitions. E.g., tossing a fair coin; possible outcomes are head H and tail T (sample space); before tossing a coin the outcome of tossing this coin is unknown; nevertheless, if we repeat tossing a coin we expect that approx. half of the outcomes are H and the other half T, meaning the proportion of H of all outcomes is 0.5.

Exercise 1.1: Each student tosses a 50-cents coin 20 times and reports the number of heads and tails. All students collect results from all other students and visualizes the proportion of heads after 20, 40, 60, ..., 200 tosses.

# 20 numbers of heads between 0 and 20
n.hs<-c(16,14,8,3,18,8,7,3,10,11,15,17,19,16,7,9,13,7,3,7)
cumsum(n.hs)
##  [1]  16  30  38  41  59  67  74  77  87  98 113 130 149 165 172 181 194
## [18] 201 204 211
n.tosses.each<-20
n.tosses<-rep(20,20)
prop.h <- cumsum(n.hs) / cumsum(n.tosses)
plot(x=cumsum(n.tosses),y=prop.h,xlab="Number of tosses"
     ,ylab="Proportion of head",main=""
     ,lwd=3,type="b",ylim=c(0,1))
abline(h=0.5,col="grey")
plot of chunk unnamed-chunk-1

Heroes of tossing a coin:

  1. the French naturalist Count Buffon (1707-1788) tossed a coin 4040 times => 2048 heads => proportion of heads = 2048/4040=0.5069
  2. around 1900, the English statistician Karl Pearson tossed a coin 24.000 times => 12.012 heads => proportion of heads = 0.5005
  3. while imprisoned by the Germans during World War II, the South African statistician John Kerrich tossed a coin 10.000 times => 5067 heads => proportion of heads = 0.5067

4 probability rules:

  1. for any event A of S: 0 <= P(A) <= 1
  2. for sample space S: P(S) = 1
  3. Addition rule: If A and B are disjoint, P(A or B) = P(A) + P(B)
  4. Complement rule: for any event A, P(c(A)) = 1 - P(A)

Exercise 1.2: Imagine the random experiment of tossing a coin. Define the sample space and re-state the four probability rules in context of tossing a coin.

Exercise 1.3: Each student simulates the random process of tossing a coin many times and visualizes how the proportion of heads changes along this simulation. In total 5 experiments with 5000 tosses should be simulated.

p.h<-0.3
N.tosses<-5000
N.series<-5

p.t<-1-p.h
props.h<-matrix(NA,ncol=N.tosses,nrow=N.series)
n.tosses<-1:N.tosses
for(r in 1:N.series){
  # 0=T and 1=H
  tosses<-sample(x=c(0,1),size=N.tosses,prob=c(p.t,p.h)
                ,replace=TRUE)
  props.h[r,]<- cumsum(tosses) / n.tosses
}

cols<-c("red","green","blue","orange","pink")

plot(x=n.tosses,y=props.h[1,],
     ,main="",xlab="Number of tosses",ylab="Proportion of H"
     ,lwd=2,type="l",log="x",ylim=c(0,1)
     ,col=cols[1])

for(r in 2:N.series){
  lines(x=n.tosses,y=props.h[r,],type="l",lwd=2,col=cols[r])
}
abline(h=p.h,lwd=2)
plot of chunk unnamed-chunk-2

Two events A and B are independent if knowing that one occurs does not change the probability that the other occurs. Example: tossing a coin twice. Because a fair coin doesn't have memory, we assume that H occurs after the second toss still with probability 0.5 regardless of the outcome of the first toss.

Exercise 1.4: Describe an example of two dependent random events.

p<-c(3,8,13,5,2,11)
names(p)<-c("white","green","orange","red","blue","yellow")
barplot(p,ylab="Frequencies",xlab="Color"
        ,main="Haribo Capsulas")
plot of chunk unnamed-chunk-3

4+1 probability rules:

  1. for any event A of S: 0 <= P(A) <= 1
  2. for sample space S: P(S) = 1
  3. Addition rule: If A and B are disjoint, P(A or B) = P(A) + P(B)
  4. Complement rule: for any event A, P(c(A)) = 1 - P(A)
  5. Multiplication rule: If A and B are independent P(A and B)=P(A)*P(B)

Exercise 1.5: Tossing a fair coin twice: what is the probability of getting two tails?

Exercise 1.6: Internet sites often disappear, hence, references to them cannot be followed anymore. In fact, 13% of Internet sites referenced in papers in major scientific journals are lost within two years after publication. If a paper contains seven Internet references, what is the probability that all seven are still working after two years? What specific assumptions did you make in order to calculate this probability?

2. Distributions

2.1 Discrete Distributions

Distributions for discrete random variables. For example, the first digit of numbers in legitimate financial records is discrete. Its sample space is S={1,2,3,4,5,6,7,8,9} and its distribution known as the Benford's Law looks like follows.

p<-c(0.301,0.176,0.125,0.097,0.079,0.067,0.058,0.051,0.046)
names(p)<-c("1","2","3","4","5","6","7","8","9")
barplot(p,ylab="Probability",xlab="First digit"
        ,main="Benford's Law")
plot of chunk unnamed-chunk-4

Exercise 2.1: Tossing a fair coin 4 times and recording the number of heads. The number of heads after 4 tosses is a discrete random number. We denote this random number by X. What is the sample space of X? What is the probability of each of it's possible events? Plot the distribution of X.

p<-c(1/16,4/16,6/16,4/16,1/16)
names(p)<-c("0","1","2","3","4")
barplot(p,ylab="Probability",xlab="X",main="Distribution")
plot of chunk unnamed-chunk-5

Exercise 2.2: What is the probability of X being 3 or 4? Which probability rule did you apply to answer this question? Emphasize the part of the distribution corresponding to this probability.

barplot(p,ylab="Probability",xlab="X",main="Distribution"
        ,col=c("grey","grey","grey","blue","blue"))
plot of chunk unnamed-chunk-6

What is the probability of X being smaller or equal 1 or 4? Which probability rule did you apply to answer this question? Emphasize the part of the distribution corresponding to this probability.

barplot(p,ylab="Probability",xlab="X",main="Distribution"
        ,col=c("blue","blue","grey","grey","blue"))
plot of chunk unnamed-chunk-7

What is the probability of X being smaller or equal then 0, 1, 2, 3, 4? Plot these probabilities.

cum.p<-cumsum(p)
names(p)<-c("0","1","2","3","4")
barplot(cum.p,ylab="Probability",xlab="X",
        main="Cumulative Distribution Function")
plot of chunk unnamed-chunk-8

Exercise 2.3 Let the number of heads after N tosses of a coin be a random variable X. What is the sample space of X? The discrete distribution of X is the Binomial distribution. Do a simulation experiment where you repeatedly toss a coin 4 times and record the number of heads. Repeat this process 2500 times and determine the proportions of heads after 50, 100, 500, 2500 repeats. Plot these empirical distributions and compare them to the exact values.

# number of tosses
N.tosses<-4
# sample space of X
possible.outcomes<-0:N.tosses
# probability of head
p.h<-0.5
# repeats of tossing a coin 4 times
reps<-2500
# probability of tail
p.t<-1-p.h

# tossing a coin 4 times, record the number of heads 
# and repeat this procedure 2500 times
n.hs<-rep(NA,reps)
for(i in 1:reps){
  n.hs[i]<-sum( sample(x=c(0,1),size=N.tosses
                      ,prob=c(p.t,p.h),replace=TRUE))
}

# exact probabilities
p<-c(1/16,4/16,6/16,4/16,1/16)
names(p)<-c("0","1","2","3","4")

par(mfrow=c(2,2))
red.tr<-adjustcolor("red", alpha.f = 0.3)
blue.tr<-adjustcolor("blue", alpha.f = 0.3)

M.repeats<-c(50,100,500,2500)
for(reps in M.repeats){
  barplot(p,ylab="Probability",xlab="X"
           ,main=paste("After",reps,"repeats")
           ,col=blue.tr)
  n.hs.tmp<-n.hs[1:reps]
  barplot(table(n.hs.tmp)/length(n.hs.tmp)
            ,add=TRUE,col=red.tr)
}
plot of chunk unnamed-chunk-9

Exercise 2.4 Let X be the number of seen heads after tossing a coin 20 times. Plot the Binomial distribution of X and its cumulative distribution function.

N.tosses<-20
# sample space of X
possible.outcomes<-seq(0,N.tosses)
# probability of head
p.h<-0.5
p<-dbinom(x=possible.outcomes,size=N.tosses,prob=p.h)
names(p)<-seq(0,20)
barplot(p,ylab="Probability",xlab="X"
            ,main=paste("Binomial distribution, N="
                         ,N.tosses," p=",p.h,sep="")
        )
plot of chunk unnamed-chunk-10
barplot(cumsum(p),ylab="Probability",xlab="X"
            ,main=paste("Cumulative Binomial 
                        distribution function
                        ,N=",N.tosses," p=",p.h,sep=""))
plot of chunk unnamed-chunk-10

There are many other discrete probability distributions.

2.2 Continuous Distributions

Normal distributions comprise an important class of continuous distributions. All Normal distributions have the same over-all shape. They are all symmetric, uni-modal, and bell-shaped. Each particular Normal distribution is defined by two parameters, mean and the standard deviation.

Exercise 2.5 Given are three normal distributed random variables X1, X2, and X3 with means and standard deviations (0,1), (-2.5,3), and (8,0.7). What is the sample space of X1, X2, X3? Plot the probability density for X1, X2, X3 in the range -15 to 15.

xs<-seq(-15,15,length.out=200)
ys.1<-dnorm(xs,mean=0,sd=1)
ys.2<-dnorm(xs,mean=-2.5,sd=3)
ys.3<-dnorm(xs,mean=8,sd=0.7)

y.lab<-"Density"

y.max<-max(c(ys.1,ys.2,ys.3))
plot(x=xs,y=ys.1,type="l",lwd=3,col=cols[1]
     ,ylim=c(0,y.max),xlab="X",ylab=y.lab)

lines(x=xs,y=ys.2,lwd=3,col=cols[2])

lines(x=xs,y=ys.3,lwd=3,col=cols[3])

legend(x="topleft",legend=c("X1","X2","X3")
      ,col=c(cols[1],cols[2],cols[3]),lty=1,bty="n")
plot of chunk unnamed-chunk-11

Interpretation of density df(x) of continuous random variables:

The cumulative density function for continuous random variable X is the function cdf(x)=P(X>=x).

Exercise 2.6 Plot the cumulative density functions for X1, X2, X3 into the same plot in the range -15 to 15.

ys.1<-pnorm(xs,mean=0,sd=1)
ys.2<-pnorm(xs,mean=-2.5,sd=3)
ys.3<-pnorm(xs,mean=8,sd=0.7)

y.lab<-"Cumulative Density"

y.max<-max(c(ys.1,ys.2,ys.3))
plot(x=xs,y=ys.1,type="l",lwd=3,col=cols[1]
     ,ylim=c(0,y.max),xlab="X",ylab=y.lab)

lines(x=xs,y=ys.2,lwd=3,col=cols[2])

lines(x=xs,y=ys.3,lwd=3,col=cols[3])

legend(x="topleft",legend=c("X1","X2","X3")
       ,col=c(cols[1],cols[2],cols[3]),lty=1,bty="n")
plot of chunk unnamed-chunk-12

The standard Normal distribution is the Normal distribution N(0,1) with mean 0 and standard deviation 1. If a random variable X has any Normal distribution N(mu,sd) with mean mu and standard deviation sd, then the standardized variable Z = (X-mu)/sd has a standard Normal distribution. For a given normal distributed sample, this standardization is also called Z-transformation.

Exercise 2.7 Take a sample of 1000 values from X2. Apply the Z-transformation to X2 which gives you sample Z2. Plot a histogram for X2 and Z2 and plot them into the same diagram. Add the density of the standard Normal distribution. Do the same for X3.

green.tr<-adjustcolor(cols[2], alpha.f = 0.3)
x2.sample<-rnorm(n=1000,mean=-2.5,sd=3)
z2.sample<-(x2.sample-(-2.5)) / 3

hist(x2.sample,col=green.tr,freq=FALSE
      ,xlim=c(min(xs),max(xs)),ylim=c(0,0.65),main="")
hist(z2.sample,col=red.tr,freq=FALSE,add=TRUE)

lines(x=xs,y=dnorm(xs,mean=0,sd=1),lwd=2)
legend(x="topleft",legend=c("X2","Z2")
      ,col=c(cols[2],cols[1]),pch=15,bty="n",pt.cex=1.5)
plot of chunk unnamed-chunk-13
x3.sample<-rnorm(n=1000,mean=8,sd=0.7)
z3.sample<-(x3.sample-(8)) / 0.7

hist(x3.sample,col=blue.tr,freq=FALSE,xlim=c(min(xs)
      ,max(xs)),ylim=c(0,0.65),main="")
hist(z3.sample,col=red.tr,freq=FALSE,add=TRUE)

lines(x=xs,y=dnorm(xs,mean=0,sd=1),lwd=2)
legend(x="topleft",legend=c("X3","Z3"),col=c(cols[3],cols[1])
       ,pch=15,bty="n",pt.cex=1.5)
plot of chunk unnamed-chunk-13

The 68-95-99.7 rule: For the Normal distribution with mean mu and standard deviation sd:

sd<-1
xs<-seq(-4.5,4.5,length.out=200)
dens<-dnorm(xs)
plot(x=xs,y=dens,type="l",lwd=3,xlab="X"
     ,ylab="Standard Normal density"
     ,main=paste(sd," sd = 68% of observations"))
xs<-c(seq(-sd*1,+sd*1,length.out=200))
ys<-dnorm(xs)
polygon(c(-sd*1,xs,sd*1),c(0,ys,0),col=blue.tr)
plot of chunk unnamed-chunk-14
sd<-2
xs<-seq(-4.5,4.5,length.out=200)
dens<-dnorm(xs)
plot(x=xs,y=dens,type="l",lwd=3,xlab="X"
    ,ylab="Standard Normal density"
    ,main=paste(sd," sd = 95% of observations"))
xs<-c(seq(-sd*1,+sd*1,length.out=200))
ys<-dnorm(xs)
polygon(c(-sd*1,xs,sd*1),c(0,ys,0),col=blue.tr)
plot of chunk unnamed-chunk-14
sd<-3
xs<-seq(-4.5,4.5,length.out=200)
dens<-dnorm(xs)
plot(x=xs,y=dens,type="l",lwd=3,xlab="X"
    ,ylab="Standard Normal density"
    ,main=paste(sd," sd = 99.7% of observations"))
xs<-c(seq(-sd*1,+sd*1,length.out=200))
ys<-dnorm(xs)
polygon(c(-sd*1,xs,sd*1),c(0,ys,0),col=blue.tr)
plot of chunk unnamed-chunk-14

Exercise 2.8 Let's assume that 76,531 students took a statistics course at the UPF last year. The mean score was 572 and the standard deviation was 51. Assume that the scores are approx. Normally distributed. Use the 68-95-99.7 rule to give a range of scores that includes 95% of the students.

mu<-572
sd<-51
lower.b<-mu-2*sd
lower.b
## [1] 470
upper.b<-mu+2*sd
upper.b
## [1] 674
The range of scores that include 95% of the students is 470 and 674.

3. Sampling Distributions