A phenomenon is called random if individual outcomes are uncertain but there is nonetheless a regular distribution of outcomes in a large number of repetitions. The probability of any outcome of a random phenomenon is the proportions of times the outcome would occur in a very long series of repetitions. E.g., tossing a fair coin; possible outcomes are head H and tail T (sample space); before tossing a coin the outcome of tossing this coin is unknown; nevertheless, if we repeat tossing a coin we expect that approx. half of the outcomes are H and the other half T, meaning the proportion of H of all outcomes is 0.5.
Exercise 1.1: Each student tosses a 50-cents coin 20 times and reports the number of heads and tails. All students collect results from all other students and visualizes the proportion of heads after 20, 40, 60, ..., 200 tosses.
# 20 numbers of heads between 0 and 20 n.hs<-c(16,14,8,3,18,8,7,3,10,11,15,17,19,16,7,9,13,7,3,7) cumsum(n.hs)
## [1] 16 30 38 41 59 67 74 77 87 98 113 130 149 165 172 181 194 ## [18] 201 204 211
n.tosses.each<-20 n.tosses<-rep(20,20) prop.h <- cumsum(n.hs) / cumsum(n.tosses) plot(x=cumsum(n.tosses),y=prop.h,xlab="Number of tosses" ,ylab="Proportion of head",main="" ,lwd=3,type="b",ylim=c(0,1)) abline(h=0.5,col="grey")
Heroes of tossing a coin:
4 probability rules:
Exercise 1.2: Imagine the random experiment of tossing a coin. Define the sample space and re-state the four probability rules in context of tossing a coin.
Exercise 1.3: Each student simulates the random process of tossing a coin many times and visualizes how the proportion of heads changes along this simulation. In total 5 experiments with 5000 tosses should be simulated.
p.h<-0.3 N.tosses<-5000 N.series<-5 p.t<-1-p.h props.h<-matrix(NA,ncol=N.tosses,nrow=N.series) n.tosses<-1:N.tosses for(r in 1:N.series){ # 0=T and 1=H tosses<-sample(x=c(0,1),size=N.tosses,prob=c(p.t,p.h) ,replace=TRUE) props.h[r,]<- cumsum(tosses) / n.tosses } cols<-c("red","green","blue","orange","pink") plot(x=n.tosses,y=props.h[1,], ,main="",xlab="Number of tosses",ylab="Proportion of H" ,lwd=2,type="l",log="x",ylim=c(0,1) ,col=cols[1]) for(r in 2:N.series){ lines(x=n.tosses,y=props.h[r,],type="l",lwd=2,col=cols[r]) } abline(h=p.h,lwd=2)
Two events A and B are independent if knowing that one occurs does not change the probability that the other occurs. Example: tossing a coin twice. Because a fair coin doesn't have memory, we assume that H occurs after the second toss still with probability 0.5 regardless of the outcome of the first toss.
Exercise 1.4: Describe an example of two dependent random events.
p<-c(3,8,13,5,2,11) names(p)<-c("white","green","orange","red","blue","yellow") barplot(p,ylab="Frequencies",xlab="Color" ,main="Haribo Capsulas")
4+1 probability rules:
Exercise 1.5: Tossing a fair coin twice: what is the probability of getting two tails?
Exercise 1.6: Internet sites often disappear, hence, references to them cannot be followed anymore. In fact, 13% of Internet sites referenced in papers in major scientific journals are lost within two years after publication. If a paper contains seven Internet references, what is the probability that all seven are still working after two years? What specific assumptions did you make in order to calculate this probability?
Distributions for discrete random variables. For example, the first digit of numbers in legitimate financial records is discrete. Its sample space is S={1,2,3,4,5,6,7,8,9} and its distribution known as the Benford's Law looks like follows.
p<-c(0.301,0.176,0.125,0.097,0.079,0.067,0.058,0.051,0.046) names(p)<-c("1","2","3","4","5","6","7","8","9") barplot(p,ylab="Probability",xlab="First digit" ,main="Benford's Law")
Exercise 2.1: Tossing a fair coin 4 times and recording the number of heads. The number of heads after 4 tosses is a discrete random number. We denote this random number by X. What is the sample space of X? What is the probability of each of it's possible events? Plot the distribution of X.
p<-c(1/16,4/16,6/16,4/16,1/16) names(p)<-c("0","1","2","3","4") barplot(p,ylab="Probability",xlab="X",main="Distribution")
Exercise 2.2: What is the probability of X being 3 or 4? Which probability rule did you apply to answer this question? Emphasize the part of the distribution corresponding to this probability.
barplot(p,ylab="Probability",xlab="X",main="Distribution" ,col=c("grey","grey","grey","blue","blue"))
What is the probability of X being smaller or equal 1 or 4? Which probability rule did you apply to answer this question? Emphasize the part of the distribution corresponding to this probability.
barplot(p,ylab="Probability",xlab="X",main="Distribution" ,col=c("blue","blue","grey","grey","blue"))
What is the probability of X being smaller or equal then 0, 1, 2, 3, 4? Plot these probabilities.
cum.p<-cumsum(p) names(p)<-c("0","1","2","3","4") barplot(cum.p,ylab="Probability",xlab="X", main="Cumulative Distribution Function")
Exercise 2.3 Let the number of heads after N tosses of a coin be a random variable X. What is the sample space of X? The discrete distribution of X is the Binomial distribution. Do a simulation experiment where you repeatedly toss a coin 4 times and record the number of heads. Repeat this process 2500 times and determine the proportions of heads after 50, 100, 500, 2500 repeats. Plot these empirical distributions and compare them to the exact values.
# number of tosses N.tosses<-4 # sample space of X possible.outcomes<-0:N.tosses # probability of head p.h<-0.5 # repeats of tossing a coin 4 times reps<-2500 # probability of tail p.t<-1-p.h # tossing a coin 4 times, record the number of heads # and repeat this procedure 2500 times n.hs<-rep(NA,reps) for(i in 1:reps){ n.hs[i]<-sum( sample(x=c(0,1),size=N.tosses ,prob=c(p.t,p.h),replace=TRUE)) } # exact probabilities p<-c(1/16,4/16,6/16,4/16,1/16) names(p)<-c("0","1","2","3","4") par(mfrow=c(2,2)) red.tr<-adjustcolor("red", alpha.f = 0.3) blue.tr<-adjustcolor("blue", alpha.f = 0.3) M.repeats<-c(50,100,500,2500) for(reps in M.repeats){ barplot(p,ylab="Probability",xlab="X" ,main=paste("After",reps,"repeats") ,col=blue.tr) n.hs.tmp<-n.hs[1:reps] barplot(table(n.hs.tmp)/length(n.hs.tmp) ,add=TRUE,col=red.tr) }
Exercise 2.4 Let X be the number of seen heads after tossing a coin 20 times. Plot the Binomial distribution of X and its cumulative distribution function.
N.tosses<-20 # sample space of X possible.outcomes<-seq(0,N.tosses) # probability of head p.h<-0.5 p<-dbinom(x=possible.outcomes,size=N.tosses,prob=p.h) names(p)<-seq(0,20) barplot(p,ylab="Probability",xlab="X" ,main=paste("Binomial distribution, N=" ,N.tosses," p=",p.h,sep="") )
barplot(cumsum(p),ylab="Probability",xlab="X" ,main=paste("Cumulative Binomial distribution function ,N=",N.tosses," p=",p.h,sep=""))
There are many other discrete probability distributions.
Normal distributions comprise an important class of continuous distributions. All Normal distributions have the same over-all shape. They are all symmetric, uni-modal, and bell-shaped. Each particular Normal distribution is defined by two parameters, mean and the standard deviation.
Exercise 2.5 Given are three normal distributed random variables X1, X2, and X3 with means and standard deviations (0,1), (-2.5,3), and (8,0.7). What is the sample space of X1, X2, X3? Plot the probability density for X1, X2, X3 in the range -15 to 15.
xs<-seq(-15,15,length.out=200) ys.1<-dnorm(xs,mean=0,sd=1) ys.2<-dnorm(xs,mean=-2.5,sd=3) ys.3<-dnorm(xs,mean=8,sd=0.7) y.lab<-"Density" y.max<-max(c(ys.1,ys.2,ys.3)) plot(x=xs,y=ys.1,type="l",lwd=3,col=cols[1] ,ylim=c(0,y.max),xlab="X",ylab=y.lab) lines(x=xs,y=ys.2,lwd=3,col=cols[2]) lines(x=xs,y=ys.3,lwd=3,col=cols[3]) legend(x="topleft",legend=c("X1","X2","X3") ,col=c(cols[1],cols[2],cols[3]),lty=1,bty="n")
Interpretation of density df(x) of continuous random variables:
The cumulative density function for continuous random variable X is the function cdf(x)=P(X>=x).
Exercise 2.6 Plot the cumulative density functions for X1, X2, X3 into the same plot in the range -15 to 15.
ys.1<-pnorm(xs,mean=0,sd=1) ys.2<-pnorm(xs,mean=-2.5,sd=3) ys.3<-pnorm(xs,mean=8,sd=0.7) y.lab<-"Cumulative Density" y.max<-max(c(ys.1,ys.2,ys.3)) plot(x=xs,y=ys.1,type="l",lwd=3,col=cols[1] ,ylim=c(0,y.max),xlab="X",ylab=y.lab) lines(x=xs,y=ys.2,lwd=3,col=cols[2]) lines(x=xs,y=ys.3,lwd=3,col=cols[3]) legend(x="topleft",legend=c("X1","X2","X3") ,col=c(cols[1],cols[2],cols[3]),lty=1,bty="n")
The standard Normal distribution is the Normal distribution N(0,1) with mean 0 and standard deviation 1. If a random variable X has any Normal distribution N(mu,sd) with mean mu and standard deviation sd, then the standardized variable Z = (X-mu)/sd has a standard Normal distribution. For a given normal distributed sample, this standardization is also called Z-transformation.
Exercise 2.7 Take a sample of 1000 values from X2. Apply the Z-transformation to X2 which gives you sample Z2. Plot a histogram for X2 and Z2 and plot them into the same diagram. Add the density of the standard Normal distribution. Do the same for X3.
green.tr<-adjustcolor(cols[2], alpha.f = 0.3) x2.sample<-rnorm(n=1000,mean=-2.5,sd=3) z2.sample<-(x2.sample-(-2.5)) / 3 hist(x2.sample,col=green.tr,freq=FALSE ,xlim=c(min(xs),max(xs)),ylim=c(0,0.65),main="") hist(z2.sample,col=red.tr,freq=FALSE,add=TRUE) lines(x=xs,y=dnorm(xs,mean=0,sd=1),lwd=2) legend(x="topleft",legend=c("X2","Z2") ,col=c(cols[2],cols[1]),pch=15,bty="n",pt.cex=1.5)
x3.sample<-rnorm(n=1000,mean=8,sd=0.7) z3.sample<-(x3.sample-(8)) / 0.7 hist(x3.sample,col=blue.tr,freq=FALSE,xlim=c(min(xs) ,max(xs)),ylim=c(0,0.65),main="") hist(z3.sample,col=red.tr,freq=FALSE,add=TRUE) lines(x=xs,y=dnorm(xs,mean=0,sd=1),lwd=2) legend(x="topleft",legend=c("X3","Z3"),col=c(cols[3],cols[1]) ,pch=15,bty="n",pt.cex=1.5)
The 68-95-99.7 rule: For the Normal distribution with mean mu and standard deviation sd:
sd<-1 xs<-seq(-4.5,4.5,length.out=200) dens<-dnorm(xs) plot(x=xs,y=dens,type="l",lwd=3,xlab="X" ,ylab="Standard Normal density" ,main=paste(sd," sd = 68% of observations")) xs<-c(seq(-sd*1,+sd*1,length.out=200)) ys<-dnorm(xs) polygon(c(-sd*1,xs,sd*1),c(0,ys,0),col=blue.tr)
sd<-2 xs<-seq(-4.5,4.5,length.out=200) dens<-dnorm(xs) plot(x=xs,y=dens,type="l",lwd=3,xlab="X" ,ylab="Standard Normal density" ,main=paste(sd," sd = 95% of observations")) xs<-c(seq(-sd*1,+sd*1,length.out=200)) ys<-dnorm(xs) polygon(c(-sd*1,xs,sd*1),c(0,ys,0),col=blue.tr)
sd<-3 xs<-seq(-4.5,4.5,length.out=200) dens<-dnorm(xs) plot(x=xs,y=dens,type="l",lwd=3,xlab="X" ,ylab="Standard Normal density" ,main=paste(sd," sd = 99.7% of observations")) xs<-c(seq(-sd*1,+sd*1,length.out=200)) ys<-dnorm(xs) polygon(c(-sd*1,xs,sd*1),c(0,ys,0),col=blue.tr)
Exercise 2.8 Let's assume that 76,531 students took a statistics course at the UPF last year. The mean score was 572 and the standard deviation was 51. Assume that the scores are approx. Normally distributed. Use the 68-95-99.7 rule to give a range of scores that includes 95% of the students.
mu<-572 sd<-51 lower.b<-mu-2*sd lower.b
## [1] 470
upper.b<-mu+2*sd upper.b
## [1] 674
470
and 674
.
A very prominent statistic is the sample mean.
Its sampling distribution depends on
Exercise 2.9 Assume the time periods between two arriving text messages on your cell phone is exponentially distributed. Draw a sample of 5000 time periods with parameter rate 1/20 from the exponential distribution. Plot a histogram of these time periods.
rate<-1/20 hist(rexp(5000, rate=rate) ,xlab="Time periods between two text messages [s]" ,ylab="Frequency" ,main="Exponentially distributed time periods" ,freq=TRUE,col="grey")
N.sample<-100 M.repeats<-2000 sample.means<-rep(NA,M.repeats) for(i in 1:M.repeats){ sample.means[i]<-mean(rexp(N.sample,rate)) } hist(sample.means ,xlab=paste("Sample means of",N.sample,"time periods") ,ylab="Frequency",main="",freq=TRUE,col="grey")
N<-144 mu.x<-240 sd.x<-18/sqrt(N) xs<-seq(230,250,length.out=500) ys<-dnorm(xs,mean=mu.x,sd=sd.x) plot(x=xs,y=ys,type="l",main="Sampling distribution of sample mean",xlab="Sample mean" ,ylab="Density",lwd=3) xs<-seq(mu.x-2*sd.x,mu.x+2*sd.x,length.out=200) ys<-dnorm(xs,mean=mu.x,sd=sd.x) polygon(x=c(xs[1],xs,xs[length(xs)]) ,y=c(0,ys,0),col=blue.tr)
mu.x-2*sd.x
## [1] 237
mu.x+2*sd.x
## [1] 243
237
and 243
.
mu<-175 sd<-5 N<-5000 heights<-rnorm(n,mean=mu,sd=sd) mean.heights<-rep(NA,N) for(i in 1:N){ mean.heights[i]<-mean(heights[1:i]) } plot(mean.heights,ylab="Sample mean" ,xlab="Sample size",main="" ,lwd=3,type="l") abline(h=mu,lwd=2,col="grey")