Difference between revisions of "BIST Introduction to Statistics 2016"

From Bioinformatics Core Wiki
(Created page with "__TOC__ == BIST "Introduction to Biostatistics" Course == ==== Online Resources ==== * Nature Web-collection "Statistics for Biologists": http://www.nature.com/collections/qg...")
 
(Course Instructors)
 
(81 intermediate revisions by 4 users not shown)
Line 1: Line 1:
 
__TOC__
 
__TOC__
== BIST "Introduction to Biostatistics" Course ==
 
  
==== Online Resources ====
 
* Nature Web-collection "Statistics for Biologists": http://www.nature.com/collections/qghhqm
 
* Self-paced online UC Berkeley courses
 
** https://www.edx.org/course/introduction-statistics-descriptive-uc-berkeleyx-stat2-1x
 
** https://www.edx.org/course/introduction-statistics-probability-uc-berkeleyx-stat2-2x
 
** https://www.edx.org/course/introduction-statistics-inference-uc-berkeleyx-stat2-3x
 
* Online book recommended by the above courses http://www.stat.berkeley.edu/~stark/SticiGui/
 
* Upcoming (Feb 2, 2016) MIT course Introduction to Probability - The Science of Uncertainty. https://www.edx.org/course/introduction-probability-science-mitx-6-041x-1
 
* Self-paced online course "Explore Statistics with R" https://www.edx.org/course/explore-statistics-r-kix-kiexplorx-0
 
* "An Introduction to Statistical Learning with Applications in R" from Stanford http://www-bcf.usc.edu/~gareth/ISL/
 
* 100 Statistical Tests.pdf - ResearchGate - just search Google to get a link
 
* VIB "Basic statistics in R" course. Tutorial and links.    https://www.bits.vib.be/index.php/training/180#download
 
  
==== Comparison of two samples ====
+
[[File:BIST-Portada-Cursos-BioStatistics-2.jpg|500px]]
* The t-test, paired or unpaired, in R >t.test (x,y, paired=TRUE). The t-test provides an exact test for the equality of the means of two normal populations with unknown, but equal, variances. The latter can be checked with F-test, or in R >var.test(x,y). https://en.wikipedia.org/wiki/Student's_t-test#Paired_samples
+
  
* Non-parametric tests. No assumption about variances and normality.
+
=== Course Description ===
** Independent samples. The Wilcoxon rank-sum test, aka Mann-Witney test. https://en.wikipedia.org/wiki/Mann–Whitney_U_test. In R, >wilcox.test(x,y). H0= Ranks of means of two samples are not different.
+
This introductory course to statistics and probability theory is modeled after the traditional university course Statistics 101 and will be given by the CRG staff and PhD students. The material is offered in 5 consecutive modules (please see Course Syllabus below), each containing a morning lecture and an afternoon practicum in a computer class. For practical exercises we will use R programming language and [https://www.rstudio.com R Studio]. However, this course is focused on statistics rather than R; therefore, each practicum is designed with the purpose to demonstrate and reinforce understanding of concepts introduced in the lecture rather than to provide a training in R.
** Paired samples. The Wilcoxon signed-Rank Test. In R, >wilcox.test(x,y, paired=TRUE). See http://vassarstats.net/textbook/ch12a.html.
+
** The Kolmogorov-Smirnov test. In R, >ks.test(x,y). https://en.wikipedia.org/wiki/Kolmogorov–Smirnov_test. If two samples have the same mean but different variance or/and shape/distribution, this test can spot it. It is more powerful than the Wilcoxon test. The statistic is calculated by finding the maximum absolute value of the differences between the two sample cumulative distribution functions. See http://www.physics.csbsju.edu/stats/KS-test.html.
+
  
==== Comparison of two microbiome samples ====
+
=== Course Objectives ===
* New (2012) biostatistical methods for the analysis of microbiome data based on a fully parametric approach using all the data. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3527355/#pone.0052078-LaRosa1
+
To introduce the basic concepts of statistics and probability and to demonstrate how they can be applied to real-life biological problems using R. Knowledge of statistics or R is not required for taking this course. However, familiarity with the material in the previous modules is recommended if the modules are not taken in a sequence.
** The use of a fully parametric model for these data has the benefit over alternative non-parametric approaches such as bootstrapping and permutation testing, in that this model is able to retain more information contained in the data.  
+
** R package "HMP" is available. http://cran.r-project.org/web/packages/HMP/HMP.pdf. To install it: > source("http://www.bioconductor.org/biocLite.R"); biocLite("HMP")
+
  
== Online courses ==
+
=== Course Instructors ===
* Ten rules for online learning. http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002631
+
* Dmitri Pervouchine (lectures) pervouchine@gmail.com
 +
* German Demidov (practicums III, V) german.demidov@crg.eu
 +
* Andre Gohr (practicum II) Andre.Gohr@crg.eu
 +
* [https://biocore.crg.eu/wiki/User:Sbonnin Sarah Bonnin] (lecture on R, practicum I) sarah.bonnin@crg.eu
 +
* Julia Ponomarenko (organizer, practicum IV) julia.ponomarenko@crg.eu
  
* Self-paced online course "Intro to Linux" https://www.edx.org/course/introduction-linux-linuxfoundationx-lfs101x-2
+
=== Time and Location ===
* Self-paced online course from Microsoft "Intro to R programming"  https://www.edx.org/course/introduction-r-programming-microsoft-dat204x-0
+
* LECTURES: 9:30 - 13:30. PRBB. AULA Auditorium. 4th floor. The hotel wing.
* Self-paced online course "Introduction to Cloud Computing" https://www.edx.org/course/introduction-cloud-computing-ieeex-cloudintro-x-0
+
* PRACTICUMS: 14:30 - 17:00. PRBB. Boinformatics classroom. 468. 4th floor. The hotel wing.
* Self-paced online course from Microsoft "Data Science and Machine Learning Essentials" https://www.edx.org/course/data-science-machine-learning-essentials-microsoft-dat203x-0
+
* PICA-PICA (generously sponsored by BIST): May 18, 17:00. Terrace of the 5th floor. PRBB.
  
* Self-paced online courses in the seria "Data Analysis for Life Sciences" from Harvard:
 
** 1: Statistics and R https://www.edx.org/course/data-analysis-life-sciences-1-statistics-harvardx-ph525-1x
 
** 2: Introduction to Linear Models and Matrix Algebra https://www.edx.org/course/data-analysis-life-sciences-2-harvardx-ph525-2x
 
** 3: Statistical Inference and Modeling for High-throughput Experiments https://www.edx.org/course/data-analysis-life-sciences-3-harvardx-ph525-3x
 
** 4: High-Dimensional Data Analysis https://www.edx.org/course/data-analysis-life-sciences-4-high-harvardx-ph525-4x
 
** 5: Introduction to Bioconductor: Annotation and Analysis of Genomes and Genomic Assays https://www.edx.org/course/data-analysis-life-sciences-5-harvardx-ph525-5x
 
** 6: High-performance Computing for Reproducible Genomics https://www.edx.org/course/data-analysis-life-sciences-6-high-harvardx-ph525-6x
 
** 7: Case Studies in Functional Genomics https://www.edx.org/course/data-analysis-life-sciences-7-case-harvardx-ph525-7x
 
  
* Self-paced excellent course from MIT "Introduction to Biology - The Secret of Life" https://www.edx.org/course/introduction-biology-secret-life-mitx-7-00x-2
+
=== Course Syllabus, Schedule, and Materials ===
 +
 
 +
 
 +
==== MODULE 0. Workshop "Introduction to R". May 2, 2016. ICFO. ====
 +
[[Media:ICFO_R.zip|Download the workshop materials.]] The workshop was given by Dr. Alejandro Caceres, CREAL, and organized by the ICFO's Training and Development Program.
 +
 
 +
 
 +
==== MODULE I. Descriptive statistics. May 6, 2016. CRG. ====
 +
* LECTURE I. [[Media:Module1.pdf|View slides in this browser window.]] Exploratory data analysis: bar-plot, histogram, CDF, box-plot, scatter-plot, pie charts etc. Samples, measures of center and spread, percentiles, odds ratio. Outliers and robustness. Experiment versus observational study, confounding factors, simple random sample, other types of sampling, biases in sampling techniques.
 +
* LECTURE II. [[Media:Introduction to R Module1.pdf|View slides in this browser window.]] Introduction to R programming language and R Studio: Data types, variables, packages, functions, handling files/scripts/projects.
 +
* PRACTICUM. [[Media:Practicum1 ggplot2.pdf|View pdf-file in this browser window.]] Basic plots in R using the ggplot2 package.
 +
 
 +
 
 +
==== MODULE II. Introduction to Probability. May 9, 2016. CRG. ====
 +
* LECTURE. [[Media:Module2.pdf|View slides in this browser window.]] Independence, conditional probability, Bayes formula. Distributions, population mean and population variance, Binomial, Poisson, and Normal distribution. Central Limit theorem and the Law of large numbers. Continuity correction. Sampling with and without replacement. Correction for finite population size.
 +
* PRACTICUM. [[Media:Practicum2.zip|Download the zip-file.]] Elementary probability problems in R, pdf and cdf functions, simulation explicating the law of large numbers.
 +
* [[Media:Tables corrected.pdf|STATISTICAL TABLES]]
 +
* [[Media:QUIZ2.pdf|QUIZ 2]]
 +
 
 +
 
 +
==== MODULE III. Statistical Inference, part I. May 13, 2016. CRG. ====
 +
* LECTURE. [[Media:Module3.pdf|View slides in this browser window.]] Statistical Inference, part I. The concept of hypothesis testing, type I and type II error, false discovery rate. Significance and confidence level, p-value. Confidence intervals. One-sided and two-sided tests and confidence intervals. Sampling distribution, estimators, standard error. Normal probabilities in application to p-value. One-sample and two-sample tests for independent and matched samples with known variance.  The case of unknown variance and Student t-distribution, assumption of normality. Pooled variance and equal variances assumption.
 +
* PRACTICUM. [[Media:BIST_Module3_practicum.zip|Download the zip-file.]] One- and two-sample tests with known and unknown variance, test for proportions, simulation involving confidence intervals and t-distribution.
 +
* [[Media:QUIZ3.pdf|QUIZ 3]]
 +
 
 +
 
 +
==== MODULE IV. Statistical Inference, part II. May 18, 2016. CRG. ====
 +
* LECTURE. [[Media:Module4-2.pdf|View slides in this browser window.]] Statistical Inference, part II. Estimation of variance. Fisher test for variance equality. Non-parametric tests. Sign test, Wilcoxon sum of ranks test (Mann-Whitney U-test), Wilcoxon signed rank test. Chi-square test for goodness of fit, chi-square test for independence. Kolmogorov-Smirnov (KS) test. Shapiro test for normality. Sample size estimation. Correction for multiple testing, family-wise error rate.
 +
* PRACTICUM. [[Media:Module4.zip|Download the zip-file.]] Tests with unknown variance, non-parametric tests, simulations explicating non-parametric tests, FDR.
 +
* [[Media:QUIZ4.pdf|QUIZ 4]]
 +
 
 +
==== MODULE V. Statistical modeling, Regression. May 20, 2016. CRG. ====
 +
* LECTURE. [[Media:Module5-2.pdf|View slides in this browser window.]] Simple linear regression model, residuals, degrees of freedom, least squares method, correlation coefficient, variance decomposition, determination coefficient. Interpretation of the slope, correlation, and determination coefficients. Standard error and statistical inference in simple linear regression model. Analysis of variance (ANOVA). One-way and two-way ANOVA.
 +
* PRACTICUM. [[Media:BIST_Module5_hands_on.zip|Download the zip-file.]] Problems on linear regression, ANOVA, data transformation.
 +
* [[Media:QUIZ5.pdf|QUIZ 5]]
 +
 
 +
 
 +
=== External Resources ===
 +
* [http://www.nature.com/collections/qghhqm Nature Web-collection "Statistics for Biologists"]
 +
* 100 Statistical Tests.pdf - ResearchGate - just search Google to get a link
 +
* [http://data.bits.vib.be/pub/trainingen/StatTheory/Jarko_Isotalo_Concepts.pdf Book "Basics of Statistics" by Jarko Isotalo]
 +
* [https://cran.r-project.org/web/packages/IPSUR/vignettes/IPSUR.pdf "Introduction to Probability and Statistics using R" by G. Jay Kerns]
 +
* [http://ww2.coastal.edu/kingw/statistics/R-tutorials/ R Tutorials by William B. King]
 +
* [http://www.gardenersown.co.uk/Education/Lectures/R/basics.htm Tutorials "R for basic statistics"]
 +
* [http://www.r-bloggers.com Blog "R-bloggers"]
 +
* [http://www.statsblogs.com StatsBlogs]
 +
* [https://learnr.wordpress.com Blog "Learning R"]
 +
* [https://ryouready.wordpress.com Blog "R you ready?"]
 +
* [http://www.r-statistics.com "R-statistics blog"]
 +
* Self-paced online courses from UC Berkeley: [https://www.edx.org/course/introduction-statistics-descriptive-uc-berkeleyx-stat2-1x Descriptive Statistics.] [https://www.edx.org/course/introduction-statistics-probability-uc-berkeleyx-stat2-2x Probability.] [https://www.edx.org/course/introduction-statistics-inference-uc-berkeleyx-stat2-3x Inference.]
 +
* [http://www.stat.berkeley.edu/~stark/SticiGui/ Online book recommended for the UC Berkeley courses]
 +
* [https://www.edx.org/course/explore-statistics-r-kix-kiexplorx-0 Self-paced online course "Explore Statistics with R"]
 +
* [http://www-bcf.usc.edu/~gareth/ISL/ Online course from Stanford "An Introduction to Statistical Learning with Applications in R"]
 +
* [https://www.edx.org/course/introduction-r-programming-microsoft-dat204x-0 Self-paced online course from Microsoft "Intro to R programming"]
 +
* [https://www.edx.org/course/data-analysis-life-sciences-1-statistics-harvardx-ph525-1x Self-paced online course from Harvard "Statistics and R"]
 +
* [https://www.edx.org/course/data-analysis-life-sciences-3-harvardx-ph525-3x Self-paced online course from Harvard "Statistical Inference and Modeling for High-throughput Experiments" ]
 +
* [http://data.bits.vib.be/pub/trainingen/StatTheory/SlidesFullDay.pdf VIB "Basic statistics theory" course slides.]
 +
* [https://www.bits.vib.be/index.php/training/180#download VIB "Basic statistics in R" course. Tutorial, excercises, cheat sheets.]

Latest revision as of 13:32, 12 September 2018


BIST-Portada-Cursos-BioStatistics-2.jpg

Course Description

This introductory course to statistics and probability theory is modeled after the traditional university course Statistics 101 and will be given by the CRG staff and PhD students. The material is offered in 5 consecutive modules (please see Course Syllabus below), each containing a morning lecture and an afternoon practicum in a computer class. For practical exercises we will use R programming language and R Studio. However, this course is focused on statistics rather than R; therefore, each practicum is designed with the purpose to demonstrate and reinforce understanding of concepts introduced in the lecture rather than to provide a training in R.

Course Objectives

To introduce the basic concepts of statistics and probability and to demonstrate how they can be applied to real-life biological problems using R. Knowledge of statistics or R is not required for taking this course. However, familiarity with the material in the previous modules is recommended if the modules are not taken in a sequence.

Course Instructors

  • Dmitri Pervouchine (lectures) pervouchine@gmail.com
  • German Demidov (practicums III, V) german.demidov@crg.eu
  • Andre Gohr (practicum II) Andre.Gohr@crg.eu
  • Sarah Bonnin (lecture on R, practicum I) sarah.bonnin@crg.eu
  • Julia Ponomarenko (organizer, practicum IV) julia.ponomarenko@crg.eu

Time and Location

  • LECTURES: 9:30 - 13:30. PRBB. AULA Auditorium. 4th floor. The hotel wing.
  • PRACTICUMS: 14:30 - 17:00. PRBB. Boinformatics classroom. 468. 4th floor. The hotel wing.
  • PICA-PICA (generously sponsored by BIST): May 18, 17:00. Terrace of the 5th floor. PRBB.


Course Syllabus, Schedule, and Materials

MODULE 0. Workshop "Introduction to R". May 2, 2016. ICFO.

Download the workshop materials. The workshop was given by Dr. Alejandro Caceres, CREAL, and organized by the ICFO's Training and Development Program.


MODULE I. Descriptive statistics. May 6, 2016. CRG.

  • LECTURE I. View slides in this browser window. Exploratory data analysis: bar-plot, histogram, CDF, box-plot, scatter-plot, pie charts etc. Samples, measures of center and spread, percentiles, odds ratio. Outliers and robustness. Experiment versus observational study, confounding factors, simple random sample, other types of sampling, biases in sampling techniques.
  • LECTURE II. View slides in this browser window. Introduction to R programming language and R Studio: Data types, variables, packages, functions, handling files/scripts/projects.
  • PRACTICUM. View pdf-file in this browser window. Basic plots in R using the ggplot2 package.


MODULE II. Introduction to Probability. May 9, 2016. CRG.

  • LECTURE. View slides in this browser window. Independence, conditional probability, Bayes formula. Distributions, population mean and population variance, Binomial, Poisson, and Normal distribution. Central Limit theorem and the Law of large numbers. Continuity correction. Sampling with and without replacement. Correction for finite population size.
  • PRACTICUM. Download the zip-file. Elementary probability problems in R, pdf and cdf functions, simulation explicating the law of large numbers.
  • STATISTICAL TABLES
  • QUIZ 2


MODULE III. Statistical Inference, part I. May 13, 2016. CRG.

  • LECTURE. View slides in this browser window. Statistical Inference, part I. The concept of hypothesis testing, type I and type II error, false discovery rate. Significance and confidence level, p-value. Confidence intervals. One-sided and two-sided tests and confidence intervals. Sampling distribution, estimators, standard error. Normal probabilities in application to p-value. One-sample and two-sample tests for independent and matched samples with known variance. The case of unknown variance and Student t-distribution, assumption of normality. Pooled variance and equal variances assumption.
  • PRACTICUM. Download the zip-file. One- and two-sample tests with known and unknown variance, test for proportions, simulation involving confidence intervals and t-distribution.
  • QUIZ 3


MODULE IV. Statistical Inference, part II. May 18, 2016. CRG.

  • LECTURE. View slides in this browser window. Statistical Inference, part II. Estimation of variance. Fisher test for variance equality. Non-parametric tests. Sign test, Wilcoxon sum of ranks test (Mann-Whitney U-test), Wilcoxon signed rank test. Chi-square test for goodness of fit, chi-square test for independence. Kolmogorov-Smirnov (KS) test. Shapiro test for normality. Sample size estimation. Correction for multiple testing, family-wise error rate.
  • PRACTICUM. Download the zip-file. Tests with unknown variance, non-parametric tests, simulations explicating non-parametric tests, FDR.
  • QUIZ 4

MODULE V. Statistical modeling, Regression. May 20, 2016. CRG.

  • LECTURE. View slides in this browser window. Simple linear regression model, residuals, degrees of freedom, least squares method, correlation coefficient, variance decomposition, determination coefficient. Interpretation of the slope, correlation, and determination coefficients. Standard error and statistical inference in simple linear regression model. Analysis of variance (ANOVA). One-way and two-way ANOVA.
  • PRACTICUM. Download the zip-file. Problems on linear regression, ANOVA, data transformation.
  • QUIZ 5


External Resources

Bioinformatics Core Facility @ CRG — 2011-2024