Difference between revisions of "CRG Introduction to Statistics and R 2017"

From Bioinformatics Core Wiki
(Created page with "__TOC__ === Description === This is an introductory course to statistics and R programming. <br> The R part is offered in 4 slow-paced practicums for absolute beginners, fol...")
 
(MODULE III. Statistical modeling & Regression. June 9.)
 
(28 intermediate revisions by 2 users not shown)
Line 3: Line 3:
  
 
=== Description ===
 
=== Description ===
This is an introductory course to statistics and R programming. <br>
+
This is an introductory course to statistics and R programming. For the previous edition of this course, please refer to [https://biocore.crg.eu/wiki/BIST_Introduction_to_Statistics_2016 this page].
The R part is offered in 4 slow-paced practicums for absolute beginners, followed by 3 fast-paced practicums of statistical modules. <br>
+
<br> The R part is offered in 4 slow-paced practicums for absolute beginners, followed by 3 fast-paced practicums of statistical modules. <br>
 
For practical exercises we will use R programming language and [https://www.rstudio.com R Studio].  
 
For practical exercises we will use R programming language and [https://www.rstudio.com R Studio].  
  
Line 46: Line 46:
 
** Create and run a short script.
 
** Create and run a short script.
 
** Read and write a file.
 
** Read and write a file.
** OUTCOME: Write a script that creates (and enters) a directory,= and writes a simple calculation into a file.
+
** OUTCOME: Write a script that creates (and enters) a directory, process a simple manipulation and write into a file.
 +
[[Media:170525_Introduction_to_R_day1.pdf|Slides day1]]
 +
<br>
 +
[[Media:170525_Exercises_day1.pdf|Exercises day1]]
 +
<br>
 +
Correction for [[Media:ex1.R|exercise 1]], [[Media:ex2.R|exercise 2]], [[Media:ex3.R|exercise 3]]
 +
 
  
  
 
* <u>PRACTICUM II. Data structures in R. May 26. 10:00 - 13:00.</u>
 
* <u>PRACTICUM II. Data structures in R. May 26. 10:00 - 13:00.</u>
** More on vectors.
+
** More on vectors and factors.
 
** Matrices and data frames: create, access/extract/subset, modify, arithmetic, conversions, check and name dimensions.
 
** Matrices and data frames: create, access/extract/subset, modify, arithmetic, conversions, check and name dimensions.
** OUTCOME: Produce a script that reads matrices and data frames, converts one into another, and makes calculations.
+
** OUTCOME: Produce a script that reads matrices and data frames, manipulate them, read and write files.
 +
[[Media:170526_Introduction_to_R_day2.pdf|Slides day2]]
 +
<br>
 +
[[Media:170526_Exercises_day2.pdf|Exercises day2]]
 +
<br>
 +
 
  
  
Line 59: Line 70:
 
** Packages: find, install, load, explore/find functions and documentation, get help on functions.
 
** Packages: find, install, load, explore/find functions and documentation, get help on functions.
 
** OUTCOME: Install the packages "diamonds" and "WriteXLS". Use them in a script that manipulates the diamonds data frame and writes it into an Excel file.
 
** OUTCOME: Install the packages "diamonds" and "WriteXLS". Use them in a script that manipulates the diamonds data frame and writes it into an Excel file.
 
+
[[Media:170529_Introduction_to_R_day3.pdf|Slides day3]]
 +
<br>
 +
[[Media:170529_Exercises_day3.pdf|Exercises day3]]
 +
<br>
  
 
* <u>PRACTICUM IV. Plots & Graphics in R. May 30. 10:00 - 13:00.</u>
 
* <u>PRACTICUM IV. Plots & Graphics in R. May 30. 10:00 - 13:00.</u>
Line 65: Line 79:
 
** Introduction to ggplot2 package: structure of ggplot2 commands, scatter plots.
 
** Introduction to ggplot2 package: structure of ggplot2 commands, scatter plots.
 
** OUTCOME: Write a script that produces, customizes, and saves plots in files.
 
** OUTCOME: Write a script that produces, customizes, and saves plots in files.
 +
[[Media:170530_Introduction_to_R_day4.pdf|Slides day4]]
 +
<br>
 +
[[Media:170530_Exercises_day4.pdf|Exercises day4]]
 +
<br>
 
<br>
 
<br>
  
Line 80: Line 98:
 
** Correction for finite population size.  
 
** Correction for finite population size.  
 
* [[Media:Tables corrected.pdf|STATISTICAL TABLES]]
 
* [[Media:Tables corrected.pdf|STATISTICAL TABLES]]
 +
[[Media:Module_1_Lectures_June_2017.pdf|Lecture 1 slides.]]
 
<br>
 
<br>
  
Line 85: Line 104:
 
** Descriptive statistics.
 
** Descriptive statistics.
 
** Plots: Bar-plot, histogram, CDF, box-plot, scatter-plot, pie charts etc.
 
** Plots: Bar-plot, histogram, CDF, box-plot, scatter-plot, pie charts etc.
** Independence, conditional probability, Bayes formula.
 
 
** Distributions, population mean and population variance.  
 
** Distributions, population mean and population variance.  
** Central Limit theorem and the Law of large numbers.
+
[[Media:Module1_June_2017.html_2.zip|Download the zipped html-file for the practicum.]]
 
<br>
 
<br>
 +
  
 
==== <b>MODULE II. Statistical Inference. June 8. </b> ====
 
==== <b>MODULE II. Statistical Inference. June 8. </b> ====
Line 106: Line 125:
 
** Chi-square test for goodness of fit, chi-square test for independence.  
 
** Chi-square test for goodness of fit, chi-square test for independence.  
 
** Sample size estimation.  
 
** Sample size estimation.  
 +
[[Media:Module_2_Lectures_June_2017.pdf|Lecture 2 slides.]]
 
<br>
 
<br>
  
 
* <u>PRACTICUM. 14:00 - 17:00.</u>.
 
* <u>PRACTICUM. 14:00 - 17:00.</u>.
** One- and two-sample tests with known and unknown variance, test for proportions, simulation involving confidence intervals and t-distribution.  
+
** One- and two-sample tests with known and unknown variance.  
** Non-parametric tests.  
+
** Test for proportions.
** Kolmogorov-Smirnov (KS) test.  
+
** Confidence intervals and t-distribution.  
** Shapiro test for normality.  
+
** Fisher test.
** QQ-plot.  
+
** Sample size estimation.
** Data transformation.
+
[[Media:Module2_June_2017_Parametric_tests.html.zip|Download the zipped html-file for the practicum Part 1.]]<br>
 +
[[Media:Module2_June_2017_FDR_test_power.html.zip|Download the zipped html-file for the practicum Part 2.]]
 
<br>
 
<br>
 +
  
 
==== <b>MODULE III. Statistical modeling & Regression. June 9.</b> ====
 
==== <b>MODULE III. Statistical modeling & Regression. June 9.</b> ====
Line 126: Line 148:
 
** Beyond simple regression models: multiple regression, logistic regression.
 
** Beyond simple regression models: multiple regression, logistic regression.
 
** Correction for multiple testing, family-wise error rate.  
 
** Correction for multiple testing, family-wise error rate.  
 +
[[Media:Part3.pdf|Lecture 3 slides.]]
 
<br>
 
<br>
  
 
* <u>PRACTICUM. 14:00 - 17:00.</u>  
 
* <u>PRACTICUM. 14:00 - 17:00.</u>  
 +
** QQ-plot.
 +
** Tests for normality.
 +
** Data transformation.
 +
** Non-parametric tests.
 
** Problems on linear regression.
 
** Problems on linear regression.
 
** ANOVA.
 
** ANOVA.
 +
[[Media:Module3_June_2017.html_2.zip|Download the zipped html-file for the practicum Part 1.]]<br>
 +
[[Media:3rd_module_regression_anova.html.zip|Download the zipped html-file for the practicum Part 2.]]<br>
  
  
 
<br>
 
<br>
 +
 
=== External Resources ===
 
=== External Resources ===
 
* [http://www.nature.com/collections/qghhqm Nature Web-collection "Statistics for Biologists"]  
 
* [http://www.nature.com/collections/qghhqm Nature Web-collection "Statistics for Biologists"]  
Line 155: Line 185:
 
* [https://www.edx.org/course/data-analysis-life-sciences-1-statistics-harvardx-ph525-1x Self-paced online course from Harvard "Statistics and R"]
 
* [https://www.edx.org/course/data-analysis-life-sciences-1-statistics-harvardx-ph525-1x Self-paced online course from Harvard "Statistics and R"]
 
* [https://www.edx.org/course/data-analysis-life-sciences-3-harvardx-ph525-3x Self-paced online course from Harvard "Statistical Inference and Modeling for High-throughput Experiments" ]
 
* [https://www.edx.org/course/data-analysis-life-sciences-3-harvardx-ph525-3x Self-paced online course from Harvard "Statistical Inference and Modeling for High-throughput Experiments" ]
* [http://data.bits.vib.be/pub/trainingen/StatTheory/SlidesFullDay.pdf VIB "Basic statistics theory" course slides.]
 
* [https://www.bits.vib.be/index.php/training/180#download VIB "Basic statistics in R" course. Tutorial, excercises, cheat sheets.]
 
 
* [http://students.brown.edu/seeing-theory/ The Seeing Theory website visualizes the fundamental concepts covered in an introductory college statistics, using D3.jc]
 
* [http://students.brown.edu/seeing-theory/ The Seeing Theory website visualizes the fundamental concepts covered in an introductory college statistics, using D3.jc]

Latest revision as of 16:00, 9 June 2017


Description

This is an introductory course to statistics and R programming. For the previous edition of this course, please refer to this page.
The R part is offered in 4 slow-paced practicums for absolute beginners, followed by 3 fast-paced practicums of statistical modules.
For practical exercises we will use R programming language and R Studio.

The statistics material is offered in 3 consecutive modules (please see Course Syllabus below), each containing a morning lecture and an afternoon practicum in a computer class. These practicums are focused on using statistics in R, with the purpose to demonstrate and reinforce understanding of concepts introduced in the lectures, rather than teaching R programming.


Course Instructors


Dates, Time and Location

  • Module 0. Introduction to R. May 25, 26, 29, 30, 2017.
    • 10:00 - 13:00.
    • PRBB. Boinformatics classroom (468). 4th floor. The hotel/North wing.


  • Modules I, II, III. Introduction to Statistics. June 6, 8, 9, 2017.
    • LECTURES.
      • 10:00 - 13:00.
      • PRBB. Ramon y Cajal.
    • PRACTICUMS.
      • 14:00 - 17:00.
      • PRBB. Boinformatics classroom (468). 4th floor. The hotel/North wing.


Course Syllabus, Schedule, and Materials


MODULE 0. Introduction to R. May 25, 26, 29, 30.

  • PRACTICUM I. Intro to R and R Studio. May 25. 10:00 - 13:00.
    • Introduction to R studio: explore environment variable, navigate the history of commands, navigate directory and file structure, workspace and files.
    • Simple arithmetic in R console.
    • Create and delete an object.
    • Introduction to data types and the "vector" data structure.
    • Create and run a short script.
    • Read and write a file.
    • OUTCOME: Write a script that creates (and enters) a directory, process a simple manipulation and write into a file.

Slides day1
Exercises day1
Correction for exercise 1, exercise 2, exercise 3


  • PRACTICUM II. Data structures in R. May 26. 10:00 - 13:00.
    • More on vectors and factors.
    • Matrices and data frames: create, access/extract/subset, modify, arithmetic, conversions, check and name dimensions.
    • OUTCOME: Produce a script that reads matrices and data frames, manipulate them, read and write files.

Slides day2
Exercises day2


  • PRACTICUM III. Lists & Packages. May 29. 10:00 - 13:00.
    • More on data structures. Lists: create, access/extract/subset, modify.
    • Packages: find, install, load, explore/find functions and documentation, get help on functions.
    • OUTCOME: Install the packages "diamonds" and "WriteXLS". Use them in a script that manipulates the diamonds data frame and writes it into an Excel file.

Slides day3
Exercises day3

  • PRACTICUM IV. Plots & Graphics in R. May 30. 10:00 - 13:00.
    • Basic plotting: scatter plots, box plots, histograms, density plots. Changing colors, points shapes, titles, labels, legend, axes, etc.
    • Introduction to ggplot2 package: structure of ggplot2 commands, scatter plots.
    • OUTCOME: Write a script that produces, customizes, and saves plots in files.

Slides day4
Exercises day4

MODULE I. Descriptive Statistics & Intro to Probability. June 6.

  • LECTURE. 10:00 - 13:00.
    • Exploratory data analysis and graphical displays.
    • Samples, measures of center and spread, percentiles, odds ratio.
    • Outliers and robustness.
    • Independence, conditional probability, Bayes formula.
    • Distributions, population mean and population variance, Binomial, Poisson, and Normal distribution.
    • Central Limit theorem and the Law of large numbers.
    • Continuity correction.
    • Sampling with and without replacement.
    • Correction for finite population size.
  • STATISTICAL TABLES

Lecture 1 slides.

  • PRACTICUM. 14:00 - 17:00.
    • Descriptive statistics.
    • Plots: Bar-plot, histogram, CDF, box-plot, scatter-plot, pie charts etc.
    • Distributions, population mean and population variance.

Download the zipped html-file for the practicum.


MODULE II. Statistical Inference. June 8.

  • LECTURE. 10:00 - 13:00..
    • The concept of hypothesis testing, type I and type II error, false discovery rate.
    • Significance and confidence level, p-value.
    • One-sided and two-sided tests and confidence intervals.
    • Sampling distribution, estimators, standard error.
    • Normal probabilities in application to p-value.
    • One-sample and two-sample tests for independent and matched samples with known variance.
    • The case of unknown variance and Student t-distribution, assumption of normality.
    • Pooled variance and equal variances assumption.
    • Estimation of variance.
    • Fisher test for variance equality.
    • Non-parametric tests: Sign test, Wilcoxon sum of ranks test (Mann-Whitney U-test), Wilcoxon signed rank test.
    • Chi-square test for goodness of fit, chi-square test for independence.
    • Sample size estimation.

Lecture 2 slides.

  • PRACTICUM. 14:00 - 17:00..
    • One- and two-sample tests with known and unknown variance.
    • Test for proportions.
    • Confidence intervals and t-distribution.
    • Fisher test.
    • Sample size estimation.

Download the zipped html-file for the practicum Part 1.
Download the zipped html-file for the practicum Part 2.


MODULE III. Statistical modeling & Regression. June 9.

  • LECTURE. 10:00 - 13:00.
    • Simple linear regression model, residuals, degrees of freedom, least squares method, correlation coefficient, variance decomposition, determination coefficient.
    • Interpretation of the slope, correlation, and determination coefficients.
    • Standard error and statistical inference in simple linear regression model.
    • Analysis of variance (ANOVA). One-way and two-way ANOVA.
    • Beyond simple regression models: multiple regression, logistic regression.
    • Correction for multiple testing, family-wise error rate.

Lecture 3 slides.

  • PRACTICUM. 14:00 - 17:00.
    • QQ-plot.
    • Tests for normality.
    • Data transformation.
    • Non-parametric tests.
    • Problems on linear regression.
    • ANOVA.

Download the zipped html-file for the practicum Part 1.
Download the zipped html-file for the practicum Part 2.



External Resources

Bioinformatics Core Facility @ CRG — 2011-2024