BIST Introduction to Statistics 2017

From Bioinformatics Core Wiki
Revision as of 10:14, 13 April 2017 by Jponomarenko (Talk | contribs)


Description

This is an introductory course to statistics and R programming.
The R part is offered in 4 slow-paced practicums for absolute beginners, followed by 3 fast-paced practicums of statistical modules.
For practical exercises we will use R programming language and R Studio.

The statistics material is offered in 3 consecutive modules (please see Course Syllabus below), each containing a morning lecture and an afternoon practicum in a computer class. These practicums are focused on using statistics in R, with the purpose to demonstrate and reinforce understanding of concepts introduced in the lectures, rather than teaching R programming.


Course Instructors


Dates, Time and Location

  • Module 0. Introduction to R. May 25, 26, 29, 30, 2017.
    • 10:00 - 13:00.
    • PRBB. Boinformatics classroom (468). 4th floor. The hotel/North wing.
  • Modules I, II, III. Introduction to Statistics. June 6, 8, 9, 2017.
    • LECTURES.
      • 10:00 - 13:00.
      • PRBB. Ramon y Cajal.
    • PRACTICUMS.
      • 14:00 - 17:00.
      • PRBB. Boinformatics classroom (468). 4th floor. The hotel/North wing.


Course Syllabus, Schedule, and Materials


MODULE 0. Introduction to R. May 25, 26, 29, 30.

  • PRACTICUM I. Intro to R and R Studio. May 25. 10:00 - 13:00.
    • Introduction to R studio: explore environment variable, navigate the history of commands, navigate directory and file structure, workspace and files.
    • Simple arithmetic in R console.
    • Create and delete an object.
    • Introduction to data types and the "vector" data structure.
    • Create and run a short script.
    • Read and write a file.
    • OUTCOME: Write a script that creates (and enters) a directory,= and writes a simple calculation into a file.


  • PRACTICUM II. Data structures in R. May 26. 10:00 - 13:00.
    • More on vectors.
    • Matrices and data frames: create, access/extract/subset, modify, arithmetic, conversions, check and name dimensions.
    • OUTCOME: Produce a script that reads matrices and data frames, converts one into another, and makes calculations.


  • PRACTICUM III. Lists & Packages. May 29. 10:00 - 13:00.
    • More on data structures. Lists: create, access/extract/subset, modify.
    • Packages: find, install, load, explore/find functions and documentation, get help on functions.
    • OUTCOME: Install the packages "diamonds" and "WriteXLS". Use them in a script that manipulates the diamonds data frame and writes it into an Excel file.


  • PRACTICUM IV. Plots & Graphics in R. May 30. 10:00 - 13:00.
    • Basic plotting: scatter plots, box plots, histograms, density plots. Changing colors, points shapes, titles, labels, legend, axes, etc.
    • Introduction to ggplot2 package: structure of ggplot2 commands, scatter plots.
    • OUTCOME: Write a script that produces, customizes, and saves plots in files.


MODULE I. Descriptive Statistics & Intro to Probability. June 6.

  • LECTURE. 10:00 - 13:00.
    • Exploratory data analysis and graphical displays.
    • Samples, measures of center and spread, percentiles, odds ratio.
    • Outliers and robustness.
    • Independence, conditional probability, Bayes formula.
    • Distributions, population mean and population variance, Binomial, Poisson, and Normal distribution.
    • Central Limit theorem and the Law of large numbers.
    • Continuity correction.
    • Sampling with and without replacement.
    • Correction for finite population size.
  • STATISTICAL TABLES


  • PRACTICUM. 14:00 - 17:00.
    • Descriptive statistics.
    • Plots: Bar-plot, histogram, CDF, box-plot, scatter-plot, pie charts etc.
    • Independence, conditional probability, Bayes formula.
    • Distributions, population mean and population variance.
    • Central Limit theorem and the Law of large numbers.


MODULE II. Statistical Inference. June 8.

  • LECTURE. 10:00 - 13:00..
    • The concept of hypothesis testing, type I and type II error, false discovery rate.
    • Significance and confidence level, p-value.
    • One-sided and two-sided tests and confidence intervals.
    • Sampling distribution, estimators, standard error.
    • Normal probabilities in application to p-value.
    • One-sample and two-sample tests for independent and matched samples with known variance.
    • The case of unknown variance and Student t-distribution, assumption of normality.
    • Pooled variance and equal variances assumption.
    • Estimation of variance.
    • Fisher test for variance equality.
    • Non-parametric tests: Sign test, Wilcoxon sum of ranks test (Mann-Whitney U-test), Wilcoxon signed rank test.
    • Chi-square test for goodness of fit, chi-square test for independence.
    • Sample size estimation.


  • PRACTICUM. 14:00 - 17:00..
    • One- and two-sample tests with known and unknown variance, test for proportions, simulation involving confidence intervals and t-distribution.
    • Non-parametric tests.
    • Kolmogorov-Smirnov (KS) test.
    • Shapiro test for normality.
    • QQ-plot.
    • Data transformation.


MODULE III. Statistical modeling & Regression. June 9.

  • LECTURE. 10:00 - 13:00.
    • Simple linear regression model, residuals, degrees of freedom, least squares method, correlation coefficient, variance decomposition, determination coefficient.
    • Interpretation of the slope, correlation, and determination coefficients.
    • Standard error and statistical inference in simple linear regression model.
    • Analysis of variance (ANOVA). One-way and two-way ANOVA.
    • Beyond simple regression models: multiple regression, logistic regression.
    • Correction for multiple testing, family-wise error rate.


  • PRACTICUM. 14:00 - 17:00.
    • Problems on linear regression.
    • ANOVA.



External Resources

Bioinformatics Core Facility @ CRG — 2011-2024