Difference between revisions of "Useful Links"

From Bioinformatics Core Wiki
(Created page with "List of useful links")
 
 
(111 intermediate revisions by 3 users not shown)
Line 1: Line 1:
List of useful links
+
__TOC__
 +
 
 +
== NGS data analysis Protocols, Methods & Tools ==
 +
 
 +
==== General ====
 +
* [https://www.illumina.com/content/dam/illumina-marketing/documents/applications/ngs-library-prep/ForAllYouSeqMethods.pdf Illumina poster of NGS methods, 2015]
 +
* [http://www.nature.com/nrg/journal/v18/n8/full/nrg.2017.44.html Reference standards for next-generation sequencing, Nature Review Genetics, 2017]
 +
* [http://www.oxfordjournals.org/our_journals/bioinformatics/nextgenerationsequencing.html Bioinformatics for Next Generation Sequencing virtual (constantly updated) issue of ''Bioinformatics''].
 +
* [https://liorpachter.wordpress.com/seq/ Current list of *seq assays from liorpachter.wordpress.com]
 +
* [http://www.sciencedirect.com/science/article/pii/S2212066116300230 Standardization and quality management in next-generation sequencing, Applied & Translational Genomics, 2016]
 +
* [http://www.illumina.com/content/dam/illumina-marketing/documents/products/research_reviews/sequencing-methods-review.pdf Review of NGS methods by Illumina]
 +
* [https://genohub.com/next-generation-sequencing-guide/ NGS Guide from Genohub]
 +
* [https://www.illumina.com/content/dam/illumina-marketing/documents/products/technotes/technote_coverage_calculation.pdf Illumina's guide how to estimate sequencing coverage]
 +
* [http://support.illumina.com/downloads/sequencing_coverage_calculator.html Illumina's online sequencing coverage calculator]
 +
* [https://www.encodeproject.org/data-standards/ ENCODE experimental guidelines and data analysis standards]
 +
* [https://genohub.com/recommended-sequencing-coverage-by-application/ Coverage and Read Depth Recommendations by Sequencing Application from Genohub]
 +
 
 +
 
 +
==== QC ====
 +
* [https://www.illumina.com/documents/products/technotes/technote_Q-Scores.pdf Illumina's Quality Scores for Next-Generation Sequencing explained]
 +
* [http://multiqc.info/ '''MultiQC''' - Python package that aggregates results from bioinformatics analyses across many samples into a single report]
 +
* [https://github.com/bartongroup/AlmostSignificant '''AlmostSignificant''' - open-source software designed to simplify the aggregation of quality statistics for sequencing runs from Illumina MiSeq, NextSeq and HiSeq machines.] It includes data produced as part of the bcl2fastq Illumina pipeline (e.g. cluster density and lane de-multiplexing) and sample meta-data rather than being limited to read data only. [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5167069/ Link to the paper]
 +
* [http://qualimap.bioinfo.cipf.es '''QualiMap'''- QC tool (GUI and command line) for SAM/BAM files for WES, WGS, RNA-seq, ChIP-seq. Supports multi-sample comparison of alignment and counts]
 +
* [http://rseqc.sourceforge.net '''RSeQC''' - RNA-seq Quality Control Package] for inspecting sequence quality, nucleotide composition bias, PCR bias and GC bias, while RNA-seq specific modules evaluate sequencing saturation, mapped reads distribution, coverage uniformity, strand specificity, transcript level RNA integrity etc.
 +
 
 +
 
 +
==== RNA-seq ====
 +
===== How many replicates? =====
 +
* [https://www.ncbi.nlm.nih.gov/pubmed/27022035 How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? RNA, 2016.]
 +
* [https://www.ncbi.nlm.nih.gov/pubmed/22985019 Efficient experimental design and analysis strategies for the detection of differential expression using RNA-Sequencing. BMC Genomics, 2012.]
 +
* [https://www.ncbi.nlm.nih.gov/pubmed/?term=26206307 Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment. Bioinformatics, 2015.]
 +
* [https://www.ncbi.nlm.nih.gov/pubmed/25246651 Power analysis and sample size estimation for RNA-Seq differential expression. RNA, 2014.]
 +
* [http://scotty.genetics.utah.edu/scotty.php '''Scotty''' - Power Analysis for RNA-seq Experiments]. It answers the question, "How many reads do we need to sequence?"
 +
 
 +
===== Approaches and benchmarks =====
 +
* [https://www.ncbi.nlm.nih.gov/pubmed/26732976 Comparison of normalization and differential expression analyses using RNA-Seq data from 726 individual Drosophila melanogaster. BMC Genomics, 2016.]
 +
* [https://www.ncbi.nlm.nih.gov/pubmed/26511205 Comparing the normalization methods for the differential analysis of Illumina high-throughput RNA-Seq data. BMC Bioinformatics, 2015.]
 +
* [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4728800/ A survey of best practices for RNA-seq data analysis. Genome Biology, 2016.]
 +
* [https://www.ncbi.nlm.nih.gov/pubmed/28484260 Benchmarking of RNA-sequencing analysis workflows using whole-transcriptome RT-qPCR expression data. Science Report, 2017]
 +
* [https://www.ncbi.nlm.nih.gov/pubmed/24185836 Systematic evaluation of spliced alignment programs for RNA-seq data, Nature Methods, 2013]
 +
* [https://www.ncbi.nlm.nih.gov/pubmed/27043002  '''kallisto''' - Near-optimal probabilistic RNA-seq quantification. Nature Biotechnol, 2016]
 +
* [https://www.ncbi.nlm.nih.gov/pubmed/28263959 '''Salmon''' provides fast and bias-aware quantification of transcript expression. Nature Methods, 2017]
 +
* On batch effect in RNA-seq: http://f1000research.com/articles/4-121/v1 (This is a critique of the original article http://www.ncbi.nlm.nih.gov/pubmed/25413365)
 +
** The authors used the ‘ComBat’ function from the sva package v3.12.020, with a model that includes effects for batch, species and tissue. The R code is provided.
 +
** In the comments on f1000 site it was noted that it is incorrect to use correlation as a measure of association between the logged gene expression levels of different tissues; proportionality is suggested: Lovell, D., Pawlowsky-Glahn, V., Egozcue, J. J., Marguerat, S., & Bähler, J. (2015). Proportionality: A Valid Alternative to Correlation for Relative Data. PLoS Comput Biol, 11(3), e1004075. http://doi.org/10.1371/journal.pcbi.1004075
 +
** Previously reported similar case http://simplystatistics.org/2015/05/20/is-it-species-or-is-it-batch-they-are-confounded-so-we-cant-know/
 +
 
 +
 
 +
==== Single cell RNA-seq ====
 +
* [https://hemberg-lab.github.io/scRNA.seq.course scRNA-seq analysis course from Hemberg's group, Sanger.]
 +
 
 +
 
 +
==== ChIP-seq ====
 +
* [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3431496/ ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res, 2012.]
 +
* [https://deeptools.readthedocs.io/en/develop/content/list_of_tools.html deepTools, including plotFingerprint] that addresses the question "Did my ChIP-seq work?" by sampling indexed BAM files and plotting a profile of cumulative read coverages for each file.
 +
* [https://academic.oup.com/bib/article-lookup/doi/10.1093/bib/bbw035 Features that define the best ChIP-seq peak calling algorithms.] ''Briefings in Bioinformatics'', 2016. - Benchmarking of peak-calling algorithms.
 +
* [http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003326 Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data. ''PLOS Comp Biol'', 2013]:
 +
** "For mammalian transcription factors (TFs) and chromatin modifications such as enhancer-associated histone marks, which are typically localized at specific, narrow sites and have on the order of thousands of binding sites, '''20 million reads''' may be adequate (4 million reads for worm and fly TFs)."
 +
** "Proteins with more binding sites (e.g., RNA Pol II) or broader factors, including most histone marks, will require more reads, up to '''60 million for mammalian ChIP-seq'''."
 +
** "Importantly, control samples should be sequenced significantly deeper than the ChIP ones in a TF experiment and in experiments involving diffused broad-domain chromatin data. This is to ensure sufficient coverage of a substantial portion of the genome and non-repetitive autosomal DNA regions."
 +
** "To ensure that the chosen '''sequencing depth''' was adequate, a '''saturation analysis''' is recommended—the peaks called should be consistent when the next two steps (read mapping and peak calling) are performed on increasing numbers of reads chosen at random from the actual reads. Saturation analysis is built into some peak callers (e.g., [https://github.com/hms-dbmi/spp '''SPP''', an R package for analysis of ChIP-seq and other functional sequencing data] ). If this shows that the number of reads is not adequate, reads from technical replicate experiments can be combined."
 +
** "To avoid over-sequencing and estimate an optimal sequencing depth, it is important to take into account '''library complexity'''." Several tools are available for this purpose: the [http://smithlabresearch.org/software/preseq/ '''Preseq package'''] allows users to predict the number of redundant reads from a given sequencing depth and how many will be expected from additional sequencing."
 +
 
 +
* [http://ccg.vital-it.ch/chipseq/ '''ChIP-Seq Web Server''' from Swiss Bioinformatics Institute]. The publication in BMC Genomics, 2016, is [https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-3288-8 available here].
 +
* [http://www.ngs-qc.org/ '''NGS-QC''' - Quality control tool for comparative analysis of ChIP-seq and other enrichment-related assays (tool is in Galaxy) and Database of QC for published datasets and Certified Antibodies]
 +
* [https://github.com/songlab/chance '''CHANCE''' - comprehensive software for quality control and validation of ChIP-seq data.] Paper is available [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4053734/ here]. Now are called [http://deeptools.readthedocs.io/en/latest/ '''deepTools'''] that can be used for mapping QC of RNA-seq as well, API is provided. Paper published in NAR, 2016, is [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4987876/ available here].
 +
 
 +
* [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5142015/ A comprehensive comparison of tools for differential ChIP-seq analysis. Brief Bioinformatics, 2016.]
 +
** [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5142015/figure/bbv110-F7/ Decision tree indicating the proper choice of tool depending on the data set]
 +
 
 +
* [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5416852/ HMCan-diff - the method for analysis of ChIP-seq data to detect changes in histone modifications between two cancer samples of different genetic backgrounds, or between a cancer sample and a normal control. NAR 2017.]
 +
* [https://www.ncbi.nlm.nih.gov/pubmed/24021381 HMCan - a method for analysis of ChIP-seq and ATAC-seq data of cancer samples. Bioinformatics, 2013.] HMCan corrects for the GC-content and copy number bias and then applies Hidden Markov Models to detect the signal from the corrected data.
 +
 
 +
==== ChIRP-seq ====
 +
* [https://en.wikipedia.org/wiki/ChiRP-Seq ChiRP-Seq Wikipedia]
 +
* [https://www.nature.com/articles/nature12210 Functional roles of enhancer RNAs for oestrogen-dependent transcriptional activation. Nature. 2013]
 +
 
 +
==== Genome/Transcriptome Assembly ====
 +
* [http://www.nesc.ac.uk/talks/1104/OPTIMALITY%20CRITERIA%20for%20Transcriptome%20de%20novo%20Assembly2.pdf Optimality Criteria for ''de novo'' Transcriptome Assembly, 2010].
 +
* [http://onlinelibrary.wiley.com/doi/10.1111/eva.12178/full A field guide to whole-genome sequencing, assembly and annotation, 2014]
 +
* [http://www.nature.com/nmeth/journal/v9/n4/full/nmeth.1935.html De novo genome assembly: what every biologist should know. Nature Methods, 2012.]
 +
* [https://github.com/Kingsford-Group/scallop Scallop - a reference-based transcript assembler that improves reconstruction of multi-exon and lowly expressed transcripts. Nature Biotech, 2018]. A parameter advisor for Scallop is [https://github.com/Kingsford-Group/scallopadvising available on Github]; it allows to automatically choose input-specific parameter values for reference-based transcript assembly.
 +
 
 +
==== Gene Set Enrichment Analysis (GSEA) and other post-processing analysis ====
 +
* [http://amp.pharm.mssm.edu/Enrichr/ Enrichr - a comprehensive gene set enrichment analysis web server].
 +
* [https://david-d.ncifcrf.gov DAVID - Integrated biological knowledgebase and analytic tools for extracting biological meaning from large gene/protein lists].
 +
* [http://homer.salk.edu/homer/ '''HOMER''' (Hypergeometric Optimization of Motif EnRichment) is a suite of tools for Motif Discovery and next-gen sequencing analysis.]  It is a collection of command line programs for unix-style operating systems written in Perl and C++. HOMER was primarily written as a de novo motif discovery algorithm and is well suited for finding 8-20 bp motifs in large scale genomics data.  HOMER contains many useful tools for analyzing ChIP-Seq, GRO-Seq, RNA-Seq, DNase-Seq, Hi-C and numerous other types of functional genomics sequencing data sets.
 +
* [http://www2.heatmapper.ca/expression/ Heatmapper - a free online application for visualizing various types of data as heat maps.]
 +
* [http://bioinformatics.psb.ugent.be/cgi-bin/liste/Venn/calculate_venn.htpl​ Online tool to draw Venn diagrams with up to 5 sets.]
 +
 
 +
 
 +
==== NGS other ====
 +
* Nanopore (MinION) de novo bacterial genome sequencing [http://www.nature.com/articles/nmeth.3444.epdf?shared_access_token=mg6EbQi1EKQo5XSC_SapCdRgN0jAjWel9jnR3ZoTv0OLFjUNGC4TzsY6VhV1jmGnEXm6K-NwgIyYpW4CJg08E4ETejeOEIUdCJnalRlsBbpo_aYkBpwz7npVIG-KMZbqHPkJrclUz_ymxzYrvB14TCbXOQ6d6FRJN3dAEQ9QJyzJ555FEWiHeVFHRRzWHmHy5D7FcI-9hK1g5H79K3eykg%3D%3D]
 +
** "Many bacterial genomes can be assembled into single contigs if reads longer than 7 kb are available, as these reads span the conserved rRNA operon, which is typically the longest repeat sequence in a bacterial genome.
 +
** Recent versions of nanopore chemistry (R7.3) coupled with the latest base caller (Metrichor versions 1.9 and later) permit read-level accuracies of 78–85% (refs. 1,8). Although this is slightly lower than accuracies achieved by the latest version of Pacific Biosciences chemistry."
 +
** Two-dimentional reads from four separate MinION runs using R7.3 chemistry were combined. In total, 22,270 2D reads were used comprising 133.6 Mb of read data, representing ~29× theoretical coverage of the 4.6-Mb E. coli K-12 MG1655 reference genome.
 +
** Potential overlaps between the reads were detected using the DALIGNER software. Each read and its overlapped reads were used as input to the partial-order alignment (POA) software, which iteratively computes the consensus sequence. The read error-correction software, Nanocorrect, is available at https://github.com/jts/nanocorrect/.
 +
** The reads resulting from two rounds of correction were used as input to version 8.2 of the Celera Assembler. This resulted in a highly contiguous assembly with three contigs, the largest being 4.6 Mb long and covering the entire E. coli chromosome.
 +
** The authors implemented an algorithm that uses the electric current signal to compute an improved consensus sequence for the assembly. That allowed the base-level accuracy improved to 99.5%, comprising 1,202 mismatches (26 per 100 kb) and 17,241 indels of ≥1 base (371 errors per 100 kb). The signal-level consensus software, Nanopolish, is available at https://github.com/jts/nanopolish/.
 +
** The complete pipeline used to generate the assembly, including downloading the input data and required software, is provided as a Makefile on GitHub at https://github.com/jts/nanopore-paper-analysis/blob/master/full-pipeline.make. Additional scripts used to analyze the assembly are provided in the same repository. An IPython notebook documenting the analysis workflow is also provided.
 +
 
 +
== Biology ==
 +
* [http://www.cell.com/cell/abstract/S0092-8674(14)00338-9?_returnURL=http%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS0092867414003389%3Fshowall%3Dtrue The Noncoding RNA Revolution—Trashing Old Rules to Forge New Ones, Review, Cell, 2014]
 +
 
 +
 
 +
== Data Science ==
 +
* [http://europepmc.org/articles/PMC4619002;jsessionid=B0B361BFFC625C32FCD7E6BD6C0E1C1D Ten Simple Rules for Experiments’ Provenance], Comput Biol. 2015.
 +
* [http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285 Ten simple rules for reproducible computational research.] PLoS Computational Biology, 2013.
 +
* [http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003542 10 simple rules for the care and feeding of scientific data.] PLoS Computational Biology, 2014.
 +
* [http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005399 Ten simple rules for responsible big data research. ]PLoS Computational Biology, 2017.
 +
* [http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005128 Ten Simple Rules for Developing Public Biological Databases.] PLoS Computational Biology, 2016.
 +
* [http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005097 Ten Simple Rules for Digital Data Storage.] PLoS Computational Biology, 2016.
 +
* [http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004947 Ten Simple Rules for Taking Advantage of Git and GitHub.] PLoS Computational Biology, 2016.
 +
* [http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004743 Ten Simple Rules for Selecting a Bio-ontology.] PLoS Computational Biology, 2016.
 +
* [http://tools.medialab.sciences-po.fr/iwanthue/ Online tool for making color palettes]
 +
 
 +
== Statistics ==
 +
 
 +
* Nature Web-collection "Statistics for Biologists": http://www.nature.com/collections/qghhqm
 +
* 100 Statistical Tests.pdf - ResearchGate - just search Google to get a link
 +
* http://students.brown.edu/seeing-theory/ The Seeing Theory website visualizes the fundamental concepts covered in an introductory college statistics, using D3.jc.
 +
 
 +
 
 +
==== Experimental Design ====
 +
* [https://rawgit.com/bioinformatics-core-shared-training/experimental-design/master/ExperimentalDesignManual.pdf Experimental design manual from U. of Cambridge, 2014]
 +
* [https://eda.nc3rs.org.uk/experimental-design Guide and tool for design and analysis of biological experiments from the UK's National Center for the Replacement Refinement and Reduction of Animals in Research (NC3R)], covering topics of control for cofounding variables, sample size, effect size, a standardised effect size, power of statistical tests, multiple testing.
 +
* [http://www.sample-size.net/ Sample/effect size online calculators for designing biomedical experiments from UC San Francisco]
 +
 
 +
 
 +
==== Statistical Rituals & Statistical Power ====
 +
* [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5367316/ Low statistical power in biomedical science: a review of three human research domains. R Soc Open Sci. 2017]
 +
* [http://muscle.ucsd.edu/More_HTML/papers/pdf/Lieber_JOR_1990.pdf Statistical Significance and Statistical Power in Hypothesis Testing by R. Lieber, 1990]
 +
* [https://www.statisticsdonewrong.com/power.html Statistical power and underpowered statistics by Alex Reinhart]
 +
* [https://emj.bmj.com/content/20/5/453 An introduction to power and sample size estimation. Emergency Medicine Journal, 2004]
 +
* [http://library.mpib-berlin.mpg.de/ft/gg/GG_Mindless_2004.pdf Mindless statistics by G. Gigerenzer. Journal of Socio-Economics, 2004]
 +
 
 +
 
 +
==== Online Resources & Courses ====
 +
* Self-paced online UC Berkeley courses
 +
** https://www.edx.org/course/introduction-statistics-descriptive-uc-berkeleyx-stat2-1x
 +
** https://www.edx.org/course/introduction-statistics-probability-uc-berkeleyx-stat2-2x
 +
** https://www.edx.org/course/introduction-statistics-inference-uc-berkeleyx-stat2-3x
 +
* Online book recommended by the above courses http://www.stat.berkeley.edu/~stark/SticiGui/
 +
* [https://www.edx.org/course/explore-statistics-r-kix-kiexplorx-0 Self-paced online course "Explore Statistics with R" from EdX.org]
 +
* [https://www.edx.org/course/introduction-r-data-science-microsoft-dat204x-3 Self-paced online course "Introduction to R for Data Science" from Microsoft]
 +
* [https://www.edx.org/course/programming-r-data-science-microsoft-dat209x-2 Self-paced online course "Programming with R for Data Science" from Microsoft]
 +
* [https://www.edx.org/course/introduction-data-analysis-using-excel-microsoft-dat205x-0 Self-paced online course "Introduction to Data Analysis using Excel" from Microsoft]
 +
* [https://www.edx.org/course/statistics-r-harvardx-ph525-1x Self-paced online course "Statistics and R" from Harvard]
 +
* [https://www.edx.org/course/high-dimensional-data-analysis-harvardx-ph525-4x Self-paced online course "High-Dimensional Data Analysis" from Harvard], covering dimensionality reduction, factor analysis, batch effect, clustering and focused on genomics applications
 +
* [https://www.edx.org/course/statistical-inference-modeling-high-harvardx-ph525-3x Self-paced online course "Statistical Inference and Modeling for High-throughput Experiments" from Harvard, covering multiple testing problem, error rates, error rate controlling procedures, false discovery rates, q-values and exploratory analysis of genomics data]
 +
* [https://www.edx.org/course/biostatistics-big-data-applications-utmbx-stat101x Self-paced online course "Biostatistics for Big Data Applications" from EdX.org]
 +
* "An Introduction to Statistical Learning with Applications in R" from Stanford http://www-bcf.usc.edu/~gareth/ISL/
 +
 
 +
==== Comparison of two samples ====
 +
* The t-test, paired or unpaired, in R >t.test (x,y, paired=TRUE). The t-test provides an exact test for the equality of the means of two normal populations with unknown, but equal, variances. The latter can be checked with F-test, or in R >var.test(x,y). https://en.wikipedia.org/wiki/Student's_t-test#Paired_samples
 +
 
 +
* Non-parametric tests. No assumption about variances and normality.
 +
** Independent samples. The Wilcoxon rank-sum test, aka Mann-Witney test. https://en.wikipedia.org/wiki/Mann–Whitney_U_test. In R, >wilcox.test(x,y). H0= Ranks of means of two samples are not different.
 +
** Paired samples. The Wilcoxon signed-Rank Test. In R, >wilcox.test(x,y, paired=TRUE). See http://vassarstats.net/textbook/ch12a.html.
 +
** The Kolmogorov-Smirnov test. In R, >ks.test(x,y). https://en.wikipedia.org/wiki/Kolmogorov–Smirnov_test. If two samples have the same mean but different variance or/and shape/distribution, this test can spot it. It is more powerful than the Wilcoxon test. The statistic is calculated by finding the maximum absolute value of the differences between the two sample cumulative distribution functions. See http://www.physics.csbsju.edu/stats/KS-test.html.
 +
 
 +
==== Comparison of two microbiome samples ====
 +
* New (2012) biostatistical methods for the analysis of microbiome data based on a fully parametric approach using all the data. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3527355/#pone.0052078-LaRosa1
 +
** The use of a fully parametric model for these data has the benefit over alternative non-parametric approaches such as bootstrapping and permutation testing, in that this model is able to retain more information contained in the data.
 +
** R package "HMP" is available. http://cran.r-project.org/web/packages/HMP/HMP.pdf. To install it: > source("http://www.bioconductor.org/biocLite.R"); biocLite("HMP")
 +
 
 +
==== Other topics ====
 +
* [https://academic.oup.com/ije/article/34/1/215/638499/Regression-to-the-mean-what-it-is-and-how-to-deal Regression to the mean: what it is and how to deal with it. Int J Epidemiol, 2004]
 +
 
 +
 
 +
== Linux ==
 +
* [https://github.com/crazyhottommy/scripts-general-use/blob/master/Shell/bioinformatics_one_liner.md One liners for Bioinformatics]
 +
* [http://www.grymoire.com/Unix/Awk.html Awk and other Linux stuff by Bruce Barnett]
 +
* [https://www.tutorialspoint.com/awk/index.htm Awk Tutorial from Tutorialspoint]
 +
 
 +
== Online Courses & Materials ==
 +
* Ten rules for online learning: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002631
 +
 
 +
* [https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf - "Official" R introduction]
 +
 
 +
* [https://www.coursera.org Coursera - Thousands of online courses and certified specializations]
 +
* [https://www.udacity.com Udacity - Computer Science oriented online courses and nano-degrees]
 +
* [https://www.edx.org EdX - Online courses from Microsoft, MIT, Harvard, and other well-estanslished institutions]
 +
 
 +
 
 +
* Self-paced online courses from [https://www.edx.org EdX.org]:
 +
** [https://www.edx.org/course/introduction-linux-linuxfoundationx-lfs101x-2 Intro to Linux]
 +
** [https://www.edx.org/course/introduction-r-programming-microsoft-dat204x-0 Intro to R programming]
 +
** [https://www.edx.org/course/programming-r-data-science-microsoft-dat209x-1 Programming with R for Data Science]
 +
** [https://www.edx.org/course/introduction-cloud-computing-ieeex-cloudintro-x-0 Introduction to Cloud Computing]
 +
** [https://www.edx.org/course/data-science-machine-learning-essentials-microsoft-dat203x-0 Data Science and Machine Learning Essentials]
 +
** [https://www.edx.org/course/introduction-biology-secret-life-mitx-7-00x-2 Introduction to Biology - The Secret of Life]
 +
 
 +
 
 +
* Self-paced online courses in the seria "Data Analysis for Life Sciences" from Harvard at EdX.org:
 +
** [https://www.edx.org/course/data-analysis-life-sciences-1-statistics-harvardx-ph525-1x 1. Statistics and R]
 +
** [https://www.edx.org/course/data-analysis-life-sciences-2-harvardx-ph525-2x 2. Introduction to Linear Models and Matrix Algebra]
 +
** [https://www.edx.org/course/data-analysis-life-sciences-3-harvardx-ph525-3x 3. Statistical Inference and Modeling for High-throughput Experiments]
 +
** [https://www.edx.org/course/data-analysis-life-sciences-4-high-harvardx-ph525-4x 4. High-Dimensional Data Analysis ]
 +
** [https://www.edx.org/course/data-analysis-life-sciences-5-harvardx-ph525-5x 5. Introduction to Bioconductor: Annotation and Analysis of Genomes and Genomic Assays ]
 +
** [https://www.edx.org/course/data-analysis-life-sciences-6-high-harvardx-ph525-6x 6. High-performance Computing for Reproducible Genomics ]
 +
** [https://www.edx.org/course/data-analysis-life-sciences-7-case-harvardx-ph525-7x 7. Case Studies in Functional Genomics ]
 +
 
 +
 
 +
* Linux:
 +
** [http://sheet.shiar.nl/less Cheat sheet on less command to navigate files]
 +
** [http://www.thegeekstuff.com/linux-101-hacks-ebook/ Free e-book on mastering Linux commands. The website provides a lot of posts on using Linux]
 +
** [http://codular.com/regex Intro to regular expressions. The website provides intro tutorials on PHP, SQL, HTML5, JSON, etc.]
 +
 
 +
 
 +
* [https://software-carpentry.org/lessons/ Online lessons from Software Carpentry]
 +
* [https://pythonforbiologists.com/ Python for Biologists]
 +
* [http://work.caltech.edu/telecourse.html Learning from Data - self-paced course  from CalTech, USA]
 +
* [http://huttenhower.sph.harvard.edu/moodle/ Huttenhower Lab (Harvard) Courses]
 +
* [http://www.bioinformatics.babraham.ac.uk/training.html Babraham (UK) Bioinformatics training courses]
 +
* [http://jura.wi.mit.edu/bio/education/ Whitehead (USA) Bioinformatics Courses]
 +
* [https://cbsu.tc.cornell.edu/workshops.aspx Cornell University (USA), Institute of Biotechnology, Bioinformatics courses]
 +
* [https://wiki.hpcc.msu.edu/display/Bioinfo/Bioinformatics+Support+at+MSU Bioinformatics tutorials from Michigan State University]
 +
* [http://edu.isb-sib.ch/course/index.php?categoryid=2 Swiss Institute of Bioinformatics Training portal]
 +
 
 +
== Bioinformatics training providers ==
 +
* [http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1002245 Ten Simple Rules for Developing a Short Bioinformatics Training Course. PLoS Comput Biol, 2011.]
 +
* https://www.embl.de/training/events/index.php
 +
* https://tess.elixir-europe.org
 +
* https://meetings.cshl.edu/courseshome.aspx
 +
* http://gtpb.igc.gulbenkian.pt/bicourses/
 +
* http://mygoblet.org/training-portal
 +
* https://www.elixir-europe.org/events
 +
* https://www.ecseq.com/workshops/ngs-data-analysis-courses
 +
* https://www.seqme.eu/en/courses/
 +
* https://www.scilifelab.se/education/courses/
 +
* http://www.transmittingscience.org
 +
* http://evomics.org/
 +
* http://www.sib.swiss/training/upcoming-training-events
 +
* https://bio-it.embl.de/course-materials/
 +
 
 +
== Communities & Blogs ==
 +
* http://bioinfo-core.org
 +
 
 +
* https://liorpachter.wordpress.com - blog of Lior Pachter, the developer of Cufflinks, TopHat, eXpresso, callisto and other algorithms.
 +
** Post on kallisto https://liorpachter.wordpress.com/2015/05/10/near-optimal-rna-seq-quantification-with-kallisto/
 +
* http://blog.genohub.com
 +
* http://www.rna-seqblog.com
 +
* http://www.lncrnablog.com
 +
* http://onetipperday.sterding.com - One Tip Per Day: Learning notes for Unix, Perl, R, HTML, Javascript, Google API and mostly Bioinformatics

Latest revision as of 11:11, 27 February 2023

NGS data analysis Protocols, Methods & Tools

General


QC


RNA-seq

How many replicates?
Approaches and benchmarks


Single cell RNA-seq


ChIP-seq

  • ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res, 2012.
  • deepTools, including plotFingerprint that addresses the question "Did my ChIP-seq work?" by sampling indexed BAM files and plotting a profile of cumulative read coverages for each file.
  • Features that define the best ChIP-seq peak calling algorithms. Briefings in Bioinformatics, 2016. - Benchmarking of peak-calling algorithms.
  • Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data. PLOS Comp Biol, 2013:
    • "For mammalian transcription factors (TFs) and chromatin modifications such as enhancer-associated histone marks, which are typically localized at specific, narrow sites and have on the order of thousands of binding sites, 20 million reads may be adequate (4 million reads for worm and fly TFs)."
    • "Proteins with more binding sites (e.g., RNA Pol II) or broader factors, including most histone marks, will require more reads, up to 60 million for mammalian ChIP-seq."
    • "Importantly, control samples should be sequenced significantly deeper than the ChIP ones in a TF experiment and in experiments involving diffused broad-domain chromatin data. This is to ensure sufficient coverage of a substantial portion of the genome and non-repetitive autosomal DNA regions."
    • "To ensure that the chosen sequencing depth was adequate, a saturation analysis is recommended—the peaks called should be consistent when the next two steps (read mapping and peak calling) are performed on increasing numbers of reads chosen at random from the actual reads. Saturation analysis is built into some peak callers (e.g., SPP, an R package for analysis of ChIP-seq and other functional sequencing data ). If this shows that the number of reads is not adequate, reads from technical replicate experiments can be combined."
    • "To avoid over-sequencing and estimate an optimal sequencing depth, it is important to take into account library complexity." Several tools are available for this purpose: the Preseq package allows users to predict the number of redundant reads from a given sequencing depth and how many will be expected from additional sequencing."

ChIRP-seq

Genome/Transcriptome Assembly

Gene Set Enrichment Analysis (GSEA) and other post-processing analysis


NGS other

  • Nanopore (MinION) de novo bacterial genome sequencing [1]
    • "Many bacterial genomes can be assembled into single contigs if reads longer than 7 kb are available, as these reads span the conserved rRNA operon, which is typically the longest repeat sequence in a bacterial genome.
    • Recent versions of nanopore chemistry (R7.3) coupled with the latest base caller (Metrichor versions 1.9 and later) permit read-level accuracies of 78–85% (refs. 1,8). Although this is slightly lower than accuracies achieved by the latest version of Pacific Biosciences chemistry."
    • Two-dimentional reads from four separate MinION runs using R7.3 chemistry were combined. In total, 22,270 2D reads were used comprising 133.6 Mb of read data, representing ~29× theoretical coverage of the 4.6-Mb E. coli K-12 MG1655 reference genome.
    • Potential overlaps between the reads were detected using the DALIGNER software. Each read and its overlapped reads were used as input to the partial-order alignment (POA) software, which iteratively computes the consensus sequence. The read error-correction software, Nanocorrect, is available at https://github.com/jts/nanocorrect/.
    • The reads resulting from two rounds of correction were used as input to version 8.2 of the Celera Assembler. This resulted in a highly contiguous assembly with three contigs, the largest being 4.6 Mb long and covering the entire E. coli chromosome.
    • The authors implemented an algorithm that uses the electric current signal to compute an improved consensus sequence for the assembly. That allowed the base-level accuracy improved to 99.5%, comprising 1,202 mismatches (26 per 100 kb) and 17,241 indels of ≥1 base (371 errors per 100 kb). The signal-level consensus software, Nanopolish, is available at https://github.com/jts/nanopolish/.
    • The complete pipeline used to generate the assembly, including downloading the input data and required software, is provided as a Makefile on GitHub at https://github.com/jts/nanopore-paper-analysis/blob/master/full-pipeline.make. Additional scripts used to analyze the assembly are provided in the same repository. An IPython notebook documenting the analysis workflow is also provided.

Biology


Data Science

Statistics


Experimental Design


Statistical Rituals & Statistical Power


Online Resources & Courses

Comparison of two samples

  • The t-test, paired or unpaired, in R >t.test (x,y, paired=TRUE). The t-test provides an exact test for the equality of the means of two normal populations with unknown, but equal, variances. The latter can be checked with F-test, or in R >var.test(x,y). https://en.wikipedia.org/wiki/Student's_t-test#Paired_samples

Comparison of two microbiome samples

Other topics


Linux

Online Courses & Materials





Bioinformatics training providers

Communities & Blogs

Bioinformatics Core Facility @ CRG — 2011-2024