Difference between revisions of "Useful Links"

From Bioinformatics Core Wiki
Line 24: Line 24:
  
 
==== RNA-seq ====
 
==== RNA-seq ====
* [http://blog.genohub.com/how-many-replicates-are-sufficient-for-differential-gene-expression/ How Many Replicates are Sufficient for Differential Gene Expression? ]
+
===== How many replicates? =====
 +
* [https://www.ncbi.nlm.nih.gov/pubmed/27022035 How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use? RNA, 2016.]
 +
* [https://www.ncbi.nlm.nih.gov/pubmed/22985019 Efficient experimental design and analysis strategies for the detection of differential expression using RNA-Sequencing. BMC Genomics, 2012.]
 +
* [https://www.ncbi.nlm.nih.gov/pubmed/?term=26206307 Statistical models for RNA-seq data derived from a two-condition 48-replicate experiment. Bioinformatics, 2015.]
 +
* [https://www.ncbi.nlm.nih.gov/pubmed/25246651 Power analysis and sample size estimation for RNA-Seq differential expression. RNA, 2014.]
 +
* [http://scotty.genetics.utah.edu/scotty.php '''Scotty''' - Power Analysis for RNA-seq Experiments]. It answers the question, "How many reads do we need to sequence?"
 +
 
 +
===== Approaches and benchmarks =====
 +
* [https://www.ncbi.nlm.nih.gov/pubmed/26732976 Comparison of normalization and differential expression analyses using RNA-Seq data from 726 individual Drosophila melanogaster. BMC Genomics, 2016.]
 +
* [https://www.ncbi.nlm.nih.gov/pubmed/26511205 Comparing the normalization methods for the differential analysis of Illumina high-throughput RNA-Seq data. BMC Bioinformatics, 2015.]
 
* [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4728800/ A survey of best practices for RNA-seq data analysis. Genome Biology, 2016.]
 
* [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4728800/ A survey of best practices for RNA-seq data analysis. Genome Biology, 2016.]
 
* [https://www.ncbi.nlm.nih.gov/pubmed/28484260 Benchmarking of RNA-sequencing analysis workflows using whole-transcriptome RT-qPCR expression data. Science Report, 2017]
 
* [https://www.ncbi.nlm.nih.gov/pubmed/28484260 Benchmarking of RNA-sequencing analysis workflows using whole-transcriptome RT-qPCR expression data. Science Report, 2017]
Line 30: Line 39:
 
* [https://www.ncbi.nlm.nih.gov/pubmed/27043002  '''kallisto''' - Near-optimal probabilistic RNA-seq quantification. Nature Biotechnol, 2016]
 
* [https://www.ncbi.nlm.nih.gov/pubmed/27043002  '''kallisto''' - Near-optimal probabilistic RNA-seq quantification. Nature Biotechnol, 2016]
 
* [https://www.ncbi.nlm.nih.gov/pubmed/28263959 '''Salmon''' provides fast and bias-aware quantification of transcript expression. Nature Methods, 2017]
 
* [https://www.ncbi.nlm.nih.gov/pubmed/28263959 '''Salmon''' provides fast and bias-aware quantification of transcript expression. Nature Methods, 2017]
* [http://scotty.genetics.utah.edu/scotty.php '''Scotty''' - Power Analysis for RNA-seq Experiments]. It answers the question, "How many reads do we need to sequence?"
 
 
* On batch effect in RNA-seq: http://f1000research.com/articles/4-121/v1 (This is a critique of the original article http://www.ncbi.nlm.nih.gov/pubmed/25413365)
 
* On batch effect in RNA-seq: http://f1000research.com/articles/4-121/v1 (This is a critique of the original article http://www.ncbi.nlm.nih.gov/pubmed/25413365)
 
** The authors used the ‘ComBat’ function from the sva package v3.12.020, with a model that includes effects for batch, species and tissue. The R code is provided.
 
** The authors used the ‘ComBat’ function from the sva package v3.12.020, with a model that includes effects for batch, species and tissue. The R code is provided.
Line 36: Line 44:
 
** Previously reported similar case http://simplystatistics.org/2015/05/20/is-it-species-or-is-it-batch-they-are-confounded-so-we-cant-know/
 
** Previously reported similar case http://simplystatistics.org/2015/05/20/is-it-species-or-is-it-batch-they-are-confounded-so-we-cant-know/
  
===== single cell RNA-seq =====
+
==== Single cell RNA-seq ====
* [https://hemberg-lab.github.io/scRNA.seq.course scRNA-seq analysis course from Hemberg's group, Sanger]
+
* [https://hemberg-lab.github.io/scRNA.seq.course scRNA-seq analysis course from Hemberg's group, Sanger.]
 +
 
  
 
==== ChIP-seq ====
 
==== ChIP-seq ====

Revision as of 12:26, 20 February 2018

NGS data analysis Protocols, Methods & Tools

General


QC


RNA-seq

How many replicates?
Approaches and benchmarks

Single cell RNA-seq


ChIP-seq

  • ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res, 2012.
  • Features that define the best ChIP-seq peak calling algorithms. Briefings in Bioinformatics, 2016. - Benchmarking of peak-calling algorithms.
  • Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data. PLOS Comp Biol, 2013:
    • "For mammalian transcription factors (TFs) and chromatin modifications such as enhancer-associated histone marks, which are typically localized at specific, narrow sites and have on the order of thousands of binding sites, 20 million reads may be adequate (4 million reads for worm and fly TFs)."
    • "Proteins with more binding sites (e.g., RNA Pol II) or broader factors, including most histone marks, will require more reads, up to 60 million for mammalian ChIP-seq."
    • "Importantly, control samples should be sequenced significantly deeper than the ChIP ones in a TF experiment and in experiments involving diffused broad-domain chromatin data. This is to ensure sufficient coverage of a substantial portion of the genome and non-repetitive autosomal DNA regions."
    • "To ensure that the chosen sequencing depth was adequate, a saturation analysis is recommended—the peaks called should be consistent when the next two steps (read mapping and peak calling) are performed on increasing numbers of reads chosen at random from the actual reads. Saturation analysis is built into some peak callers (e.g., SPP, an R package for analysis of ChIP-seq and other functional sequencing data ). If this shows that the number of reads is not adequate, reads from technical replicate experiments can be combined."
    • "To avoid over-sequencing and estimate an optimal sequencing depth, it is important to take into account library complexity." Several tools are available for this purpose: the Preseq package allows users to predict the number of redundant reads from a given sequencing depth and how many will be expected from additional sequencing."


Assembly


Gene Set Enrichment Analysis (GSEA) and other post-processing analysis


NGS other

  • Nanopore (MinION) de novo bacterial genome sequencing [1]
    • "Many bacterial genomes can be assembled into single contigs if reads longer than 7 kb are available, as these reads span the conserved rRNA operon, which is typically the longest repeat sequence in a bacterial genome.
    • Recent versions of nanopore chemistry (R7.3) coupled with the latest base caller (Metrichor versions 1.9 and later) permit read-level accuracies of 78–85% (refs. 1,8). Although this is slightly lower than accuracies achieved by the latest version of Pacific Biosciences chemistry."
    • Two-dimentional reads from four separate MinION runs using R7.3 chemistry were combined. In total, 22,270 2D reads were used comprising 133.6 Mb of read data, representing ~29× theoretical coverage of the 4.6-Mb E. coli K-12 MG1655 reference genome.
    • Potential overlaps between the reads were detected using the DALIGNER software. Each read and its overlapped reads were used as input to the partial-order alignment (POA) software, which iteratively computes the consensus sequence. The read error-correction software, Nanocorrect, is available at https://github.com/jts/nanocorrect/.
    • The reads resulting from two rounds of correction were used as input to version 8.2 of the Celera Assembler. This resulted in a highly contiguous assembly with three contigs, the largest being 4.6 Mb long and covering the entire E. coli chromosome.
    • The authors implemented an algorithm that uses the electric current signal to compute an improved consensus sequence for the assembly. That allowed the base-level accuracy improved to 99.5%, comprising 1,202 mismatches (26 per 100 kb) and 17,241 indels of ≥1 base (371 errors per 100 kb). The signal-level consensus software, Nanopolish, is available at https://github.com/jts/nanopolish/.
    • The complete pipeline used to generate the assembly, including downloading the input data and required software, is provided as a Makefile on GitHub at https://github.com/jts/nanopore-paper-analysis/blob/master/full-pipeline.make. Additional scripts used to analyze the assembly are provided in the same repository. An IPython notebook documenting the analysis workflow is also provided.


Biology


Data Science

Statistics

Online Resources & Courses

Comparison of two samples

  • The t-test, paired or unpaired, in R >t.test (x,y, paired=TRUE). The t-test provides an exact test for the equality of the means of two normal populations with unknown, but equal, variances. The latter can be checked with F-test, or in R >var.test(x,y). https://en.wikipedia.org/wiki/Student's_t-test#Paired_samples

Comparison of two microbiome samples

Other topics


Linux

Online Courses & Materials






Bioinformatics training providers

Blogs

Bioinformatics Core Facility @ CRG — 2011-2024