HumMeth27QCReport

From Bioinformatics Core Wiki
HumMeth27QCReport workflow


HumMeth27QCReport is an R package that permits a quick overview of the quality of Illumina’s Infinium BeadChip methylation assays. This project has been developed as collaboration between the CRG Genotyping Unit and the CRG Bioinformatics Core


The HumMeth27QCReport R package can be downloaded from the CRAN repository


In order to enhance the use of our package by wet-lab researchers, ad-hoc scripts for its implementation in the Galaxy workbench were developed. They can be downloaded at the Galaxy Tool Shed


Background

DNA methylation is an epigenetic mechanism that in vertebrates occurs most frequently at cytosines followed by guanines (CpG). This modification regulates gene expression and can be inherited through cell division, thus being essential for preserving tissue identities and guiding normal cellular development [1]. Hypermethylation of CpG islands located in the promoter regions of tumor suppressor genes has been firmly established as one of the most common mechanisms for gene regulation in cancer [2,3]. As investigating the human DNA methylome has gained interest, several methods have been developed to detect cytosine methylation on a genomic scale. Among these, Illumina’s Infinium Methylation Assay is a hybridization-based technique that offers quantitative methylation measurements at the single-CpG-site level providing as accurate results as sequencing-based methylation assays (e.g. MethylCap-seq, MeDIP-seq, RRBS) [4]. Microarray-based Illumina Infinium methylation assay has been recently used in epigenomic studies [5-7] due to its high throughput, good accuracy, small sample requirement and relatively low cost. To date, available Infinium Illumina platforms for methylation analysis are: the HumanMethylation27 BeadChip with 27,578 CpG sites, covering >14,000 genes; and, the new HumanMethylation450 BeadChip comprising >450,000 methylation sites. To estimate the methylation status, the Illumina Infinium assay utilizes a pair of probes (a methylated probe and an unmethylated probe) to measure the intensities of the methylated and unmethylated alleles at the interrogated CpG site [8]. The methylation level is then estimated based on the measured intensities of this pair of probes.


Requirements

To run properly, the tool needs that working versions of R and of Perl are installed on your machine.


HOW TO INSTALL

From R command line type:
R> install.packages("HumMeth27QCReport", repos="http://cran.r-project.org", dependencies=T, type="source")

HOW TO RUN

To run the example inside the package, type from R command line:
R> Dir <- system.file("extdata/",package="HumMeth27QCReport")
R> ImportDataR <- ImportData(Dir)
R> normMvalues <- NormCheck(ImportDataR, platform="Hum27", pval=0.05, ChrX=F, ClustMethod="euclidean")

where:
* Dir is a character string containing the location of the directory in which the input files are. All output files will be stored here.
* platform is the type of Illumina Infinium BeadChip methylation assay. This must be one of "Hum27"(Infinium HumanMethylation27 BeadChip) or "Hum450"(Infinium HumanMethylation450 BeadChip).
* pval is the p-value threshold number to define which samples keep for the normalization and the following analysis;
* ChrX ia a logical value indicating whether the CpGs that belong to the X chromosome should be deleted from normalization and the following steps. The default is FALSE.
* ClustMethod is the distance measure to be used for the clustering. This must be one of "euclidean", "maximum", "manhattan", "canberra", "binary", "pearson", "correlation", "spearman" or "kendall";


If you want to make the QC analysis of your data, just substitute the Dir variable with the directory where your data are stored (i.e. Dir <- "C:/Analysis/data").

If you are interested in only one of the three available functions, type:
R> ControlResults <- getAssayControls(ImportDataR,platform="Hum27") in case you want to export only the internal controls as suggested by Illumina's guidelines;
R> QCresults <- QCCheck(ImportDataR, pval=0.05) in case you are interested in other QC analyses as distribution of Beta values or average p-value;
R> normMvalues <- NormCheck(ImportDataR, platform="Hum27", pval=0.05, ChrX=F, ClustMethod="euclidean") in case you are interested in exporting the normalized M-values and in generating PCA and hierarchical Clustering plots.


Input

HumMeth27QCReport takes in input the three files from BeadStudio plus an optional text file with the chip control samples to discard from the normalization step:


NOTE: all data were obtained in the CRG Genotyping Unit (now CRG Genomics Unit).


Sample table

Required columns from BeadStudio:

Index Sample ID Sample Group Sentrix Barcode Sample Section Detected Genes (0.01) Detected Genes (0.05) Signal Average GRN Signal Average RED Signal P05 GRN Signal P05 RED Signal P25 GRN Signal P25 RED Signal P50 GRN Signal P50 RED Signal P75 GRN Signal P75 RED Signal P95 GRN Signal P95 RED Sample_Well Sample_Plate


Control table

Required columns from BeadStudio (<Sn> = Sample Name):

Index TargetID ProbeID <Sn>.Signal_Grn <Sn>.Signal_Red <Sn>.Detection Pval ... ... ...

Required controls (rows):

  • BISULFITE CONVERSION (4 rows)
  • EXTENSION (4 rows)
  • HYBRIDIZATION (3 rows)
  • NEGATIVE (16 rows)
  • NON-POLYMORPHIC (4 rows)
  • SPECIFICITY (4 rows)
  • STAINING (4 rows)
  • TARGET REMOVAL


Average Beta table

Required columns from BeadStudio (<Sn> = Sample Name):

Index TargetID <Sn>.AVG_Beta <Sn>.Intensity <Sn>.Signal_A <Sn>.Signal_B <Sn>.BEAD_STDERR_A <Sn>.BEAD_STDERR_B <Sn>.Avg_NBEADS_A <Sn>.Avg_NBEADS_B <Sn>.Detection Pval ... ... .. SYMBOL


Discard.txt

Text file containing the name of the samples (the same name present in the Sample table; one sample per row.) you want to discard from normalization. i.e. sample controls to see if chip worked properly like un-methylated samples.

Output

HumMeth27QCReport creates as output different plots (saved in pdf files) to asses the quality of the samples:

  • a histogram foreach internal control.
  • an Intensity Graph plot foreach sample recalling the "plotSampleIntensities" function of methylumi package.
  • a histogram with the percentage of non dectected CPG (that is the CPGs tha have a detection p-value bigger than 0.05 or 0.01.
  • a histogram with the average p-value for each sample.
  • a PCA of normalized Beta values
  • a Cluster of normalized Beta values

As further outputs, a text file with the normalized M values and an Excel file are provided. The Excel file contains a summary of the Internal Controls and of the gene detection and different lists of non-detected CPGs.

Figure Details

This paragraph describes the controls used in the Illumina Infinium Methylation Assay for 27k example data, their expected outcomes, and how to view them. Diagrams are included with descriptions for sample-independent and sample-dependent controls as well as controls that are specific to the green channel or red channel. The sample-independent controls let you evaluate the quality of specific steps in the process flow, and include:

  • Staining controls
  • Extension controls
  • Target removal controls
  • Hybridization controls

The sample-dependent controls let you evaluate performance across samples, and include:

  • Bisulfite conversion controls
  • Specificity controls
  • Negative controls
  • Non-polymorphic (NP) controls


Figure 1: Barplot of DNP staining control

This figure represents the ratio (%) between background and signal for Staining control in the red channel (DNP). Staining controls are used to examine the efficiency of the staining step in both the red and green channels. Staining controls have dinitrophenyl (DNP) or biotin attached to the beads. The ratios should result in low signal, indicating that the staining step was efficient.

DNP staining control


Figure 2: Barplot of Biotin staining control

This figure represents the ratio (%) between background and signal for Staining control in the green channel (Biotin). These controls are independent of the hybridization and extension step. The ratios should result in low signal, indicating that the staining step was efficient.

Biotin staining control


Figure 3: Barplot of hybridization control

This figure represents the ratio (%) between background and signal for Hybridization controls in the green channel for three concentrations. The hybridization controls test the overall performance of the entire assay using synthetic targets instead of amplified DNA. These synthetic targets complement the sequence on the array perfectly, allowing the probe to extend on the synthetic target as template. The synthetic targets are present in the hybridization buffer at three levels, monitoring the response from high-concentration (5 pM), medium-concentration (1 pM), and low-concentration (0.2 pM) targets. All bead type IDs should result in signal with various intensities, corresponding to the concentrations of the initial synthetic targets.

Hybridization control


Figure 4: Barplot of target removal control

This figure represents the intensity value for Target removal controls in the green channel. Target removal controls test the efficiency of the stripping step after the extension reaction. The control oligos are extended using the probe sequence as template. This process generates labeled targets. The probe sequences are designed such that extension from the probe does not occur. All target removal controls should result in low signal, indicating that the targets were removed efficiently after extension. Values < 3400 have been detected (108 samples). There is not a range specified from illumina, the value is based on previous experiments run in our facility.

Target removal control


Figure 5: Barplot of extension control: green channel

This figure represents the ratio (%) between background and signal for Extension control in the green channel (C,G). Extension controls test the extension efficiency of A, T, C, and G nucleotides from a hairpin probe, and are therefore sample-independent. The ratios should result in low signal, indicating that the extension was efficient.

Extension control on green channel


Figure 6: Barplot of extension control: red channel

This figure represents the ratio (%) between background and signal for Extension control in the red channel (A,T). The ratios should result in low signal, indicating that the extension was efficient.

Extension control on red channel


Figure 7: Barplot of bisulfite control

This figure represents the ratio (%) between background and signal for Bisultife conversion control. The Bisulfite conversion Control asses the efficiency of bisulfite conversion of the genomic DNA. The Infinium Methylation probes query a [C/T] polymorphism created by bisulfite conversion of two different Hind III sites [AAGCTT] in the genome. If the bisulfite conversion reaction was successful, the "C" (Converted) probes will match the converted sequence and get extended. If the sample has unconverted DNA, the "U" (Unconverted) probes will get extended. There are no underlying C bases in the primer landing sites, except for the query site itself. Performance of bisulfite conversion controls should only be monitored in the Green channel. The ratios should result in low signal, indicating that the Bisulfite conversion was efficient.

Bisulfite control


Figure 8: Barplot of specificity control (mismatch 1) in red channel

This figure represents the ratio (%) between background (MM) and signal (PM) for Specificity controls in red channel. In the Infinium Methylation assay, the methylation status of a particular cytosine is carried out following bisulfite treatment of DNA by using query probes for unmethylated and methylated state of each CpG locus. In assay oligo design, the A/T match corresponds to the unmethylated status of the interrogated C, and G/C match corresponds to the methylated status of C. G/T mismatch controls check for non-specific detection of methylation signal over unmethylated background. Specificity controls are designed against non-polymorphic T sites. PM controls correspond to A/T perfect match and should give high signal. MM controls correspond to G/T mismatch and should give low signal. The ratios should result in low signal, indicating that the performance of the assay was efficient.

Specificity control (mismatch 1) in red channel


Figure 9: Barplot of specificity control (mismatch 2) in green channel

This figure represents the ratio (%) between background (MM) and signal (PM) for Specificity controls in the green channel. PM controls correspond to A/T perfect match and should give high signal. MM controls correspond to G/T mismatch and should give low signal. The ratios should result in low signal, indicating that the performance of the assay was efficient.

Specificity control (mismatch 2) in green channel


Figure 10: Barplot of negative control

This figure represents the intensity value for the Negative control. Negative control probes are randomly permutated sequences that should not hybridize to the DNA template. Negative controls are particularly important for methylation studies because of a decrease in sequence complexity after bisulfite conversion. The mean signal of these probes defines the system background. This is a comprehensive measurement of background, including signal resulting from cross-hybridization, as well as non-specific extension and imaging system background. All target negative controls should result in low signal. Values < 2500 have been detected (108 samples). There is not a range specified from illumina, the value is based on previous experiments run in our facility.

Negative control


Figure 11: Barplot for green channel of non-polymorphic control

This figure represents the ratio (%) between background and signal for Non-Polymorphic control in the green channel. Non-polymorphic controls test the overall performance of the assay, from amplification to detection, by querying a particular base in a non-polymorphic region of the bisulfite genome. They let compare assay performance across different samples. One non-polymorphic control has been designed to query each of the four nucleotides (A, T, C and G). The target with the C base results from querying the opposite whole genome amplified strand generated from the converted strand. The ratios should result in low signal, indicating that the performance of the assay was efficient.

Green channel of non-polymorphic control


Figure 12: Barplot for red channel of non-polymorphic control

This figure represents the ratio (%) between background and signal for Non-Polymorphic control in the red channel. The ratios should result in low signal, indicating that the performance of the assay was efficient.

Red channel of non-polymorphic control


Figure 13: Intensity at high and low betas

(depending on the number of samples there could be more than one figure) For each sample the intensity at high and low betas is showed. The intensities as output by the GenomeStudio software often show a considerable amount of dye bias. This is a graphical example of this dye bias. In short, for each of the Cy3 and Cy5 channels, a cutoff in beta is used to calculate which Cy3 and Cy5 values should be plotted at high-methylation and low-methylation status. Any offset between Cy3 and Cy5 when plotted in this way likely represents dye bias and will lead to biases in the estimate of beta.


Intensity at high and low betas


Figure 14: Barplot of percentages of non detected genes

This figure represents the percentage (%) of non detected genes at P-value cut-off 0.05 and p-value cut-off 0.01. Non detected genes are the CpGs with no significant AverageBeta.


Percentages of non detected genes


Figure 15: Barplot of average detection p-values

The boxplots show the average p-value for each sample; the red dotted line is the treshold defined by the user to select the samples for the following analysis.

Average detection p-values


Figure 16: Principal Component Analysis

PCA is made on filtered and normalized data.

Principal Component Analysis


Figure 17: Hierarchical Clustering

The clustering is made on filtered and normalized data. The distance method is defined by the user.

Hierarchical Clustering

References


Background:
[1] - Ladd-Acosta C, Pevsner J, Sabunciyan S, Yolken RH, Webster MJ, Dinkins T, Callinan PA, Fan JB, Potash JB, Feinberg AP: DNA methylation signatures within the human brain. Am J Hum Genet 2007, 81:1304-1315.
[2] - Esteller M: CpG island hypermethylation and tumor suppressor genes: a booming present, a brighter future. Oncogene 2002, 21(35):5427-5440.
[3] - Herman JG, Baylin SB: Gene silencing in cancer in association with promoter hypermethylation. N Engl J Med 2003, 349(21):2042-2054.
[4] - Bock C, Tomazou EM, Brinkman AB, Muller F, Simmer F, Gu H, Jager N, Gnirke A, Stunnenberg HG, Meissner A: Quantitative comparison of genome-wide DNA methylation mapping technologies. Nat Biotechnol 2010, 28:1106-1114.
[5] - Bell CG, Teschendorff AE, Rakyan VK, Maxwell AP, Beck S, Savage DA: Genome-wide DNA methylation analysis for diabetic nephropathy in type 1 diabetes mellitus. BMC Med Genomics 2010, 3:33.
[6] - Thirlwell C, Eymard M, Feber A, Teschendorff A, Pearce K, Lechner M,Widschwendter M, Beck S: Genome-wide DNA methylation analysis of archival formalin-fixed paraffin-embedded tissue using the Illumina Infinium HumanMethylation27 BeadChip. Methods 2010, 52(3):248-54.
[7] - Grafodatskaya D, Choufani S, Ferreira JC, Butcher DT, Lou Y, Zhao C, Scherer SW, Weksberg R: EBV transformation and cell culturing destabilizes DNA methylation in human lymphoblastoid cell lines. Genomics 2010, 95(2):73-83.
[8] - Weisenberger DJ, Berg DVD, Pan F, Berman BP, Laird PW: Comprehensive DNA Methylation Analysis on the Illumina Infinium Assay Platform. Illumina Illumina Application Note 2008.


Package:
[1] - illumina: Chapter 4 System Controls. In Infinium HD Assay Methylation Protocol Guide. 2010: 231-244
[2] - Du P, Kibbe WA, Lin SM: lumi: a pipeline for processing Illumina microarray. Bioinformatics 2008, 24:1547-1548.
[3] - Ihaka R, Gentleman R: R: a language for data analysis and graphics. J Comput Graph Stat 1996, 5:299-314.
[4] - Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19:185-193.
[5] - Du P, Zhang X, Huang CC, Jafari N, Kibbe WA, Hou L, Lin SM: Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics 2010, 11:587.



This software is distributed only for non-commercial purposes and only for acedemic use. For any question please contact the author (mail)

Bioinformatics Core Facility @ CRG — 2011-2019