Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2014 Feb 24;15(1):154.
doi: 10.1186/1471-2164-15-154.

Improving reliability and absolute quantification of human brain microarray data by filtering and scaling probes using RNA-Seq

Affiliations

Improving reliability and absolute quantification of human brain microarray data by filtering and scaling probes using RNA-Seq

Jeremy A Miller et al. BMC Genomics. .

Abstract

Background: High-throughput sequencing is gradually replacing microarrays as the preferred method for studying mRNA expression levels, providing nucleotide resolution and accurately measuring absolute expression levels of almost any transcript, known or novel. However, existing microarray data from clinical, pharmaceutical, and academic settings represent valuable and often underappreciated resources, and methods for assessing and improving the quality of these data are lacking.

Results: To quantitatively assess the quality of microarray probes, we directly compare RNA-Seq to Agilent microarrays by processing 231 unique samples from the Allen Human Brain Atlas using RNA-Seq. Both techniques provide highly consistent, highly reproducible gene expression measurements in adult human brain, with RNA-Seq slightly outperforming microarray results overall. We show that RNA-Seq can be used as ground truth to assess the reliability of most microarray probes, remove probes with off-target effects, and scale probe intensities to match the expression levels identified by RNA-Seq. These sequencing scaled microarray intensities (SSMIs) provide more reliable, quantitative estimates of absolute expression levels for many genes when compared with unscaled intensities. Finally, we validate this result in two human cell lines, showing that linear scaling factors can be applied across experiments using the same microarray platform.

Conclusions: Microarrays provide consistent, reproducible gene expression measurements, which are improved using RNA-Seq as ground truth. We expect that our strategy could be used to improve probe quality for many data sets from major existing repositories.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Experimental design. RNA from 240 samples spanning 29 neocortical (c) and non-neocortical (s) regions were run using microarray and RNA-Seq in two brains. Gene expression levels were then calculated using comparable strategies. Microarray results were assessed, filtered, and improved using RNA-Seq as ground truth. Details on region selection and preprocessing are available in the Methods and Additional file 3. The source of the microarray image is Guillaume Paumier.
Figure 2
Figure 2
Microarray and RNA-Seq show highly consistent gene expression metrics. A) Pearson correlations of absolute expression levels between 115 replicate sample pairs using both methods. B) Average log2 expression levels between RNA-Seq (TPM) and microarray (intensity) are strongly correlated. A subset of bright probes (red) show particularly increased intensity in microarray. C) Histograms showing distribution of gene expression measures across all samples with microarray (top) and RNA-Seq (bottom). Note the extended leftward tail on the RNA-Seq distribution indicating the lower range sensitivity. D) Number of genes called present in microarray (light grey) and RNA-Seq (dark grey) for at least 5%, 50%, and 95% of samples. Horizontal black bars indicate the percentage of overlapping genes called as present using both methods. E-F) Correlation of differential expression between brains based on microarray intensity (E) and RNA-Seq TPM values (F). Each of 100,000 points shows the log2 fold change of a random gene between two random non-neocortical regions as measured by brain 1 (x-axis) and brain 2 (y-axis). G) Correlation of differential expression between methods in the training set (brain 2). Labeling as in E, except fold changes correspond to RNA-Seq (x-axis) and microarray (y-axis). H) Number of genes differentially expressed between non-neocortical regions based on an ANOVA, for various p-value thresholds. Colors and lines as in (D).
Figure 3
Figure 3
Gene expression reproducibility is dependent on expression level. A) Example genes showing good (CBNL2, left) and poor (TCF15, right) reproducibility using microarray. Reproducibility is defined here based on the between-brain correlation of a gene on average log2(intensity) values in each of the 29 brain regions. B) There is a strong relationship between expression level and reproducibility for genes with low expression. Genes were sorted from lowest to highest expression and divided into 20 bins based on expression, which each represent 5% of array genes (x-axis). Each point shows the average between-brain correlation (as in A) for all genes in that bin (y-axis), as measured by microarray (blue) and RNA-Seq (green). Arrows indicate approximate average TPM and intensity values below which RNA-Seq (TPM = 1) and microarray (log2(intensity) = 5) become progressively less reliable. Approximately 25% and 33% of genes have expression levels below these thresholds in RNA-Seq and microarray, respectively. The standard error of the mean (SEM) for each bin is smaller than the dot size.
Figure 4
Figure 4
Probes chosen by RNA-Seq show improved reproducibility metrics. A) Example gene (ZFR2) with different probes showing the "worst" between-method correlation (left), the "highest" average expression (center) and the "best" between-method correlation (right). Each plot shows the expression level of a microarray probe (y-axis) and the corresponding gene TPM value as measured by RNA-Seq (x-axis). Each dot represents a single sample in our training set (brain 2). Two of these probes would be filtered out as "low quality" using our metric. B) Between method (left) and between-brain (right) measures of differential expression correlation when defining microarray genes based on the worst, highest, and best probes (left three bars). Note that correlations in the "highest probes" bars come directly from Figure 2G (*) and Figure 2E (^). The other two bars correspond to the subset of best probes that pass (green) and fail (red) quality control based on our filtering strategy, respectively. Note that the best passing probes have the highest reproducibility. C) Genes with low expression are more likely to fail than genes with moderate to high expression. Genes were binned based on expression levels (x-axis) and the number of passing and failing probes is shown for each bin (y-axis). 91% of genes with log2(intensity) > 3 passed, compared to only 47% with lower expression.
Figure 5
Figure 5
Scaling of microarray probes by RNA-Seq leads to improved biological reproducibility. A) Strategy to convert intensity levels of all probes to sequencing scaled microarray intensities (SSMIs) using samples from brain 2. SATB2 is shown as an example. 5th and 95th quantiles (red dots) are compared between methods, and microarray intensities are scaled linearly such that these quantiles align. Grey and black dots show expression of a sample in brain 2 for both methods before and after scaling, respectively. Inset shows the range of slope (m) and intercept (b) parameters across all probes (25%, 50%, and 75% quantiles shown in bold; 5% and 95% quantiles shown in light lines or enumerated if off the plot). (B -C) After scaling (black dots), all samples in brain 1 show markedly improved between-method correlation of absolute expression levels compared with before (grey dots). This result holds for all 115 samples in brain 1 (B). A single example is shown in C (corresponding to the arrow in B; labeling as in Figure 2B). Diagonal dotted line indicates perfect agreement of absolute expression levels (y = x). D) SSMIs show improved reproducibility between methods based on between-method (left; * = compare with Figure 2G) and between-brain (right; ^ = compare with Figure 2E) differential expression measures (compare with Figure 4C). The blue line indicates the between-brain correlation as measured by RNA-Seq (Figure 2F), which is now only slightly higher (ΔR=0.02) than in microarray.
Figure 6
Figure 6
Scaling parameters generated in human brain also improve measurements of absolute expression levels in human hESC lines. Improved between-method correlation of absolute expression levels is found in H1 (A) and H9 (B) human hESC lines after scaling using parameters identified in brain. Each point shows expression levels for a single gene in microarray (y-axis) compared with RNA-Seq (x-axis). Labeling as in Figure 5C.

Similar articles

Cited by

References

    1. Bradford JR, Hey Y, Yates T, Li Y, Pepper SD, Miller CJ. A comparison of massively parallel nucleotide sequencing with oligonucleotide microarrays for global transcription profiling. BMC Genomics. 2010;11:282. doi: 10.1186/1471-2164-11-282. - DOI - PMC - PubMed
    1. Chen H, Liu Z, Gong S, Wu X, Taylor WL, Williams RW, Matta SG, Sharp BM. Genome-wide gene expression profiling of nucleus accumbens neurons projecting to ventral pallidum using both microarray and transcriptome sequencing. Front Neurosci. 2011;5:98. doi: 10.3389/fnins.2011.00098. - DOI - PMC - PubMed
    1. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008;18:1509–1517. doi: 10.1101/gr.079558.108. - DOI - PMC - PubMed
    1. Raghavachari N, Barb J, Yang Y, Liu P, Woodhouse K, Levy D, O'Donnell CJ, Munson PJ, Kato GJ. A systematic comparison and evaluation of high density exon arrays and RNA-seq technology used to unravel the peripheral blood transcriptome of sickle cell disease. BMC Med Genom. 2012;5:28. doi: 10.1186/1755-8794-5-28. - DOI - PMC - PubMed
    1. Sirbu A, Kerr G, Crane M, Ruskin HJ. RNA-Seq vs dual- and single-channel microarray data: sensitivity analysis for differential expression and clustering. PLoS ONE. 2012;7:e50986. doi: 10.1371/journal.pone.0050986. - DOI - PMC - PubMed

Publication types

MeSH terms

LinkOut - more resources

-