Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Sep;38(17):e170.
doi: 10.1093/nar/gkq670. Epub 2010 Jul 29.

A two-parameter generalized Poisson model to improve the analysis of RNA-seq data

Affiliations

A two-parameter generalized Poisson model to improve the analysis of RNA-seq data

Sudeep Srivastava et al. Nucleic Acids Res. 2010 Sep.

Abstract

Deep sequencing of RNAs (RNA-seq) has been a useful tool to characterize and quantify transcriptomes. However, there are significant challenges in the analysis of RNA-seq data, such as how to separate signals from sequencing bias and how to perform reasonable normalization. Here, we focus on a fundamental question in RNA-seq analysis: the distribution of the position-level read counts. Specifically, we propose a two-parameter generalized Poisson (GP) model to the position-level read counts. We show that the GP model fits the data much better than the traditional Poisson model. Based on the GP model, we can better estimate gene or exon expression, perform a more reasonable normalization across different samples, and improve the identification of differentially expressed genes and the identification of differentially spliced exons. The usefulness of the GP model is demonstrated by applications to multiple RNA-seq data sets.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
The fraction of total mRNA amount derived from highly expressed genes for the human tissue samples. Genes were ranked based on the product of its estimated expression level and the gene length: formula image for the Poisson model (A) or formula image for the GP model (B). Then the percentage of mRNA amount contributed from the top 1, 10,…, 10 000 genes was calculated and plotted.
Figure 2.
Figure 2.
The observed and the expected frequencies of the read count equal to k. (A–D) are for the top four highly expressed genes in the liver tissue. Blue bars are for the observed frequencies, brown bars are for the expected frequencies based on the Poisson model, and magenta bars are for the expected frequencies based on the GP model. We only plotted the ‘k’s with at least one frequency among the three types of frequencies ≥5. The total number of reads mapped to the genes (formula image) were about 677 476, 329 818, 277 529 and 272 551. And the total number of estimated reads from the GP model (formula image) were about 6340.2, 11297.0, 4589.7 and 9138.9 for these four genes.
Figure 3.
Figure 3.
ROC curves for the GP, the Poisson and the GLM (Poisson, negative binomial and quasi-Poisson links) in the identification of differentially expressed genes. Genes with the estimates in the six models and a reliable log-ratio value in the qRT-PCR experiments were considered. We further limited our studies on the standard positives (qRT-PCR absolute log-ratio >2.0, n = 218) and the standard negatives (qRT-PCR absolute log-ratio <0.2, n = 74). A true positive was required to be differentially expressed in the same direction according to both RNA-seq and qRT-PCR.
Figure 4.
Figure 4.
Number of tissue-specific genes. For each human tissue, we counted the number of genes differentially expressed between this tissue and all the other eight tissues. Grey bars are for the GP model and the white bars are for the Poisson model. The shaded regions represent the shared genes between the GP and Poisson models. Only genes with existing MLEs of the parameters for both the GP and the Poisson models in the two compared samples were considered.
Figure 5.
Figure 5.
Identification of differentially spliced exons. (A–C) The number of differentially spliced exons for each pair of human tissues. Only middle exons with existing MLEs of the parameters for both the GP and the Poisson models in each pair of tissues were considered. (A) is for the GP model with FDR 0.01 for each two-tissue comparison, (B) is for the Poisson model with FDR 0.01, (C) is for the Poisson model with FDR 1.0 × 10−14. The size of each black box indicates the number of differentially spliced exons. However the size of the boxes between different panels (A–C) is not comparable. The total discoveries for (A–C) are: 5099, 94849 and 5719. Ly, lymph node; Te, testes; Ad, adipose; Co, colon; Mu, muscle; He, heart; Li, liver; Bre, breast; Bra, brain. (D) The difference of the inclusion rates calculated from the junction reads for the differentially spliced exons identified by the GP model (GP) or the Poisson model (P). The differentially spliced exons specific to the GP (GP_specific) or the Poisson (P_specific) models are also shown. The FDR was controlled at 0.01.

Similar articles

Cited by

References

    1. Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010;11:94. - PMC - PubMed
    1. Cloonan N, Forrest AR, Kolle G, Gardiner BB, Faulkner GJ, Brown MK, Taylor DF, Steptoe AL, Wani S, Bethel G, et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat. Methods. 2008;5:613–619. - PubMed
    1. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods. 2008;5:621–628. - PubMed
    1. Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008;320:1344–1349. - PMC - PubMed
    1. Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB. Alternative isoform regulation in human tissue transcriptomes. Nature. 2008;456:470–476. - PMC - PubMed

Publication types

-