Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Oct 1;19(4):562-578.
doi: 10.1093/biostatistics/kxx053.

Missing data and technical variability in single-cell RNA-sequencing experiments

Affiliations

Missing data and technical variability in single-cell RNA-sequencing experiments

Stephanie C Hicks et al. Biostatistics. .

Abstract

Until recently, high-throughput gene expression technology, such as RNA-Sequencing (RNA-seq) required hundreds of thousands of cells to produce reliable measurements. Recent technical advances permit genome-wide gene expression measurement at the single-cell level. Single-cell RNA-Seq (scRNA-seq) is the most widely used and numerous publications are based on data produced with this technology. However, RNA-seq and scRNA-seq data are markedly different. In particular, unlike RNA-seq, the majority of reported expression levels in scRNA-seq are zeros, which could be either biologically-driven, genes not expressing RNA at the time of measurement, or technically-driven, genes expressing RNA, but not at a sufficient level to be detected by sequencing technology. Another difference is that the proportion of genes reporting the expression level to be zero varies substantially across single cells compared to RNA-seq samples. However, it remains unclear to what extent this cell-to-cell variation is being driven by technical rather than biological variation. Furthermore, while systematic errors, including batch effects, have been widely reported as a major challenge in high-throughput technologies, these issues have received minimal attention in published studies based on scRNA-seq technology. Here, we use an assessment experiment to examine data from published studies and demonstrate that systematic errors can explain a substantial percentage of observed cell-to-cell expression variability. Specifically, we present evidence that some of these reported zeros are driven by technical variation by demonstrating that scRNA-seq produces more zeros than expected and that this bias is greater for lower expressed genes. In addition, this missing data problem is exacerbated by the fact that this technical variation varies cell-to-cell. Then, we show how this technical cell-to-cell variability can be confused with novel biological results. Finally, we demonstrate and discuss how batch-effects and confounded experiments can intensify the problem.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Boxplots of the detection rate, or the proportion of genes in a cell reporting expression values greater formula image calculated for each cell across 15 publicly available scRNA-seq studies (the number of genes (G) included in each study). Boxplots are ordered by median detection rate across cells within a study. The detection rate across cells and studies ranges from less than 1formula image to 65formula image. For Patel and others (2014)formula image, the data submitted to GEO were de-trended so that measurements for each gene averaged to zero across cells and authors applied heavy gene filtering resulting in only formula image genes. Therefore, the detection rate for this study was calculated by downloading raw sequencing files and quantifying gene-level expression for formula image genes. (see Section 2.1 for complete details).
Fig. 2.
Fig. 2.
RNA-seq profiles compared to averaged scRNA-seq profile. (A) Scatter plot comparing a bulk RNA-seq profile and an averaged scRNA-seq profile, which we reproduced from Figure 1C in Shalek and others (2013). (B) The MA plot demonstrates there is a bias between the bulk profile and the single-cell profile averaged across cells as the single-cell profile averaged across cells is smaller than the bulk profile for low expressed genes. The dotted line is calculated by binning the values along the x-axis (the average between bulk and single-cell profile averaged across cells on log scale) and calculating the mean of values on the y-axis (difference between the bulk and single-cell profile averaged across cells on log scale) within each bin.
Fig. 3.
Fig. 3.
Plots comparing bulk and averaged scRNA-seq profiles that demonstrate evidence of more zeros in in scRNA-seq data for low expressed genes than what is expected. Data was obtained from three publicly available scRNA-seq studies that included a matched bulk RNA-seq sample measured on the same population of cells (Shalek and others, 2013; Wu and others, 2014; Trapnell and others, 2014). The red triangles are averages of the single-cell profiles computed in strata defined by the bulk RNA-seq values. The black solid line is what we expect if there is no bias.
Fig. 4.
Fig. 4.
The distribution of gene expression changes with the detection rate using processed scRNA-seq data available on GEO. Failure to account for differences of the proportion of detected genes between cells over-inflates the gene expression estimates of cells with a low detection rate. The curves were obtained by fitting a locally weighted scatter plot smoothing (loess) with a degree of 1. Because the range of detection rate varied from study to study, the range of the x-axis differs across plots.
Fig. 5.
Fig. 5.
Illustration of how technical variation can lead to differences in detection rates, which in turn can lead to false differences. Data from Patel and others (2014). (A) Using PCs analysis, scRNA-seq samples cluster by sequencing instrument. (B) The first PC is strongly associated the detection rate. (C) Boxplots of detection rates stratified by sequencing instrument used to sequence cells.

Similar articles

Cited by

References

    1. Achim, K., Pettit, J.-B., Saraiva, L. R., Gavriouchkina, D., Larsson, T., Arendt, D. and Marioni, J. C. (2015). High-throughput spatial mapping of single-cell RNA-seq data to tissue of origin. Nature Biotechnology 33, 503–509. - PubMed
    1. Bacher, R. and Kendziorski, C. (2016). Design and computational analysis of single-cell RNA-sequencing experiments. Genome Biology 17, 63. - PMC - PubMed
    1. Borel, C., Ferreira, P. G., Santoni, F., Delaneau, O., Fort, A., Popadin, K. Y., Garieri, M., Falconnet, E., Ribaux, P., Guipponi, M., Padioleau, I., Carninci, P., Dermitzakis, E. T.. and others (2015). Biased allelic expression in human primary fibroblast single cells. American Journal of Human Genetics 96, 70–80. - PMC - PubMed
    1. Bray, N. L., Pimentel, H., Melsted, P. and Pachter, L. (2016). Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology 34, 525–527. - PubMed
    1. Brennecke, P., Anders, S., Kim, J. K., Kołodziejczyk, A. A., Zhang, X., Proserpio, V., Baying, B., Benes, V., Teichmann, S. A., Marioni, J. C.. and others (2013). Accounting for technical noise in single-cell RNA-seq experiments. Nature Methods 10, 1093–1095. - PubMed

Publication types

MeSH terms

-