Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Mar 8;9(1):997.
doi: 10.1038/s41467-018-03405-7.

An accurate and robust imputation method scImpute for single-cell RNA-seq data

Affiliations

An accurate and robust imputation method scImpute for single-cell RNA-seq data

Wei Vivian Li et al. Nat Commun. .

Abstract

The emerging single-cell RNA sequencing (scRNA-seq) technologies enable the investigation of transcriptomic landscapes at the single-cell resolution. ScRNA-seq data analysis is complicated by excess zero counts, the so-called dropouts due to low amounts of mRNA sequenced within individual cells. We introduce scImpute, a statistical method to accurately and robustly impute the dropouts in scRNA-seq data. scImpute automatically identifies likely dropouts, and only perform imputation on these values without introducing new biases to the rest data. scImpute also detects outlier cells and excludes them from imputation. Evaluation based on both simulated and real human and mouse scRNA-seq data suggests that scImpute is an effective tool to recover transcriptome dynamics masked by dropouts. scImpute is shown to identify likely dropouts, enhance the clustering of cell subpopulations, improve the accuracy of differential expression analysis, and aid the study of gene expression dynamics.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
A toy example illustrating the workflow in the imputation step of scImpute method. scImpute first learns each gene’s dropout probability in each cell by fitting a mixture model. Next, scImpute imputes the (highly probable) dropout values in cell j (gene set Aj) by borrowing information of the same gene in other similar cells, which are selected based on gene set Bj (not severely affected by dropout events). The details are described in Eqs. (2) and (3)
Fig. 2
Fig. 2
scImpute improves the dropouts in the ERCC RNA transcripts. The y-axis and x-axis give the ERCC spike-ins’ log10(count+1) and log10 (concentration) in four randomly selected mouse cortex cells. The imputed data present stronger linear relationships between the true concentrations and the observed counts
Fig. 3
Fig. 3
Violin plots showing the log10(count+1) of nine cell cycle genes. The expression levels of these genes belong to three phases (G1, G2M, and S). scImpute has corrected the dropout values of cell cycle genes
Fig. 4
Fig. 4
scImpute corrects dropout values and helps define cellular identity in the simulated data. a The first two PCs calculated from the complete data, the raw data, and the imputed data by scImpute, MAGIC, and SAVER. Numbers in the parentheses are the within-cluster sum of squares calculated based on the first two PCs. The within-cluster sum of squares is defined as k=13j=150ykj-y¯k2, where y¯k=150j=150ykj and ykj is a vector of length 2, denoting the first two PCs of cell j in cell type ck. b The expression profiles of the 810 true DE genes in the complete, raw, and imputed datasets
Fig. 5
Fig. 5
scImpute improves cell subpopulation clustering in the mouse embryonic cells. The scatter plots show the first two PCs obtained from the raw and imputed data of mouse embryonic cells. The black dots mark the outlier cells detected by scImpute
Fig. 6
Fig. 6
scImpute helps identify cell subpopulations in the PBMC dataset. The scatter plots give the first two dimensions of the t-SNE results calculated from raw and imputed PBMC dataset. Numbers marked on the imputed data are cluster labels. Cell type information is marked for major clusters. We note that for the raw data, we did not mask zero expression values as missing values in the dimension reduction and the clustering steps
Fig. 7
Fig. 7
scImpute improves differential gene expression analysis and reveals expression dynamics in time-course experiments. a Raw and imputed expression levels of two marker genes of DEC. b Time-course expression patterns of the gene GDF3, which is annotated with the GO term “endoderm development.” Black triangles mark the gene’s expression in bulk data. c Selected GO terms enriched in the DEC-upregulated genes that can be only detected (by DESeq2 or MAST) in the imputed data by scImpute, but not in the raw data

Similar articles

Cited by

References

    1. Wang Z, Gerstein M, Snyder M. Rna-seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 2009;10:57–63. doi: 10.1038/nrg2484. - DOI - PMC - PubMed
    1. McDavid A, et al. Data exploration, quality control and testing in single-cell qpcr-based gene expression experiments. Bioinformatics. 2012;29:461–467. doi: 10.1093/bioinformatics/bts714. - DOI - PMC - PubMed
    1. Saliba AE, Westermann AJ, Gorski SA, Vogel J. Single-cell RNA-seq: advances and future challenges. Nucleic Acids Res. 2014;42:8845–8860. doi: 10.1093/nar/gku555. - DOI - PMC - PubMed
    1. Vallejos CA, Marioni JC, Richardson S. Basics: Bayesian analysis of single-cell sequencing data. PLoS Comput. Biol. 2015;11:e1004333. doi: 10.1371/journal.pcbi.1004333. - DOI - PMC - PubMed
    1. Kolodziejczyk AA, Kim JK, Svensson V, Marioni JC, Teichmann SA. The technology and biology of single-cell RNA sequencing. Mol. Cell. 2015;58:610–620. doi: 10.1016/j.molcel.2015.04.005. - DOI - PubMed

Publication types

-