Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016:48:1070-1079.

Dirichlet Process Mixture Model for Correcting Technical Variation in Single-Cell Gene Expression Data

Affiliations

Dirichlet Process Mixture Model for Correcting Technical Variation in Single-Cell Gene Expression Data

Sandhya Prabhakaran et al. JMLR Workshop Conf Proc. 2016.

Abstract

We introduce an iterative normalization and clustering method for single-cell gene expression data. The emerging technology of single-cell RNA-seq gives access to gene expression measurements for thousands of cells, allowing discovery and characterization of cell types. However, the data is confounded by technical variation emanating from experimental errors and cell type-specific biases. Current approaches perform a global normalization prior to analyzing biological signals, which does not resolve missing data or variation dependent on latent cell types. Our model is formulated as a hierarchical Bayesian mixture model with cell-specific scalings that aid the iterative normalization and clustering of cells, teasing apart technical variation from biological signals. We demonstrate that this approach is superior to global normalization followed by clustering. We show identifiability and weak convergence guarantees of our method and present a scalable Gibbs inference algorithm. This method improves cluster inference in both synthetic and real single-cell data compared with previous methods, and allows easy interpretation and recovery of the underlying structure and cell types.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Distribution of log library size in an example scRNA-seq dataset (Zeisel et al., 2015). The heavy tail is indicative of over-dispersion in data. Two windows of cells with low and high library size are selected for motivating cell-specific scaling in Section 2.
Figure 2
Figure 2
Top: Means and variances per gene across a window of cells with high library size vs a window of cells with low library size (each data point is one gene). Bottom: Same for a particular cluster (cell type): interneurons.
Figure 3
Figure 3
Toy example showing stochastic data generation. Top: Ideal case without technical variation: observations per cell are drawn from a DPMM (a). An example of 10 cells in (b) with block covariance (c). Bottom: When cell-specific variations are present, observations are drawn from a DPMM (a) with scaled cluster-specific moments (b), where block structures can be partially lost (c).
Figure 4
Figure 4
Plate model for BISCUIT. xj is the observed gene expression of cell j, white circles denote latent variables of interest, rectangles indicate replications with the replicative factor at the bottom right corner, diamonds are hyperparameters and double diamonds are hyperpriors calculated empirically.
Figure 5
Figure 5
Left to right: Confusion matrices showing true labels and those from MCMC-based methods.
Figure 6
Figure 6
Boxplots of F-scores obtained in 15 experiments with randomly-generated X for various methods.
Figure 7
Figure 7
Boxplots of F-scores obtained in 15 experiments with randomly-generated X from a negative binomial distribution.
Figure 8
Figure 8
Actual cell types (top left) compared to mode of inferred classes using BISCUIT (top center) versus other comparative approaches for 3005 cells in the (Zeisel et al., 2015) dataset. Cells are projected to 2D using t-SNE (Van der Maaten & Hinton, 2008).
Figure 9
Figure 9
Inferred αj vs library size per cell j. Errorbars show 1 s.d. across Gibbs sweeps.
Figure 10
Figure 10
Inferred αj, βj vs log of down-sampling rates rj per cell; shaded areas show 70% confidence intervals.
Figure 11
Figure 11
Mode of inferred classes for an imputed dataset generated by down-sampling (DS) 500 cells from a real dataset (middle), compared to actual cell types (left). F-measure vs center of a sliding window on DS rates (r) for 10 different down-sampled datasets got with different rates, averaged across Gibbs sweeps; shaded area shows 70% confidence interval.
Figure 12
Figure 12
Density plot of imputed values with BISCUIT (left) and normalization to library size (right) on a down-sampled dataset vs original values prior to down-sampling.

Similar articles

Cited by

References

    1. Amir El-ad David, Davis Kara L, Tadmor Michelle D, Simonds Erin F, Levine Jacob H, Bendall Sean C, Shenfeld Daniel K, Krishnaswamy Smita, Nolan Garry P, Pe’er Dana. visne enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia. Nature biotechnology. 2013;31(6):545–552. - PMC - PubMed
    1. Anders Simon, Huber Wolfgang. Differential expression analysis for sequence count data. 2010 - PMC - PubMed
    1. Antoniak Charles E. Mixtures of dirichlet processes with applications to bayesian nonparametric problems. The Annals of Statistics. 1974:1152–1174.
    1. Bengio Yoshua. Statistical language and speech processing. Springer; 2013. Deep learning of representations: Looking forward. In; pp. 1–37.
    1. Blei David M, Jordan Michael I. Variational methods for the dirichlet process. In: Brodley Carla E., editor. Proceedings of the International Conference on Machine Learning (ICML 2004) Vol. 69. 2004. (ACM International Conference Proceeding Series).

LinkOut - more resources

-