Dirichlet Process Mixture Model for Correcting Technical Variation in Single-Cell Gene Expression Data

. 2016:48:1070-1079.

Dirichlet Process Mixture Model for Correcting Technical Variation in Single-Cell Gene Expression Data

Sandhya Prabhakaran¹, Elham Azizi¹, Ambrose Carr¹, Dana Pe'er¹

Affiliations

PMID: 29928470
PMCID: PMC6004614

Dirichlet Process Mixture Model for Correcting Technical Variation in Single-Cell Gene Expression Data

Sandhya Prabhakaran et al. JMLR Workshop Conf Proc. 2016.

. 2016:48:1070-1079.

Authors

Sandhya Prabhakaran¹, Elham Azizi¹, Ambrose Carr¹, Dana Pe'er¹

Affiliation

¹ Departments of Biological Sciences, Systems Biology and Computer Science, Columbia University, New York, NY, USA.

PMID: 29928470
PMCID: PMC6004614

Abstract

We introduce an iterative normalization and clustering method for single-cell gene expression data. The emerging technology of single-cell RNA-seq gives access to gene expression measurements for thousands of cells, allowing discovery and characterization of cell types. However, the data is confounded by technical variation emanating from experimental errors and cell type-specific biases. Current approaches perform a global normalization prior to analyzing biological signals, which does not resolve missing data or variation dependent on latent cell types. Our model is formulated as a hierarchical Bayesian mixture model with cell-specific scalings that aid the iterative normalization and clustering of cells, teasing apart technical variation from biological signals. We demonstrate that this approach is superior to global normalization followed by clustering. We show identifiability and weak convergence guarantees of our method and present a scalable Gibbs inference algorithm. This method improves cluster inference in both synthetic and real single-cell data compared with previous methods, and allows easy interpretation and recovery of the underlying structure and cell types.

PubMed Disclaimer

Figures

**Figure 1**
Distribution of log library size in an example scRNA-seq dataset (Zeisel et al., 2015). The heavy tail is indicative of over-dispersion in data. Two windows of cells with low and high library size are selected for motivating cell-specific scaling in Section 2.

**Figure 2**
**Top**: Means and variances per gene across a window of cells with high library size vs a window of cells with low library size (each data point is one gene). **Bottom**: Same for a particular cluster (cell type): interneurons.

**Figure 3**
Toy example showing stochastic data generation. **Top**: Ideal case without technical variation: observations per cell are drawn from a DPMM (a). An example of 10 cells in (b) with block covariance (c). **Bottom**: When cell-specific variations are present, observations are drawn from a DPMM (a) with scaled cluster-specific moments (b), where block structures can be partially lost (c).

**Figure 4**
Plate model for BISCUIT. x_j is the observed gene expression of cell j, white circles denote latent variables of interest, rectangles indicate replications with the replicative factor at the bottom right corner, diamonds are hyperparameters and double diamonds are hyperpriors calculated empirically.

**Figure 5**
Left to right: Confusion matrices showing true labels and those from MCMC-based methods.

**Figure 6**
Boxplots of F-scores obtained in 15 experiments with randomly-generated X for various methods.

**Figure 7**
Boxplots of F-scores obtained in 15 experiments with randomly-generated X from a negative binomial distribution.

**Figure 8**
Actual cell types (**top** left) compared to mode of inferred classes using BISCUIT (**top** center) versus other comparative approaches for 3005 cells in the (Zeisel et al., 2015) dataset. Cells are projected to 2D using t-SNE (Van der Maaten & Hinton, 2008).

**Figure 9**
Inferred *α_j* vs library size per cell j. Errorbars show 1 s.d. across Gibbs sweeps.

**Figure 10**
Inferred *α_j*, *β_j* vs log of down-sampling rates *r_j* per cell; shaded areas show 70% confidence intervals.

**Figure 11**
Mode of inferred classes for an imputed dataset generated by down-sampling (DS) 500 cells from a real dataset (**middle**), compared to actual cell types (**left**). F-measure vs center of a sliding window on DS rates (r) for 10 different down-sampled datasets got with different rates, averaged across Gibbs sweeps; shaded area shows 70% confidence interval.

**Figure 12**
Density plot of imputed values with BISCUIT (left) and normalization to library size (right) on a down-sampled dataset vs original values prior to down-sampling.

See this image and copyright information in PMC

Cited by

scINRB: single-cell gene expression imputation with network regularization and bulk RNA-seq data.
Kang Y, Zhang H, Guan J. Kang Y, et al. Brief Bioinform. 2024 Mar 27;25(3):bbae148. doi: 10.1093/bib/bbae148. Brief Bioinform. 2024. PMID: 38600665 Free PMC article.
scCURE identifies cell types responding to immunotherapy and enables outcome prediction.
Zou X, Liu Y, Wang M, Zou J, Shi Y, Su X, Xu J, Tong HHY, Ji Y, Gui L, Hao J. Zou X, et al. Cell Rep Methods. 2023 Nov 20;3(11):100643. doi: 10.1016/j.crmeth.2023.100643. Cell Rep Methods. 2023. PMID: 37989083 Free PMC article.
A new and effective two-step clustering approach for single cell RNA sequencing data.
Li R, Guan J, Wang Z, Zhou S. Li R, et al. BMC Genomics. 2023 Nov 9;23(Suppl 6):864. doi: 10.1186/s12864-023-09577-x. BMC Genomics. 2023. PMID: 37946133 Free PMC article.
Essential procedures of single-cell RNA sequencing in multiple myeloma and its translational value.
Du J, Gu XR, Yu XX, Cao YJ, Hou J. Du J, et al. Blood Sci. 2023 Nov 2;5(4):221-236. doi: 10.1097/BS9.0000000000000172. eCollection 2023 Oct. Blood Sci. 2023. PMID: 37941914 Free PMC article. Review.
scKINETICS: inference of regulatory velocity with single-cell transcriptomics data.
Burdziak C, Zhao CJ, Haviv D, Alonso-Curbelo D, Lowe SW, Pe'er D. Burdziak C, et al. Bioinformatics. 2023 Jun 30;39(39 Suppl 1):i394-i403. doi: 10.1093/bioinformatics/btad267. Bioinformatics. 2023. PMID: 37387147 Free PMC article.

See all "Cited by" articles

References

1. Amir El-ad David, Davis Kara L, Tadmor Michelle D, Simonds Erin F, Levine Jacob H, Bendall Sean C, Shenfeld Daniel K, Krishnaswamy Smita, Nolan Garry P, Pe’er Dana. visne enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia. Nature biotechnology. 2013;31(6):545–552. - PMC - PubMed
1. Anders Simon, Huber Wolfgang. Differential expression analysis for sequence count data. 2010 - PMC - PubMed
1. Antoniak Charles E. Mixtures of dirichlet processes with applications to bayesian nonparametric problems. The Annals of Statistics. 1974:1152–1174.
1. Bengio Yoshua. Statistical language and speech processing. Springer; 2013. Deep learning of representations: Looking forward. In; pp. 1–37.
1. Blei David M, Jordan Michael I. Variational methods for the dirichlet process. In: Brodley Carla E., editor. Proceedings of the International Conference on Machine Learning (ICML 2004) Vol. 69. 2004. (ACM International Conference Proceeding Series).

Grants and funding

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central

[1] Amir El-ad David, Davis Kara L, Tadmor Michelle D, Simonds Erin F, Levine Jacob H, Bendall Sean C, Shenfeld Daniel K, Krishnaswamy Smita, Nolan Garry P, Pe’er Dana. visne enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia. Nature biotechnology. 2013;31(6):545–552. - PMC - PubMed

[2] Amir El-ad David, Davis Kara L, Tadmor Michelle D, Simonds Erin F, Levine Jacob H, Bendall Sean C, Shenfeld Daniel K, Krishnaswamy Smita, Nolan Garry P, Pe’er Dana. visne enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia. Nature biotechnology. 2013;31(6):545–552. - PMC - PubMed

[3] Anders Simon, Huber Wolfgang. Differential expression analysis for sequence count data. 2010 - PMC - PubMed

[4] Anders Simon, Huber Wolfgang. Differential expression analysis for sequence count data. 2010 - PMC - PubMed

[5] Antoniak Charles E. Mixtures of dirichlet processes with applications to bayesian nonparametric problems. The Annals of Statistics. 1974:1152–1174.

[6] Antoniak Charles E. Mixtures of dirichlet processes with applications to bayesian nonparametric problems. The Annals of Statistics. 1974:1152–1174.

[7] Bengio Yoshua. Statistical language and speech processing. Springer; 2013. Deep learning of representations: Looking forward. In; pp. 1–37.

[8] Bengio Yoshua. Statistical language and speech processing. Springer; 2013. Deep learning of representations: Looking forward. In; pp. 1–37.

[9] Blei David M, Jordan Michael I. Variational methods for the dirichlet process. In: Brodley Carla E., editor. Proceedings of the International Conference on Machine Learning (ICML 2004) Vol. 69. 2004. (ACM International Conference Proceeding Series).

[10] Blei David M, Jordan Michael I. Variational methods for the dirichlet process. In: Brodley Carla E., editor. Proceedings of the International Conference on Machine Learning (ICML 2004) Vol. 69. 2004. (ACM International Conference Proceeding Series).

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Dirichlet Process Mixture Model for Correcting Technical Variation in Single-Cell Gene Expression Data

Affiliation

Dirichlet Process Mixture Model for Correcting Technical Variation in Single-Cell Gene Expression Data

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Grants and funding

LinkOut - more resources

Full Text Sources

Abstract

Figures

Similar articles

Cited by

References

Related information

Grants and funding

LinkOut - more resources

Full Text Sources