Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2024 Jul;20(7):744-766.
doi: 10.1038/s44320-024-00045-6. Epub 2024 May 29.

Building and analyzing metacells in single-cell genomics data

Affiliations
Review

Building and analyzing metacells in single-cell genomics data

Mariia Bilous et al. Mol Syst Biol. 2024 Jul.

Abstract

The advent of high-throughput single-cell genomics technologies has fundamentally transformed biological sciences. Currently, millions of cells from complex biological tissues can be phenotypically profiled across multiple modalities. The scaling of computational methods to analyze and visualize such data is a constant challenge, and tools need to be regularly updated, if not redesigned, to cope with ever-growing numbers of cells. Over the last few years, metacells have been introduced to reduce the size and complexity of single-cell genomics data while preserving biologically relevant information and improving interpretability. Here, we review recent studies that capitalize on the concept of metacells-and the many variants in nomenclature that have been used. We further outline how and when metacells should (or should not) be used to analyze single-cell genomics data and what should be considered when analyzing such data at the metacell level. To facilitate the exploration of metacells, we provide a comprehensive tutorial on the construction and analysis of metacells from single-cell RNA-seq data ( https://github.com/GfellerLab/MetacellAnalysisTutorial ) as well as a fully integrated pipeline to rapidly build, visualize and evaluate metacells with different methods ( https://github.com/GfellerLab/MetacellAnalysisToolkit ).

Keywords: Coarse-graining; Metacells; Single-cell Data Analysis; Single-cell Genomics; Tutorial.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1. Main conceptual steps in the metacell construction workflow.
Starting from a single-cell profile matrix, space and metrics are first defined for identifying cells displaying high similarity in their profiles (e.g., high transcriptomic similarity in scRNA-seq data). Second, highly similar cells are grouped into metacells. Third, single-cell profiles within each metacell are aggregated to create a metacell profile matrix. Dots represent single cells colored by cell type.
Figure 2
Figure 2. Graining level of metacell partition.
(A) tSNE representation of a peripheral blood mononuclear cells (PBMCs) scRNA-seq dataset (see Appendix) at different graining levels. Each dot represents a single cell, a metacell or a cluster, depending on the graining level. Colors represent cell types. (B) Distribution of graining levels in different studies using metacells (see Dataset EV2). Colors represent different metacell construction tools. (C) Graining levels used for datasets of different sizes. Colors represent different metacell construction tools. (D) Example of single-cell RNA-seq datasets with different levels of complexity (T cells, cord blood mononuclear cells (CBMCs) and bone marrow datasets). (E) Number of cell types recovered at different graining levels in the three examples of panel (D). (F) Example of single-cell RNA-seq datasets with different sizes. (G) Number of cell types recovered at different graining levels in the three examples of panel (F).
Figure 3
Figure 3. Metacell quality metrics.
(A) Purity is defined as the proportion of cells from the most abundant cell type in a metacell. Higher purity corresponds to higher proportion of cells of the same cell type within a metacell. Purity can also be defined based on other annotations/categories than cell types. (B) Compactness is defined as the average variance of latent space component within a metacell. Better compactness corresponds to lower variance in the latent space components within cells grouped into a metacell. (C) Separation is defined as the distance to the closest metacell. Better separation corresponds to more distant metacells in the latent space. (D) Inner normalized variance is defined as the mean normalized gene variance within a metacell. Better inner normalized variance corresponds to lower variance of the single-cell profiles within a metacell. (E) Metacell size distribution is defined as the distribution of the number of cells in each metacell. Better metacell size distribution corresponds to more homogeneous metacell sizes. (F) Representativeness corresponds to the ability of metacells to faithfully represent the global structure of the single-cell dataset. Better representation corresponds to more uniform coverage of the dataset (black stars represent the centroid of each metacell). (G) Conservation of the downstream analyses at the metacell level is defined as the ability of metacells to preserve the results of the single-cell analysis.
Figure 4
Figure 4. Relationships between metacells and sketching or imputation.
Metacells combine the reduction in size of sketching approaches and the reduction in sparsity of imputation strategies.
Figure 5
Figure 5. Limitations of metacells.
(A) Example of limitations in metacells when aggregating cells of different cell types (i.e., impure metacell_3 in the example). Such impure metacells can lead to mixed profiles and artifacts in gene co-expression analyses. (B) Correlation between the size of metacells and the number of detected genes. (C) Computational cost of metacell construction (using MC2, SuperCell, and SEACells at a graining level of 75). Time (CPU time) is represented in minutes and memory (max RSS) in GB as a function of the cell numbers contained in the dataset being analyzed. Colors and shapes highlight the tool used for metacells construction. The y-axis is displayed on a log10 scale. All tasks were run on a machine with 500 GB and a time limit of 20 h with 1 CPU except for the run of MC2 with multithreading (10 CPUs). (D) Schematic representation of the integration strategy recommended to analyze large datasets with multiple samples using metacells: (i) constructing the metacells for each sample, (ii) integrating the samples at the metacell level, (iii) performing downstream analyses on the integrated metacell atlas. (E) Computational cost of metacell construction (using MC2, SuperCell, and SEACells at a graining level of 75), metacell construction + downstream analysis, and single-cell analysis (with and without using BPCells in Seurat). Time (CPU time) is represented in minutes and memory (max RSS) in GB as a function of the cell numbers contained in the dataset being analyzed. Following the approach described in panel (D), metacells were built on a per embryo basis and in parallel using 15 CPUs. After samples integration, downstream analyses included dimensionality reduction, clustering, and differential analysis. Colors and shapes highlight the tool used for metacells construction. The y-axis is displayed on a log10 scale.
Figure 6
Figure 6. Concepts that share similarities with metacells.
(A) Example of nested communities. (B) Example of graph abstraction. (C) Example of neighborhoods. (D) Example of sample-specific pseudobulks. (E) Example of cell-type-/sample-specific pseudobulks. (F) Example of pseudocells. (G) Example of pseudobulks of pseudoreplicates.
Figure 7
Figure 7. Impact of metacell sizes on the results of the downstream analyses.
(A) Comparison of the results of weighted versus non-weighted differential abundance analysis at the metacell level. Each dot is a metacell colored by cell type. Bars correspond to the estimated proportions of each cell type in a condition with and without considering the size of each metacell. (B) Comparison of the results of weighted versus non-weighted differential expression analysis. Each dot is a metacell colored by cell type. Solid and dashed lines correspond to weighed and non-weighted estimation of mean expression. (C) Results of weighted and non-weighted principal component analysis for the same dataset. Each dot is a metacell colored by cell type. Better separation of cell types is observed in the weighted PCA. (D) Results of weighted and non-weighted Louvain clustering, with dots representing metacells colored by cluster annotation. Size of dots correspond to the size of metacells.
Figure 8
Figure 8
Metacells increase profile coverage and save computational resources, while preserving biologically relevant heterogeneity in single-cell genomics data.

Similar articles

References

    1. Ackermann AM, Wang Z, Schug J, Naji A, Kaestner KH. Integration of ATAC-seq and RNA-seq identifies human alpha cell and beta cell signature genes. Mol Metab. 2016;5:233–244. doi: 10.1016/j.molmet.2016.01.002. - DOI - PMC - PubMed
    1. Amezquita RA, Lun ATL, Becht E, Carey VJ, Carpp LN, Geistlinger L, Marini F, Rue-Albrecht K, Risso D, Soneson C, et al. Orchestrating single-cell analysis with Bioconductor. Nat Methods. 2020;17:137–145. doi: 10.1038/s41592-019-0654-x. - DOI - PMC - PubMed
    1. Amodio M, Van Dijk D, Srinivasan K, Chen WS, Mohsen H, Moon KR, Campbell A, Zhao Y, Wang X, Venkataswamy M, et al. Exploring single-cell data with deep multitasking neural networks. Nat Methods. 2019;16:1139–1145. doi: 10.1038/s41592-019-0576-7. - DOI - PMC - PubMed
    1. Andreatta M, Carmona SJ. STACAS: Sub-Type Anchor Correction for Alignment in Seurat to integrate single-cell RNA-seq data. Bioinformatics. 2021;37:882–884. doi: 10.1093/bioinformatics/btaa755. - DOI - PMC - PubMed
    1. Andrews TS, Hemberg M. False signals induced by single-cell imputation. F1000Res. 2019;7:1740. doi: 10.12688/f1000research.16613.2. - DOI - PMC - PubMed

LinkOut - more resources

-