Learn more: PMC Disclaimer | PMC Copyright Notice
Single-cell mRNA quantification and differential analysis with Census
Abstract
Single-cell gene expression studies promise to unveil rare cell types and cryptic states in development and disease through a stunningly high-resolution view of gene regulation. However, measurements from single-cell RNA-Seq are highly variable, frustrating efforts to assay how expression differs between cells. We introduce Census, an algorithm available through our single-cell analysis toolkit Monocle 2, which converts relative RNA-Seq expression levels into relative transcript counts without the need for experimental spike-in controls. We show that analyzing changes in relative transcript counts leads to dramatic improvements in accuracy compared to normalized read counts and enables new statistical tests for identifying developmentally regulated genes. We explore the power of Census through reanalysis of single-cell studies in several developmental and disease contexts. Census counts can be analyzed with widely used regression techniques to reveal changes in cell fate-dependent gene expression, splicing patterns, and allelic imbalances, demonstrating that Census enables robust single-cell analysis at multiple layers of gene regulation.
Introduction
Differential gene expression analysis, typically powered by statistical regression, is central to nearly all single-cell transcriptomic studies. As experiments now capture tens of thousands of cells1,2, such regressions could in principle be used to detect gene regulatory changes across individual cells as a function of developmental progression, position in an embryo, or genetic sequence. However, they report measurements with high variability, frustrating efforts to build models that can detect such changes3,4. Numerous studies have reported high rates of “drop-out”, wherein some cells of a nominally homogeneous population express high levels of a gene and others none at all. Drop-outs have spurred the deployment of hurdle models5 that overcome limitations over simpler regression approaches, typically at a cost in speed, numerical stability, or design flexibility for the user.
Single-cell protocols that use exogenous RNA “spike-in” standards6 or unique molecular identifiers7,8 (UMIs) enable analysis to be performed at the level of transcript counts rather than read counts. Previous work by Grun et al. suggested that comparing UMIs, rather than read counts, between cells would improve regression analysis. However, because UMI protocols work by counting 3’ end tags, they are limited to measuring gene expression and do not report expression at allele- or isoform-resolution. Spike-in-based protocols, which convert a cell’s relative abundances to transcript counts through a linear regression between the spikes’ normalized read counts and their known molecular concentrations, can report measurements at this resolution. However, exogenous standards must be carefully calibrated for single-cell experiments lest they dominate the libraries, and may be subject to different rates of degradation or reverse transcription than endogenous RNA. Many published studies have chosen to forgo the use of spike-in controls, restricting subsequent reanalysis.
Here, we introduce Census, an algorithm that converts conventional measures of relative expression such as transcript per million (TPM) in single cells to relative transcript counts without the need for spike-in standards or UMIs. “Census counts” eliminate much of the apparent technical variability in single-cell experiments and are thus easier to model with standard regression techniques than normalized read counts. We demonstrate the power of transcript count analysis with a new regression model, BEAM (Branch Expression Analysis Modeling), for detecting genes that change following fate decisions in development. We also analyze Census counts at the splice isoform and allele level, demonstrating that our approach robustly detects developmental regulation at those resolutions. Census and BEAM are implemented in Monocle 2, the second major release of our open-source single-cell analysis toolkit.
Results
Estimating relative transcript counts in spike-in-free experiments
Census exploits two properties of single-cell RNA-Seq datasets produced with current protocols (Figure 1a). First, mRNA degradation following cell lysis and inefficiencies in the reverse transcription reaction result in the capture of as few as 10% of the transcripts in a cell as cDNA. Second, most protocols rely on template-switching reverse transcriptases primed at the polyA tail of mRNAs and thus generate full-length cDNAs9. Such protocols typically generate libraries in which genes are detected most frequently as a single cDNA molecule (Figure 1b, Supplementary Figure 1). Thus, all detectably expressed genes measured at or below the mode of the (log-transformed) relative abundance distribution in each cell should be present at around 1 cDNA copy (see Methods).
We assessed Census’ accuracy by re-analyzing several experiments that included spike-in controls4,10,11,12,13,14,15. Reanalysis of developing lung epithelial cells with Census recovered estimates of total per-cell transcript counts that were correlated with but not equal to those derived by linear regression against spike-in controls (Figure 1c), likely because of Census’ inability to control for non-linear cDNA amplification during library construction. However, changes in Census counts between groups of cells collected at the same time points were highly similar to changes measured via spike-in controls (Figure 1d,e). Census produced accurate changes in relative transcript counts for seven additional datasets, including two based on UMIs4,10,13, demonstrating that the algorithm can work well with different single-cell RNA-Seq protocols. (Supplementary Figure 1,2). Downsampling and simulation experiments determined that Census counts faithfully capture changes in expression between groups of cells with as few as 100,000 reads per cell and over a wide range of mRNA capture rates (Supplementary Figure 3, 4). Taken together, these benchmarking experiments show that Census recovers an accurate measure of changes in relative transcript counts between single cells without the need for spike-in controls.
Census counts improve differential analysis accuracy
We next assessed whether using Census counts improved downstream differential analysis. We tested several popular tools16,17 for differential expression with both read counts and relative transcript counts, including two tools specifically developed for single-cell data, Monocle18, and SCDE19 (Figure 2a, Supplementary Figure 5). When provided with read counts as a measure of expression, consensus between the tools was poor, with only 1,971 of 5,805 (34%) differentially expressed genes (DEGs) reported by all tools (except SCDE, which has very high precision but low recall), and few agreed with those reported by a nonparametric, permutation-based test between spike-in derived expression levels (Figure 2b, Methods). Tools designed for bulk RNA-Seq analysis, such as DESeq217, produce false discovery rates as high as 61%. SCDE, which includes explicit modeling of drop-outs returned few false positives but also captured a smaller fraction of the true positive set.
Repeating these tests using Census counts showed marked improvements in differential expression accuracy compared to read counts and TPM (Figure 2a). We attribute the improvements to the fact that the negative binomial distribution, which underlies most commonly used RNA-Seq analysis software16,18,19, fits relative transcript count data much better than read count data, as noted by Grun et al.4 (Supplementary Figure 5). For example, when targeting a false discovery rate of 10%, DESeq2’s empirical false discovery rate dropped dramatically from 61% to 22% with little to no drop in sensitivity, which remains as high as 82%. Monocle’s false discovery rate dropped from 53% to 11%. Importantly, using Census counts dramatically improved agreement between the tools, which all agreed on 2,437 DEGs among a total of 4,220 (70%), similar to the 62% (2,367 / 3,793) consensus genes obtained with spike-in derived levels (Figure 2b). Census also improved DE accuracy relative to gold standards derived from bulk RNA-Seq18 measurements (Supplementary Figure 6). Taken together, our benchmarks demonstrate that single-cell relative transcript counts produced by Census can be more accurately compared with commonly used differential analysis methods than normalized read counts, and are thus preferable when spike-in standards or UMIs are unavailable.
Differential analysis of branch points in developmental trajectories reveals regulators of cell fate
Many single-cell gene expression studies aim to identify gene regulatory circuits that control cell-fate decisions made during development20,21. We recently developed Monocle, an algorithm that organizes single cells along trajectories and can describe the gene expression changes executed during cell differentiation. Monocle introduced the concept of “pseudotime”, which quantifies each cell’s progress through development. Pseudotime resolves cascades of gene regulatory changes that accompany differentiation and other dynamic cellular programs18. Monocle produces more reliable tests for differential expression along a trajectory when provided with Census counts than with relative expression values (Supplementary Figure 7).
Single-cell trajectories can have multiple outcomes, such as during the generation of alternative developmental lineages22. Analyzing cells at branch points where cells are diverted along two or more mutually exclusive paths could reveal the mechanisms by which such decisions are made. For example, scrutinizing genes upregulated in common myeloid progenitors but downregulated in common lymphoid progenitors has shed light on the molecular regulation of cell fate in hematopoiesis23,24.
To explore a developmental fate decision at single-cell resolution, we reanalyzed RNA-seq data from a recent study investigating the specification of the distal lung epithelium25. Treutlein et al. sequenced developing epithelial cells to define the cellular intermediates giving rise to type I (AT1) and type II (AT2) pneumocytes. Monocle reconstructed a trajectory with a single branch point leading from progenitors to two outcomes corresponding to the AT1 and AT2 fates. The beginning of the trajectory contained cells with high levels of markers of active proliferation26 (Ccnb2, Cdk1), whereas these genes were expressed at much lower levels after the branch point (Figure 3a). High expression of a known marker of AT1 cells27 (Pdpn) was restricted to cells on one branch of the tree, whereas cells expressing an AT2 marker28 (Sftpb) at high levels were located on the other branch. Cells classified as AT1 and AT2 according to known markers by Treutlein et al. fell exclusively along the branches, with what the authors termed “bipotent progenitors (BP)” at or near the branch point. (Supplementary Figure 8).
To detect cell fate-dependent genes in a statistically robust manner, we developed BEAM, a generalized linear modeling (GLM) 29 strategy for analyzing branched single-cell trajectories (Figure 3b; Supplementary Figure 9, see Methods). BEAM identified 1,219 genes (FDR < 5%) as either AT1- or AT2- fate dependent, including canonical markers27 such as as Pdpn and Sftpb (Figure 3c). AT1-restricted genes were strongly enriched for ontological terms related to tube development, cytoskeletal remodeling, and cell morphogenesis (Supplementary Figure 10, Supplementary Table 1), while AT2-restricted genes were enriched for terms related to lipid processing, consistent with the production of lipid-rich surfactant by AT2 cells in the mature lung. Regulatory DNA elements proximal to these genes were enriched for binding sites of 74 transcription factors, eleven of which exhibited significant branch-dependent expression. Supplementary Figure 11) These factors included several such as Tcf7l2 that are well known to regulate lung development30-35.
Disruption of interferon signaling induces a branch in the dendritic cell LPS stimulation trajectory
Branch points in single-cell trajectories represent steps in a program of transcriptional change in which cells must choose between one of several mutually exclusive gene expression programs. Branches could arise not only during development, but also in response to mutations, treatment with drugs, or other cellular perturbations. We reanalyzed a recent study36 from Shalek and colleagues, which dissected the transcriptional response of murine bone marrow-derived dendritic cells (BMDCs) to lipopolysaccharide (LPS) (Figure 4a). In BMDCs, LPS triggers a paracrine feedback loop of type I interferon signaling mediated in part by Stat137-39. The authors compared BMDCs from wild-type (WT) mice to those from mice that lack the receptor for Interferon alpha (Ifnar1-/-) or Stat1 (Stat1-/-). Monocle recovered a trajectory with a single branch point, with cells from Infar-/- or Stat1-/- mice distributed on an alternative trajectory in response to LPS stimulation compared with those from WT mice (Figure 4b).
BEAM identified 870 genes (FDR < 5%), many associated with interferon signaling, dependent on this branch (Figure 4c, Supplementary Figure 12). Peaks corresponding to open chromatin collected by Lavin et al40 proximal to branch-dependent genes are enriched for Stat1/2 and Irf1/2 binding motifs (Supplementary Figure 13). These factors were themselves significantly branch-dependent, with branching pseudotimes substantially earlier than their putative targets, confirming that BEAM can distinguish the regulatory factors that drive branching in single-cell trajectories from genes downstream (Figure 4e, f). Monocle 2 and BEAM demonstrated that loss of a key paracrine loop generates an “alternative trajectory”, suggesting that single-cell trajectory analysis can be useful for defining how a signaling pathway regulates a larger process.
Census counts enable single-cell differential splicing analysis
Methods for detecting splicing changes in single-cell RNA-Seq experiments are beginning to appear, but have grappled with isoform-level measurement variability. For example, Welch et al. described SingleSplice41, which uses a hurdle model to compare observed variation in isoform frequencies against expected technical variation, but its contrasts are limited to tests for excess variability within groups of cells, rather than as a function of arbitrary variables in a regression, and it requires calibration with spike-in standards.
We used Census to estimate isoform-level transcript counts in differentiating myoblasts, a classic model system for vertebrate splicing. Modeling isoform counts from each gene as a Dirichlet-multinomial distribution captured pseudotime-dependent shifts in splicing in 74 genes (FDR < 0.1), including well-characterized components of the molecular machinery required for muscle contraction such as tropomyosin TPM1, which has been intensely studied in myoblasts as a model of alternative splicing42,43 (Figure 5). TPM1 has three well-characterized sets of alternatively spliced exons, with exons 6b and 9b excluded in myoblasts but included in myotubes44. These exons became progressively more frequent in TPM1 mRNAs, with inclusion of exon 6b preceding inclusion of exon 9b. Each isoform of the 74 differentially spliced genes showed one of seven distinct pseudotemporal expression patterns, (Supplementary Figure 14a, b) coinciding with shifts in the actin family from widely expressed members (ACTB, ACGT) partly replaced with muscle specific ones (ACTA1, ACTA2). (Supplementary Figure 14c). Our analysis supports the view that cytoskeletal reorganization during myoblast differentiation is globally coordinated not only at the level of genes but across individual splice variants.
Census counts enable allelic balance analysis in single cells
Single-cell analysis could in principle shed light on the degree to which the two alleles of each gene are regulated in a coordinated manner. Recently, Deng et al. tracked gene expression genome wide in single-cells from pre-implantation mouse embryos of mixed genetic background (CAST/EiJ × C57BL/6J) 45. Coupling allele-level relative abundances from Kallisto46 with Census produced relative allele transcript counts which, when modeled similarly to isoform counts, recapitulated many of the key observations made in the initial study. As expected, nearly all RNAs matched the maternal allele in zygotes and early 2-cell embyros, consistent with little to no transcription from the embryonic genome (Figure 6a). Allelic balance for most genes equilibrated to 50% as transcription from the embryonic genome began in mid- to late- 2-cell embryos, with the X chromosome notably excepted (Figure 6a). Inactivation of the paternal X chromosome in female embryos was manifest by the 16-cell stage, with progressively fewer genes exhibiting contributions from the paternal X (Figure 6b, c), although genes known to escape inactivation were notably excepted (Supplementary Figure 15).
In addition to pre-implantation allelic dynamics, Deng et al. reported widespread stochastic monoallelic gene expression in individual cells. This claim has been challenged by Kim et al. 47, who analyzed allele-specific expression in embryonic stem cells using a statistical model that attributed much of the apparent stochastic monoallelic expression to technical sources. We tested whether using Census to estimate allelic transcript counts instead of allelic read counts would reduce the apparent stochastic monoallelic expression to expected levels. Consistent with the generative model used by Kim et al., the expected rate of monoallelic expression was near 100% for genes expressed at a single copy, and decreased with increasing expression (Figure 6e). Of 6,608 “allele-informative” genes in the genome, 95.0% produced observed monoallelic transcript counts within the expected range. In contrast, only 77% of genes fell within the range obtained by fitting similar models to normalized read counts for each allele. We interpret this to mean that a substantial portion of apparent monoallelic expression arose because the sequenced libraries correspond to a small proportion of the true RNA molecules in each cell (due to dropout), a technical artifact that is accounted for when allelic gene expression is modeled using Census-estimated relative transcript counts but not when it is modeled using normalized read counts.
Discussion
Efforts to detect changes in gene regulation in development have grappled with high technical and biological variability, demanding specialized statistical methods that explicitly model drop-outs and other nuisance variation. Here, we show that analyzing changes in relative transcript counts leads to dramatic reductions in apparent technical variability compared to normalized read counts, making single-cell RNA-Seq compatible with widely used regression techniques. We have developed Census, a normalization algorithm that can convert relative expression levels from read counts into per-cell transcript counts without the need for spike-in standards or UMIs. The algorithm requires only that genes are most frequently present at 1 cDNA molecule in each cell’s library. We show through reanalysis of several datasets that this is the case with most current protocols, owing to mRNA capture rates lower than 50% and their generation of full-length cDNAs during reverse transcription. Census cannot control for amplification biases, and thus does not produce estimates of lysate mRNA abundances that perfectly match those derived with spike-ins or UMIs. When spike-ins or UMIs are available, transcript counts should be recovered using them rather than Census. However, we show through extensive benchmarking that differential analysis results with Census counts are highly concordant with those from spike-ins. Importantly, tools widely used for bulk RNA-Seq analysis that perform poorly when provided with read counts work vastly better with Census counts, alleviating the need for software tailored for single-cell RNA-Seq.
To illustrate their power, we have developed three regression-based methods for detecting changes in Census counts. The first, BEAM, builds on our previous work tracking gene expression changes in single-cell trajectories, helping pinpoint the moment at which cell-fate decisions occur in a complex biological process. BEAM identified hundreds of genes differentially regulated during specification of the type I and type II pneumocytes in the alveolar epithelium. Surprisingly, branched cell trajectories arise not only in development, but also in response to genetic perturbations, suggesting that branch analysis may be useful in many biological contexts. The second method uses Census counts to find genes undergoing pseudotime-dependent changes in splicing. Reanalysis of differentiating myoblasts showed widespread alteration in isoform ratios in genes involved in muscle contraction and cytoskeletal structure, with some genes such as TPM1 showing a sequence of pseudotime-dependent shifts. The third method captures changes in allelic transcript counts derived with Census. By reanalyzing data from pre-implantation embryos, we confirmed the authors’ timing of transcriptional activation of the embryonic genome and X chromosome inactivation. In contrast to the original study, we do not see substantial evidence of random, monoallelic expression on the autosomes, and attribute this observation to inadequate modeling of dropouts in normalized read counts. Monoallelic expression at the transcript count level was in line with expectations under a simple overdispersed binomial regression model.
Together, our analyses show that single-cell differential expression analyses conducted at the level of normalized transcript counts are more robust and accurate than analyses of normalized read counts. We provide a new algorithm, Census, that makes relative transcript count analysis widely accessible, as well as examples of regression models, in particular BEAM, that leverage them for high-resolution dissection of gene regulation. We expect that such techniques will continue to unveil new mechanisms of gene regulation, including at the allele and isoform level, in development and disease.
Data Availability Statement
Code availability
A version of monocle 2 (version: 1.99) used to produce all the figures, supplementary data is provided in Supplementary Software. The newest Monocle 2 is available through Bioconductor as well as GitHub (https://github.com/cole-trapnell-lab/monocle-release). Supplementary Software also includes a helper package including helper functions as well as all analysis code which can reproduce all figures in this study.
Data availability
Eleven public sc RNA-seq datasets are used in this study, of which 8 datasets used ERCC spike-in. Here is a summary list of all the data:
Datasets with spike-in:
Lung: GSE5258325
Noise model: GSE54695 4
Neuron reprogramming: GSE67310 15
Human Preimplantation Embryos: E-MTAB-3929 10
Pancreas: E-MTAB-506111
Cortex: http://linnarssonlab.org/cortex/ 12
Marker-free: GSE54006 13
Quantitative assessment data: GSE51254 14
Datasets without spike-in:
HSMM: GSE52529 18
Dendritic cell knockout: GSE41265 36
Allele-specific gene expression: GSE45719 45
Online Methods
A generative model for single-cell RNA-seq experiments with a spike-in ladder
Census is motivated by a generative model of single-cell (sc) RNA-Seq similar to the one developed by Kim et al.47. When performing sc-RNA-seq, each individual cell is lysed to recover its endogenous RNA molecules, some fraction of which may be degraded or lost. Lysis thus involves an RNA recovery rate α. Spike-in transcripts are then added into the cell lysate. Note that spike-in transcripts are added to the lysate as naked RNA, and thus may be degraded at different rates from the endogenous RNA. We denote the ladder recovery rate as β. The RNA counts in the lysate can be written:
where Yl, Sl, S, are the transcript counts of endogenous RNA in cell lysate, spike-in transcript counts in cell lysate and the spike-in transcript counts added into the cell lysate. The first subscript in all variables (here and below) corresponds to cell while the second subscript corresponds to gene index. Note that we are not able to directly observe , the true transcript counts for gene j in cell i and thus α is an unknown variable.
The RNA molecules and spike-in transcripts will then be subjected to reverse transcription and amplified to make a cDNA library. The expected number of cDNA molecules generated from each RNA molecules is denoted by θ. The cDNA counts can be written:
where Yd, Sd, are the cDNA counts of endogenous RNA, spike-in cDNA counts successfully converted from the corresponding transcript counts Yl, Sl in cell lysate under a uniform capture rate θ, which for current protocols is less than 1.
Our model generates sequencing reads from the cDNA. The relative cDNA abundances are calculated as for endogenous RNA, or for spike-in RNA.
The model then generates γ reads per cDNA molecule on average; with sufficient sequencing, γ will be larger than 1; we expect each cDNA molecule to generate at least one sequencing read. This process can be regarded as a multinomial sampling of R reads from the distribution of relative cDNA abundances mentioned above which can be represented as:
where , denotes the reads sampled for cDNA from the endogenous RNA or spike- in RNA in cell i, denotes the reads sampled for cDNA from the endogenous RNA j or spike-in RNA j in cell i.
The model described here is essentially a special case of the model in Kim et al., and differs mainly in that their model describes transcript-level capture rates and sequencing rates with beta and gamma distributions, respectively. In contrast, we simply use global constants for these rates. As Census does not make use of variance estimates from the generative model, this simpler model is sufficient for calculating key statistics (e.g. mode of the transcript counts) needed to convert relative to absolute abundances.
A simulator for the sc-RNA-seq process
To generate an in silico library for a single cell, we built a simulator that first selects G genes at random from a relative expression profile (Pbulk) derived from a bulk RNA-Seq experiment to represent the hypothetical relative abundance of a single-cell in cell lysate. These values are rescaled to proportions (i.e. summing to 1), or ρscaled.
These proportions are then used to parameterize a multinomial distribution from which T transcripts are drawn to obtain the transcripts in the library space where we also consider there is αi percentage of the RNA is degraded. Therefore, we have:
To this pool of transcripts, a fixed number of spike-in transcripts are added, forming a mixture of simulated “endogenous” and “spike-in” mRNAs where the degradation of spike-in transcripts is represented by βi. Of these, θi percent are selected uniformly at random to simulate incomplete mRNA capture by the reverse transcription process. Finally, the abundances of these cDNAs relative to one another were used to parameterize another multinomial, from which Ri reads are sampled. The read counts are then used to calculate the relative abundance for the spike-in and the endogenous RNA.
In this study, we systematically simulated the sc RNA-seq process obtained from bulk RNA-Seq measurements made in Trapnell and Cacchiarelli et al18 by varying the gene number G, capture rate θ, endogenous RNA degradation α, spike-in degradation β, total endogenous transcript count T and total number of reads R. Results based on simulation are shown in Supplemental Figure 4.
Estimating the capture rate based on spike-in ladder
Similar to Kim et al.47, spike-in transcripts can be used to infer the rate at which lysate RNAs are converted to cDNA. The probability of observing a particular spike-in transcript in the sequenced read counts can be used to estimate the capture rate θ. For a given spike-in transcript i with transcript counts s calculated using the above procedure, the probability to observe at least one copy of this transcript is p = 1 – (1 – θ)s. We assume the capture rate, θ, is the same for all spike transcripts and thus can use the following objective function to estimate the capture rate using all spike transcripts:
where is the probability for all transcripts with s copies have non-zero TPM values. In order to robustly estimate θ, we assume a constant capture rate for cells collected in each time point (lung or neuron experiment) or the whole dataset (other experiments) and pool them for estimating θ.
Census
Census aims to convert relative abundances Xij into lysate transcript counts Yij. Without loss of generality, we consider relative abundances is on the TPM scale, and assume that a gene’s TPM value is proportional to the relative frequencies of its mRNA within the total pool of mRNA in a given cell’s lysate, i.e., . The generative model discussed above predicts that when only a minority of the transcripts in a cell is captured in the library, signal from most detectably expressed genes will originate from a single mRNA. Because the number of sequencing reads per transcript is proportionate to molecular frequency after normalizing for length (i.e. TPM or FPKM), all such genes in a given cell should have similar TPM values.
Census works by first identifying the (log-transformed) TPM value in each cell i, written as , that corresponds to genes from which signal originates from a single transcript. Because our generative model predicts that these most detectable genes should fall into this category, we simply estimate as the mode of the log-transformed TPM distribution for cell i. This mode is obtained by log-transforming the TPM values, performing a Gaussian kernel density estimation and then identifying the peak of the distribution. Given the TPM value for a single transcript in cell i, it is straightforward to convert all relative abundances to their lysate transcript counts. We estimate the total number of mRNAs captured for cell i:
where Fx represents the cumulative distribution function for the TPM values for cell i, ε is a TPM value below which no mRNA is believed to be present (by default, ε = 0.1), and ni is the number of genes with TPM values in the interval . That is, we simply calculate the total number of single-mRNA genes and divide this number by the fraction of the library contributed by them to estimate the total number of captured mRNAs in the cell. This number is scaled by to yield an estimate for the number of mRNAs that were in the cell’s lysate, including those that were not actually captured. This scaling step is performed mainly to facilitate comparison with spike-in derived estimates. While we do not know the capture rate θ a priori, it is a highly protocol-dependent quantity that appears to have little dependence on cell type or state. Throughout our analysis, we assume a value of 0.25, which is close to the lung and neuron experiments of Truetlein et al.
With an estimate of the total lysate mRNAs Mi in cell i, we simply rescale its TPM values into mRNA counts for each gene:
Limitations of Census
Census and our generative model of single-cell RNA-Seq assume that TPM is proportional to the true relative abundance in the cell lysis, i.e., . However, non-linear amplification at any stage of the library construction protocol could distort this relationship. We can see this distortion when fitting the linear regression model, log(TPMij) = k ∗ log(Yij) + b, to the spike-in data recovers a value of k that deviates from 1, which indicates that TPMij ∝ (Yij)k. In practice, we find that k ranges from around 0.5 to near 1, depending on the protocol and the laboratory. We have not observed k much larger than 1.
The inability to estimate k without making strong assumptions surrounding the expected number of total RNAs in a given cell means that Census and indeed any measure of relative abundance not normalized by spike-in standards will be limited in its ability to recapitulate the transcript counts derive from spike-based conversion. We argue here that this limitation is not onerous in differential analysis because its impact on fold changes between cells is small.
Testing for branch-dependent expression
Monocle assigns each cell a pseudotime value and a “State” encoding the segment of the trajectory it resides upon based on the PQ-tree algorithm (see the supplemental material for Trapnell and Cacchiarelli et al for further information18). Transcript counts values were variance-stabilized49 via the technique described by Anders and Huber prior to tree construction.
In Monocle 2, we extended the capability to test for branch-dependent gene expression by formulating the problem as a contrast between two negative binomial GLMs.
The null model
for the test assumes the gene being tested is not a branch specific gene, whereas the alternative model:
assumes that the gene is a branch specific gene where : represents an interaction term between branch and transformed pseudotime, NB means negative binomial distribution. Each model includes a natural spline (here with three degrees of freedom) describing smooth changes in mean expression as a function of pseudotime. The null model fits only a single curve, whereas the alternative will fit a distinct curve for each branch. Our current implementation of Monocle 2 relies on VGAM’s “smart” spline fitting functionality, hence the use of the sm.ns() function instead of the more widely used ns() function from the splines package in R50. Likelihood ratio testing was performed with the VGAM lrtest() function, similar to Monocle’s other differential expression tests50. A significant branch-dependent genes means that the gene has distinct expression dynamics along each branch, with smoothed curves that have different shapes.
To fit the full model, each cell must be assigned to the appropriate branch, which is coded through the factor “Branch” in the above model formula. Monocle’s function for testing branch dependence accepts an argument specifying which branches are to be compared. These arguments are specified using the ‘State’ attribute assigned by Monocle during trajectory reconstructions. For example, in our analysis of the Truetlein et al data 25, Monocle reconstructed a trajectory with two branches (LAT1, LAT2 for AT1 and AT2 lineages, respectively), and three states (SBP, SAT1, SAT2 for progenitor, AT1, or AT2 cells). The user specifies that he or she wants to compare LAT1 and LAT2 by providing SAT1 and SAT2 as arguments to the function. Monocle then assigns all the cells with state SAT1 to branch LAT1 and similarly for the AT2 cells. However, the cells with SBP must be members of both branches, because they are on the path from each branch back to the root of the tree. In order to ensure the independence of data points required for the LRT as well as the robustness and stability of our algorithm, we implemented a strategy to partition the progenitor cells into two groups, with each branch receiving a group. The groups are computed by simply ranking the progenitor cells by pseudotime and assigning the odd-numbered cells to one group and the even numbered cells to the other. We assign the first progenitor to both branches to ensure they start at the same time which is required for downstream spline fitting and clustering. The branch plots in Figure 3d visualize the branch specific spline curves fit by this method.
Branch time point detection
The branching time point for each gene can be quantified by fitting a separate spline curves for each branch from all the progenitor to each cell fate. To robustly detect the pseudotime point when a gene i with a branching expression pattern starts to diverge between two cell fates L1, L2, we developed the branch time point detection algorithm. The algorithm starts from the end of stretched pseudotime (pseudotime t = 100, see below) to calculate the divergence (Di (t = 100) = xL1(t = 100) – xL2 (t = 100)) of gene i’s expression (xL1(t = 100), xL2(t = 100)) between two cell fates, L1 L2, (for a branching gene, the divergence at this moment should be large if not the largest across pseudotime). It then moves backwards to find the latest intersection point between two fitted spline curves, which corresponds to the time when the gene starts to diverge between two branches. To add further flexibility, the algorithm moves forward to find the time point when the gene expression diverges up to a user controllable threshold (ε), or Di (t) ≥ ε(t), and defines this time point as the branch time point, , for that particular gene i.
Analysis of human skeletal muscle myoblasts
We used the HSMM data from our previous publication18 to benchmark the performance of developmental tree reconstruction and pseudotime DEG test between relative abundance or census counts. Relative abundances are converted into transcript counts using Census with default parameters with parameter t* estimated from the relative abundance data for each cell. Potential contaminating fibroblast cells with transcript counts of Mef2c less than 5 and Myf5 less than 1 were removed which yields 142 cells for downstream analysis.
The union of genes which are differentially expressed between the four time points in relative abundance or recovered transcript counts scale are used to reduce dimension and order the cells. Transcript counts were variance stabilized. The ordering of developmental trajectories between these two approaches is compared using spearman correlation. Pseudotime tests are performed on both the relative abundance and transcript counts scale where the pseudotime dependent genes are collected as those with q values less than 0.05 (Benjamini-Hochberg correction). The benchmark set is obtained from the permutation test based on a modified algorithm from the glm.perm package as previously described (see section Benchmarking differential expression analysis in supplementary notes).
Differential splicing analysis was conducted by first converting isoform-level TPM values from Cufflinks to transcript counts using Census with default parameters. Each gene’s isoform-level transcript counts Z1, … ,Zk were then modeled using a generalized linear model with a Dirichlet-multinomial response using the VGAM package (version 1.0-1). The Dirichlet-multinomial distribution is a compound distribution, where the probabilities that parameterize a multinomial are themselves drawn from a Dirichlet distribution with an additional over dispersion parameter ϕ. That is, the Dirichlet encodes the frequencies of the isoforms π and the variation in this frequency vector, while the multinomial captures the sampling of actual transcripts according to these frequencies. The Dirichlet has proven effective in previous analyses of splicing changes in bulk RNA-Seq studies 51.
To test for pseudotime-dependent shifts in the frequencies of the isoforms produced by each gene, we fit the following model to the observed isoform-level Census RNA counts:
Only isoforms with at least one copy detected in at least 15 cells were included in the model for each gene, in order to ensure numerical stability within VGAM. We then compared this full model to the null
by likelihood ratio test. Note that each gene’s ϕ was estimated by maximum likelihood separately, as we did not wish to assume that these dispersion parameters are a smooth function of expression level, as is commonly done in RNA-Seq.
Analysis of pre-implantation embryos
Allele-specific relative gene expression values (transcripts per million) were estimated by applying Kallisto46 to the raw reads of Deng et al.45 using an allele-specific transcriptome index. This index consisted of cDNA sequences from GENCODE vM9, corresponding to the paternal (C57BL/6J) alleles, plus the same sequences with maternal (CAST/EiJ) SNP alleles overlaid (CAST genotypes from Keane et al. 52; only homozygous variants relative to the C57BL/6J reference were used).
The TPM values for the two alleles for each gene were converted to allelic RNA counts using Census with default parameters. The number of RNA molecules from each allele of each gene were modeled using a quasibinomial GLM. The quasibinomial is a binomial that allows for over (or under) dispersion with respect to the binomial through a parameter ϕ. Its probability mass function is:
where p encodes the probability that an RNA originated from the maternal allele (without loss of generality).
Quasibinomial GLMs were fit to each gene using VGAM, using the option “dispersion=0” to direct VGAM to estimate the dispersion parameter for each model from each gene’s maternal and paternal RNA counts Zm and Zp, respectively. To test for embryo stage-dependent allelic balance shifts in each gene, we fit a full model
And a null
to these data, and compared them using an F-test 29. As for isoform-level modeling, the dispersion parameter was fit separately for each gene. We note that the quasibinomial is similar to the beta-binomial, the two category case of the Dirichlet-Multinomial. We explored the use of the beta-binomial for this analysis, and while we reached qualitatively similar conclusions regarding escape from X inactivation and monoallelic expression, we felt that the quasibinomial provided a better fit for the data.
Analysis of X chromosome inactivation was performed on female embryos at the 4-, 16- and early blastocyst stages. Embryos were sexed by hierarchically clustering cells on the basis of variance stabilized transcript counts for genes on the Y chromosome. Cells fell into two clearly defined clusters, only one of which expressed “informative” Y genes. Embryos comprised of these cells were annotated as male.
To quantify the number of genes escaping X inactivation at each stage, we used the quasibinomial GLMs to assess the probability that less than 10% of the RNA from a gene originated from the inactive chromosome. (10% is a widely accepted threshold for escape from X inactivation53,54). To do so, we constructed a 95% prediction interval on the allelic ratio for each gene by simulating random variates from its GLM via the VGAM package’s simulate.vlm(). That is, we calculated the number of simulated observations that were less than 10% percent maternal or paternal. Using this statistic, we calculated a significance score for contribution from the maternal and paternal alleles for each gene on the X chromosome, corrected these for multiple testing (via Benjamini-Hochberg), and reported the number of genes with significant maternal and paternal contributions.
We used a similar simulation-based procedure to construct prediction intervals for expected monoallelic expression. After fitting a quasibinomial GLM for each (autosomal) gene’s allele RNA counts, we simulated 100 random variates from each gene’s model and counted the number of times the model reported RNAs from only one of the two alleles. We then collected these counts into quantiles based on the gene’s expression level to generate 95% prediction intervals for monoallelic expression as a function of expression level. The exact same fitting, simulation, and prediction interval estimation procedure was used for both RNA counts and estimated allelic read counts from Kallisto.
Supplementary Material
1
Supplementary table 1:
Sheet 1. Gene ontology (biological process) gene sets significantly enriched in each cluster of branch-dependent gene for lung epithelial analysis (qval < 0.1, Benjamini-Hochberg correction; see Methods)
Sheet 2. Gene ontology (biological process) gene sets significantly enriched in each cluster of branch-dependent genes for the Stat1-/- and Ifnar1-/- analysis (qval < 0.1, Benjamini-Hochberg correction)
Supplementary table 2:
Variables used in Census, BEAM, isoform and allele switch analysis. (a) A table describing all variables used in the Census algorithm. (b) The same for BEAM. (c) The same for isoform switch analysis. (d) The same for allele switch analysis.
Supplementary file 1:
Text file storing the result (p-value) from the permutation test used in benchmarking differential gene expression based on spike-in transcript counts. Each row corresponds to a gene.
Supplementary software 1:
A tarball includes a version of monocle 2 (version: 1.99), a helper package as well as all analysis code which can reproduce all figures in this study.
2
Supplementary figure 1. Assessment of spike-in free relative transcript count recovery in 7 additional experiments (a) Fitting the probability of observing a spike-in transcript at a given copy number based on a binomial model for the data from Grun et al 4. Each point indicates a different spike-in transcript. The Y axis shows the fraction of cells in the experiment in which the spike was detected (non-zero relative expression). The curve shows the probability of observation, learned as described in the methods. (b) Distribution of the corresponding transcript counts in cell lysate as well as cDNA space for the most frequent log10(TPM). (c) MA plot for expressed genes based on contrasts of cells from 2i and serum medium. (d) Fold changes for expressed genes based on contrasts of cells from 2i and serum medium. (e, f) The same plots as in (a, b) for ESC dataset 10. (g, h) the same plots as in (c, d) for the contrast of cells from embryonic day 5 and day 6 in ESC dataset. (i) Estimated capture rate of the neuron dataset by different days 15. (j) the same plot as in b for the neuron dataset. (k, l) The same plots as in (c, d) for the contrast of cells from day 22 and day 0 in the neuron dataset. (m, n) The same plots as in (a, b) for pancreas dataset 11. (o, p) the same plots as in (c, d) for the contrast of cells from the annotated alpha cells and beta cells in pancreas dataset. (q, r) The same plots as in (a, b) for cortex dataset 12. (s, t) the same plots as in (c, d) for the contrast of cells from the annotated ependymal astrocyte cells and mural endothelial cells in pancreas dataset. (u, v) The same plots as in (a, b) for marker-free dataset 13. (w, x) the same plots as in (c, d) for the contrast of cells from the annotated CD11c+ cells and CD11c+ cells after two hours LPS stimulation in marker-free dataset. (y, z) The same plots as in (a, b) for quantitative assessment dataset 14. (aa, ab) the same plots as in (c, d) for the contrast of cells from a random sample set A cells and a different random sample set B in quantitative assessment dataset. In (b, f, j, n, r, v, z) the cDNA counts were calculated by multiplying the modes of the transcript count in cell lysate by the estimated capture rate.
Supplementary figure 2. Library sequencing depths for each group of co-captured cells. (a) Raw read counts per cell for the lung epithelial data. (b) Number of detected ERCC spike-in transcripts per cell for lung epithelial data. (c) Fitting the probability of observing a spike-in transcript at a given copy number based on a binomial model. Different spike-in transcripts across all cells in each time point with the same copy number is pooled. (d) Capture rate estimated for each time point. Different spike-in transcripts across all cells in each time point with the same copy number is pooled. (e) Raw read counts for the dendritic cell data from Shalek et al. used in this analysis. For details on the binomial model, see method for more details.
Supplementary figure 3. Robustness of Census to sequencing depth. (a) MA plot of expressed genes for the contrast of E14.5 and E18.5 cells under downsampled read depths. (b) Fold-change of expressed genes for the contrast of E14.5 and E18.5 cells under downsampled read depths.
Supplemental figure 4. Census accuracy compared to spike-in based conversion over various experimental designs under the generative model. (a) Transcript counts corresponding to the mode of the relative abundance as a function of the fraction of endogenous mRNA and ladder transcripts successfully converted to cDNA (capture rate). (b-e) Transcript counts corresponding to the mode of the relative abundance as a function of ladder degradation rate, mRNA degradation rate, number of genes expressed in the cell and total mRNAs, respectively. For all panels, blue lines correspond to the cDNA space while the red lines correspond to the cell lysate space.
Supplementary figure 5. Gene expression as measured by read counts and transcript counts. (a) Distribution of expression values on the read count and transcript count scale for three representative cells. (b) Genes passing a chi-squared goodness of fit test for the negative binomial (NB) and its zero-inflated variant (ZINB). (c) The number of genes that can be fitted by the fitdistrplus package without throwing a numerical exception. (d) Differential expression (DE) analysis accuracy from various tools provided with TPM, normalized read counts, and transcript counts estimated with spike-ins, Census, TPM scaled to 100, 000 total transcripts, TPM using negative binomial distribution and TPM scaled to the true total calculated from the spike-in regression. Cells from E14.5 and E18.5 from Treutlein et al. were provided to each tool. A permutation-based test was applied to the spike-in-based expression levels to determine a ground truth set of DE genes. (e) Differential expression (DE) analysis accuracy from various tools provided with TPM, normalized read counts, and transcript counts estimated with spike-ins, Census, TPM scaled to 100, 000 total transcripts, TPM using negative binomial distribution and TPM scaled to the true total calculated from the spike-in regression. Cells from E16.5 and E18.5 from Treutlein et al. were provided to each tool. A permutation-based test was applied to the spike-in-based expression levels to determine a ground truth set of DE genes. (f) Receiver-operating characteristic (ROC) curves showing differential expression (DE) analysis accuracy from various tools provided TPM rounded to the nearest integer number as well as transcript counts generated by multiplying the relative abundances in each cell times 100,000 total transcripts. Similar to Figure 2, cells from E14.5 and E18.5 from Treutlein et al. were provided to each tool. A permutation-based test was applied to the spike-in-based expression levels to determine a ground truth set of DE genes. (g) Consensus in differential analysis results between Monocle, DESeq2, edgeR, and permutation tests using different measures of expression. The total height of each bar reflects the size of the union of DE genes reported by any of the four tests. The smaller bar reports the number of DE genes identified by all tests. (h) ROC curves showing differential expression (DE) analysis accuracy from various tools provided TPM, normalized read counts, and transcript counts estimated with spike-ins or Census, TPM scaled to 100, 000 total transcripts, TPM using negative binomial distribution and TPM scaled to the true total calculated from the spike-in regression. Cells from E16.5 and E18.5 from Treutlein et al. were provided to each tool. A permutation-based test was applied to the spike-in-based expression levels to determine a ground truth set of DE genes. (i) Same as in panel g. All the above analysis is based on lung epithelial dataset.
Supplementary figure 6. Concordance of different approaches for differential expression analysis in single-cell RNA-Seq using bulk RNA-seq as the ground truth. (a) Accuracy for several popular tools for differential expression analysis of cells from Trapnell and Cacchiarelli et al. using a matched bulk RNA-Seq collected in that study as a gold standard. (b) ROC curves and the area under them for each method from the experiment shown in (a). (c) Differential expression (DE) analysis accuracy from various tools provided with TPM, transcript counts estimated with spike-ins, UMI or Census. (d) Receiver-operating characteristic (ROC) curves showing differential expression (DE) analysis accuracy from various tools provided with TPM, transcript counts estimated with spike-ins, UMI or Census. Panel c, d is based on the noise model dataset with UMI counts.
Supplementary figure 7. Single-cell trajectory reconstruction using TPM values and transcript counts. (a) HSMM trajectory reconstructed with TPMs. (b) HSMM trajectory reconstructed with variance-stabilized transcript counts. Note that the pseudotime ordering is not affected by the rotation of tree. (c) Cell-to-cell map between the two orderings. (d) Overlap between genes that are significantly differentially expressed as a function of pseudotime as calculated with TPM and transcript counts. Only the genes for which a permutation-based p-value could be computed with glmperm are shown here. (e) Benchmarking results for Monocle 2 and DESeq for a pseudotime-dependence test relative to a permutation control (see Methods for details). (f) ROC curves and the area under them for each method from the experiment shown in (e). (g) Overlap between significantly enriched GO terms for the pseudotime dependent genes calculated from TPM or transcript counts (h) Relative significance scores for gene sets under a hypergeometric test for enrichment within pseudotime dependent genes. Positive values indicate that a gene set is more significantly enriched when using TPM values to order cells, while negative values indicate a gene set is more significant when using transcript counts. “Muscle related” gene sets are those from REACTOME with “muscle” or “myo” in the name. (e.g. “Muscle development”)
Supplementary figure 8. Monocle’s Single-cell trajectory of developing murine pneumocytes. Cells are colored according to type assignments made by Truetlein et al. Only cells collected at E18.5 were assigned in the original study. Bounding boxes show Monocle’s assignment of cells to types based on the topology of the trajectory.
Supplementary figure 9. Clustering and motif enrichment analysis of branch-dependent genes according to bilineage expression kinetics. Branch-dependent genes identified by BEAM can be further assessed for common functions and potential upstream regulators. The expression for each gene along the two trajectories is used to cluster genes into groups that share branch-dependent expression kinetics. Regulatory elements (e.g. defined by DNaseI-Seq) can be collected for each group and tested for enrichment with specific transcription factor binding site motifs.
Supplementary figure 10. BEAM analysis reveals branch-dependent gene expression during lung epithelial specification. (a) Hierarchical clustering of normalized expression in terms of transcript counts for all the significant branch genes identified by BEAM (FDR < 5%). Each column represents a cell ordered along the trajectory. The center of the heatmap corresponds to the beginning of the trajectory. Moving left proceeds down the AT1 branch, whereas moving right proceeds down the AT2 branch. Each row represents the transcript counts for a gene on each branch. Rows are hierarchically clustered using Pearson’s correlation with Ward’s method. (b) TF motif enrichment within the hypersensitive sites proximal to the branching genes in each of the kinetic clusters. Motifs corresponding to transcription factors that are themselves significantly branch dependent are labeled in red. (c) Hierarchical clustering of normalized expression in terms of Census counts for markers of pneumocyte specification as defined by Treutlein et al. and cell cycle genes. Each column represents a cell ordered along the trajectory. The center of the heatmap corresponds to the beginning of the trajectory. Moving left proceeds down the AT1 branch, whereas moving right proceeds down the AT2 branch. Each row represents the smoothed BEAM expression curve for a gene on each branch. Rows are transformed to Z-scores prior to hierarchically clustering using Pearson’s correlation with Ward’s method. (d) Pseudotime distribution of branch points for markers of early and late pneumocyte specification as defined by Truetlein et al.
Supplementary figure 11. Branching expression kinetics for predicted and known regulators of lung epithelial cell fate and function. (a) Significant branching TFs which also have binding motifs enriched in the upstream of significant branching genes shown in supplementary figure 10a. (b) Known regulators of lung epithelial cell fate specification.
Supplementary figure 12. Robustness of BEAM analysis to number of cells collected. The original set of cells from the KO experiment from Shalek et al. were downsampled randomly at progressively lower proportions of the original dataset. For each subset, cells were re-ordered with Monocle 2, and BEAM tests were then performed. Three different subsets were performed per proportion. (a) Example trajectories for each proportion. (b) Fraction of cells assigned to the same branch as in the full dataset. (c) Pearson correlation of pseudotime for cells included in each sample compared to the relative ordering of these cells in the full dataset. (d) Precision and (e) Recall of genes reported as significantly branch-dependent by BEAM for each sample of the cells, using the branch-dependent genes from the full dataset as the “ground truth”. (f) Spearman rank correlation of p-values returned by BEAM for all genes compared to their p-values when BEAM is given the full dataset. Red (blue) dots in panels d, e, f correspond to cell downsampling with cell ordering fixed using transcript count data (TPM expression data), green dots cell downsampling with cell re-ordering as in panels a, b. For TPM data, the same ordering as for transcript counts is used.
Supplementary figure 13. Significant enriched motifs. Transcription factor binding motifs significantly enriched in regulatory elements upstream of genes in cluster 5 (hypergeometric test, FDR < 10%. See methods).
Supplementary figure 14. Differential splicing during cell differentiation with Census. (a) A Dirichlet-multinomial regression of detected changes in relative transcript counts for the isoforms of 74 genes during skeletal myoblast differentiation. All detectably expressed isoforms of genes with pseudotime-dependent splicing were clustered, allowing isoforms of a gene to fall into different clusters. Each row represents the pseudotime-smoothed expression curve for an isoform. Rows are transformed to Z-scores prior to hierarchically clustering using Pearson’s correlation with Ward’s method. (b) Pseudotime clusters spanned by the isoforms of individual genes. (c) Transcript counts for detectably expressed isoforms of the actin family of proteins. Black lines indicate the expected transcript count from a negative binomial regression of expression as a function of pseudotime.
Supplementary figure 15. Example of allelic specific expression with Census. Allelic transcript counts from Census for genes known to be silenced in late blastocysts or post-implantation embryos (Atrx, Huwe1), or before the 8-cell stage (Kif4, Rlim). Lines indicate the average count the maternal and paternal allele at each stage.
3
4
Acknowledgments
We thank Jianghong Shi, Song Xu for technical discussions and Martin Kircher for cluster computation support. We are grateful to Jay Shendure, Ron Hause, Darren Cusanovich, Bruce Trapnell, Jeff Whitsett and members of the Trapnell laboratory for comments on manuscript. This work was supported by NIH Grant DP2 HD088158. C.T. is partly supported by a Dale. F. Frey Award for Breakthrough Scientists and an Alfred P. Sloan Foundation Research Fellowship. A.H. is supported by an NSF Graduate Research Fellowship.
Footnotes
Author Contributions
X.Q. and C.T. designed Census and the regression methods. X.Q. implemented the methods. X.Q. and A.H performed the analysis. J.P, D.L, and Y.M contributed to technical design. C.T. conceived the project. All authors wrote the manuscript.Competing Financial Interest Statement
The authors declare no relevant financial interests.