157 research outputs found
Droplet scRNA-seq is not zero-inflated
Potential users of single-cell RNA-sequencing (scRNA-seq) often encounter a choice between high-throughput droplet-based methods and high-sensitivity plate-based methods. There is a widespread belief that scRNA-seq will often fail to generate measurements for some genes from some cells owing to technical molecular inefficiencies. It is believed that this causes data to have an overabundance of zero values compared to what is expected from random sampling and that this effect is particularly pronounced in droplet-based methods. Here I present an investigation of published data for technical controls in droplet-based scRNA-seq experiments that demonstrates that the number of zero values in the data is consistent with common distributional models of molecule sampling counts. Thus, any additional zero values in biological data likely result from biological variation or may reflect variation in gene abundance among cell types or cell states
RNA velocity and protein acceleration from single-cell multiomics experiments
The simultaneous quantification of protein and RNA makes possible the inference of past, present, and future cell states from single experimental snapshots. To enable such temporal analysis from multimodal single-cell experiments, we introduce an extension of the RNA velocity method that leverages estimates of unprocessed transcript and protein abundances to extrapolate cell states. We apply the model to six datasets and demonstrate consistency among cell landscapes and phase portraits. The analysis software is available as the protaccel Python package
Recommended from our members
Probabilistic modelling of cellular development from single-cell gene expression
The recent technology of single-cell RNA sequencing can be used to investigate molecular, transcriptional, changes in cells as they develop. I reviewed the literature on the technology, and made a large scale quantitative comparison of the different implementations of single cell RNA sequencing to identify their technical limitations.
I investigate how to model transcriptional changes during cellular development. The general forms of expression changes with respect to development leads to nonparametric regression models, in the forms of Gaussian Processes. I used Gaussian process models to investigate expression patterns in early embryonic development, and compared the development of mice and humans.
When using in vivo systems, ground truth time for each cell cannot be known. Only a snapshot of cells, all being in different stages of development can be obtained. In an experiment measuring the transcriptome of zebrafish blood precursor cells undergoing the development from hematopoietic stem cells to thrombocytes, I used a Gaussian Process Latent Variable model to align the cells according to the developmental trajectory. This way I could investigate which genes were driving the development, and characterise the different patterns of expression.
With the latent variable strategy in mind, I designed an experiment to study a rare event of murine embryonic stem cells entering a state similar to very early embryos. The GPLVM can take advantage of the nonlinear expression patterns involved with this process. The results showed multiple activation events of genes as cells progress towards the rare state.
An essential feature of cellular biology is that precursor cells can give rise to multiple types of progenitor cells through differentiation. In the immune system, naive T-helper cells differentiate to different sub-types depending on the infection. For an experiment where mice were infected by malaria, the T-helper cells develop into two cell types, Th1 and Tfh. I model this branching development using an Overlapping Mixture of Gaussian Processes, which let me identify both which cells belong to which branch, and learn which genes are involved with the different branches.
Researchers have now started performing high-throughput experiments where spatial context of gene expression is recorded. Similar to how I identify temporal expression patterns, spatial expression patterns can be identified nonparametrically. To enable researchers to make use of this technique, I developed a very fast method to perform a statistical test for spatial dependence, and illustrate the result on multiple data sets.EMBL International Phd Progra
Droplet scRNA-seq is not zero-inflated
Potential users of single-cell RNA-sequencing (scRNA-seq) often encounter a choice between high-throughput droplet-based methods and high-sensitivity plate-based methods. There is a widespread belief that scRNA-seq will often fail to generate measurements for some genes from some cells owing to technical molecular inefficiencies. It is believed that this causes data to have an overabundance of zero values compared to what is expected from random sampling and that this effect is particularly pronounced in droplet-based methods. Here I present an investigation of published data for technical controls in droplet-based scRNA-seq experiments that demonstrates that the number of zero values in the data is consistent with common distributional models of molecule sampling counts. Thus, any additional zero values in biological data likely result from biological variation or may reflect variation in gene abundance among cell types or cell states
Recommended from our members
SpatialDE: identification of spatially variable genes.
Technological advances have made it possible to measure spatially resolved gene expression at high throughput. However, methods to analyze these data are not established. Here we describe SpatialDE, a statistical test to identify genes with spatial patterns of expression variation from multiplexed imaging or spatial RNA-sequencing data. SpatialDE also implements 'automatic expression histology', a spatial gene-clustering approach that enables expression-based tissue histology
RNA velocity and protein acceleration from single-cell multiomics experiments
The simultaneous quantification of protein and RNA makes possible the inference of past, present, and future cell states from single experimental snapshots. To enable such temporal analysis from multimodal single-cell experiments, we introduce an extension of the RNA velocity method that leverages estimates of unprocessed transcript and protein abundances to extrapolate cell states. We apply the model to six datasets and demonstrate consistency among cell landscapes and phase portraits. The analysis software is available as the protaccel Python package
Quantifying the tradeoff between sequencing depth and cell number in single-cell RNA-seq
The allocation of a sequencing budget when designing single cell RNA-seq experiments requires consideration of the tradeoff between number of cells sequenced and the read depth per cell. One approach to the problem is to perform a power analysis for a univariate objective such as differential expression. However, many of the goals of single-cell analysis requires consideration of the multivariate structure of gene expression, such as clustering. We introduce an approach to quantifying the impact of sequencing depth and cell number on the estimation of a multivariate generative model for gene expression that is based on error analysis in the framework of a variational autoencoder. We find that at shallow depths, the marginal benefit of deeper sequencing per cell significantly outweighs the benefit of increased cell numbers. Above about 15,000 reads per cell the benefit of increased sequencing depth is minor. Code for the workflow reproducing the results of the paper is available at https://github.com/pachterlab/SBP_2019/
Interpretable factor models of single-cell RNA-seq via variational autoencoders
Motivation: Single-cell RNA-seq makes possible the investigation of variability in gene expression among cells, and dependence of variation on cell type. Statistical inference methods for such analyses must be scalable, and ideally interpretable.
Results: We present an approach based on a modification of a recently published highly scalable variational autoencoder framework that provides interpretability without sacrificing much accuracy. We demonstrate that our approach enables identification of gene programs in massive datasets. Our strategy, namely the learning of factor models with the auto-encoding variational Bayes framework, is not domain specific and may be useful for other applications.
Availability and implementation: The factor model is available in the scVI package hosted at https://github.com/YosefLab/scVI/
A curated database reveals trends in single cell transcriptomics
The more than 1000 single-cell transcriptomics studies that have been published to date constitute a valuable and vast resource for biological discovery. While various ‘atlas’ projects have collated some of the associated datasets, most questions related to specific tissue types, species or other attributes of studies require identifying papers through manual and challenging literature search. To facilitate discovery with published single-cell transcriptomics data, we have assembled a near exhaustive, manually curated database of single-cell transcriptomics studies with key information: descriptions of the type of data and technologies used, along with descriptors of the biological systems studied. Additionally, the database contains summarized information about analysis in the papers, allowing for analysis of trends in the field. As an example, we show that the number of cell types identified in scRNA-seq studies is proportional to the number of cells analysed
- …