152 research outputs found
A fast and efficient count-based matrix factorization method for detecting cell types from single-cell RNAseq data
Abstract
Background
Single-cell RNA sequencing (scRNAseq) data always involves various unwanted variables, which would be able to mask the true signal to identify cell-types. More efficient way of dealing with this issue is to extract low dimension information from high dimensional gene expression data to represent cell-type structure. In the past two years, several powerful matrix factorization tools were developed for scRNAseq data, such as NMF, ZIFA, pCMF and ZINB-WaVE. But the existing approaches either are unable to directly model the raw count of scRNAseq data or are really time-consuming when handling a large number of cells (e.g. n>500).
Results
In this paper, we developed a fast and efficient count-based matrix factorization method (single-cell negative binomial matrix factorization, scNBMF) based on the TensorFlow framework to infer the low dimensional structure of cell types. To make our method scalable, we conducted a series of experiments on three public scRNAseq data sets, brain, embryonic stem, and pancreatic islet. The experimental results show that scNBMF is more powerful to detect cell types and 10 - 100 folds faster than the scRNAseq bespoke tools.
Conclusions
In this paper, we proposed a fast and efficient count-based matrix factorization method, scNBMF, which is more powerful for detecting cell type purposes. A series of experiments were performed on three public scRNAseq data sets. The results show that scNBMF is a more powerful tool in large-scale scRNAseq data analysis. scNBMF was implemented in R and Python, and the source code are freely available at
https://github.com/sqsun
.https://deepblue.lib.umich.edu/bitstream/2027.42/148526/1/12918_2019_Article_699.pd
Comprehensive mapping of tissue cell architecture via integrated single cell and spatial transcriptomics
elocation-id: 2020.11.15.378125elocation-id: 2020.11.15.378125The spatial organization of cell types in tissues fundamentally shapes cellular interactions and function, but the high-throughput spatial mapping of complex tissues remains a challenge. We present сell2location, a principled and versatile Bayesian model that integrates single-cell and spatial transcriptomics to map cell types in situ in a comprehensive manner. We show that сell2location outperforms existing tools in accuracy and comprehensiveness and we demonstrate its utility by mapping two complex tissues. In the mouse brain, we use a new paired single nucleus and spatial RNA-sequencing dataset to map dozens of cell types and identify tissue regions in an automated manner. We discover novel regional astrocyte subtypes including fine subpopulations in the thalamus and hypothalamus. In the human lymph node, we resolve spatially interlaced immune cell states and identify co-located groups of cells underlying tissue organisation. We spatially map a rare pre-germinal centre B-cell population and predict putative cellular interactions relevant to the interferon response. Collectively our results demonstrate how сell2location can serve as a versatile first-line analysis tool to map tissue architectures in a high-throughput manner.Competing Interest StatementThe authors have declared no competing interest
Recommended from our members
GoM DE: interpreting structure in sequence count data with differential expression analysis allowing for grades of membership
Parts-based representations, such as non-negative matrix factorization and topic modeling, have been used to identify structure from single-cell sequencing data sets, in particular structure that is not as well captured by clustering or other dimensionality reduction methods. However, interpreting the individual parts remains a challenge. To address this challenge, we extend methods for differential expression analysis by allowing cells to have partial membership to multiple groups. We call this grade of membership differential expression (GoM DE). We illustrate the benefits of GoM DE for annotating topics identified in several single-cell RNA-seq and ATAC-seq data sets
Extracting information from high-throughput gene expression data with pathway analysis and deconvolution
Modern technologies allow for the collection of large biological datasets that can be utilised for diverse health-related applications. However, to extract useful information from such data, computational methods are needed. The field that develops and explores methods to analyse biological data is called bioinformatics. In this thesis I evaluate different bioinformatic methods and introduce novel ones related to processing gene expression data. Gene expression data reflects how active different genes are in a set of measured biological samples. These samples can be for example blood from human individuals, tissue samples from tumours and the corresponding healthy tissue, or brain samples from mice with different neural diseases. This thesis covers two topics, pathway analysis and deconvolution, related to downstream analysis of gene expression data. Notably, this summary does not repeat in detail the same points made in the original publications, but aims to provide a comprehensive overview of the current knowledge of the two wider topics. The original publications focus on comparing and evaluating the available methods as well as presenting new ones that cover some previously untouched features.
While the terms ’pathway analysis’ and ’deconvolution’ have been used with alternative definitions in other fields, in the context of this thesis, pathway analysis refers to estimating the activity of pathways, i.e. interaction networks body uses to react to different signals, based on given gene expression data and structural information of the relevant pathways. I focus on different types of analysis methods and their varying goals, requirements, and underlying statistical approaches. In addition, the strengths and weaknesses of the concept of pathway analysis are briefly discussed. The first two original publications I and II empirically compare different types of pathway methods and introduce a novel one. In the paper I, the tested methods are evaluated from different perspectives, and in the paper II, a novel method is introduced and its performance demonstrated against alternative tools.
Many biological samples contain a variety of cell types and here, deconvolution means computationally extracting cell type composition or cell type specific expression from bulk samples. The deconvolution sections of this thesis also focus on a general overview of the topic and the available computational methodology. As deconvolution is challenging, I discuss the factors affecting its accuracy as well as alternative wet lab approaches to obtain cell type specific information. The first original publication about deconvolution (publication III) introduces a novel method and evaluates it against the other available tools. The second (publication IV) focuses on identifying cell type specific differences between sample groups, which is a particularly difficult task
McImpute: Matrix Completion Based Imputation for Single Cell RNA-seq Data
Motivation: Single-cell RNA sequencing has been proved to be revolutionary for its potential of zooming into complex biological systems. Genome-wide expression analysis at single-cell resolution provides a window into dynamics of cellular phenotypes. This facilitates the characterization of transcriptional heterogeneity in normal and diseased tissues under various conditions. It also sheds light on the development or emergence of specific cell populations and phenotypes. However, owing to the paucity of input RNA, a typical single cell RNA sequencing data features a high number of dropout events where transcripts fail to get amplified.Results: We introduce mcImpute, a low-rank matrix completion based technique to impute dropouts in single cell expression data. On a number of real datasets, application of mcImpute yields significant improvements in the separation of true zeros from dropouts, cell-clustering, differential expression analysis, cell type separability, the performance of dimensionality reduction techniques for cell visualization, and gene distribution.Availability and Implementation:https://github.com/aanchalMongia/McImpute_scRNAse
Computational solutions for spatial transcriptomics
Transcriptome level expression data connected to the spatial organization of the cells and molecules would allow a comprehensive understanding of how gene expression is connected to the structure and function in the biological systems. The spatial transcriptomics platforms may soon provide such information. However, the current platforms still lack spatial resolution, capture only a fraction of the transcriptome heterogeneity, or lack the throughput for large scale studies. The strengths and weaknesses in current ST platforms and computational solutions need to be taken into account when planning spatial transcriptomics studies. The basis of the computational ST analysis is the solutions developed for single-cell RNA-sequencing data, with advancements taking into account the spatial connectedness of the transcriptomes. The scRNA-seq tools are modified for spatial transcriptomics or new solutions like deep learning-based joint analysis of expression, spatial, and image data are developed to extract biological information in the spatially resolved transcriptomes. The computational ST analysis can reveal remarkable biological insights into spatial patterns of gene expression, cell signaling, and cell type variations in connection with cell type-specific signaling and organization in complex tissues. This review covers the topics that help choosing the platform and computational solutions for spatial transcriptomics research. We focus on the currently available ST methods and platforms and their strengths and limitations. Of the computational solutions, we provide an overview of the analysis steps and tools used in the ST data analysis. The compatibility with the data types and the tools provided by the current ST analysis frameworks are summarized.</p
Integrating barcoded neuroanatomy with spatial transcriptional profiling reveals cadherin correlates of projections shared across the cortex
Functional circuits consist of neurons with diverse axonal projections and gene expression. Understanding the molecular signature of projections requires high-throughput interrogation of both gene expression and projections to multiple targets in the same cells at cellular resolution, which is difficult to achieve using current technology. Here, we introduce BARseq2, a technique that simultaneously maps projections and detects multiplexed gene expression by in situ sequencing. We determined the expression of cadherins and cell-type markers in 29,933 cells, and the projections of 3,164 cells in both the mouse motor cortex and auditory cortex. Associating gene expression and projections in 1,349 neurons revealed shared cadherin signatures of homologous projections across the two cortical areas. These cadherins were enriched across multiple branches of the transcriptomic taxonomy. By correlating multi-gene expression and projections to many targets in single neurons with high throughput, BARseq2 provides a path to uncovering the molecular logic underlying neuronal circuits
Pan-cancer analysis of post-translational modifications reveals shared patterns of protein regulation
Post-translational modifications (PTMs) play key roles in regulating cell signaling and physiology in both normal and cancer cells. Advances in mass spectrometry enable high-throughput, accurate, and sensitive measurement of PTM levels to better understand their role, prevalence, and crosstalk. Here, we analyze the largest collection of proteogenomics data from 1,110 patients with PTM profiles across 11 cancer types (10 from the National Cancer Institute\u27s Clinical Proteomic Tumor Analysis Consortium [CPTAC]). Our study reveals pan-cancer patterns of changes in protein acetylation and phosphorylation involved in hallmark cancer processes. These patterns revealed subsets of tumors, from different cancer types, including those with dysregulated DNA repair driven by phosphorylation, altered metabolic regulation associated with immune response driven by acetylation, affected kinase specificity by crosstalk between acetylation and phosphorylation, and modified histone regulation. Overall, this resource highlights the rich biology governed by PTMs and exposes potential new therapeutic avenues
Deep Learning in Single-Cell Analysis
Single-cell technologies are revolutionizing the entire field of biology. The
large volumes of data generated by single-cell technologies are
high-dimensional, sparse, heterogeneous, and have complicated dependency
structures, making analyses using conventional machine learning approaches
challenging and impractical. In tackling these challenges, deep learning often
demonstrates superior performance compared to traditional machine learning
methods. In this work, we give a comprehensive survey on deep learning in
single-cell analysis. We first introduce background on single-cell technologies
and their development, as well as fundamental concepts of deep learning
including the most popular deep architectures. We present an overview of the
single-cell analytic pipeline pursued in research applications while noting
divergences due to data sources or specific applications. We then review seven
popular tasks spanning through different stages of the single-cell analysis
pipeline, including multimodal integration, imputation, clustering, spatial
domain identification, cell-type deconvolution, cell segmentation, and
cell-type annotation. Under each task, we describe the most recent developments
in classical and deep learning methods and discuss their advantages and
disadvantages. Deep learning tools and benchmark datasets are also summarized
for each task. Finally, we discuss the future directions and the most recent
challenges. This survey will serve as a reference for biologists and computer
scientists, encouraging collaborations.Comment: 77 pages, 11 figures, 15 tables, deep learning, single-cell analysi
Generative Models of Biological Variations in Bulk and Single-cell RNA-seq
The explosive growth of next-generation sequencing data enhances our ability to understand biological process at an unprecedented resolution. Meanwhile organizing and utilizing this tremendous amount of data becomes a big challenge. High-throughput technology provides us a snapshot of all underlying biological activities, but this kind of extremely high-dimensional data is hard to interpret. Due to the curse of dimensionality, the measurement is sparse and far from enough to shape the actual manifold in the high-dimensional space. On the other hand, the measurements may contain structured noise such as technical or nuisance biological variation which can interfere downstream interpretation. Generative modeling is a powerful tool to make sense of the data and generate compact representations summarizing the embedded biological information. This thesis introduces three generative models that help amplifying biological signals buried in the noisy bulk and single-cell RNA-seq data.
In Chapter 2, we propose a semi-supervised deconvolution framework called PLIER which can identify regulations in cell-type proportions and specific pathways that control gene expression. PLIER has inspired the development of MultiPLIER and has been used to infer context-specific genotype effects in the brain.
In Chapter 3, we construct a supervised transformation named DataRemix to normalize bulk gene expression profiles in order to maximize the biological findings with respect to a variety of downstream tasks. By reweighing the contribution of hidden factors, we are able to reveal the hidden biological signals without any external dataset-specific knowledge. We apply DataRemix to the ROSMAP dataset and report the first replicable trans-eQTL effect in human brain.
In Chapter 4, we focus on scRNA-seq and introduce NIFA which is an unsupervised decomposition framework that combines the desired properties of PCA, ICA and NMF. It simultaneously models uni- and multi-modal factors isolating discrete cell-type identity and continuous pathway-level variations into separate components.
The work presented in Chapter 2 has been published as a journal article. The work in Chapter 3 and Chapter 4 are under submission and they are available as preprints on bioRxiv
- …