172,200 research outputs found
Deep Learning in Single-Cell Analysis
Single-cell technologies are revolutionizing the entire field of biology. The
large volumes of data generated by single-cell technologies are
high-dimensional, sparse, heterogeneous, and have complicated dependency
structures, making analyses using conventional machine learning approaches
challenging and impractical. In tackling these challenges, deep learning often
demonstrates superior performance compared to traditional machine learning
methods. In this work, we give a comprehensive survey on deep learning in
single-cell analysis. We first introduce background on single-cell technologies
and their development, as well as fundamental concepts of deep learning
including the most popular deep architectures. We present an overview of the
single-cell analytic pipeline pursued in research applications while noting
divergences due to data sources or specific applications. We then review seven
popular tasks spanning through different stages of the single-cell analysis
pipeline, including multimodal integration, imputation, clustering, spatial
domain identification, cell-type deconvolution, cell segmentation, and
cell-type annotation. Under each task, we describe the most recent developments
in classical and deep learning methods and discuss their advantages and
disadvantages. Deep learning tools and benchmark datasets are also summarized
for each task. Finally, we discuss the future directions and the most recent
challenges. This survey will serve as a reference for biologists and computer
scientists, encouraging collaborations.Comment: 77 pages, 11 figures, 15 tables, deep learning, single-cell analysi
Data-driven comparison of multiple high-dimensional single-cell expression profiles
Comparing multiple single-cell expression datasets such as cytometry and scRNA-seq data between case and control donors provides information to elucidate the mechanisms of disease. We propose a completely data-driven computational biological method for this task. This overcomes the challenges of conventional cellular subset-based comparisons and facilitates further analyses such as machine learning and gene set analysis of single-cell expression datasets
An Elastic-net Logistic Regression Approach to Generate Classifiers and Gene Signatures for Types of Immune Cells and T Helper Cell Subsets
Background: Host immune response is coordinated by a variety of different specialized cell types that vary in time and location. While host immune response can be studied using conventional low-dimensional approaches, advances in transcriptomics analysis may provide a less biased view. Yet, leveraging transcriptomics data to identify immune cell subtypes presents challenges for extracting informative gene signatures hidden within a high dimensional transcriptomics space characterized by low sample numbers with noisy and missing values. To address these challenges, we explore using machine learning methods to select gene subsets and estimate gene coefficients simultaneously. Results: Elastic-net logistic regression, a type of machine learning, was used to construct separate classifiers for ten different types of immune cell and for five T helper cell subsets. The resulting classifiers were then used to develop gene signatures that best discriminate among immune cell types and T helper cell subsets using RNA-seq datasets. We validated the approach using single-cell RNA-seq (scRNA-seq) datasets, which gave consistent results. In addition, we classified cell types that were previously unannotated. Finally, we benchmarked the proposed gene signatures against other existing gene signatures. Conclusions: Developed classifiers can be used as priors in predicting the extent and functional orientation of the host immune response in diseases, such as cancer, where transcriptomic profiling of bulk tissue samples and single cells are routinely employed. Information that can provide insight into the mechanistic basis of disease and therapeutic response. The so
Causal machine learning for single-cell genomics
Advances in single-cell omics allow for unprecedented insights into the
transcription profiles of individual cells. When combined with large-scale
perturbation screens, through which specific biological mechanisms can be
targeted, these technologies allow for measuring the effect of targeted
perturbations on the whole transcriptome. These advances provide an opportunity
to better understand the causative role of genes in complex biological
processes such as gene regulation, disease progression or cellular development.
However, the high-dimensional nature of the data, coupled with the intricate
complexity of biological systems renders this task nontrivial. Within the
machine learning community, there has been a recent increase of interest in
causality, with a focus on adapting established causal techniques and
algorithms to handle high-dimensional data. In this perspective, we delineate
the application of these methodologies within the realm of single-cell genomics
and their challenges. We first present the model that underlies most of current
causal approaches to single-cell biology and discuss and challenge the
assumptions it entails from the biological point of view. We then identify open
problems in the application of causal approaches to single-cell data:
generalising to unseen environments, learning interpretable models, and
learning causal models of dynamics. For each problem, we discuss how various
research directions - including the development of computational approaches and
the adaptation of experimental protocols - may offer ways forward, or on the
contrary pose some difficulties. With the advent of single cell atlases and
increasing perturbation data, we expect causal models to become a crucial tool
for informed experimental design.Comment: 35 pages, 7 figures, 3 tables, 1 bo
Unsupervised Machine Learning Algorithms to Characterize Single-Cell Heterogeneity and Perturbation Response
Recent advances in microfluidic technologies facilitate the measurement of gene expression, DNA accessibility, protein content, or genomic mutations at unprecedented scale. The challenges imposed by the scale of these datasets are further exacerbated by non-linearity in molecular effects, complex interdependencies between features, and a lack of understanding of both data generating processes and sources of technical and biological noise. As a result, analysis of modern single-cell data requires the development of specialized computational tools. One solution to these problems is the use of manifold learning, a sub-field of unsupervised machine learning that seeks to model data geometry using a simplifying assumption that the underlying system is continuous and locally Euclidean. In this dissertation, I show how manifold learning is naturally suited for single-cell analysis and introduce three related algorithms for characterization of single-cell heterogeneity and perturbation response. I first describe Vertex Frequency Clustering, an algorithm that identifies groups of cells with similar responses to an experiment perturbation by analyzing the spectral representation of condition labels expressed as signals over a cell similarity graph. Next, I introduce MELD, an algorithm that expands on these ideas to estimate the density of each experimental sample over the graph to quantify the effect of an experimental perturbation at single cell resolution. Finally, I describe a neural network for archetypal analysis that represents the data as continuously distributed between a set of extrema. Each of these algorithms are demonstrated on a combination of real and synthetic datasets and are benchmarked against state-of-the-art algorithms
Integrative computational methodologies on single cell datasets
High throughput single cell sequencing has seen exciting developments in recent years. With its high resolution characterization of genetics, genomics, proteomics, and epigenomics features, single cell data offer more insights on the underlying biological processes than those from bulk sequencing data. The most well developed single cell technologies are single cell RNA-seq (scRNA-seq) on transcriptomics and flow cytometry on proteomics. Many multi-omics single cell sequencing platforms have also emerged recently, such as CITE-seq, which profiles both epitope and transcriptome simultaneously. But some well known limitations of single cell data, such as batch variations, shallow sequencing depth, and sparsity also present many challenges. Many computational approaches built on machine learning and deep learning methods have been proposed to address these challenges. In this dissertation, I present three computational methods for joint analysis of single cell sequencing data either by multi-omics integration or joint analysis of multiple datasets. In the first chapter, we focus on single cell proteomics data, specifically, the antibody profiling of CITE-seq and cytometry by time of flight (CyTOF) applied to single cells to measure surface marker abundance. Although CyTOF has high accuracy and was introduced earlier than scRNA-seq, there is a lack of computational methods on cell type classification and annotations for these data. We propose a novel automated cell type annotation tool by incorporating CITE-seq data from the same tissue, publicly available annotated scRNA-seq data, and prior knowledge of surface markers in the literature. Our new method, called automated single cell proteomics data annotation approach (ProtAnno), is based on non-negative matrix factorization. We demonstrate the annotation accuracy and robustness of ProtAnno through extensive applications, especially for peripheral blood mononuclear cells (PBMC). The second chapter introduces an integrative method improving bulk sequencing data decomposition into cell type proportions by harmonizing scRNA-seq data across multiple tissues or multiple studies. As a Bayesian model, our method, called tranSig, is able to construct a more reliable signature matrix for decomposition by borrowing information from other tissues and/or studies. Our method can be considered an add-on step in cell type decomposition. Our method can better derive signature gene matrix and better characterize the biological heterogeneity from bulk sequencing datasets. Finally, in the last chapter, we propose a method to jointly analyze scRNA-seq data with summary statistics from genome wide association studies (GWAS). Our method generates a set of SNP (single nucelotide polymorphism)-level weight scores for each cell type or tissue type using scRNA-seq atlas. These scores are combined with risk allele effect sizes to decompose polygenic risk score (PRS) into cell types or tissue types. We show through enrichment analysis and phenome-wide association study (PheWAS) that the decomposed PRSs can better explain the biological mechanisms of genetic effects on complex traits mediated through transcription regulation and the differences across cell types and tissues
On the robustness of machine learning algorithms toward microfluidic distortions for cell classification via on-chip fluorescence microscopy
Single-cell imaging and sorting are critical technologies in biology and clinical applications. The power of these technologies is increased when combined with microfluidics, fluorescence markers, and machine learning. However, this quest faces several challenges. One of these is the effect of the sample flow velocity on the classification performances. Indeed, cell flow speed affects the quality of image acquisition by increasing motion blur and decreasing the number of acquired frames per sample. We investigate how these visual distortions impact the final classification task in a real-world use-case of cancer cell screening, using a microfluidic platform in combination with light sheet fluorescence microscopy. We demonstrate, by analyzing both simulated and experimental data, that it is possible to achieve high flow speed and high accuracy in single-cell classification. We prove that it is possible to overcome the 3D slice variability of the acquired 3D volumes, by relying on their 2D sum z-projection transformation, to reach an efficient real time classification with an accuracy of 99.4% using a convolutional neural network with transfer learning from simulated data. Beyond this specific use-case, we provide a web platform to generate a synthetic dataset and to investigate the effect of flow speed on cell classification for any biological samples and a large variety of fluorescence microscopes (https://www.creatis.insa-lyon.fr/site7/en/MicroVIP)
Methods towards precision bioinformatics in single cell era
Single-cell technology offers unprecedented insight into the molecular landscape of individual cell and is transforming precision medicine. Key to the effective use of single-cell data for disease understanding is the analysis of such information through bioinformatics methods. In this thesis, we examine and address several challenges in single-cell bioinformatics methods for precision medicine.
While most of current single-cell analytical tools employ statistical and machine learning methods, deep learning technology has gained tremendous success in computer science. Combined with ensemble learning, this further improve model performance. Through a review article (Cao et al., 2020), we share recent key developments in this area and their contribution to bioinformatics research.
Bioinformatics tools often use simulation data to assess proposed methodologies, but evaluation of the quality of single-cell RNA-sequencing (scRNA-seq) data simulation tools is lacking. We develop a comprehensive framework, SimBench (Cao et al., 2021), that examines a range of aspects from data properties to the ability to maintain biological signals, scalability, and applicability.
While individual patient understanding is the key to precision medicine, there is little consensus on the best ways to compress complex single-cell data into summary statistics that represent each individual. We present scFeatures (Cao et al., 2022b), an approach that creates interpretable molecular representations for individuals.
Finally, in a case study using multiple COVID-19 scRNA-seq data, we utilise scFeatures to generate molecular characterisations of individuals and illustrate the impact of ensemble learning and deep learning on improving disease outcome prediction.
Overall, this thesis addresses several gaps in precision bioinformatics in the single-cell field by highlighting research advances, developing methodologies, and illustrating practical uses through experimental datasets and case studies
Single-cell microfluidic impedance cytometry: From raw signals to cell phenotypes using data analytics
The biophysical analysis of single-cells by microfluidic impedance cytometry is emerging as a label-free and high-throughput means to stratify the heterogeneity of cellular systems based on their electrophysiology. Emerging applications range from fundamental life-science and drug assessment research to point-of-care diagnostics and precision medicine. Recently, novel chip designs and data analytic strategies are laying the foundation for multiparametric cell characterization and subpopulation distinction, which are essential to understand biological function, follow disease progression and monitor cell behaviour in microsystems. In this tutorial review, we present a comparative survey of the approaches to elucidate cellular and subcellular features from impedance cytometry data, covering the related subjects of device design, data analytics (i.e., signal processing, dielectric modelling, population clustering), and phenotyping applications. We give special emphasis to the exciting recent developments of the technique (timeframe 2017-2020) and provide our perspective on future challenges and directions. Its synergistic application with microfluidic separation, sensor science and machine learning can form an essential tool-kit for label-free quantification and isolation of subpopulations to stratify heterogeneous biosystems
- …