172,200 research outputs found

    Deep Learning in Single-Cell Analysis

    Full text link
    Single-cell technologies are revolutionizing the entire field of biology. The large volumes of data generated by single-cell technologies are high-dimensional, sparse, heterogeneous, and have complicated dependency structures, making analyses using conventional machine learning approaches challenging and impractical. In tackling these challenges, deep learning often demonstrates superior performance compared to traditional machine learning methods. In this work, we give a comprehensive survey on deep learning in single-cell analysis. We first introduce background on single-cell technologies and their development, as well as fundamental concepts of deep learning including the most popular deep architectures. We present an overview of the single-cell analytic pipeline pursued in research applications while noting divergences due to data sources or specific applications. We then review seven popular tasks spanning through different stages of the single-cell analysis pipeline, including multimodal integration, imputation, clustering, spatial domain identification, cell-type deconvolution, cell segmentation, and cell-type annotation. Under each task, we describe the most recent developments in classical and deep learning methods and discuss their advantages and disadvantages. Deep learning tools and benchmark datasets are also summarized for each task. Finally, we discuss the future directions and the most recent challenges. This survey will serve as a reference for biologists and computer scientists, encouraging collaborations.Comment: 77 pages, 11 figures, 15 tables, deep learning, single-cell analysi

    Data-driven comparison of multiple high-dimensional single-cell expression profiles

    Get PDF
    Comparing multiple single-cell expression datasets such as cytometry and scRNA-seq data between case and control donors provides information to elucidate the mechanisms of disease. We propose a completely data-driven computational biological method for this task. This overcomes the challenges of conventional cellular subset-based comparisons and facilitates further analyses such as machine learning and gene set analysis of single-cell expression datasets

    An Elastic-net Logistic Regression Approach to Generate Classifiers and Gene Signatures for Types of Immune Cells and T Helper Cell Subsets

    Get PDF
    Background: Host immune response is coordinated by a variety of different specialized cell types that vary in time and location. While host immune response can be studied using conventional low-dimensional approaches, advances in transcriptomics analysis may provide a less biased view. Yet, leveraging transcriptomics data to identify immune cell subtypes presents challenges for extracting informative gene signatures hidden within a high dimensional transcriptomics space characterized by low sample numbers with noisy and missing values. To address these challenges, we explore using machine learning methods to select gene subsets and estimate gene coefficients simultaneously. Results: Elastic-net logistic regression, a type of machine learning, was used to construct separate classifiers for ten different types of immune cell and for five T helper cell subsets. The resulting classifiers were then used to develop gene signatures that best discriminate among immune cell types and T helper cell subsets using RNA-seq datasets. We validated the approach using single-cell RNA-seq (scRNA-seq) datasets, which gave consistent results. In addition, we classified cell types that were previously unannotated. Finally, we benchmarked the proposed gene signatures against other existing gene signatures. Conclusions: Developed classifiers can be used as priors in predicting the extent and functional orientation of the host immune response in diseases, such as cancer, where transcriptomic profiling of bulk tissue samples and single cells are routinely employed. Information that can provide insight into the mechanistic basis of disease and therapeutic response. The so

    Causal machine learning for single-cell genomics

    Full text link
    Advances in single-cell omics allow for unprecedented insights into the transcription profiles of individual cells. When combined with large-scale perturbation screens, through which specific biological mechanisms can be targeted, these technologies allow for measuring the effect of targeted perturbations on the whole transcriptome. These advances provide an opportunity to better understand the causative role of genes in complex biological processes such as gene regulation, disease progression or cellular development. However, the high-dimensional nature of the data, coupled with the intricate complexity of biological systems renders this task nontrivial. Within the machine learning community, there has been a recent increase of interest in causality, with a focus on adapting established causal techniques and algorithms to handle high-dimensional data. In this perspective, we delineate the application of these methodologies within the realm of single-cell genomics and their challenges. We first present the model that underlies most of current causal approaches to single-cell biology and discuss and challenge the assumptions it entails from the biological point of view. We then identify open problems in the application of causal approaches to single-cell data: generalising to unseen environments, learning interpretable models, and learning causal models of dynamics. For each problem, we discuss how various research directions - including the development of computational approaches and the adaptation of experimental protocols - may offer ways forward, or on the contrary pose some difficulties. With the advent of single cell atlases and increasing perturbation data, we expect causal models to become a crucial tool for informed experimental design.Comment: 35 pages, 7 figures, 3 tables, 1 bo

    Unsupervised Machine Learning Algorithms to Characterize Single-Cell Heterogeneity and Perturbation Response

    Get PDF
    Recent advances in microfluidic technologies facilitate the measurement of gene expression, DNA accessibility, protein content, or genomic mutations at unprecedented scale. The challenges imposed by the scale of these datasets are further exacerbated by non-linearity in molecular effects, complex interdependencies between features, and a lack of understanding of both data generating processes and sources of technical and biological noise. As a result, analysis of modern single-cell data requires the development of specialized computational tools. One solution to these problems is the use of manifold learning, a sub-field of unsupervised machine learning that seeks to model data geometry using a simplifying assumption that the underlying system is continuous and locally Euclidean. In this dissertation, I show how manifold learning is naturally suited for single-cell analysis and introduce three related algorithms for characterization of single-cell heterogeneity and perturbation response. I first describe Vertex Frequency Clustering, an algorithm that identifies groups of cells with similar responses to an experiment perturbation by analyzing the spectral representation of condition labels expressed as signals over a cell similarity graph. Next, I introduce MELD, an algorithm that expands on these ideas to estimate the density of each experimental sample over the graph to quantify the effect of an experimental perturbation at single cell resolution. Finally, I describe a neural network for archetypal analysis that represents the data as continuously distributed between a set of extrema. Each of these algorithms are demonstrated on a combination of real and synthetic datasets and are benchmarked against state-of-the-art algorithms

    Integrative computational methodologies on single cell datasets

    Get PDF
    High throughput single cell sequencing has seen exciting developments in recent years. With its high resolution characterization of genetics, genomics, proteomics, and epigenomics features, single cell data offer more insights on the underlying biological processes than those from bulk sequencing data. The most well developed single cell technologies are single cell RNA-seq (scRNA-seq) on transcriptomics and flow cytometry on proteomics. Many multi-omics single cell sequencing platforms have also emerged recently, such as CITE-seq, which profiles both epitope and transcriptome simultaneously. But some well known limitations of single cell data, such as batch variations, shallow sequencing depth, and sparsity also present many challenges. Many computational approaches built on machine learning and deep learning methods have been proposed to address these challenges. In this dissertation, I present three computational methods for joint analysis of single cell sequencing data either by multi-omics integration or joint analysis of multiple datasets. In the first chapter, we focus on single cell proteomics data, specifically, the antibody profiling of CITE-seq and cytometry by time of flight (CyTOF) applied to single cells to measure surface marker abundance. Although CyTOF has high accuracy and was introduced earlier than scRNA-seq, there is a lack of computational methods on cell type classification and annotations for these data. We propose a novel automated cell type annotation tool by incorporating CITE-seq data from the same tissue, publicly available annotated scRNA-seq data, and prior knowledge of surface markers in the literature. Our new method, called automated single cell proteomics data annotation approach (ProtAnno), is based on non-negative matrix factorization. We demonstrate the annotation accuracy and robustness of ProtAnno through extensive applications, especially for peripheral blood mononuclear cells (PBMC). The second chapter introduces an integrative method improving bulk sequencing data decomposition into cell type proportions by harmonizing scRNA-seq data across multiple tissues or multiple studies. As a Bayesian model, our method, called tranSig, is able to construct a more reliable signature matrix for decomposition by borrowing information from other tissues and/or studies. Our method can be considered an add-on step in cell type decomposition. Our method can better derive signature gene matrix and better characterize the biological heterogeneity from bulk sequencing datasets. Finally, in the last chapter, we propose a method to jointly analyze scRNA-seq data with summary statistics from genome wide association studies (GWAS). Our method generates a set of SNP (single nucelotide polymorphism)-level weight scores for each cell type or tissue type using scRNA-seq atlas. These scores are combined with risk allele effect sizes to decompose polygenic risk score (PRS) into cell types or tissue types. We show through enrichment analysis and phenome-wide association study (PheWAS) that the decomposed PRSs can better explain the biological mechanisms of genetic effects on complex traits mediated through transcription regulation and the differences across cell types and tissues

    On the robustness of machine learning algorithms toward microfluidic distortions for cell classification via on-chip fluorescence microscopy

    Get PDF
    Single-cell imaging and sorting are critical technologies in biology and clinical applications. The power of these technologies is increased when combined with microfluidics, fluorescence markers, and machine learning. However, this quest faces several challenges. One of these is the effect of the sample flow velocity on the classification performances. Indeed, cell flow speed affects the quality of image acquisition by increasing motion blur and decreasing the number of acquired frames per sample. We investigate how these visual distortions impact the final classification task in a real-world use-case of cancer cell screening, using a microfluidic platform in combination with light sheet fluorescence microscopy. We demonstrate, by analyzing both simulated and experimental data, that it is possible to achieve high flow speed and high accuracy in single-cell classification. We prove that it is possible to overcome the 3D slice variability of the acquired 3D volumes, by relying on their 2D sum z-projection transformation, to reach an efficient real time classification with an accuracy of 99.4% using a convolutional neural network with transfer learning from simulated data. Beyond this specific use-case, we provide a web platform to generate a synthetic dataset and to investigate the effect of flow speed on cell classification for any biological samples and a large variety of fluorescence microscopes (https://www.creatis.insa-lyon.fr/site7/en/MicroVIP)

    Methods towards precision bioinformatics in single cell era

    Get PDF
    Single-cell technology offers unprecedented insight into the molecular landscape of individual cell and is transforming precision medicine. Key to the effective use of single-cell data for disease understanding is the analysis of such information through bioinformatics methods. In this thesis, we examine and address several challenges in single-cell bioinformatics methods for precision medicine. While most of current single-cell analytical tools employ statistical and machine learning methods, deep learning technology has gained tremendous success in computer science. Combined with ensemble learning, this further improve model performance. Through a review article (Cao et al., 2020), we share recent key developments in this area and their contribution to bioinformatics research. Bioinformatics tools often use simulation data to assess proposed methodologies, but evaluation of the quality of single-cell RNA-sequencing (scRNA-seq) data simulation tools is lacking. We develop a comprehensive framework, SimBench (Cao et al., 2021), that examines a range of aspects from data properties to the ability to maintain biological signals, scalability, and applicability. While individual patient understanding is the key to precision medicine, there is little consensus on the best ways to compress complex single-cell data into summary statistics that represent each individual. We present scFeatures (Cao et al., 2022b), an approach that creates interpretable molecular representations for individuals. Finally, in a case study using multiple COVID-19 scRNA-seq data, we utilise scFeatures to generate molecular characterisations of individuals and illustrate the impact of ensemble learning and deep learning on improving disease outcome prediction. Overall, this thesis addresses several gaps in precision bioinformatics in the single-cell field by highlighting research advances, developing methodologies, and illustrating practical uses through experimental datasets and case studies

    Single-cell microfluidic impedance cytometry: From raw signals to cell phenotypes using data analytics

    Get PDF
    The biophysical analysis of single-cells by microfluidic impedance cytometry is emerging as a label-free and high-throughput means to stratify the heterogeneity of cellular systems based on their electrophysiology. Emerging applications range from fundamental life-science and drug assessment research to point-of-care diagnostics and precision medicine. Recently, novel chip designs and data analytic strategies are laying the foundation for multiparametric cell characterization and subpopulation distinction, which are essential to understand biological function, follow disease progression and monitor cell behaviour in microsystems. In this tutorial review, we present a comparative survey of the approaches to elucidate cellular and subcellular features from impedance cytometry data, covering the related subjects of device design, data analytics (i.e., signal processing, dielectric modelling, population clustering), and phenotyping applications. We give special emphasis to the exciting recent developments of the technique (timeframe 2017-2020) and provide our perspective on future challenges and directions. Its synergistic application with microfluidic separation, sensor science and machine learning can form an essential tool-kit for label-free quantification and isolation of subpopulations to stratify heterogeneous biosystems
    corecore