256 research outputs found

    Deep Learning in Single-Cell Analysis

    Full text link
    Single-cell technologies are revolutionizing the entire field of biology. The large volumes of data generated by single-cell technologies are high-dimensional, sparse, heterogeneous, and have complicated dependency structures, making analyses using conventional machine learning approaches challenging and impractical. In tackling these challenges, deep learning often demonstrates superior performance compared to traditional machine learning methods. In this work, we give a comprehensive survey on deep learning in single-cell analysis. We first introduce background on single-cell technologies and their development, as well as fundamental concepts of deep learning including the most popular deep architectures. We present an overview of the single-cell analytic pipeline pursued in research applications while noting divergences due to data sources or specific applications. We then review seven popular tasks spanning through different stages of the single-cell analysis pipeline, including multimodal integration, imputation, clustering, spatial domain identification, cell-type deconvolution, cell segmentation, and cell-type annotation. Under each task, we describe the most recent developments in classical and deep learning methods and discuss their advantages and disadvantages. Deep learning tools and benchmark datasets are also summarized for each task. Finally, we discuss the future directions and the most recent challenges. This survey will serve as a reference for biologists and computer scientists, encouraging collaborations.Comment: 77 pages, 11 figures, 15 tables, deep learning, single-cell analysi

    What I talk about when I talk about integration of single-cell data

    Get PDF
    Over the past decade, single-cell technologies evolved from profiling hundreds of cells to millions of cells, and emerged from a single modality of data to cover multiple views at single-cell resolution, including genome, epigenome, transcriptome, and so on. With advance of these single-cell technologies, the booming of multimodal single-cell data creates a valuable resource for us to understand cellular heterogeneity and molecular mechanism at a comprehensive level. However, the large-scale multimodal single-cell data also presents a huge computational challenge for insightful integrative analysis. Here, I will lay out problems in data integration that single-cell research community is interested in and introduce computational principles for solving these integration problems. In the following chapters, I will present four computational methods for data integration under different scenarios. Finally, I will discuss some future directions and potential applications of single-cell data integration

    Artificial neural network system for cell classification using single cell RNA expression

    Get PDF
    We implemented an automated system for single-cell classification using artificial neural networks (ANN). Our system takes single-cell gene expression sparse matrices and trains ANN to classify cell types and subtypes. The assemblies of ANNs predict cell classes by voting. We tested the system in a case study where we trained ANNs with a dataset containing approximately 120,000 single cells and tested the resulting model using an independent data set of 13,000 single cells. The overall accuracy of the 5-class classification was 95%. We trained and tested a total of 100 ANNs in 10 cycles. The prediction system demonstrated excellent reproducibility. The analysis of misclassifications indicated that 2% were likely classification errors, while the remaining 3% were likely due to mislabeled types and subtypes in the test set

    Model-based deep autoencoders for characterizing discrete data with application to genomic data analysis

    Get PDF
    Deep learning techniques have achieved tremendous successes in a wide range of real applications in recent years. For dimension reduction, deep neural networks (DNNs) provide a natural choice to parameterize a non-linear transforming function that maps the original high dimensional data to a lower dimensional latent space. Autoencoder is a kind of DNNs used to learn efficient feature representation in an unsupervised manner. Deep autoencoder has been widely explored and applied to analysis of continuous data, while it is understudied for characterizing discrete data. This dissertation focuses on developing model-based deep autoencoders for modeling discrete data. A motivating example of discrete data is the count data matrix generated by single-cell RNA sequencing (scRNA-seq) technology which is widely used in biological and medical fields. scRNA-seq promises to provide higher resolution of cellular differences than bulk RNA sequencing and has helped researchers to better understand complex biological questions. The recent advances in sequencing technology have enabled a dramatic increase in the throughput to thousands of cells for scRNA-seq. However, analysis of scRNA-seq data remains a statistical and computational challenge. A major problem is the pervasive dropout events obscuring the discrete matrix with prevailing \u27false\u27 zero count observations, which is caused by the shallow sequencing depth per cell. To make downstream analysis more effective, imputation, which recovers the missing values, is often conducted as the first step in preprocessing scRNA-seq data. Several imputation methods have been proposed. Of note is a deep autoencoder model, which proposes to explicitly characterize the count distribution, over-dispersion, and sparsity of scRNA-seq data using a zero-inflated negative binomial (ZINB) model. This dissertation introduces a model-based deep learning clustering model ? scDeepCluster for clustering analysis of scRNA-seq data. The scDeepCluster is a deep autoencoder which simultaneously learns feature representation and clustering via explicit modeling of scRNA-seq data generation using the ZINB model. Based on testing extensive simulated datasets and real datasets from different representative single-cell sequencing platforms, scDeepCluster outperformed several state-of-the-art methods under various clustering performance metrics and exhibited improved scalability, with running time increasing linearly with the sample size. Although this model-based deep autoencoder approach has demonstrated superior performance, it is over-permissive in defining ZINB model space, which can lead to an unidentifiable model and make results unstable. Next, this dissertation proposes to impose a regularization that takes dropout events into account. The regularization uses a differentiable categorical distribution - Gumbel-Softmax to explicitly model the dropout events, and minimizes the Maximum Mean Discrepancy (MMD) between the reconstructed randomly masked matrix and the raw count matrix. Imputation analyses showed that the proposed regularized model-based autoencoder significantly outperformed the vanilla model-based deep autoencoder

    Classification of single cell types during leukemia therapy using artificial neural networks

    Get PDF
    We trained artificial neural network (ANN) models to classify peripheral blood mononuclear cells (PBMC) in chronic lymphoid leukemia (CLL) patients. The classification task was to determine differences in gene expression profiles in PBMC pre-treatment (with ibrutinib) and on days 30, 120, 150, and 280 after the start of treatment. Twelve datasets represented clinical samples containing a total 48,016 single cell profiles were used to train and test ANN models to classify the progress of therapy by gene expression changes. The accuracy of ANN classification was >92% in internal cross-validation. External cross-validation, using independent data sets for training and testing, showed the accuracy of classification of post-treatment PBMCs to more than 80%. To the best of our knowledge, this is the first study that has demonstrated the potential of ANNs with 10x single cell gene expression data for detecting the changes during treatment of CLL

    scGCN is a graph convolutional networks algorithm for knowledge transfer in single cell omics

    Get PDF
    Single-cell omics is the fastest-growing type of genomics data in the literature and public genomics repositories. Leveraging the growing repository of labeled datasets and transferring labels from existing datasets to newly generated datasets will empower the exploration of single-cell omics data. However, the current label transfer methods have limited performance, largely due to the intrinsic heterogeneity among cell populations and extrinsic differences between datasets. Here, we present a robust graph artificial intelligence model, single-cell Graph Convolutional Network (scGCN), to achieve effective knowledge transfer across disparate datasets. Through benchmarking with other label transfer methods on a total of 30 single cell omics datasets, scGCN consistently demonstrates superior accuracy on leveraging cells from different tissues, platforms, and species, as well as cells profiled at different molecular layers. scGCN is implemented as an integrated workflow as a python software, which is available at https://github.com/QSong-github/scGCN

    DSTG: deconvoluting spatial transcriptomics data through graph-based artificial intelligence

    Get PDF
    Recent development of spatial transcriptomics (ST) is capable of associating spatial information at different spots in the tissue section with RNA abundance of cells within each spot, which is particularly important to understand tissue cytoarchitectures and functions. However, for such ST data, since a spot is usually larger than an individual cell, gene expressions measured at each spot are from a mixture of cells with heterogenous cell types. Therefore, ST data at each spot needs to be disentangled so as to reveal the cell compositions at that spatial spot. In this study, we propose a novel method, named deconvoluting spatial transcriptomics data through graph-based convolutional networks (DSTG), to accurately deconvolute the observed gene expressions at each spot and recover its cell constitutions, thus achieving high-level segmentation and revealing spatial architecture of cellular heterogeneity within tissues. DSTG not only demonstrates superior performance on synthetic spatial data generated from different protocols, but also effectively identifies spatial compositions of cells in mouse cortex layer, hippocampus slice and pancreatic tumor tissues. In conclusion, DSTG accurately uncovers the cell states and subpopulations based on spatial localization. DSTG is available as a ready-to-use open source software (https://github.com/Su-informatics-lab/DSTG) for precise interrogation of spatial organizations and functions in tissues

    Sparsely-connected autoencoder (SCA) for single cell RNAseq data mining

    Get PDF
    Abstract Single-cell RNA sequencing (scRNAseq) is an essential tool to investigate cellular heterogeneity. Thus, it would be of great interest being able to disclose biological information belonging to cell subpopulations, which can be defined by clustering analysis of scRNAseq data. In this manuscript, we report a tool that we developed for the functional mining of single cell clusters based on Sparsely-Connected Autoencoder (SCA). This tool allows uncovering hidden features associated with scRNAseq data. We implemented two new metrics, QCC (Quality Control of Cluster) and QCM (Quality Control of Model), which allow quantifying the ability of SCA to reconstruct valuable cell clusters and to evaluate the quality of the neural network achievements, respectively. Our data indicate that SCA encoded space, derived by different experimentally validated data (TF targets, miRNA targets, Kinase targets, and cancer-related immune signatures), can be used to grasp single cell cluster-specific functional features. In our implementation, SCA efficacy comes from its ability to reconstruct only specific clusters, thus indicating only those clusters where the SCA encoding space is a key element for cells aggregation. SCA analysis is implemented as module in rCASC framework and it is supported by a GUI to simplify it usage for biologists and medical personnel

    Swarm Learning for decentralized and confidential clinical machine learning

    Get PDF
    Fast and reliable detection of patients with severe and heterogeneous illnesses is a major goal of precision medicine1,2. Patients with leukaemia can be identified using machine learning on the basis of their blood transcriptomes3. However, there is an increasing divide between what is technically possible and what is allowed, because of privacy legislation4,5. Here, to facilitate the integration of any medical data from any data owner worldwide without violating privacy laws, we introduce Swarm Learning—a decentralized machine-learning approach that unites edge computing, blockchain-based peer-to-peer networking and coordination while maintaining confidentiality without the need for a central coordinator, thereby going beyond federated learning. To illustrate the feasibility of using Swarm Learning to develop disease classifiers using distributed data, we chose four use cases of heterogeneous diseases (COVID-19, tuberculosis, leukaemia and lung pathologies). With more than 16,400 blood transcriptomes derived from 127 clinical studies with non-uniform distributions of cases and controls and substantial study biases, as well as more than 95,000 chest X-ray images, we show that Swarm Learning classifiers outperform those developed at individual sites. In addition, Swarm Learning completely fulfils local confidentiality regulations by design. We believe that this approach will notably accelerate the introduction of precision medicine

    Methods towards precision bioinformatics in single cell era

    Get PDF
    Single-cell technology offers unprecedented insight into the molecular landscape of individual cell and is transforming precision medicine. Key to the effective use of single-cell data for disease understanding is the analysis of such information through bioinformatics methods. In this thesis, we examine and address several challenges in single-cell bioinformatics methods for precision medicine. While most of current single-cell analytical tools employ statistical and machine learning methods, deep learning technology has gained tremendous success in computer science. Combined with ensemble learning, this further improve model performance. Through a review article (Cao et al., 2020), we share recent key developments in this area and their contribution to bioinformatics research. Bioinformatics tools often use simulation data to assess proposed methodologies, but evaluation of the quality of single-cell RNA-sequencing (scRNA-seq) data simulation tools is lacking. We develop a comprehensive framework, SimBench (Cao et al., 2021), that examines a range of aspects from data properties to the ability to maintain biological signals, scalability, and applicability. While individual patient understanding is the key to precision medicine, there is little consensus on the best ways to compress complex single-cell data into summary statistics that represent each individual. We present scFeatures (Cao et al., 2022b), an approach that creates interpretable molecular representations for individuals. Finally, in a case study using multiple COVID-19 scRNA-seq data, we utilise scFeatures to generate molecular characterisations of individuals and illustrate the impact of ensemble learning and deep learning on improving disease outcome prediction. Overall, this thesis addresses several gaps in precision bioinformatics in the single-cell field by highlighting research advances, developing methodologies, and illustrating practical uses through experimental datasets and case studies
    • …
    corecore