60 research outputs found
Cell Type Classification Via Deep Learning On Single-Cell Gene Expression Data
Single-cell sequencing is a recently advanced revolutionary technology which enables researchers to obtain genomic, transcriptomic, or multi-omics information through gene expression analysis. It gives the advantage of analyzing highly heterogenous cell type information compared to traditional sequencing methods, which is gaining popularity in the biomedical area. Moreover, this analysis can help for early diagnosis and drug development of tumor cells, and cancer cell types. In the workflow of gene expression data profiling, identification of the cell types is an important task, but it faces many challenges like the curse of dimensionality, sparsity, batch effect, and overfitting. However, these challenges can be overcome by performing a feature selection technique which selects more relevant features by reducing feature dimensions. In this research work, recurrent neural network-based feature selection model is proposed to extract relevant features from high dimensional, and low sample size data. Moreover, a deep learning-based gene embedding model is also proposed to reduce data sparsity of single-cell data for cell type identification. The proposed frameworks have been implemented with different architectures of recurrent neural networks, and demonstrated via real-world micro-array datasets and single-cell RNA-seq data and observed that the proposed models perform better than other feature selection models. A semi-supervised model is also implemented using the same workflow of gene embedding concept since labeling data is very cumbersome, time consuming, and requires manual effort and expertise in the field. Therefore, different ratios of labeled data are used in the experiment to validate the concept. Experimental results show that the proposed semi-supervised approach represents very encouraging performance even though a limited number of labeled data is used via the gene embedding concept. In addition, graph attention based autoencoder model has also been studied to learn the latent features by incorporating prior knowledge with gene expression data for cell type classification.
Index Terms — Single-Cell Gene Expression Data, Gene Embedding, Semi-Supervised model, Incorporate Prior Knowledge, Gene-gene Interaction Network, Deep Learning, Graph Auto Encode
Recommended from our members
Single Cell Analysis of Chromatin Accessibility
The identity of each cell in the human body is established and maintained through distinct gene expression program, which is regulated in part by the chromatin accessibility. Until recently, our understanding of chromatin accessibility has depended largely upon bulk measurements in populations of cells. Recent advances in the sequencing techniques have allowed for the identification of open chromatin regions in single cells. During my Ph.D., I have developed and used single cell sequencing techniques to study the diverse gene regulatory programs underlie the different cell types in mammalian complex tissues. In chapter 1, colleague and I developed Single Nucleus Assay of Transpose Accessible Chromatin using Sequencing (snATAC-seq), a combinatorial barcoding-assisted single-cell assay for probing accessible chromatin in single cells. We then used snATAC-seq to generate an epigenomic atlas of early developing mouse brain. The high-level noise of each single cell chromatin accessibility profile and the large volume of the datasets pose unique computational challenges. In chapter 2, I developed a comprehensive bioinformatics software package called SnapATAC for analyzing large-scale single cell ATAC-seq dataset. SnapATAC resolves the heterogeneity in complex tissues and maps the trajectories of cellular states. As a demonstration of its utility, SnapATAC was applied to 55,592 single-nucleus ATAC-seq profiles from the mouse secondary motor cortex. To further determine the target genes of the distal regulatory elements identified using snATAC-seq in different cell types, in chapter 3, colleague and I developed PLAC-seq, a cost-efficient method that identifies the long-range chromatin interaction at kilobase resolution. PLAC-seq improves the efficiency of detecting chromatin conformation by over 10-fold and reduces the input requirement by nearly 100-fold compared to the prior techniques. Finally, to probe the in vivo function of the regulatory sequences, I present a high-throughput CRISPR screening method (CREST-seq) for the unbiased discovery and functional assessment of enhancer sequences in the human genome. We used it to interrogate the 2-Mb POU5F1 locus in human embryonic stem cells and discovered that sequences previously annotated as promoters of functionally unrelated genes can regulate the expression of POU5F1 from a long distance. We anticipate that these studies will help us understand the gene regulatory programs across diverse biological systems ranging from human disease to the evolution of species
Recommended from our members
Statistical methods for the integrative analysis of single-cell multi-omics data
Single-cell profiling techniques have provided an unprecedented opportunity to study cellular heterogeneity at the molecular level. This represents a remarkable advance over traditional bulk sequencing methods, particularly to study lineage diversification and cell fate commitment events in heterogeneous biological processes. While the large majority of single-cell studies are focused on quantifying RNA expression, transcriptomic readouts provide only a single dimension of cellular heterogeneity. Recently, technological advances have enabled multiple biological layers to be probed in parallel one cell at a time, unveiling a powerful approach for investigating multiple dimensions of cellular heterogeneity. However, the increasing availability of multi-modal data sets needs to be accompanied by the development of suitable integrative strategies to fully exploit the data generated. In this thesis I worked in collaboration with different research groups to introduce innovative experimental and computational strategies for the integrative study of multi-omics at single-cell resolution.
The first contribution is the development of scNMT-seq, a protocol for the simultaneous profiling of RNA expression, DNA methylation and chromatin accessibility in single cells. I demonstrate how this assay provides a powerful approach for investigating regulatory relationships between the epigenome and the transcriptome within individual cells.
The second contribution is Multi-Omics Factor Analysis (MOFA), a statistical framework for the unsupervised integration of multi-omics data sets. MOFA is a Bayesian latent variable model that can be viewed as a statistically rigorous generalization of Principal Component Analysis to multi-omics data. The method provides a principled approach to retrieve, in an unsupervised manner, the underlying sources of sample heterogeneity while at the same time disentangling which axes of heterogeneity are shared across multiple modalities and which are specific to individual data modalities.
The third contribution is the generation of a comprehensive molecular roadmap of mouse gastrulation at single-cell resolution. We employed scNMT-seq to simultaneously profile RNA expression, DNA methylation and chromatin accessibility for hundreds of cells, spanning multiple time points from the exit from pluripotency to primary germ layer specification. Using MOFA, and other tools, I performed an integrative analysis of the multi-modal measurements, revealing novel insights into the role of the epigenome in regulating this key developmental process.
The fourth contribution is an extended formulation of the MOFA model tailored to the analysis of large-scale single-cell data with complex experimental designs. I extended the model to incorporate a flexible regularisation that enables the joint analysis of multiple omics as well as multiple sample groups (batches and/or experimental conditions). In addition, I implemented a GPU-accelerated stochastic variational inference framework, thus enabling the scalable analysis of potentially millions of samples
Recommended from our members
Statistical methods for multi-omic data integration
The thesis is focused on the development of new ways to integrate multiple ’omic datasets in the context of precision medicine. This type of analyses have the potential to help researchers deepen their understanding of biological mechanisms underlying disease. However, integrative studies pose several challenges, due to the typically widely differing characteristics of the ’omic layers in terms of number of predictors, type of data, and level of noise.
In this work, we first tackle the problem of performing variable selection and building supervised models, while integrating multiple ’omic datasets of different type. It has been recently shown that applying classical logistic regression with elastic-net penalty to these datasets can lead to poor results. Therefore, we suggest a two-step approach to multi-omic logistic regression in which variable selection is performed on each layer separately and a predictive model is subsequently built on the ensemble of the selected variables.
In the unsupervised setting, we first examine cluster of clusters analysis (COCA), an integrative clustering approach that combines information from multiple data sources. COCA has been widely applied in the context of tumour subtyping, but its properties have never been systematically explored before, and its robustness to the inclusion of noisy datasets is unclear. Then, we propose a new statistical method for the unsupervised integration of multi-omic data, called kernel learning integrative clustering (KLIC). This approach is based on the idea to frame the challenge of combining clustering structures as a multiple kernel learning problem, in which different datasets each provide a weighted contribution to the final clustering.
Finally, we build upon the notion of the posterior similarity matrix (PSM) in order to suggest new approaches for summarising the output of MCMC algorithms for Bayesian mixture models. A key contribution of our work is the observation that PSMs can be used to define probabilistically-motivated kernel matrices that capture the clustering structure present in the data. This observation enables us to employ a range of kernel methods to obtain summary clusterings, and, if we have multiple PSMs, use standard methods for combining kernels in order to perform integrative clustering. We also show that one can embed PSMs within predictive kernel models in order to perform outcome-guided clustering
On Using Inductive Biases for Designing Deep Learning Architectures
Recent advancements in field of Artificial Intelligence, especially in the field of Deep Learning (DL), have paved way for new and improved solutions to complex problems occurring in almost all domains. Often we have some prior knowledge and beliefs of the underlying system of the problem at-hand which we want to capture in the corresponding deep learning architectures. Sometimes, it is not clear on how to include our prior beliefs into the traditionally recommended deep architectures like Recurrent neural networks, Convolutional neural networks, Variational Autoencoders and others. Often the post-hoc techniques of modifying these architectures are not straightforward and provide little performance gain.
There have been efforts on developing domain specific architectures but those techniques are generally not transferable to other domains. We ask the question that can we come up with generic and intuitive techniques to design deep learning architectures that takes our prior knowledge of the system as an inductive bias?
In this dissertation, we develop two novel approaches towards this end. The first one called `Cooperative Neural Networks' can incorporate the inductive bias from the underlying probabilistic graphical model representation of the domain. The second one called problem dependent `Unrolled Algorithms' parameterizes the recurrent structure of unrolling the iterations of an optimization algorithm for the objective function defining the problem. We found that the neural network architectures obtained from our approaches typically end up with very fewer learnable parameters and provide considerable improvement in run-time compared to other deep learning methods. We have successfully applied our techniques to solve Natural Language processing related tasks, doing sparse graph recovery and computational biology problems like doing gene regulatory network inference.
Firstly, we introduce the Cooperative Neural Networks approach which is a new theoretical approach for implementing learning systems that can exploit both prior insights about the independence structure of the problem domain and the universal approximation capability of the deep neural networks. Specifically, we develop CoNN-sLDA model for the document classification task. We use the popular Latent Dirichlet Allocation graphical model as the inductive bias for the CoNN-sLDA model. We demonstrate a 23% reduction in error on the challenging MultiSent data set compared to state-of-the-art and also derived ways to make the learned representations more interpretable.
Secondly, we elucidate the idea of using problem dependent `Unrolled Algorithms' for the sparse graph recovery task. We propose a deep learning architecture, GLAD, which uses an Alternating Minimization algorithm as our model inductive bias and learns the model parameters via supervised learning. We show that GLAD learns a very compact and effective model for recovering sparse graphs from data. We do an extensive theoretical analysis that strengthen our claims for using similar approaches for other problems as well.
Finally, we further build up on the proposed `Unrolled Algorithm' technique for a challenging real world computational biology problem. To this end, we design GRNUlar, a novel deep learning framework for supervised learning of gene regulatory networks (GRNs) from single cell RNA-Sequencing data. Our framework incorporates two intertwined models. We first leverage the expressive ability of neural networks to capture complex dependencies between transcription factors and the corresponding genes they regulate, by developing a multi-task learning framework. Then, in order to capture sparsity of GRNs observed in the real world, we design an unrolled algorithm technique for our framework. Our deep architecture requires supervision for training, for which we repurpose existing synthetic data simulators that generate scRNA-Seq data guided by an underlying GRN. Experimental results demonstrate GRNUlar outperforms state-of-the-art methods on both synthetic and real datasets. Our work also demonstrates the novel and successful use of expression data simulators for supervised learning of GRN inference.Ph.D
New approaches for unsupervised transcriptomic data analysis based on Dictionary learning
The era of high-throughput data generation enables new access to biomolecular profiles and exploitation thereof. However, the analysis of such biomolecular data, for example, transcriptomic data, suffers from the so-called "curse of dimensionality". This occurs in the analysis of datasets with a significantly larger number of variables than data points. As a consequence, overfitting and unintentional learning of process-independent patterns can appear. This can lead to insignificant results in the application. A common way of counteracting this problem is the application of dimension reduction methods and subsequent analysis of the resulting low-dimensional representation that has a smaller number of variables.
In this thesis, two new methods for the analysis of transcriptomic datasets are introduced and evaluated. Our methods are based on the concepts of Dictionary learning, which is an unsupervised dimension reduction approach. Unlike many dimension reduction approaches that are widely applied for transcriptomic data analysis, Dictionary learning does not impose constraints on the components that are to be derived. This allows for great flexibility when adjusting the representation to the data. Further, Dictionary learning belongs to the class of sparse methods. The result of sparse methods is a model with few non-zero coefficients, which is often preferred for its simplicity and ease of interpretation. Sparse methods exploit the fact that the analysed datasets are highly structured. Indeed, a characteristic of transcriptomic data is particularly their structuredness, which appears due to the connection of genes and pathways, for example. Nonetheless, the application of Dictionary learning in medical data analysis is mainly restricted to image analysis. Another advantage of Dictionary learning is that it is an interpretable approach. Interpretability is a necessity in biomolecular data analysis to gain a holistic understanding of the investigated processes.
Our two new transcriptomic data analysis methods are each designed for one main task: (1) identification of subgroups for samples from mixed populations, and (2) temporal ordering of samples from dynamic datasets, also referred to as "pseudotime estimation". Both methods are evaluated on simulated and real-world data and compared to other methods that are widely applied in transcriptomic data analysis. Our methods convince through high performance and overall outperform the comparison methods
Machine Learning Applications for Drug Repurposing
The cost of bringing a drug to market is astounding and the failure rate is intimidating. Drug discovery has been of limited success under the conventional reductionist model of one-drug-one-gene-one-disease paradigm, where a single disease-associated gene is identified and a molecular binder to the specific target is subsequently designed. Under the simplistic paradigm of drug discovery, a drug molecule is assumed to interact only with the intended on-target. However, small molecular drugs often interact with multiple targets, and those off-target interactions are not considered under the conventional paradigm. As a result, drug-induced side effects and adverse reactions are often neglected until a very late stage of the drug discovery, where the discovery of drug-induced side effects and potential drug resistance can decrease the value of the drug and even completely invalidate the use of the drug. Thus, a new paradigm in drug discovery is needed.
Structural systems pharmacology is a new paradigm in drug discovery that the drug activities are studied by data-driven large-scale models with considerations of the structures and drugs. Structural systems pharmacology will model, on a genome scale, the energetic and dynamic modifications of protein targets by drug molecules as well as the subsequent collective effects of drug-target interactions on the phenotypic drug responses. To date, however, few experimental and computational methods can determine genome-wide protein-ligand interaction networks and the clinical outcomes mediated by them. As a result, the majority of proteins have not been charted for their small molecular ligands; we have a limited understanding of drug actions. To address the challenge, this dissertation seeks to develop and experimentally validate innovative computational methods to infer genome-wide protein-ligand interactions and multi-scale drug-phenotype associations, including drug-induced side effects. The hypothesis is that the integration of data-driven bioinformatics tools with structure-and-mechanism-based molecular modeling methods will lead to an optimal tool for accurately predicting drug actions and drug associated phenotypic responses, such as side effects.
This dissertation starts by reviewing the current status of computational drug discovery for complex diseases in Chapter 1. In Chapter 2, we present REMAP, a one-class collaborative filtering method to predict off-target interactions from protein-ligand interaction network. In our later work, REMAP was integrated with structural genomics and statistical machine learning methods to design a dual-indication polypharmacological anticancer therapy. In Chapter 3, we extend REMAP, the core method in Chapter 2, into a multi-ranked collaborative filtering algorithm, WINTF, and present relevant mathematical justifications. Chapter 4 is an application of WINTF to repurpose an FDA-approved drug diazoxide as a potential treatment for triple negative breast cancer, a deadly subtype of breast cancer. In Chapter 5, we present a multilayer extension of REMAP, applied to predict drug-induced side effects and the associated biological pathways. In Chapter 6, we close this dissertation by presenting a deep learning application to learn biochemical features from protein sequence representation using a natural language processing method
Systems Analytics and Integration of Big Omics Data
A “genotype"" is essentially an organism's full hereditary information which is obtained from its parents. A ""phenotype"" is an organism's actual observed physical and behavioral properties. These may include traits such as morphology, size, height, eye color, metabolism, etc. One of the pressing challenges in computational and systems biology is genotype-to-phenotype prediction. This is challenging given the amount of data generated by modern Omics technologies. This “Big Data” is so large and complex that traditional data processing applications are not up to the task. Challenges arise in collection, analysis, mining, sharing, transfer, visualization, archiving, and integration of these data. In this Special Issue, there is a focus on the systems-level analysis of Omics data, recent developments in gene ontology annotation, and advances in biological pathways and network biology. The integration of Omics data with clinical and biomedical data using machine learning is explored. This Special Issue covers new methodologies in the context of gene–environment interactions, tissue-specific gene expression, and how external factors or host genetics impact the microbiome
- …