93 research outputs found

    Deep transfer learning for drug response prediction

    Get PDF
    The goal of precision oncology is to make accurate predictions for cancer patients via some omics data types of individual patients. Major challenges of computational methods for drug response prediction are that labeled clinical data is very limited, not publicly available, or has drug response for one or two drugs. These challenges have been addressed by generating large-scale pre-clinical datasets such as cancer cell lines or patient-derived xenografts (PDX). These pre-clinical datasets have multi-omics characterization of samples and are often screened with hundreds of drugs which makes them viable resources for precision oncology. However, they raise new questions: how can we integrate different data types? how can we handle data discrepancy between pre-clinical and clinical datasets that exist due to basic biological differences? and how can we make the best use of unlabeled samples in drug response prediction where labeling is extra challenging? In this thesis, we propose methods based on deep neural networks to answer these questions. First, we propose a method of multi-omics integration. Second, we propose a transfer learning method to address data discrepancy between cell lines, patients, and PDX models in the input and output space. Finally, we proposed a semi-supervised method of out-of-distribution generalization to predict drug response using labeled and unlabeled samples. The proposed methods have promising performance when compared to the state-of-the-art and may guide precision oncology more accurately

    Deep Learning in Single-Cell Analysis

    Full text link
    Single-cell technologies are revolutionizing the entire field of biology. The large volumes of data generated by single-cell technologies are high-dimensional, sparse, heterogeneous, and have complicated dependency structures, making analyses using conventional machine learning approaches challenging and impractical. In tackling these challenges, deep learning often demonstrates superior performance compared to traditional machine learning methods. In this work, we give a comprehensive survey on deep learning in single-cell analysis. We first introduce background on single-cell technologies and their development, as well as fundamental concepts of deep learning including the most popular deep architectures. We present an overview of the single-cell analytic pipeline pursued in research applications while noting divergences due to data sources or specific applications. We then review seven popular tasks spanning through different stages of the single-cell analysis pipeline, including multimodal integration, imputation, clustering, spatial domain identification, cell-type deconvolution, cell segmentation, and cell-type annotation. Under each task, we describe the most recent developments in classical and deep learning methods and discuss their advantages and disadvantages. Deep learning tools and benchmark datasets are also summarized for each task. Finally, we discuss the future directions and the most recent challenges. This survey will serve as a reference for biologists and computer scientists, encouraging collaborations.Comment: 77 pages, 11 figures, 15 tables, deep learning, single-cell analysi

    Definition and Independent Validation of a Proteomic-Classifier in Ovarian Cancer

    Get PDF
    Simple Summary: The heterogeneity of epithelial ovarian cancer and its associated molecular biological characteristics are continuously integrated in the development of therapy guidelines. In a next step, future therapy recommendations might also be able to focus on the patient's systemic status, not only the tumor's molecular pattern. Therefore, new methods to identify and validate host-related biomarkers need to be established. Using mass spectrometry, we developed and independently validated a blood-based proteomic classifier, stratifying epithelial ovarian cancer patients into good and poor survival groups. We also determined an age dependence of the prognostic performance of this classifier and its association with important biological processes. This work highlights that, just like molecular markers of the tumor itself, the systemic condition of a patient (partly reflected in proteomic patterns) also influences survival and therapy response and could therefore be integrated into future processes of therapy planning. Abstract: Mass-spectrometry-based analyses have identified a variety of candidate protein biomarkers that might be crucial for epithelial ovarian cancer (EOC) development and therapy response. Comprehensive validation studies of the biological and clinical implications of proteomics are needed to advance them toward clinical use. Using the Deep MALDI method of mass spectrometry, we developed and independently validated (development cohort: n = 199, validation cohort: n = 135) a blood-based proteomic classifier, stratifying EOC patients into good and poor survival groups. We also determined an age dependency of the prognostic performance of this classifier, and our protein set enrichment analysis showed that the good and poor proteomic phenotypes were associated with, respectively, lower and higher levels of complement activation, inflammatory response, and acute phase reactants. This work highlights that, just like molecular markers of the tumor itself, the systemic condition of a patient (partly reflected in proteomic patterns) also influences survival and therapy response in a subset of ovarian cancer patients and could therefore be integrated into future processes of therapy planning

    A primer on machine learning techniques for genomic applications

    Get PDF
    High throughput sequencing technologies have enabled the study of complex biological aspects at single nucleotide resolution, opening the big data era. The analysis of large volumes of heterogeneous “omic” data, however, requires novel and efficient computational algorithms based on the paradigm of Artificial Intelligence. In the present review, we introduce and describe the most common machine learning methodologies, and lately deep learning, applied to a variety of genomics tasks, trying to emphasize capabilities, strengths and limitations through a simple and intuitive language. We highlight the power of the machine learning approach in handling big data by means of a real life example, and underline how described methods could be relevant in all cases in which large amounts of multimodal genomic data are available

    Using machine learning to predict treatment outcome in depression – hype or hope?

    Get PDF

    Resolving Biological Trajectories in Single-cell Data using Feature Selection and Multi-modal Integration

    Get PDF
    Single-cell technologies can readily measure the expression of thousands of molecular features from individual cells undergoing dynamic biological processes, such as cellular differentiation, immune response, and disease progression. While computational trajectory inference methods and RNA velocity approaches have been developed to study how subtle changes in gene or protein expression impact cell fate decision-making, identifying characteristic features that drive continuous biological processes remains difficult to detect due to the inherent biological or technical challenges associated with single-cell data. Here, we developed two data representation-based approaches for improving inference of cellular dynamics. First, we present DELVE, an unsupervised feature selection method for identifying a representative subset of dynamically-expressed molecular features that resolve cellular trajectories in noisy data. In contrast to previous work, DELVE uses a bottom-up approach to mitigate the effect of unwanted sources of variation confounding inference and models cell states from dynamic feature modules that constitute core regulatory complexes. Using simulations, single-cell RNA sequencing data, and iterative immunofluorescence imaging data in the context of cell cycle and cellular differentiation, we demonstrate that DELVE selects genes or proteins that more accurately characterize cell populations and improve the recovery of cell type transitions. Next, we present the first task-oriented benchmarking study that investigates integration of temporal gene expression modalities for dynamic cell state prediction. We benchmark ten multi-modal integration approaches on ten datasets spanning different biological contexts, sequencing technologies, and species. This study illustrates how temporal gene expression modalities can be optimally combined to improve inference of cellular trajectories and more accurately predict sample-associated perturbation and disease phenotypes. Lastly, we illustrate an application of these approaches and perform an integrative analysis of gene expression and RNA velocity data to study the crosstalk between signaling pathways that govern the mesendoderm fate decision during directed definitive endoderm differentiation. Results of this study suggest that lineage-specific, temporally expressed genes within the primitive streak may serve as a potential target for increasing definitive endoderm efficiency. Collectively, this work uses scalable data-driven approaches to effectively manage the inherent biological or technical challenges associated with single-cell data in order to improve inference of cellular dynamics.Doctor of Philosoph

    Machine Learning Models for Deciphering Regulatory Mechanisms and Morphological Variations in Cancer

    Get PDF
    The exponential growth of multi-omics biological datasets is resulting in an emerging paradigm shift in fundamental biological research. In recent years, imaging and transcriptomics datasets are increasingly incorporated into biological studies, pushing biology further into the domain of data-intensive-sciences. New approaches and tools from statistics, computer science, and data engineering are profoundly influencing biological research. Harnessing this ever-growing deluge of multi-omics biological data requires the development of novel and creative computational approaches. In parallel, fundamental research in data sciences and Artificial Intelligence (AI) has advanced tremendously, allowing the scientific community to generate a massive amount of knowledge from data. Advances in Deep Learning (DL), in particular, are transforming many branches of engineering, science, and technology. Several of these methodologies have already been adapted for harnessing biological datasets; however, there is still a need to further adapt and tailor these techniques to new and emerging technologies. In this dissertation, we present computational algorithms and tools that we have developed to study gene-regulation and cellular morphology in cancer. The models and platforms that we have developed are general and widely applicable to several problems relating to dysregulation of gene expression in diseases. Our pipelines and software packages are disseminated in public repositories for larger scientific community use. This dissertation is organized in three main projects. In the first project, we present Causal Inference Engine (CIE), an integrated platform for the identification and interpretation of active regulators of transcriptional response. The platform offers visualization tools and pathway enrichment analysis to map predicted regulators to Reactome pathways. We provide a parallelized R-package for fast and flexible directional enrichment analysis to run the inference on custom regulatory networks. Next, we designed and developed MODEX, a fully automated text-mining system to extract and annotate causal regulatory interaction between Transcription Factors (TFs) and genes from the biomedical literature. MODEX uses putative TF-gene interactions derived from high-throughput ChIP-Seq or other experiments and seeks to collect evidence and meta-data in the biomedical literature to validate and annotate the interactions. MODEX is a complementary platform to CIE that provides auxiliary information on CIE inferred interactions by mining the literature. In the second project, we present a Convolutional Neural Network (CNN) classifier to perform a pan-cancer analysis of tumor morphology, and predict mutations in key genes. The main challenges were to determine morphological features underlying a genetic status and assess whether these features were common in other cancer types. We trained an Inception-v3 based model to predict TP53 mutation in five cancer types with the highest rate of TP53 mutations. We also performed a cross-classification analysis to assess shared morphological features across multiple cancer types. Further, we applied a similar methodology to classify HER2 status in breast cancer and predict response to treatment in HER2 positive samples. For this study, our training slides were manually annotated by expert pathologists to highlight Regions of Interest (ROIs) associated with HER2+/- tumor microenvironment. Our results indicated that there are strong morphological features associated with each tumor type. Moreover, our predictions highly agree with manual annotations in the test set, indicating the feasibility of our approach in devising an image-based diagnostic tool for HER2 status and treatment response prediction. We have validated our model using samples from an independent cohort, which demonstrates the generalizability of our approach. Finally, in the third project, we present an approach to use spatial transcriptomics data to predict spatially-resolved active gene regulatory mechanisms in tissues. Using spatial transcriptomics, we identified tissue regions with differentially expressed genes and applied our CIE methodology to predict active TFs that can potentially regulate the marker genes in the region. This project bridged the gap between inference of active regulators using molecular data and morphological studies using images. The results demonstrate a significant local pattern in TF activity across the tissue, indicating differential spatial-regulation in tissues. The results suggest that the integrative analysis of spatial transcriptomics data with CIE can capture discriminant features and identify localized TF-target links in the tissue

    Pathway-Based Multi-Omics Data Integration for Breast Cancer Diagnosis and Prognosis.

    Get PDF
    Ph.D. Thesis. University of Hawaiʻi at Mānoa 2017
    corecore