2,962 research outputs found

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    A Novel Unsupervised Method to Identify Genes Important in the Anti-viral Response: Application to Interferon/Ribavirin in Hepatitis C Patients

    Get PDF
    Background: Treating hepatitis C with interferon/ribavirin results in a varied response in terms of decrease in viral titer and ultimate outcome. Marked responders have a sharp decline in viral titer within a few days of treatment initiation, whereas in other patients there is no effect on the virus (poor responders). Previous studies have shown that combination therapy modifies expression of hundreds of genes in vitro and in vivo. However, identifying which, if any, of these genes have a role in viral clearance remains challenging. Aims: The goal of this paper is to link viral levels with gene expression and thereby identify genes that may be responsible for early decrease in viral titer. Methods: Microarrays were performed on RNA isolated from PBMC of patients undergoing interferon/ribavirin therapy. Samples were collected at pre-treatment (day 0), and 1, 2, 7, 14 and 28 days after initiating treatment. A novel method was applied to identify genes that are linked to a decrease in viral titer during interferon/ribavirin treatment. The method uses the relationship between inter-patient gene expression based proximities and inter-patient viral titer based proximities to define the association between microarray gene expression measurements of each gene and viral-titer measurements. Results: We detected 36 unique genes whose expressions provide a clustering of patients that resembles viral titer based clustering of patients. These genes include IRF7, MX1, OASL and OAS2, viperin and many ISG's of unknown function. Conclusion: The genes identified by this method appear to play a major role in the reduction of hepatitis C virus during the early phase of treatment. The method has broad utility and can be used to analyze response to any group of factors influencing biological outcome such as antiviral drugs or anti-cancer agents where microarray data are available. © 2007 Brodsky et al

    Approaches to Integrating Metabolomics and Multi-Omics Data: A Primer

    Get PDF
    Metabolomics deals with multiple and complex chemical reactions within living organisms and how these are influenced by external or internal perturbations. It lies at the heart of omics profiling technologies not only as the underlying biochemical layer that reflects information expressed by the genome, the transcriptome and the proteome, but also as the closest layer to the phenome. The combination of metabolomics data with the information available from genomics, transcriptomics, and proteomics offers unprecedented possibilities to enhance current understanding of biological functions, elucidate their underlying mechanisms and uncover hidden associations between omics variables. As a result, a vast array of computational tools have been developed to assist with integrative analysis of metabolomics data with different omics. Here, we review and propose five criteria—hypothesis, data types, strategies, study design and study focus— to classify statistical multi-omics data integration approaches into state-of-the-art classes under which all existing statistical methods fall. The purpose of this review is to look at various aspects that lead the choice of the statistical integrative analysis pipeline in terms of the different classes. We will draw particular attention to metabolomics and genomics data to assist those new to this field in the choice of the integrative analysis pipeline

    Integrative analysis identifies candidate tumor microenvironment and intracellular signaling pathways that define tumor heterogeneity in NF1

    Get PDF
    Neurofibromatosis type 1 (NF1) is a monogenic syndrome that gives rise to numerous symptoms including cognitive impairment, skeletal abnormalities, and growth of benign nerve sheath tumors. Nearly all NF1 patients develop cutaneous neurofibromas (cNFs), which occur on the skin surface, whereas 40-60% of patients develop plexiform neurofibromas (pNFs), which are deeply embedded in the peripheral nerves. Patients with pNFs have a ~10% lifetime chance of these tumors becoming malignant peripheral nerve sheath tumors (MPNSTs). These tumors have a severe prognosis and few treatment options other than surgery. Given the lack of therapeutic options available to patients with these tumors, identification of druggable pathways or other key molecular features could aid ongoing therapeutic discovery studies. In this work, we used statistical and machine learning methods to analyze 77 NF1 tumors with genomic data to characterize key signaling pathways that distinguish these tumors and identify candidates for drug development. We identified subsets of latent gene expression variables that may be important in the identification and etiology of cNFs, pNFs, other neurofibromas, and MPNSTs. Furthermore, we characterized the association between these latent variables and genetic variants, immune deconvolution predictions, and protein activity predictions

    Pathway activity analysis of bulk and single-cell RNA-Seq data

    Get PDF
    Gene expression profiling can produce effective biomarkers that can provide additional information beyond other approaches for characterizing disease. While these approaches are typically performed on standard bulk RNA sequencing data, new methods for RNA sequencing of individual cells have allowed these approaches to be applied at the resolution of a single cell. As these methods enter the mainstream, there is an increased need for user-friendly software that allows researchers without experience in bioinformatics to apply these techniques. In this thesis, I have developed new, user-friendly data resources and software tools to allow researchers to use gene expression signatures in their own datasets. Specifically, I created the Single Cell Toolkit, a user-friendly and interactive toolkit for analyzing single-cell RNA sequencing data and used this toolkit to analyze the pathway activity levels in breast cancer cells before and after cancer therapy. Next, I created and validated a set of activated oncogenic growth factor receptor signatures in breast cancer, which revealed additional heterogeneity within public breast cancer cell line and patient sample RNA sequencing datasets. Finally, I created an R package for rapidly profiling TB samples using a set of 30 existing tuberculosis gene signatures. I applied this tool to look at pathway differences in a dataset of tuberculosis treatment failure samples. Taken together, the results of these studies serve as a set of user-friendly software tools and data sets that allow researchers to rapidly and consistently apply pathway activity methods across RNA sequencing samples

    Machine Learning Methods To Identify Hidden Phenotypes In The Electronic Health Record

    Get PDF
    The widespread adoption of Electronic Health Records (EHRs) means an unprecedented amount of patient treatment and outcome data is available to researchers. Research is a tertiary priority in the EHR, where the priorities are patient care and billing. Because of this, the data is not standardized or formatted in a manner easily adapted to machine learning approaches. Data may be missing for a large variety of reasons ranging from individual input styles to differences in clinical decision making, for example, which lab tests to issue. Few patients are annotated at a research quality, limiting sample size and presenting a moving gold standard. Patient progression over time is key to understanding many diseases but many machine learning algorithms require a snapshot, at a single time point, to create a usable vector form. In this dissertation, we develop new machine learning methods and computational workflows to extract hidden phenotypes from the Electronic Health Record (EHR). In Part 1, we use a semi-supervised deep learning approach to compensate for the low number of research quality labels present in the EHR. In Part 2, we examine and provide recommendations for characterizing and managing the large amount of missing data inherent to EHR data. In Part 3, we present an adversarial approach to generate synthetic data that closely resembles the original data while protecting subject privacy. We also introduce a workflow to enable reproducible research even when data cannot be shared. In Part 4, we introduce a novel strategy to first extract sequential data from the EHR and then demonstrate the ability to model these sequences with deep learning
    • …
    corecore