14 research outputs found

    A Framework for Implementing Machine Learning on Omics Data

    Get PDF
    The potential benefits of applying machine learning methods to -omics data are becoming increasingly apparent, especially in clinical settings. However, the unique characteristics of these data are not always well suited to machine learning techniques. These data are often generated across different technologies in different labs, and frequently with high dimensionality. In this paper we present a framework for combining -omics data sets, and for handling high dimensional data, making -omics research more accessible to machine learning applications. We demonstrate the success of this framework through integration and analysis of multi-analyte data for a set of 3,533 breast cancers. We then use this data-set to predict breast cancer patient survival for individuals at risk of an impending event, with higher accuracy and lower variance than methods trained on individual data-sets. We hope that our pipelines for data-set generation and transformation will open up -omics data to machine learning researchers. We have made these freely available for noncommercial use at www.ccg.ai.Comment: Machine Learning for Health (ML4H) Workshop at NeurIPS 2018 arXiv:1811.0721

    Profiling lung adenocarcinoma by liquid biopsy: can one size fit all?

    Get PDF
    BACKGROUND: Cancer is first and foremost a disease of the genome. Specific genetic signatures within a tumour are prognostic of disease outcome, reflect subclonal architecture and intratumour heterogeneity, inform treatment choices and predict the emergence of resistance to targeted therapies. Minimally invasive liquid biopsies can give temporal resolution to a tumour's genetic profile and allow the monitoring of treatment response through levels of circulating tumour DNA (ctDNA). However, the detection of ctDNA in repeated liquid biopsies is currently limited by economic and time constraints associated with targeted sequencing. METHODS: Here we bioinformatically profile the mutational and copy number spectrum of The Cancer Genome Network's lung adenocarcinoma dataset to uncover recurrently mutated genomic loci. RESULTS: We build a panel of 400 hotspot mutations and show that the coverage extends to more than 80% of the dataset at a median depth of 8 mutations per patient. Additionally, we uncover several novel single-nucleotide variants present in more than 5% of patients, often in genes not commonly associated with lung adenocarcinoma. CONCLUSION: With further optimisation, this hotspot panel could allow molecular diagnostics laboratories to build curated primer banks for 'off-the-shelf' monitoring of ctDNA by droplet-based digital PCR or similar techniques, in a time- and cost-effective manner

    Identifying Cancer Drivers Using DRIVE: A Feature-Based Machine Learning Model for a Pan-Cancer Assessment of Somatic Missense Mutations.

    Get PDF
    Sporadic cancer develops from the accrual of somatic mutations. Out of all small-scale somatic aberrations in coding regions, 95% are base substitutions, with 90% being missense mutations. While multiple studies focused on the importance of this mutation type, a machine learning method based on the number of protein-protein interactions (PPIs) has not been fully explored. This study aims to develop an improved computational method for driver identification, validation and evaluation (DRIVE), which is compared to other methods for assessing its performance. DRIVE aims at distinguishing between driver and passenger mutations using a feature-based learning approach comprising two levels of biological classification for a pan-cancer assessment of somatic mutations. Gene-level features include the maximum number of protein-protein interactions, the biological process and the type of post-translational modifications (PTMs) while mutation-level features are based on pathogenicity scores. Multiple supervised classification algorithms were trained on Genomics Evidence Neoplasia Information Exchange (GENIE) project data and then tested on an independent dataset from The Cancer Genome Atlas (TCGA) study. Finally, the most powerful classifier using DRIVE was evaluated on a benchmark dataset, which showed a better overall performance compared to other state-of-the-art methodologies, however, considerable care must be taken due to the reduced size of the dataset. DRIVE outlines the outstanding potential that multiple levels of a feature-based learning model will play in the future of oncology-based precision medicine
    corecore