The potential benefits of applying machine learning methods to -omics data
are becoming increasingly apparent, especially in clinical settings. However,
the unique characteristics of these data are not always well suited to machine
learning techniques. These data are often generated across different
technologies in different labs, and frequently with high dimensionality. In
this paper we present a framework for combining -omics data sets, and for
handling high dimensional data, making -omics research more accessible to
machine learning applications. We demonstrate the success of this framework
through integration and analysis of multi-analyte data for a set of 3,533
breast cancers. We then use this data-set to predict breast cancer patient
survival for individuals at risk of an impending event, with higher accuracy
and lower variance than methods trained on individual data-sets. We hope that
our pipelines for data-set generation and transformation will open up -omics
data to machine learning researchers. We have made these freely available for
noncommercial use at www.ccg.ai.Comment: Machine Learning for Health (ML4H) Workshop at NeurIPS 2018
  arXiv:1811.0721

Cannings, Timothy

Cassidy, John W

Clifford, Harry W

Cotter, Fergal

Dubourg-Felonneau, Geoffroy

Patel, Nirmesh

Thompson, Hannah

English

arXiv

The potential benefits of applying machine learning methods to -omics data are becoming increasingly apparent, especially in clinical settings. However, the unique characteristics of these data are not always well suited to machine learning techniques. These data are often generated across different technologies in different labs, and frequently with high dimensionality. In this paper we present a framework for combining -omics data sets, and for handling high dimensional data, making -omics research more accessible to machine learning applications. We demonstrate the success of this framework through integration and analysis of multi-analyte data for a set of 3,533 breast cancers. We then use this data-set to predict breast cancer patient survival for individuals at risk of an impending event, with higher accuracy and lower variance than methods trained on individual data-sets. We hope that our pipelines for data-set generation and transformation will open up -omics data to machine learning researchers. We have made these freely available for noncommercial use at www.ccg.ai

Edinburgh Research Explorer

     Edinburgh Research Explorer                                      A Framework for Implementing Machine Learning on Omics DataCitation for published version:Cannings, T, Dubourg-Felonneau, G, Cotter, F, Thompson, H, Patel, N, Cassidy, JW & Clifford, HW 2018,'A Framework for Implementing Machine Learning on Omics Data', ML4H: Machine Learning for Health,Montreal, Canada, 8/12/18 - 8/12/18. <https://arxiv.org/abs/1811.10455>Link:Link to publication record in Edinburgh Research ExplorerDocument Version:Peer reviewed versionGeneral rightsCopyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)and / or other copyright owners and it is a condition of accessing these publications that users recognise andabide by the legal requirements associated with these rights.Take down policyThe University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorercontent complies with UK legislation. If you believe that the public display of this file breaches copyright pleasecontact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately andinvestigate your claim.Download date: 17. May. 2022A Framework for Implementing Machine Learningon Omics DataGeoffroy Dubourg-Felonneau1, Timothy Cannings1,2, Fergal Cotter1,3,Hannah Thompson1, Nirmesh Patel1, John W Cassidy1,3, Harry W Clifford11Cambridge Cancer Genomics2University of Edinburgh3University of CambridgeAbstractThe potential benefits of applying machine learning methods to -omics data arebecoming increasingly apparent, especially in clinical settings. However, theunique characteristics of these data are not always well suited to machine learningtechniques. These data are often generated across different technologies in differentlabs, and frequently with high dimensionality. In this paper we present a frameworkfor combining -omics data sets, and for handling high dimensional data, making-omics research more accessible to machine learning applications. We demonstratethe success of this framework through integration and analysis of multi-analytedata for a set of 3,533 breast cancers. We then use this data-set to predict breastcancer patient survival for individuals at risk of an impending event, with higheraccuracy and lower variance than methods trained on individual data-sets. We hopethat our pipelines for data-set generation and transformation will open up -omicsdata to machine learning researchers. We have made these freely available fornoncommercial use at www.ccg.ai.1 IntroductionCancer research has been revolutionized by the advent of high-throughput sequencing and the abilityto generate data at an "-omics" level (genomics, transcriptomics, epigenomics, proteomics, etc).Implementation of machine learning techniques to -omics data is complex, however, when appliedsuccessfully they have been useful in obtaining meaningful biological insights. For example, althoughcancer is a highly heterogeneous disease with a diverse range of subtypes and clonal compositions[1], unsupervised learning has been successfully utilized to classify and interpret the unique genomicsignatures found from one tumor to the next [2]. This in turn has enabled stratification of patientsinto subgroups with distinct clinical outcomes. Another example is in prediction of drug-targetinteractions, for which machine learning is being used to narrow the search space for candidate drugsby application of predictive methods [3].These examples clearly demonstrate how useful machine learning can be in -omics, but have reliedon the generation of -omics data specific to these purposes, rather than utilizing the vast amount ofdata that has already been generated. More commonly, this data is disparate, split across labs intosmall data sets, and generated with different technologies (e.g. RNA-Seq, microarrays). Additionally,the richness of -omics data enables extraction of a large number of features, which often outstripsthe availability of patients and results in high dimensional data. Only large research labs capable ofproducing population-scale consistent -omics data can overcome these problems.In this paper, we provide a pipeline for combining -omics data sets and methods for handlinghigh dimensional data, making -omics research more accessible to machine learning applicationsMachine Learning for Health (ML4H) Workshop at NeurIPS 2018.arXiv:1811.10455v1  [cs.LG]  26 Nov 2018(section 2). To demonstrate the success of this, we use the combined data to predict breast cancerpatient survival for individuals at risk of an impending event, with higher accuracy and lower variancethan methods trained on individual data-sets (section 3). With the hope of this work enabling greateropportunity for implementing machine learning algorithms on -omics data produced in clinicaland research settings, we have made these pipelines freely available for noncommercial use atwww.ccg.ai.2 Data2.1 Combining sources of RNA: gene expression and CNA: copy number aberrationRNA is a proxy for gene expression. RNA data is usually found in the form of a N ×M matrix,where N is the number of patients and M is the number of observed genes. Each value is a positivereal number representing the level of expression for a given gene of a given patient, but these valuesdiffer considerably across technologies, each with their own biases and signal-noise distributions.Microarrays give an intensity value from RNA binding to probes, which roughly follows a gammadistribution (see Figure 1.1); RNA-Seq gives a count value from sequenced fragments of RNA, whichfollows a negative binomial distribution. We have been able to successfully combine 3 differentdatasets: METABRIC microarray, TCGA microarray, TCGA RNA-seq. The union of patients givesN = 3533, and the intersection of genes gives M = 15233, whilst retaining key characteristics, suchas distinct disease-free survival in Integrative Cluster classification [2].Figure 1: (1) Combined RNA distribution. (2) Survival plots show Integrative Subtype retention.The CNA data is found in the same dimensionality, with each value corresponding to the number ofcopies for a given region in the genome of a given patient. Both datasets were generated with GISTIC2.0 [4] in the following categorical way: -2 = homozygous deletion; -1 = hemizygous deletion; 0 =neutral / no change; 1 = gain; 2 = high level amplification.2.2 Connecting clinical data and defining target variablesWe added patient age to the gene expression/copy number data to make our input Xi. We defined thelifetime of patient i, as a random variable Ti and wish to estimate the conditional probability of thesurvival function:ST (t) = P (T > t|X)where t is time. Figure 1.2 shows survival functions for some patient groups in the dataset (groupingdone by the unsupervised technique in [2]). Due to the nature of clinical trials, not every patient istracked until death. Many leave the study and their status becomes unknown. When this happens, wesay the patient is lost and the time they were last seen is Li. If a patient remains in the study untildeath, then we have the true variable Ti. In our combined dataset, 55.4% of the patients were trackeduntil death, with the remaining 44.6% of patients lost. We define the clinical data for each patient asCi, withCi = min(Ti, Li)A common task is to estimate the probability at certain times, e.g. ST (60) or ST (24) (for 2 and 5year survival). We can do this by building an estimator for ST directly and evaluating it at given2times, or by converting the clinical data Ci to classification labels yi and build a classifier. Thissimplifies the problem at the cost of coarseness in our prediction. We built yi with the following logic:Data: Set t = 60 (5 years) or t = 24 (2 years)if Ci > t thenyi = 1 // patient survived at least t monthselse if Ti <= Li thenyi = 0 // patient died before telsedrop // patient lost before tendFor both 2 and 5 year survival, there was significant class imbalance with 92% (p(yi) = [0.08, 0.92]T )and 82% (p(yi) = [0.18, 0.82]T ) of patients surviving respectively. Accuracy results are misleadinglyhigh due to skewed prior distributions, so, we used receiver operating characteristic (ROC) curves toquantify our networks by calculating the area under the ROC (AUC). Values closer to 1 indicate agood classifier, and a random classifier will achieve an AUC of 0.5.2.3 Related workIt is difficult to benchmark ourselves against other papers as many use much smaller sample sizes,different prediction targets, and few state the class split prior. However we do note that in manyrecent papers, an AUC for predicting 5 year survival of 0.75 is a good target [5, 6].3 MethodsTo demonstrate increased performance in our combined data compared to baseline methods, weapplied both supervised (we used our optimization pipeline to train supervised models on the rawlabelled data) and semi-supervised techniques (we applied manifold learning techniques on all thedata (with and without labels) to learn a way to project the data in a smaller representation space) onthe same data for comparison. Then we applied classification methods on the labeled data projectedin this space.3.1 Combining multiple data sourcesFor combining disparate expression data-sets, we applied Feature Specific Quantile Normalization(FSQN) proposed for biological data by Franks JM et al [7]. We observed a significant increasein performance while training on the combined data rather than each dataset independently. Forexample, an MLP Classifier reaches 0.522 AUC when trained on METABRIC alone, but 0.754 AUCwhen trained on the combined dataset. Similarily, an SVC with RBF kernel reaches 0.657 AUC onMETABRIC alone, but 0.81 on the combined dataset.3.2 ProjectionsTo address the issue of high dimensionality, we applied the t-sne [8] technique to project the raw datainto smaller spaces. Once again, we observed an increase in performance. For example, a GaussianProcess Classifier reaches .5 AUC on raw data, but 0.75 AUC on a 3D TSNE projection. For thisreason, we integrated a projection module into the pipeline for iterations at multiple dimensions.3.3 Classifiers and RegressorsFor classification tasks, we use several standard classifiers such as Support Vector Machines, NaiveBayes, Lasso Regression and Random Forests. These could be applied to either the raw data (all Mgenes), or to a projection of these genes. The Support Vector Classification with the Radial basisfunction kernel gave the best performance on the projected data (see Table 1).We applied a neural network regressor to build an estimator for ST (t). As mentioned in 2.2, many ofthe patients were lost from the study before death, so the variable Ci is a lower bound on the truevariable. We can account for this in the cost function by weighting down or even removing these3samples, although we found it made little difference in the prediction accuracy. Instead, we simplyminimize the mean squared error between the neural network output and Ci.L =1NN−1∑i=0(Ci − f(xi))2After fitting the network, we evaluated ST (60), and ST (24) and calculated the AUC (see Table 1).This achieved a lower overall AUC than the best classifiers but can be used to give us more completedata (e.g. we can generate a kaplan meier plot per patient with ST ).Finally, we also applied the Random-projection ensemble classifier proposed by Cannings andSamworth [9]. This method can be seen as a way to extend simple classifiers to high-dimensionaldata. Moreover, it allowed us to assess the relative importance of the features used in the prediction.The classifier gave an AUC on 250 held-out test observations of 0.76 and 0.79 for 2 year and 5 yearsurvival prediction, respectively. Tuning parameters where chosen using 10-fold cross validation.3.4 PipelineWe present a pipeline tool for easier experimentation and reproducibility in other data-sets acrossdiseases. This pipeline allows us to perform the following; model wrapping - handling custommodels that respect a simple interface; cross validation - automatic cross validation for evaluation;hyperparameter optimization - scanning a wide parameter space across multiple computationalplatforms using hyperopt [10]; distribute the data - seamless data distribution across compute clusters.4 Experiments and ResultsTable 1: Results across different classifiers and data-setsModel Data Validation AUCSVC (RBF) TCGA Metabric RNA raw 0.815SVC (RBF) TCGA Metabric RNA TSNE 15 age 0.774GaussianProcessClassifier TCGA Metabric RNA TSNE 3 age 0.755RectangleMLPClassifier TCGA Metabric RNA TSNE 40 age 0.754Lasso TCGA Metabric RNA raw 0.750GaussianNB TCGA Metabric RNA TSNE 70 age 0.742SVC (RBF) TCGA Metabric RNA TSNE 5 age 0.736GaussianProcessClassifier TCGA Metabric RNA TSNE 5 age 0.725Neural Network Regressor TCGA Metabric RNA raw 0.720GaussianNB TCGA Metabric RNA TSNE 10 age 0.692Random Forest TCGA Metabric RNA raw 0.670SVC (RBF) Metabric RNA raw 0.662GaussianNB Metabric RNA raw 0.657GaussianProcessClassifier TCGA Metabric RNA TSNE 70 age 0.654GaussianNB TCGA Metabric RNA+CNA raw 0.649GaussianNB TCGA Metabric RNA raw 0.645GaussianNB TCGA Metabric CNA raw 0.639RectangleMLPClassifier TCGA Metabric RNA raw 0.607RectangleMLPClassifier TCGA Metabric RNA+CNA raw age 0.551RectangleMLPClassifier Metabric RNA raw 0.522GaussianProcessClassifier TCGA Metabric RNA raw 0.5005 ConclusionOur deep learning pipeline enables the use of high dimensional -omics data from disparate sources topredict clinical outcomes. We demonstrate this through prediction of short term survival in breastcancer patients, with the hope of greater monitoring and care for those patients at high risk. Webelieve this will be especially beneficial in opening up -omics data to machine learning researchers.4References[1] Vanessa Almendro, Andriy Marusyk, and Kornélia Polyák. Cellular heterogeneity and molecularevolution in cancer. Annual review of pathology, 8:277–302, 2013.[2] Sarah-Jane Dawson, Oscar M. Rueda, S. Aparicio, and Carlos Caldas. A new genome-drivenintegrated classification of breast cancer and its implications. The EMBO journal, 32 5:617–28,2013.[3] Katie Gao, Dayong Wang, and Yi Huang. Cross-cancer prediction: A novel machine learningapproach to discover molecular targets for development of treatments for multiple cancers.Cancer Informatics, 17:1176–9351, 2018.[4] Craig H. Mermel, Steven E. Schumacher, Barbara Hill, Matthew L. Meyerson, RameenBeroukhim, and Gad Getz. Gistic2.0 facilitates sensitive and confident localization of thetargets of focal somatic copy-number alteration in human cancers. Genome Biology, 12(4):R41, Apr 2011. ISSN 1474-760X. doi: 10.1186/gb-2011-12-4-r41. URL https://doi.org/10.1186/gb-2011-12-4-r41.[5] Eliseos J. Mucaki, Katherina Baranova, Huy Q. Pham, Iman Rezaeian, Dimo Angelov, AliouneNgom, Luis Rueda, and Peter K. Rogan. Predicting Outcomes of Hormone and Chemotherapyin the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) Study byBiochemically-inspired Machine Learning. F1000Research, 5, May 2017. ISSN 2046-1402.doi: 10.12688/f1000research.9417.3. 00004.[6] Daniel Urda, J Montes-Torres, F Moreno, Leonardo Franco, and José Jerez. Deep Learning toAnalyze RNA-Seq Gene Expression Data. pages 50–59, May 2017. ISBN 978-3-319-59146-9.doi: 10.1007/978-3-319-59147-6_5. 00001.[7] Jennifer M. Franks, Guoshuai Cai, and Michael L. Whitfield. Feature specific quantile normal-ization enables cross-platform classification of molecular subtypes using gene expression data.Bioinformatics, 34 11:1868–1874, 2018.[8] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal ofMachine Learning Research, 9:2579–2605, 2008. URL http://www.jmlr.org/papers/v9/vandermaaten08a.html.[9] Timothy I. Cannings and Richard J. Samworth. Random-projection ensemble classification. J.Roy. Statist. Soc., Ser. B. (with discussion), 79:959–1035, 2017.[10] Yamins D. Cox D. D. Bergstra, J. Making a science of model search: Hyperparameter optimiza-tion in hundreds of dimensions for vision architectures. ICML 2013, 2013.5

A Framework for Implementing Machine Learning on Omics Data

https://www.research.ed.ac.uk/files/80085879/1811.10455.pd.pdf

A Framework for Implementing Machine Learning on Omics Data

Abstract

Similar works

Full text

Available Versions

Edinburgh Research Explorer