In this brief paper, we address the medical problem of human obesity prediction from genomic data. Genomic datasets may contain a huge number of features and they often have to be analyzed within the realm of Big Data technologies. As a medical problem, obesity prediction would welcome interpretables outcomes. Therefore, the analyst would benefit from appraches in which the problem of very high data dimensionality could be eased as much as possible. Feature selection can be an essential part of such approaches. In this context, though, traditional machine learning methods may struggle. Here, we propose a pipeline to address this problem using partitioning strategies: both vertical, by dividing the data based on gender, and horizontal, by splitting each of the analyzed chromosomes into 5,000-instances subsets. For each, Minimum Redundancy and Maximum Relevance feature selection is used to find rankings of the single nucleotide polymorphisms most relevant for classification in the medical dataset.Preprin

Bilal, Ahsan

Ribas Ripoll, Vicent

Vellido Alcacena, Alfredo

English

UPCommons. Portal del coneixement obert de la UPC

1Enabling interpretation of the outcome of a human obesity prediction machine learning analysis from genomic dataAhsan Bilal(1,2), Alfredo Vellido(1,3), Vicent Ribas(2)(1) Universitat Politècnica de Catalunya (UPC BarcelonaTech)Barcelona 08034, Spain, avellido@cs.upc.edu(2) EURECAT: Centre Tecnològic de CatalunyaBarcelona 08005, Spain, vicent.ribas@eurecat.org(3) Intelligent Data Engineering and Artificial Intelligence (IDEAI) Research CenterBarcelona 08034, SpainKeywords: Machine Learning, Feature Selection, Minimum Redundancy and MaximumRelevance, SNP, Big Data, Apache Spark, Obesity.Abstract. In this brief paper, we address the medical problem of human obesity predic-tion from genomic data. Genomic datasets may contain a huge number of features andthey often have to be analyzed within the realm of Big Data technologies. As a medi-cal problem, obesity prediction would welcome interpretables outcomes. Therefore, theanalyst would benefit from appraches in which the problem of very high data dimen-sionality could be eased as much as possible. Feature selection can be an essential partof such approaches. In this context, though, traditional machine learning methods maystruggle. Here, we propose a pipeline to address this problem using partitioning strate-gies: both vertical, by dividing the data based on gender, and horizontal, by splittingeach of the analyzed chromosomes into 5,000-instances subsets. For each, MinimumRedundancy and Maximum Relevance feature selection is used to find rankings of thesingle nucleotide polymorphisms most relevant for classification in the medical dataset.1 IntroductionThe pervasive use of networked computer systems in medical and clinical environ-ments has made medical research an increasingly data-dependent discipline. This bringsto the fore many challenges related to operational data management and knowledge ex-traction from data [?].This paper addresses the medical problem of human obesity prediction from genomicdata. Genomic datasets (in general and in the particular case of this study) may containa huge number (even millions) of features. Not only that, but also, more often than not,showing very low ratios of instances to features. This has two immediate consequences:first, that the data require Big Data technologies for their management and analysis and,second, that traditional machine learning (ML) methods for data analysis and knowledgeextraction may struggle in a low instances-to-features ratios scenario [?].As a medical problem, obesity prediction would welcome interpretable outcomesthat can be acted upon in an operational manner, even if for purely research-relatedpuroposes. Therefore, the analyst would benefit from approaches in which the problemof large data dimensionality could be eased as much as possible. Feature selection (FS)for dimensionality reduction (DR) can be an essential part of such approaches and it isthe strategy that we propose in our study.Apache Spark is a distributed in-memory Big Data system with the potential to over-come these bottlenecks. Our analyses, though, show that Apache Spark is unable tocope with our dataset containing ≈ 0.74 million features. Here, as an alternative, we2propose a pipeline to address this problem using partitioning strategies: both vertical,by dividing the data based on gender, and horizontal, by splitting each of the analyzedchromosomes into 5,000-instances subsets. For each subset, Minimum Redundancy andMaximum Relevance (mRMR) FS is then used to find rankings of the most relevantsingle nucleotide polymorphisms (SNPs) in a medical dataset.The remaining of the paper is structured as follows: first we describe the FS approachfollowed in the study. We then briefly describe the analyzed data and the pipeline usedfor their tractable analysis. This is followed by the reporting of the experimental resultsand some summary conclusions.2 Methods: Feature SelectionFS can be described as a process of automatic tagging of subsets of features as rele-vant for model construction. FS is by itself useful, but it mostly acts as a filter, mutingout features that are not useful for the purpose of analysis. As commented in [?], FS“has shown its effectiveness in many applications, but the unique characteristics of bigdata present challenges”. Usually, real-world datasets come with a sizeable amountof irrelevant and redundant features. FS also helps data analysis by decreasing mem-ory storage requirements and computational cost, while avoiding information loss in asmuch as possible [?].FS methods can be used for identifying and removing from data those unwanted,irrelevant and redundant variants that do not contribute to the accuracy of a model, ormay in fact decrease the accuracy of the model. Redundant features are adscribed tothe category of irrelevant ones. Each feature is to some extent relevant and cannot bediscarded manually, but redundancy implies the co-presence of another feature withsimilar peformance, and the model’s learning performance will not be compromised byremoving one of them [?].According to Guyon and Elisseeff [?], the most important objectives of FS are:• to reduce overfitting and improves the model performance in sense of prediction,• to provide faster and cost effective models,• to achieve an easy interpretation of the model by domain users using only a smallsubset of data.Although FS techniques are very handy in large-scale datasets and are widely used,there are also a few aspects that require being careful about in the process. The ad-vantages of FS techniques come at a certain price, because the search for a subset ofrelevant features introduces an additional layer of complexity in the modeling task. In-stead of just optimizing the parameters of the model for the full feature subset, we nowneed to find the optimal model parameters for the optimal feature subset, as there is noguarantee that the optimal parameters for the full feature set are equally optimal for theoptimal feature subset [?, ?, ?, ?].From a ML point of view, the selection of biomarkers in our medical context can bestated as a FS problem for a classification task, where we have the objective of findinga reasonably small set of features (biomarkers) that is capable of best explaining thedifference between the disease and the control samples [?].From a biological point of view, Haury et al. explain that applying FS to biolog-ical case/control datasets allows to investigate the genes selected in the signature andevaluate the relationship to biological processes involved in the disease [?].2.1 Minimum-Redundancy-Maximum-Relevance (mRMR)mRMR was first developed by Peng et al. [?] and it is considered as one of themost powerful filter methods. It is based on mutual information and selects features3according to the maximum statistical dependency to the class label. Selecting a small butmeaningful subset out of several thousands or millions of biomarkers is a most relevanttask, not only for achieving the most accurate classification of biomedical data, butalso for enabling biomedical interpretability. The mRMR algorithm was developed andintended to deal with the classification of DNA microarray data, which is a challengingtask when faced with a huge number of features (SNPs in this case) paired with a limitednumber of observations. In their study, Peng et al. stated that selected genes via mRMRprovide a more balanced coverage of the space and capture broader characteristics ofphenotypes [?].The mRMR algorithm ranks the importance of the features based on their relevanceto the class. As the name suggests, the main goal is to achieve the maximum relevancybetween the features X and the class C, using mutual information (MI).I(A,B) =∑b∈B∑a∈AP (a, b)log(P (a, b)P (a)P (b))(1)In the above Eq.??, I represents the mutual information between the features a andb, which can be easily derived by calculating the marginal probabilities P(a) and P(b),and the joint probability between both features P(a,b) [?].The maximum relevance can be determined by Eq.?? [?],maxD(X,C), D =(1|X|) ∑Xi∈XI(Xi;C) (2)Since the redundancy is a major issue in this feature selection task, specially whentargeting the maximum relevancy criterion for large datasets, we can minimize the re-dundancy according to the following Eq.??, as suggested in [?].minR(X), R =(1|X|2) ∑Xi,Xj∈XI(Xi;Xj) (3)Finally, the combination of both Eq.?? and Eq.?? helps deriving the desired outputi.e. mRMR in Eq.??, where S is the selected set of features [?].maxXi /∈S[I(Xi, C)−(1|S|) ∑Xj∈SI(Xj;Xi)] (4)3 Materials: Experimental DatasetThe analyzed dataset comprises genomic data from a series of patients. The basedataset consists of 22 chromosomes, whereas chromosome 23 is related to sex and isnot considered. A total of 4,988 patients and 736,990 SNPs were available.4 Proposed Data Analysis PipelineOur proposed data analysis pipeline is based on a complex data pre-processing stagethat includes data partitioning, data transposition, feature selection, data merging andbuilding the classifier. First, the data are partitioned horizontally (by rows) and vertically(by columns) into subsets of 5,000 features and based on gender, respectively, and a datapreparation strategy is applied to each partition. Second, we merge the results from all22 chromosomes and obtain a final model based on top relevant features which influence4Figure 1: Data pipeline.the obesity in males and females. A high level view of the proposed pipeline architecturecan be seen in Fig. ??.The first stage of data preparation consists on dividing the data into horizontal andvertical partitions. This partitioning enabled us to run the job in the Apache Spark clus-ter available for data handling. The partitioning solution that was initially implementedin Apache Spark involved writing the partitioned data in HDFS (a Java-based file systemfor data storage). The Apache Spark version 2.0 generated exceptions that were han-dled by turning to the use of PLINK (a widely used application for analyzing genotypicdata that can be considered the de facto standard in the field) and Linux commands in-stead, for partitioning the data first into gender-specific subsets and, second, into subsetscontaining 5,000 features. This solution was found to be fast and efficient.Subsequently, data of each partition were transposed for each chromosome CHi formales and females separately. This stage was necessary due to the required format struc-ture of the data (SNPs as variables and patients as samples), so that the FS procedurecould be applied. Note that in the original structure of the provided data, patients weredescribed in columns and SNPs in rows.Finally, FS was applied to each partition of the chromosome CHi for males and fe-males separately; the selected features found to be the most relevant as obesity predic-tors were merged; and the classifiers were built by splitting the data into training (70%)and test (30%) sets. Through the mRMR filter method [?, ?], the top 20 features wereselected according to their ranking, for each partition of the data. In summary, from all22 chromosomes, both for males and females, only 3,040 SNPs variants were selected;that is, a mere 0.41% of the original total amount of SNPs available for analysis.Approximately, 140 features were selected from each chromosome and learningmodels were built individually, evaluating their accuracy. The final step of the pro-posed data pipeline involves finding common features that are available in both maleand female datasets, ranking them according to the mRMR FS score.5 ResultsThe experiments were performed on a YARN machine with 3 executers, 27GB RAMand 7 CPUs. The performance of mRMR was extremely slow for 5,000 features using5Table 1: Combined performance, as measured by AUC, with all 22 chromosomes.Sampling (Classifier) Gender Test AUC CV AUCWeight (LR) Male 0.971 0.962Weight (LR) Female 0.965 0.948No Sampling (LR) Male 0.963 0.941No Sampling (LR) Female 0.925 0.923Down-sampling (RF) Male 0.782 0.784Down-sampling (RF) Female 0.632 0.678No Sampling (RF) Male 0.500 0.501No Sampling (RF) Female 0.500 0.500the maximum resources. It took several days to process all the partitions from all 22chromosomes for both genders.A Random Forest (RF) classifier was first used, in a 5-Fold Cross Validation pro-cedure used to find the best parameters of the model and increase efficiency. Unfortu-nately, the preliminary accuracy results were poor. After analyzing its implementationin Spark ML, we found that it could not properly handle imbalanced binary class dis-tributions. To overcome this problem, a down-sampling technique was used to reducethe number of cases in the majority class. Alternatively, Spark can manage the weightswith imbalanced binary classification using a Logistic Regression (LR) model, that wasalso used as classifier.Finally, we merged the top selected 0.41% of SNPs from all chromosomes, combin-ing them in a single dataset. During the evaluation of each chromosome, we found thatLR performed quite well in the binary classification problem. We also observed thatnot using sampling or weighting in the LR method did not have any significant negativeimpact in performance as measured by the the Area Under the ROC Curve AUC, al-though the weighting in the LR-Weights model slightly increased the AUC. The resultsfor all 22 chromosomes combined are shown in Tab. ?? and common SNPs found inboth males and females are listed in Tab. ??.6 ConclusionsInterpretability is paramount in medical applications of ML in general, but it is par-ticularly difficult when the medical problem, obesity prediction in the case of this study,is defined according to genomic data. This setting requires the use of Big Data tools andtechnologies as we nned to extract knowledge from thousands of individuals describedthrough millions of features (SNPs in this study). Systems still lack flexibility for bioin-formatics data described through millions of features in a distributed manner and notjust millions of records.We have proposed a data analysis pipeline design using data partitioning for BigData, which has solved feasibility issues in an Apache Spark 2.0 framework, allowingus to run jobs using the available resources. We reckon that running these tasks withmaximized resources according to the proposed pipeline would definitely lead to a goodcomputational performance.Through feature engineering and FS-based DR, we have managed to reduce fromthe original bulk of 736,990 SNPs to an extremely lean 3,040 SNP selection, whileproviding a quite accurate obesity prediction (0.965 AUC for females and 0.971 AUCfor males). This result, with specific SNP selections related to specific chromosomes,is the first and necessary step for guaranteeing the interpretability of any biomedicalresearch oriented towards explaining human obesity from this type of genomic data.6Table 2: Common SNPs from Males and Females.SR No. SNP Chromosome1 2:4259627:C:T 22 2:224060700:G:A 23 2:233158545:C:T 24 3:125050868:T:C 35 4:130008848:T:C 46 6:30127079:T:C 67 6:32975283:G:T 68 6:30233192:T:C 69 6:28865417:T:C 610 7:1932780:G:C 711 8:30430742:T:C 812 8:133210054:T:C 813 8:143486205:G:A 814 9:119600196:T:C 915 11:91411734:A:G 1116 13:67193281:C:T 1317 13:101857816:G:A 1318 14:57747325:C:T 1419 14:92758540:G:C 1420 15:96771641:T:G 1521 16:7403274:T:G 1622 19:11096293:G:A 19AcknowledgementsThis research was partially funded by the Spanish MINECO TIN2016-79576-R.References[1] J. Li and H. Liu. ”Challenges of feature selection for big data analytics.” IEEE Intelligent Systemsvol.32, no.2, pp. 9-15, 2017.[2] J. Li, K. Cheng, S. Wang, F. Morstatter, R.P. Trevino, J. Tang, and H. Liu, ”Feature selection: Adata perspective.” ACM Computing Surveys (CSUR) vol.50, no.6, p. 94, 2017.[3] H. Liu and H. Motoda, ”Computational methods of feature selection.”, CRC Press, 2007.[4] I. Guyon, A. Elisseeff, ”An introduction to variable and feature selection.” Journal of MachineLearning Research vol.3, pp. 1157-1182, 2003.[5] Y. Saeys, I. Inza, and P. Larrañaga, ”A review of feature selection techniques in bioinformatics.”Bioinformatics vol.23, no.19, pp. 2507-2517, 2007.[6] W. Daelemans, V. Hoste, F. De Meulder, and B. Naudts, ”Combined optimization of feature selec-tion and algorithm parameters in machine learning of language.” European Conference on MachineLearning. Springer, Berlin, Heidelberg, 2003.[7] T. Abeel, T. Helleputte, Y. Van de Peer, P. Dupont, Y. Saeys, ”Robust biomarker identification forcancer diagnosis with ensemble feature selection methods.” Bioinformatics vol.26, no.3, pp. 392-398, 2009.[8] A.-C. Haury, P. Gestraud, J.-P. Vert, ”The influence of feature selection methods on accuracy, sta-bility and interpretability of molecular signatures.” PLoS ONE, vol.6, no.12, p. e28210, 2011.[9] H. Peng, F. Long, and C. Ding, ”Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy.” IEEE Transactions on Pattern Analysis and Ma-chine Intelligence vol.27, no.8, pp. 1226-1238, 2005.[10] C. Ding and H. Peng, ”Minimum redundancy feature selection from microarray gene expressiondata.” Journal of Bioinformatics and Computational Biology vol.3, no.2, pp. 185-205, 2005.[11] S. Ramı́rez-Gallego, I. Lastra, D. Martı́nezRego, V. BolónCanedo, J.M. Benı́tez, F. Herrera, andA. AlonsoBetanzos, ”Fast-mRMR: Fast Minimum Redundancy Maximum Relevance algorithmfor high-dimensional Big Data.” International Journal of Intelligent Systems, vol.32, pp. 134-152,2017.

Enabling interpretation of the outcome of a human obesity prediction machine learning analysis from genomic data

In this brief paper, we address the medical problem of human obesity prediction from genomic data. Genomic datasets may contain a huge number of features and they often have to be analyzed within the realm of Big Data technologies. As a medical problem, obesity prediction would welcome interpretables outcomes. Therefore, the analyst would benefit from appraches in which the problem of very high data dimensionality could be eased as much as possible. Feature selection can be an essential part of such approaches. In this context, though, traditional machine learning methods may struggle. Here, we propose a pipeline to address this problem using partitioning strategies: both vertical, by dividing the data based on gender, and horizontal, by splitting each of the analyzed chromosomes into 5,000-instances subsets. For each, Minimum Redundancy and Maximum Relevance feature selection is used to find rankings of the single nucleotide polymorphisms most relevant for classification in the medical dataset

UPCommons

https://upcommons.upc.edu/bitstream/2117/179493/1/Vellido.pdf

Enabling interpretation of the outcome of a human obesity prediction machine learning analysis from genomic data

Abstract

Similar works

Full text

Available Versions

UPCommons. Portal del coneixement obert de la UPC

UPCommons