Enabling interpretation of the outcome of a human obesity prediction machine learning analysis from genomic data

Abstract

In this brief paper, we address the medical problem of human obesity prediction from genomic data. Genomic datasets may contain a huge number of features and they often have to be analyzed within the realm of Big Data technologies. As a medical problem, obesity prediction would welcome interpretables outcomes. Therefore, the analyst would benefit from appraches in which the problem of very high data dimensionality could be eased as much as possible. Feature selection can be an essential part of such approaches. In this context, though, traditional machine learning methods may struggle. Here, we propose a pipeline to address this problem using partitioning strategies: both vertical, by dividing the data based on gender, and horizontal, by splitting each of the analyzed chromosomes into 5,000-instances subsets. For each, Minimum Redundancy and Maximum Relevance feature selection is used to find rankings of the single nucleotide polymorphisms most relevant for classification in the medical dataset.Preprin

    Similar works