8 research outputs found

    A New Hierarchical Redundancy Eliminated Tree Augmented Naive Bayes Classifier for Coping with Gene Ontology-based Features

    Get PDF
    The Tree Augmented Naive Bayes classifier is a type of probabilistic graphical model that can represent some feature dependencies. In this work, we propose a Hierarchical Redundancy Eliminated Tree Augmented Naive Bayes (HRE-TAN) algorithm, which considers removing the hierarchical redundancy during the classifier learning process, when coping with data containing hierarchically structured features. The experiments showed that HRE-TAN obtains significantly better predictive performance than the conventional Tree Augmented Naive Bayes classifier, and enhanced the robustness against imbalanced class distributions, in aging-related gene datasets with Gene Ontology terms used as features.Comment: International Conference on Machine Learning (ICML 2016) Computational Biology Worksho

    Proposta de Operadores Genéticos na Seleção de Características de Bases de Dados de Organismos Modelo

    Get PDF
    O envelhecimento é um processo biológico natural de todos os organismos e vêm sendo muito estudado recentemente. Entretanto, não se sabe ao certo todos os mecanismos que influenciam nesse processo, tanto na longevidade quanto na anti-longevidade. Porém, é certo de que muitas pesquisas estão cada vez mais descobrindo processos biológicos que estão atrelados ao envelhecimento. Um exemplo deles é a restrição calórica que pôde estender a longevidade de muitas espécies

    Evolution of Scikit-Learn Pipelines with Dynamic Structured Grammatical Evolution

    Full text link
    The deployment of Machine Learning (ML) models is a difficult and time-consuming job that comprises a series of sequential and correlated tasks that go from the data pre-processing, and the design and extraction of features, to the choice of the ML algorithm and its parameterisation. The task is even more challenging considering that the design of features is in many cases problem specific, and thus requires domain-expertise. To overcome these limitations Automated Machine Learning (AutoML) methods seek to automate, with few or no human-intervention, the design of pipelines, i.e., automate the selection of the sequence of methods that have to be applied to the raw data. These methods have the potential to enable non-expert users to use ML, and provide expert users with solutions that they would unlikely consider. In particular, this paper describes AutoML-DSGE - a novel grammar-based framework that adapts Dynamic Structured Grammatical Evolution (DSGE) to the evolution of Scikit-Learn classification pipelines. The experimental results include comparing AutoML-DSGE to another grammar-based AutoML framework, Resilient ClassificationPipeline Evolution (RECIPE), and show that the average performance of the classification pipelines generated by AutoML-DSGE is always superior to the average performance of RECIPE; the differences are statistically significant in 3 out of the 10 used datasets.Comment: EvoApps 202

    Investigating the Role of Simpson’s Paradox in the Analysis of Top-Ranked Features in High-Dimensional Bioinformatics Datasets

    Get PDF
    An important problem in bioinformatics consists of identifying the most important features (or predictors), among a large number of features in a given classification dataset. This problem is often addressed by using a machine learning-based feature ranking method to identify a small set of top-ranked predictors (i.e. the most relevant features for classification). The large number of studies in this area have, however, an important limitation: they ignore the possibility that the top-ranked predictors occur in an instance of Simpson’s paradox, where the positive or negative association between a predictor and a class variable reverses sign upon conditional on each of the values of a third (confounder) variable. In this work, we review and investigate the role of Simpson’s paradox in the analysis of top-ranked predictors in high-dimensional bioinformatics datasets, in order to avoid the potential danger of misinterpreting an association between a predictor and the class variable. We perform computational experiments using four well-known feature ranking methods from the machine learning field and five high-dimensional datasets of ageing-related genes, where the predictors are Gene Ontology terms. The results show that occurrences of Simpson’s paradox involving top-ranked predictors are much more common for one of the feature ranking methods

    Predicting the pro-longevity or anti-longevity effect of model organism genes with new hierarchical feature selection methods

    No full text
    Ageing is a highly complex biological process that is still poorly understood. With the growing amount of ageing-related data available on the web, in particular concerning the genetics of ageing, it is timely to apply data mining methods to that data, in order to try to discover novel patterns that may assist ageing research. In this work, we introduce new hierarchical feature selection methods for the classification task of data mining and apply them to ageing-related data from four model organisms: Caenorhabditis elegans (worm), Saccharomyces cerevisiae (yeast), Drosophila melanogaster (fly), and Mus musculus (mouse). The main novel aspect of the proposed feature selection methods is that they exploit hierarchical relationships in the set of features (Gene Ontology terms) in order to improve the predictive accuracy of the Naïve Bayes and 1-Nearest Neighbour (1-NN) classifiers, which are used to classify model organisms’ genes into pro-longevity or anti-longevity genes. The results show that our hierarchical feature selection methods, when used together with Naïve Bayes and 1-NN classifiers, obtain higher predictive accuracy than the standard (without feature selection) Naïve Bayes and 1-NN classifiers, respectively. We also discuss the biological relevance of a number of Gene Ontology terms very frequently selected by our algorithms in our datasets

    Prediction and characterization of human ageing-related proteins by using machine learning

    Get PDF
    Abstract Ageing has a huge impact on human health and economy, but its molecular basis – regulation and mechanism – is still poorly understood. By today, more than three hundred genes (almost all of them function as protein-coding genes) have been related to human ageing. Although individual ageing-related genes or some small subsets of these genes have been intensively studied, their analysis as a whole has been highly limited. To fill this gap, for each human protein we extracted 21000 protein features from various databases, and using these data as an input to state-of-the-art machine learning methods, we classified human proteins as ageing-related or non-ageing-related. We found a simple classification model based on only 36 protein features, such as the “number of ageing-related interaction partners”, “response to oxidative stress”, “damaged DNA binding”, “rhythmic process” and “extracellular region”. Predicted values of the model quantify the relevance of a given protein in the regulation or mechanisms of the human ageing process. Furthermore, we identified new candidate proteins having strong computational evidence of their important role in ageing. Some of them, like Cytochrome b-245 light chain (CY24A) and Endoribonuclease ZC3H12A (ZC12A) have no previous ageing-associated annotations

    In silico identification of genetic and pharmacological interventions to modulate ageing

    Get PDF
    As life expectancy increases and fertility rates decrease, the growing ageing population poses a significant challenge to the healthcare systems of developed countries. Ageing as the major risk factor for chronic diseases constitutes the primary target to reduce the burden of diseases and improve human health. However, ageing is a complex process and predicting potential interventions into it requires system-level approaches. In this thesis, I present the development of two computational methods using biological data to predict novel genetic and pharmacological interventions to ameliorate ageing. My first study focused on identifying repurposable drugs to delay human ageing. Several computational drug-repurposing studies have been developed, but most of them focus on predicting geroprotectors using animal models data, even though certain aspects of ageing may be human-specific. Using drug-protein interaction information, I searched for drugs targeting a significant proportion of human ageing-related genes and pathways. The top-ranked drugs included a significant number of known geroprotectors, validating the capability of the method to discover drugs to modulate ageing. On the top of the list was tanespimycin, a heat shock protein inhibitor, whose geroprotective properties we validated experimentally. My second study centres on determining the molecular mechanisms associated with healthy lifespan, and how to use this information to find new genetic interventions to delay ageing. In recent years, the number of transcriptomic studies of mouse models of ageing has increased dramatically, providing the opportunity to compare gene expression changes of long- and short-lived strains. I showed that differences in healthy lifespan are associated with expression changes in genes regulating mitochondrial metabolism. Using these gene sets as biomarkers of lifespan, I compared the mouse models of ageing against 51 genetically engineered mice and predicted candidate genetic and pharmacological interventions with the potential to delay ageing. Through computational studies I predicted a narrowed down list of candidate genetic and pharmacological interventions to delay mouse and human ageing and validated several predictions made by other researchers using different methods, confirming the robustness of computational methods to identify new anti-ageing interventions. With the discovery of tanespimycin as a new geroprotector, I revealed that a little proteostatic stress is good for longevity and that we can trigger this hormetic response pharmacologically. I exposed the complexity of ageing as I found multiple mechanisms to delay ageing, most of which were tissue-specific, and found evidence for new candidate hallmarks of ageing and novel biomarkers of lifespan
    corecore