68 research outputs found

    HIERARCHICAL ENSEMBLE METHODS FOR ONTOLOGY-BASED PREDICTIONS IN COMPUTATIONAL BIOLOGY

    Get PDF
    L'annotazione standardizzata di entit\ue0 biologiche, quali geni e proteine, ha fortemente promosso l'organizzazione dei concetti biologici in vocabolari controllati, cio\ue8 ontologie che consentono di indicizzare in modo coerente le relazioni tra le diverse classi funzionali organizzate secondo una gerarchia predefinita. Esempi di ontologie biologiche in cui i termini funzionali sono strutturati secondo un grafo diretto aciclico (DAG) sono la Gene Ontology (GO) e la Human Phenotype Ontology (HPO). Tali tassonomie gerarchiche vengono utilizzate dalla comunit\ue0 scientifica rispettivamente per sistematizzare le funzioni proteiche di tutti gli organismi viventi dagli Archea ai Metazoa e per categorizzare le anomalie fenotipiche associate a malattie umane. Tali bio-ontologie, offrendo uno spazio di classificazione ben definito, hanno favorito lo sviluppo di metodi di apprendimento per la predizione automatizzata della funzione delle proteine e delle associazioni gene-fenotipo patologico nell'uomo. L'obiettivo di tali metodologie consiste nell'\u201cindirizzare\u201d la ricerca \u201cin-vitro\u201d per favorire una riduzione delle spese ed un uso pi\uf9 efficace dei fondi destinati alla ricerca. Dal punto di vista dell'apprendimento automatico il problema della predizione della funzione delle proteine o delle associazioni gene-fenotipo patologico nell'uomo pu\uf2 essere modellato come un problema di classificazione multi-etichetta strutturato, in cui le predizioni associate ad ogni esempio (i.e., gene o proteina) sono sotto-grafi organizzati secondo una determinata struttura (albero o DAG). A causa della complessit\ue0 del problema di classificazione, ad oggi l'approccio di predizione pi\uf9 comunemente utilizzato \ue8 quello \u201cflat\u201d, che consiste nell'addestrare un classificatore separatamente per ogni termine dell'ontologia senza considerare le relazioni gerarchiche esistenti tra le classi funzionali. L'utilizzo di questo approccio \ue8 giustificato non soltanto dal fatto di ridurre la complessit\ue0 computazionale del problema di apprendimento, ma anche dalla natura \u201cinstabile\u201d dei termini che compongono l'ontologia stessa. Infatti tali termini vengono aggiornati mensilmente mediante un processo curato da esperti che si basa sia sulla letteratura scientifica biomedica che su dati sperimentali ottenuti da esperimenti eseguiti \u201cin-vitro\u201d o \u201cin-silico\u201d. In questo contesto, in letteratura sono stati proposti due classi generali di classificatori. Da una parte, si collocano i metodi di apprendimento automatico che predicono le classi funzionali in modo \u201cflat\u201d, ossia senza esplorare la struttura intrinseca dello spazio delle annotazioni. Dall'altra parte, gli approcci gerarchici che, considerando esplicitamente le relazioni gerarchiche fra i termini funzionali dell'ontologia, garantiscono che le annotazioni predette rispettino la \u201ctrue-path-rule\u201d, la regola biologica che governa le ontologie. Nell'ambito dei metodi gerarchici, in letteratura sono stati proposti due diverse categorie di approcci. La prima si basa su metodi kernelizzati per predizioni con output strutturato, mentre la seconda su metodi di ensemble gerarchici. Entrambi questi metodi presentano alcuni svantaggi. I primi sono computazionalmente pesanti e non scalano bene se applicati ad ontologie biologiche. I secondi sono stati per la maggior parte concepiti per tassonomie strutturate ad albero, e quei pochi approcci specificatamente progettati per ontologie strutturate secondo un DAG, sono nella maggioranza dei casi incapaci di migliorare le performance di predizione dei metodi \u201cflat\u201d. Per superare queste limitazioni, nel presente lavoro di tesi si sono proposti dei nuovi metodi di ensemble gerarchici capaci di fornire predizioni consistenti con la struttura gerarchica dell'ontologia. Tali approcci, da un lato estendono precedenti metodi originariamente sviluppati per ontologie strutturate ad albero ad ontologie organizzate secondo un DAG e dall'altro migliorano significativamente le predizioni rispetto all'approccio \u201cflat\u201d indipendentemente dalla scelta del tipo di classificatore utilizzato. Nella loro forma pi\uf9 generale, gli approcci di ensemble gerarchici sono altamente modulari, nel senso che adottano una strategia di apprendimento a due passi. Nel primo passo, le classi funzionali dell'ontologia vengono apprese in modo indipendente l'una dall'altra, mentre nel secondo passo le predizioni \u201cflat\u201d vengono combinate opportunamente tenendo conto delle gerarchia fra le classi ontologiche. I principali contributi introdotti nella presente tesi sono sia metodologici che sperimentali. Da un punto di vista metodologico, sono stati proposti i seguenti nuovi metodi di ensemble gerarchici: a) HTD-DAG (Hierarchical Top-Down per tassonomie DAG strutturate); b) TPR-DAG (True-Path-Rule per DAG) con diverse varianti algoritmiche; c) ISO-TPR (True-Path-Rule con Regressione Isotonica), un nuovo algoritmo gerarchico che combina la True-Path-Rule con metodi di regressione isotonica. Per tutti i metodi di ensemble gerarchici \ue8 stato dimostrato in modo formale la coerenza delle predizioni, cio\ue8 \ue8 stato provato come gli approcci proposti sono in grado di fornire predizioni che rispettano le relazioni gerarchiche fra le classi. Da un punto di vista sperimentale, risultati a livello dell'intero genoma di organismi modello e dell'uomo ed a livello della totalit\ue0 delle classi incluse nelle ontologie biologiche mostrano che gli approcci metodologici proposti: a) sono competitivi con gli algoritmi di predizione output strutturata allo stato dell'arte; b) sono in grado di migliorare i classificatori \u201cflat\u201d, a patto che le predizioni fornite dal classificatore non siano casuali; c) sono in grado di predire nuove associazioni tra geni umani e fenotipi patologici, un passo cruciale per la scoperta di nuovi geni associati a malattie genetiche umane e al cancro; d) scalano bene su dataset costituiti da decina di migliaia di esempi (i.e., proteine o geni) e su tassonomie costituite da migliaia di classi funzionali. Infine, i metodi proposti in questa tesi sono stati implementati in una libreria software scritta in linguaggio R, HEMDAG (Hierarchical Ensemble Methods per DAG), che \ue8 pubblica, liberamente scaricabile e disponibile per i sistemi operativi Linux, Windows e Macintosh.The standardized annotation of biomedical related objects, often organized in dedicated catalogues, strongly promoted the organization of biological concepts into controlled vocabularies, i.e. ontologies by which related terms of the underlying biological domain are structured according to a predefined hierarchy. Indeed large ontologies have been developed by the scientific community to structure and organize the gene and protein taxonomy of all the living organisms from Archea to Metazoa, i.e. the Gene Ontology, or human specific ontologies, such as the Human Phenotype Ontology, that provides a structured taxonomy of the abnormal human phenotypes associated with diseases. These ontologies, offering a coded and well-defined classification space for biological entities such as genes and proteins, favor the development of machine learning methods able to predict features of biological objects like the association between a human gene and a disease, with the aim to drive wet lab research allowing a reduction of the costs and a more effective usage of the available research funds. Despite the soundness of the aforementioned objectives, the resulting multi-label classification problems raise so complex machine learning issues that until recently the far common approach was the \u201cflat\u201d prediction, i.e. simply training a classifier for each term in the controlled vocabulary and ignoring the relationships between terms. This approach was not only justified by the need to reduce the computational complexity of the learning task, but also by the somewhat \u201cunstable\u201d nature of the terms composing the controlled vocabularies, because they were (and are) updated on a monthly basis in a process performed by expert curators and based on biomedical literature, and wet and in-silico experiments. In this context, two main general classes of classifiers have been proposed in literature. On the one hand, \u201chierarchy-unaware\u201d learning methods predict labels in a \u201cflat\u201d way without exploiting the inherent structure of the annotation space. On the other hand, \u201chierarchy-aware\u201d learning methods can improve the accuracy and the precision of the predictions by considering the hierarchical relationships between ontology terms. Moreover these methods can guarantee the consistency of the predicted labels according to the \u201ctrue path rule\u201d, that is the biological and logical rule that governs the internal coherence of biological ontologies. To properly handle the hierarchical relationships linking the ontology terms, two main classes of structured output methods have been proposed in literature: the first one is based on kernelized methods for structured output spaces, the second on hierarchical ensemble methods for ontology-based predictions. However both these approaches suffer of significant drawbacks. The kernel-based methods for structured output space are computationally intensive and do not scale well when applied to complex multi-label bio-ontologies. Most hierarchical ensemble methods have been conceived for tree-structured taxonomies and the few ones specifically developed for the prediction in DAG-structured output spaces are, in most cases, unable to improve prediction performances over flat methods. To overcome these limitations, in this thesis novel \u201contology-aware\u201d ensemble methods have been developed, able to handle DAG-structured ontologies, leveraging previous results obtained with \u201ctrue-path-rule\u201d-based hierarchical learning algorithms. These methods are highly modular in the sense that they adopt a \u201ctwo-step\u201d learning strategy: in the first step they learn separately each term of the ontology using flat methods, and in the second they properly combine the flat predictions according to the hierarchy of the classes. The main contributions of this thesis are both methodological and experimental. From a methodological standpoint, novel hierarchical ensemble methods are proposed, including: a) HTD (Hierarchical Top-Down algorithm for DAG structured ontologies); b) TPR-DAG (True Path Rule ensemble for DAG) with several variants; c) ISO-TPR, a novel ensemble method that combines the True Path Rule approach with Isotonic Regression. For all these methods a formal proof of their consistency, i.e. the guarantee of providing predictions that \u201crespect\u201d the hierarchical relationships between classes, is provided. From an experimental standpoint, extensive genome and ontology-wide results show that the proposed methods: a) are competitive with state-of-the-art prediction algorithms; b) are able to improve flat machine learning classifiers, if the base learners can provide non random predictions; c) are able to predict new associations between genes and human abnormal phenotypes, a crucial step to discover novel genes associated with human diseases ranging from genetic disorders to cancer; d) scale nicely with large datasets and bio-ontologies. Finally HEMDAG, a novel R library implementing the proposed hierarchical ensemble methods has been developed and publicly delivered

    Data based system design and network analysis tools for chemical and biological processes

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Classification of Alzheimer's Disease and Mild Cognitive Impairment Using Longitudinal FDG-PET Images

    Get PDF
    RÉSUMÉ La maladie d’Alzheimer (MA) est la principale cause de maladies dégénératives et se caractérise par un début insidieux, une perte de mémoire précoce, des déficits verbaux et visuo-spatiaux (associés à la destruction des lobes temporal et pariétal), un développement progressif et une absence de signes neurologiques tôt dans l’apparition de la maladie. Aucun traitement n’est disponible en ce moment pour guérir la MA. Les traitements actuels peuvent souvent ralentir de façon significative la progression de la maladie. La capacité de diagnostiquer la MA à son stade initial a un impact majeur sur l’intervention clinique et la planification thérapeutique, réduisant ainsi les coûts associés aux soins de longue durée. La distinction entre les différents stades de la démence est essentielle afin de ralentir la progression de la MA. La différenciation entre les patients ayant la MA, une déficience cognitive légère précoce (DCLP), une déficience cognitive légère tardive (DCLT) ou un état cognitif normal (CN) est un domaine de recherche qui a suscité beaucoup d’intérêt durant la dernière décennie. Les images obtenues par tomographie par émission de positrons (TEP) font partie des meilleures méthodes accessibles pour faciliter la distinction entre ces différentes classes. Du point de vue de la neuro-imagerie, les images TEP par fluorodésoxyglucose (FDG) pour le métabolisme cérébral du glucose et pour les plaques amyloïdes (AV45) sont considérées comme des biomarqueurs ayant une puissance diagnostique élevée. Cependant, seules quelques approches ont étudié l’efficacité de considérer uniquement les zones actives localisées par la TEP à des fins de classification. La question de recherche principale de ce travail est de démontrer la capacité des images TEP à classer les résultats de façon précise et de comparer les résultats de deux méthodes d’imagerie TEP (FDG et AV45). Afin de déterminer la meilleure façon de classer les sujets dans les catégories MA, DCLP, DCLT ou CN en utilisant exclusivement les images TEP, nous proposons une procédure qui utilise les caractéristiques apprises à partir d’images TEP identifiées sémantiquement. Les machines à vecteurs de support (MVS) sont déjà utilisées pour faire de nombreuses classifications et font partie des techniques les plus utilisées pour la classification basée sur la neuro-imagerie, comme pour la MA. Les MVS linéaires et la fonction de base radiale (FBR)-MVS sont deux noyaux populaires utilisés dans notre classification. L’analyse en composante principale (ACP) est utilisée pour diminuer la taille des données suivie par les MVS linéaires qui sont une autre méthode de classification. Les forêts d’arbres décisionnels (FAD) sont aussi exécutées pour rendre les résultats obtenus par MVS comparables. L’objectif général de ce travail est de concevoir un ensemble d’outils déjà existants pour classer la MA et les différents stades de DCL. Suivant les étapes de normalisation et de prétraitement, une méthode d’enregistrement TEP-IRM ultimodale et déformable est proposée afin de fusionner l’atlas du MNI au scan TEP de chaque patient et de développer une méthode simple de segmentation basée sur l’atlas du cerveau dans le but de générer un volume étiqueté avec 10 régions d’intérêt communes. La procédure a deux approches : la première utilise l’intensité des voxels des régions d’intérêt, et la seconde, l’intensité des voxels du cerveau en entier. La méthode a été testée sur 660 sujets provenant de la base de données de l’(Alzheimer’s Disease Neuroimaging Initiative) et a été comparée à une approche qui incluait le cerveau en entier. La précision de la classification entre la MA et les CN a été mesurée à 91,7% et à 91,2% en utilisant la FBR et les FAD, respectivement, sur des données combinant les caractéristiques multirégionales des FDG-TEP des examens transversal et de suivi. Une amélioration considérable a été notée pour la précision de classification entre les DCLP et DCLT avec un taux de 72,5%. La précision de classification entre la MA et les CN en utilisant AV45-TEP avec les données combinées a été mesurée à 90,8% et à 87,9% pour la FBR et les FAD, respectivement. Cette procédure démontre le potentiel des caractéristiques multirégionales de la TEP pour améliorer l’évaluation cognitive. Les résultats observés confirment qu’il est possible de se fier uniquement aux images TEP sans ajout d’autres bio-marqueurs pour obtenir une précision de classification élevée.----------ABSTRACT Alzheimer’s disease (AD) is the most general cause of degenerative dementia, characterized by insidious onset early memory loss, language and visuospatial deficits (associated with the destruction of the temporal and parietal lobes), a progressive course, and lack of early neurological signs early in the course of disease. There is currently no absolute cure for AD but some treatments can slow down the progression of the disease in early stages of AD. The ability to diagnose AD at an early stage has a great impact on the clinical intervention and treatment planning, and hence reduces costs associated with long-term care. In addition, discrimination of different stages of dementia is crucial to slow down the progression of AD. Distinguishing patients with AD, early mild cognitive impairment (EMCI), late mild cognitive impairment (LMCI), and normal controls (NC) is an extremely active research area, which has garnered significant attention in the past decade. Positron emission tomography (PET) images are one of the best accessible ways to discriminate between different classes. From a neuroimaging point of view, PET images of fluorodeoxyglucose (FDG) for cerebral glucose metabolism and amyloid plaque images (AV45) are considered a highly powerful diagnostic biomarker, but few approaches have investigated the efficacy of focusing on localized PETactive areas for classification purposes. The main research question of this work is to show the ability of using PET images to achieve accurate classification results and to compare the results of two imaging methods of PET (FDG and AV45). To find the best scenario to classify our subjects into AD, EMCI, LMCI, and NC using PET images exclusively, we proposed a pipeline using learned features from semantically labelled PET images to perform group classification using four classifiers. Support vector machines (SVMs) are already applied in a wide variety of classifications, and it is one of the most popular techniques in classification based on neuroimaging like AD. Linear SVMs and radial basis function (RBF) SVMs are two common kernels used in our classification. Principal component analysis (PCA) is used to reduce the dimension of our data followed by linear SVMs, which is another method of classification. Random forest (RF) is also applied to make our SVM results comparable. The general objective of this work is to design a set of existing tools for classifying AD and different stages of MCI. Following normalization and pre-processing steps, a multi-modal PET-MRI registration method is proposed to fuse the Montreal Neurological Institute (MNI) atlas to PET images of each patient which is registered to its corresponding MRI scan, developing a simple method of segmentation based on a brain atlas generated from a fully labelled volume with 10 common regions of interest (ROIs). This pipeline can be used in two ways: (1) using voxel intensities from specific regions of interest (multi-region approach), and (2) using voxel intensities from the entire brain (whole brain approach). The method was tested on 660 subjects from the Alzheimer’s Disease Neuroimaging Initiative database and compared to a whole-brain approach. The classification accuracy of AD vs NC was measured at 91.7 % and 91.2 % when using RBF-SVM and RF, respectively, on combining both multi-region features from FDG-PET on cross-sectional and follow-up exams. A considerable improvement compare to the similar works in the EMCI vs LMCI classification accuracy was achieved at 72.5 %. The classification accuracy of AD versus NC using AV45-PET on the combined data was measured at 90.8 % and 87.9 % using RBF-SVM and RF, respectively. The pipeline demonstrates the potential of exploiting longitudinal multi-region PET features to improve cognitive assessment. We can achieve high accuracy using only PET images. This suggests that PET images are a rich source of discriminative information for this task. We note that other methods rely on the combination of multiple sources

    Reconstrução e classificação de sequências de ADN desconhecidas

    Get PDF
    The continuous advances in DNA sequencing technologies and techniques in metagenomics require reliable reconstruction and accurate classification methodologies for the diversity increase of the natural repository while contributing to the organisms' description and organization. However, after sequencing and de-novo assembly, one of the highest complex challenges comes from the DNA sequences that do not match or resemble any biological sequence from the literature. Three main reasons contribute to this exception: the organism sequence presents high divergence according to the known organisms from the literature, an irregularity has been created in the reconstruction process, or a new organism has been sequenced. The inability to efficiently classify these unknown sequences increases the sample constitution's uncertainty and becomes a wasted opportunity to discover new species since they are often discarded. In this context, the main objective of this thesis is the development and validation of a tool that provides an efficient computational solution to solve these three challenges based on an ensemble of experts, namely compression-based predictors, the distribution of sequence content, and normalized sequence lengths. The method uses both DNA and amino acid sequences and provides efficient classification beyond standard referential comparisons. Unusually, it classifies DNA sequences without resorting directly to the reference genomes but rather to features that the species biological sequences share. Specifically, it only makes use of features extracted individually from each genome without using sequence comparisons. RFSC was then created as a machine learning classification pipeline that relies on an ensemble of experts to provide efficient classification in metagenomic contexts. This pipeline was tested in synthetic and real data, both achieving precise and accurate results that, at the time of the development of this thesis, have not been reported in the state-of-the-art. Specifically, it has achieved an accuracy of approximately 97% in the domain/type classification.Os contínuos avanços em tecnologias de sequenciação de ADN e técnicas em meta genómica requerem metodologias de reconstrução confiáveis e de classificação precisas para o aumento da diversidade do repositório natural, contribuindo, entretanto, para a descrição e organização dos organismos. No entanto, após a sequenciação e a montagem de-novo, um dos desafios mais complexos advém das sequências de ADN que não correspondem ou se assemelham a qualquer sequencia biológica da literatura. São três as principais razões que contribuem para essa exceção: uma irregularidade emergiu no processo de reconstrução, a sequência do organismo é altamente dissimilar dos organismos da literatura, ou um novo e diferente organismo foi reconstruído. A incapacidade de classificar com eficiência essas sequências desconhecidas aumenta a incerteza da constituição da amostra e desperdiça a oportunidade de descobrir novas espécies, uma vez que muitas vezes são descartadas. Neste contexto, o principal objetivo desta tese é fornecer uma solução computacional eficiente para resolver este desafio com base em um conjunto de especialistas, nomeadamente preditores baseados em compressão, a distribuição de conteúdo de sequência e comprimentos de sequência normalizados. O método usa sequências de ADN e de aminoácidos e fornece classificação eficiente além das comparações referenciais padrão. Excecionalmente, ele classifica as sequências de ADN sem recorrer diretamente a genomas de referência, mas sim às características que as sequências biológicas da espécie compartilham. Especificamente, ele usa apenas recursos extraídos individualmente de cada genoma sem usar comparações de sequência. Além disso, o pipeline é totalmente automático e permite a reconstrução sem referência de genomas a partir de reads FASTQ com a garantia adicional de armazenamento seguro de informações sensíveis. O RFSC é então um pipeline de classificação de aprendizagem automática que se baseia em um conjunto de especialistas para fornecer classificação eficiente em contextos meta genómicos. Este pipeline foi aplicado em dados sintéticos e reais, alcançando em ambos resultados precisos e exatos que, no momento do desenvolvimento desta dissertação, não foram relatados na literatura. Especificamente, esta ferramenta desenvolvida, alcançou uma precisão de aproximadamente 97% na classificação de domínio/tipo.Mestrado em Engenharia de Computadores e Telemátic

    A Machine Learning Approach for Detecting Selective Sweeps Using Ancient DNA

    Get PDF
    Biological adaptation leads to speci c patterns in population genetic data called selective sweeps. Although researchers have applied machine learning to sweep detection, which speci c methods are appropriate for any given scenario is not well understood. We conducted a systematic review of a suite of machine learning(ML) classi ers for sweep detection. We found that accurate models can be built using simple, fast classi ers supported by preprocessing. We produced a ML work ow which is applicable for general population genetic problems. Our methods were extended for ancient DNA, showing a sweep signal can be retrieved even at high missing rates.Thesis (MPhil) -- University of Adelaide, School of Mathematics, 202

    Ecohydrology of wetlands : monitoring and modelling interactions between groundwater, soil and vegetation

    Get PDF
    corecore