1,524 research outputs found

    Prediction of Human Phenotype Ontology terms by means of hierarchical ensemble methods

    Get PDF
    Background The prediction of human geneā€“abnormal phenotype associations is a fundamental step toward the discovery of novel genes associated with human disorders, especially when no genes are known to be associated with a specific disease. In this context the Human Phenotype Ontology (HPO) provides a standard categorization of the abnormalities associated with human diseases. While the problem of the prediction of geneā€“disease associations has been widely investigated, the related problem of geneā€“phenotypic feature (i.e., HPO term) associations has been largely overlooked, even if for most human genes no HPO term associations are known and despite the increasing application of the HPO to relevant medical problems. Moreover most of the methods proposed in literature are not able to capture the hierarchical relationships between HPO terms, thus resulting in inconsistent and relatively inaccurate predictions. Results We present two hierarchical ensemble methods that we formally prove to provide biologically consistent predictions according to the hierarchical structure of the HPO. The modular structure of the proposed methods, that consists in a ā€œflatā€ learning first step and a hierarchical combination of the predictions in the second step, allows the predictions of virtually any flat learning method to be enhanced. The experimental results show that hierarchical ensemble methods are able to predict novel associations between genes and abnormal phenotypes with results that are competitive with state-of- the-art algorithms and with a significant reduction of the computational complexity. Conclusions Hierarchical ensembles are efficient computational methods that guarantee biologically meaningful predictions that obey the true path rule, and can be used as a tool to improve and make consistent the HPO terms predictions starting from virtually any flat learning method. The implementation of the proposed methods is available as an R package from the CRAN repository

    Exploring and Exploiting Disease Interactions from Multi-Relational Gene and Phenotype Networks

    Get PDF
    The availability of electronic health care records is unlocking the potential for novel studies on understanding and modeling disease co-morbidities based on both phenotypic and genetic data. Moreover, the insurgence of increasingly reliable phenotypic data can aid further studies on investigating the potential genetic links among diseases. The goal is to create a feedback loop where computational tools guide and facilitate research, leading to improved biological knowledge and clinical standards, which in turn should generate better data. We build and analyze disease interaction networks based on data collected from previous genetic association studies and patient medical histories, spanning over 12 years, acquired from a regional hospital. By exploring both individual and combined interactions among these two levels of disease data, we provide novel insight into the interplay between genetics and clinical realities. Our results show a marked difference between the well defined structure of genetic relationships and the chaotic co-morbidity network, but also highlight clear interdependencies. We demonstrate the power of these dependencies by proposing a novel multi-relational link prediction method, showing that disease co-morbidity can enhance our currently limited knowledge of genetic association. Furthermore, our methods for integrated networks of diverse data are widely applicable and can provide novel advances for many problems in systems biology and personalized medicine

    Large-scale automated protein function prediction

    Get PDF
    Includes bibliographical references.2016 Summer.Proteins are the workhorses of life, and identifying their functions is a very important biological problem. The function of a protein can be loosely defined as everything it performs or happens to it. The Gene Ontology (GO) is a structured vocabulary which captures protein function in a hierarchical manner and contains thousands of terms. Through various wet-lab experiments over the years scientists have been able to annotate a large number of proteins with GO categories which reflect their functionality. However, experimentally determining protein functions is a highly resource-intensive task, and a large fraction of proteins remain un-annotated. Recently a plethora automated methods have emerged and their reasonable success in computationally determining the functions of proteins using a variety of data sources ā€“ by sequence/structure similarity or using various biological network data, has led to establishing automated function prediction (AFP) as an important problem in bioinformatics. In a typical machine learning problem, cross-validation is the protocol of choice for evaluating the accuracy of a classifier. But, due to the process of accumulation of annotations over time, we identify the AFP as a combination of two sub-tasks: making predictions on annotated proteins and making predictions on previously unannotated proteins. In our first project, we analyze the performance of several protein function prediction methods in these two scenarios. Our results show that GOstruct, an AFP method that our lab has previously developed, and two other popular methods: binary SVMs and guilt by association, find it hard to achieve the same level of accuracy on these two tasks compared to the performance evaluated through cross-validation, and that predicting novel annotations for previously annotated proteins is a harder problem than predicting annotations for uncharacterized proteins. We develop GOstruct 2.0 by proposing improvements which allows the model to make use of information of a protein's current annotations to better handle the task of predicting novel annotations for previously annotated proteins. Experimental results on yeast and human data show that GOstruct 2.0 outperforms the original GOstruct, demonstrating the effectiveness of the proposed improvements. Although the biomedical literature is a very informative resource for identifying protein function, most AFP methods do not take advantage of the large amount of information contained in it. In our second project, we conduct the first ever comprehensive evaluation on the effectiveness of literature data for AFP. Specifically, we extract co-mentions of protein-GO term pairs and bag-of-words features from the literature and explore their effectiveness in predicting protein function. Our results show that literature features are very informative of protein function but with further room for improvement. In order to improve the quality of automatically extracted co-mentions, we formulate the classification of co-mentions as a supervised learning problem and propose a novel method based on graph kernels. Experimental results indicate the feasibility of using this co-mention classifier as a complementary method that aids the bio-curators who are responsible for maintaining databases such as Gene Ontology. This is the first study of the problem of protein-function relation extraction from biomedical text. The recently developed human phenotype ontology (HPO), which is very similar to GO, is a standardized vocabulary for describing the phenotype abnormalities associated with human diseases. At present, only a small fraction of human protein coding genes have HPO annotations. But, researchers believe that a large portion of currently unannotated genes are related to disease phenotypes. Therefore, it is important to predict gene-HPO term associations using accurate computational methods. In our third project, we introduce PHENOstruct, a computational method that directly predicts the set of HPO terms for a given gene. We compare PHENOstruct with several baseline methods and show that it outperforms them in every respect. Furthermore, we highlight a collection of informative data sources suitable for the problem of predicting gene-HPO associations, including large scale literature mining data

    Gene2DisCo : gene to disease using disease commonalities

    Get PDF
    OBJECTIVE: Finding the human genes co-causing complex diseases, also known as "disease-genes", is one of the emerging and challenging tasks in biomedicine. This process, termed gene prioritization (GP), is characterized by a scarcity of known disease-genes for most diseases, and by a vast amount of heterogeneous data, usually encoded into networks describing different types of functional relationships between genes. In addition, different diseases may share common profiles (e.g. genetic or therapeutic profiles), and exploiting disease commonalities may significantly enhance the performance of GP methods. This work aims to provide a systematic comparison of several disease similarity measures, and to embed disease similarities and heterogeneous data into a flexible framework for gene prioritization which specifically handles the lack of known disease-genes. METHODS: We present a novel network-based method, Gene2DisCo, based on generalized linear models (GLMs) to effectively prioritize genes by exploiting data regarding disease-genes, gene interaction networks and disease similarities. The scarcity of disease-genes is addressed by applying an efficient negative selection procedure, together with imbalance-aware GLMs. Gene2DisCo is a flexible framework, in the sense it is not dependent upon specific types of data, and/or upon specific disease ontologies. RESULTS: On a benchmark dataset composed of nine human networks and 708 medical subject headings (MeSH) diseases, Gene2DisCo largely outperformed the best benchmark algorithm, kernelized score functions, in terms of both area under the ROC curve (0.94 against 0.86) and precision at given recall levels (for recall levels from 0.1 to 1 with steps 0.1). Furthermore, we enriched and extended the benchmark data to the whole human genome and provided the top-ranked unannotated candidate genes even for MeSH disease terms without known annotations

    HIERARCHICAL ENSEMBLE METHODS FOR ONTOLOGY-BASED PREDICTIONS IN COMPUTATIONAL BIOLOGY

    Get PDF
    L'annotazione standardizzata di entit\ue0 biologiche, quali geni e proteine, ha fortemente promosso l'organizzazione dei concetti biologici in vocabolari controllati, cio\ue8 ontologie che consentono di indicizzare in modo coerente le relazioni tra le diverse classi funzionali organizzate secondo una gerarchia predefinita. Esempi di ontologie biologiche in cui i termini funzionali sono strutturati secondo un grafo diretto aciclico (DAG) sono la Gene Ontology (GO) e la Human Phenotype Ontology (HPO). Tali tassonomie gerarchiche vengono utilizzate dalla comunit\ue0 scientifica rispettivamente per sistematizzare le funzioni proteiche di tutti gli organismi viventi dagli Archea ai Metazoa e per categorizzare le anomalie fenotipiche associate a malattie umane. Tali bio-ontologie, offrendo uno spazio di classificazione ben definito, hanno favorito lo sviluppo di metodi di apprendimento per la predizione automatizzata della funzione delle proteine e delle associazioni gene-fenotipo patologico nell'uomo. L'obiettivo di tali metodologie consiste nell'\u201cindirizzare\u201d la ricerca \u201cin-vitro\u201d per favorire una riduzione delle spese ed un uso pi\uf9 efficace dei fondi destinati alla ricerca. Dal punto di vista dell'apprendimento automatico il problema della predizione della funzione delle proteine o delle associazioni gene-fenotipo patologico nell'uomo pu\uf2 essere modellato come un problema di classificazione multi-etichetta strutturato, in cui le predizioni associate ad ogni esempio (i.e., gene o proteina) sono sotto-grafi organizzati secondo una determinata struttura (albero o DAG). A causa della complessit\ue0 del problema di classificazione, ad oggi l'approccio di predizione pi\uf9 comunemente utilizzato \ue8 quello \u201cflat\u201d, che consiste nell'addestrare un classificatore separatamente per ogni termine dell'ontologia senza considerare le relazioni gerarchiche esistenti tra le classi funzionali. L'utilizzo di questo approccio \ue8 giustificato non soltanto dal fatto di ridurre la complessit\ue0 computazionale del problema di apprendimento, ma anche dalla natura \u201cinstabile\u201d dei termini che compongono l'ontologia stessa. Infatti tali termini vengono aggiornati mensilmente mediante un processo curato da esperti che si basa sia sulla letteratura scientifica biomedica che su dati sperimentali ottenuti da esperimenti eseguiti \u201cin-vitro\u201d o \u201cin-silico\u201d. In questo contesto, in letteratura sono stati proposti due classi generali di classificatori. Da una parte, si collocano i metodi di apprendimento automatico che predicono le classi funzionali in modo \u201cflat\u201d, ossia senza esplorare la struttura intrinseca dello spazio delle annotazioni. Dall'altra parte, gli approcci gerarchici che, considerando esplicitamente le relazioni gerarchiche fra i termini funzionali dell'ontologia, garantiscono che le annotazioni predette rispettino la \u201ctrue-path-rule\u201d, la regola biologica che governa le ontologie. Nell'ambito dei metodi gerarchici, in letteratura sono stati proposti due diverse categorie di approcci. La prima si basa su metodi kernelizzati per predizioni con output strutturato, mentre la seconda su metodi di ensemble gerarchici. Entrambi questi metodi presentano alcuni svantaggi. I primi sono computazionalmente pesanti e non scalano bene se applicati ad ontologie biologiche. I secondi sono stati per la maggior parte concepiti per tassonomie strutturate ad albero, e quei pochi approcci specificatamente progettati per ontologie strutturate secondo un DAG, sono nella maggioranza dei casi incapaci di migliorare le performance di predizione dei metodi \u201cflat\u201d. Per superare queste limitazioni, nel presente lavoro di tesi si sono proposti dei nuovi metodi di ensemble gerarchici capaci di fornire predizioni consistenti con la struttura gerarchica dell'ontologia. Tali approcci, da un lato estendono precedenti metodi originariamente sviluppati per ontologie strutturate ad albero ad ontologie organizzate secondo un DAG e dall'altro migliorano significativamente le predizioni rispetto all'approccio \u201cflat\u201d indipendentemente dalla scelta del tipo di classificatore utilizzato. Nella loro forma pi\uf9 generale, gli approcci di ensemble gerarchici sono altamente modulari, nel senso che adottano una strategia di apprendimento a due passi. Nel primo passo, le classi funzionali dell'ontologia vengono apprese in modo indipendente l'una dall'altra, mentre nel secondo passo le predizioni \u201cflat\u201d vengono combinate opportunamente tenendo conto delle gerarchia fra le classi ontologiche. I principali contributi introdotti nella presente tesi sono sia metodologici che sperimentali. Da un punto di vista metodologico, sono stati proposti i seguenti nuovi metodi di ensemble gerarchici: a) HTD-DAG (Hierarchical Top-Down per tassonomie DAG strutturate); b) TPR-DAG (True-Path-Rule per DAG) con diverse varianti algoritmiche; c) ISO-TPR (True-Path-Rule con Regressione Isotonica), un nuovo algoritmo gerarchico che combina la True-Path-Rule con metodi di regressione isotonica. Per tutti i metodi di ensemble gerarchici \ue8 stato dimostrato in modo formale la coerenza delle predizioni, cio\ue8 \ue8 stato provato come gli approcci proposti sono in grado di fornire predizioni che rispettano le relazioni gerarchiche fra le classi. Da un punto di vista sperimentale, risultati a livello dell'intero genoma di organismi modello e dell'uomo ed a livello della totalit\ue0 delle classi incluse nelle ontologie biologiche mostrano che gli approcci metodologici proposti: a) sono competitivi con gli algoritmi di predizione output strutturata allo stato dell'arte; b) sono in grado di migliorare i classificatori \u201cflat\u201d, a patto che le predizioni fornite dal classificatore non siano casuali; c) sono in grado di predire nuove associazioni tra geni umani e fenotipi patologici, un passo cruciale per la scoperta di nuovi geni associati a malattie genetiche umane e al cancro; d) scalano bene su dataset costituiti da decina di migliaia di esempi (i.e., proteine o geni) e su tassonomie costituite da migliaia di classi funzionali. Infine, i metodi proposti in questa tesi sono stati implementati in una libreria software scritta in linguaggio R, HEMDAG (Hierarchical Ensemble Methods per DAG), che \ue8 pubblica, liberamente scaricabile e disponibile per i sistemi operativi Linux, Windows e Macintosh.The standardized annotation of biomedical related objects, often organized in dedicated catalogues, strongly promoted the organization of biological concepts into controlled vocabularies, i.e. ontologies by which related terms of the underlying biological domain are structured according to a predefined hierarchy. Indeed large ontologies have been developed by the scientific community to structure and organize the gene and protein taxonomy of all the living organisms from Archea to Metazoa, i.e. the Gene Ontology, or human specific ontologies, such as the Human Phenotype Ontology, that provides a structured taxonomy of the abnormal human phenotypes associated with diseases. These ontologies, offering a coded and well-defined classification space for biological entities such as genes and proteins, favor the development of machine learning methods able to predict features of biological objects like the association between a human gene and a disease, with the aim to drive wet lab research allowing a reduction of the costs and a more effective usage of the available research funds. Despite the soundness of the aforementioned objectives, the resulting multi-label classification problems raise so complex machine learning issues that until recently the far common approach was the \u201cflat\u201d prediction, i.e. simply training a classifier for each term in the controlled vocabulary and ignoring the relationships between terms. This approach was not only justified by the need to reduce the computational complexity of the learning task, but also by the somewhat \u201cunstable\u201d nature of the terms composing the controlled vocabularies, because they were (and are) updated on a monthly basis in a process performed by expert curators and based on biomedical literature, and wet and in-silico experiments. In this context, two main general classes of classifiers have been proposed in literature. On the one hand, \u201chierarchy-unaware\u201d learning methods predict labels in a \u201cflat\u201d way without exploiting the inherent structure of the annotation space. On the other hand, \u201chierarchy-aware\u201d learning methods can improve the accuracy and the precision of the predictions by considering the hierarchical relationships between ontology terms. Moreover these methods can guarantee the consistency of the predicted labels according to the \u201ctrue path rule\u201d, that is the biological and logical rule that governs the internal coherence of biological ontologies. To properly handle the hierarchical relationships linking the ontology terms, two main classes of structured output methods have been proposed in literature: the first one is based on kernelized methods for structured output spaces, the second on hierarchical ensemble methods for ontology-based predictions. However both these approaches suffer of significant drawbacks. The kernel-based methods for structured output space are computationally intensive and do not scale well when applied to complex multi-label bio-ontologies. Most hierarchical ensemble methods have been conceived for tree-structured taxonomies and the few ones specifically developed for the prediction in DAG-structured output spaces are, in most cases, unable to improve prediction performances over flat methods. To overcome these limitations, in this thesis novel \u201contology-aware\u201d ensemble methods have been developed, able to handle DAG-structured ontologies, leveraging previous results obtained with \u201ctrue-path-rule\u201d-based hierarchical learning algorithms. These methods are highly modular in the sense that they adopt a \u201ctwo-step\u201d learning strategy: in the first step they learn separately each term of the ontology using flat methods, and in the second they properly combine the flat predictions according to the hierarchy of the classes. The main contributions of this thesis are both methodological and experimental. From a methodological standpoint, novel hierarchical ensemble methods are proposed, including: a) HTD (Hierarchical Top-Down algorithm for DAG structured ontologies); b) TPR-DAG (True Path Rule ensemble for DAG) with several variants; c) ISO-TPR, a novel ensemble method that combines the True Path Rule approach with Isotonic Regression. For all these methods a formal proof of their consistency, i.e. the guarantee of providing predictions that \u201crespect\u201d the hierarchical relationships between classes, is provided. From an experimental standpoint, extensive genome and ontology-wide results show that the proposed methods: a) are competitive with state-of-the-art prediction algorithms; b) are able to improve flat machine learning classifiers, if the base learners can provide non random predictions; c) are able to predict new associations between genes and human abnormal phenotypes, a crucial step to discover novel genes associated with human diseases ranging from genetic disorders to cancer; d) scale nicely with large datasets and bio-ontologies. Finally HEMDAG, a novel R library implementing the proposed hierarchical ensemble methods has been developed and publicly delivered

    Knowledge Discovery in Biological Databases for Revealing Candidate Genes Linked to Complex Phenotypes

    Get PDF
    Genetics and ā€œomicsā€ studies designed to uncover genotype to phenotype relationships often identify large numbers of potential candidate genes, among which the causal genes are hidden. Scientists generally lack the time and technical expertise to review all relevant information available from the literature, from key model species and from a potentially wide range of related biological databases in a variety of data formats with variable quality and coverage. Computational tools are needed for the integration and evaluation of heterogeneous information in order to prioritise candidate genes and components of interaction networks that, if perturbed through potential interventions, have a positive impact on the biological outcome in the whole organism without producing negative side effects. Here we review several bioinformatics tools and databases that play an important role in biological knowledge discovery and candidate gene prioritization. We conclude with several key challenges that need to be addressed in order to facilitate biological knowledge discovery in the future.&nbsp

    PREDICT: a method for inferring novel drug indications with application to personalized medicine

    Get PDF
    The authors present a new method, PREDICT, for the large-scale prediction of drug indications, and demonstrate its use on both approved drugs and novel molecules. They also provide a proof-of-concept for its potential utility in predicting patient-specific medications

    Statistical Methods in Integrative Genomics

    Get PDF
    Statistical methods in integrative genomics aim to answer important biology questions by jointly analyzing multiple types of genomic data (vertical integration) or aggregating the same type of data across multiple studies (horizontal integration). In this article, we introduce different types of genomic data and data resources, and then review statistical methods of integrative genomics, with emphasis on the motivation and rationale of these methods. We conclude with some summary points and future research directions

    A framework for identifying genotypic information from clinical records: exploiting integrated ontology structures to transfer annotations between ICD codes and Gene Ontologies

    Get PDF
    Although some methods are proposed for automatic ontology generation, none of them address the issue of integrating large-scale heterogeneous biomedical ontologies. We propose a novel approach for integrating various types of ontologies efficiently and apply it to integrate International Classification of Diseases, Ninth Revision, Clinical Modification (ICD9CM) and Gene Ontologies (GO). This approach is one of the early attempts to quantify the associations among clinical terms (e.g. ICD9 codes) based on their corresponding genomic relationships. We reconstructed a merged tree for a partial set of GO and ICD9 codes and measured the performance of this tree in terms of associationsā€™ relevance by comparing them with two well-known disease-gene datasets (i.e. MalaCards and Disease Ontology). Furthermore, we compared the genomic-based ICD9 associations to temporal relationships between them from electronic health records. Our analysis shows promising associations supported by both comparisons suggesting a high reliability. We also manually analyzed several significant associations and found promising support from literature
    • ā€¦
    corecore