677 research outputs found

    Automated gene function prediction through gene multifunctionality in biological networks

    Get PDF
    As the number of sequenced genomes rapidly grows, Automated Prediction of gene Function (AFP) is now a challenging problem. Despite significant progresses in the last several years, the accuracy of gene function prediction still needs to be improved in order to be used effectively in practice. Two of the main issues of AFP problem are the imbalance of gene functional annotations and the 'multifunctional properties' of genes. While the former is a well studied problem in machine learning, the latter has recently emerged in bioinformatics and few studies have been carried out about it. Here we propose a method for AFP which appropriately handles the label imbalance characterizing biological taxonomies, and embeds in the model the property of some genes of being 'multifunctional'. We tested the method in predicting the functions of the Gene Ontology functional hierarchy for genes of yeast and fly model organisms, in a genome-wide approach. The achieved results show that cost-sensitive strategies and 'gene multifunctionality' can be combined to achieve significantly better results than the compared state-of-the-art algorithms for AFP

    Progress and challenges in the computational prediction of gene function using networks: 2012-2013 update

    Get PDF
    In an opinion published in 2012, we reviewed and discussed our studies of how gene network-based guilt-by-association (GBA) is impacted by confounds related to gene multifunctionality. We found such confounds account for a significant part of the GBA signal, and as a result meaningfully evaluating and applying computationally-guided GBA is more challenging than generally appreciated. We proposed that effort currently spent on incrementally improving algorithms would be better spent in identifying the features of data that do yield novel functional insights. We also suggested that part of the problem is the reliance by computational biologists on gold standard annotations such as the Gene Ontology. In the year since, there has been continued heavy activity in GBA-based research, including work that contributes to our understanding of the issues we raised. Here we provide a review of some of the most relevant recent work, or which point to new areas of progress and challenges

    Bioinformatics and Moonlighting Proteins

    Get PDF
    Multitasking or moonlighting is the capability of some proteins to execute two or more biochemical functions. Usually, moonlighting proteins are experimentally revealed by serendipity. For this reason, it would be helpful that Bioinformatics could predict this multifunctionality, especially because of the large amounts of sequences from genome projects. In the present work, we analyse and describe several approaches that use sequences, structures, interactomics and current bioinformatics algorithms and programs to try to overcome this problem. Among these approaches are: a) remote homology searches using Psi-Blast, b) detection of functional motifs and domains, c) analysis of data from protein-protein interaction databases (PPIs), d) match the query protein sequence to 3D databases (i.e., algorithms as PISITE), e) mutation correlation analysis between amino acids by algorithms as MISTIC. Programs designed to identify functional motif/domains detect mainly the canonical function but usually fail in the detection of the moonlighting one, Pfam and ProDom being the best methods. Remote homology search by Psi-Blast combined with data from interactomics databases (PPIs) have the best performance. Structural information and mutation correlation analysis can help us to map the functional sites. Mutation correlation analysis can only be used in very specific situations –it requires the existence of multialigned family protein sequences - but can suggest how the evolutionary process of second function acquisition took place. The multitasking protein database MultitaskProtDB (http://wallace.uab.es/multitask/), previously published by our group, has been used as a benchmark for the all of the analyses

    Multitask Protein Function Prediction Through Task Dissimilarity

    Get PDF
    Automated protein function prediction is a challenging problem with distinctive features, such as the hierarchical organization of protein functions and the scarcity of annotated proteins for most biological functions. We propose a multitask learning algorithm addressing both issues. Unlike standard multitask algorithms, which use task (protein functions) similarity information as a bias to speed up learning, we show that dissimilarity information enforces separation of rare class labels from frequent class labels, and for this reason is better suited for solving unbalanced protein function prediction problems. We support our claim by showing that a multitask extension of the label propagation algorithm empirically works best when the task relatedness information is represented using a dissimilarity matrix as opposed to a similarity matrix. Moreover, the experimental comparison carried out on three model organism shows that our method has a more stable performance in both "protein-centric" and "function-centric" evaluation settings

    COSNet: An R package for label prediction in unbalanced biological networks

    Get PDF
    Several problems in computational biology and medicine are modeled as learning problems in graphs, where nodes represent the biological entities to be studied, e.g. proteins, and connections different kinds of relationships among them, e.g. protein-protein interactions. In this context, classes are usually characterized by a high imbalance, i.e. positive examples for a class are much less than those negative. Although several works studied this problem, no graph-based software designed to explicitly take into account the label imbalance in biological networks is available. We propose COSNet, an R package to serve this purpose. COSNet deals with the label imbalance problem by implementing a novel parametric model of Hopfield Network (HN), whose output levels and activation thresholds of neurons are parameters to be automatically learnt. Due to the quasi-linear time complexity, COSNet nicely scales when the number of instances is large, and application examples to challenging problems in biomedicine show the efficiency and the accuracy of the proposed library

    Progress and challenges in the computational prediction of gene function using networks

    Get PDF

    Computational design and designability of gene regulatory networks

    Full text link
    Nuestro conocimiento de las interacciones moleculares nos ha conducido hoy hacia una perspectiva ingenieril, donde diseños e implementaciones de sistemas artificiales de regulación intentan proporcionar instrucciones fundamentales para la reprogramación celular. Nosotros aquí abordamos el diseño de redes de genes como una forma de profundizar en la comprensión de las regulaciones naturales. También abordamos el problema de la diseñabilidad dada una genoteca de elementos compatibles. Con este fin, aplicamos métodos heuríticos de optimización que implementan rutinas para resolver problemas inversos, así como herramientas de análisis matemático para estudiar la dinámica de la expresión genética. Debido a que la ingeniería de redes de transcripción se ha basado principalmente en el ensamblaje de unos pocos elementos regulatorios usando principios de diseño racional, desarrollamos un marco de diseño computacional para explotar este enfoque. Modelos asociados a genotecas fueron examinados para descubrir el espacio genotípico asociado a un cierto fenotipo. Además, desarrollamos un procedimiento completamente automatizado para diseñar moleculas de ARN no codificante con capacidad regulatoria, basándonos en un modelo fisicoquímico y aprovechando la regulación alostérica. Los circuitos de ARN resultantes implementaban un mecanismo de control post-transcripcional para la expresión de proteínas que podía ser combinado con elementos transcripcionales. También aplicamos los métodos heurísticos para analizar la diseñabilidad de rutas metabólicas. Ciertamente, los métodos de diseño computacional pueden al mismo tiempo aprender de los mecanismos naturales con el fin de explotar sus principios fundamentales. Así, los estudios de estos sistemas nos permiten profundizar en la ingeniería genética. De relevancia, el control integral y las regulaciones incoherentes son estrategias generales que los organismos emplean y que aquí analizamos.Rodrigo Tarrega, G. (2011). Computational design and designability of gene regulatory networks [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/1417

    Learning node labels with multi-category Hopfield networks

    Get PDF
    In several real-world node label prediction problems on graphs, in fields ranging from computational biology to World Wide Web analysis, nodes can be partitioned into categories different from the classes to be predicted, on the basis of their characteristics or their common properties. Such partitions may provide further information about node classification that classical machine learning algorithms do not take into account. We introduce a novel family of parametric Hopfield networks (m-category Hopfield networks) and a novel algorithm (Hopfield multi-category \u2014 HoMCat ), designed to appropriately exploit the presence of property-based partitions of nodes into multiple categories. Moreover, the proposed model adopts a cost-sensitive learning strategy to prevent the remarkable decay in performance usually observed when instance labels are unbalanced, that is, when one class of labels is highly underrepresented than the other one. We validate the proposed model on both synthetic and real-world data, in the context of multi-species function prediction, where the classes to be predicted are the Gene Ontology terms and the categories the different species in the multi-species protein network. We carried out an intensive experimental validation, which on the one hand compares HoMCat with several state-of-the-art graph-based algorithms, and on the other hand reveals that exploiting meaningful prior partitions of input data can substantially improve classification performances

    Gene2DisCo : gene to disease using disease commonalities

    Get PDF
    OBJECTIVE: Finding the human genes co-causing complex diseases, also known as "disease-genes", is one of the emerging and challenging tasks in biomedicine. This process, termed gene prioritization (GP), is characterized by a scarcity of known disease-genes for most diseases, and by a vast amount of heterogeneous data, usually encoded into networks describing different types of functional relationships between genes. In addition, different diseases may share common profiles (e.g. genetic or therapeutic profiles), and exploiting disease commonalities may significantly enhance the performance of GP methods. This work aims to provide a systematic comparison of several disease similarity measures, and to embed disease similarities and heterogeneous data into a flexible framework for gene prioritization which specifically handles the lack of known disease-genes. METHODS: We present a novel network-based method, Gene2DisCo, based on generalized linear models (GLMs) to effectively prioritize genes by exploiting data regarding disease-genes, gene interaction networks and disease similarities. The scarcity of disease-genes is addressed by applying an efficient negative selection procedure, together with imbalance-aware GLMs. Gene2DisCo is a flexible framework, in the sense it is not dependent upon specific types of data, and/or upon specific disease ontologies. RESULTS: On a benchmark dataset composed of nine human networks and 708 medical subject headings (MeSH) diseases, Gene2DisCo largely outperformed the best benchmark algorithm, kernelized score functions, in terms of both area under the ROC curve (0.94 against 0.86) and precision at given recall levels (for recall levels from 0.1 to 1 with steps 0.1). Furthermore, we enriched and extended the benchmark data to the whole human genome and provided the top-ranked unannotated candidate genes even for MeSH disease terms without known annotations
    • …
    corecore