7 research outputs found

    FunSimMat: a comprehensive functional similarity database

    Get PDF
    Functional similarity based on Gene Ontology (GO) annotation is used in diverse applications like gene clustering, gene expression data analysis, protein interaction prediction and evaluation. However, there exists no comprehensive resource of functional similarity values although such a database would facilitate the use of functional similarity measures in different applications. Here, we describe FunSimMat (Functional Similarity Matrix, http://funsimmat.bioinf.mpi-inf.mpg.de/), a large new database that provides several different semantic similarity measures for GO terms. It offers various precomputed functional similarity values for proteins contained in UniProtKB and for protein families in Pfam and SMART. The web interface allows users to efficiently perform both semantic similarity searches with GO terms and functional similarity searches with proteins or protein families. All results can be downloaded in tab-delimited files for use with other tools. An additional XML–RPC interface gives automatic online access to FunSimMat for programs and remote services

    FunSimMat update: new features for exploring functional similarity

    Get PDF
    Quantifying the functional similarity of genes and their products based on Gene Ontology annotation is an important tool for diverse applications like the analysis of gene expression data, the prediction and validation of protein functions and interactions, and the prioritization of disease genes. The Functional Similarity Matrix (FunSimMat, http://www.funsimmat.de) is a comprehensive database providing various precomputed functional similarity values for proteins in UniProtKB and for protein families in Pfam and SMART. With this update, we significantly increase the coverage of FunSimMat by adding data from the Gene Ontology Annotation project as well as new functional similarity measures. The applicability of the database is greatly extended by the implementation of a new Gene Ontology-based method for disease gene prioritization. Two new visualization tools allow an interactive analysis of the functional relationships between proteins or protein families. This is enhanced further by the introduction of an automatically derived hierarchy of annotation classes. Additional changes include a revised user front-end and a new RESTlike interface for improving the user-friendliness and online accessibility of FunSimMat

    Growing functional modules from a seed protein via integration of protein interaction and gene expression data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Nowadays modern biology aims at unravelling the strands of complex biological structures such as the protein-protein interaction (PPI) networks. A key concept in the organization of PPI networks is the existence of dense subnetworks (functional modules) in them. In recent approaches clustering algorithms were applied at these networks and the resulting subnetworks were evaluated by estimating the coverage of well-established protein complexes they contained. However, most of these algorithms elaborate on an unweighted graph structure which in turn fails to elevate those interactions that would contribute to the construction of biologically more valid and coherent functional modules.</p> <p>Results</p> <p>In the current study, we present a method that corroborates the integration of protein interaction and microarray data via the discovery of biologically valid functional modules. Initially the gene expression information is overlaid as weights onto the PPI network and the enriched PPI graph allows us to exploit its topological aspects, while simultaneously highlights enhanced functional association in specific pairs of proteins. Then we present an algorithm that unveils the functional modules of the weighted graph by expanding a kernel protein set, which originates from a given 'seed' protein used as starting-point.</p> <p>Conclusion</p> <p>The integrated data and the concept of our approach provide reliable functional modules. We give proofs based on yeast data that our method manages to give accurate results in terms both of structural coherency, as well as functional consistency.</p

    Hierarchical multi-label classification for protein function prediction going beyond traditional approaches

    Get PDF
    Hierarchical multi-label classification is a variant of traditional classification in which the instances can belong to several labels, that are in turn organized in a hierarchy. Functional classification of genes is a challenging problem in functional genomics due to several reasons. First, each gene participates in multiple biological activities. Hence, prediction models should support multi-label classification. Second, the genes are organized and classified according to a hierarchical classification scheme that represents the relationships between the functions of the genes. These relationships should be maintained by the prediction models. In addition, various bimolecular data sources, such as gene expression data and protein-protein interaction data, can be used to assign biological functions to genes. Therefore, the integration of multiple data sources is required to acquire a precise picture of the roles of the genes in the living organisms through uncovering novel biology in the form of previously unknown functional annotations. In order to address these issues, the presented work deals with the hierarchical multi-label classification. The purpose of this thesis is threefold: first, Hierarchical Multi-Label classification algorithm using Boosting classifiers, HML-Boosting, for the hierarchical multi-label classification problem in the context of gene function prediction is proposed. HML-Boosting exploits the predefined hierarchical dependencies among the classes. We demonstrate, through HML-Boosting and using two approaches for class-membership inconsistency correction during the testing phase, the top-down approach and the bottom-up approach, that the HMLBoosting algorithm outperforms the flat classifier approach. Moreover, the author proposed the HiBLADE algorithm (Hierarchical multi-label Boosting with LAbel DEpendency), a novel algorithm that takes advantage of not only the pre-established hierarchical taxonomy of the classes, but also effectively exploits the hidden correlation among the classes that is not shown through the class hierarchy, thereby improving the quality of the predictions. According to the proposed approach, first, the pre-defined hierarchical taxonomy of the labels is used to decide upon the training set for each classifier. Second, the dependencies of the children for each label in the hierarchy are captured and analyzed using Bayes method and instance-based similarity. The primary objective of the proposed algorithm is to find and share a number of base models across the correlated labels. HiBLADE is different than the conventional algorithms in two ways. First, it allows the prediction of multiple functions for genes at the same time while maintaining the hierarchy constraint. Second, the classifiers are built based on the label understudy and its most similar sibling. Experimental results on several real-world biomolecular datasets show that the proposed method can improve the performance of hierarchical multilabel classification. More important, however, is then the third part that focuses on the integration of multiple heterogeneous data sources for improving hierarchical multi-label classification. Unlike most of the previous works, which mostly consider a single data source for gene function prediction, the author explores the integration of heterogeneous data sources for genome-wide gene function prediction. The integration of multiple heterogeneous data sources is addressed with a novel Hierarchical Bayesian iNtegration algorithm, HiBiN, a general framework that uses Bayesian reasoning to integrate heterogeneous data sources for accurate gene function prediction. The system formally uses posterior probabilities to assign class memberships to samples using multiple data sources while maintaining the hierarchical constraint that governs the annotation of the genes. The author demonstrates, through HiBiN, that the integration of the diverse datasets significantly improves the classification quality for hierarchical gene function prediction in terms of several measures, compared to single-source prediction models and fused-flat model, which are the baselines compared against. Moreover, the system has been extended to include a weighting scheme to control the contributions from each data source according to its relevance to the label under-study. The results show that the new weighting scheme compares favorably with the other approach along various performance criteria

    Ontology-based similarity measures and their application in bioinformatics

    Get PDF
    Genome-wide sequencing projects of many different organisms produce large numbers of sequences that are functionally characterized using experimental and bioinformatics methods. Following the development of the first bio-ontologies, knowledge of the functions of genes and proteins is increasingly made available in a standardized format. This allows for devising approaches that directly exploit functional information using semantic and functional similarity measures. This thesis addresses different aspects of the development and application of such similarity measures. First, we analyze semantic and functional similarity measures and apply them for investigating the functional space in different taxa. Second, a new software program and a new database are described, which overcome limitations of existing tools and simplify the utilization of similarity measures for different applications. Third, we delineate two applications of our functional similarity measures. We utilize them for analyzing domain and protein interaction datasets and derive thresholds for grouping predicted domain interactions into low- and high-confidence subsets. We also present the new MedSim method for prioritization of candidate disease genes, which is based on the observation that genes and proteins contributing to similar diseases are functionally related. We demonstrate that the MedSim method performs at least as well as more complex state-of-the-art methods and significantly outperforms current methods that also utilize functional annotation.Die Sequenzierung der kompletten Genome vieler verschiedener Organismen liefert eine große Anzahl an Sequenzen, die mit Hilfe experimenteller und bioinformatischer Methoden funktionell charakterisiert werden. Nach der Entwicklung der ersten Bio-Ontologien wird das Wissen über die Funktionen von Genen und Proteinen zunehmend in einem standardisierten Format zur Verfügung gestellt. Dadurch wird die Entwicklung von Verfahren ermöglicht, die funktionelle Informationen direkt mit Hilfe semantischer und funktioneller Ähnlichkeit verwenden. Diese Doktorarbeit befasst sich mit verschiedenen Aspekten der Entwicklung und Anwendung solcher Ähnlichkeitsmaße. Zuerst analysieren wir semantische und funktionelle Ähnlichkeitsmaße und verwenden sie für eine Analyse des funktionellen Raumes verschiedener Organismengruppen. Danach beschreiben wir eine neue Software und eine neue Datenbank, die Limitationen existierender Programme überwinden und den Einsatz von Ähnlichkeitsmaßen in verschiedenen Anwendungen vereinfachen. Drittens schildern wir zwei Anwendungen unserer funktionellen Ähnlichkeitsmaße. Wir verwenden sie zur Analyse von Domän- und Proteininteraktionsdatensätzen und leiten Grenzwerte ab, um die Domäninteraktionen in Teilmengen mit niedriger und hoher Konfidenz einzuteilen. Außerdem präsentieren wir die MedSim-Methode zur Priorisierung von potentiellen Krankheitsgenen. Sie beruht auf der Beobachtung, dass Gene und Proteine, die zu ähnlichen Krankheiten beitragen, funktionell verwandt sind. Wir zeigen, dass die MedSim-Methode mindestens so gut funktioniert wie komplexere moderne Methoden und die Leistung anderer aktueller Methoden signifikant übertrifft, die auch funktionelle Annotationen verwenden

    Understanding patient experience from online medium

    Get PDF
    Improving patient experience at hospitals leads to better health outcomes. To improve this, we must first understand and interpret patients' written feedback. Patient-generated texts such as patient reviews found on RateMD, or online health forums found on WebMD are venues where patients post about their experiences. Due to the massive amounts of patient-generated texts that exist online, an automated approach to identifying the topics from patient experience taxonomy is the only realistic option to analyze these texts. However, not only is there a lack of annotated taxonomy on these media, but also word usage is colloquial, making it challenging to apply standardized NLP technique to identify the topics that are present in the patient-generated texts. Furthermore, patients may describe multiple topics in the patient-generated texts which drastically increases the complexity of the task. In this thesis, we address the challenges in comprehensively and automatically understanding the patient experience from patient-generated texts. We first built a set of rich semantic features to represent the corpus which helps capture meanings that may not typically be captured by the bag-of-words (BOW) model. Unlike the BOW model, semantic feature representation captures the context and in-depth meaning behind each word in the corpus. To the best of our knowledge, no existing work in understanding patient experience from patient-generated texts delves into which semantic features help capture the characteristics of the corpus. Furthermore, patients generally talk about multiple topics when they write in patient-generated texts, and these are frequently interdependent of each other. There are two types of topic interdependencies, those that are semantically similar, and those that are not. We built a constraint-based deep neural network classifier to capture the two types of topic interdependencies and empirically show the classification performance improvement over the baseline approaches. Past research has also indicated that patient experiences differ depending on patient segments [1-4]. The segments can be based on demographics, for instance, by race, gender, or geographical location. Similarly, the segments can be based on health status, for example, whether or not the patient is taking medication, whether or not the patient has a particular disease, or whether or not the patient is readmitted to the hospital. To better understand patient experiences, we built an automated approach to identify patient segments with a focus on whether the person has stopped taking the medication or not. The technique used to identify the patient segment is general enough that we envision the approach to be applicable to other types of patient segments. With a comprehensive understanding of patient experiences, we envision an application system where clinicians can directly read the most relevant patient-generated texts that pertain to their interest. The system can capture topics from patient experience taxonomy that is of interest to each clinician or designated expert, and we believe the system is one of many approaches that can ultimately help improve the patient experience

    Alzheimer’s Dementia Recognition Through Spontaneous Speech

    Get PDF
    corecore