15 research outputs found

    Hierarchical ensemble methods for protein function prediction

    Get PDF
    Protein function prediction is a complex multiclass multilabel classification problem, characterized by multiple issues such as the incompleteness of the available annotations, the integration of multiple sources of high dimensional biomolecular data, the unbalance of several functional classes, and the difficulty of univocally determining negative examples. Moreover, the hierarchical relationships between functional classes that characterize both the Gene Ontology and FunCat taxonomies motivate the development of hierarchy-aware prediction methods that showed significantly better performances than hierarchical-unaware \u201cflat\u201d prediction methods. In this paper, we provide a comprehensive review of hierarchical methods for protein function prediction based on ensembles of learning machines. According to this general approach, a separate learning machine is trained to learn a specific functional term and then the resulting predictions are assembled in a \u201cconsensus\u201d ensemble decision, taking into account the hierarchical relationships between classes. The main hierarchical ensemble methods proposed in the literature are discussed in the context of existing computational methods for protein function prediction, highlighting their characteristics, advantages, and limitations. Open problems of this exciting research area of computational biology are finally considered, outlining novel perspectives for future research

    Hierarchical multi-label classification for protein function prediction going beyond traditional approaches

    Get PDF
    Hierarchical multi-label classification is a variant of traditional classification in which the instances can belong to several labels, that are in turn organized in a hierarchy. Functional classification of genes is a challenging problem in functional genomics due to several reasons. First, each gene participates in multiple biological activities. Hence, prediction models should support multi-label classification. Second, the genes are organized and classified according to a hierarchical classification scheme that represents the relationships between the functions of the genes. These relationships should be maintained by the prediction models. In addition, various bimolecular data sources, such as gene expression data and protein-protein interaction data, can be used to assign biological functions to genes. Therefore, the integration of multiple data sources is required to acquire a precise picture of the roles of the genes in the living organisms through uncovering novel biology in the form of previously unknown functional annotations. In order to address these issues, the presented work deals with the hierarchical multi-label classification. The purpose of this thesis is threefold: first, Hierarchical Multi-Label classification algorithm using Boosting classifiers, HML-Boosting, for the hierarchical multi-label classification problem in the context of gene function prediction is proposed. HML-Boosting exploits the predefined hierarchical dependencies among the classes. We demonstrate, through HML-Boosting and using two approaches for class-membership inconsistency correction during the testing phase, the top-down approach and the bottom-up approach, that the HMLBoosting algorithm outperforms the flat classifier approach. Moreover, the author proposed the HiBLADE algorithm (Hierarchical multi-label Boosting with LAbel DEpendency), a novel algorithm that takes advantage of not only the pre-established hierarchical taxonomy of the classes, but also effectively exploits the hidden correlation among the classes that is not shown through the class hierarchy, thereby improving the quality of the predictions. According to the proposed approach, first, the pre-defined hierarchical taxonomy of the labels is used to decide upon the training set for each classifier. Second, the dependencies of the children for each label in the hierarchy are captured and analyzed using Bayes method and instance-based similarity. The primary objective of the proposed algorithm is to find and share a number of base models across the correlated labels. HiBLADE is different than the conventional algorithms in two ways. First, it allows the prediction of multiple functions for genes at the same time while maintaining the hierarchy constraint. Second, the classifiers are built based on the label understudy and its most similar sibling. Experimental results on several real-world biomolecular datasets show that the proposed method can improve the performance of hierarchical multilabel classification. More important, however, is then the third part that focuses on the integration of multiple heterogeneous data sources for improving hierarchical multi-label classification. Unlike most of the previous works, which mostly consider a single data source for gene function prediction, the author explores the integration of heterogeneous data sources for genome-wide gene function prediction. The integration of multiple heterogeneous data sources is addressed with a novel Hierarchical Bayesian iNtegration algorithm, HiBiN, a general framework that uses Bayesian reasoning to integrate heterogeneous data sources for accurate gene function prediction. The system formally uses posterior probabilities to assign class memberships to samples using multiple data sources while maintaining the hierarchical constraint that governs the annotation of the genes. The author demonstrates, through HiBiN, that the integration of the diverse datasets significantly improves the classification quality for hierarchical gene function prediction in terms of several measures, compared to single-source prediction models and fused-flat model, which are the baselines compared against. Moreover, the system has been extended to include a weighting scheme to control the contributions from each data source according to its relevance to the label under-study. The results show that the new weighting scheme compares favorably with the other approach along various performance criteria

    Graph - Based Methods for Protein Function Prediction

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    New Probabilistic Graphical Models and Meta-Learning Approaches for Hierarchical Classification, with Applications in Bioinformatics and Ageing

    Get PDF
    This interdisciplinary work proposes new hierarchical classification algorithms and evaluates them on biological datasets, and specifically on ageing-related datasets. Hierarchical classification is a type of classification task where the classes to be predicted are organized into a hierarchical structure. The focus on ageing is justified by the increasing impact that ageing-related diseases have on the human population and by the increasing amount of freely available ageing-related data. The main contributions of this thesis are as follows. First, we improve the running time of a previously proposed hierarchical classification algorithm based on an extension of the well-known Naive Bayes classification algorithm. We show that our modification greatly improves the runtime of the hierarchical classification algorithm, maintaining its predictive performance. We also propose four new hierarchical classification algorithms. The focus on hierarchical classification algorithms and their evaluation on biological data is justified as the class labels of biological data are commonly organized into class hierarchies. Two of our four new hierarchical classification algorithms - the "Hierarchical Dependence Network" (HDN) and the "Hierarchical Dependence Network algorithm based on finding non-Hierarchically related Predictive Classes'' (HDN-nHPC) - are based on Dependence Networks, a relatively new type of probabilistic graphical model that has not yet received a lot of attention from the classification community. The other two hierarchical classification algorithms we proposed are hybrid algorithms that use the hierarchical classification models produced by the Predictive Clustering Tree (PCT) algorithm. One of the hybrids combines the models produced by the PCT algorithm and a Local Hierarchical Classification (LHC) algorithm (which basically induces a local model for each class in the hierarchy). The other hybrid combines the models produced by the PCT and HDN algorithms. We have tested our four proposed algorithms and four other commonly used hierarchical classification algorithms on 42 hierarchical classification datasets. 20 of these datasets were created by us and are freely available for researchers. We have concluded that, for one out of the three hierarchical predictive accuracy measures used in our experiments, one of our four new algorithms (the HDN-nHPC algorithm) outperforms all other seven algorithms in terms of average rank across the 42 hierarchical classification datasets. We have also proposed the first meta-learning approach for hierarchical classification problems. In meta-learning, each meta-instance represents a dataset, meta-features represent dataset properties, and meta-classes represent the best classification algorithm for the corresponding dataset (meta-instance). Hence, meta-learning techniques for classification use the predictive performance of some candidate classification algorithms in previously tested datasets, and dataset descriptors (the meta-features), to infer the performance of those candidate classification algorithms in new datasets, given the meta-features of those new datasets. The predictions of our meta-learning system can be used as a guide to choose which hierarchical classification algorithm (out of a set of candidate ones) to use on a new dataset, without the need for time-consuming trial and error experiments with those candidate algorithms. This is particularly important for hierarchical classification problems, as the training time of hierarchical classification algorithms tends to be much greater than the training time of 'flat' classification algorithms. This increased training time is mainly due to the typically much greater number of class labels that annotate the instances of hierarchical classification problems. We have tested the predictive power of our meta-learning system and interpreted some generated meta-models. We have concluded that our meta-learning system had good predictive performance when compared to other baseline meta-learning approaches. We have also concluded that the meta-rules generated by our meta-learning system were useful to identify dataset characteristics to assist the choice of hierarchical classification algorithm. Finally, we have reviewed the current practice of applying supervised machine learning (classification and regression) algorithms to study the biology of ageing. This review discusses the main findings of such algorithms, in the context of the ageing biology literature. We have also interpreted some of the hierarchical classification models generated in our experiments. Both the above literature review and the interpretation of some models were performed in collaboration with an ageing expert, in order to extract relevant information for ageing research

    Computational functional annotation of crop genomics using hierarchical orthologous groups

    Get PDF
    Improving agronomically important traits, such as yield, is important in order to meet the ever growing demands of increased crop production. Knowledge of the genes that have an effect on a given trait can be used to enhance genomic selection by prediction of biologically interesting loci. Candidate genes that are strongly linked to a desired trait can then be targeted by transformation or genome editing. This application of prioritisation of genetic material can accelerate crop improvement. However, the application of this is currently limited due to the lack of accurate annotations and methods to integrate experimental data with evolutionary relationships. Hierarchical orthologous groups (HOGs) provide nested groups of genes that enable the comparison of highly diverged and similar species in a consistent manner. Over 2,250 species are included in the OMA project, resulting in over 600,000 HOGs. This thesis provides the required methodology and a tool to exploit this rich source of information, in the HOGPROP algorithm. The potential of this is then demonstrated in mining crop genome data, from metabolic QTL studies and utilising Gene Ontology (GO) annotations as well as ChEBI terms (Chemical Entities of Biological Interest) in order to prioritise candidate causal genes. Gauging the performance of the tool is also important. When considering GO annotations, the CAFA series of community experiments has provided the most extensive benchmarking to-date. However, this has not fully taken into account the incomplete knowledge of protein function – the open world assumption (OWA). This will require extra negative annotations, for which one such source has been identified based on expertly curated gene phylogenies. These negative annotations are then utilised in the proposed, OWA-compliant, improved framework for benchmarking. The results show that current benchmarks tend to focus on the general terms, which means that conclusions are not merely uninformative, but misleading

    Bioinformatics protocols for analysis of functional genomics data applied to neuropathy microarray datasets

    Get PDF
    Microarray technology allows the simultaneous measurement of the abundance of thousands of transcripts in living cells. The high-throughput nature of microarray technology means that automatic analytical procedures are required to handle the sheer amount of data, typically generated in a single microarray experiment. Along these lines, this work presents a contribution to the automatic analysis of microarray data by attempting to construct protocols for the validation of publicly available methods for microarray. At the experimental level, an evaluation of amplification of RNA targets prior to hybridisation with the physical array was undertaken. This had the important consequence of revealing the extent to which the significance of intensity ratios between varying biological conditions may be compromised following amplification as well as identifying the underlying cause of this effect. On the basis of these findings, recommendations regarding the usability of RNA amplification protocols with microarray screening were drawn in the context of varying microarray experimental conditions. On the data analysis side, this work has had the important outcome of developing an automatic framework for the validation of functional analysis methods for microarray. This is based on using a GO semantic similarity scoring metric to assess the similarity between functional terms found enriched by functional analysis of a model dataset and those anticipated from prior knowledge of the biological phenomenon under study. Using such validation system, this work has shown, for the first time, that ‘Catmap’, an early functional analysis method performs better than the more recent and most popular methods of its kind. Crucially, the effectiveness of this validation system implies that such system may be reliably adopted for validation of newly developed functional analysis methods for microarray

    Unraveling the replication process of Toxoplasma gondii through the MOB1 protein

    Get PDF
    Tese de Doutoramento em Ciências Veterinárias na Especialidade de Ciências Biológicas e Biomédicas, área científica - Sanidade AnimalABSTRACT - MOB1 is a conserved protein that regulates cellular proliferation versus apoptosis, centrosome duplication and cellular differentiation in multicellular eukaryotes and also cytokinesis and division axis orientation in unicellular and multicellular eukaryotes. Toxoplasma gondii, an obligate intracellular parasite of veterinary and medical importance, presents one MOB1 protein. T. gondii interconverts between several cellular stages during its life cycle, namely between fast replicating tachyzoite and slow replicating bradyzoite stages during its asexual cycle, a key ability for its success as a parasite. Bradyzoites produce tissue cysts, establishing a chronic infection that enables recrudescence. Conversion is dependent on cell cycle regulation and involves cell differentiation and regulation of replication. This led us to select MOB1 as a strong candidate to be involved in the Toxoplasma replication process. We employed reverse genetics to assess the Mob1 function in T. gondii. In opposition to what was observed in other unicellular eukaryotes, as Tetrahymena and Trypanosoma, Mob1 knockout in T. gondii showed no cytokinesis impairment in its asexual cycle. Instead, we observed an increase in replication, a decrease in parasitophorous vacuole regularity and a significant loss in tachyzoite to bradyzoite conversion. Additionally, recombinant MOB1 accumulates in a midline between the daughter nuclei at the end of mitosis, suggesting MOB1 may be involved in this process. To elucidate how MOB1 acts in T. gondii, we employed a proximity biotinylation method and identified the MOB1 interactome. This analysis detected proteins related to several functional categories, indicating a multivalent role for MOB1 regulated by the ubiquitin proteasome system. We also verified that the Mob1 locus is transcribed from both genomic strands and gives rise to alternatively spliced variants. Our results indicate that MOB1 is tightly regulated along the cell cycle and along the life cycle of T. gondii, contributing to the control of replication and tachyzoite-bradyzoite differentiation.RESUMO - Desvendar o processo de replicação de Toxoplasma gondii através da proteína MOB1 - As proteínas da família monospindle one binder (MOB) são cinases adaptadoras de sinal altamente conservadas em eucariotas que são frequentemente essenciais para a sobrevivência das células e dos organismos. Historicamente, as proteínas MOB foram descritas como ativadoras de cinases que participam nas vias de sinalização mitotic exit network/ septation initiation network (MEN/SIN) que têm papéis centrais na regulação da citocinese, polaridade celular, proliferação celular e destino celular para controlar o crescimento e a regeneração dos órgãos. Nos metazoários, as proteínas MOB atuam como adaptadores centrais de sinal do módulo nuclear de cinases mammalian sterile 20 (MST) 1/2, large tumor suppressor (LATS) 1/2 e nuclear dbf2-related (NDR) 1/2 que fosforilam os co-ativadores transcricionais Yes associated protein (YAP)/ WW domain containing transcription regulator 1 (TAZ), efetores da via de sinalização Hippo. Mais recentemente, as proteínas MOB mostraram também ter parceiros não-cinase e estar envolvidos na biologia dos cílios, indicando que a sua atividade e regulação é mais diversificada do que inicialmente foi admitido. Em particular, as proteínas MOB1 regulam o equilíbrio entre proliferação celular versus apoptose, a duplicação de centrossomas e a diferenciação celular em eucariotas multicelulares mas também a citocinese e a orientação do eixo de divisão em eucariotas unicelulares e multicelulares. Toxoplasma gondii é um parasita intracelular obrigatório de grande importância veterinária e médica, afetando uma grande variedade de espécies animais e estando presente por todo o mundo. A sua distribuição e o facto de poder parasitar potencialmente todos os animais de sangue quente contribuíram para ser considerado o parasita de maior sucesso. Apesar de os seus hospedeiros geralmente não desenvolverem formas clínicas, T. gondii pode causar doença grave em hospedeiros imunocomprometidos e levar a aborto e malformações congénitas se a infeção é adquirida durante a gestação. T. gondii apresenta um ciclo de vida complexo, com uma fase assexuada e uma fase sexuada. Ao longo do seu ciclo de vida, é capaz de interconverter entre vários estádios celulares, nomeadamente entre os estádios de replicação rápida (taquizoítos) e os estádios de replicação lenta (bradizoítos) durante a fase assexuada, uma capacidade fundamental para o seu sucesso como parasita. ...N/

    Program and abstracts from the 24th Fungal Genetics Conference

    Get PDF
    Abstracts of the plenary and poster sessions from the 24th Fungal Genetics Conference, March 20-25, 2007, Pacific Grove, CA

    FunCat Functional Inference with Belief Propagation and Feature Integration.

    No full text
    Pairwise comparison of sequence data is intensively used for automated functional protein annotation, while graphical models emerge as promising candidates for an integration of various heterogeneous features. We designed a model, termed hRMN that integrates different genomic features and implemented a variant of belief propagation for functional annotation transfer. hRMN allows the assignment Of multiple functional categories while avoiding common problems in annotation transfer from heterogeneous datasets, such as an independency of the investigated datasets, We benchmarked this system with large-scale annotation transfer (based oil the MIPS FunCat ontology) to proteins of the prokaryotes Bacillus subtilis, Helicobacter pylori, Listeria monocytogenes, and Listeria innocua. hRMN consistently outperformed two competitors in annotation of four bacterial genomes. The developed code is available for download at http://mips.gsf.de/proj/bfab/bfab/hRMN.html
    corecore