104 research outputs found

    Exploring the function and evolution of proteins using domain families

    Get PDF
    Proteins are frequently composed of multiple domains which fold independently. These are often evolutionarily distinct units which can be adapted and reused in other proteins. The classification of protein domains into evolutionary families facilitates the study of their evolution and function. In this thesis such classifications are used firstly to examine methods for identifying evolutionary relationships (homology) between protein domains. Secondly a specific approach for predicting their function is developed. Lastly they are used in studying the evolution of protein complexes. Tools for identifying evolutionary relationships between proteins are central to computational biology. They aid in classifying families of proteins, giving clues about the function of proteins and the study of molecular evolution. The first chapter of this thesis concerns the effectiveness of cutting edge methods in identifying evolutionary relationships between protein domains. The identification of evolutionary relationships between proteins can give clues as to their function. The second chapter of this thesis concerns the development of a method to identify proteins involved in the same biological process. This method is based on the concept of domain fusion whereby pairs of proteins from one organism with a concerted function are sometimes found fused into single proteins in a different organism. Using protein domain classifications it is possible to identify these relationships. Most proteins do not act in isolation but carry out their function by binding to other proteins in complexes; little is understood about the evolution of such complexes. In the third chapter of this thesis the evolution of complexes is examined in two representative model organisms using protein domain families. In this work, protein domain superfamilies allow distantly related parts of complexes to be identified in order to determine how homologous units are reused

    Modular analysis of the probabilistic genetic interaction network

    Get PDF
    Motivation: Epistatic Miniarray Profiles (EMAP) has enabled the mapping of large-scale genetic interaction networks; however, the quantitative information gained from EMAP cannot be fully exploited since the data are usually interpreted as a discrete network based on an arbitrary hard threshold. To address such limitations, we adopted a mixture modeling procedure to construct a probabilistic genetic interaction network and then implemented a Bayesian approach to identify densely interacting modules in the probabilistic network

    Development of computational tools and resources for systems biology of bacterial pathogens

    Get PDF
    Bacterial pathogens are a major cause of diseases in human, agricultural plants and farm animals. Even after decades of research they remain a challenge to health care as they are known to rapidly evolve and develop resistance to the existing drugs. Systems biology is an emerging area of research where all of the components of the system, their interactions, and the dynamics can be studied in a comprehensive, quantitative, and integrative fashion to generate predictive models. When applied to bacterial pathogenesis, systems biology approaches will help identify potential novel molecular targets for drug discovery. A pre-requisite for conducting systems analysis is the identification of the building blocks of the system i.e. individual components of the system (structural annotation), identification of their functions (functional annotation) and identification of the interactions among the individual components (interaction prediction). In the context of bacterial pathogenesis, it is necessary to identify the host-pathogen interactions. This dissertation work describes computational resources that enable comprehensive systems level study of host pathogen system to enhance our understanding of bacterial pathogenesis. It specifically focuses on improving the structural and functional annotation of pathogen genomes as well as identifying host-pathogen interactions at a genome scale. The novel contributions of this dissertation towards systems biology of bacterial pathogens include three computational tools/resources. “TAAPP” (Tiling array analysis pipeline for prokaryotes) is a web based tool for the analysis of whole genome tiling array data for bacterial pathogens. TAAPP helps improve the structural annotation of bacterial genomes. “ISO-IEA” (Inferred from sequence orthology - Inferred from electronic annotation) is a tool that can be used for the functional annotation of any sequenced genome. “HPIDB” (Host pathogen interaction database) is developed with data a mining capability that includes host-pathogen interaction prediction. The new knowledge gained due to the implementation of these tools is the description of the non coding RNA as well as a computationally predicted host-pathogen interaction network for the human respiratory pathogen Streptococcus pneumoniae. In summary, the computation tools and resources developed in this dissertation study will enable building systems biology models of bacterial pathogens

    Graph - Based Methods for Protein Function Prediction

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Scalable phylogenetic profiling using MinHash uncovers likely eukaryotic sexual reproduction genes

    Get PDF
    Phylogenetic profiling is a computational method to predict genes involved in the same biological process by identifying protein families which tend to be jointly lost or retained across the tree of life. Phylogenetic profiling has customarily been more widely used with prokaryotes than eukaryotes, because the method is thought to require many diverse genomes. There are now many eukaryotic genomes available, but these are considerably larger, and typical phylogenetic profiling methods require at least quadratic time as a function of the number of genes. We introduce a fast, scalable phylogenetic profiling approach entitled HogProf, which leverages hierarchical orthologous groups for the construction of large profiles and locality-sensitive hashing for efficient retrieval of similar profiles. We show that the approach outperforms Enhanced Phylogenetic Tree, a phylogeny-based method, and use the tool to reconstruct networks and query for interactors of the kinetochore complex as well as conserved proteins involved in sexual reproduction: Hap2, Spo11 and Gex1. HogProf enables large-scale phylogenetic profiling across the three domains of life, and will be useful to predict biological pathways among the hundreds of thousands of eukaryotic species that will become available in the coming few years. HogProf is available at https://github.com/DessimozLab/HogProf

    Human protein function prediction: application of machine learning for integration of heterogeneous data sources

    Get PDF
    Experimental characterisation of protein cellular function can be prohibitively expensive and take years to complete. To address this problem, this thesis focuses on the development of computational approaches to predict function from sequence. For sequences with well characterised close relatives, annotation is trivial, orphans or distant homologues present a greater challenge. The use of a feature based method employing ensemble support vector machines to predict individual Gene Ontology classes is investigated. It is found that different combinations of feature inputs are required to recognise different functions. Although the approach is applicable to any human protein sequence, it is restricted to broadly descriptive functions. The method is well suited to prioritisation of candidate functions for novel proteins rather than to make highly accurate class assignments. Signatures of common function can be derived from different biological characteristics; interactions and binding events as well as expression behaviour. To investigate the hypothesis that common function can be derived from expression information, public domain human microarray datasets are assembled. The questions of how best to integrate these datasets and derive features that are useful in function prediction are addressed. Both co-expression and abundance information is represented between and within experiments and investigated for correlation with function. It is found that features derived from expression data serve as a weak but significant signal for recognising functions. This signal is stronger for biological processes than molecular function categories and independent of homology information. The protein domain has historically been coined as a modular evolutionary unit of protein function. The occurrence of domains that can be linked by ancestral fusion events serves as a signal for domain-domain interactions. To exploit this information for function prediction, novel domain architecture and fused architecture scores are developed. Architecture scores rather than single domain scores correlate more strongly with function, and both architecture and fusion scores correlate more strongly with molecular functions than biological processes. The final study details the development of a novel heterogeneous function prediction approach designed to target the annotation of both homologous and non-homologous proteins. Support vector regression is used to combine pair-wise sequence features with expression scores and domain architecture scores to rank protein pairs in terms of their functional similarities. The target of the regression models represents the continuum of protein function space empirically derived from the Gene Ontology molecular function and biological process graphs. The merit and performance of the approach is demonstrated using homologous and non-homologous test datasets and significantly improves upon classical nearest neighbour annotation transfer by sequence methods. The final model represents a method that achieves a compromise between high specificity and sensitivity for all human proteins regardless of their homology status. It is expected that this strategy will allow for more comprehensive and accurate annotations of the human proteome

    Uncovering the Molecular Machinery of the Human Spindle—An Integration of Wet and Dry Systems Biology

    Get PDF
    The mitotic spindle is an essential molecular machine involved in cell division, whose composition has been studied extensively by detailed cellular biology, high-throughput proteomics, and RNA interference experiments. However, because of its dynamic organization and complex regulation it is difficult to obtain a complete description of its molecular composition. We have implemented an integrated computational approach to characterize novel human spindle components and have analysed in detail the individual candidates predicted to be spindle proteins, as well as the network of predicted relations connecting known and putative spindle proteins. The subsequent experimental validation of a number of predicted novel proteins confirmed not only their association with the spindle apparatus but also their role in mitosis. We found that 75% of our tested proteins are localizing to the spindle apparatus compared to a success rate of 35% when expert knowledge alone was used. We compare our results to the previously published MitoCheck study and see that our approach does validate some findings by this consortium. Further, we predict so-called “hidden spindle hub”, proteins whose network of interactions is still poorly characterised by experimental means and which are thought to influence the functionality of the mitotic spindle on a large scale. Our analyses suggest that we are still far from knowing the complete repertoire of functionally important components of the human spindle network. Combining integrated bio-computational approaches and single gene experimental follow-ups could be key to exploring the still hidden regions of the human spindle system

    A structural classification of protein-protein interactions for detection of convergently evolved motifs and for prediction of protein binding sites on sequence level

    Get PDF
    BACKGROUND: A long-standing challenge in the post-genomic era of Bioinformatics is the prediction of protein-protein interactions, and ultimately the prediction of protein functions. The problem is intrinsically harder, when only amino acid sequences are available, but a solution is more universally applicable. So far, the problem of uncovering protein-protein interactions has been addressed in a variety of ways, both experimentally and computationally. MOTIVATION: The central problem is: How can protein complexes with solved threedimensional structure be utilized to identify and classify protein binding sites and how can knowledge be inferred from this classification such that protein interactions can be predicted for proteins without solved structure? The underlying hypothesis is that protein binding sites are often restricted to a small number of residues, which additionally often are well-conserved in order to maintain an interaction. Therefore, the signal-to-noise ratio in binding sites is expected to be higher than in other parts of the surface. This enables binding site detection in unknown proteins, when homology based annotation transfer fails. APPROACH: The problem is addressed by first investigating how geometrical aspects of domain-domain associations can lead to a rigorous structural classification of the multitude of protein interface types. The interface types are explored with respect to two aspects: First, how do interface types with one-sided homology reveal convergently evolved motifs? Second, how can sequential descriptors for local structural features be derived from the interface type classification? Then, the use of sequential representations for binding sites in order to predict protein interactions is investigated. The underlying algorithms are based on machine learning techniques, in particular Hidden Markov Models. RESULTS: This work includes a novel approach to a comprehensive geometrical classification of domain interfaces. Alternative structural domain associations are found for 40% of all family-family interactions. Evaluation of the classification algorithm on a hand-curated set of interfaces yielded a precision of 83% and a recall of 95%. For the first time, a systematic screen of convergently evolved motifs in 102.000 protein-protein interactions with structural information is derived. With respect to this dataset, all cases related to viral mimicry of human interface bindings are identified. Finally, a library of 740 motif descriptors for binding site recognition - encoded as Hidden Markov Models - is generated and cross-validated. Tests for the significance of motifs are provided. The usefulness of descriptors for protein-ligand binding sites is demonstrated for the case of "ATP-binding", where a precision of 89% is achieved, thus outperforming comparable motifs from PROSITE. In particular, a novel descriptor for a P-loop variant has been used to identify ATP-binding sites in 60 protein sequences that have not been annotated before by existing motif databases
    corecore