1,865 research outputs found

    On the detection of functionally coherent groups of protein domains with an extension to protein annotation

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Protein domains coordinate to perform multifaceted cellular functions, and domain combinations serve as the functional building blocks of the cell. The available methods to identify functional domain combinations are limited in their scope, e.g. to the identification of combinations falling within individual proteins or within specific regions in a translated genome. Further effort is needed to identify groups of domains that span across two or more proteins and are linked by a cooperative function. Such functional domain combinations can be useful for protein annotation.</p> <p>Results</p> <p>Using a new computational method, we have identified 114 groups of domains, referred to as domain assembly units (DASSEM units), in the proteome of budding yeast <it>Saccharomyces cerevisiae</it>. The units participate in many important cellular processes such as transcription regulation, translation initiation, and mRNA splicing. Within the units the domains were found to function in a cooperative manner; and each domain contributed to a different aspect of the unit's overall function. The member domains of DASSEM units were found to be significantly enriched among proteins contained in transcription modules, defined as genes sharing similar expression profiles and presumably similar functions. The observation further confirmed the functional coherence of DASSEM units. The functional linkages of units were found in both functionally characterized and uncharacterized proteins, which enabled the assessment of protein function based on domain composition.</p> <p>Conclusion</p> <p>A new computational method was developed to identify groups of domains that are linked by a common function in the proteome of <it>Saccharomyces cerevisiae</it>. These groups can either lie within individual proteins or span across different proteins. We propose that the functional linkages among the domains within the DASSEM units can be used as a non-homology based tool to annotate uncharacterized proteins.</p

    Exploring the function and evolution of proteins using domain families

    Get PDF
    Proteins are frequently composed of multiple domains which fold independently. These are often evolutionarily distinct units which can be adapted and reused in other proteins. The classification of protein domains into evolutionary families facilitates the study of their evolution and function. In this thesis such classifications are used firstly to examine methods for identifying evolutionary relationships (homology) between protein domains. Secondly a specific approach for predicting their function is developed. Lastly they are used in studying the evolution of protein complexes. Tools for identifying evolutionary relationships between proteins are central to computational biology. They aid in classifying families of proteins, giving clues about the function of proteins and the study of molecular evolution. The first chapter of this thesis concerns the effectiveness of cutting edge methods in identifying evolutionary relationships between protein domains. The identification of evolutionary relationships between proteins can give clues as to their function. The second chapter of this thesis concerns the development of a method to identify proteins involved in the same biological process. This method is based on the concept of domain fusion whereby pairs of proteins from one organism with a concerted function are sometimes found fused into single proteins in a different organism. Using protein domain classifications it is possible to identify these relationships. Most proteins do not act in isolation but carry out their function by binding to other proteins in complexes; little is understood about the evolution of such complexes. In the third chapter of this thesis the evolution of complexes is examined in two representative model organisms using protein domain families. In this work, protein domain superfamilies allow distantly related parts of complexes to be identified in order to determine how homologous units are reused

    Graph Theory and Networks in Biology

    Get PDF
    In this paper, we present a survey of the use of graph theoretical techniques in Biology. In particular, we discuss recent work on identifying and modelling the structure of bio-molecular networks, as well as the application of centrality measures to interaction networks and research on the hierarchical structure of such networks and network motifs. Work on the link between structural network properties and dynamics is also described, with emphasis on synchronization and disease propagation.Comment: 52 pages, 5 figures, Survey Pape

    Assessing Performance of Orthology Detection Strategies Applied to Eukaryotic Genomes

    Get PDF
    Orthology detection is critically important for accurate functional annotation, and has been widely used to facilitate studies on comparative and evolutionary genomics. Although various methods are now available, there has been no comprehensive analysis of performance, due to the lack of a genomic-scale ‘gold standard’ orthology dataset. Even in the absence of such datasets, the comparison of results from alternative methodologies contains useful information, as agreement enhances confidence and disagreement indicates possible errors. Latent Class Analysis (LCA) is a statistical technique that can exploit this information to reasonably infer sensitivities and specificities, and is applied here to evaluate the performance of various orthology detection methods on a eukaryotic dataset. Overall, we observe a trade-off between sensitivity and specificity in orthology detection, with BLAST-based methods characterized by high sensitivity, and tree-based methods by high specificity. Two algorithms exhibit the best overall balance, with both sensitivity and specificity>80%: INPARANOID identifies orthologs across two species while OrthoMCL clusters orthologs from multiple species. Among methods that permit clustering of ortholog groups spanning multiple genomes, the (automated) OrthoMCL algorithm exhibits better within-group consistency with respect to protein function and domain architecture than the (manually curated) KOG database, and the homolog clustering algorithm TribeMCL as well. By way of using LCA, we are also able to comprehensively assess similarities and statistical dependence between various strategies, and evaluate the effects of parameter settings on performance. In summary, we present a comprehensive evaluation of orthology detection on a divergent set of eukaryotic genomes, thus providing insights and guides for method selection, tuning and development for different applications. Many biological questions have been addressed by multiple tests yielding binary (yes/no) outcomes but no clear definition of truth, making LCA an attractive approach for computational biology

    What is hidden in the darkness? Deep-learning assisted large-scale protein family curation uncovers novel protein families and folds

    Get PDF
    Driven by the development and upscaling of fast genome sequencing and assembly pipelines, the number of protein-coding sequences deposited in public protein sequence databases is increasing exponentially. Recently, the dramatic success of deep learning-based approaches applied to protein structure prediction has done the same for protein structures. We are now entering a new era in protein sequence and structure annotation, with hundreds of millions of predicted protein structures made available through the AlphaFold database. These models cover most of the catalogued natural proteins, including those difficult to annotate for function or putative biological role based on standard, homology-based approaches. In this work, we quantified how much of such "dark matter" of the natural protein universe was structurally illuminated by AlphaFold2 and modelled this diversity as an interactive sequence similarity network that can be navigated at https://uniprot3d.org/atlas/AFDB90v4 . In the process, we discovered multiple novel protein families by searching for novelties from sequence, structure, and semantic perspectives. We added a number of them to Pfam, and experimentally demonstrate that one of these belongs to a novel superfamily of toxin-antitoxin systems, TumE-TumA. This work highlights the role of large-scale, evolution-driven protein comparison efforts in combination with structural similarities, genomic context conservation, and deep-learning based function prediction tools for the identification of novel protein families, aiding not only annotation and classification efforts but also the curation and prioritisation of target proteins for experimental characterisation

    Pairwise gene GO-based measures for biclustering of high-dimensional expression data

    Get PDF
    Background: Biclustering algorithms search for groups of genes that share the same behavior under a subset of samples in gene expression data. Nowadays, the biological knowledge available in public repositories can be used to drive these algorithms to find biclusters composed of groups of genes functionally coherent. On the other hand, a distance among genes can be defined according to their information stored in Gene Ontology (GO). Gene pairwise GO semantic similarity measures report a value for each pair of genes which establishes their functional similarity. A scatter search-based algorithm that optimizes a merit function that integrates GO information is studied in this paper. This merit function uses a term that addresses the information through a GO measure. Results: The effect of two possible different gene pairwise GO measures on the performance of the algorithm is analyzed. Firstly, three well known yeast datasets with approximately one thousand of genes are studied. Secondly, a group of human datasets related to clinical data of cancer is also explored by the algorithm. Most of these data are high-dimensional datasets composed of a huge number of genes. The resultant biclusters reveal groups of genes linked by a same functionality when the search procedure is driven by one of the proposed GO measures. Furthermore, a qualitative biological study of a group of biclusters show their relevance from a cancer disease perspective. Conclusions: It can be concluded that the integration of biological information improves the performance of the biclustering process. The two different GO measures studied show an improvement in the results obtained for the yeast dataset. However, if datasets are composed of a huge number of genes, only one of them really improves the algorithm performance. This second case constitutes a clear option to explore interesting datasets from a clinical point of view.Ministerio de Economía y Competitividad TIN2014-55894-C2-

    Functional coherence and annotation agreement metrics for enzyme families

    Get PDF
    Tese de doutoramento, Informática (Bioinformática), Universidade de Lisboa, Faculdade de Ciências, 2015A range of methodologies is used to create sequence annotations, from manual curation by specialized curators to several automatic procedures. The multitude of existing annotation methods consequently generates an annotation heterogeneity in terms of coverage and specificity across the biological sequence space. When comparing groups of similar sequences (such as protein families) this heterogeneity can introduce issues regarding the interpretation of the actual functional similarity and the overall functional coherence. A direct path to mitigate these issues is the annotation extension within the protein families under analysis. This thesis postulates that the protein families can be used as knowledgebases for their own annotation extension with the assistance of a proper functional coherence analysis. Therefore, a modular framework for functional coherence analysis and annotation extension in protein families was proposed. The framework includes a proposed module for functional coherence analysis that relies on graph visualization, term enrichment and other statistics. In this work it was implemented and made available as a publicly accessible web application, GRYFUN which can be accessed at http://xldb.di.fc.ul.pt/gryfun/. In addition, four metrics were developed to assess distinct aspects of the coherence and completeness in protein families in conjunction with additional existing metrics. Therefore the use of the complete proposed framework by curators can be regarded as a semi-automatic approach to annotation able to assist with protein annotation extension.Diversas metodologias são usadas para criar anotações em sequências, desde a curação manual por curadores especializados até vários procedimentos automáticos. A multitude de métodos de anotação existentes consequentemente gera heterogeneidade nas anotações em termos de cobertura e especificidade em espaços de sequências biológicas. Ao comparar grupos de sequências semelhantes (tais como famílias proteícas) esta heterogeneidade pode introduzir dificuldades quanto à interpretação da semelhança e coerência funcional nesses grupos. Uma maneira de mitigar essas dificuldades é a extensão da anotação dentro das famílias proteícas em análise. Esta tese postula que famílias proteícas podem ser usadas como bases de conhecimento para a sua própria extensão de anotação através do uso de análises de coerência funcional apropriadas. Portanto, uma framework modular para a análise de coerência funcional e extensão de anotação em famílias proteícas foi proposta. A framework incluí um módulo proposto para a análise de coerência funcional baseado em visualização de grafos, enriquecimento de termos e outras estatísticas. Neste trabalho o módulo foi implementado e disponibilizado como uma aplicação web, GRYFUN que pode ser acedida em http://xldb.di.fc.ul.pt/gryfun/. Adicionalmente, quatro métricas foram desenvolvidas para aferir aspectos distinctos da coerência e completude de anotação em famílias proteícas em conjunção com métricas já existentes. Portanto, o uso da framework completa por curadores, como uma estratégia de anotação semi-automática, é capaz de potenciar a extensão de anotação.Fundação para a Ciência e a Tecnologia (FCT), SFRH/BD/48035/200

    Detection of new protein domains using co-occurrence: application to Plasmodium falciparum

    Get PDF
    International audienceMotivation: Hidden Markov Models (HMMs) have proved to be a powerful tool for protein domain identification in newly sequenced organisms. However, numerous domains may be missed in highly divergent proteins. This is the case for Plasmodium falciparum proteins, the main causal agent of human malaria. Results: We propose a method to improve the sensitivity of HMM domain detection by exploiting the tendency of the domains to appear preferentially with a few other favorite domains in a protein. When sequence information alone is not sufficient to warrant the presence of a particular domain, our method enables its detection on the basis of the presence of other Pfam or InterPro domains. Moreover, a shuffling procedure allows us to estimate the false discovery rate associated with the results. Applied to P. falciparum, our method identifies 585 new Pfam domains (versus the 3683 already known domains in the Pfam database) with an estimated error rate below 20%. These new domains provide 387 new Gene Ontology annotations to the P. falciparum proteome. Analogous and congruent results are obtained when applying the method to related Plasmodium species, P. vivax and P. yoelii. Availability: Supplementary Material and a database of the new domains and GO predictions achieved on Plasmodium proteins are available at http://www.lirmm.fr/~terrapon/codd

    Protocols to capture the functional plasticity of protein domain superfamilies

    Get PDF
    Most proteins comprise several domains, segments that are clearly discernable in protein structure and sequence. Over the last two decades, it has become increasingly clear that domains are often also functional modules that can be duplicated and recombined in the course of evolution. This gives rise to novel protein functions. Traditionally, protein domains are grouped into homologous domain superfamilies in resources such as SCOP and CATH. This is done primarily on the basis of similarities in their three-dimensional structures. A biologically sound subdivision of the domain superfamilies into families of sequences with conserved function has so far been missing. Such families form the ideal framework to study the evolutionary and functional plasticity of individual superfamilies. In the few existing resources that aim to classify domain families, a considerable amount of manual curation is involved. Whilst immensely valuable, the latter is inherently slow and expensive. It can thus impede large-scale application. This work describes the development and application of a fully-automatic pipeline for identifying functional families within superfamilies of protein domains. This pipeline is built around a method for clustering large-scale sequence datasets in distributed computing environments. In addition, it implements two different protocols for identifying families on the basis of the clustering results: a supervised and an unsupervised protocol. These are used depending on whether or not high-quality protein function annotation data are associated with a given superfamily. The results attained for more than 1,500 domain superfamilies are discussed in both a qualitative and quantitative manner. The use of domain sequence data in conjunction with Gene Ontology protein function annotations and a set of rules and concepts to derive families is a novel approach to large-scale domain sequence classification. Importantly, the focus lies on domain, not whole-protein function
    corecore