1,865 research outputs found
On the detection of functionally coherent groups of protein domains with an extension to protein annotation
<p>Abstract</p> <p>Background</p> <p>Protein domains coordinate to perform multifaceted cellular functions, and domain combinations serve as the functional building blocks of the cell. The available methods to identify functional domain combinations are limited in their scope, e.g. to the identification of combinations falling within individual proteins or within specific regions in a translated genome. Further effort is needed to identify groups of domains that span across two or more proteins and are linked by a cooperative function. Such functional domain combinations can be useful for protein annotation.</p> <p>Results</p> <p>Using a new computational method, we have identified 114 groups of domains, referred to as domain assembly units (DASSEM units), in the proteome of budding yeast <it>Saccharomyces cerevisiae</it>. The units participate in many important cellular processes such as transcription regulation, translation initiation, and mRNA splicing. Within the units the domains were found to function in a cooperative manner; and each domain contributed to a different aspect of the unit's overall function. The member domains of DASSEM units were found to be significantly enriched among proteins contained in transcription modules, defined as genes sharing similar expression profiles and presumably similar functions. The observation further confirmed the functional coherence of DASSEM units. The functional linkages of units were found in both functionally characterized and uncharacterized proteins, which enabled the assessment of protein function based on domain composition.</p> <p>Conclusion</p> <p>A new computational method was developed to identify groups of domains that are linked by a common function in the proteome of <it>Saccharomyces cerevisiae</it>. These groups can either lie within individual proteins or span across different proteins. We propose that the functional linkages among the domains within the DASSEM units can be used as a non-homology based tool to annotate uncharacterized proteins.</p
Exploring the function and evolution of proteins using domain families
Proteins are frequently composed of multiple domains which fold
independently. These are often evolutionarily distinct units which can be
adapted and reused in other proteins. The classification of protein domains
into evolutionary families facilitates the study of their evolution and function.
In this thesis such classifications are used firstly to examine methods for
identifying evolutionary relationships (homology) between protein domains.
Secondly a specific approach for predicting their function is developed.
Lastly they are used in studying the evolution of protein complexes.
Tools for identifying evolutionary relationships between proteins are
central to computational biology. They aid in classifying families of proteins,
giving clues about the function of proteins and the study of molecular
evolution. The first chapter of this thesis concerns the effectiveness of cutting
edge methods in identifying evolutionary relationships between protein
domains.
The identification of evolutionary relationships between proteins can
give clues as to their function. The second chapter of this thesis concerns the
development of a method to identify proteins involved in the same biological
process. This method is based on the concept of domain fusion whereby
pairs of proteins from one organism with a concerted function are sometimes
found fused into single proteins in a different organism. Using protein
domain classifications it is possible to identify these relationships.
Most proteins do not act in isolation but carry out their function by
binding to other proteins in complexes; little is understood about the
evolution of such complexes. In the third chapter of this thesis the evolution
of complexes is examined in two representative model organisms using
protein domain families. In this work, protein domain superfamilies allow
distantly related parts of complexes to be identified in order to determine
how homologous units are reused
Graph Theory and Networks in Biology
In this paper, we present a survey of the use of graph theoretical techniques
in Biology. In particular, we discuss recent work on identifying and modelling
the structure of bio-molecular networks, as well as the application of
centrality measures to interaction networks and research on the hierarchical
structure of such networks and network motifs. Work on the link between
structural network properties and dynamics is also described, with emphasis on
synchronization and disease propagation.Comment: 52 pages, 5 figures, Survey Pape
Assessing Performance of Orthology Detection Strategies Applied to Eukaryotic Genomes
Orthology detection is critically important for accurate functional annotation, and has been widely used to facilitate studies on comparative and evolutionary genomics. Although various methods are now available, there has been no comprehensive analysis of performance, due to the lack of a genomic-scale ‘gold standard’ orthology dataset. Even in the absence of such datasets, the comparison of results from alternative methodologies contains useful information, as agreement enhances confidence and disagreement indicates possible errors. Latent Class Analysis (LCA) is a statistical technique that can exploit this information to reasonably infer sensitivities and specificities, and is applied here to evaluate the performance of various orthology detection methods on a eukaryotic dataset. Overall, we observe a trade-off between sensitivity and specificity in orthology detection, with BLAST-based methods characterized by high sensitivity, and tree-based methods by high specificity. Two algorithms exhibit the best overall balance, with both sensitivity and specificity>80%: INPARANOID identifies orthologs across two species while OrthoMCL clusters orthologs from multiple species. Among methods that permit clustering of ortholog groups spanning multiple genomes, the (automated) OrthoMCL algorithm exhibits better within-group consistency with respect to protein function and domain architecture than the (manually curated) KOG database, and the homolog clustering algorithm TribeMCL as well. By way of using LCA, we are also able to comprehensively assess similarities and statistical dependence between various strategies, and evaluate the effects of parameter settings on performance. In summary, we present a comprehensive evaluation of orthology detection on a divergent set of eukaryotic genomes, thus providing insights and guides for method selection, tuning and development for different applications. Many biological questions have been addressed by multiple tests yielding binary (yes/no) outcomes but no clear definition of truth, making LCA an attractive approach for computational biology
What is hidden in the darkness? Deep-learning assisted large-scale protein family curation uncovers novel protein families and folds
Driven by the development and upscaling of fast genome sequencing and assembly pipelines, the number of protein-coding sequences deposited in public protein sequence databases is increasing exponentially. Recently, the dramatic success of deep learning-based approaches applied to protein structure prediction has done the same for protein structures. We are now entering a new era in protein sequence and structure annotation, with hundreds of millions of predicted protein structures made available through the AlphaFold database. These models cover most of the catalogued natural proteins, including those difficult to annotate for function or putative biological role based on standard, homology-based approaches. In this work, we quantified how much of such "dark matter" of the natural protein universe was structurally illuminated by AlphaFold2 and modelled this diversity as an interactive sequence similarity network that can be navigated at https://uniprot3d.org/atlas/AFDB90v4 . In the process, we discovered multiple novel protein families by searching for novelties from sequence, structure, and semantic perspectives. We added a number of them to Pfam, and experimentally demonstrate that one of these belongs to a novel superfamily of toxin-antitoxin systems, TumE-TumA. This work highlights the role of large-scale, evolution-driven protein comparison efforts in combination with structural similarities, genomic context conservation, and deep-learning based function prediction tools for the identification of novel protein families, aiding not only annotation and classification efforts but also the curation and prioritisation of target proteins for experimental characterisation
Pairwise gene GO-based measures for biclustering of high-dimensional expression data
Background: Biclustering algorithms search for groups of genes that share the same
behavior under a subset of samples in gene expression data. Nowadays, the biological
knowledge available in public repositories can be used to drive these algorithms to
find biclusters composed of groups of genes functionally coherent. On the other hand,
a distance among genes can be defined according to their information stored in Gene
Ontology (GO). Gene pairwise GO semantic similarity measures report a value for each
pair of genes which establishes their functional similarity. A scatter search-based
algorithm that optimizes a merit function that integrates GO information is studied in
this paper. This merit function uses a term that addresses the information through a GO
measure.
Results: The effect of two possible different gene pairwise GO measures on the
performance of the algorithm is analyzed. Firstly, three well known yeast datasets with
approximately one thousand of genes are studied. Secondly, a group of human
datasets related to clinical data of cancer is also explored by the algorithm. Most of
these data are high-dimensional datasets composed of a huge number of genes. The
resultant biclusters reveal groups of genes linked by a same functionality when the
search procedure is driven by one of the proposed GO measures. Furthermore, a
qualitative biological study of a group of biclusters show their relevance from a cancer
disease perspective.
Conclusions: It can be concluded that the integration of biological information
improves the performance of the biclustering process. The two different GO measures
studied show an improvement in the results obtained for the yeast dataset. However, if
datasets are composed of a huge number of genes, only one of them really improves
the algorithm performance. This second case constitutes a clear option to explore
interesting datasets from a clinical point of view.Ministerio de Economía y Competitividad TIN2014-55894-C2-
Functional coherence and annotation agreement metrics for enzyme families
Tese de doutoramento, Informática (Bioinformática), Universidade de Lisboa, Faculdade de Ciências, 2015A range of methodologies is used to create sequence annotations, from manual curation by specialized curators to several automatic procedures. The multitude of existing annotation methods consequently generates an annotation heterogeneity in terms of coverage and specificity across the biological sequence space. When comparing groups of similar sequences (such as protein families) this heterogeneity can introduce issues regarding the interpretation of the actual functional similarity and the overall functional coherence. A direct path to mitigate these issues is the annotation extension within the protein families under analysis. This thesis postulates that the protein families can be used as knowledgebases for their own annotation extension with the assistance of a proper functional coherence analysis. Therefore, a modular framework for functional coherence analysis and annotation extension in protein families was proposed. The framework includes a proposed module for functional coherence analysis that relies on graph visualization, term enrichment and other statistics. In this work it was implemented and made available as a publicly accessible web application, GRYFUN which can be accessed at http://xldb.di.fc.ul.pt/gryfun/. In addition, four metrics were developed to assess distinct aspects of the coherence and completeness in protein families in conjunction with additional existing metrics. Therefore the use of the complete proposed framework by curators can be regarded as a semi-automatic approach to annotation able to assist with protein annotation extension.Diversas metodologias são usadas para criar anotações em sequências, desde a curação manual por curadores especializados até vários procedimentos automáticos. A multitude de métodos de anotação existentes consequentemente gera heterogeneidade nas anotações em termos de cobertura e especificidade em espaços de sequências biológicas. Ao comparar grupos de sequências semelhantes (tais como famílias proteícas) esta heterogeneidade pode introduzir dificuldades quanto à interpretação da semelhança e coerência funcional nesses grupos. Uma maneira de mitigar essas dificuldades é a extensão da anotação dentro das famílias proteícas em análise. Esta tese postula que famílias proteícas podem ser usadas como bases de conhecimento para a sua própria extensão de anotação através do uso de análises de coerência funcional apropriadas. Portanto, uma framework modular para a análise de coerência funcional e extensão de anotação em famílias proteícas foi proposta. A framework incluí um módulo proposto para a análise de coerência funcional baseado em visualização de grafos, enriquecimento de termos e outras estatísticas. Neste trabalho o módulo foi implementado e disponibilizado como uma aplicação web, GRYFUN que pode ser acedida em http://xldb.di.fc.ul.pt/gryfun/. Adicionalmente, quatro métricas foram desenvolvidas para aferir aspectos distinctos da coerência e completude de anotação em famílias proteícas em conjunção com métricas já existentes. Portanto, o uso da framework completa por curadores, como uma estratégia de anotação semi-automática, é capaz de potenciar a extensão de anotação.Fundação para a Ciência e a Tecnologia (FCT), SFRH/BD/48035/200
Detection of new protein domains using co-occurrence: application to Plasmodium falciparum
International audienceMotivation: Hidden Markov Models (HMMs) have proved to be a powerful tool for protein domain identification in newly sequenced organisms. However, numerous domains may be missed in highly divergent proteins. This is the case for Plasmodium falciparum proteins, the main causal agent of human malaria. Results: We propose a method to improve the sensitivity of HMM domain detection by exploiting the tendency of the domains to appear preferentially with a few other favorite domains in a protein. When sequence information alone is not sufficient to warrant the presence of a particular domain, our method enables its detection on the basis of the presence of other Pfam or InterPro domains. Moreover, a shuffling procedure allows us to estimate the false discovery rate associated with the results. Applied to P. falciparum, our method identifies 585 new Pfam domains (versus the 3683 already known domains in the Pfam database) with an estimated error rate below 20%. These new domains provide 387 new Gene Ontology annotations to the P. falciparum proteome. Analogous and congruent results are obtained when applying the method to related Plasmodium species, P. vivax and P. yoelii. Availability: Supplementary Material and a database of the new domains and GO predictions achieved on Plasmodium proteins are available at http://www.lirmm.fr/~terrapon/codd
Protocols to capture the functional plasticity of protein domain superfamilies
Most proteins comprise several domains, segments that are clearly discernable
in protein structure and sequence. Over the last two decades, it has become
increasingly clear that domains are often also functional modules that can be
duplicated and recombined in the course of evolution. This gives rise to novel
protein functions. Traditionally, protein domains are grouped into
homologous domain superfamilies in resources such as SCOP and CATH.
This is done primarily on the basis of similarities in their three-dimensional
structures. A biologically sound subdivision of the domain superfamilies into
families of sequences with conserved function has so far been missing. Such
families form the ideal framework to study the evolutionary and functional
plasticity of individual superfamilies. In the few existing resources that aim to
classify domain families, a considerable amount of manual curation is
involved. Whilst immensely valuable, the latter is inherently slow and
expensive. It can thus impede large-scale application.
This work describes the development and application of a fully-automatic
pipeline for identifying functional families within superfamilies of protein
domains. This pipeline is built around a method for clustering large-scale
sequence datasets in distributed computing environments. In addition, it
implements two different protocols for identifying families on the basis of the
clustering results: a supervised and an unsupervised protocol. These are used
depending on whether or not high-quality protein function annotation data
are associated with a given superfamily. The results attained for more than
1,500 domain superfamilies are discussed in both a qualitative and quantitative
manner. The use of domain sequence data in conjunction with Gene
Ontology protein function annotations and a set of rules and concepts to
derive families is a novel approach to large-scale domain sequence
classification. Importantly, the focus lies on domain, not whole-protein
function
- …