701 research outputs found

    Disentangling evolutionary signals: conservation, specificity determining positions and coevolution. Implication for catalytic residue prediction

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>A large panel of methods exists that aim to identify residues with critical impact on protein function based on evolutionary signals, sequence and structure information. However, it is not clear to what extent these different methods overlap, and if any of the methods have higher predictive potential compared to others when it comes to, in particular, the identification of catalytic residues (CR) in proteins. Using a large set of enzymatic protein families and measures based on different evolutionary signals, we sought to break up the different components of the information content within a multiple sequence alignment to investigate their predictive potential and degree of overlap.</p> <p>Results</p> <p>Our results demonstrate that the different methods included in the benchmark in general can be divided into three groups with a limited mutual overlap. One group containing real-value Evolutionary Trace (rvET) methods and conservation, another containing mutual information (MI) methods, and the last containing methods designed explicitly for the identification of specificity determining positions (SDPs): integer-value Evolutionary Trace (ivET), SDPfox, and XDET. In terms of prediction of CR, we find using a proximity score integrating structural information (as the sum of the scores of residues located within a given distance of the residue in question) that only the methods from the first two groups displayed a reliable performance. Next, we investigated to what degree proximity scores for conservation, rvET and cumulative MI (cMI) provide complementary information capable of improving the performance for CR identification. We found that integrating conservation with proximity scores for rvET and cMI achieved the highest performance. The proximity conservation score contained no complementary information when integrated with proximity rvET. Moreover, the signal from rvET provided only a limited gain in predictive performance when integrated with mutual information and conservation proximity scores. Combined, these observations demonstrate that the rvET and cMI scores add complementary information to the prediction system.</p> <p>Conclusions</p> <p>This work contributes to the understanding of the different signals of evolution and also shows that it is possible to improve the detection of catalytic residues by integrating structural and higher order sequence evolutionary information with sequence conservation.</p

    Maintaining protein localization, structure, and functional interactions via codon usage and coevolution of gene expression: Combining evolutionary bioinformatics with omics-scale data to test hypotheses related to protein function

    Get PDF
    A major challenge of the omics-era is identifying how a protein functions, both in terms of its specific function and within the context of the various biological processes necessary for the cell\u27s survival. Key elements necessary for a protein to perform its function are efficient and accurate protein localization, protein folding, and interactions with other proteins. Previous work implicated codon usage as a means to modulate protein localization and folding. Using a mechanistic model rooted in population genetics, I examine potential selective differences in codon usage in signal peptides (localization) and protein secondary structures. Although previous work argued signal peptides were under selection for increased translation inefficiency, I find selection is generally consistent with the 5\u27-regions of non-secreted proteins. I also find that previous work was likely confounded by biases in signal peptide amino acid usage and gene expression. Although the direction of selection on codon usage is mostly consistent between protein secondary structures, the strength of this selection does vary for certain codons. After successful folding and localization of a protein, it must be able to function within the context of other proteins in the cell, often through protein-protein interactions of metabolic pathways. Previous work suggests proteins which are part of the same functional processes within a cell are co-expressed across time and environmental conditions. Using the concept of guilt-by-association, I combine empirical protein abundances (measured via mass spectrometry) with sequence homology based function prediction tools to identify potential functions of proteins of unknown function in \textit{C. thermocellum}. Building upon the concept that functionally-related genes are co-expressed within a species, I demonstrate how phylogenetic comparative methods can be used to detect signals of gene expression coevolution across species while accounting for the shared ancestry of the species in question

    Evolutionary Analysis and Expression Profiling of Zebra Finch Immune Genes

    Get PDF
    Genes of the immune system are generally considered to evolve rapidly due to host–parasite coevolution. They are therefore of great interest in evolutionary biology and molecular ecology. In this study, we manually annotated 144 avian immune genes from the zebra finch (Taeniopygia guttata) genome and conducted evolutionary analyses of these by comparing them with their orthologs in the chicken (Gallus gallus). Genes classified as immune receptors showed elevated dN/dS ratios compared with other classes of immune genes. Immune genes in general also appear to be evolving more rapidly than other genes, as inferred from a higher dN/dS ratio compared with the rest of the genome. Furthermore, ten genes (of 27) for which sequence data were available from at least three bird species showed evidence of positive selection acting on specific codons. From transcriptome data of eight different tissues, we found evidence for expression of 106 of the studied immune genes, with primary expression of most of these in bursa, blood, and spleen. These immune-related genes showed a more tissue-specific expression pattern than other genes in the zebra finch genome. Several of the avian immune genes investigated here provide strong candidates for in-depth studies of molecular adaptation in birds

    Networks of High Mutual Information Define the Structural Proximity of Catalytic Sites: Implications for Catalytic Residue Identification

    Get PDF
    Identification of catalytic residues (CR) is essential for the characterization of enzyme function. CR are, in general, conserved and located in the functional site of a protein in order to attain their function. However, many non-catalytic residues are highly conserved and not all CR are conserved throughout a given protein family making identification of CR a challenging task. Here, we put forward the hypothesis that CR carry a particular signature defined by networks of close proximity residues with high mutual information (MI), and that this signature can be applied to distinguish functional from other non-functional conserved residues. Using a data set of 434 Pfam families included in the catalytic site atlas (CSA) database, we tested this hypothesis and demonstrated that MI can complement amino acid conservation scores to detect CR. The Kullback-Leibler (KL) conservation measurement was shown to significantly outperform both the Shannon entropy and maximal frequency measurements. Residues in the proximity of catalytic sites were shown to be rich in shared MI. A structural proximity MI average score (termed pMI) was demonstrated to be a strong predictor for CR, thus confirming the proposed hypothesis. A structural proximity conservation average score (termed pC) was also calculated and demonstrated to carry distinct information from pMI. A catalytic likeliness score (Cls), combining the KL, pC and pMI measures, was shown to lead to significantly improved prediction accuracy. At a specificity of 0.90, the Cls method was found to have a sensitivity of 0.816. In summary, we demonstrate that networks of residues with high MI provide a distinct signature on CR and propose that such a signature should be present in other classes of functional residues where the requirement to maintain a particular function places limitations on the diversification of the structural environment along the course of evolution

    Computational Methods for Accelerated Discovery and Characterization of Genes in Emerging Model Organisms

    Get PDF
    Cilia are evolutionarily conserved, complex, microtubule-based structures that protrude from many eukaryotic cells. In humans, cilia can be found on almost all cell types. The effect of abnormal or absent cilia has been established as the common underlying cause of a recently emerging class of genetic diseases collectively referred to as ciliopathies. The function and structure of cilia are conserved across all organisms with cilia. One of the most influential model systems used to study ciliopathies has been the ciliated green alga Chlamydomonas reinhardtii, an organism for which there is a sequenced genome with relatively few experimentally validated whole-gene annotations but in which the ciliogenesis process can be reliably induced. Experimental methods have been successful in identifying a handful of highly specific cilia disease genes in the alga, but high-throughput, automated computational analyses harbor the greatest potential to reveal a more comprehensive ciliopathy disease gene list. However, ii in order for a genome to be informative for downstream computational analyses, it must first be accurately annotated. This dissertation focuses on accelerating the accurate annotation of the Chlamydomonas genome using whole-genome and whole-transcriptome methodologies to identify human ciliopathy genes. Towards this end, we first develop a genefinder training method for Chlamydomonas that does not require whole gene annotations and demonstrate that this traning method results in a more accurate genefinder than any other genefinder for this alga. Next, we develop a new automated protein characterization method that facilitates the transfer of information across different protein families by extending simple homology categorization to identify new cilia gene candidates. Finally we perform and analyze high-throughput whole-transcriptome sequencing of Chlamydomonas at various timepoints during ciliogenesis to identify ~300 novel human ciliopathy gene candidates. Together these three methodologies complement each other and the existing literature to better elucidate a more complete and informative cilia gene catalog
    • …
    corecore