4,006 research outputs found

    Protein and DNA sequence determinants of thermophilic adaptation

    Get PDF
    Prokaryotes living at extreme environmental temperatures exhibit pronounced signatures in the amino acid composition of their proteins and nucleotide compositions of their genomes reflective of adaptation to their thermal environments. However, despite significant efforts, the definitive answer of what are the genomic and proteomic compositional determinants of Optimal Growth Temperature of prokaryotic organisms remained elusive. Here the authors performed a comprehensive analysis of amino acid and nucleotide compositional signatures of thermophylic adaptation by exhaustively evaluating all combinations of amino acids and nucleotides as possible determinants of Optimal Growth Temperature for all prokaryotic organisms with fully sequences genomes.. The authors discovered that total concentration of seven amino acids in proteomes, IVYWREL, serves as a universal proteomic predictor of Optimal Growth Temperature in prokaryotes. Resolving the old-standing controversy the authors determined that the variation in nucleotide composition (increase of purine load, or A+G content with temperature) is largely a consequence of thermal adaptation of proteins. However, the frequency with which A and G nucleotides appear as nearest neighbors in genome sequences is strongly and independently correlated with Optimal Growth Temperature. as a result of codon bias in corresponding genomes. Together these results provide a complete picture of proteomic and genomic determinants of thermophilic adaptation.Comment: in press PLoS Computational Biology; revised versio

    PRED-CLASS: cascading neural networks for generalized protein classification and genome-wide applications

    Full text link
    A cascading system of hierarchical, artificial neural networks (named PRED-CLASS) is presented for the generalized classification of proteins into four distinct classes-transmembrane, fibrous, globular, and mixed-from information solely encoded in their amino acid sequences. The architecture of the individual component networks is kept very simple, reducing the number of free parameters (network synaptic weights) for faster training, improved generalization, and the avoidance of data overfitting. Capturing information from as few as 50 protein sequences spread among the four target classes (6 transmembrane, 10 fibrous, 13 globular, and 17 mixed), PRED-CLASS was able to obtain 371 correct predictions out of a set of 387 proteins (success rate approximately 96%) unambiguously assigned into one of the target classes. The application of PRED-CLASS to several test sets and complete proteomes of several organisms demonstrates that such a method could serve as a valuable tool in the annotation of genomic open reading frames with no functional assignment or as a preliminary step in fold recognition and ab initio structure prediction methods. Detailed results obtained for various data sets and completed genomes, along with a web sever running the PRED-CLASS algorithm, can be accessed over the World Wide Web at http://o2.biol.uoa.gr/PRED-CLAS

    The Plant Short-Chain Dehydrogenase (SDR) superfamily:genome-wide inventory and diversification patterns

    Get PDF
    Background Short-chain dehydrogenases/reductases (SDRs) form one of the largest and oldest NAD(P)(H) dependent oxidoreductase families. Despite a conserved 'Rossmann-fold' structure, members of the SDR superfamily exhibit low sequence similarities, which constituted a bottleneck in terms of identification. Recent classification methods, relying on hidden-Markov models (HMMs), improved identification and enabled the construction of a nomenclature. However, functional annotations of plant SDRs remain scarce. Results Wide-scale analyses were performed on ten plant genomes. The combination of hidden Markov model (HMM) based analyses and similarity searches led to the construction of an exhaustive inventory of plant SDR. With 68 to 315 members found in each analysed genome, the inventory confirmed the over-representation of SDRs in plants compared to animals, fungi and prokaryotes. The plant SDRs were first classified into three major types --- 'classical', 'extended' and 'divergent' --- but a minority (10 % of the predicted SDRs) could not be classified into these general types ('unknown' or 'atypical' types). In a second step, we could categorize the vast majority of land plant SDRs into a set of 49 families. Out of these 49 families, 35 appeared early during evolution since they are commonly found through all the Green Lineage. Yet, some SDR families --- tropinone reductase-like proteins (SDR65C), 'ABA2-like'-NAD dehydrogenase (SDR110C), 'salutaridine/menthone-reductase-like' proteins (SDR114C), 'dihydroflavonol 4-reductase'-like proteins (SDR108E) and 'isoflavone-reductase-like' (SDR460A) proteins --- have undergone significant functional diversification within vascular plants since they diverged from Bryophytes. Interestingly, these diversified families are either involved in the secondary metabolism routes (terpenoids, alkaloids, phenolics) or participate in developmental processes (hormone biosynthesis or catabolism, flower development), in opposition to SDR families involved in primary metabolism which are poorly diversified. Conclusion The application of HMMs to plant genomes enabled us to identify 49 families that encompass all Angiosperms ('higher plants') SDRs, each family being sufficiently conserved to enable simpler analyses based only on overall sequence similarity. The multiplicity of SDRs in plant kingdom is mainly explained by the diversification of large families involved in different secondary metabolism pathways, suggesting that the chemical diversification that accompanied the emergence of vascular plants acted as a driving force for SDR evolution

    Genomic and proteomic biases inform metabolic engineering strategies for anaerobic fungi.

    Get PDF
    Anaerobic fungi (Neocallimastigomycota) are emerging non-model hosts for biotechnology due to their wealth of biomass-degrading enzymes, yet tools to engineer these fungi have not yet been established. Here, we show that the anaerobic gut fungi have the most GC depleted genomes among 443 sequenced organisms in the fungal kingdom, which has ramifications for heterologous expression of genes as well as for emerging CRISPR-based genome engineering approaches. Comparative genomic analyses suggest that anaerobic fungi may contain cellular machinery to aid in sexual reproduction, yet a complete mating pathway was not identified. Predicted proteomes of the anaerobic fungi also contain an unusually large fraction of proteins with homopolymeric amino acid runs consisting of five or more identical consecutive amino acids. In particular, threonine runs are especially enriched in anaerobic fungal carbohydrate active enzymes (CAZymes) and this, together with a high abundance of predicted N-glycosylation motifs, suggests that gut fungal CAZymes are heavily glycosylated, which may impact heterologous production of these biotechnologically useful enzymes. Finally, we present a codon optimization strategy to aid in the development of genetic engineering tools tailored to these early-branching anaerobic fungi

    Phylogenetic differences in content and intensity of periodic proteins

    Get PDF
    Many proteins exhibit sequence periodicity, often correlated with a visible structural periodicity. The statistical significance of such periodicity can be assessed by means of a chi-square-based test, with significance thresholds being calculated from shuffled sequences. Comparison of the complete proteomes of 45 species reveals striking differences in the proportion of periodic proteins and the intensity of the most significant periodicities. Eukaryotes tend to have a higher proportion of periodic proteins than eubacteria, which in turn tend to have more than archaea. The intensity of periodicity in the most periodic proteins is also greatest in eukaryotes. By contrast, the relatively small group of periodic proteins in archaea also tend to be weakly periodic compared to those of eukaryotes and eubacteria. Exceptions to this general rule are found in those prokaryotes with multicellular life-cycle phases, e.g. Methanosarcina sps. or Anabaena sps., which have more periodicities than prokaryotes in general, and in unicellular eukaryotes, which have fewer than multicellular eukaryotes. The distribution of significantly periodic proteins in eukaryotes is over a wide range of period lengths, whereas prokaryotic proteins typically have a more limited set of period lengths. This is further investigated by repeating the analysis on the NRL-3D database of proteins of solved structure. Some short range periodicities are explicable in terms of basic secondary structure, e.g. alpha helices, while middle range periodicities are frequently found to consist of known short Pfam domains, e.g. leucine-rich repeats, tetratricopeptides or armadillo domains. However, not all can be explained in this way

    Predicting protein function by machine learning on amino acid sequences – a critical evaluation

    Get PDF
    Copyright @ 2007 Al-Shahib et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Background: Predicting the function of newly discovered proteins by simply inspecting their amino acid sequence is one of the major challenges of post-genomic computational biology, especially when done without recourse to experimentation or homology information. Machine learning classifiers are able to discriminate between proteins belonging to different functional classes. Until now, however, it has been unclear if this ability would be transferable to proteins of unknown function, which may show distinct biases compared to experimentally more tractable proteins. Results: Here we show that proteins with known and unknown function do indeed differ significantly. We then show that proteins from different bacterial species also differ to an even larger and very surprising extent, but that functional classifiers nonetheless generalize successfully across species boundaries. We also show that in the case of highly specialized proteomes classifiers from a different, but more conventional, species may in fact outperform the endogenous species-specific classifier. Conclusion: We conclude that there is very good prospect of successfully predicting the function of yet uncharacterized proteins using machine learning classifiers trained on proteins of known function
    corecore