4,006 research outputs found
Protein and DNA sequence determinants of thermophilic adaptation
Prokaryotes living at extreme environmental temperatures exhibit pronounced
signatures in the amino acid composition of their proteins and nucleotide
compositions of their genomes reflective of adaptation to their thermal
environments. However, despite significant efforts, the definitive answer of
what are the genomic and proteomic compositional determinants of Optimal Growth
Temperature of prokaryotic organisms remained elusive. Here the authors
performed a comprehensive analysis of amino acid and nucleotide compositional
signatures of thermophylic adaptation by exhaustively evaluating all
combinations of amino acids and nucleotides as possible determinants of Optimal
Growth Temperature for all prokaryotic organisms with fully sequences genomes..
The authors discovered that total concentration of seven amino acids in
proteomes, IVYWREL, serves as a universal proteomic predictor of Optimal Growth
Temperature in prokaryotes. Resolving the old-standing controversy the authors
determined that the variation in nucleotide composition (increase of purine
load, or A+G content with temperature) is largely a consequence of thermal
adaptation of proteins. However, the frequency with which A and G nucleotides
appear as nearest neighbors in genome sequences is strongly and independently
correlated with Optimal Growth Temperature. as a result of codon bias in
corresponding genomes. Together these results provide a complete picture of
proteomic and genomic determinants of thermophilic adaptation.Comment: in press PLoS Computational Biology; revised versio
PRED-CLASS: cascading neural networks for generalized protein classification and genome-wide applications
A cascading system of hierarchical, artificial neural networks (named
PRED-CLASS) is presented for the generalized classification of proteins into
four distinct classes-transmembrane, fibrous, globular, and mixed-from
information solely encoded in their amino acid sequences. The architecture of
the individual component networks is kept very simple, reducing the number of
free parameters (network synaptic weights) for faster training, improved
generalization, and the avoidance of data overfitting. Capturing information
from as few as 50 protein sequences spread among the four target classes (6
transmembrane, 10 fibrous, 13 globular, and 17 mixed), PRED-CLASS was able to
obtain 371 correct predictions out of a set of 387 proteins (success rate
approximately 96%) unambiguously assigned into one of the target classes. The
application of PRED-CLASS to several test sets and complete proteomes of
several organisms demonstrates that such a method could serve as a valuable
tool in the annotation of genomic open reading frames with no functional
assignment or as a preliminary step in fold recognition and ab initio structure
prediction methods. Detailed results obtained for various data sets and
completed genomes, along with a web sever running the PRED-CLASS algorithm, can
be accessed over the World Wide Web at http://o2.biol.uoa.gr/PRED-CLAS
The Plant Short-Chain Dehydrogenase (SDR) superfamily:genome-wide inventory and diversification patterns
Background Short-chain dehydrogenases/reductases (SDRs) form one of the largest and oldest NAD(P)(H) dependent oxidoreductase families. Despite a conserved 'Rossmann-fold' structure, members of the SDR superfamily exhibit low sequence similarities, which constituted a bottleneck in terms of identification. Recent classification methods, relying on hidden-Markov models (HMMs), improved identification and enabled the construction of a nomenclature. However, functional annotations of plant SDRs remain scarce. Results Wide-scale analyses were performed on ten plant genomes. The combination of hidden Markov model (HMM) based analyses and similarity searches led to the construction of an exhaustive inventory of plant SDR. With 68 to 315 members found in each analysed genome, the inventory confirmed the over-representation of SDRs in plants compared to animals, fungi and prokaryotes. The plant SDRs were first classified into three major types --- 'classical', 'extended' and 'divergent' --- but a minority (10 % of the predicted SDRs) could not be classified into these general types ('unknown' or 'atypical' types). In a second step, we could categorize the vast majority of land plant SDRs into a set of 49 families. Out of these 49 families, 35 appeared early during evolution since they are commonly found through all the Green Lineage. Yet, some SDR families --- tropinone reductase-like proteins (SDR65C), 'ABA2-like'-NAD dehydrogenase (SDR110C), 'salutaridine/menthone-reductase-like' proteins (SDR114C), 'dihydroflavonol 4-reductase'-like proteins (SDR108E) and 'isoflavone-reductase-like' (SDR460A) proteins --- have undergone significant functional diversification within vascular plants since they diverged from Bryophytes. Interestingly, these diversified families are either involved in the secondary metabolism routes (terpenoids, alkaloids, phenolics) or participate in developmental processes (hormone biosynthesis or catabolism, flower development), in opposition to SDR families involved in primary metabolism which are poorly diversified. Conclusion The application of HMMs to plant genomes enabled us to identify 49 families that encompass all Angiosperms ('higher plants') SDRs, each family being sufficiently conserved to enable simpler analyses based only on overall sequence similarity. The multiplicity of SDRs in plant kingdom is mainly explained by the diversification of large families involved in different secondary metabolism pathways, suggesting that the chemical diversification that accompanied the emergence of vascular plants acted as a driving force for SDR evolution
Genomic and proteomic biases inform metabolic engineering strategies for anaerobic fungi.
Anaerobic fungi (Neocallimastigomycota) are emerging non-model hosts for biotechnology due to their wealth of biomass-degrading enzymes, yet tools to engineer these fungi have not yet been established. Here, we show that the anaerobic gut fungi have the most GC depleted genomes among 443 sequenced organisms in the fungal kingdom, which has ramifications for heterologous expression of genes as well as for emerging CRISPR-based genome engineering approaches. Comparative genomic analyses suggest that anaerobic fungi may contain cellular machinery to aid in sexual reproduction, yet a complete mating pathway was not identified. Predicted proteomes of the anaerobic fungi also contain an unusually large fraction of proteins with homopolymeric amino acid runs consisting of five or more identical consecutive amino acids. In particular, threonine runs are especially enriched in anaerobic fungal carbohydrate active enzymes (CAZymes) and this, together with a high abundance of predicted N-glycosylation motifs, suggests that gut fungal CAZymes are heavily glycosylated, which may impact heterologous production of these biotechnologically useful enzymes. Finally, we present a codon optimization strategy to aid in the development of genetic engineering tools tailored to these early-branching anaerobic fungi
Phylogenetic differences in content and intensity of periodic proteins
Many proteins exhibit sequence periodicity, often correlated with a visible structural periodicity. The statistical significance of such periodicity can be assessed by means of a chi-square-based test, with significance thresholds being calculated from shuffled sequences. Comparison of the complete proteomes of 45 species reveals striking differences in the proportion of periodic proteins and the intensity of the most significant periodicities. Eukaryotes tend to have a higher proportion of periodic proteins than eubacteria, which in turn tend to have more than archaea. The intensity of periodicity in the most periodic proteins is also greatest in eukaryotes. By contrast, the relatively small group of periodic proteins in archaea also tend to be weakly periodic compared to those of eukaryotes and eubacteria. Exceptions to this general rule are found in those prokaryotes with multicellular life-cycle phases, e.g. Methanosarcina sps. or Anabaena sps., which have more periodicities than prokaryotes in general, and in unicellular eukaryotes, which have fewer than multicellular eukaryotes. The distribution of significantly periodic proteins in eukaryotes is over a wide range of period lengths, whereas prokaryotic proteins typically have a more limited set of period lengths. This is further investigated by repeating the analysis on the NRL-3D database of proteins of solved structure. Some short range periodicities are explicable in terms of basic secondary structure, e.g. alpha helices, while middle range periodicities are frequently found to consist of known short Pfam domains, e.g. leucine-rich repeats, tetratricopeptides or armadillo domains. However, not all can be explained in this way
Predicting protein function by machine learning on amino acid sequences – a critical evaluation
Copyright @ 2007 Al-Shahib et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Background: Predicting the function of newly discovered proteins by simply inspecting their amino acid sequence is one of the major challenges of post-genomic computational biology, especially when done without recourse to experimentation or homology information. Machine learning classifiers are able to discriminate between proteins belonging to different functional classes. Until now, however, it has been unclear if this ability would be transferable to proteins of unknown function, which may show distinct biases compared to experimentally more tractable proteins. Results: Here we show that proteins with known and unknown function do indeed differ significantly. We then show that proteins from different bacterial species also differ to an even larger and very surprising extent, but that functional classifiers nonetheless generalize successfully across species boundaries. We also show that in the case of highly specialized proteomes classifiers from a different, but more conventional, species may in fact outperform the endogenous species-specific classifier. Conclusion: We conclude that there is very good prospect of successfully predicting the function of yet uncharacterized proteins using machine learning classifiers trained on proteins of known function
- …