297 research outputs found
Improving the accuracy of protein secondary structure prediction using structural alignment
BACKGROUND: The accuracy of protein secondary structure prediction has steadily improved over the past 30 years. Now many secondary structure prediction methods routinely achieve an accuracy (Q3) of about 75%. We believe this accuracy could be further improved by including structure (as opposed to sequence) database comparisons as part of the prediction process. Indeed, given the large size of the Protein Data Bank (>35,000 sequences), the probability of a newly identified sequence having a structural homologue is actually quite high. RESULTS: We have developed a method that performs structure-based sequence alignments as part of the secondary structure prediction process. By mapping the structure of a known homologue (sequence ID >25%) onto the query protein's sequence, it is possible to predict at least a portion of that query protein's secondary structure. By integrating this structural alignment approach with conventional (sequence-based) secondary structure methods and then combining it with a "jury-of-experts" system to generate a consensus result, it is possible to attain very high prediction accuracy. Using a sequence-unique test set of 1644 proteins from EVA, this new method achieves an average Q3 score of 81.3%. Extensive testing indicates this is approximately 4–5% better than any other method currently available. Assessments using non sequence-unique test sets (typical of those used in proteome annotation or structural genomics) indicate that this new method can achieve a Q3 score approaching 88%. CONCLUSION: By using both sequence and structure databases and by exploiting the latest techniques in machine learning it is possible to routinely predict protein secondary structure with an accuracy well above 80%. A program and web server, called PROTEUS, that performs these secondary structure predictions is accessible at . For high throughput or batch sequence analyses, the PROTEUS programs, databases (and server) can be downloaded and run locally
Comparative modelling of protein structure and its impact on microbial cell factories
Comparative modeling is becoming an increasingly helpful technique in microbial cell factories as the knowledge of the three-dimensional structure of a protein would be an invaluable aid to solve problems on protein production. For this reason, an introduction to comparative modeling is presented, with special emphasis on the basic concepts, opportunities and challenges of protein structure prediction. This review is intended to serve as a guide for the biologist who has no special expertise and who is not involved in the determination of protein structure. Selected applications of comparative modeling in microbial cell factories are outlined, and the role of microbial cell factories in the structural genomics initiative is discussed
Clustering protein sequences with a novel metric transformed from sequence similarity scores and sequence alignments with neural networks
BACKGROUND: The sequencing of the human genome has enabled us to access a comprehensive list of genes (both experimental and predicted) for further analysis. While a majority of the approximately 30000 known and predicted human coding genes are characterized and have been assigned at least one function, there remains a fair number of genes (about 12000) for which no annotation has been made. The recent sequencing of other genomes has provided us with a huge amount of auxiliary sequence data which could help in the characterization of the human genes. Clustering these sequences into families is one of the first steps to perform comparative studies across several genomes. RESULTS: Here we report a novel clustering algorithm (CLUGEN) that has been used to cluster sequences of experimentally verified and predicted proteins from all sequenced genomes using a novel distance metric which is a neural network score between a pair of protein sequences. This distance metric is based on the pairwise sequence similarity score and the similarity between their domain structures. The distance metric is the probability that a pair of protein sequences are of the same Interpro family/domain, which facilitates the modelling of transitive homology closure to detect remote homologues. The hierarchical average clustering method is applied with the new distance metric. CONCLUSION: Benchmarking studies of our algorithm versus those reported in the literature shows that our algorithm provides clustering results with lower false positive and false negative rates. The clustering algorithm is applied to cluster several eukaryotic genomes and several dozens of prokaryotic genomes
Recommended from our members
Protein Fold Recognition Using Neural Networks
To predict accurately the three-dimensional (3D) structures of proteins from their amino acid sequences alone remains a challenging problem. However, using protein fold recognition tools, it is often possible to achieve good models or at least to gain some more information, to aid scientists in their research. This thesis describes development of TUNE (Threading Using Neural Networks), a fold recognition program using artificial neural network (ANN) models. A new method to generate amino acid substitution matrices is described in chapter two. It uses an ANN to generalise amino acid substitutions observed in protein structure alignments. Matrices for alignment scoring from this approach were compared with classic alignment scoring schemes. From these neural network models, a series of encoding schemes were constructed. These schemes describe the amino acid types with a few numbers. They were generated to replace the orthogonal encoding scheme, so that smaller, faster and more accurate neural network models can be applied on bioinformatic problems. The TUNE model was introduced in chapter four to measure protein sequence-structure compatibility. Given the integrated residue structural environment descriptions, the model predicts probabilities of observing amino acid types in such environments. Using this model, a scoring function to measure the fitness of a residue in a protein structure model can be made for protein threading programs. The model in chapter two was extended by including the residue structural environment descriptions for predictions. A simple protein fold recognition program with a dynamic programming algorithm was developed using this model. The program was then tested in the fourth round of the Critical Assessment of protein Structure Prediction methods (CASP4) and produced reasonably good results
Improving model construction of profile HMMs for remote homology detection through structural alignment
<p>Abstract</p> <p>Background</p> <p>Remote homology detection is a challenging problem in Bioinformatics. Arguably, profile Hidden Markov Models (pHMMs) are one of the most successful approaches in addressing this important problem. pHMM packages present a relatively small computational cost, and perform particularly well at recognizing remote homologies. This raises the question of whether structural alignments could impact the performance of pHMMs trained from proteins in the <it>Twilight Zone</it>, as structural alignments are often more accurate than sequence alignments at identifying motifs and functional residues. Next, we assess the impact of using structural alignments in pHMM performance.</p> <p>Results</p> <p>We used the SCOP database to perform our experiments. Structural alignments were obtained using the 3DCOFFEE and MAMMOTH-mult tools; sequence alignments were obtained using CLUSTALW, TCOFFEE, MAFFT and PROBCONS. We performed leave-one-family-out cross-validation over super-families. Performance was evaluated through ROC curves and paired two tailed t-test.</p> <p>Conclusion</p> <p>We observed that pHMMs derived from structural alignments performed significantly better than pHMMs derived from sequence alignment in low-identity regions, mainly below 20%. We believe this is because structural alignment tools are better at focusing on the important patterns that are more often conserved through evolution, resulting in higher quality pHMMs. On the other hand, sensitivity of these tools is still quite low for these low-identity regions. Our results suggest a number of possible directions for improvements in this area.</p
Recommended from our members
Using structure to explore the sequence alignment space of remote homologs
The success of protein structure modeling by homology requires an accurate sequence alignment between the query sequence and its structural template. However, sequence alignment methods based on dynamic programming (DP) are typically unable to generate accurate alignments for remote sequence homologs, thus limiting the applicability of modeling methods. A central problem is that the alignment that would produce the best structural model is generally not optimal, in the sense of having the highest DP score. Suboptimal alignment methods can be used to generate alternative alignments, but encounter difficulties given the enormous number of alignments that need to be considered. We present here a new suboptimal alignment method that relies heavily on the structure of the template. By initially aligning the query sequence to individual fragments in secondary structure elements (SSEs) and combining high-scoring fragments that pass basic tests for 'modelability', we can generate accurate alignments within a set of limited size. Chapter 1 introduces the field of protein structure prediction in general and the technique of homology modeling in particular. One subproblem of homology modeling -- the sequence to structure alignment of proteins -- is discussed in Chapter 2. Particular attention is given to descriptions of the size, density and redundancy of alignment space as well as an explanation of the dynamic programming technique and its strengths and weaknesses. The rationale for developing alternative alignment techniques and the unique difficulties of these methods are also discussed. Chapter 3 explains the methodologies of S4 -- the alternative alignment program we developed that is the main focus of this thesis. The process of finding alternative alignments with S4 involves several steps, but can be roughly divided into two main parts. First, the program looks for combinations of high-similarity fragments that pass basic rules for modelability. These 'fragment alignments' define regions of alignment space that can be searched more thoroughly with a statistical potential for a single representative for that region. The ensemble of alignments that is thus created needs to be evaluated for accuracy against the correct alignment. Current methods for doing so, as well as adjustments to those methods to better suit the realm of remote homology alignments, are discussed in Chapter 4. A novel measure for determining similarity between alignments, termed the inter-alignment distance (IAD) also is developed. This measure can be used to assess quality, but is also well-suited to finding redundant alignments within an ensemble. In Chapter 5, the results of testing S4 on a large set of targets from previous CASP experiments are analyzed. Comparisons to the optimal alignment as well as two standard alternative alignment methods, all of which use the same similarity score as S4, demonstrate that S4's improvement in accuracy is due to better sampling and filtering rather than more sophisticated scoring. Models made from S4 alignments are also shown to significantly improve upon those made from optimal alignments, especially for remote homologs. Finally, an example of a sequence to structure alignment offers an in depth explanation of how S4 finds correct alignments where the other methods do not. Chapter 6 describes a set of three experiments that paired S4 with the model evaluation tool ProsaII in a homology modeling pipeline. There were two primary objectives in this project. First, we wanted to test different methods for finding remote homologs that could serve as input to S4. And second, we evaluated the use of ProsaII as a method for discriminating between good and bad models, and thus also between homologous and non-homologous templates. The first two experiments are essentially blind searches for homologous sequences and structures. The third experiment takes remote templates returned by PSI-BLAST and uses S4 and ProsaII to find alignments and determine whether the template is a structural homolog. While S4 was able to find homologs in the blind searches, the alignment/model quality and level of discrimination was found to be higher when the input to the pipeline came from a set of structures produced by a template selection method. Finally, Chapter 7 discusses the consequences of this research and suggests future directions for its application
- …