605 research outputs found

    Toward optimal fragment generations for ab initio protein structure assembly

    Full text link
    Fragment assembly using structural motifs excised from other solved proteins has shown to be an efficient method for ab initio proteinā€structure prediction. However, how to construct accurate fragments, how to derive optimal restraints from fragments, and what the best fragment length is are the basic issues yet to be systematically examined. In this work, we developed a gaplessā€threading method to generate positionā€specific structure fragments. Distance profiles and torsion angle pairs are then derived from the fragments by statistical consistency analysis, which achieved comparable accuracy with the machineā€learningā€based methods although the fragments were taken from unrelated proteins. When measured by both accuracies of the derived distance profiles and torsion angle pairs, we come to a consistent conclusion that the optimal fragment length for structural assembly is around 10, and at least 100 fragments at each location are needed to achieve optimal structure assembly. The distant profiles and torsion angle pairs as derived by the fragments have been successfully used in QUARK for ab initio protein structure assembly and are provided by the QUARK online server at http://zhanglab.ccmb. med.umich.edu/QUARK/ . Proteins 2013. Ā© 2012 Wiley Periodicals, Inc.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/96355/1/24179_ftp.pdfhttp://deepblue.lib.umich.edu/bitstream/2027.42/96355/2/PROT_24179_sm_SuppInfo.pd

    Learning biophysically-motivated parameters for alpha helix prediction

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Our goal is to develop a state-of-the-art protein secondary structure predictor, with an intuitive and biophysically-motivated energy model. We treat structure prediction as an optimization problem, using parameterizable cost functions representing biological "pseudo-energies". Machine learning methods are applied to estimate the values of the parameters to correctly predict known protein structures.</p> <p>Results</p> <p>Focusing on the prediction of alpha helices in proteins, we show that a model with 302 parameters can achieve a Q<sub><it>Ī± </it></sub>value of 77.6% and an SOV<sub><it>Ī± </it></sub>value of 73.4%. Such performance numbers are among the best for techniques that do not rely on external databases (such as multiple sequence alignments). Further, it is easier to extract biological significance from a model with so few parameters.</p> <p>Conclusion</p> <p>The method presented shows promise for the prediction of protein secondary structure. Biophysically-motivated elementary free-energies can be learned using SVM techniques to construct an energy cost function whose predictive performance rivals state-of-the-art. This method is general and can be extended beyond the all-alpha case described here.</p

    Learning biophysically-motivated parameters for alpha helix prediction

    Get PDF
    Background: Our goal is to develop a state-of-the-art protein secondary structure predictor, with an intuitive and biophysically-motivated energy model. We treat structure prediction as an optimization problem, using parameterizable cost functions representing biological ā€œpseudo-energies. ā€ Machine learning methods are applied to estimate the values of the parameters to correctly predict known protein structures. Results: Focusing on the prediction of alpha helices in proteins, we show that a model with 302 parameters can achieve a QĪ± value of 77.6 % and an SOVĪ± value of 73.4%. Such performance numbers are among the best for techniques that do not rely on external databases (such as multiple sequence alignments). Further, it is easier to extract biological significance from a model with so few parameters. Conclusions: The method presented shows promise for the prediction of protein secondary structure. Biophysically-motivated elementary free-energies can be learned using SVM techniques to construct an energy cost function whose predictive performance rivals state-of-the-art. This method is general and can be extended beyond the all-alpha case described here. 1 Backgroun

    Automated protein structure modeling in CASP9 by Iā€TASSER pipeline combined with QUARKā€based ab initio folding and FGā€MDā€based structure refinement

    Full text link
    Iā€TASSER is an automated pipeline for protein tertiary structure prediction using multiple threading alignments and iterative structure assembly simulations. In CASP9 experiments, two new algorithms, QUARK and fragmentā€guided molecular dynamics (FGā€MD), were added to the Iā€TASSER pipeline for improving the structural modeling accuracy. QUARK is a de novo structure prediction algorithm used for structure modeling of proteins that lack detectable template structures. For distantly homologous targets, QUARK models are found useful as a reference structure for selecting good threading alignments and guiding the Iā€TASSER structure assembly simulations. FGā€MD is an atomicā€level structural refinement program that uses structural fragments collected from the PDB structures to guide molecular dynamics simulation and improve the local structure of predicted model, including hydrogenā€bonding networks, torsion angles, and steric clashes. Despite considerable progress in both the templateā€based and templateā€free structure modeling, significant improvements on protein target classification, domain parsing, model selection, and ab initio folding of Ī²ā€proteins are still needed to further improve the Iā€TASSER pipeline. Proteins 2011; Ā© 2011 Wileyā€Liss, Inc.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/88077/1/23111_ftp.pd

    Learning sparse models for a dynamic Bayesian network classifier of protein secondary structure

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Protein secondary structure prediction provides insight into protein function and is a valuable preliminary step for predicting the 3D structure of a protein. Dynamic Bayesian networks (DBNs) and support vector machines (SVMs) have been shown to provide state-of-the-art performance in secondary structure prediction. As the size of the protein database grows, it becomes feasible to use a richer model in an effort to capture subtle correlations among the amino acids and the predicted labels. In this context, it is beneficial to derive sparse models that discourage over-fitting and provide biological insight.</p> <p>Results</p> <p>In this paper, we first show that we are able to obtain accurate secondary structure predictions. Our per-residue accuracy on a well established and difficult benchmark (CB513) is 80.3%, which is comparable to the state-of-the-art evaluated on this dataset. We then introduce an algorithm for sparsifying the parameters of a DBN. Using this algorithm, we can automatically remove up to 70-95% of the parameters of a DBN while maintaining the same level of predictive accuracy on the SD576 set. At 90% sparsity, we are able to compute predictions three times faster than a fully dense model evaluated on the SD576 set. We also demonstrate, using simulated data, that the algorithm is able to recover true sparse structures with high accuracy, and using real data, that the sparse model identifies known correlation structure (local and non-local) related to different classes of secondary structure elements.</p> <p>Conclusions</p> <p>We present a secondary structure prediction method that employs dynamic Bayesian networks and support vector machines. We also introduce an algorithm for sparsifying the parameters of the dynamic Bayesian network. The sparsification approach yields a significant speed-up in generating predictions, and we demonstrate that the amino acid correlations identified by the algorithm correspond to several known features of protein secondary structure. Datasets and source code used in this study are available at <url>http://noble.gs.washington.edu/proj/pssp</url>.</p

    Interplay of Iā€TASSER and QUARK for templateā€based and ab initio protein structure prediction in CASP10

    Full text link
    We develop and test a new pipeline in CASP10 to predict protein structures based on an interplay of Iā€TASSER and QUARK for both freeā€modeling (FM) and templateā€based modeling (TBM) targets. The most noteworthy observation is that sorting through the threading template pool using the QUARKā€based ab initio models as probes allows the detection of distantā€homology templates which might be ignored by the traditional sequence profileā€based threading alignment algorithms. Further template assembly refinement by Iā€TASSER resulted in successful folding of two mediumā€sized FM targets with >150 residues. For TBM, the multiple threading alignments from LOMETS are, for the first time, incorporated into the ab initio QUARK simulations, which were further refined by Iā€TASSER assembly refinement. Compared with the traditional threading assembly refinement procedures, the inclusion of the threadingā€constrained ab initio folding models can consistently improve the quality of the fullā€length models as assessed by the GDTā€HA and hydrogenā€bonding scores. Despite the success, significant challenges still exist in domain boundary prediction and consistent folding of mediumā€size proteins (especially betaā€proteins) for nonhomologous targets. Further developments of sensitive foldā€recognition and ab initio folding methods are critical for solving these problems. Proteins 2014; 82(Suppl 2):175ā€“187. Ā© 2013 Wiley Periodicals, Inc.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/102666/1/prot24341-sup-0001-suppinfo.pdfhttp://deepblue.lib.umich.edu/bitstream/2027.42/102666/2/prot24341.pd

    Annotation of Alternatively Spliced Proteins and Transcripts with Protein-Folding Algorithms and Isoform-Level Functional Networks.

    Get PDF
    Tens of thousands of splice isoforms of proteins have been catalogued as predicted sequences from transcripts in humans and other species. Relatively few have been characterized biochemically or structurally. With the extensive development of protein bioinformatics, the characterization and modeling of isoform features, isoform functions, and isoform-level networks have advanced notably. Here we present applications of the I-TASSER family of algorithms for folding and functional predictions and the IsoFunc, MIsoMine, and Hisonet data resources for isoform-level analyses of network and pathway-based functional predictions and protein-protein interactions. Hopefully, predictions and insights from protein bioinformatics will stimulate many experimental validation studies

    Algorithms for pre-microrna classification and a GPU program for whole genome comparison

    Get PDF
    MicroRNAs (miRNAs) are non-coding RNAs with approximately 22 nucleotides that are derived from precursor molecules. These precursor molecules or pre-miRNAs often fold into stem-loop hairpin structures. However, a large number of sequences with pre-miRNA-like hairpin can be found in genomes. It is a challenge to distinguish the real pre-miRNAs from other hairpin sequences with similar stem-loops (referred to as pseudo pre-miRNAs). The first part of this dissertation presents a new method, called MirID, for identifying and classifying microRNA precursors. MirID is comprised of three steps. Initially, a combinatorial feature mining algorithm is developed to identify suitable feature sets. Then, the feature sets are used to train support vector machines to obtain classification models, based on which classifier ensemble is constructed. Finally, an AdaBoost algorithm is adopted to further enhance the accuracy of the classifier ensemble. Experimental results on a variety of species demonstrate the good performance of the proposed approach, and its superiority over existing methods. In the second part of this dissertation, A GPU (Graphics Processing Unit) program is developed for whole genome comparison. The goal for the research is to identify the commonalities and differences of two genomes from closely related organisms, via multiple sequencing alignments by using a seed and extend technique to choose reliable subsets of exact or near exact matches, which are called anchors. A rigorous method named Smith-Waterman search is applied for the anchor seeking, but takes days and months to map millions of bases for mammalian genome sequences. With GPU programming, which is designed to run in parallel hundreds of short functions called threads, up to 100X speed up is achieved over similar CPU executions

    Fast learning optimized prediction methodology for protein secondary structure prediction, relative solvent accessibility prediction and phosphorylation prediction

    Get PDF
    Computational methods are rapidly gaining importance in the field of structural biology, mostly due to the explosive progress in genome sequencing projects and the large disparity between the number of sequences and the number of structures. There has been an exponential growth in the number of available protein sequences and a slower growth in the number of structures. There is therefore an urgent need to develop computed structures and identify the functions of these sequences. Developing methods that will satisfy these needs both efficiently and accurately is of paramount importance for advances in many biomedical fields, for a better basic understanding of aberrant states of stress and disease, including drug discovery and discovery of biomarkers. Several aspects of secondary structure predictions and other protein structure-related predictions are investigated using different types of information such as data obtained from knowledge-based potentials derived from amino acids in protein sequences, physicochemical properties of amino acids and propensities of amino acids to appear at the ends of secondary structures. Investigating the performance of these secondary structure predictions by type of amino acid highlights some interesting aspects relating to the influences of the individual amino acid types on formation of secondary structures and points toward ways to make further gains. Other research areas include Relative Solvent Accessibility (RSA) predictions and predictions of phosphorylation sites, which is one of the Post-Translational Modification (PTM) sites in proteins. Protein secondary structures and other features of proteins are predicted efficiently, reliably, less expensively and more accurately. A novel method called Fast Learning Optimized PREDiction (FLOPRED) Methodology is proposed for predicting protein secondary structures and other features, using knowledge-based potentials, a Neural Network based Extreme Learning Machine (ELM) and advanced Particle Swarm Optimization (PSO) techniques that yield better and faster convergence to produce more accurate results. These techniques yield superior classification of secondary structures, with a training accuracy of 93.33% and a testing accuracy of 92.24% with a standard deviation of 0.48% obtained for a small group of 84 proteins. We have a Matthew\u27s correlation-coefficient ranging between 80.58% and 84.30% for these secondary structures. Accuracies for individual amino acids range between 83% and 92% with an average standard deviation between 0.3% and 2.9% for the 20 amino acids. On a larger set of 415 proteins, we obtain a testing accuracy of 86.5% with a standard deviation of 1.38%. These results are significantly higher than those found in the literature. Prediction of protein secondary structure based on amino acid sequence is a common technique used to predict its 3-D structure. Additional information such as the biophysical properties of the amino acids can help improve the results of secondary structure prediction. A database of protein physicochemical properties is used as features to encode protein sequences and this data is used for secondary structure prediction using FLOPRED. Preliminary studies using a Genetic Algorithm (GA) for feature selection, Principal Component Analysis (PCA) for feature reduction and FLOPRED for classification give promising results. Some amino acids appear more often at the ends of secondary structures than others. A preliminary study has indicated that secondary structure accuracy can be improved as much as 6% by including these effects for those residues present at the ends of alpha-helix, beta-strand and coil. A study on RSA prediction using ELM shows large gains in processing speed compared to using support vector machines for classification. This indicates that ELM yields a distinct advantage in terms of processing speed and performance for RSA. Additional gains in accuracies are possible when the more advanced FLOPRED algorithm and PSO optimization are implemented. Phosphorylation is a post-translational modification on proteins often controls and regulates their activities. It is an important mechanism for regulation. Phosphorylated sites are known to be present often in intrinsically disordered regions of proteins lacking unique tertiary structures, and thus less information is available about the structures of phosphorylated sites. It is important to be able to computationally predict phosphorylation sites in protein sequences obtained from mass-scale sequencing of genomes. Phosphorylation sites may aid in the determination of the functions of a protein and to better understanding the mechanisms of protein functions in healthy and diseased states. FLOPRED is used to model and predict experimentally determined phosphorylation sites in protein sequences. Our new PSO optimization included in FLOPRED enable the prediction of phosphorylation sites with higher accuracy and with better generalization. Our preliminary studies on 984 sequences demonstrate that this model can predict phosphorylation sites with a training accuracy of 92.53% , a testing accuracy 91.42% and Matthew\u27s correlation coefficient of 83.9%. In summary, secondary structure prediction, Relative Solvent Accessibility and phosphorylation site prediction have been carried out on multiple sets of data, encoded with a variety of information drawn from proteins and the physicochemical properties of their constituent amino acids. Improved and efficient algorithms called S-ELM and FLOPRED, which are based on Neural Networks and Particle Swarm Optimization are used for classifying and predicting protein sequences. Analysis of the results of these studies provide new and interesting insights into the influence of amino acids on secondary structure prediction. S-ELM and FLOPRED have also proven to be robust and efficient for predicting relative solvent accessibility of proteins and phosphorylation sites. These studies show that our method is robust and resilient and can be applied for a variety of purposes. It can be expected to yield higher classification accuracy and better generalization performance compared to previous methods
    • ā€¦
    corecore