90 research outputs found

    CONFOLD2: Improved contact-driven ab initio protein structure modeling

    Get PDF
    Background: Contact-guided protein structure prediction methods are becoming more and more successful because of the latest advances in residue-residue contact prediction. To support contact-driven structure prediction, effective tools that can quickly build tertiary structural models of good quality from predicted contacts need to be developed. Results: We develop an improved contact-driven protein modelling method, CONFOLD2, and study how it may be effectively used for ab initio protein structure prediction with predicted contacts as input. It builds models using various subsets of input contacts to explore the fold space under the guidance of a soft square energy function, and then clusters the models to obtain the top five models. CONFOLD2 obtains an average reconstruction accuracy of 0.57 TM-score for the 150 proteins in the PSICOV contact prediction dataset. When benchmarked on the CASP11 contacts predicted using CONSIP2 and CASP12 contacts predicted using Raptor-X, CONFOLD2 achieves a mean TM-score of 0.41 on both datasets. Conclusion: CONFOLD2 allows to quickly generate top five structural models for a protein sequence when its secondary structures and contacts predictions at hand. The source code of CONFOLD2 is publicly available at https://github.com/multicom-toolbox/CONFOLD2/

    Analysis of several key factors influencing deep learning-based inter-residue contact prediction

    Get PDF
    Motivation: Deep learning has become the dominant technology for protein contact prediction. However, the factors that affect the performance of deep learning in contact prediction have not been systematically investigated. Results: We analyzed the results of our three deep learning-based contact prediction methods (MULTICOMCLUSTER, MULTICOM-CONSTRUCT and MULTICOM-NOVEL) in the CASP13 experiment and identified several key factors [i.e. deep learning technique, multiple sequence alignment (MSA), distance distribution prediction and domain-based contact integration] that influenced the contact prediction accuracy. We compared our convolutional neural network (CNN)-based contact prediction methods with three coevolution-based methods on 75 CASP13 targets consisting of 108 domains. We demonstrated that the CNN-based multi-distance approach was able to leverage global coevolutionary coupling patterns comprised of multiple correlated contacts for more accurate contact prediction than the local coevolution-based methods, leading to a substantial increase of precision by 19.2 percentage points. We also tested different alignment methods and domain-based contact prediction with the deep learning contact predictors. The comparison of the three methods showed deeper sequence alignments and the integration of domain-based contact prediction with the full-length contact prediction improved the performance of contact prediction. Moreover, we demonstrated that the domain-based contact prediction based on a novel ab initio approach of parsing domains from MSAs alone without using known protein structures was a simple, fast approach to improve contact prediction. Finally, we showed that predicting the distribution of inter-residue distances in multiple distance intervals could capture more structural information and improve binary contact prediction. Availability and implementation: https://github.com/multicom-toolbox/DNCON2/

    Resdiue-residue contact driven protein structure prediction using optimization and machine learning

    Get PDF
    Significant improvements in the prediction of protein residue-residue contacts are observed in the recent years. These contacts, predicted using a variety of coevolution-based and machine learning methods, are the key contributors to the recent progress in ab initio protein structure prediction, as demonstrated in the recent CASP experiments. Continuing the development of new methods to reliably predict contact maps, tools to assess the utility of predicted contacts, and methods to construct protein tertiary structures from predicted contacts, are essential to further improve ab initio structure prediction. In this dissertation, three contributions are described -- (a) DNCON2, a two-level convolutional neural network-based method for protein contact prediction, (b) ConEVA, a toolkit for contact assessment and evaluation, and (c) CONFOLD, a method of building protein 3D structures from predicted contacts and secondary structures. Additional related contributions on protein contact prediction and structure reconstruction are also described. DNCON2 and CONFOLD demonstrate state-of-the-art performance on contact prediction and structure reconstruction from scratch. All three protein structure methods are available as software or web server which are freely available to the scientific community.Includes biblographical reference

    Identification and localization of Tospovirus genus-wide conserved residues in 3D models of the nucleocapsid and the silencing suppressor proteins

    Get PDF
    Background: Tospoviruses (genus Tospovirus, family Peribunyaviridae, order Bunyavirales) cause significant losses to a wide range of agronomic and horticultural crops worldwide. Identification and characterization of specific sequences and motifs that are critical for virus infection and pathogenicity could provide useful insights and targets for engineering virus resistance that is potentially both broad spectrum and durable. Tomato spotted wilt virus (TSWV), the most prolific member of the group, was used to better understand the structure-function relationships of the nucleocapsid gene (N), and the silencing suppressor gene (NSs), coded by the TSWV small RNA. Methods: Using a global collection of orthotospoviral sequences, several amino acids that were conserved across the genus and the potential location of these conserved amino acid motifs in these proteins was determined. We used state of the art 3D modeling algorithms, MULTICOM-CLUSTER, MULTICOM-CONSTRUCT, MULTICOM-NOVEL, I-TASSER, ROSETTA and CONFOLD to predict the secondary and tertiary structures of the N and the NSs proteins. Results: We identified nine amino acid residues in the N protein among 31 known tospoviral species, and ten amino acid residues in NSs protein among 27 tospoviral species that were conserved across the genus. For the N protein, all three algorithms gave nearly identical tertiary models. While the conserved residues were distributed throughout the protein on a linear scale, at the tertiary level, three residues were consistently located in the coil in all the models. For NSs protein models, there was no agreement among the three algorithms. However, with respect to the localization of the conserved motifs, G was consistently located in coil, while H was localized in the coil in three models. Conclusions: This is the first report of predicting the 3D structure of any tospoviral NSs protein and revealed a consistent location for two of the ten conserved residues. The modelers used gave accurate prediction for N protein allowing the localization of the conserved residues. Results form the basis for further work on the structure-function relationships of tospoviral proteins and could be useful in developing novel virus control strategies targeting the conserved residues. 18 11

    ConEVA: A toolbox for comprehensive assessment of protein contacts

    Get PDF
    Background: In recent years, successful contact prediction methods and contact-guided ab initio protein structure prediction methods have highlighted the importance of incorporating contact information into protein structure prediction methods. It is also observed that for almost all globular proteins, the quality of contact prediction dictates the accuracy of structure prediction. Hence, like many existing evaluation measures for evaluating 3D protein models, various measures are currently used to evaluate predicted contacts, with the most popular ones being precision, coverage and distance distribution score (X ). Results: We have built a web application and a downloadable tool, ConEVA, for comprehensive assessment and detailed comparison of predicted contacts. Besides implementing existing measures for contact evaluation we have implemented new and useful methods of contact visualization using chord diagrams and comparison using Jaccard similarity computations. For a set (or sets) of predicted contacts, the web application runs even when a native structure is not available, visualizing the contact coverage and similarity between predicted contacts. We applied the tool on various contact prediction data sets and present our findings and insights we obtained from the evaluation of effective contact assessments. ConEVA is publicly available at http://cactus.rnet.missouri.edu/coneva/. Conclusion: ConEVA is useful for a range of contact related analysis and evaluations including predicted contact comparison, investigation of individual protein folding using predicted contacts, and analysis of contacts in a structure of interest.

    Improved computational methods of protein sequence alignment, model selection and tertiary structure prediction

    Get PDF
    Protein sequence and profile alignment has been used essentially in most bioinformatics tasks such as protein structure modeling, function prediction, and phylogenetic analysis. We designed a new algorithm MSACompro to incorporate predicted secondary structure, relative solvent accessibility, and residue-residue contact information into multiple protein sequence alignment. Our experiments showed that it improved multiple sequence alignment accuracy over most existing methods without using the structural information and performed comparably to the method using structural features and additional homologous sequences by slightly lower scores. We also developed HHpacom, a new profile-profile pairwise alignment by integrating secondary structure, solvent accessibility, torsion angle and inferred residue pair coupling information. The evaluation showed that the secondary structure, relative solvent accessibility and torsion angle information significantly improved the alignment accuracy in comparison with the state of the art methods HHsearch and HHsuite. The evolutionary constraint information did help in some cases, especially the alignments of the proteins which are of short lengths, typically 100 to 500 residues. Protein Model selection is also a key step in protein tertiary structure prediction. We developed two SVM model quality assessment methods taking query-template alignment as input. The assessment results illustrated that this could help improve the model selection, protein structure prediction and many other bioinformatics problems. Moreover, we also developed a protein tertiary structure prediction pipeline, of which many components were built in our group's MULTICOM system. The MULTICOM performed well in the CASP10 (Critical Assessment of Techniques for Protein Structure Prediction) competition

    SoyDB: a knowledge database of soybean transcription factors

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Transcription factors play the crucial rule of regulating gene expression and influence almost all biological processes. Systematically identifying and annotating transcription factors can greatly aid further understanding their functions and mechanisms. In this article, we present SoyDB, a user friendly database containing comprehensive knowledge of soybean transcription factors.</p> <p>Description</p> <p>The soybean genome was recently sequenced by the Department of Energy-Joint Genome Institute (DOE-JGI) and is publicly available. Mining of this sequence identified 5,671 soybean genes as putative transcription factors. These genes were comprehensively annotated as an aid to the soybean research community. We developed SoyDB - a knowledge database for all the transcription factors in the soybean genome. The database contains protein sequences, predicted tertiary structures, putative DNA binding sites, domains, homologous templates in the Protein Data Bank (PDB), protein family classifications, multiple sequence alignments, consensus protein sequence motifs, web logo of each family, and web links to the soybean transcription factor database PlantTFDB, known EST sequences, and other general protein databases including Swiss-Prot, Gene Ontology, KEGG, EMBL, TAIR, InterPro, SMART, PROSITE, NCBI, and Pfam. The database can be accessed via an interactive and convenient web server, which supports full-text search, PSI-BLAST sequence search, database browsing by protein family, and automatic classification of a new protein sequence into one of 64 annotated transcription factor families by hidden Markov models.</p> <p>Conclusions</p> <p>A comprehensive soybean transcription factor database was constructed and made publicly accessible at <url>http://casp.rnet.missouri.edu/soydb/</url>.</p

    Improving protein structure prediction by deep learning and computational optimization

    Get PDF
    Includes vitaProtein structure prediction is one of the most important scientific problems in the field of bioinformatics and computational biology. The availability of protein three-dimensional (3D) structure is crucial for studying biological and cellular functions of proteins. The importance of four major sub-problems in protein structure prediction have been clearly recognized. Those include, first, protein secondary structure prediction, second, protein fold recognition, third, protein quality assessment, and fourth, multi-domain assembly. In recent years, deep learning techniques have proved to be a highly effective machine learning method, which has brought revolutionary advances in computer vision, speech recognition and bioinformatics. In this dissertation, five contributions are described. First, DNSS2, a method for protein secondary structure prediction using one-dimensional deep convolution network. Second, DeepSF, a method of applying deep convolutional network to classify protein sequence into one of thousands known folds. Third, CNNQA & DeepRank, two deep neural network approaches to systematically evaluate the quality of predicted protein structures and select the most accurate model as the final protein structure prediction. Fourth, MULTICOM, a protein structure prediction system empowered by deep learning and protein contact prediction. Finally, SAXSDOM, a data-assisted method for protein domain assembly using small-angle X-ray scattering data. All the methods are available as software tools or web servers which are freely available to the scientific community.Includes bibliographical reference

    Computational methods for protein structure prediction and next-generation sequencing data analysis

    Get PDF
    With the wide application of next-generation sequencing technologies, the number of protein sequences is increasing exponentially. However, only a tiny portion of proteins have experimentally verified structures. The huge protein sequence-structure gap could be reduced by computational methods including template-based modeling and template-free modeling. Chapter 2 describes a stochastic point cloud sampling method for multi-template protein model generation. The stochastic sampling and simulated annealing protocol in the method has the capability to improve the global quality and reduce atom clashes in protein models. Two popular approaches for improving protein structure prediction include enlarging the sampling space of template-based modeling and integrating template-based modeling with template-free modeling when no good templates or only partial templates can be found for a target protein. Chapters 3 and 4 introduce a large-scale conformation sampling and evaluation system for protein structure prediction which integrates the two methods. Next-generation sequencing of RNAs (RNA-Seq) generates hundreds of millions of short reads. Analyzing these reads is increasingly being used to foster novel discovery in biomedical research. Chapter 5 describes a bioinformatics pipeline for RNA-Seq data analysis, which converts gigabytes of raw RNA-Seq data into kilobytes of valuable biological knowledge through a five-step data mining and knowledge discovery process

    Mass & secondary structure propensity of amino acids explain their mutability and evolutionary replacements

    Get PDF
    Why is an amino acid replacement in a protein accepted during evolution? The answer given by bioinformatics relies on the frequency of change of each amino acid by another one and the propensity of each to remain unchanged. We propose that these replacement rules are recoverable from the secondary structural trends of amino acids. A distance measure between high-resolution Ramachandran distributions reveals that structurally similar residues coincide with those found in substitution matrices such as BLOSUM: Asn Asp, Phe Tyr, Lys Arg, Gln Glu, Ile Val, Met → Leu; with Ala, Cys, His, Gly, Ser, Pro, and Thr, as structurally idiosyncratic residues. We also found a high average correlation (\overline{R} R = 0.85) between thirty amino acid mutability scales and the mutational inertia (I X ), which measures the energetic cost weighted by the number of observations at the most probable amino acid conformation. These results indicate that amino acid substitutions follow two optimally-efficient principles: (a) amino acids interchangeability privileges their secondary structural similarity, and (b) the amino acid mutability depends directly on its biosynthetic energy cost, and inversely with its frequency. These two principles are the underlying rules governing the observed amino acid substitutions. © 2017 The Author(s)
    corecore