424 research outputs found

    PROMALS3D: a tool for multiple protein sequence and structure alignments

    Get PDF
    Although multiple sequence alignments (MSAs) are essential for a wide range of applications from structure modeling to prediction of functional sites, construction of accurate MSAs for distantly related proteins remains a largely unsolved problem. The rapidly increasing database of spatial structures is a valuable source to improve alignment quality. We explore the use of 3D structural information to guide sequence alignments constructed by our MSA program PROMALS. The resulting tool, PROMALS3D, automatically identifies homologs with known 3D structures for the input sequences, derives structural constraints through structure-based alignments and combines them with sequence constraints to construct consistency-based multiple sequence alignments. The output is a consensus alignment that brings together sequence and structural information about input proteins and their homologs. PROMALS3D can also align sequences of multiple input structures, with the output representing a multiple structure-based alignment refined in combination with sequence constraints. The advantage of PROMALS3D is that it gives researchers an easy way to produce high-quality alignments consistent with both sequences and structures of proteins. PROMALS3D outperforms a number of existing methods for constructing multiple sequence or structural alignments using both reference-dependent and reference-independent evaluation methods

    Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Knowledge of structural class is used by numerous methods for identification of structural/functional characteristics of proteins and could be used for the detection of remote homologues, particularly for chains that share twilight-zone similarity. In contrast to existing sequence-based structural class predictors, which target four major classes and which are designed for high identity sequences, we predict seven classes from sequences that share twilight-zone identity with the training sequences.</p> <p>Results</p> <p>The proposed MODular Approach to Structural class prediction (MODAS) method is unique as it allows for selection of any subset of the classes. MODAS is also the first to utilize a novel, custom-built feature-based sequence representation that combines evolutionary profiles and predicted secondary structure. The features quantify information relevant to the definition of the classes including conservation of residues and arrangement and number of helix/strand segments. Our comprehensive design considers 8 feature selection methods and 4 classifiers to develop Support Vector Machine-based classifiers that are tailored for each of the seven classes. Tests on 5 twilight-zone and 1 high-similarity benchmark datasets and comparison with over two dozens of modern competing predictors show that MODAS provides the best overall accuracy that ranges between 80% and 96.7% (83.5% for the twilight-zone datasets), depending on the dataset. This translates into 19% and 8% error rate reduction when compared against the best performing competing method on two largest datasets. The proposed predictor provides accurate predictions at 58% accuracy for membrane proteins class, which is not considered by majority of existing methods, in spite that this class accounts for only 2% of the data. Our predictive model is analyzed to demonstrate how and why the input features are associated with the corresponding classes.</p> <p>Conclusions</p> <p>The improved predictions stem from the novel features that express collocation of the secondary structure segments in the protein sequence and that combine evolutionary and secondary structure information. Our work demonstrates that conservation and arrangement of the secondary structure segments predicted along the protein chain can successfully predict structural classes which are defined based on the spatial arrangement of the secondary structures. A web server is available at <url>http://biomine.ece.ualberta.ca/MODAS/</url>.</p

    Contrastive learning on protein embeddings enlightens midnight zone

    Get PDF
    Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI), facilitating the transfer of information from a protein with known annotation to a query without any annotation. A recent alternative expands the concept of HBI from sequence-distance lookup to embedding-based annotation transfer (EAT). These embeddings are derived from protein Language Models (pLMs). Here, we introduce using single protein representations from pLMs for contrastive learning. This learning procedure creates a new set of embeddings that optimizes constraints captured by hierarchical classifications of protein 3D structures defined by the CATH resource. The approach, dubbed ProtTucker, has an improved ability to recognize distant homologous relationships than more traditional techniques such as threading or fold recognition. Thus, these embeddings have allowed sequence comparison to step into the 'midnight zone' of protein similarity, i.e. the region in which distantly related sequences have a seemingly random pairwise sequence similarity. The novelty of this work is in the particular combination of tools and sampling techniques that ascertained good performance comparable or better to existing state-of-the-art sequence comparison methods. Additionally, since this method does not need to generate alignments it is also orders of magnitudes faster. The code is available at https://github.com/Rostlab/EAT

    Homology-extended sequence alignment

    Get PDF
    We present a profile–profile multiple alignment strategy that uses database searching to collect homologues for each sequence in a given set, in order to enrich their available evolutionary information for the alignment. For each of the alignment sequences, the putative homologous sequences that score above a pre-defined threshold are incorporated into a position-specific pre-alignment profile. The enriched position-specific profile is used for standard progressive alignment, thereby more accurately describing the characteristic features of the given sequence set. We show that owing to the incorporation of the pre-alignment information into a standard progressive multiple alignment routine, the alignment quality between distant sequences increases significantly and outperforms state-of-the-art methods, such as T-COFFEE and MUSCLE. We also show that although entirely sequence-based, our novel strategy is better at aligning distant sequences when compared with a recent contact-based alignment method. Therefore, our pre-alignment profile strategy should be advantageous for applications that rely on high alignment accuracy such as local structure prediction, comparative modelling and threading

    Sequence homology based protein-protein interacting residue predictions and the applications in ranking docked conformations

    Get PDF
    Protein-protein interactions play a central role in the formation of protein complexes and the biological pathways that orchestrate virtually all cellular processes. Three dimensional structures of a complex formed by a protein with one or more of its interaction partners provide useful information regarding the specific amino acid residues that make up the interface between proteins. The emergence of high throughput techniques such as Yeast 2 Hybrid (Y2H) assays has made it possible to identify putative interactions between thousands of proteins (but not the interfaces that form the structural basis of interactions or the structures of protein complexes that result from such interactions). Reliable identification of the specific amino acid residues that form the interface of a protein with one or more other proteins is critical for understanding the structural and physico-chemical basis of protein interactions and their role in key cellular processes, for predicting protein complexes, for validating protein interactions predicted by high throughput methods, for ranking conformations of protein complexes generated by docking, and for identifying and prioritizing drug targets in computational drug design. However, given the high cost of experimental determination of the structures of protein complexes, there is an urgent need for reliable and fast computational methods for identifying interface residues and/or predicting the structure of a complex formed by a protein of interest with its interaction partners. Given the large and growing gap between the number of known protein sequences and the number of experimentally determined structures, sequence-based methods for predicting protein-protein interfaces are of particular interest. Against this background, we develop HomPPI ( http://homppi.cs.iastate.edu/), a class of sequence homology based approaches to protein interface prediction. We present two variants of HomPPI: (i) NPS-HomPPI (non-partner-specific HomPPI), which can be used to predict interface residues of a query protein in the absence of knowledge of the interaction partner. NPS-HomPPI is based on the results of a systematic analysis of the conditions under which interface residues of a query protein are conserved among its sequence homologs (and hence can be inferred from the known interface residues in proteins that are sequence homologs of the query protein). Our experiments suggest that when sequence homologs of the query protein can be reliably identified, NPS-HomPPI is competitive with several state-of-the-art interface prediction servers including those that exploit the structure of the query proteins. (ii) PS-HomPPI (partner-specific HomPPI), which can be used to predict the interface residues of a query protein with a specific target protein. PS-HomPPI is based on a systematic analysis of the conditions under which the interface residues that make up the interface between a query protein and its interaction partner are preserved among their homo-interologs, i.e., complexes formed by their respective sequence homologs. To the best of our knowledge, with the exception of protein-protein docking (which is computationally much more expensive than PS-HomPPI), PS-HomPPI is one of the first partner-specific protein-protein interface predictors. Our experiments with PS-HomPPI show that when homo-interologs of a query protein and its putative interaction partner can be reliably identified, the interface predictions generated by PS-HomPPI are significantly more reliable than those generated by NPS-HomPPI. Protein-Protein Docking offers a powerful approach to computational determination of the 3-dimensional conformation of protein complexes and protein-protein interfaces. However, the reliability of conformations produced by docking is limited by the efficacy of the scoring functions used to select a few near-native conformations from among tens of thousands of possible conformations, generated by docking programs. Against this background, we introduce DockRank, a novel approach to rank docked conformations based on the degree to which the interface residues inferred from the docked conformation match the interface residues predicted by a partner-specific sequence homology based interface predictor PS-HomPPI. We compare, on a data set of 69 docked cases with 54,000 decoys per case, the ranking of conformations produced using DockRank\u27s interface similarity scoring function applied to predicted interface residues obtained from four protein interface predictors: PS-HomPPI, and three NPS interface predictors NPS-HomPPI, PRISE, and meta-PPISP, with the rankings produced by two state-of-the-art energy-based scoring functions ZRank and IRAD. Our results show that DockRank significantly outperforms these ranking methods. Our results that NPS interface predictors (homology based and machine learning-based methods) failed to select near-native conformations that are superior to those selected by DockRank (partner-specific interface prediction based), highlight the importance of the knowledge of the binding partners in using predicted interfaces to rank docked models. The application of DockRank, as a third-party scoring function without access to all the original docked models, for improving ClusPro results on two benchmark data sets of 32 and 56 test cases shows the viability of combining our scoring function with existing docking software. An online implementation of DockRank is available at http://einstein.cs.iastate.edu/DockRank/

    iWRAP: An Interface Threading Approach with Application to Prediction of Cancer-Related Protein–Protein Interactions

    Get PDF
    Current homology modeling methods for predicting protein–protein interactions (PPIs) have difficulty in the “twilight zone” (< 40%) of sequence identities. Threading methods extend coverage further into the twilight zone by aligning primary sequences for a pair of proteins to a best-fit template complex to predict an entire three-dimensional structure. We introduce a threading approach, iWRAP, which focuses only on the protein interface. Our approach combines a novel linear programming formulation for interface alignment with a boosting classifier for interaction prediction. We demonstrate its efficacy on SCOPPI, a classification of PPIs in the Protein Databank, and on the entire yeast genome. iWRAP provides significantly improved prediction of PPIs and their interfaces in stringent cross-validation on SCOPPI. Furthermore, by combining our predictions with a full-complex threader, we achieve a coverage of 13% for the yeast PPIs, which is close to a 50% increase over previous methods at a higher sensitivity. As an application, we effectively combine iWRAP with genomic data to identify novel cancer-related genes involved in chromatin remodeling, nucleosome organization, and ribonuclear complex assembly. iWRAP is available at http://iwrap.csail.mit.edu.National Institutes of Health (U.S.) (Grant 1R01GM081871

    SeqStruct : A New Amino Acid Similarity Matrix Based on Sequence Correlations and Structural Contacts Yields Sequence-Structure Congruence

    Get PDF
    Protein sequence matching does not properly account for some well-known features of protein structures: surface residues being more variable than core residues, the high packing densities in globular proteins, and does not yield good matches of sequences of many proteins known to be close structural relatives. There are now abundant protein sequences and structures to enable major improvements to sequence matching. Here, we utilize structural frameworks to mount the observed correlated sequences to identify the most important correlated parts. The rationale is that protein structures provide the important physical framework for improving sequence matching. Combining the sequence and structure data in this way leads to a simple amino acid substitution matrix that can be readily incorporated into any sequence matching. This enables the incorporation of allosteric information into sequence matching and transforms it effectively from a 1-D to a 3-D procedure. The results from testing in over 3,000 sequence matches demonstrate a 37% gain in sequence similarity and a loss of 26% of the gaps when compared with the use of BLOSUM62. And, importantly there are major gains in the specificity of sequence matching across diverse proteins. Specifically, all known cases where protein structures match but sequences do not match well are resolved
    corecore