12 research outputs found

    Topology independent protein structural alignment

    Get PDF
    Abstract. Protein structural alignment is an indispensable tool used for many different studies in bioinformatics. Most structural alignment algorithms assume that the structural units of two similar proteins will align sequentially. This assumption may not be true for all similar proteins and as a result, proteins with similar structure but with permuted sequence arrangement are often missed. We present a solution to the problem based on an approximation algorithm that finds a sequenceorder independent structural alignment that is close to optimal. We first exhaustively fragment two proteins and calculate a novel similarity score between all possible aligned fragment pairs. We treat each aligned fragment pair as a vertex on a graph. Vertices are connected by an edge if there are intra residue sequence conflicts. We regard the realignment of the fragment pairs as a special case of the maximum-weight independent set problem and solve this computationally intensive problem approximately by iteratively solving relaxations of an appropriate integer programming formulation. The resulting structural alignment is sequence order independent. Our method is insensitive to gaps, insertions/deletions, and circular permutations.

    Optimization of minimum set of protein–DNA interactions: a quasi exact solution with minimum over-fitting

    Get PDF
    Motivation: A major limitation in modeling protein interactions is the difficulty of assessing the over-fitting of the training set. Recently, an experimentally based approach that integrates crystallographic information of C2H2 zinc finger–DNA complexes with binding data from 11 mutants, 7 from EGR finger I, was used to define an improved interaction code (no optimization). Here, we present a novel mixed integer programming (MIP)-based method that transforms this type of data into an optimized code, demonstrating both the advantages of the mathematical formulation to minimize over- and under-fitting and the robustness of the underlying physical parameters mapped by the code

    A Mathematical Framework for Protein Structure Comparison

    Get PDF
    Comparison of protein structures is important for revealing the evolutionary relationship among proteins, predicting protein functions and predicting protein structures. Many methods have been developed in the past to align two or multiple protein structures. Despite the importance of this problem, rigorous mathematical or statistical frameworks have seldom been pursued for general protein structure comparison. One notable issue in this field is that with many different distances used to measure the similarity between protein structures, none of them are proper distances when protein structures of different sequences are compared. Statistical approaches based on those non-proper distances or similarity scores as random variables are thus not mathematically rigorous. In this work, we develop a mathematical framework for protein structure comparison by treating protein structures as three-dimensional curves. Using an elastic Riemannian metric on spaces of curves, geodesic distance, a proper distance on spaces of curves, can be computed for any two protein structures. In this framework, protein structures can be treated as random variables on the shape manifold, and means and covariance can be computed for populations of protein structures. Furthermore, these moments can be used to build Gaussian-type probability distributions of protein structures for use in hypothesis testing. The covariance of a population of protein structures can reveal the population-specific variations and be helpful in improving structure classification. With curves representing protein structures, the matching is performed using elastic shape analysis of curves, which can effectively model conformational changes and insertions/deletions. We show that our method performs comparably with commonly used methods in protein structure classification on a large manually annotated data set

    Network analysis of circular permutations in multidomain proteins reveals functional linkages for uncharacterized proteins.

    Get PDF
    Various studies have implicated different multidomain proteins in cancer. However, there has been little or no detailed study on the role of circular multidomain proteins in the general problem of cancer or on specific cancer types. This work represents an initial attempt at investigating the potential for predicting linkages between known cancer-associated proteins with uncharacterized or hypothetical multidomain proteins, based primarily on circular permutation (CP) relationships. First, we propose an efficient algorithm for rapid identification of both exact and approximate CPs in multidomain proteins. Using the circular relations identified, we construct networks between multidomain proteins, based on which we perform functional annotation of multidomain proteins. We then extend the method to construct subnetworks for selected cancer subtypes, and performed prediction of potential link-ages between uncharacterized multidomain proteins and the selected cancer types. We include practical results showing the performance of the proposed methods

    MICAN : a protein structure alignment algorithm that can handle Multiple-chains, Inverse alignments, Cα only models, Alternative alignments, and Non-sequential alignments

    Get PDF
    BACKGROUND: Protein pairs that have the same secondary structure packing arrangement but have different topologies have attracted much attention in terms of both evolution and physical chemistry of protein structures. Further investigation of such protein relationships would give us a hint as to how proteins can change their fold in the course of evolution, as well as a insight into physico-chemical properties of secondary structure packing. For this purpose, highly accurate sequence order independent structure comparison methods are needed. RESULTS: We have developed a novel protein structure alignment algorithm, MICAN (a structure alignment algorithm that can handle Multiple-chain complexes, Inverse direction of secondary structures, C(α) only models, Alternative alignments, and Non-sequential alignments). The algorithm was designed so as to identify the best structural alignment between protein pairs by disregarding the connectivity between secondary structure elements (SSE). One of the key feature of the algorithm is utilizing the multiple vector representation for each SSE, which enables us to correctly treat bent or twisted nature of long SSE. We compared MICAN with other 9 publicly available structure alignment programs, using both reference-dependent and reference-independent evaluation methods on a variety of benchmark test sets which include both sequential and non-sequential alignments. We show that MICAN outperforms the other existing methods for reproducing reference alignments of non-sequential test sets. Further, although MICAN does not specialize in sequential structure alignment, it showed the top level performance on the sequential test sets. We also show that MICAN program is the fastest non-sequential structure alignment program among all the programs we examined here. CONCLUSIONS: MICAN is the fastest and the most accurate program among non-sequential alignment programs we examined here. These results suggest that MICAN is a highly effective tool for automatically detecting non-trivial structural relationships of proteins, such as circular permutations and segment-swapping, many of which have been identified manually by human experts so far. The source code of MICAN is freely download-able at http://www.tbp.cse.nagoya-u.ac.jp/MICAN

    Detection and Alignment of 3D Domain Swapping Proteins Using Angle-Distance Image-Based Secondary Structural Matching Techniques

    Get PDF
    This work presents a novel detection method for three-dimensional domain swapping (DS), a mechanism for forming protein quaternary structures that can be visualized as if monomers had “opened” their “closed” structures and exchanged the opened portion to form intertwined oligomers. Since the first report of DS in the mid 1990s, an increasing number of identified cases has led to the postulation that DS might occur in a protein with an unconstrained terminus under appropriate conditions. DS may play important roles in the molecular evolution and functional regulation of proteins and the formation of depositions in Alzheimer's and prion diseases. Moreover, it is promising for designing auto-assembling biomaterials. Despite the increasing interest in DS, related bioinformatics methods are rarely available. Owing to a dramatic conformational difference between the monomeric/closed and oligomeric/open forms, conventional structural comparison methods are inadequate for detecting DS. Hence, there is also a lack of comprehensive datasets for studying DS. Based on angle-distance (A-D) image transformations of secondary structural elements (SSEs), specific patterns within A-D images can be recognized and classified for structural similarities. In this work, a matching algorithm to extract corresponding SSE pairs from A-D images and a novel DS score have been designed and demonstrated to be applicable to the detection of DS relationships. The Matthews correlation coefficient (MCC) and sensitivity of the proposed DS-detecting method were higher than 0.81 even when the sequence identities of the proteins examined were lower than 10%. On average, the alignment percentage and root-mean-square distance (RMSD) computed by the proposed method were 90% and 1.8Å for a set of 1,211 DS-related pairs of proteins. The performances of structural alignments remain high and stable for DS-related homologs with less than 10% sequence identities. In addition, the quality of its hinge loop determination is comparable to that of manual inspection. This method has been implemented as a web-based tool, which requires two protein structures as the input and then the type and/or existence of DS relationships between the input structures are determined according to the A-D image-based structural alignments and the DS score. The proposed method is expected to trigger large-scale studies of this interesting structural phenomenon and facilitate related applications

    Sequence and structural analysis of antibodies

    Get PDF
    The work presented in this thesis focusses on the sequence and structural analysis of antibodies and has fallen into three main areas. First I developed a method to assess how typical an antibody sequence is of the expressed human antibody repertoire. My hypothesis was that the more \humanlike" an antibody sequence is (in other words how typical it is of the expressed human repertoire), the less likely it is to elicit an immune response when used in vivo in humans. In practice, I found that, while the most and least-human sequences generated the lowest and highest anti-antibody reponses in the small available dataset, there was little correlation in between these extremes. Second, I examined the distribution of the packing angles between VH and VL domains of antibodies and whether residues in the interface in uence the packing angle angle. This is an important factor which has essentially been ignored in modelling antibody structures since the packing angle can have a signi�cant e�ect on the topography of the combining site. Finding out which interface residues have the greatest in uence is also important in protocols for `humanizing' mouse antibodies to make them more suitable for use in therapy in humans. Third, I developed a method to apply standard Kabat or Chothia numbering schemes to an antibody sequence automatically. In brief, the method uses pro�les to identify the ends of the framework regions and then �lls in the numbers for each section. Benchmarking the performance of this algorithm against annotations in the Kabat database highlighted several errors in the manual annotations in the Kabat database. Based on structural analysis of insertions and deletions in the framework regions of antibodies, I have extended the Chothia numbering scheme to identify the structurally correct positions of insertions and deletions in the framework regions

    Development of computational approaches for structural classification, analysis and prediction of molecular recognition regions in proteins

    Get PDF
    The vast and growing volume of 3D protein structural data stored in the PDB contains abundant information about macromolecular complexes, and hence, data about protein interfaces. Non-covalent contacts between amino acids are the basis of protein interactions, and they are responsible for binding afinity and specificity in biological processes. In addition, water networks in protein interfaces can also complement direct interactions contributing significantly to molecular recognition, although their exact role is still not well understood. It is estimated that protein complexes in the PDB are substantially underrepresented due to their crystallization dificulties. Methods for automatic classifification and description of the protein complexes are essential to study protein interfaces, and to propose putative binding regions. Due to this strong need, several protein-protein interaction databases have been developed. However, most of them do not take into account either protein-peptide complexes, solvent information or a proper classification of the binding regions, which are fundamental components to provide an accurate description of protein interfaces. In the firest stage of my thesis, I developed the SCOWLP platform, a database and web application that structurally classifies protein binding regions at family level and defines accurately protein interfaces at atomic detail. The analysis of the results showed that protein-peptide complexes are substantially represented in the PDB, and are the only source of interacting information for several families. By clustering the family binding regions, I could identify 9,334 binding regions and 79,803 protein interfaces in the PDB. Interestingly, I observed that 65% of protein families interact to other molecules through more than one region and in 22% of the cases the same region recognizes different protein families. The database and web application are open to the research community (www.scowlp.org) and can tremendously facilitate high-throughput comparative analysis of protein binding regions, as well as, individual analysis of protein interfaces. SCOWLP and the other databases collect and classify the protein binding regions at family level, where sequence and structure homology exist. Interestingly, it has been observed that many protein families also present structural resemblances within each other, mostly across folds. Likewise, structurally similar interacting motifs (binding regions) have been identified among proteins with different folds and functions. For these reasons, I decided to explore the possibility to infer protein binding regions independently of their fold classification. Thus, I performed the firest systematic analysis of binding region conservation within all protein families that are structurally similar, calculated using non-sequential structural alignment methods. My results indicate there is a substantial molecular recognition information that could be potentially inferred among proteins beyond family level. I obtained a 6 to 8 fold enrichment of binding regions, and identified putative binding regions for 728 protein families that lack binding information. Within the results, I found out protein complexes from different folds that present similar interfaces, confirming the predictive usage of the methodology. The data obtained with my approach may complement the SCOWLP family binding regions suggesting alternative binding regions, and can be used to assist protein-protein docking experiments and facilitate rational ligand design. In the last part of my thesis, I used the interacting information contained in the SCOWLP database to help understand the role that water plays in protein interactions in terms of affinity and specificity. I carried out one of the firest high-throughput analysis of solvent in protein interfaces for a curated dataset of transient and obligate protein complexes. Surprisingly, the results highlight the abundance of water-bridged residues in protein interfaces (40.1% of the interfacial residues) that reinforces the importance of including solvent in protein interaction studies (14.5% extra residues interacting only water- mediated). Interestingly, I also observed that obligate and transient interfaces present a comparable amount of solvent, which contrasts the old thoughts saying that obligate protein complexes are expected to exhibit similarities to protein cores having a dry and hydrophobic interfaces. I characterized novel features of water-bridged residues in terms of secondary structure, temperature factors, residue composition, and pairing preferences that differed from direct residue-residue interactions. The results also showed relevant aspects in the mobility and energetics of water-bridged interfacial residues. Collectively, my doctoral thesis work can be summarized in the following points: 1. I developed SCOWLP, an improved framework that identiffies protein interfaces and classifies protein binding regions at family level. 2. I developed a novel methodology to predict alternative binding regions among structurally similar protein families independently of the fold they belong to. 3. I performed a high-throughput analysis of water-bridged interactions contained in SCOWLP to study the role of solvent in protein interfaces. These three components of my thesis represent novel methods for exploiting existing structural information to gain insights into protein- protein interactions, key mechanisms to understand biological processes

    Finding Similar Protein Structures Efficiently and Effectively

    Get PDF
    To assess the similarities and the differences among protein structures, a variety of structure alignment algorithms and programs have been designed and implemented. We introduce a low-resolution approach and a high-resolution approach to evaluate the similarities among protein structures. Our results show that both the low-resolution approach and the high-resolution approach outperform state-of-the-art methods. For the low-resolution approach, we eliminate false positives through the comparison of both local similarity and remote similarity with little compromise in speed. Two kinds of contact libraries (ContactLib) are introduced to fingerprint protein structures effectively and efficiently. Each contact group from the contact library consists of one local or two remote fragments and is represented by a concise vector. These vectors are then indexed and used to calculate a new combined hit-rate score to identify similar protein structures effectively and efficiently. We tested our ContactLibs on the high-quality protein structure subset of SCOP30, which contains 3,297 protein structures. For each protein structure of the subset, we retrieved its neighbor protein structures from the rest of the subset. The best area under the ROC curve, archived by a ContactLib, is as high as 0.960. This is a significant improvement over 0.747, the best result achieved by the state-of-the-art method, FragBag. For the high-resolution approach, our PROtein STructure Alignment method (PROSTA) relies on and verifies the fact that the optimal protein structure alignment always contains a small subset of aligned residue pairs, called a seed, such that the rotation and translation (ROTRAN), which minimizes the RMSD of the seed, yields both the optimal ROTRAN and the optimal alignment score. Thus, ROTRANs minimizing the RMSDs of small subsets of residues are sampled, and global alignments are calculated directly from the sampled ROTRANs. Moreover, our method incorporates remote information and filters similar ROTRANs (or alignments) by clustering, rather than by an exhaustive method, to overcome the computational inefficiency. Our high-resolution protein structure alignment method, when applied to optimizing the TM-score and the GDT-TS score, produces a significantly better result than state-of-the-art protein structure alignment methods. Specifically, if the highest TM-score found by TM-align is lower than 0.6 and the highest TM-score found by one of the tested methods is higher than 0.5, our alignment method tends to discover better protein structure alignments with (up to 0.21) higher TM-scores. In such cases, TM-align fails to find TM-scores higher than 0.5 with a probability of 42%; however, our alignment method fails the same task with a probability of only 2%. In addition, existing protein structure alignment scoring functions focus on atom coordinate similarity alone and simply ignore other important similarities, such as sequence similarity. Our scoring function has the capacity for incorporating multiple similarities into the scoring function. Our result shows that sequence similarity aids in finding high quality protein structure alignments that are more consistent with HOMSTRAD alignments, which are protein structure alignments examined by human experts. When atom coordinate similarity itself fails to find alignments with any consistency to HOMSTRAD alignments, our scoring function remains capable of finding alignments highly similar to, or even identical to, HOMSTRAD alignments
    corecore