533 research outputs found

    Application of coevolution-based methods and deep learning for structure prediction of protein complexes

    Get PDF
    The three-dimensional structures of proteins play a critical role in determining their biological functions and interactions. Experimental determination of protein and protein complex structures can be expensive and difficult. Computational prediction of protein and protein complex structures has therefore been an open challenge for decades. Recent advances in computational structure prediction techniques have resulted in increasingly accurate protein structure predictions. These techniques include methods that leverage information about coevolving residues to predict residue interactions and that apply deep learning techniques to enable better prediction of residue contacts and protein structures. Prior to the work outlined in this thesis, coevolution-based methods and deep learning had been shown to improve the prediction of single protein domains or single protein chains. Most proteins in living organisms do not function on their own but interact with other proteins either through transient interactions or by forming stable protein complexes. Knowledge of protein complex structures can be useful for biological and disease research, drug discovery and protein engineering. Unfortunately, a large number of protein complexes do not have experimental structures or close homolog structures that can be used as templates. In this thesis, methods previously developed and applied to the de novo prediction of single protein domains or protein monomer chains were modified and leveraged for the prediction of protein heterodimer and homodimer complexes. A number of coevolution-based tools and deep learning methods are explored for the purpose of predicting inter-chain and intra-chain residue contacts in protein dimers. These contacts are combined with existing protein docking methods to explore the prediction of homodimers and heterodimers. Overall, the work in this thesis demonstrates the promise of leveraging coevolution and deep-learning for the prediction of protein complexes, shows improvements in protein complex prediction tasks achieved using coevolution based methods and deep learning methods, and demonstrates remaining challenges in protein complex prediction

    Mass & secondary structure propensity of amino acids explain their mutability and evolutionary replacements

    Get PDF
    Why is an amino acid replacement in a protein accepted during evolution? The answer given by bioinformatics relies on the frequency of change of each amino acid by another one and the propensity of each to remain unchanged. We propose that these replacement rules are recoverable from the secondary structural trends of amino acids. A distance measure between high-resolution Ramachandran distributions reveals that structurally similar residues coincide with those found in substitution matrices such as BLOSUM: Asn Asp, Phe Tyr, Lys Arg, Gln Glu, Ile Val, Met → Leu; with Ala, Cys, His, Gly, Ser, Pro, and Thr, as structurally idiosyncratic residues. We also found a high average correlation (\overline{R} R = 0.85) between thirty amino acid mutability scales and the mutational inertia (I X ), which measures the energetic cost weighted by the number of observations at the most probable amino acid conformation. These results indicate that amino acid substitutions follow two optimally-efficient principles: (a) amino acids interchangeability privileges their secondary structural similarity, and (b) the amino acid mutability depends directly on its biosynthetic energy cost, and inversely with its frequency. These two principles are the underlying rules governing the observed amino acid substitutions. © 2017 The Author(s)

    Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-ligand Structure Prediction Models

    Full text link
    Protein-ligand structure prediction is an essential task in drug discovery, predicting the binding interactions between small molecules (ligands) and target proteins (receptors). Although conventional physics-based docking tools are widely utilized, their accuracy is compromised by limited conformational sampling and imprecise scoring functions. Recent advances have incorporated deep learning techniques to improve the accuracy of structure prediction. Nevertheless, the experimental validation of docking conformations remains costly, it raises concerns regarding the generalizability of these deep learning-based methods due to the limited training data. In this work, we show that by pre-training a geometry-aware SE(3)-Equivariant neural network on a large-scale docking conformation generated by traditional physics-based docking tools and then fine-tuning with a limited set of experimentally validated receptor-ligand complexes, we can achieve outstanding performance. This process involved the generation of 100 million docking conformations, consuming roughly 1 million CPU core days. The proposed model, HelixDock, aims to acquire the physical knowledge encapsulated by the physics-based docking tools during the pre-training phase. HelixDock has been benchmarked against both physics-based and deep learning-based baselines, showing that it outperforms its closest competitor by over 40% for RMSD. HelixDock also exhibits enhanced performance on a dataset that poses a greater challenge, thereby highlighting its robustness. Moreover, our investigation reveals the scaling laws governing pre-trained structure prediction models, indicating a consistent enhancement in performance with increases in model parameters and pre-training data. This study illuminates the strategic advantage of leveraging a vast and varied repository of generated data to advance the frontiers of AI-driven drug discovery

    주형 기반 도킹과 Ab Initio 도킹을 이용한 단백질 복합체 구조 예측

    Get PDF
    학위논문(박사) -- 서울대학교대학원 : 자연과학대학 화학부, 2021.8. 석차옥.Protein-protein interactions play crucial roles in diverse biological processes, including various disease progressions. Atomistic structural details of protein-protein interactions that can be obtained from protein complex structures may provide vital information for the design of therapeutic agents. However, a large portion of protein complex structures is hard to be experimentally captured due to their weak and transient protein-protein interactions. Indeed, a limited fraction of protein-protein interactions happening in the human body has been experimentally determined. Computational protein complex structure prediction methods have been spotlighted for their roles in providing insights into protein-protein interactions in the absence of complete structural information by experiment. In this dissertation, three protein complex structure prediction methods are explained: GalaxyTongDock, GalaxyHeteromer, and GalaxyHomomer2. GalaxyTongDock performs ab initio docking for structure prediction of hetero- and homo-oligomers. GalaxyHeteromer and GalaxyHomomer2 predict heterodimer and homo-oligomer structures, respectively, by template-based docking and ab initio docking depending on the template's availability. Lastly, examples of how these methods were utilized to predict protein complex structures in CASP and CAPRI, community-wide prediction experiments, are presented.단백질 사이의 상호작용은 세포분열, 항상성 유지, 면역반응, 질병의 발생 등 많은 생물학적 과정에서 핵심적인 역할을 한다. 단백질 복합체 구조로부터 얻을 수 있는 단백질 상호작용에 대한 구조적 이해는 효과적인 항체 신약, 단백질 상호작용 저해제 등의 약물 설계를 위해 필수적인 요소이다. 그러나 단백질 복합체는 대체로 약한 상호작용에 의해 일시적으로 형성되어 실험을 통해 결정하기가 어렵다. 실제로 우리 몸에서 일어나는 수많은 단백질 상호작용 중 극히 일부에 대해서만 복합체 구조가 알려져 있다. 컴퓨터를 이용한 단백질 복합체 구조 예측 방법은 실험에 의해 결정된 단백질 복합체 구조가 없는 경우에 단백질 상호작용에 대한 정보를 제공하는 중요한 역할을 해왔다. 이 논문에서는 단백질 복합체 구조 예측 방법인 GalaxyTongDock과 GalaxyHomomer2, GalaxyHeteromer에 대해서 소개한다. GalaxyTongDock은 ab initio 도킹을 통해 동종 올리고머 단백질과 이종 올리고머 단백질의 구조를 예측한다. GalaxyHomomer2와 GalaxyHeteromer는 각각 동종 올리고머 단백질과 이종 올리고머 단백질의 구조를 주형 기반 도킹과 ab initio 도킹을 모두 이용하여 예측한다. 마지막으로, 이 방법들이 국제 단백질 구조 및 복합체 구조 예측 대회인 CASP과 CAPRI에서 단백질 복합체 구조를 예측하기 위해 어떻게 활용되었는지 몇 가지 예시를 통해 소개한다.1. Introduction 1 2. GalaxyTongDock 4 2.1. Methods 4 2.2. Performance of GalaxyTongDock 21 3. GalaxyHeteromer 27 3.1. Methods 27 3.2. Performance of GalaxyHeteromer 34 4. GalaxyHomomer2 40 4.1. Methods 41 4.2. Performance of GalaxyHomomer2 47 5. CASP and CAPRI 54 5.1. CASP13 54 5.2. CASP14 57 5.3. CAPRI 64 6. Conclusion 65 7. References 67 국문초록 71 감사의 글 73박

    Computational prediction and analysis of macromolecular interactions

    Full text link
    Protein interactions regulate gene expression, cell signaling, catalysis, and many other functions across all of molecular biology. We must understand them quantitatively, and experimental methods have provided the data that form the basis of our current understanding. They remain our most accurate tools. However, their low efficiency and high cost leave room for predictive, computational approaches that can provide faster and more detailed answers to biological problems. A rigid-body simulation can quickly and effectively calculate the predicted interaction energy between two molecular structures in proximity. The fast Fourier-transform-based mapping algorithm FTMap predicts small molecule binding 'hot spots' on a protein's surface and can provide likely orientations of specific ligands of interest that may occupy those hot spots. This process now allows unique ligands to be used by this algorithm while permitting additional small molecular cofactors to remain in their bound conformation. By keeping the cofactors bound, FTMap can reduce false positives where the algorithm identifies a true, but incorrect, ligand pocket where the known cofactor already binds. A related algorithm, ClusPro, can evaluate interaction energies for billions of docked conformations of macromolecular structures. The work reported in this thesis can predict protein-polysaccharide interactions and the software now contains a publicly available feature for predicting protein-heparin interactions. In addition, a new approach for determining regions of predicted activity on a protein's surface allows prediction of a protein-protein interface. This new tool can also identify the interface in encounter complexes formed by the process of protein association—more closely resembling the biological nature of the interaction than the former, calculated, binary, bound and unbound states

    Enhancing protein interaction prediction using deep learning and protein language models

    Full text link
    Proteins are large macromolecules that play critical roles in many cellular activities in living organisms. These include catalyzing metabolic reactions, mediating signal transduction, DNA replication, responding to stimuli, and transporting molecules, to name a few. Proteins perform their functions by interacting with other proteins and molecules. As a result, determining the nature of such interactions is critically important in many areas of biology and medicine. The primary structure of a protein refers to its specific sequence of amino acids, while the tertiary structure refers to its unique 3D shape, and the quaternary structure refers to the interaction of multiple protein subunits to form a larger, more complex structure. While the number of experimentally determined tertiary and quaternary structures are limited, databases of protein sequences continue to grow at an unprecedented rate, providing a wealth of information for training and improving sequence-based models. Recent developments in the sequence-based model using machine learning and deep learning has shown significant progress toward solving protein-related problems. Specifically, attention-based transformer models, a recent breakthrough in Natural Language Processing (NLP), has shown that large models trained on unlabeled data are able to learn powerful representations of protein sequences and can lead to significant improvements in understanding protein folding, function, and interactions, as well as in drug discovery and protein engineering. The research in this thesis has pursued two objectives using sequence-based modeling. The first is to use deep learning techniques based on NLP to address an important problem in cellular immune system studies, namely, predicting Major Histocompatibility Complex (MHC)-Peptide binding. The second is to improve the performance of the Cluspro docking server, a well-known protein-protein docking tool, in three ways: (i) integrating Cluspro with AlphaFold2, a well-known accurate protein structure predictor, for enhanced protein model docking, (ii) predicting distance maps to improve docking accuracy, and (iii) using regression techniques to rank protein clusters for better results

    Improving protein docking with binding site prediction

    Get PDF
    Protein-protein and protein-ligand interactions are fundamental as many proteins mediate their biological function through these interactions. Many important applications follow directly from the identification of residues in the interfaces between protein-protein and protein-ligand interactions, such as drug design, protein mimetic engineering, elucidation of molecular pathways, and understanding of disease mechanisms. The identification of interface residues can also guide the docking process to build the structural model of protein-protein complexes. This dissertation focuses on developing computational approaches for protein-ligand and protein-protein binding site prediction and applying these predictions to improve protein-protein docking. First, we develop an automated approach LIGSITEcs to predict protein-ligand binding site, based on the notion of surface-solvent-surface events and the degree of conservation of the involved surface residues. We compare our algorithm to four other approaches, LIGSITE, CAST, PASS, and SURFNET, and evaluate all on a dataset of 48 unbound/bound structures and 210 bound-structures. LIGSITEcs performs slightly better than the other tools and achieves a success rate of 71% and 75%, respectively. Second, for protein-protein binding site, we develop metaPPI, a meta server for interface prediction. MetaPPI combines results from a number of tools, such as PPI_Pred, PPISP, PINUP, Promate, and SPPIDER, which predict enzyme-inhibitor interfaces with success rates of 23% to 55% and other interfaces with 10% to 28% on a benchmark dataset of 62 complexes. After refinement, metaPPI significantly improves prediction success rates to 70% for enzyme-inhibitor and 44% for other interfaces. Third, for protein-protein docking, we develop a FFT-based docking algorithm and system BDOCK, which includes specific scoring functions for specific types of complexes. BDOCK uses family-based residue interface propensities as a scoring function and obtains improvement factors of 4-30 for enzyme-inhibitor and 4-11 for antibody-antigen complexes in two specific SCOP families. Furthermore, the degrees of buriedness of surface residues are integrated into BDOCK, which improves the shape discriminator for enzyme-inhibitor complexes. The predicted interfaces from metaPPI are integrated as well, either during docking or after docking. The evaluation results show that reliable interface predictions improve the discrimination between near-native solutions and false positive. Finally, we propose an implicit method to deal with the flexibility of proteins by softening the surface, to improve docking for non enzyme-inhibitor complexes

    Template Based Modeling and Structural Refinement of Protein-Protein Interactions.

    Full text link
    Determining protein structures from sequence is a fundamental problem in molecular biology, as protein structure is essential to understanding protein function. In this study, I developed one of the first fully automated pipelines for template based quaternary structure prediction starting from sequence. Two critical steps for template based modeling are identifying the correct homologous structures by threading which generates sequence to structure alignments and refining the initial threading template coordinates closer to the native conformation. I developed SPRING (single-chain-based prediction of interactions and geometries), a monomer threading to dimer template mapping program, which was compared to the dimer co-threading program, COTH, using 1838 non homologous target complex structures. SPRING’s similarity score outperformed COTH in the first place ranking of templates, correctly identifying 798 and 527 interfaces respectively. More importantly the results were found to be complementary and the programs could be combined in a consensus based threading program showing a 5.1% improvement compared to SPRING. Template based modeling requires a structural analog being present in the PDB. A full search of the PDB, using threading and structural alignment, revealed that only 48.7% of the PDB has a suitable template whereas only 39.4% of the PDB has templates that can be identified by threading. In order to circumvent this, I included intramolecular domain-domain interfaces into the PDB library to boost template recognition of protein dimers; the merging of the two classes of interfaces improved recognition of heterodimers by 40% using benchmark settings. Next the template based assembly of protein complexes pipeline, TACOS, was created. The pipeline combines threading templates and domain knowledge from the PDB into a knowledge based energy score. The energy score is integrated into a Monte Carlo sampling simulation that drives the initial template closer to the native topology. The full pipeline was benchmarked using 350 non homologous structures and compared to two state of the art programs for dimeric structure prediction: ZDOCK and MODELLER. On average, TACOS models global and interface structure have a better quality than the models generated by MODELLER and ZDOCK.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/135847/1/bgovi_1.pd
    corecore