533 research outputs found
Application of coevolution-based methods and deep learning for structure prediction of protein complexes
The three-dimensional structures of proteins play a critical role in determining their biological functions and interactions. Experimental determination of protein and protein complex structures can be expensive and difficult. Computational prediction of protein and protein complex structures has therefore been an open challenge for decades. Recent advances in computational structure prediction techniques have resulted in increasingly accurate protein structure predictions. These techniques include methods that leverage information about coevolving residues to predict residue interactions and that apply deep learning techniques to enable better prediction of residue contacts and protein structures. Prior to the work outlined in this thesis, coevolution-based methods and deep learning had been shown to improve the prediction of single protein domains or single protein chains.
Most proteins in living organisms do not function on their own but interact with other proteins either through transient interactions or by forming stable protein complexes. Knowledge of protein complex structures can be useful for biological and disease research, drug discovery and protein engineering. Unfortunately, a large number of protein complexes do not have experimental structures or close homolog structures that can be used as templates. In this thesis, methods previously developed and applied to the de novo prediction of single protein domains or protein monomer chains were modified and leveraged for the prediction of protein heterodimer and homodimer complexes. A number of coevolution-based tools and deep learning methods are explored for the purpose of predicting inter-chain and intra-chain residue contacts in protein dimers. These contacts are combined with existing protein docking methods to explore the prediction of homodimers and heterodimers.
Overall, the work in this thesis demonstrates the promise of leveraging coevolution and deep-learning for the prediction of protein complexes, shows improvements in protein complex prediction tasks achieved using coevolution based methods and deep learning methods, and demonstrates remaining challenges in protein complex prediction
Mass & secondary structure propensity of amino acids explain their mutability and evolutionary replacements
Why is an amino acid replacement in a protein accepted during evolution? The answer given by bioinformatics relies on the frequency of change of each amino acid by another one and the propensity of each to remain unchanged. We propose that these replacement rules are recoverable from the secondary structural trends of amino acids. A distance measure between high-resolution Ramachandran distributions reveals that structurally similar residues coincide with those found in substitution matrices such as BLOSUM: Asn Asp, Phe Tyr, Lys Arg, Gln Glu, Ile Val, Met → Leu; with Ala, Cys, His, Gly, Ser, Pro, and Thr, as structurally idiosyncratic residues. We also found a high average correlation (\overline{R} R = 0.85) between thirty amino acid mutability scales and the mutational inertia (I X ), which measures the energetic cost weighted by the number of observations at the most probable amino acid conformation. These results indicate that amino acid substitutions follow two optimally-efficient principles: (a) amino acids interchangeability privileges their secondary structural similarity, and (b) the amino acid mutability depends directly on its biosynthetic energy cost, and inversely with its frequency. These two principles are the underlying rules governing the observed amino acid substitutions. © 2017 The Author(s)
Pre-Training on Large-Scale Generated Docking Conformations with HelixDock to Unlock the Potential of Protein-ligand Structure Prediction Models
Protein-ligand structure prediction is an essential task in drug discovery,
predicting the binding interactions between small molecules (ligands) and
target proteins (receptors). Although conventional physics-based docking tools
are widely utilized, their accuracy is compromised by limited conformational
sampling and imprecise scoring functions. Recent advances have incorporated
deep learning techniques to improve the accuracy of structure prediction.
Nevertheless, the experimental validation of docking conformations remains
costly, it raises concerns regarding the generalizability of these deep
learning-based methods due to the limited training data. In this work, we show
that by pre-training a geometry-aware SE(3)-Equivariant neural network on a
large-scale docking conformation generated by traditional physics-based docking
tools and then fine-tuning with a limited set of experimentally validated
receptor-ligand complexes, we can achieve outstanding performance. This process
involved the generation of 100 million docking conformations, consuming roughly
1 million CPU core days. The proposed model, HelixDock, aims to acquire the
physical knowledge encapsulated by the physics-based docking tools during the
pre-training phase. HelixDock has been benchmarked against both physics-based
and deep learning-based baselines, showing that it outperforms its closest
competitor by over 40% for RMSD. HelixDock also exhibits enhanced performance
on a dataset that poses a greater challenge, thereby highlighting its
robustness. Moreover, our investigation reveals the scaling laws governing
pre-trained structure prediction models, indicating a consistent enhancement in
performance with increases in model parameters and pre-training data. This
study illuminates the strategic advantage of leveraging a vast and varied
repository of generated data to advance the frontiers of AI-driven drug
discovery
주형 기반 도킹과 Ab Initio 도킹을 이용한 단백질 복합체 구조 예측
학위논문(박사) -- 서울대학교대학원 : 자연과학대학 화학부, 2021.8. 석차옥.Protein-protein interactions play crucial roles in diverse biological processes, including various disease progressions. Atomistic structural details of protein-protein interactions that can be obtained from protein complex structures may provide vital information for the design of therapeutic agents. However, a large portion of protein complex structures is hard to be experimentally captured due to their weak and transient protein-protein interactions. Indeed, a limited fraction of protein-protein interactions happening in the human body has been experimentally determined. Computational protein complex structure prediction methods have been spotlighted for their roles in providing insights into protein-protein interactions in the absence of complete structural information by experiment. In this dissertation, three protein complex structure prediction methods are explained: GalaxyTongDock, GalaxyHeteromer, and GalaxyHomomer2. GalaxyTongDock performs ab initio docking for structure prediction of hetero- and homo-oligomers. GalaxyHeteromer and GalaxyHomomer2 predict heterodimer and homo-oligomer structures, respectively, by template-based docking and ab initio docking depending on the template's availability. Lastly, examples of how these methods were utilized to predict protein complex structures in CASP and CAPRI, community-wide prediction experiments, are presented.단백질 사이의 상호작용은 세포분열, 항상성 유지, 면역반응, 질병의 발생 등 많은 생물학적 과정에서 핵심적인 역할을 한다. 단백질 복합체 구조로부터 얻을 수 있는 단백질 상호작용에 대한 구조적 이해는 효과적인 항체 신약, 단백질 상호작용 저해제 등의 약물 설계를 위해 필수적인 요소이다. 그러나 단백질 복합체는 대체로 약한 상호작용에 의해 일시적으로 형성되어 실험을 통해 결정하기가 어렵다. 실제로 우리 몸에서 일어나는 수많은 단백질 상호작용 중 극히 일부에 대해서만 복합체 구조가 알려져 있다. 컴퓨터를 이용한 단백질 복합체 구조 예측 방법은 실험에 의해 결정된 단백질 복합체 구조가 없는 경우에 단백질 상호작용에 대한 정보를 제공하는 중요한 역할을 해왔다. 이 논문에서는 단백질 복합체 구조 예측 방법인 GalaxyTongDock과 GalaxyHomomer2, GalaxyHeteromer에 대해서 소개한다. GalaxyTongDock은 ab initio 도킹을 통해 동종 올리고머 단백질과 이종 올리고머 단백질의 구조를 예측한다. GalaxyHomomer2와 GalaxyHeteromer는 각각 동종 올리고머 단백질과 이종 올리고머 단백질의 구조를 주형 기반 도킹과 ab initio 도킹을 모두 이용하여 예측한다. 마지막으로, 이 방법들이 국제 단백질 구조 및 복합체 구조 예측 대회인 CASP과 CAPRI에서 단백질 복합체 구조를 예측하기 위해 어떻게 활용되었는지 몇 가지 예시를 통해 소개한다.1. Introduction 1
2. GalaxyTongDock 4
2.1. Methods 4
2.2. Performance of GalaxyTongDock 21
3. GalaxyHeteromer 27
3.1. Methods 27
3.2. Performance of GalaxyHeteromer 34
4. GalaxyHomomer2 40
4.1. Methods 41
4.2. Performance of GalaxyHomomer2 47
5. CASP and CAPRI 54
5.1. CASP13 54
5.2. CASP14 57
5.3. CAPRI 64
6. Conclusion 65
7. References 67
국문초록 71
감사의 글 73박
Recommended from our members
Scoring functions for protein docking and drug design
textPredicting the structure of complexes formed by two interacting proteins is an important problem in computation structural biology. Proteins perform many of their functions by binding to other proteins. The structure of protein-protein complexes provides atomic details about protein function and biochemical pathways, and can help in designing drugs that inhibit binding. Docking computationally models the structure of protein-protein complexes, given three-dimensional structures of the individual chains. Protein docking methods have two phases. In the first phase, a comprehensive, coarse search is performed for optimally docked models. In the second refinement and reranking phase, the models from the first phase are refined and reranked, with the expectation of extracting a small set of accurate models from the pool of thousands of models obtained from the first phase. In this thesis, new algorithms are developed for the refinement and reranking phase of docking. New scoring functions, or potentials, that rank models are developed. These potentials are learnt using large-scale machine learning methods based on mathematical programming. The procedure for learning these potentials involves examining hundreds of thousands of correct and incorrect models. In this thesis, hierarchical constraints were introduced into the learning algorithm. First, an atomic potential was developed using this learning procedure. A refinement procedure involving side-chain remodeling and conjugate gradient-based minimization was introduced. The refinement procedure combined with the atomic potential was shown to improve docking accuracy significantly. Second, a hydrogen bond potential, was developed. Molecular dynamics-based sampling combined with the hydrogen bond potential improved docking predictions. Third, mathematical programming compared favorably to SVMs and neural networks in terms of accuracy, training and test time for the task of designing potentials to rank docking models. The methods described in this thesis are implemented in the docking package DOCK/PIERR. DOCK/PIERR was shown to be among the best automated docking methods in community wide assessments. Finally, DOCK/PIERR was extended to predict membrane protein complexes. A membrane-based score was added to the reranking phase, and shown to improve the accuracy of docking. This docking algorithm for membrane proteins was used to study the dimers of amyloid precursor protein, implicated in Alzheimer's disease.R. DOCK/PIERR was shown to be among the best automated docking methods in community wide assessments. Finally, DOCK/PIERR was extended to predict membrane protein complexes. A membrane-based score was added to the reranking phase, and shown to improve the accuracy of docking. This docking algorithm for membrane proteins was used to study the dimers of amyloid precursor protein, implicated in Alzheimer’s disease.Computer Science
Computational prediction and analysis of macromolecular interactions
Protein interactions regulate gene expression, cell signaling, catalysis, and many other functions across all of molecular biology. We must understand them quantitatively, and experimental methods have provided the data that form the basis of our current understanding. They remain our most accurate tools. However, their low efficiency and high cost leave room for predictive, computational approaches that can provide faster and more detailed answers to biological problems. A rigid-body simulation can quickly and effectively calculate the predicted interaction energy between two molecular structures in proximity. The fast Fourier-transform-based mapping algorithm FTMap predicts small molecule binding 'hot spots' on a protein's surface and can provide likely orientations of specific ligands of interest that may occupy those hot spots. This process now allows unique ligands to be used by this algorithm while permitting additional small molecular cofactors to remain in their bound conformation. By keeping the cofactors bound, FTMap can reduce false positives where the algorithm identifies a true, but incorrect, ligand pocket where the known cofactor already binds. A related algorithm, ClusPro, can evaluate interaction energies for billions of docked conformations of macromolecular structures. The work reported in this thesis can predict protein-polysaccharide interactions and the software now contains a publicly available feature for predicting protein-heparin interactions. In addition, a new approach for determining regions of predicted activity on a protein's surface allows prediction of a protein-protein interface. This new tool can also identify the interface in encounter complexes formed by the process of protein association—more closely resembling the biological nature of the interaction than the former, calculated, binary, bound and unbound states
Enhancing protein interaction prediction using deep learning and protein language models
Proteins are large macromolecules that play critical roles in many cellular activities in living organisms. These include catalyzing metabolic reactions, mediating signal transduction, DNA replication, responding to stimuli, and transporting molecules, to name a few. Proteins perform their functions by interacting with other proteins and molecules. As a result, determining the nature of such interactions is critically important in many areas of biology and medicine. The primary structure of a protein refers to its specific sequence of amino acids, while the tertiary structure refers to its unique 3D shape, and the quaternary structure refers to the interaction of multiple protein subunits to form a larger, more complex structure. While the number of experimentally determined tertiary and quaternary structures are limited, databases of protein sequences continue to grow at an unprecedented rate, providing a wealth of information for training and improving sequence-based models.
Recent developments in the sequence-based model using machine learning and deep learning has shown significant progress toward solving protein-related problems. Specifically, attention-based transformer models, a recent breakthrough in Natural Language Processing (NLP), has shown that large models trained on unlabeled data are able to learn powerful representations of protein sequences and can lead to significant improvements in understanding protein folding, function, and interactions, as well as in drug discovery and protein engineering.
The research in this thesis has pursued two objectives using sequence-based modeling. The first is to use deep learning techniques based on NLP to address an important problem in cellular immune system studies, namely, predicting Major Histocompatibility Complex (MHC)-Peptide binding. The second is to improve the performance of the Cluspro docking server, a well-known protein-protein docking tool, in three ways: (i) integrating Cluspro with AlphaFold2, a well-known accurate protein structure predictor, for enhanced protein model docking, (ii) predicting distance maps to improve docking accuracy, and (iii) using regression techniques to rank protein clusters for better results
Improving protein docking with binding site prediction
Protein-protein and protein-ligand interactions are fundamental as many proteins mediate their biological function through these interactions. Many important applications follow directly from the identification of residues in the interfaces between protein-protein and protein-ligand interactions, such as drug design, protein mimetic engineering, elucidation of molecular pathways, and understanding of disease mechanisms. The identification of interface residues can also guide the docking process to build the structural model of protein-protein complexes. This dissertation focuses on developing computational approaches for protein-ligand and protein-protein binding site prediction and applying these predictions to improve protein-protein docking. First, we develop an automated approach LIGSITEcs to predict protein-ligand binding site, based on the notion of surface-solvent-surface events and the degree of conservation of the involved surface residues. We compare our algorithm to four other approaches, LIGSITE, CAST, PASS, and SURFNET, and evaluate all on a dataset of 48 unbound/bound structures and 210 bound-structures. LIGSITEcs performs slightly better than the other tools and achieves a success rate of 71% and 75%, respectively. Second, for protein-protein binding site, we develop metaPPI, a meta server for interface prediction. MetaPPI combines results from a number of tools, such as PPI_Pred, PPISP, PINUP, Promate, and SPPIDER, which predict enzyme-inhibitor interfaces with success rates of 23% to 55% and other interfaces with 10% to 28% on a benchmark dataset of 62 complexes. After refinement, metaPPI significantly improves prediction success rates to 70% for enzyme-inhibitor and 44% for other interfaces. Third, for protein-protein docking, we develop a FFT-based docking algorithm and system BDOCK, which includes specific scoring functions for specific types of complexes. BDOCK uses family-based residue interface propensities as a scoring function and obtains improvement factors of 4-30 for enzyme-inhibitor and 4-11 for antibody-antigen complexes in two specific SCOP families. Furthermore, the degrees of buriedness of surface residues are integrated into BDOCK, which improves the shape discriminator for enzyme-inhibitor complexes. The predicted interfaces from metaPPI are integrated as well, either during docking or after docking. The evaluation results show that reliable interface predictions improve the discrimination between near-native solutions and false positive. Finally, we propose an implicit method to deal with the flexibility of proteins by softening the surface, to improve docking for non enzyme-inhibitor complexes
Template Based Modeling and Structural Refinement of Protein-Protein Interactions.
Determining protein structures from sequence is a fundamental problem in molecular biology, as protein structure is essential to understanding protein function. In this study, I developed one of the first fully automated pipelines for template based quaternary structure prediction starting from sequence. Two critical steps for template based modeling are identifying the correct homologous structures by threading which generates sequence to structure alignments and refining the initial threading template coordinates closer to the native conformation. I developed SPRING (single-chain-based prediction of interactions and geometries), a monomer threading to dimer template mapping program, which was compared to the dimer co-threading program, COTH, using 1838 non homologous target complex structures. SPRING’s similarity score outperformed COTH in the first place ranking of templates, correctly identifying 798 and 527 interfaces respectively. More importantly the results were found to be complementary and the programs could be combined in a consensus based threading program showing a 5.1% improvement compared to SPRING. Template based modeling requires a structural analog being present in the PDB. A full search of the PDB, using threading and structural alignment, revealed that only 48.7% of the PDB has a suitable template whereas only 39.4% of the PDB has templates that can be identified by threading. In order to circumvent this, I included intramolecular domain-domain interfaces into the PDB library to boost template recognition of protein dimers; the merging of the two classes of interfaces improved recognition of heterodimers by 40% using benchmark settings. Next the template based assembly of protein complexes pipeline, TACOS, was created. The pipeline combines threading templates and domain knowledge from the PDB into a knowledge based energy score. The energy score is integrated into a Monte Carlo sampling simulation that drives the initial template closer to the native topology. The full pipeline was benchmarked using 350 non homologous structures and compared to two state of the art programs for dimeric structure prediction: ZDOCK and MODELLER. On average, TACOS models global and interface structure have a better quality than the models generated by MODELLER and ZDOCK.PHDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/135847/1/bgovi_1.pd
- …