359 research outputs found
Multi-dimensional scaling and MODELLER based evolutionary algorithms for protein model refinement
"December 2013.""A Thesis Presented to the Faculty of the Graduate School at the University of Missouri--Columbia In Partial Fulfillment of the Requirements for the Degree Master of Science."Thesis supervisor: Dr. Yi Shang.To computationally obtain an accurate prediction of the three-dimensional structure of a protein from its primary sequence is one of the most important problems in bioinformatics and has been actively researched for many years. Although a number of software packages have been developed and they sometimes perform well on template-based modeling, further improvement is needed for practical use. Model refinement is a step in the prediction process, in which improved structures are constructed based on a pool of initially generated models. Since the refinement category being added to the Critical Assessment of Structure Prediction (CASP) competition in 2008, CASP results show that it is a challenge for existing model refinement methods to improve model quality consistently. This project focuses on evolutionary algorithms for protein model refinement. Three new algorithms have been developed, in which multidimensional scaling (MDS), MODELLER, and a hybrid of both are used as crossover operators, respectively. The MDS-based method takes a purely geometrical approach and generates a child model by combining the contact maps of multiple parents. The MODELLER-based method takes a statistical and energy minimization approach and uses the remodeling module in MODELLER program to generate new models from multiple parents. The hybrid method first generates models using the MDS-based method and then run them through the MODELLER-based method, aiming at combining the strength of both. Promising IX results have been obtained in experiments using CASP datasets. The MDS-based method improved the best of a pool of predicted models in terms of the global distance test score (GDT-TS) in 9 out of 16 test targets. For instance, for target T0680, the GDT-TS of a refined model is 0.833, much better than 0.763, the value of the best model in the initial pool.Includes bibliographical references (pages 43-48)
Mass & secondary structure propensity of amino acids explain their mutability and evolutionary replacements
Why is an amino acid replacement in a protein accepted during evolution? The answer given by bioinformatics relies on the frequency of change of each amino acid by another one and the propensity of each to remain unchanged. We propose that these replacement rules are recoverable from the secondary structural trends of amino acids. A distance measure between high-resolution Ramachandran distributions reveals that structurally similar residues coincide with those found in substitution matrices such as BLOSUM: Asn Asp, Phe Tyr, Lys Arg, Gln Glu, Ile Val, Met → Leu; with Ala, Cys, His, Gly, Ser, Pro, and Thr, as structurally idiosyncratic residues. We also found a high average correlation (\overline{R} R = 0.85) between thirty amino acid mutability scales and the mutational inertia (I X ), which measures the energetic cost weighted by the number of observations at the most probable amino acid conformation. These results indicate that amino acid substitutions follow two optimally-efficient principles: (a) amino acids interchangeability privileges their secondary structural similarity, and (b) the amino acid mutability depends directly on its biosynthetic energy cost, and inversely with its frequency. These two principles are the underlying rules governing the observed amino acid substitutions. © 2017 The Author(s)
Molecular dynamics recipes for genome research
Molecular dynamics (MD) simulation allows one to predict the time evolution of a system of interacting particles. It is widely used in physics, chemistry and biology to address specific questions about the structural properties and dynamical mechanisms of model systems. MD earned a great success in genome research, as it proved to be beneficial in sorting pathogenic from neutral genomic mutations. Considering their computational requirements, simulations are commonly performed on HPC computing devices, which are generally expensive and hard to administer. However, variables like the software tool used for modeling and simulation or the size of the molecule under investigation might make one hardware type or configuration more advantageous than another or even make the commodity hardware definitely suitable for MD studies. This work aims to shed lights on this aspect
Molecular models for the core components of the flagellar type-III secretion complex
We show that by using a combination of computational methods, consistent three-dimensional molecular models can be proposed for the core proteins of the type-III secretion system. We employed a variety of approaches to reconcile disparate, and sometimes inconsistent, data sources into a coherent picture that for most of the proteins indicated a unique solution to the constraints. The range of difficulty spanned from the trivial (FliQ) to the difficult (FlhA and FliP). The uncertainties encountered with FlhA were largely the result of the greater number of helix packing possibilities allowed in a large protein, however, for FliP, there remains an uncertainty in how to reconcile the large displacement predicted between its two main helical hairpins and their ability to sit together happily across the bacterial membrane. As there is still no high resolution structural information on any of these proteins, we hope our predicted models may be of some use in aiding the interpretation of electron microscope images and in rationalising mutation data and experiments
Recommended from our members
Scoring functions for protein docking and drug design
textPredicting the structure of complexes formed by two interacting proteins is an important problem in computation structural biology. Proteins perform many of their functions by binding to other proteins. The structure of protein-protein complexes provides atomic details about protein function and biochemical pathways, and can help in designing drugs that inhibit binding. Docking computationally models the structure of protein-protein complexes, given three-dimensional structures of the individual chains. Protein docking methods have two phases. In the first phase, a comprehensive, coarse search is performed for optimally docked models. In the second refinement and reranking phase, the models from the first phase are refined and reranked, with the expectation of extracting a small set of accurate models from the pool of thousands of models obtained from the first phase. In this thesis, new algorithms are developed for the refinement and reranking phase of docking. New scoring functions, or potentials, that rank models are developed. These potentials are learnt using large-scale machine learning methods based on mathematical programming. The procedure for learning these potentials involves examining hundreds of thousands of correct and incorrect models. In this thesis, hierarchical constraints were introduced into the learning algorithm. First, an atomic potential was developed using this learning procedure. A refinement procedure involving side-chain remodeling and conjugate gradient-based minimization was introduced. The refinement procedure combined with the atomic potential was shown to improve docking accuracy significantly. Second, a hydrogen bond potential, was developed. Molecular dynamics-based sampling combined with the hydrogen bond potential improved docking predictions. Third, mathematical programming compared favorably to SVMs and neural networks in terms of accuracy, training and test time for the task of designing potentials to rank docking models. The methods described in this thesis are implemented in the docking package DOCK/PIERR. DOCK/PIERR was shown to be among the best automated docking methods in community wide assessments. Finally, DOCK/PIERR was extended to predict membrane protein complexes. A membrane-based score was added to the reranking phase, and shown to improve the accuracy of docking. This docking algorithm for membrane proteins was used to study the dimers of amyloid precursor protein, implicated in Alzheimer's disease.R. DOCK/PIERR was shown to be among the best automated docking methods in community wide assessments. Finally, DOCK/PIERR was extended to predict membrane protein complexes. A membrane-based score was added to the reranking phase, and shown to improve the accuracy of docking. This docking algorithm for membrane proteins was used to study the dimers of amyloid precursor protein, implicated in Alzheimer’s disease.Computer Science
Analysis of host-pathogen interactions via clustering, statistical analysis, and data visualization
Infectious diseases are caused by a variety of agents: viruses, bacteria, parasites, or even proteins. Using existing state-of-the-art methods and tools I developed myself, I studied aspects of infectious agents. To find the most conserved and diverse regions of influenza A proteins, I found clusters of extremely conserved or diverse residues. Because traditional methods of clustering proved ineffective for diverse regions, I developed a Metropolis Criterion Monte Carlo (MMC) clustering algorithm to discover clusters of extremely diverse regions. In addition to viruses, I studied pathogenic bacterial proteins known as effectors. Using an in-house prediction method, Preffector, I generated predicted effectors for 14 bacteria and created a database and webserver to hold relevant information: BacPaC. BacPaC uses intuitive visualizations and script-generated profile pages to display relevant data about the predicted effectors. Finally, I applied structural modeling and docking techniques to soybean proteins that are known to incur resistance to nematodes. For each of these studies, I used clustering, data analysis, and data visualization to better understand infectious agents
Protein structure prediction and structure-based protein function annotation
Nature tends to modify rather than invent function of protein molecules, and the log of the modifications is encrypted in the gene sequence. Analysis of these modification events in evolutionarily related genes is important for assigning function to hypothetical genes and their products surging in databases, and to improve our understanding of the bioverse. However, random mutations occurring during evolution chisel the sequence to an extent that both decrypting these codes and identifying evolutionary relatives from sequence alone becomes difficult. Thankfully, even after many changes at the sequence level, the protein three-dimensional structures are often conserved and hence protein structural similarity usually provide more clues on evolution of functionally related proteins. In this dissertation, I study the design of three bioinformatics modules that form a new hierarchical approach for structure prediction and function annotation of proteins based on sequence-to-structure-to-function paradigm. First, we design an online platform for structure prediction of protein molecules using multiple threading alignments and iterative structural assembly simulations (I-TASSER). I review the components of this module and have added features that provide function annotation to the protein sequences and help to combine experimental and biological data for improving the structure modeling accuracy. The online service of the system has been supporting more than 20,000 biologists from over 100 countries. Next, we design a new comparative approach (COFACTOR) to identify the location of ligand binding sites on these modeled protein structures and spot the functional residue constellations using an innovative global-to-local structural alignment procedure and functional sites in known protein structures. Based on both large-scale benchmarking and blind tests (CASP), the method demonstrates significant advantages over the state-of-the- art methods of the field in recognizing ligand-binding residues for both metal and non- metal ligands. The major advantage of the method is the optimal combination of the local and global protein structural alignments, which helps to recognize functionally conserved structural motifs among proteins that have taken different evolutionary paths. We further extend the COFACTOR global-to-local approach to annotate the gene- ontology and enzyme classifications of protein molecules. Here, we added two new components to COFACTOR. First, we developed a new global structural match algorithm that allows performing better structural search. Second, a sensitive technique was proposed for constructing local 3D-signature motifs of template proteins that lack known functional sites, which allows us to perform query-template local structural similarity comparisons with all template proteins. A scoring scheme that combines the confidence score of structure prediction with global-local similarity score is used for assigning a confidence score to each of the predicted function. Large scale benchmarking shows that the predicted functions have remarkably improved precision and recall rates and also higher prediction coverage than the state-of-art sequence based methods. To explore the applicability of the method for real-world cases, we applied the method to a subset of ORFs from Chlamydia trachomatis and the functional annotations provided new testable hypothesis for improving the understanding of this phylogenetically distinct bacterium
Protein Threading for Genome-Scale Structural Analysis
Protein structure prediction is a necessary tool in the field of bioinformatic analysis. It is a non-trivial process that can add a great deal of information to a genome annotation. This dissertation deals with protein structure prediction through the technique of protein fold recognition and outlines several strategies for the improvement of protein threading techniques. In order to improve protein threading performance, this dissertation begins with an outline of sequence/structure alignment energy functions. A technique called Violated Inequality Minimization is used to quickly adapt to the changing energy landscape as new energy functions are added. To continue the improvement of alignment accuracy and fold recognition, new formulations of energy functions are used for the creation of the sequence/structure alignment. These energies include a formulation of a gap penalty which is dependent on sequence characteristics different from the traditional constant penalty. Another proposed energy is dependent on conserved structural patterns found during threading. These structural patterns have been employed to refine the sequence/structure alignment in my research. The section on Linear Programming Algorithm for protein structure alignment deals with the optimization of an alignment using additional residue-pair energy functions. In the original version of the model, all cores had to be aligned to the target sequence. Our research outlines an expansion of the original threading model which allows for a more flexible alignment by allowing core deletions. Aside from improvements in fold recognition and alignment accuracy, there is also a need to ensure that these techniques can scale for the computational demands of genome level structure prediction. A heuristic decision making processes has been designed to automate the classification and preparation of proteins for prediction. A graph analysis has been applied to the integration of different tools involved in the pipeline. Analysis of the data dependency graph allows for automatic parallelization of genome structure prediction. These different contributions help to improve the overall performance of protein threading and help distribute computations across a large set of computers to help make genome scale protein structure prediction practically feasible
Data-driven modelling of biological multi-scale processes
Biological processes involve a variety of spatial and temporal scales. A
holistic understanding of many biological processes therefore requires
multi-scale models which capture the relevant properties on all these scales.
In this manuscript we review mathematical modelling approaches used to describe
the individual spatial scales and how they are integrated into holistic models.
We discuss the relation between spatial and temporal scales and the implication
of that on multi-scale modelling. Based upon this overview over
state-of-the-art modelling approaches, we formulate key challenges in
mathematical and computational modelling of biological multi-scale and
multi-physics processes. In particular, we considered the availability of
analysis tools for multi-scale models and model-based multi-scale data
integration. We provide a compact review of methods for model-based data
integration and model-based hypothesis testing. Furthermore, novel approaches
and recent trends are discussed, including computation time reduction using
reduced order and surrogate models, which contribute to the solution of
inference problems. We conclude the manuscript by providing a few ideas for the
development of tailored multi-scale inference methods.Comment: This manuscript will appear in the Journal of Coupled Systems and
Multiscale Dynamics (American Scientific Publishers
- …