352 research outputs found
Full cyclic coordinate descent: solving the protein loop closure problem in Cα space
BACKGROUND: Various forms of the so-called loop closure problem are crucial to protein structure prediction methods. Given an N- and a C-terminal end, the problem consists of finding a suitable segment of a certain length that bridges the ends seamlessly. In homology modelling, the problem arises in predicting loop regions. In de novo protein structure prediction, the problem is encountered when implementing local moves for Markov Chain Monte Carlo simulations. Most loop closure algorithms keep the bond angles fixed or semi-fixed, and only vary the dihedral angles. This is appropriate for a full-atom protein backbone, since the bond angles can be considered as fixed, while the (φ, ψ) dihedral angles are variable. However, many de novo structure prediction methods use protein models that only consist of Cα atoms, or otherwise do not make use of all backbone atoms. These methods require a method that alters both bond and dihedral angles, since the pseudo bond angle between three consecutive Cα atoms also varies considerably. RESULTS: Here we present a method that solves the loop closure problem for Cα only protein models. We developed a variant of Cyclic Coordinate Descent (CCD), an inverse kinematics method from the field of robotics, which was recently applied to the loop closure problem. Since the method alters both bond and dihedral angles, which is equivalent to applying a full rotation matrix, we call our method Full CCD (FCDD). FCCD replaces CCD's vector-based optimization of a rotation around an axis with a singular value decomposition-based optimization of a general rotation matrix. The method is easy to implement and numerically stable. CONCLUSION: We tested the method's performance on sets of random protein Cα segments between 5 and 30 amino acids long, and a number of loops of length 4, 8 and 12. FCCD is fast, has a high success rate and readily generates conformations close to those of real loops. The presence of constraints on the angles only has a small effect on the performance. A reference implementation of FCCD in Python is available as supplementary information
Accelerating delayed-acceptance Markov chain Monte Carlo algorithms
Delayed-acceptance Markov chain Monte Carlo (DA-MCMC) samples from a
probability distribution via a two-stages version of the Metropolis-Hastings
algorithm, by combining the target distribution with a "surrogate" (i.e. an
approximate and computationally cheaper version) of said distribution. DA-MCMC
accelerates MCMC sampling in complex applications, while still targeting the
exact distribution. We design a computationally faster, albeit approximate,
DA-MCMC algorithm. We consider parameter inference in a Bayesian setting where
a surrogate likelihood function is introduced in the delayed-acceptance scheme.
When the evaluation of the likelihood function is computationally intensive,
our scheme produces a 2-4 times speed-up, compared to standard DA-MCMC.
However, the acceleration is highly problem dependent. Inference results for
the standard delayed-acceptance algorithm and our approximated version are
similar, indicating that our algorithm can return reliable Bayesian inference.
As a computationally intensive case study, we introduce a novel stochastic
differential equation model for protein folding data.Comment: 40 pages, 21 figures, 10 table
What is a meaningful representation of protein sequences?
How we choose to represent our data has a fundamental impact on our ability
to subsequently extract information from them. Machine learning promises to
automatically determine efficient representations from large unstructured
datasets, such as those arising in biology. However, empirical evidence
suggests that seemingly minor changes to these machine learning models yield
drastically different data representations that result in different biological
interpretations of data. This begs the question of what even constitutes the
most meaningful representation. Here, we approach this question for
representations of protein sequences, which have received considerable
attention in the recent literature. We explore two key contexts in which
representations naturally arise: transfer learning and interpretable learning.
In the first context, we demonstrate that several contemporary practices yield
suboptimal performance, and in the latter we demonstrate that taking
representation geometry into account significantly improves interpretability
and lets the models reveal biological information that is otherwise obscured.Comment: 17 pages, 8 figures, 2 table
Protein structure validation and refinement using amide proton chemical shifts derived from quantum mechanics
We present the ProCS method for the rapid and accurate prediction of protein
backbone amide proton chemical shifts - sensitive probes of the geometry of key
hydrogen bonds that determine protein structure. ProCS is parameterized against
quantum mechanical (QM) calculations and reproduces high level QM results
obtained for a small protein with an RMSD of 0.25 ppm (r = 0.94). ProCS is
interfaced with the PHAISTOS protein simulation program and is used to infer
statistical protein ensembles that reflect experimentally measured amide proton
chemical shift values. Such chemical shift-based structural refinements,
starting from high-resolution X-ray structures of Protein G, ubiquitin, and SMN
Tudor Domain, result in average chemical shifts, hydrogen bond geometries, and
trans-hydrogen bond (h3JNC') spin-spin coupling constants that are in excellent
agreement with experiment. We show that the structural sensitivity of the
QM-based amide proton chemical shift predictions is needed to refine protein
structures to this agreement. The ProCS method thus offers a powerful new tool
for refining the structures of hydrogen bonding networks to high accuracy with
many potential applications such as protein flexibility in ligand binding.Comment: PLOS ONE accepted, Nov 201
Bayesian inference of protein ensembles from SAXS data
The inherent flexibility of intrinsically disordered proteins (IDPs) and multi-domain proteins with intrinsically disordered regions (IDRs) presents challenges to structural analysis. These macromolecules need to be represented by an ensemble of conformations, rather than a single structure. Small-angle X-ray scattering (SAXS) experiments capture ensemble-averaged data for the set of conformations. We present a Bayesian approach to ensemble inference from SAXS data, called Bayesian ensemble SAXS (BE-SAXS). We address two issues with existing methods: the use of a finite ensemble of structures to represent the underlying distribution, and the selection of that ensemble as a subset of an initial pool of structures. This is achieved through the formulation of a Bayesian posterior of the conformational space. BE-SAXS modifies a structural prior distribution in accordance with the experimental data. It uses multi-step expectation maximization, with alternating rounds of Markov-chain Monte Carlo simulation and empirical Bayes optimization. We demonstrate the method by employing it to obtain a conformational ensemble of the antitoxin PaaA2 and comparing the results to a published ensemble.ISSN:1463-9084ISSN:1463-907
ENCORE:Software for Quantitative Ensemble Comparison
There is increasing evidence that protein dynamics and conformational changes can play an important role in modulating biological function. As a result, experimental and computational methods are being developed, often synergistically, to study the dynamical heterogeneity of a protein or other macromolecules in solution. Thus, methods such as molecular dynamics simulations or ensemble refinement approaches have provided conformational ensembles that can be used to understand protein function and biophysics. These developments have in turn created a need for algorithms and software that can be used to compare structural ensembles in the same way as the root-mean-square-deviation is often used to compare static structures. Although a few such approaches have been proposed, these can be difficult to implement efficiently, hindering a broader applications and further developments. Here, we present an easily accessible software toolkit, called ENCORE, which can be used to compare conformational ensembles generated either from simulations alone or synergistically with experiments. ENCORE implements three previously described methods for ensemble comparison, that each can be used to quantify the similarity between conformational ensembles by estimating the overlap between the probability distributions that underlie them. We demonstrate the kinds of insights that can be obtained by providing examples of three typical use-cases: comparing ensembles generated with different molecular force fields, assessing convergence in molecular simulations, and calculating differences and similarities in structural ensembles refined with various sources of experimental data. We also demonstrate efficient computational scaling for typical analyses, and robustness against both the size and sampling of the ensembles. ENCORE is freely available and extendable, integrates with the established MDAnalysis software package, reads ensemble data in many common formats, and can work with large trajectory files
Beyond rotamers: a generative, probabilistic model of side chains in proteins.
RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are.BACKGROUND: Accurately covering the conformational space of amino acid side chains is essential for important applications such as protein design, docking and high resolution structure prediction. Today, the most common way to capture this conformational space is through rotamer libraries - discrete collections of side chain conformations derived from experimentally determined protein structures. The discretization can be exploited to efficiently search the conformational space. However, discretizing this naturally continuous space comes at the cost of losing detailed information that is crucial for certain applications. For example, rigorously combining rotamers with physical force fields is associated with numerous problems. RESULTS: In this work we present BASILISK: a generative, probabilistic model of the conformational space of side chains that makes it possible to sample in continuous space. In addition, sampling can be conditional upon the protein's detailed backbone conformation, again in continuous space - without involving discretization. CONCLUSIONS: A careful analysis of the model and a comparison with various rotamer libraries indicates that the model forms an excellent, fully continuous model of side chain conformational space. We also illustrate how the model can be used for rigorous, unbiased sampling with a physical force field, and how it improves side chain prediction when used as a pseudo-energy term. In conclusion, BASILISK is an important step forward on the way to a rigorous probabilistic description of protein structure in continuous space and in atomic detail
- …