15 research outputs found

    A Model of Substitution Trajectories in Sequence Space and Long-Term Protein Evolution

    No full text
    International audienceThe nature of factors governing the tempo and mode of protein evolution is a fundamental issue in evolutionary biology. Specifically, whether or not interactions between different sites, or epistasis, are important in directing the course of evolution became one of the central questions. Several recent reports have scrutinized patterns of long-term protein evolution claiming them to be compatible only with an epistatic fitness landscape. However, these claims have not yet been substantiated with a formal model of protein evolution. Here, we formulate a simple covarion-like model of protein evolution focusing on the rate at which the fitness impact of amino acids at a site changes with time. We then apply the model to the data on convergent and divergent protein evolution to test whether or not the incorporation of epistatic interactions is necessary to explain the data. We find that convergent evolution cannot be explained without the incorporation of epistasis and the rate at which an amino acid state switches from being acceptable at a site to being deleterious is faster than the rate of amino acid substitution. Specifically, for proteins that have persisted in modern prokaryotic organisms since the last universal common ancestor for one amino acid substitution approximately ten amino acid states switch from being accessible to being deleterious, or vice versa. Thus, molecular evolution can only be perceived in the context of rapid turnover of which amino acids are available for evolution

    Machine Learning: How Much Does It Tell about Protein Folding Rates?

    No full text
    The prediction of protein folding rates is a necessary step towards understanding the principles of protein folding. Due to the increasing amount of experimental data, numerous protein folding models and predictors of protein folding rates have been developed in the last decade. The problem has also attracted the attention of scientists from computational fields, which led to the publication of several machine learning-based models to predict the rate of protein folding. Some of them claim to predict the logarithm of protein folding rate with an accuracy greater than 90%. However, there are reasons to believe that such claims are exaggerated due to large fluctuations and overfitting of the estimates. When we confronted three selected published models with new data, we found a much lower predictive power than reported in the original publications. Overly optimistic predictive powers appear from violations of the basic principles of machine-learning. We highlight common misconceptions in the studies claiming excessive predictive power and propose to use learning curves as a safeguard against those mistakes. As an example, we show that the current amount of experimental data is insufficient to build a linear predictor of logarithms of folding rates based on protein amino acid composition

    Learning curves of the linear regression model.

    No full text
    <p>The mean (n = 1000) correlation coefficient of the training and test sets between the predicted and observed log folding rates (blue and red lines, respectively) is plotted as a function of the dataset size, together with the standard deviations of both sets (blue and red regions, respectively). Sixty percent of the examples are assigned to the training set and 40% to the test set. <b>a.</b> Log folding rates were fitted with 20 features corresponding to the absolute amino acid frequency of each protein. A clear overfit can be seen as a gap between the two correlation lines. <b>b.</b> Log folding rates were fitted using a single feature corresponding to the amino acid length of each protein to the power of 2/3, ln(<i>k</i><sub><i>f</i></sub>) ~ -<i>L</i><sup>2/3</sup> [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0143166#pone.0143166.ref013" target="_blank">13</a>]. There exists a nearly-perfect correspondence between training and test sets, and a slightly higher correlation on the test set than in Fig 4A.</p

    Correlation coefficient of Huang and Tian’s model for different samples.

    No full text
    <p>Forty data points were randomly sampled from a meta data set and the model described by Huang and Tian [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0143166#pone.0143166.ref026" target="_blank">26</a>] was fitted again 10,000 times. The meta data set consists of two-state proteins from 30 to 200 residues combined from [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0143166#pone.0143166.ref026" target="_blank">26</a>] and data set 113, without duplicates. The histogram of the obtained correlation coefficients was then plotted. The correlation coefficient ranges from 0.5 to 0.8 approximately, which shows that robust estimation of the correlation cannot be achieved with 40 proteins.</p

    Learning curves of the contact order models.

    No full text
    <p><b>a</b>. Relative contact order model with fixed parameters <i>d</i> and <i>ΔL</i> (atoms contact in three-dimensional protein structure if they are closer than <i>d</i> = 6Å and belong to the residues having distance by chain <i>ΔL</i> ≥ 1). <b>b</b>. Absolute contact order model with fixed parameters <i>d</i> and <i>ΔL</i>. Relative (<b>c</b>) and absolute (<b>d</b>) contact order models with varying parameters <i>d</i> and <i>ΔL</i>. For relative contact order model we restrict the data set to two-state proteins having less than 150 residues.</p

    Cross-validation results for two independent Gaussian samples.

    No full text
    <p>In this toy model, we try predict a variable from an uncorrelated predictor. The predictive power is null, but the model can be overtrained and give the illusion that the variables are correlated. We repeatedly performed 5-fold cross validation 1,000,000 times on the same data set (n = 100). The plot shows the distribution of the obtained coefficient of correlation. The highest value is 0.202, and the lowest is -0.472 (associated <i>p</i>-values without multiple-hypothesis correction equal to 0.044 and 7·10<sup>−7</sup>, respectively).</p

    Self-consistency test reveals systematic bias in programs for prediction change of stability upon mutation

    Get PDF
    Motivation: Computational prediction of the effect of mutations on protein stability is used by researchers in many fields. The utility of the prediction methods is affected by their accuracy and bias. Bias, a systematic shift of the predicted change of stability, has been noted as an issue for several methods, but has not been investigated systematically. Presence of the bias may lead to misleading results especially when exploring the effects of combination of different mutations. Results: Here we use a protocol to measure the bias as a function of the number of introduced mutations. It is based on a self-consistency test of the reciprocity the effect of a mutation. An advantage of the used approach is that it relies solely on crystal structures without experimentally measured stability values. We applied the protocol to four popular algorithms predicting change of protein stability upon mutation, FoldX, Eris, Rosetta and I-Mutant, and found an inherent bias. For one program, FoldX, we manage to substantially reduce the bias using additional relaxation by Modeller. Authors using algorithms for predicting effects of mutations should be aware of the bias described here. Availability and implementation: All calculations were implemented by in-house PERL scripts. Supplementary information: Supplementary data are available at Bioinformatics online.This work was supported by the HHMI International Early Career Scientist Program [55007424], the MINECO [BFU2015-68723-P], Spanish Ministry of Economy and Competitiveness Centro de Excelencia Severo Ochoa 2013-2017 [grant SEV-2012-0208], Secretaria d'Universitats i Recerca del Departament d'Economia i Coneixement de la Generalitat’s AGAUR [program 2014 SGR 0974], the European Research Council under the European Union's Seventh Framework Programme [FP7/2007-2013, ERC grant agreement 335980_EinME] and Russian Scientific Foundation (RSF #14-24-00157, the part about I-Mutant calculations). The work was started at the School of Molecular and Theoretical Biology supported by the Dynasty Foundation
    corecore