15 research outputs found
Recommended from our members
Evolutionary rate determinants and functional optimization of proteins
A fundamental phenomenon in molecular evolution is the accumulation of mutations in proteins at an approximately constant rate, known as the molecular clock. Remarkably, although this rate remains constant across lineages, it varies by several orders of magnitude across different proteins. The nature of the molecular clock and its variability across proteins is a foundational question in molecular evolution. In addition, understanding the essence of evolutionary constraints provides insights into the principles of biological systems optimization.The primary determinants of the molecular clock have been actively investigated and debated for several decades. It has been established that the strongest predictor of the rate of protein evolution is protein expression. However, the underlying basis for the widely observed anti-correlation between Expression and evolutionary Rate (ER) remains poorly understood.
The main goal of this study is to unravel the basic mechanisms of the molecular clock variability across proteins and in particular the nature of ER anti-correlation. We begin by investigating the molecular basis of the ER phenomenon. In this regard, we first addressed the validity of the misfolding avoidance hypothesis, which has dominated related evolutionary discussions for more than a decade. We analyzed multiple recent genome-wide datasets describing protein stability and aggregation propensities – properties predicted to constrain the evolution of highly expressed proteins to avoid toxicity caused by misfolding. We rigorously tested the predictions of the hypothesis, and our results suggest that misfolding avoidance is unlikely to play any substantial role in explaining the variability of the molecular clock across proteins. Thus, other mechanisms should be explored.
We focused on the functional hypothesis, which proposes that variability in evolutionary constraints is due to different levels of functional optimization across proteins. We collected data on catalytic efficiency across multiple enzymes in several species to serve as a proxy for protein functional optimality. Notably, we demonstrated that the optimization of protein molecular function substantially constrains the rate of protein evolution. Moreover, up to half of the correlation between protein expression and evolutionary rate can be explained by the level of protein functional efficiency. These findings support to the functional theory of protein evolution.
We further investigated the cellular mechanisms behind the ER correlation. To this end, we analyzed how protein expression levels in different tissues of multicellular species or in strains of unicellular species living in different environmental conditions jointly affect protein optimization and evolution. Using tissue- and condition-specific expression data from various animal, plant, and bacterial species, we demonstrated that the protein clock rate and the degree of protein functional optimality are primarily affected by expression in several distinct cell types. Furthermore, the strength of the association between protein expression and evolutionary rate is correlated with the upregulation of specific cellular processes, namely functions related to synaptic transmission in animals and active cellular growth in plants and bacteria. We hypothesize that these cellular properties result in particularly high cost of protein expression, leading to a more pronounced optimization of highly abundant proteins and consequently slowing down the molecular clock.
Overall, the study reveals how various constraints from the molecular, cellular, and species’ levels of biological organization jointly affect protein evolution and the level of protein optimization and adaptation
A Model of Substitution Trajectories in Sequence Space and Long-Term Protein Evolution
International audienceThe nature of factors governing the tempo and mode of protein evolution is a fundamental issue in evolutionary biology. Specifically, whether or not interactions between different sites, or epistasis, are important in directing the course of evolution became one of the central questions. Several recent reports have scrutinized patterns of long-term protein evolution claiming them to be compatible only with an epistatic fitness landscape. However, these claims have not yet been substantiated with a formal model of protein evolution. Here, we formulate a simple covarion-like model of protein evolution focusing on the rate at which the fitness impact of amino acids at a site changes with time. We then apply the model to the data on convergent and divergent protein evolution to test whether or not the incorporation of epistatic interactions is necessary to explain the data. We find that convergent evolution cannot be explained without the incorporation of epistasis and the rate at which an amino acid state switches from being acceptable at a site to being deleterious is faster than the rate of amino acid substitution. Specifically, for proteins that have persisted in modern prokaryotic organisms since the last universal common ancestor for one amino acid substitution approximately ten amino acid states switch from being accessible to being deleterious, or vice versa. Thus, molecular evolution can only be perceived in the context of rapid turnover of which amino acids are available for evolution
Machine Learning: How Much Does It Tell about Protein Folding Rates?
The prediction of protein folding rates is a necessary step towards understanding the principles of protein folding. Due to the increasing amount of experimental data, numerous protein folding models and predictors of protein folding rates have been developed in the last decade. The problem has also attracted the attention of scientists from computational fields, which led to the publication of several machine learning-based models to predict the rate of protein folding. Some of them claim to predict the logarithm of protein folding rate with an accuracy greater than 90%. However, there are reasons to believe that such claims are exaggerated due to large fluctuations and overfitting of the estimates. When we confronted three selected published models with new data, we found a much lower predictive power than reported in the original publications. Overly optimistic predictive powers appear from violations of the basic principles of machine-learning. We highlight common misconceptions in the studies claiming excessive predictive power and propose to use learning curves as a safeguard against those mistakes. As an example, we show that the current amount of experimental data is insufficient to build a linear predictor of logarithms of folding rates based on protein amino acid composition
Learning curves of the linear regression model.
<p>The mean (n = 1000) correlation coefficient of the training and test sets between the predicted and observed log folding rates (blue and red lines, respectively) is plotted as a function of the dataset size, together with the standard deviations of both sets (blue and red regions, respectively). Sixty percent of the examples are assigned to the training set and 40% to the test set. <b>a.</b> Log folding rates were fitted with 20 features corresponding to the absolute amino acid frequency of each protein. A clear overfit can be seen as a gap between the two correlation lines. <b>b.</b> Log folding rates were fitted using a single feature corresponding to the amino acid length of each protein to the power of 2/3, ln(<i>k</i><sub><i>f</i></sub>) ~ -<i>L</i><sup>2/3</sup> [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0143166#pone.0143166.ref013" target="_blank">13</a>]. There exists a nearly-perfect correspondence between training and test sets, and a slightly higher correlation on the test set than in Fig 4A.</p
Correlation coefficient of Huang and Tian’s model for different samples.
<p>Forty data points were randomly sampled from a meta data set and the model described by Huang and Tian [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0143166#pone.0143166.ref026" target="_blank">26</a>] was fitted again 10,000 times. The meta data set consists of two-state proteins from 30 to 200 residues combined from [<a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0143166#pone.0143166.ref026" target="_blank">26</a>] and data set 113, without duplicates. The histogram of the obtained correlation coefficients was then plotted. The correlation coefficient ranges from 0.5 to 0.8 approximately, which shows that robust estimation of the correlation cannot be achieved with 40 proteins.</p
Learning curves of the contact order models.
<p><b>a</b>. Relative contact order model with fixed parameters <i>d</i> and <i>ΔL</i> (atoms contact in three-dimensional protein structure if they are closer than <i>d</i> = 6Å and belong to the residues having distance by chain <i>ΔL</i> ≥ 1). <b>b</b>. Absolute contact order model with fixed parameters <i>d</i> and <i>ΔL</i>. Relative (<b>c</b>) and absolute (<b>d</b>) contact order models with varying parameters <i>d</i> and <i>ΔL</i>. For relative contact order model we restrict the data set to two-state proteins having less than 150 residues.</p
Cross-validation results for two independent Gaussian samples.
<p>In this toy model, we try predict a variable from an uncorrelated predictor. The predictive power is null, but the model can be overtrained and give the illusion that the variables are correlated. We repeatedly performed 5-fold cross validation 1,000,000 times on the same data set (n = 100). The plot shows the distribution of the obtained coefficient of correlation. The highest value is 0.202, and the lowest is -0.472 (associated <i>p</i>-values without multiple-hypothesis correction equal to 0.044 and 7·10<sup>−7</sup>, respectively).</p
Self-consistency test reveals systematic bias in programs for prediction change of stability upon mutation
Motivation: Computational prediction of the effect of mutations on protein stability is used by researchers in many fields. The utility of the prediction methods is affected by their accuracy and bias. Bias, a systematic shift of the predicted change of stability, has been noted as an issue for several methods, but has not been investigated systematically. Presence of the bias may lead to misleading results especially when exploring the effects of combination of different mutations. Results: Here we use a protocol to measure the bias as a function of the number of introduced mutations. It is based on a self-consistency test of the reciprocity the effect of a mutation. An advantage of the used approach is that it relies solely on crystal structures without experimentally measured stability values. We applied the protocol to four popular algorithms predicting change of protein stability upon mutation, FoldX, Eris, Rosetta and I-Mutant, and found an inherent bias. For one program, FoldX, we manage to substantially reduce the bias using additional relaxation by Modeller. Authors using algorithms for predicting effects of mutations should be aware of the bias described here. Availability and implementation: All calculations were implemented by in-house PERL scripts. Supplementary information: Supplementary data are available at Bioinformatics online.This work was supported by the HHMI International Early Career Scientist Program [55007424], the MINECO [BFU2015-68723-P], Spanish Ministry of Economy and Competitiveness Centro de Excelencia Severo Ochoa 2013-2017 [grant SEV-2012-0208], Secretaria d'Universitats i Recerca del Departament d'Economia i Coneixement de la Generalitat’s AGAUR [program 2014 SGR 0974], the European Research Council under the European Union's Seventh Framework Programme [FP7/2007-2013, ERC grant agreement 335980_EinME] and Russian Scientific Foundation (RSF #14-24-00157, the part about I-Mutant calculations). The work was started at the School of Molecular and Theoretical Biology supported by the Dynasty Foundation