36 research outputs found

    Machine learning-guided directed evolution for protein engineering

    Get PDF
    Machine learning (ML)-guided directed evolution is a new paradigm for biological design that enables optimization of complex functions. ML methods use data to predict how sequence maps to function without requiring a detailed model of the underlying physics or biological pathways. To demonstrate ML-guided directed evolution, we introduce the steps required to build ML sequence-function models and use them to guide engineering, making recommendations at each stage. This review covers basic concepts relevant to using ML for protein engineering as well as the current literature and applications of this new engineering paradigm. ML methods accelerate directed evolution by learning from information contained in all measured variants and using that information to select sequences that are likely to be improved. We then provide two case studies that demonstrate the ML-guided directed evolution process. We also look to future opportunities where ML will enable discovery of new protein functions and uncover the relationship between protein sequence and function.Comment: Made significant revisions to focus on aspects most relevant to applying machine learning to speed up directed evolutio

    Computational modeling of protein mutant stability: analysis and optimization of statistical potentials and structural features reveal insights into prediction model development

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Understanding and predicting protein stability upon point mutations has wide-spread importance in molecular biology. Several prediction models have been developed in the past with various algorithms. Statistical potentials are one of the widely used algorithms for the prediction of changes in stability upon point mutations. Although the methods provide flexibility and the capability to develop an accurate and reliable prediction model, it can be achieved only by the right selection of the structural factors and optimization of their parameters for the statistical potentials. In this work, we have selected five atom classification systems and compared their efficiency for the development of amino acid atom potentials. Additionally, torsion angle potentials have been optimized to include the orientation of amino acids in such a way that altered backbone conformation in different secondary structural regions can be included for the prediction model. This study also elaborates the importance of classifying the mutations according to their solvent accessibility and secondary structure specificity. The prediction efficiency has been calculated individually for the mutations in different secondary structural regions and compared.</p> <p>Results</p> <p>Results show that, in addition to using an advanced atom description, stepwise regression and selection of atoms are necessary to avoid the redundancy in atom distribution and improve the reliability of the prediction model validation. Comparing to other atom classification models, Melo-Feytmans model shows better prediction efficiency by giving a high correlation of 0.85 between experimental and theoretical ΔΔG with 84.06% of the mutations correctly predicted out of 1538 mutations. The theoretical ΔΔG values for the mutations in partially buried <it>β</it>-strands generated by the structural training dataset from PISCES gave a correlation of 0.84 without performing the Gaussian apodization of the torsion angle distribution. After the Gaussian apodization, the correlation increased to 0.92 and prediction accuracy increased from 80% to 88.89% respectively.</p> <p>Conclusion</p> <p>These findings were useful for the optimization of the Melo-Feytmans atom classification system and implementing them to develop the statistical potentials. It was also significant that the prediction efficiency of mutations in the partially buried <it>β</it>-strands improves with the help of Gaussian apodization of the torsion angle distribution. All these comparisons and optimization techniques demonstrate their advantages as well as the restrictions for the development of the prediction model. These findings will be quite helpful not only for the protein stability prediction, but also for various structure solutions in future.</p

    MuD: an interactive web server for the prediction of non-neutral substitutions using protein structural data

    Get PDF
    The discrimination between functionally neutral amino acid substitutions and non-neutral mutations, affecting protein function, is very important for our understanding of diseases. The rapidly growing amounts of experimental data enable the development of computational tools to facilitate the annotation of these substitutions. Here, we describe a Random Forests-based classifier, named Mutation Detector (MuD) that utilizes structural and sequence-derived features to assess the impact of a given substitution on the protein function. In its automatic mode, MuD is comparable to alternative tools in performance. However, the uniqueness of MuD is that user-reported protein-specific structural and functional information can be added at run-time, thereby enhancing the prediction accuracy further. The MuD server, available at http://mud.tau.ac.il, assigns a reliability score to every prediction, thus offering a useful tool for the prioritization of substitutions in proteins with an available 3D structure

    A three-state prediction of single point mutations on protein stability changes

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>A basic question of protein structural studies is to which extent mutations affect the stability. This question may be addressed starting from sequence and/or from structure. In proteomics and genomics studies prediction of protein stability free energy change (ΔΔG) upon single point mutation may also help the annotation process. The experimental ΔΔG values are affected by uncertainty as measured by standard deviations. Most of the ΔΔG values are nearly zero (about 32% of the ΔΔG data set ranges from −0.5 to 0.5 kcal/mole) and both the value and sign of ΔΔG may be either positive or negative for the same mutation blurring the relationship among mutations and expected ΔΔG value. In order to overcome this problem we describe a new predictor that discriminates between 3 mutation classes: destabilizing mutations (ΔΔG<−1.0 kcal/mol), stabilizing mutations (ΔΔG>1.0 kcal/mole) and neutral mutations (−1.0≤ΔΔG≤1.0 kcal/mole).</p> <p>Results</p> <p>In this paper a support vector machine starting from the protein sequence or structure discriminates between stabilizing, destabilizing and neutral mutations. We rank all the possible substitutions according to a three state classification system and show that the overall accuracy of our predictor is as high as 56% when performed starting from sequence information and 61% when the protein structure is available, with a mean value correlation coefficient of 0.27 and 0.35, respectively. These values are about 20 points per cent higher than those of a random predictor.</p> <p>Conclusions</p> <p>Our method improves the quality of the prediction of the free energy change due to single point protein mutations by adopting a hypothesis of thermodynamic reversibility of the existing experimental data. By this we both recast the thermodynamic symmetry of the problem and balance the distribution of the available experimental measurements of free energy changes. This eliminates possible overestimations of the previously described methods trained on an unbalanced data set comprising a number of destabilizing mutations higher than stabilizing ones.</p

    Predicting changes in protein thermostability brought about by single- or multi-site mutations

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>An important aspect of protein design is the ability to predict changes in protein thermostability arising from single- or multi-site mutations. Protein thermostability is reflected in the change in free energy (ΔΔ<it>G</it>) of thermal denaturation.</p> <p>Results</p> <p>We have developed predictive software, Prethermut, based on machine learning methods, to predict the effect of single- or multi-site mutations on protein thermostability. The input vector of Prethermut is based on known structural changes and empirical measurements of changes in potential energy due to protein mutations. Using a 10-fold cross validation test on the M-dataset, consisting of 3366 mutants proteins from ProTherm, the classification accuracy of random forests and the regression accuracy of random forest regression were slightly better than support vector machines and support vector regression, whereas the overall accuracy of classification and the Pearson correlation coefficient of regression were 79.2% and 0.72, respectively. Prethermut performs better on proteins containing multi-site mutations than those with single mutations.</p> <p>Conclusions</p> <p>The performance of Prethermut indicates that it is a useful tool for predicting changes in protein thermostability brought about by single- or multi-site mutations and will be valuable in the rational design of proteins.</p

    Predicting the phenotypic effects of non-synonymous single nucleotide polymorphisms based on support vector machines

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Human genetic variations primarily result from single nucleotide polymorphisms (SNPs) that occur approximately every 1000 bases in the overall human population. The non-synonymous SNPs (nsSNPs) that lead to amino acid changes in the protein product may account for nearly half of the known genetic variations linked to inherited human diseases. One of the key problems of medical genetics today is to identify nsSNPs that underlie disease-related phenotypes in humans. As such, the development of computational tools that can identify such nsSNPs would enhance our understanding of genetic diseases and help predict the disease.</p> <p>Results</p> <p>We propose a method, named Parepro (Predicting the amino acid replacement probability), to identify nsSNPs having either deleterious or neutral effects on the resulting protein function. Two independent datasets, HumVar and NewHumVar, taken from the PhD-SNP server, were applied to train the model and test the robustness of Parepro. Using a 20-fold cross validation test on the HumVar dataset, Parepro achieved a Matthews correlation coefficient (MCC) of 50% and an overall accuracy (Q2) of 76%, both of which were higher than those predicted by the methods, such as PolyPhen, SIFT, and HydridMeth. Further analysis on an additional dataset (NewHumVar) using Parepro yielded similar results.</p> <p>Conclusion</p> <p>The performance of Parepro indicates that it is a powerful tool for predicting the effect of nsSNPs on protein function and would be useful for large-scale analysis of genomic nsSNP data.</p

    Ontology-guided data preparation for discovering genotype-phenotype relationships

    Get PDF
    International audienceComplexity of post-genomic data and multiplicity of mining strategies are two limits to Knowledge Discovery in Databases (KDD) in life sciences. Because they provide a semantic frame to data and because they benefit from the progress of semantic web technologies, bio-ontologies should be considered for playing a key role in the KDD process. In the frame of a case study relative to the search of genotype-phenotype relationships, we demonstrate the capability of bio-ontologies to guide data selection during the preparation step of the KDD process. We propose three scenarios to illustrate how domain knowledge can be taken into account in order to select or aggregate data to mine, and consequently how it can facilitate result interpretation at the end of the process

    Improving the prediction of disease-related variants using protein three-dimensional structure

    Get PDF
    Background: Single Nucleotide Polymorphisms (SNPs) are an important source of human genome variability. Non-synonymous SNPs occurring in coding regions result in single amino acid polymorphisms (SAPs) that may affect protein function and lead to pathology. Several methods attempt to estimate the impact of SAPs using different sources of information. Although sequence-based predictors have shown good performance, the quality of these predictions can be further improved by introducing new features derived from three-dimensional protein structures.Results: In this paper, we present a structure-based machine learning approach for predicting disease-related SAPs. We have trained a Support Vector Machine (SVM) on a set of 3,342 disease-related mutations and 1,644 neutral polymorphisms from 784 protein chains. We use SVM input features derived from the protein's sequence, structure, and function. After dataset balancing, the structure-based method (SVM-3D) reaches an overall accuracy of 85%, a correlation coefficient of 0.70, and an area under the receiving operating characteristic curve (AUC) of 0.92. When compared with a similar sequence-based predictor, SVM-3D results in an increase of the overall accuracy and AUC by 3%, and correlation coefficient by 0.06. The robustness of this improvement has been tested on different datasets and in all the cases SVM-3D performs better than previously developed methods even when compared with PolyPhen2, which explicitly considers in input protein structure information.Conclusion: This work demonstrates that structural information can increase the accuracy of disease-related SAPs identification. Our results also quantify the magnitude of improvement on a large dataset. This improvement is in agreement with previously observed results, where structure information enhanced the prediction of protein stability changes upon mutation. Although the structural information contained in the Protein Data Bank is limiting the application and the performance of our structure-based method, we expect that SVM-3D will result in higher accuracy when more structural date become available. \ua9 2011 Capriotti; licensee BioMed Central Ltd

    SDM—a server for predicting effects of mutations on protein stability and malfunction

    Get PDF
    The sheer volume of non-synonymous single nucleotide polymorphisms that have been generated in recent years from projects such as the Human Genome Project, the HapMap Project and Genome-Wide Association Studies means that it is not possible to characterize all mutations experimentally on the gene products, i.e. elucidate the effects of mutations on protein structure and function. However, automatic methods that can predict the effects of mutations will allow a reduced set of mutations to be studied. Site Directed Mutator (SDM) is a statistical potential energy function that uses environment-specific amino-acid substitution frequencies within homologous protein families to calculate a stability score, which is analogous to the free energy difference between the wild-type and mutant protein. Here, we present a web server for SDM (http://www-cryst.bioc.cam.ac.uk/~sdm/sdm.php), which has obtained more than 10 000 submissions since being online in April 2008. To run SDM, users must upload a wild-type structure and the position and amino acid type of the mutation. The results returned include information about the local structural environment of the wild-type and mutant residues, a stability score prediction and prediction of disease association. Additionally, the wild-type and mutant structures are displayed in a Jmol applet with the relevant residues highlighted