1,185 research outputs found

    Machine learning-guided directed evolution for protein engineering

    Get PDF
    Machine learning (ML)-guided directed evolution is a new paradigm for biological design that enables optimization of complex functions. ML methods use data to predict how sequence maps to function without requiring a detailed model of the underlying physics or biological pathways. To demonstrate ML-guided directed evolution, we introduce the steps required to build ML sequence-function models and use them to guide engineering, making recommendations at each stage. This review covers basic concepts relevant to using ML for protein engineering as well as the current literature and applications of this new engineering paradigm. ML methods accelerate directed evolution by learning from information contained in all measured variants and using that information to select sequences that are likely to be improved. We then provide two case studies that demonstrate the ML-guided directed evolution process. We also look to future opportunities where ML will enable discovery of new protein functions and uncover the relationship between protein sequence and function.Comment: Made significant revisions to focus on aspects most relevant to applying machine learning to speed up directed evolutio

    Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening

    Full text link
    This work introduces a number of algebraic topology approaches, such as multicomponent persistent homology, multi-level persistent homology and electrostatic persistence for the representation, characterization, and description of small molecules and biomolecular complexes. Multicomponent persistent homology retains critical chemical and biological information during the topological simplification of biomolecular geometric complexity. Multi-level persistent homology enables a tailored topological description of inter- and/or intra-molecular interactions of interest. Electrostatic persistence incorporates partial charge information into topological invariants. These topological methods are paired with Wasserstein distance to characterize similarities between molecules and are further integrated with a variety of machine learning algorithms, including k-nearest neighbors, ensemble of trees, and deep convolutional neural networks, to manifest their descriptive and predictive powers for chemical and biological problems. Extensive numerical experiments involving more than 4,000 protein-ligand complexes from the PDBBind database and near 100,000 ligands and decoys in the DUD database are performed to test respectively the scoring power and the virtual screening power of the proposed topological approaches. It is demonstrated that the present approaches outperform the modern machine learning based methods in protein-ligand binding affinity predictions and ligand-decoy discrimination

    Integration of persistent Laplacian and pre-trained transformer for protein solubility changes upon mutation

    Full text link
    Protein mutations can significantly influence protein solubility, which results in altered protein functions and leads to various diseases. Despite of tremendous effort, machine learning prediction of protein solubility changes upon mutation remains a challenging task as indicated by the poor scores of normalized Correct Prediction Ratio (CPR). Part of the challenge stems from the fact that there is no three-dimensional (3D) structures for the wild-type and mutant proteins. This work integrates persistent Laplacians and pre-trained Transformer for the task. The Transformer, pretrained with hunderds of millions of protein sequences, embeds wild-type and mutant sequences, while persistent Laplacians track the topological invariant change and homotopic shape evolution induced by mutations in 3D protein structures, which are rendered from AlphaFold2. The resulting machine learning model was trained on an extensive data set labeled with three solubility types. Our model outperforms all existing predictive methods and improves the state-of-the-art up to 15%

    Machine Learning in Enzyme Engineering

    Get PDF
    Enzyme engineering plays a central role in developing efficient biocatalysts for biotechnology, biomedicine, and life sciences. Apart from classical rational design and directed evolution approaches, machine learning methods have been increasingly applied to find patterns in data that help predict protein structures, improve enzyme stability, solubility, and function, predict substrate specificity, and guide rational protein design. In this Perspective, we analyze the state of the art in databases and methods used for training and validating predictors in enzyme engineering. We discuss current limitations and challenges which the community is facing and recent advancements in experimental and theoretical methods that have the potential to address those challenges. We also present our view on possible future directions for developing the applications to the design of efficient biocatalysts

    Computational Design of Stable and Soluble Biocatalysts

    Get PDF
    Natural enzymes are delicate biomolecules possessing only marginal thermodynamic stability. Poorly stable, misfolded, and aggregated proteins lead to huge economic losses in the biotechnology and biopharmaceutical industries. Consequently, there is a need to design optimized protein sequences that maximize stability, solubility, and activity over a wide range of temperatures and pH values in buffers of different composition and in the presence of organic cosolvents. This has created great interest in using computational methods to enhance biocatalysts' robustness and solubility. Suitable methods include (i) energy calculations, (ii) machine learning, (iii) phylogenetic analyses, and (iv) combinations of these approaches. We have witnessed impressive progress in the design of stable enzymes over the last two decades, but predictions of protein solubility and expressibility are scarce. Stabilizing mutations can be predicted accurately using available force fields, and the number of sequences available for phylogenetic analyses is growing. In addition, complex computational workflows are being implemented in intuitive web tools, enhancing the quality of protein stability predictions. Conversely, solubility predictors are limited by the lack of robust and balanced experimental data, an inadequate understanding of fundamental principles of protein aggregation, and a dearth of structural information on folding intermediates. Here we summarize recent progress in the development of computational tools for predicting protein stability and solubility, critically assess their strengths and weaknesses, and identify apparent gaps in data and knowledge. We also present perspectives on the computational design of stable and soluble biocatalysts

    Machine Learning Small Molecule Properties in Drug Discovery

    Full text link
    Machine learning (ML) is a promising approach for predicting small molecule properties in drug discovery. Here, we provide a comprehensive overview of various ML methods introduced for this purpose in recent years. We review a wide range of properties, including binding affinities, solubility, and ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity). We discuss existing popular datasets and molecular descriptors and embeddings, such as chemical fingerprints and graph-based neural networks. We highlight also challenges of predicting and optimizing multiple properties during hit-to-lead and lead optimization stages of drug discovery and explore briefly possible multi-objective optimization techniques that can be used to balance diverse properties while optimizing lead candidates. Finally, techniques to provide an understanding of model predictions, especially for critical decision-making in drug discovery are assessed. Overall, this review provides insights into the landscape of ML models for small molecule property predictions in drug discovery. So far, there are multiple diverse approaches, but their performances are often comparable. Neural networks, while more flexible, do not always outperform simpler models. This shows that the availability of high-quality training data remains crucial for training accurate models and there is a need for standardized benchmarks, additional performance metrics, and best practices to enable richer comparisons between the different techniques and models that can shed a better light on the differences between the many techniques.Comment: 46 pages, 1 figur

    The interplay of descriptor-based computational analysis with pharmacophore modeling builds the basis for a novel classification scheme for feruloyl esterases

    Get PDF
    One of the most intriguing groups of enzymes, the feruloyl esterases (FAEs), is ubiquitous in both simple and complex organisms. FAEs have gained importance in biofuel, medicine and food industries due to their capability of acting on a large range of substrates for cleaving ester bonds and synthesizing high-added value molecules through esterification and transesterification reactions. During the past two decades extensive studies have been carried out on the production and partial characterization of FAEs from fungi, while much less is known about FAEs of bacterial or plant origin. Initial classification studies on FAEs were restricted on sequence similarity and substrate specificity on just four model substrates and considered only a handful of FAEs belonging to the fungal kingdom. This study centers on the descriptor-based classification and structural analysis of experimentally verified and putative FAEs; nevertheless, the framework presented here is applicable to every poorly characterized enzyme family. 365 FAE-related sequences of fungal, bacterial and plantae origin were collected and they were clustered using Self Organizing Maps followed by k-means clustering into distinct groups based on amino acid composition and physico-chemical composition descriptors derived from the respective amino acid sequence. A Support Vector Machine model was subsequently constructed for the classification of new FAEs into the pre-assigned clusters. The model successfully recognized 98.2% of the training sequences and all the sequences of the blind test. The underlying functionality of the 12 proposed FAE families was validated against a combination of prediction tools and published experimental data. Another important aspect of the present work involves the development of pharmacophore models for the new FAE families, for which sufficient information on known substrates existed. Knowing the pharmacophoric features of a small molecule that are essential for binding to the members of a certain family opens a window of opportunities for tailored applications of FAEs
    • …
    corecore