Search CORE

1,185 research outputs found

Machine learning-guided directed evolution for protein engineering

Author: Arnold Frances H.
Wu Zachary
Yang Kevin K.
Publication venue
Publication date: 19/04/2019
Field of study

Machine learning (ML)-guided directed evolution is a new paradigm for biological design that enables optimization of complex functions. ML methods use data to predict how sequence maps to function without requiring a detailed model of the underlying physics or biological pathways. To demonstrate ML-guided directed evolution, we introduce the steps required to build ML sequence-function models and use them to guide engineering, making recommendations at each stage. This review covers basic concepts relevant to using ML for protein engineering as well as the current literature and applications of this new engineering paradigm. ML methods accelerate directed evolution by learning from information contained in all measured variants and using that information to select sequences that are likely to be improved. We then provide two case studies that demonstrate the ML-guided directed evolution process. We also look to future opportunities where ML will enable discovery of new protein functions and uncover the relationship between protein sequence and function.Comment: Made significant revisions to focus on aspects most relevant to applying machine learning to speed up directed evolutio

arXiv.org e-Print Archive

Caltech Authors

Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening

Author: Cang Zixuan
Mu Lin
Wei Guowei
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 27/08/2017
Field of study

This work introduces a number of algebraic topology approaches, such as multicomponent persistent homology, multi-level persistent homology and electrostatic persistence for the representation, characterization, and description of small molecules and biomolecular complexes. Multicomponent persistent homology retains critical chemical and biological information during the topological simplification of biomolecular geometric complexity. Multi-level persistent homology enables a tailored topological description of inter- and/or intra-molecular interactions of interest. Electrostatic persistence incorporates partial charge information into topological invariants. These topological methods are paired with Wasserstein distance to characterize similarities between molecules and are further integrated with a variety of machine learning algorithms, including k-nearest neighbors, ensemble of trees, and deep convolutional neural networks, to manifest their descriptive and predictive powers for chemical and biological problems. Extensive numerical experiments involving more than 4,000 protein-ligand complexes from the PDBBind database and near 100,000 ligands and decoys in the DUD database are performed to test respectively the scoring power and the virtual screening power of the proposed topological approaches. It is demonstrated that the present approaches outperform the modern machine learning based methods in protein-ligand binding affinity predictions and ligand-decoy discrimination

arXiv.org e-Print Archive

Directory of Open Access Journals

FigShare

Integration of persistent Laplacian and pre-trained transformer for protein solubility changes upon mutation

Author: Chen Jiahui
Wee JunJie
Wei Guo-Wei
Xia Kelin
Publication venue
Publication date: 02/11/2023
Field of study

Protein mutations can significantly influence protein solubility, which results in altered protein functions and leads to various diseases. Despite of tremendous effort, machine learning prediction of protein solubility changes upon mutation remains a challenging task as indicated by the poor scores of normalized Correct Prediction Ratio (CPR). Part of the challenge stems from the fact that there is no three-dimensional (3D) structures for the wild-type and mutant proteins. This work integrates persistent Laplacians and pre-trained Transformer for the task. The Transformer, pretrained with hunderds of millions of protein sequences, embeds wild-type and mutant sequences, while persistent Laplacians track the topological invariant change and homotopic shape evolution induced by mutations in 3D protein structures, which are rendered from AlphaFold2. The resulting machine learning model was trained on an extensive data set labeled with three solubility types. Our model outperforms all existing predictive methods and improves the state-of-the-art up to 15%

arXiv.org e-Print Archive

Machine Learning in Enzyme Engineering

Author: Damborský Jiří
Mazurenko Stanislav
Prokop Zbyněk
Publication venue: 'American Chemical Society (ACS)'
Publication date: 01/01/2020
Field of study

Enzyme engineering plays a central role in developing efficient biocatalysts for biotechnology, biomedicine, and life sciences. Apart from classical rational design and directed evolution approaches, machine learning methods have been increasingly applied to find patterns in data that help predict protein structures, improve enzyme stability, solubility, and function, predict substrate specificity, and guide rational protein design. In this Perspective, we analyze the state of the art in databases and methods used for training and validating predictors in enzyme engineering. We discuss current limitations and challenges which the community is facing and recent advancements in experimental and theoretical methods that have the potential to address those challenges. We also present our view on possible future directions for developing the applications to the design of efficient biocatalysts

Univerzitní repozitář Masarykovy univerzity

Computational Design of Stable and Soluble Biocatalysts

Author: Bednář David
Damborský Jiří
Hon Jiří
Konegger Hannes
Musil Miloš
Publication venue: 'American Chemical Society (ACS)'
Publication date: 01/01/2019
Field of study

Natural enzymes are delicate biomolecules possessing only marginal thermodynamic stability. Poorly stable, misfolded, and aggregated proteins lead to huge economic losses in the biotechnology and biopharmaceutical industries. Consequently, there is a need to design optimized protein sequences that maximize stability, solubility, and activity over a wide range of temperatures and pH values in buffers of different composition and in the presence of organic cosolvents. This has created great interest in using computational methods to enhance biocatalysts' robustness and solubility. Suitable methods include (i) energy calculations, (ii) machine learning, (iii) phylogenetic analyses, and (iv) combinations of these approaches. We have witnessed impressive progress in the design of stable enzymes over the last two decades, but predictions of protein solubility and expressibility are scarce. Stabilizing mutations can be predicted accurately using available force fields, and the number of sequences available for phylogenetic analyses is growing. In addition, complex computational workflows are being implemented in intuitive web tools, enhancing the quality of protein stability predictions. Conversely, solubility predictors are limited by the lack of robust and balanced experimental data, an inadequate understanding of fundamental principles of protein aggregation, and a dearth of structural information on folding intermediates. Here we summarize recent progress in the development of computational tools for predicting protein stability and solubility, critically assess their strengths and weaknesses, and identify apparent gaps in data and knowledge. We also present perspectives on the computational design of stable and soluble biocatalysts

Univerzitní repozitář Masarykovy univerzity

Machine Learning Small Molecule Properties in Drug Discovery

Author: Arroniz Carlos
De Fabritiis Gianni
Majewski Maciej
Schapin Nikolai
Varela Alejandro
Publication venue
Publication date: 02/08/2023
Field of study

Machine learning (ML) is a promising approach for predicting small molecule properties in drug discovery. Here, we provide a comprehensive overview of various ML methods introduced for this purpose in recent years. We review a wide range of properties, including binding affinities, solubility, and ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity). We discuss existing popular datasets and molecular descriptors and embeddings, such as chemical fingerprints and graph-based neural networks. We highlight also challenges of predicting and optimizing multiple properties during hit-to-lead and lead optimization stages of drug discovery and explore briefly possible multi-objective optimization techniques that can be used to balance diverse properties while optimizing lead candidates. Finally, techniques to provide an understanding of model predictions, especially for critical decision-making in drug discovery are assessed. Overall, this review provides insights into the landscape of ML models for small molecule property predictions in drug discovery. So far, there are multiple diverse approaches, but their performances are often comparable. Neural networks, while more flexible, do not always outperform simpler models. This shows that the availability of high-quality training data remains crucial for training accurate models and there is a need for standardized benchmarks, additional performance metrics, and best practices to enable richer comparisons between the different techniques and models that can shed a better light on the differences between the many techniques.Comment: 46 pages, 1 figur

arXiv.org e-Print Archive

Computational Developability Assessment of Antibody Therapeutics

Author: Khetan Rahul
Publication venue
Publication date: 31/12/2023
Field of study

The University of Manchester - Institutional Repository

The interplay of descriptor-based computational analysis with pharmacophore modeling builds the basis for a novel classification scheme for feruloyl esterases

Author: Akin
Altschul
Andersen
Andreasen
Aurilia
Barnum
Bartolomé
Bendtsen
Benner
Benoit
Benoit
Bhasin
Bhasin
Blum
Cai
Cai
Castanares
Chang
Choi
Crepin
D.B.R.K. Gupta Udatha
Dodd
Donaghy
Donaghy
Dudoit
Dysvik
Ewing
Faulds
Ferguson
Fillingham
Finn
Garcia-Conesa
García-Conesa
Garrigues
Gasteiger
Gasteiger
Gianni Panagiotou
Giuliani
Goldstone
Hall
Han
Hatzakis
Henikoff
Hermoso
Hsu
Humberstone
Huson
Irene Kouskoumvekaki
Kaiser
Karchin
Keerthi
Kheder
Kikuzaki
Kim
Kohavi
Kohonen
Koseki
Koseki
Kroon
Kroon
Kumar
Lao
Larkin
Laszlo
Latha
Lee
Lesage-Meessen
Levasseur
Levasseur
Li
Lima
Lisbeth Olsson
MacKay
Marcotte
McAuley
Meinicke
Morris
Mukherjee
Nielsen
Noble
Nsereko
Oili
Ong
Platt
Prates
Pérez-Bercoff
Rashamuse
Record
Rost
Sancho
Sankararaman
Sankararaman
Schrödinger Suite 2009
Schubot
Slavin
Tarbouriech
Teodoro
Thompson
Tomoko
Topakas
Topakas
Topakas
Topakas
Topakas
Tsuchiyama
Tsuchiyama
Uestuen
Vafiadi
Vafiadi
Vafiadi
Vafiadi
Vafiadi
Vafiadi
Wang
Wang
Wang
Wilkinson
Publication venue
Publication date: 11/08/2010
Field of study

One of the most intriguing groups of enzymes, the feruloyl esterases (FAEs), is ubiquitous in both simple and complex organisms. FAEs have gained importance in biofuel, medicine and food industries due to their capability of acting on a large range of substrates for cleaving ester bonds and synthesizing high-added value molecules through esterification and transesterification reactions. During the past two decades extensive studies have been carried out on the production and partial characterization of FAEs from fungi, while much less is known about FAEs of bacterial or plant origin. Initial classification studies on FAEs were restricted on sequence similarity and substrate specificity on just four model substrates and considered only a handful of FAEs belonging to the fungal kingdom. This study centers on the descriptor-based classification and structural analysis of experimentally verified and putative FAEs; nevertheless, the framework presented here is applicable to every poorly characterized enzyme family. 365 FAE-related sequences of fungal, bacterial and plantae origin were collected and they were clustered using Self Organizing Maps followed by k-means clustering into distinct groups based on amino acid composition and physico-chemical composition descriptors derived from the respective amino acid sequence. A Support Vector Machine model was subsequently constructed for the classification of new FAEs into the pre-assigned clusters. The model successfully recognized 98.2% of the training sequences and all the sequences of the blind test. The underlying functionality of the 12 proposed FAE families was validated against a combination of prediction tools and published experimental data. Another important aspect of the present work involves the development of pharmacophore models for the new FAE families, for which sufficient information on known substrates existed. Knowing the pharmacophoric features of a small molecule that are essential for binding to the members of a certain family opens a window of opportunities for tailored applications of FAEs

Crossref

Chalmers Research

Nature Precedings

Online Research Database In Technology

Chalmers Publication Library

HKU Scholars Hub

Characterization of absorption enhancers for orally administered therapeutic peptides in tablet formulations - Applying statistical learning

Author: Welling Søren Havelund
Publication venue: Technical University of Denmark
Publication date: 01/01/2017
Field of study

Online Research Database In Technology