Search CORE

29 research outputs found

Machine learning-assisted directed protein evolution with combinatorial libraries

Author: Arnold Frances H.
Kan S. B. Jennifer
Lewis Russell D.
Wittmann Bruce J.
Wu Zachary
Publication venue: 'Proceedings of the National Academy of Sciences'
Publication date: 30/04/2019
Field of study

To reduce experimental effort associated with directed protein evolution and to explore the sequence space encoded by mutating multiple positions simultaneously, we incorporate machine learning in the directed evolution workflow. Combinatorial sequence space can be quite expensive to sample experimentally, but machine learning models trained on tested variants provide a fast method for testing sequence space computationally. We validate this approach on a large published empirical fitness landscape for human GB1 binding protein, demonstrating that machine learning-guided directed evolution finds variants with higher fitness than those found by other directed evolution approaches. We then provide an example application in evolving an enzyme to produce each of the two possible product enantiomers (stereodivergence) of a new-to-nature carbene Si-H insertion reaction. The approach predicted libraries enriched in functional enzymes and fixed seven mutations in two rounds of evolution to identify variants for selective catalysis with 93% and 79% ee. By greatly increasing throughput with in silico modeling, machine learning enhances the quality and diversity of sequence solutions for a protein engineering problem.Comment: Corrected best S-selective variant sequence in Figure 4. Corrected less R-selective variant sequences from Round II Input library in Table 2 and Supp Table 4. Corrections may also be found on PNAS version https://www.pnas.org/content/early/2019/12/26/192177011

arXiv.org e-Print Archive

Caltech Authors

Machine learning-guided directed evolution for protein engineering

Author: Arnold Frances H.
Wu Zachary
Yang Kevin K.
Publication venue
Publication date: 19/04/2019
Field of study

Machine learning (ML)-guided directed evolution is a new paradigm for biological design that enables optimization of complex functions. ML methods use data to predict how sequence maps to function without requiring a detailed model of the underlying physics or biological pathways. To demonstrate ML-guided directed evolution, we introduce the steps required to build ML sequence-function models and use them to guide engineering, making recommendations at each stage. This review covers basic concepts relevant to using ML for protein engineering as well as the current literature and applications of this new engineering paradigm. ML methods accelerate directed evolution by learning from information contained in all measured variants and using that information to select sequences that are likely to be improved. We then provide two case studies that demonstrate the ML-guided directed evolution process. We also look to future opportunities where ML will enable discovery of new protein functions and uncover the relationship between protein sequence and function.Comment: Made significant revisions to focus on aspects most relevant to applying machine learning to speed up directed evolutio

arXiv.org e-Print Archive

Caltech Authors

Machine learning-assisted directed protein evolution with combinatorial libraries

Author: Arnold Frances H.
Kan S. B. Jennifer
Lewis Russell D.
Wittmann Bruce J.
Wu Zachary
Publication venue: 'Proceedings of the National Academy of Sciences'
Publication date: 30/04/2019
Field of study

To reduce experimental effort associated with directed protein evolution and to explore the sequence space encoded by mutating multiple positions simultaneously, we incorporate machine learning into the directed evolution workflow. Combinatorial sequence space can be quite expensive to sample experimentally, but machine-learning models trained on tested variants provide a fast method for testing sequence space computationally. We validated this approach on a large published empirical fitness landscape for human GB1 binding protein, demonstrating that machine learning-guided directed evolution finds variants with higher fitness than those found by other directed evolution approaches. We then provide an example application in evolving an enzyme to produce each of the two possible product enantiomers (i.e., stereodivergence) of a new-to-nature carbene Si–H insertion reaction. The approach predicted libraries enriched in functional enzymes and fixed seven mutations in two rounds of evolution to identify variants for selective catalysis with 93% and 79% ee (enantiomeric excess). By greatly increasing throughput with in silico modeling, machine learning enhances the quality and diversity of sequence solutions for a protein engineering problem

Predicting a Protein's Stability under a Million Mutations

Author: Diaz Daniel J.
Klivans Adam R.
Krähenbühl Philipp
Ouyang-Zhang Jeffrey
Publication venue
Publication date: 30/10/2023
Field of study

Stabilizing proteins is a foundational step in protein engineering. However, the evolutionary pressure of all extant proteins makes identifying the scarce number of mutations that will improve thermodynamic stability challenging. Deep learning has recently emerged as a powerful tool for identifying promising mutations. Existing approaches, however, are computationally expensive, as the number of model inferences scales with the number of mutations queried. Our main contribution is a simple, parallel decoding algorithm. Our Mutate Everything is capable of predicting the effect of all single and double mutations in one forward pass. It is even versatile enough to predict higher-order mutations with minimal computational overhead. We build Mutate Everything on top of ESM2 and AlphaFold, neither of which were trained to predict thermodynamic stability. We trained on the Mega-Scale cDNA proteolysis dataset and achieved state-of-the-art performance on single and higher-order mutations on S669, ProTherm, and ProteinGym datasets. Code is available at https://github.com/jozhang97/MutateEverythingComment: NeurIPS 2023. Code available at https://github.com/jozhang97/MutateEverythin

arXiv.org e-Print Archive

Towards higher predictability in enzyme engineering : investigation of protein epistasis in dynamic ß-lactamases and Cal-A lipase

Author: Alejaldre Ripalda Lorea
Publication venue
Publication date: 01/12/2020
Field of study

L'ingénierie enzymatique est un outil très avantageux dans l'industrie biotechnologique. Elle permet d'adapter les enzymes à une activité ou à une condition de réaction spécifique. En outre, elle peut permettre de déchiffrer les éléments clés qui ont facilité leur modification. Bien que l'ingénierie enzymatique soit largement pratiquée, elle comporte encore plusieurs goulets d'étranglement. Certains de ces goulets d'étranglement sont techniques, comme le développement de méthodologies pour la création de banques de mutations ciblées ou la réalisation de criblages à haut débit, et d'autres sont conceptuels, comme le déchiffrage des caractéristiques clés pertinentes d'une protéine cible pour la réussite d'un projet d'ingénierie. Parmi ces défis, l'épistasie intra-génique, ou la non-additivité des effets phénotypiques des mutations, est une caractéristique qui entrave grandement la prévisibilité. L'amélioration de l'ingénierie enzymatique nécessite une approche multidisciplinaire qui inclut une meilleure compréhension des relations structure-fonction-évolution. Cette thèse vise à contribuer à l'avancement de l'ingénierie enzymatique en étudiant deux systèmes modèles. Premièrement, des variantes dynamiques de la ß-lactamase TEM-1 ont été choisies pour étudier le lien entre la dynamique des protéines et l'évolution. La ß-lactamase TEM-1 a été largement caractérisée dans la littérature, ce qui s'est traduit par des connaissances approfondies sur son mécanisme de réaction, ses caractéristiques structurelles et son évolution. Les variantes de la ß-lactamase TEM-1 utilisées comme système modèle dans cette thèse ont été largement caractérisées, montrant une dynamique accrue à l'échelle temporelle pertinente pour la catalyse (µs à ms) mais maintenant la reconnaissance du substrat. Dans cette thèse, l'évolution in vitro de ces variantes dynamiques a été réalisée par des cycles itératifs de mutagenèse et de sélection aléatoires pour permettre une exploration impartiale du paysage de ‘fitness’. Nous démontrons que la présence de ces mouvements particuliers au début de l'évolution a permis d'accéder à des voies de mutations connues. De plus, des interactions épistatiques connues ont été introduites dans les variantes dynamiques. Leur caractérisation in silico et cinétique a révélé que les mouvements supplémentaires sur l'échelle de temps de la catalyse ont permis d'accéder à des conformations conduisant à une fonction améliorée, comme dans le TEM-1 natif. Dans l'ensemble, nous démontrons que l'évolution de la b-lactamase TEM-1 vers une nouvelle fonction est compatible avec divers mouvements à l'échelle de temps µs à ms. Il reste à savoir si cela peut se traduire par d'autres enzymes ayant un potentiel biotechnologique. Deuxièmement, la lipase Cal-A, pertinente sur le plan industriel, a été choisie pour identifier les caractéristiques qui pourraient faciliter son ingénierie. La lipase Cal-A présente des caractéristiques telles que la polyvalence du substrat et une grande stabilité thermique et réactivité qui la rendent attrayante pour la modification des triglycérides ou la synthèse de molécules pertinentes dans les industries alimentaire et pharmaceutique. Contrairement à TEM-1, la plupart des études d'évolution in vitro de la lipase Cal-A ont été réalisées dans un but industriel, avec une exploration limitée de l'espace de mutation. Par conséquent, les caractéristiques qui définissent la fonction de la lipase Cal-A restent insaisissables. Dans cette thèse, nous faisons état de la mutagenèse ciblée de la lipase Cal-A, confirmant l'existence d'une région clé pour la reconnaissance du substrat. Cela a été fait en combinant une nouvelle méthodologie de création de bibliothèque basée sur l'assemblage Golden-gate avec une visualisation structurelle basée sur des scripts pour identifier et cartographier les mutations sélectionnées dans la structure 3D. La caractérisation et la déconvolution de deux des plus aptes ont révélé l'existence d'une épistasie dans l'évolution de la lipase Cal-A vers une nouvelle fonction. Dans l'ensemble, nous démontrons que l’identification d'une variété de propriétés suite à la mutagenèse ciblée peut grandement améliorer la connaissance d'une enzyme. Cette information peut être appliquée pour améliorer l'efficacité de l'ingénierie dirigée.Enzyme engineering is a tool with great utility in the biotechnological industry. It allows to tailor enzymes to a specific activity or reaction condition. In addition, it can allow to decipher key elements that facilitated their modification. While enzyme engineering is extensively practised, it still entails several bottlenecks. Some of these bottlenecks are technical such as the development of methodologies for creating targeted mutational libraries or performing high-throughput screening and some are conceptual such as deciphering the key relevant features in a target protein for a successful engineering project. Among these challenges, intragenic epistasis, or the non-additivity of the phenotypic effects of mutations, is a feature that greatly hinders predictability. Improving enzyme engineering needs a multidisciplinary approach that includes gaining a better understanding of structure-function-evolution relations. This thesis seeks to contribute in the advancement of enzyme engineering by investigating two model systems. First, dynamic variants of TEM-1 ß-lactamase were chosen to investigate the link between protein dynamics and evolution. TEM-1 ß-lactamase has been extensively characterized in the literature, which has translated into extensive knowledge on its reaction mechanism, structural features and evolution. The variants of TEM-1 ß-lactamase used as model system in this thesis had been extensively characterized, showing increased dynamics at the timescale relevant to catalysis (µs to ms) but maintaining substrate recognition. In this thesis, in vitro evolution of these dynamic variants was done by iterative rounds of random mutagenesis and selection to allow an unbiased exploration of the fitness landscape. We demonstrate that the presence of these particular motions at the outset of evolution allowed access to known mutational pathways. In addition, known epistatic interactions were introduced in the dynamic variants. Their in silico and kinetic characterization revealed that the additional motions on the timescale of catalysis allowed access to conformations leading to enhanced function, as in native TEM-1. Overall, we demonstrate that the evolution of TEM-1 b-lactamase toward new function is compatible with diverse motions at the µs to ms timescale. Whether this can be translated to other enzymes with biotechnological potential remains to be explored. Secondly, the industrially relevant Cal-A lipase was chosen to identify features that could facilitate its engineering. Cal-A lipase presents characteristics such as substrate versatility and high thermal stability and reactivity that make it attractive for modification of triglycerides or synthesis of relevant molecules in the food and pharmaceutical industries. Contrary to TEM-1, most in vitro evolution studies of Cal-A lipase have been done towards an industrially-specified goal, with limited exploration of mutational space. As a result, features that define function in Cal-A lipase remain elusive. In this thesis, we report on focused mutagenesis of Cal-A lipase, confirming the existence of a key region for substrate recognition. This was done by combining a novel library creation methodology based on Golden-gate assembly with script-based structural visualization to identify and map the selected mutations into the 3D structure. The characterization and deconvolution of two of the fittest revealed the existence of epistasis in the evolution of Cal-A lipase towards new function. Overall, we demonstrate that mapping a variety of properties following mutagenesis targeted to specific regions can greatly improve knowledge of an enzyme that can be applied to improve the efficiency of directed engineering

Dépôt Institutionnel Numérique

Data-Driven Protein Engineering

Author: Wu Zachary
Publication venue
Publication date: 01/01/2021
Field of study

Directed evolution has enabled the adaptation of natural protein sequences for an endless variety of human applications. Given a starting point - a sequence with measurable activity - directed evolution is able to improve protein sequences by iteratively accumulating beneficial mutations. However, directed evolution requires investing large experimental effort, which continues to be the major bottleneck in efficient protein optimization. To this end, we describe a framework for incorporating machine learning in the directed evolution process to maximize the utility of generated experimental data in Chapter 2. In Chapter 3, we then show that this framework outperforms traditional directed evolution methods on an empirical fitness landscape. However, directed evolution is fundamentally limited by its need for a starting point, or a sequence with measurable activity. To tackle this issue, we test the ability of nascent deep learning techniques for generating short, functional amino acid sequences in Chapter 4. Encouraged by this success, we attempted to generate full length enzymatic sequences for desired substrates without success. However, we were able to apply this deep learning approach to model other aspects of enzymatic protein sequences in Chapter 5. Finally, the field of data-driven protein sequence generation is enjoying a recent surge in interest, and we provide an updated review of protein engineering with machine learning, focusing on recent work in deep generative modeling in Chapter 1.</p

Caltech Theses and Dissertations

Studies in bacterial phosphotriesterase evolution, dynamics, and engineering

Author: Campbell Eleanor Claire
Publication venue
Publication date: 01/01/2017
Field of study

the author deposited 12/06/201

The Australian National University

Probabilistic Protein Engineering

Author: Yang Kevin Kaichuang
Publication venue
Publication date: 01/01/2019
Field of study

Machine learning-guided protein engineering is a new paradigm that enables the optimization of complex protein functions. Machine-learning methods use data to predict protein function without requiring a detailed model of the underlying physics or biological pathways. They accelerate protein engineering by learning from information contained in all measured variants and using it to select variants that are likely to be improved. We begin with a review of the basics of machine learning with a focus on applications to protein engineering and protein sequence-function datasets (Chapter 1). We used the entire machine-learning guided engineering paradigm to engineer the algal-derived light-gated channel channelrhodopsin (ChR), which can be used to modulate neuronal activity with light. We build models that discover ChRs with strong plasma membrane localization in mammalian cells (Chapter 2) and unprecedented light sensitivity and photocurrents for optogenetic applications (Chapter 3). Machine learning-guided evolution requires a machine-learning model that learns the relationship between sequence and function. For machine-learning models to learn about protein sequences, protein sequences must be represented as vectors or matrices of numbers. How each protein sequence is represented determines what can be learned. We learn continuous vector encodings of sequences from patterns in unlabeled sequences (Chapter 4). Learned encodings are low-dimensional, do not require alignments, and may improve performance by transferring information in unlabeled sequences to specific prediction tasks. Alternately, we demonstrate an interpretable Gaussian process kernel tailored to biological sequences (Chapter 6). In addition to a model to predict function from sequence, engineering requires a method to use the model to choose sequences for the next round of evolution. Most machine-learning guided engineering strategies assume that selected sequences can be queried directly. However, in directed evolution it is common to design a library of sequences and then sample stochastic batches from that library. We propose a batched stochastic Bayesian optimization algorithm for iteratively designing and screening site-saturation mutagenesis libraries (Chapter 5).</p

Caltech Theses and Dissertations

Recommended from our members

Exploring protein fitness landscapes with new high-throughput technologies

Author: Zurek Paul Jannis
Publication venue: University of Cambridge
Publication date: 17/03/2021
Field of study

The concept of a protein’s fitness landscape – an abstract space in which related sequences are close together and matched with their fitness – is a useful tool to visualize core principles of protein evolution. Acquiring a new function, for example the laboratory evolution of an enzyme to convert an industrially relevant substrate, can be understood as a stepwise climb through a fitness landscape, reaching higher fitness (or activity) with each step (or mutation). The valleys of such a space relate to the starting points of protein engineering campaigns. Understanding this area could enlighten principles of how proteins quickly adapt in nature and help to identify starting points with a high potential for evolution, a high ‘evolvability’, speeding up protein engineering. In this study, high-throughput technologies will be developed that enable the read-out of directed evolution on a large scale, tracking the exploration of the valley of a fitness landscape: the conversion of an amino acid- to amine dehydrogenase will be investigated as a model of enzyme evolvability with a drastic change of substrate specificity. A sensitive high-throughput screening assay as well as a comprehensive sequencing read-out will be required to establish the identity of selected variants during evolution. I will first generate and characterize three different but related starting points and test their initial evolvability. Stabilizing the starting point results in increased mutational robustness, broadening the range of accepted mutations. However, increased initial stability does not necessarily correlate to higher functional improvement, hinting at a nuanced view of evolvability. A sensitive high-throughput assay is necessary to verify the full potential of the starting points and study the early steps of evolution comprehensively. Broadly applicable ultrahigh-throughput assays of enzyme function, such as absorbance-activated droplet sorting, currently lack the sensitivity of more specific fluorescence-based or low-throughput counterparts. A universal approach to increase detectability in single cell-lysate microfluidic enzyme assays is established by amplifying the enzyme content per droplet more than 10-fold via homogeneous clonal cell growth. Clonal amplification enables the sensitive and precise detection of newly introduced amine dehydrogenase activities, a feat restricted in conventional assays by low initial activity and stability. To generate a truly complete view of directed evolution in a fitness landscape, however, an equally powerful sequencing read-out is necessary to identify all selected variants. Here, unique molecular identifiers are used to increase the accuracy of nanopore sequencing to levels that can reliably distinguish point mutations. I establish an inexpensive and straightforward long read amplicon sequencing workflow which is then applied to map the trajectories of two comparative long-term directed evolution campaigns. In the parallel evolution campaigns, initial beneficial mutations are exclusive to each starting point and lead to incompatible trajectories. Beneficial mutations are scarce and large improvements are unavailable until recombination occurs and a jump through the fitness landscape is realized. The recombined variant holds high evolvability and quickly evolves to take over the population and form the most successful lineages, indicating the power of recombination as a means to innovation in protein evolution. The tools established in this thesis can help protein engineers explore fitness landscapes more economically and comprehensively. Their application to mapping full trajectories of early adaptation uncovers differences in the evolvability of homologs, potentially aiding the identification of evolvable starting points as well as strategies to increase evolvability for efficient protein engineering in the future

Apollo (Cambridge)