29 research outputs found
Machine learning-assisted directed protein evolution with combinatorial libraries
To reduce experimental effort associated with directed protein evolution and
to explore the sequence space encoded by mutating multiple positions
simultaneously, we incorporate machine learning in the directed evolution
workflow. Combinatorial sequence space can be quite expensive to sample
experimentally, but machine learning models trained on tested variants provide
a fast method for testing sequence space computationally. We validate this
approach on a large published empirical fitness landscape for human GB1 binding
protein, demonstrating that machine learning-guided directed evolution finds
variants with higher fitness than those found by other directed evolution
approaches. We then provide an example application in evolving an enzyme to
produce each of the two possible product enantiomers (stereodivergence) of a
new-to-nature carbene Si-H insertion reaction. The approach predicted libraries
enriched in functional enzymes and fixed seven mutations in two rounds of
evolution to identify variants for selective catalysis with 93% and 79% ee. By
greatly increasing throughput with in silico modeling, machine learning
enhances the quality and diversity of sequence solutions for a protein
engineering problem.Comment: Corrected best S-selective variant sequence in Figure 4. Corrected
less R-selective variant sequences from Round II Input library in Table 2 and
Supp Table 4. Corrections may also be found on PNAS version
https://www.pnas.org/content/early/2019/12/26/192177011
Machine learning-guided directed evolution for protein engineering
Machine learning (ML)-guided directed evolution is a new paradigm for
biological design that enables optimization of complex functions. ML methods
use data to predict how sequence maps to function without requiring a detailed
model of the underlying physics or biological pathways. To demonstrate
ML-guided directed evolution, we introduce the steps required to build ML
sequence-function models and use them to guide engineering, making
recommendations at each stage. This review covers basic concepts relevant to
using ML for protein engineering as well as the current literature and
applications of this new engineering paradigm. ML methods accelerate directed
evolution by learning from information contained in all measured variants and
using that information to select sequences that are likely to be improved. We
then provide two case studies that demonstrate the ML-guided directed evolution
process. We also look to future opportunities where ML will enable discovery of
new protein functions and uncover the relationship between protein sequence and
function.Comment: Made significant revisions to focus on aspects most relevant to
applying machine learning to speed up directed evolutio
Machine learning-assisted directed protein evolution with combinatorial libraries
To reduce experimental effort associated with directed protein evolution and to explore the sequence space encoded by mutating multiple positions simultaneously, we incorporate machine learning into the directed evolution workflow. Combinatorial sequence space can be quite expensive to sample experimentally, but machine-learning models trained on tested variants provide a fast method for testing sequence space computationally. We validated this approach on a large published empirical fitness landscape for human GB1 binding protein, demonstrating that machine learning-guided directed evolution finds variants with higher fitness than those found by other directed evolution approaches. We then provide an example application in evolving an enzyme to produce each of the two possible product enantiomers (i.e., stereodivergence) of a new-to-nature carbene SiâH insertion reaction. The approach predicted libraries enriched in functional enzymes and fixed seven mutations in two rounds of evolution to identify variants for selective catalysis with 93% and 79% ee (enantiomeric excess). By greatly increasing throughput with in silico modeling, machine learning enhances the quality and diversity of sequence solutions for a protein engineering problem
Predicting a Protein's Stability under a Million Mutations
Stabilizing proteins is a foundational step in protein engineering. However,
the evolutionary pressure of all extant proteins makes identifying the scarce
number of mutations that will improve thermodynamic stability challenging. Deep
learning has recently emerged as a powerful tool for identifying promising
mutations. Existing approaches, however, are computationally expensive, as the
number of model inferences scales with the number of mutations queried. Our
main contribution is a simple, parallel decoding algorithm. Our Mutate
Everything is capable of predicting the effect of all single and double
mutations in one forward pass. It is even versatile enough to predict
higher-order mutations with minimal computational overhead. We build Mutate
Everything on top of ESM2 and AlphaFold, neither of which were trained to
predict thermodynamic stability. We trained on the Mega-Scale cDNA proteolysis
dataset and achieved state-of-the-art performance on single and higher-order
mutations on S669, ProTherm, and ProteinGym datasets. Code is available at
https://github.com/jozhang97/MutateEverythingComment: NeurIPS 2023. Code available at
https://github.com/jozhang97/MutateEverythin
Towards higher predictability in enzyme engineering : investigation of protein epistasis in dynamic Ă-lactamases and Cal-A lipase
L'ingénierie enzymatique est un outil trÚs avantageux dans l'industrie biotechnologique. Elle permet d'adapter les enzymes à une activité ou à une condition de réaction spécifique. En outre, elle peut permettre de déchiffrer les éléments clés qui ont facilité leur modification. Bien que l'ingénierie enzymatique soit largement pratiquée, elle comporte encore plusieurs goulets d'étranglement. Certains de ces goulets d'étranglement sont techniques, comme le développement de méthodologies pour la création de banques de mutations ciblées ou la réalisation de criblages à haut débit, et d'autres sont conceptuels, comme le déchiffrage des caractéristiques clés pertinentes d'une protéine cible pour la réussite d'un projet d'ingénierie. Parmi ces défis, l'épistasie intra-génique, ou la non-additivité des effets phénotypiques des mutations, est une caractéristique qui entrave grandement la prévisibilité. L'amélioration de l'ingénierie enzymatique nécessite une approche multidisciplinaire qui inclut une meilleure compréhension des relations structure-fonction-évolution.
Cette thĂšse vise Ă contribuer Ă l'avancement de l'ingĂ©nierie enzymatique en Ă©tudiant deux systĂšmes modĂšles. PremiĂšrement, des variantes dynamiques de la Ă-lactamase TEM-1 ont Ă©tĂ© choisies pour Ă©tudier le lien entre la dynamique des protĂ©ines et l'Ă©volution. La Ă-lactamase TEM-1 a Ă©tĂ© largement caractĂ©risĂ©e dans la littĂ©rature, ce qui s'est traduit par des connaissances approfondies sur son mĂ©canisme de rĂ©action, ses caractĂ©ristiques structurelles et son Ă©volution. Les variantes de la Ă-lactamase TEM-1 utilisĂ©es comme systĂšme modĂšle dans cette thĂšse ont Ă©tĂ© largement caractĂ©risĂ©es, montrant une dynamique accrue Ă l'Ă©chelle temporelle pertinente pour la catalyse (”s Ă ms) mais maintenant la reconnaissance du substrat. Dans cette thĂšse, l'Ă©volution in vitro de ces variantes dynamiques a Ă©tĂ© rĂ©alisĂ©e par des cycles itĂ©ratifs de mutagenĂšse et de sĂ©lection alĂ©atoires pour permettre une exploration impartiale du paysage de âfitnessâ. Nous dĂ©montrons que la prĂ©sence de ces mouvements particuliers au dĂ©but de l'Ă©volution a permis d'accĂ©der Ă des voies de mutations connues. De plus, des interactions Ă©pistatiques connues ont Ă©tĂ© introduites dans les variantes dynamiques. Leur caractĂ©risation in silico et cinĂ©tique a rĂ©vĂ©lĂ© que les mouvements supplĂ©mentaires sur l'Ă©chelle de temps de la catalyse ont permis d'accĂ©der Ă des conformations conduisant Ă une fonction amĂ©liorĂ©e, comme dans le TEM-1 natif. Dans l'ensemble, nous dĂ©montrons que l'Ă©volution de la b-lactamase TEM-1 vers une nouvelle fonction est compatible avec divers mouvements Ă l'Ă©chelle de temps ”s Ă ms. Il reste Ă savoir si cela peut se traduire par d'autres enzymes ayant un potentiel biotechnologique.
DeuxiĂšmement, la lipase Cal-A, pertinente sur le plan industriel, a Ă©tĂ© choisie pour identifier les caractĂ©ristiques qui pourraient faciliter son ingĂ©nierie. La lipase Cal-A prĂ©sente des caractĂ©ristiques telles que la polyvalence du substrat et une grande stabilitĂ© thermique et rĂ©activitĂ© qui la rendent attrayante pour la modification des triglycĂ©rides ou la synthĂšse de molĂ©cules pertinentes dans les industries alimentaire et pharmaceutique. Contrairement Ă TEM-1, la plupart des Ă©tudes d'Ă©volution in vitro de la lipase Cal-A ont Ă©tĂ© rĂ©alisĂ©es dans un but industriel, avec une exploration limitĂ©e de l'espace de mutation. Par consĂ©quent, les caractĂ©ristiques qui dĂ©finissent la fonction de la lipase Cal-A restent insaisissables. Dans cette thĂšse, nous faisons Ă©tat de la mutagenĂšse ciblĂ©e de la lipase Cal-A, confirmant l'existence d'une rĂ©gion clĂ© pour la reconnaissance du substrat. Cela a Ă©tĂ© fait en combinant une nouvelle mĂ©thodologie de crĂ©ation de bibliothĂšque basĂ©e sur l'assemblage Golden-gate avec une visualisation structurelle basĂ©e sur des scripts pour identifier et cartographier les mutations sĂ©lectionnĂ©es dans la structure 3D. La caractĂ©risation et la dĂ©convolution de deux des plus aptes ont rĂ©vĂ©lĂ© l'existence d'une Ă©pistasie dans l'Ă©volution de la lipase Cal-A vers une nouvelle fonction. Dans l'ensemble, nous dĂ©montrons que lâidentification d'une variĂ©tĂ© de propriĂ©tĂ©s suite Ă la mutagenĂšse ciblĂ©e peut grandement amĂ©liorer la connaissance d'une enzyme. Cette information peut ĂȘtre appliquĂ©e pour amĂ©liorer l'efficacitĂ© de l'ingĂ©nierie dirigĂ©e.Enzyme engineering is a tool with great utility in the biotechnological industry. It allows to tailor enzymes to a specific activity or reaction condition. In addition, it can allow to decipher key elements that facilitated their modification. While enzyme engineering is extensively practised, it still entails several bottlenecks. Some of these bottlenecks are technical such as the development of methodologies for creating targeted mutational libraries or performing high-throughput screening and some are conceptual such as deciphering the key relevant features in a target protein for a successful engineering project. Among these challenges, intragenic epistasis, or the non-additivity of the phenotypic effects of mutations, is a feature that greatly hinders predictability. Improving enzyme engineering needs a multidisciplinary approach that includes gaining a better understanding of structure-function-evolution relations.
This thesis seeks to contribute in the advancement of enzyme engineering by investigating two model systems. First, dynamic variants of TEM-1 Ă-lactamase were chosen to investigate the link between protein dynamics and evolution. TEM-1 Ă-lactamase has been extensively characterized in the literature, which has translated into extensive knowledge on its reaction mechanism, structural features and evolution. The variants of TEM-1 Ă-lactamase used as model system in this thesis had been extensively characterized, showing increased dynamics at the timescale relevant to catalysis (”s to ms) but maintaining substrate recognition. In this thesis, in vitro evolution of these dynamic variants was done by iterative rounds of random mutagenesis and selection to allow an unbiased exploration of the fitness landscape. We demonstrate that the presence of these particular motions at the outset of evolution allowed access to known mutational pathways. In addition, known epistatic interactions were introduced in the dynamic variants. Their in silico and kinetic characterization revealed that the additional motions on the timescale of catalysis allowed access to conformations leading to enhanced function, as in native TEM-1. Overall, we demonstrate that the evolution of TEM-1 b-lactamase toward new function is compatible with diverse motions at the ”s to ms timescale. Whether this can be translated to other enzymes with biotechnological potential remains to be explored.
Secondly, the industrially relevant Cal-A lipase was chosen to identify features that could facilitate its engineering. Cal-A lipase presents characteristics such as substrate versatility and high thermal stability and reactivity that make it attractive for modification of triglycerides or synthesis of relevant molecules in the food and pharmaceutical industries. Contrary to TEM-1, most in vitro evolution studies of Cal-A lipase have been done towards an industrially-specified goal, with limited exploration of mutational space. As a result, features that define function in Cal-A lipase remain elusive. In this thesis, we report on focused mutagenesis of Cal-A lipase, confirming the existence of a key region for substrate recognition. This was done by combining a novel library creation methodology based on Golden-gate assembly with script-based structural visualization to identify and map the selected mutations into the 3D structure. The characterization and deconvolution of two of the fittest revealed the existence of epistasis in the evolution of Cal-A lipase towards new function. Overall, we demonstrate that mapping a variety of properties following mutagenesis targeted to specific regions can greatly improve knowledge of an enzyme that can be applied to improve the efficiency of directed engineering
Data-Driven Protein Engineering
Directed evolution has enabled the adaptation of natural protein sequences for an endless variety of human applications. Given a starting point - a sequence with measurable activity - directed evolution is able to improve protein sequences by iteratively accumulating beneficial mutations. However, directed evolution requires investing large experimental effort, which continues to be the major bottleneck in efficient protein optimization. To this end, we describe a framework for incorporating machine learning in the directed evolution process to maximize the utility of generated experimental data in Chapter 2. In Chapter 3, we then show that this framework outperforms traditional directed evolution methods on an empirical fitness landscape. However, directed evolution is fundamentally limited by its need for a starting point, or a sequence with measurable activity. To tackle this issue, we test the ability of nascent deep learning techniques for generating short, functional amino acid sequences in Chapter 4. Encouraged by this success, we attempted to generate full length enzymatic sequences for desired substrates without success. However, we were able to apply this deep learning approach to model other aspects of enzymatic protein sequences in Chapter 5. Finally, the field of data-driven protein sequence generation is enjoying a recent surge in interest, and we provide an updated review of protein engineering with machine learning, focusing on recent work in deep generative modeling in Chapter 1.</p
Studies in bacterial phosphotriesterase evolution, dynamics, and engineering
the author deposited 12/06/201
Probabilistic Protein Engineering
Machine learning-guided protein engineering is a new paradigm that enables the optimization of complex protein functions. Machine-learning methods use data to predict protein function without requiring a detailed model of the underlying physics or biological pathways. They accelerate protein engineering by learning from information contained in all measured variants and using it to select variants that are likely to be improved. We begin with a review of the basics of machine learning with a focus on applications to protein engineering and protein sequence-function datasets (Chapter 1). We used the entire machine-learning guided engineering paradigm to engineer the algal-derived light-gated channel channelrhodopsin (ChR), which can be used to modulate neuronal activity with light. We build models that discover ChRs with strong plasma membrane localization in mammalian cells (Chapter 2) and unprecedented light sensitivity and photocurrents for optogenetic applications (Chapter 3). Machine learning-guided evolution requires a machine-learning model that learns the relationship between sequence and function. For machine-learning models to learn about protein sequences, protein sequences must be represented as vectors or matrices of numbers. How each protein sequence is represented determines what can be learned. We learn continuous vector encodings of sequences from patterns in unlabeled sequences (Chapter 4). Learned encodings are low-dimensional, do not require alignments, and may improve performance by transferring information in unlabeled sequences to specific prediction tasks. Alternately, we demonstrate an interpretable Gaussian process kernel tailored to biological sequences (Chapter 6). In addition to a model to predict function from sequence, engineering requires a method to use the model to choose sequences for the next round of evolution. Most machine-learning guided engineering strategies assume that selected sequences can be queried directly. However, in directed evolution it is common to design a library of sequences and then sample stochastic batches from that library. We propose a batched stochastic Bayesian optimization algorithm for iteratively designing and screening site-saturation mutagenesis libraries (Chapter 5).</p
Recommended from our members
Exploring protein fitness landscapes with new high-throughput technologies
The concept of a proteinâs fitness landscape â an abstract space in which related sequences are close together and matched with their fitness â is a useful tool to visualize core principles of protein evolution. Acquiring a new function, for example the laboratory evolution of an enzyme to convert an industrially relevant substrate, can be understood as a stepwise climb through a fitness landscape, reaching higher fitness (or activity) with each step (or mutation). The valleys of such a space relate to the starting points of protein engineering campaigns. Understanding this area could enlighten principles of how proteins quickly adapt in nature and help to identify starting points with a high potential for evolution, a high âevolvabilityâ, speeding up protein engineering. In this study, high-throughput technologies will be developed that enable the read-out of directed evolution on a large scale, tracking the exploration of the valley of a fitness landscape: the conversion of an amino acid- to amine dehydrogenase will be investigated as a model of enzyme evolvability with a drastic change of substrate specificity. A sensitive high-throughput screening assay as well as a comprehensive sequencing read-out will be required to establish the identity of selected variants during evolution. I will first generate and characterize three different but related starting points and test their initial evolvability. Stabilizing the starting point results in increased mutational robustness, broadening the range of accepted mutations. However, increased initial stability does not necessarily correlate to higher functional improvement, hinting at a nuanced view of evolvability. A sensitive high-throughput assay is necessary to verify the full potential of the starting points and study the early steps of evolution comprehensively. Broadly applicable ultrahigh-throughput assays of enzyme function, such as absorbance-activated droplet sorting, currently lack the sensitivity of more specific fluorescence-based or low-throughput counterparts. A universal approach to increase detectability in single cell-lysate microfluidic enzyme assays is established by amplifying the enzyme content per droplet more than 10-fold via homogeneous clonal cell growth. Clonal amplification enables the sensitive and precise detection of newly introduced amine dehydrogenase activities, a feat restricted in conventional assays by low initial activity and stability. To generate a truly complete view of directed evolution in a fitness landscape, however, an equally powerful sequencing read-out is necessary to identify all selected variants. Here, unique molecular identifiers are used to increase the accuracy of nanopore sequencing to levels that can reliably distinguish point mutations. I establish an inexpensive and straightforward long read amplicon sequencing workflow which is then applied to map the trajectories of two comparative long-term directed evolution campaigns. In the parallel evolution campaigns, initial beneficial mutations are exclusive to each starting point and lead to incompatible trajectories. Beneficial mutations are scarce and large improvements are unavailable until recombination occurs and a jump through the fitness landscape is realized. The recombined variant holds high evolvability and quickly evolves to take over the population and form the most successful lineages, indicating the power of recombination as a means to innovation in protein evolution. The tools established in this thesis can help protein engineers explore fitness landscapes more economically and comprehensively. Their application to mapping full trajectories of early adaptation uncovers differences in the evolvability of homologs, potentially aiding the identification of evolvable starting points as well as strategies to increase evolvability for efficient protein engineering in the future