29 research outputs found

    Machine learning-assisted directed protein evolution with combinatorial libraries

    Get PDF
    To reduce experimental effort associated with directed protein evolution and to explore the sequence space encoded by mutating multiple positions simultaneously, we incorporate machine learning in the directed evolution workflow. Combinatorial sequence space can be quite expensive to sample experimentally, but machine learning models trained on tested variants provide a fast method for testing sequence space computationally. We validate this approach on a large published empirical fitness landscape for human GB1 binding protein, demonstrating that machine learning-guided directed evolution finds variants with higher fitness than those found by other directed evolution approaches. We then provide an example application in evolving an enzyme to produce each of the two possible product enantiomers (stereodivergence) of a new-to-nature carbene Si-H insertion reaction. The approach predicted libraries enriched in functional enzymes and fixed seven mutations in two rounds of evolution to identify variants for selective catalysis with 93% and 79% ee. By greatly increasing throughput with in silico modeling, machine learning enhances the quality and diversity of sequence solutions for a protein engineering problem.Comment: Corrected best S-selective variant sequence in Figure 4. Corrected less R-selective variant sequences from Round II Input library in Table 2 and Supp Table 4. Corrections may also be found on PNAS version https://www.pnas.org/content/early/2019/12/26/192177011

    Machine learning-guided directed evolution for protein engineering

    Get PDF
    Machine learning (ML)-guided directed evolution is a new paradigm for biological design that enables optimization of complex functions. ML methods use data to predict how sequence maps to function without requiring a detailed model of the underlying physics or biological pathways. To demonstrate ML-guided directed evolution, we introduce the steps required to build ML sequence-function models and use them to guide engineering, making recommendations at each stage. This review covers basic concepts relevant to using ML for protein engineering as well as the current literature and applications of this new engineering paradigm. ML methods accelerate directed evolution by learning from information contained in all measured variants and using that information to select sequences that are likely to be improved. We then provide two case studies that demonstrate the ML-guided directed evolution process. We also look to future opportunities where ML will enable discovery of new protein functions and uncover the relationship between protein sequence and function.Comment: Made significant revisions to focus on aspects most relevant to applying machine learning to speed up directed evolutio

    Machine learning-assisted directed protein evolution with combinatorial libraries

    Get PDF
    To reduce experimental effort associated with directed protein evolution and to explore the sequence space encoded by mutating multiple positions simultaneously, we incorporate machine learning into the directed evolution workflow. Combinatorial sequence space can be quite expensive to sample experimentally, but machine-learning models trained on tested variants provide a fast method for testing sequence space computationally. We validated this approach on a large published empirical fitness landscape for human GB1 binding protein, demonstrating that machine learning-guided directed evolution finds variants with higher fitness than those found by other directed evolution approaches. We then provide an example application in evolving an enzyme to produce each of the two possible product enantiomers (i.e., stereodivergence) of a new-to-nature carbene Si–H insertion reaction. The approach predicted libraries enriched in functional enzymes and fixed seven mutations in two rounds of evolution to identify variants for selective catalysis with 93% and 79% ee (enantiomeric excess). By greatly increasing throughput with in silico modeling, machine learning enhances the quality and diversity of sequence solutions for a protein engineering problem

    Predicting a Protein's Stability under a Million Mutations

    Full text link
    Stabilizing proteins is a foundational step in protein engineering. However, the evolutionary pressure of all extant proteins makes identifying the scarce number of mutations that will improve thermodynamic stability challenging. Deep learning has recently emerged as a powerful tool for identifying promising mutations. Existing approaches, however, are computationally expensive, as the number of model inferences scales with the number of mutations queried. Our main contribution is a simple, parallel decoding algorithm. Our Mutate Everything is capable of predicting the effect of all single and double mutations in one forward pass. It is even versatile enough to predict higher-order mutations with minimal computational overhead. We build Mutate Everything on top of ESM2 and AlphaFold, neither of which were trained to predict thermodynamic stability. We trained on the Mega-Scale cDNA proteolysis dataset and achieved state-of-the-art performance on single and higher-order mutations on S669, ProTherm, and ProteinGym datasets. Code is available at https://github.com/jozhang97/MutateEverythingComment: NeurIPS 2023. Code available at https://github.com/jozhang97/MutateEverythin

    Towards higher predictability in enzyme engineering : investigation of protein epistasis in dynamic ß-lactamases and Cal-A lipase

    Full text link
    L'ingĂ©nierie enzymatique est un outil trĂšs avantageux dans l'industrie biotechnologique. Elle permet d'adapter les enzymes Ă  une activitĂ© ou Ă  une condition de rĂ©action spĂ©cifique. En outre, elle peut permettre de dĂ©chiffrer les Ă©lĂ©ments clĂ©s qui ont facilitĂ© leur modification. Bien que l'ingĂ©nierie enzymatique soit largement pratiquĂ©e, elle comporte encore plusieurs goulets d'Ă©tranglement. Certains de ces goulets d'Ă©tranglement sont techniques, comme le dĂ©veloppement de mĂ©thodologies pour la crĂ©ation de banques de mutations ciblĂ©es ou la rĂ©alisation de criblages Ă  haut dĂ©bit, et d'autres sont conceptuels, comme le dĂ©chiffrage des caractĂ©ristiques clĂ©s pertinentes d'une protĂ©ine cible pour la rĂ©ussite d'un projet d'ingĂ©nierie. Parmi ces dĂ©fis, l'Ă©pistasie intra-gĂ©nique, ou la non-additivitĂ© des effets phĂ©notypiques des mutations, est une caractĂ©ristique qui entrave grandement la prĂ©visibilitĂ©. L'amĂ©lioration de l'ingĂ©nierie enzymatique nĂ©cessite une approche multidisciplinaire qui inclut une meilleure comprĂ©hension des relations structure-fonction-Ă©volution. Cette thĂšse vise Ă  contribuer Ă  l'avancement de l'ingĂ©nierie enzymatique en Ă©tudiant deux systĂšmes modĂšles. PremiĂšrement, des variantes dynamiques de la ß-lactamase TEM-1 ont Ă©tĂ© choisies pour Ă©tudier le lien entre la dynamique des protĂ©ines et l'Ă©volution. La ß-lactamase TEM-1 a Ă©tĂ© largement caractĂ©risĂ©e dans la littĂ©rature, ce qui s'est traduit par des connaissances approfondies sur son mĂ©canisme de rĂ©action, ses caractĂ©ristiques structurelles et son Ă©volution. Les variantes de la ß-lactamase TEM-1 utilisĂ©es comme systĂšme modĂšle dans cette thĂšse ont Ă©tĂ© largement caractĂ©risĂ©es, montrant une dynamique accrue Ă  l'Ă©chelle temporelle pertinente pour la catalyse (”s Ă  ms) mais maintenant la reconnaissance du substrat. Dans cette thĂšse, l'Ă©volution in vitro de ces variantes dynamiques a Ă©tĂ© rĂ©alisĂ©e par des cycles itĂ©ratifs de mutagenĂšse et de sĂ©lection alĂ©atoires pour permettre une exploration impartiale du paysage de ‘fitness’. Nous dĂ©montrons que la prĂ©sence de ces mouvements particuliers au dĂ©but de l'Ă©volution a permis d'accĂ©der Ă  des voies de mutations connues. De plus, des interactions Ă©pistatiques connues ont Ă©tĂ© introduites dans les variantes dynamiques. Leur caractĂ©risation in silico et cinĂ©tique a rĂ©vĂ©lĂ© que les mouvements supplĂ©mentaires sur l'Ă©chelle de temps de la catalyse ont permis d'accĂ©der Ă  des conformations conduisant Ă  une fonction amĂ©liorĂ©e, comme dans le TEM-1 natif. Dans l'ensemble, nous dĂ©montrons que l'Ă©volution de la b-lactamase TEM-1 vers une nouvelle fonction est compatible avec divers mouvements Ă  l'Ă©chelle de temps ”s Ă  ms. Il reste Ă  savoir si cela peut se traduire par d'autres enzymes ayant un potentiel biotechnologique. DeuxiĂšmement, la lipase Cal-A, pertinente sur le plan industriel, a Ă©tĂ© choisie pour identifier les caractĂ©ristiques qui pourraient faciliter son ingĂ©nierie. La lipase Cal-A prĂ©sente des caractĂ©ristiques telles que la polyvalence du substrat et une grande stabilitĂ© thermique et rĂ©activitĂ© qui la rendent attrayante pour la modification des triglycĂ©rides ou la synthĂšse de molĂ©cules pertinentes dans les industries alimentaire et pharmaceutique. Contrairement Ă  TEM-1, la plupart des Ă©tudes d'Ă©volution in vitro de la lipase Cal-A ont Ă©tĂ© rĂ©alisĂ©es dans un but industriel, avec une exploration limitĂ©e de l'espace de mutation. Par consĂ©quent, les caractĂ©ristiques qui dĂ©finissent la fonction de la lipase Cal-A restent insaisissables. Dans cette thĂšse, nous faisons Ă©tat de la mutagenĂšse ciblĂ©e de la lipase Cal-A, confirmant l'existence d'une rĂ©gion clĂ© pour la reconnaissance du substrat. Cela a Ă©tĂ© fait en combinant une nouvelle mĂ©thodologie de crĂ©ation de bibliothĂšque basĂ©e sur l'assemblage Golden-gate avec une visualisation structurelle basĂ©e sur des scripts pour identifier et cartographier les mutations sĂ©lectionnĂ©es dans la structure 3D. La caractĂ©risation et la dĂ©convolution de deux des plus aptes ont rĂ©vĂ©lĂ© l'existence d'une Ă©pistasie dans l'Ă©volution de la lipase Cal-A vers une nouvelle fonction. Dans l'ensemble, nous dĂ©montrons que l’identification d'une variĂ©tĂ© de propriĂ©tĂ©s suite Ă  la mutagenĂšse ciblĂ©e peut grandement amĂ©liorer la connaissance d'une enzyme. Cette information peut ĂȘtre appliquĂ©e pour amĂ©liorer l'efficacitĂ© de l'ingĂ©nierie dirigĂ©e.Enzyme engineering is a tool with great utility in the biotechnological industry. It allows to tailor enzymes to a specific activity or reaction condition. In addition, it can allow to decipher key elements that facilitated their modification. While enzyme engineering is extensively practised, it still entails several bottlenecks. Some of these bottlenecks are technical such as the development of methodologies for creating targeted mutational libraries or performing high-throughput screening and some are conceptual such as deciphering the key relevant features in a target protein for a successful engineering project. Among these challenges, intragenic epistasis, or the non-additivity of the phenotypic effects of mutations, is a feature that greatly hinders predictability. Improving enzyme engineering needs a multidisciplinary approach that includes gaining a better understanding of structure-function-evolution relations. This thesis seeks to contribute in the advancement of enzyme engineering by investigating two model systems. First, dynamic variants of TEM-1 ß-lactamase were chosen to investigate the link between protein dynamics and evolution. TEM-1 ß-lactamase has been extensively characterized in the literature, which has translated into extensive knowledge on its reaction mechanism, structural features and evolution. The variants of TEM-1 ß-lactamase used as model system in this thesis had been extensively characterized, showing increased dynamics at the timescale relevant to catalysis (”s to ms) but maintaining substrate recognition. In this thesis, in vitro evolution of these dynamic variants was done by iterative rounds of random mutagenesis and selection to allow an unbiased exploration of the fitness landscape. We demonstrate that the presence of these particular motions at the outset of evolution allowed access to known mutational pathways. In addition, known epistatic interactions were introduced in the dynamic variants. Their in silico and kinetic characterization revealed that the additional motions on the timescale of catalysis allowed access to conformations leading to enhanced function, as in native TEM-1. Overall, we demonstrate that the evolution of TEM-1 b-lactamase toward new function is compatible with diverse motions at the ”s to ms timescale. Whether this can be translated to other enzymes with biotechnological potential remains to be explored. Secondly, the industrially relevant Cal-A lipase was chosen to identify features that could facilitate its engineering. Cal-A lipase presents characteristics such as substrate versatility and high thermal stability and reactivity that make it attractive for modification of triglycerides or synthesis of relevant molecules in the food and pharmaceutical industries. Contrary to TEM-1, most in vitro evolution studies of Cal-A lipase have been done towards an industrially-specified goal, with limited exploration of mutational space. As a result, features that define function in Cal-A lipase remain elusive. In this thesis, we report on focused mutagenesis of Cal-A lipase, confirming the existence of a key region for substrate recognition. This was done by combining a novel library creation methodology based on Golden-gate assembly with script-based structural visualization to identify and map the selected mutations into the 3D structure. The characterization and deconvolution of two of the fittest revealed the existence of epistasis in the evolution of Cal-A lipase towards new function. Overall, we demonstrate that mapping a variety of properties following mutagenesis targeted to specific regions can greatly improve knowledge of an enzyme that can be applied to improve the efficiency of directed engineering

    Data-Driven Protein Engineering

    Get PDF
    Directed evolution has enabled the adaptation of natural protein sequences for an endless variety of human applications. Given a starting point - a sequence with measurable activity - directed evolution is able to improve protein sequences by iteratively accumulating beneficial mutations. However, directed evolution requires investing large experimental effort, which continues to be the major bottleneck in efficient protein optimization. To this end, we describe a framework for incorporating machine learning in the directed evolution process to maximize the utility of generated experimental data in Chapter 2. In Chapter 3, we then show that this framework outperforms traditional directed evolution methods on an empirical fitness landscape. However, directed evolution is fundamentally limited by its need for a starting point, or a sequence with measurable activity. To tackle this issue, we test the ability of nascent deep learning techniques for generating short, functional amino acid sequences in Chapter 4. Encouraged by this success, we attempted to generate full length enzymatic sequences for desired substrates without success. However, we were able to apply this deep learning approach to model other aspects of enzymatic protein sequences in Chapter 5. Finally, the field of data-driven protein sequence generation is enjoying a recent surge in interest, and we provide an updated review of protein engineering with machine learning, focusing on recent work in deep generative modeling in Chapter 1.</p

    Probabilistic Protein Engineering

    Get PDF
    Machine learning-guided protein engineering is a new paradigm that enables the optimization of complex protein functions. Machine-learning methods use data to predict protein function without requiring a detailed model of the underlying physics or biological pathways. They accelerate protein engineering by learning from information contained in all measured variants and using it to select variants that are likely to be improved. We begin with a review of the basics of machine learning with a focus on applications to protein engineering and protein sequence-function datasets (Chapter 1). We used the entire machine-learning guided engineering paradigm to engineer the algal-derived light-gated channel channelrhodopsin (ChR), which can be used to modulate neuronal activity with light. We build models that discover ChRs with strong plasma membrane localization in mammalian cells (Chapter 2) and unprecedented light sensitivity and photocurrents for optogenetic applications (Chapter 3). Machine learning-guided evolution requires a machine-learning model that learns the relationship between sequence and function. For machine-learning models to learn about protein sequences, protein sequences must be represented as vectors or matrices of numbers. How each protein sequence is represented determines what can be learned. We learn continuous vector encodings of sequences from patterns in unlabeled sequences (Chapter 4). Learned encodings are low-dimensional, do not require alignments, and may improve performance by transferring information in unlabeled sequences to specific prediction tasks. Alternately, we demonstrate an interpretable Gaussian process kernel tailored to biological sequences (Chapter 6). In addition to a model to predict function from sequence, engineering requires a method to use the model to choose sequences for the next round of evolution. Most machine-learning guided engineering strategies assume that selected sequences can be queried directly. However, in directed evolution it is common to design a library of sequences and then sample stochastic batches from that library. We propose a batched stochastic Bayesian optimization algorithm for iteratively designing and screening site-saturation mutagenesis libraries (Chapter 5).</p
    corecore