Search CORE

773 research outputs found

Machine learning-guided directed evolution for protein engineering

Author: Arnold Frances H.
Wu Zachary
Yang Kevin K.
Publication venue
Publication date: 19/04/2019
Field of study

Machine learning (ML)-guided directed evolution is a new paradigm for biological design that enables optimization of complex functions. ML methods use data to predict how sequence maps to function without requiring a detailed model of the underlying physics or biological pathways. To demonstrate ML-guided directed evolution, we introduce the steps required to build ML sequence-function models and use them to guide engineering, making recommendations at each stage. This review covers basic concepts relevant to using ML for protein engineering as well as the current literature and applications of this new engineering paradigm. ML methods accelerate directed evolution by learning from information contained in all measured variants and using that information to select sequences that are likely to be improved. We then provide two case studies that demonstrate the ML-guided directed evolution process. We also look to future opportunities where ML will enable discovery of new protein functions and uncover the relationship between protein sequence and function.Comment: Made significant revisions to focus on aspects most relevant to applying machine learning to speed up directed evolutio

arXiv.org e-Print Archive

Caltech Authors

Deep Learning for Genomics: A Concise Overview

Author: Wang Haohan
Yue Tianwei
Publication venue
Publication date: 08/05/2018
Field of study

Advancements in genomic research such as high-throughput sequencing techniques have driven modern genomic studies into "big data" disciplines. This data explosion is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithms, deep learning has succeeded in a variety of fields such as vision, speech, and text processing. Yet genomics entails unique challenges to deep learning since we are expecting from deep learning a superhuman intelligence that explores beyond our knowledge to interpret the genome. A powerful deep learning model should rely on insightful utilization of task-specific knowledge. In this paper, we briefly discuss the strengths of different deep learning models from a genomic perspective so as to fit each particular task with a proper deep architecture, and remark on practical considerations of developing modern deep learning architectures for genomics. We also provide a concise review of deep learning applications in various aspects of genomic research, as well as pointing out potential opportunities and obstacles for future genomics applications.Comment: Invited chapter for Springer Book: Handbook of Deep Learning Application

arXiv.org e-Print Archive

Generative Pretrained Autoregressive Transformer Graph Neural Network applied to the Analysis and Discovery of Novel Proteins

Author: Buehler Markus J.
Publication venue
Publication date: 11/07/2023
Field of study

We report a flexible language-model based deep learning strategy, applied here to solve complex forward and inverse problems in protein modeling, based on an attention neural network that integrates transformer and graph convolutional architectures in a causal multi-headed graph mechanism, to realize a generative pretrained model. The model is applied to predict secondary structure content (per-residue level and overall content), protein solubility, and sequencing tasks. Further trained on inverse tasks, the model is rendered capable of designing proteins with these properties as target features. The model is formulated as a general framework, completely prompt-based, and can be adapted for a variety of downstream tasks. We find that adding additional tasks yields emergent synergies that the model exploits in improving overall performance, beyond what would be possible by training a model on each dataset alone. Case studies are presented to validate the method, yielding protein designs specifically focused on structural proteins, but also exploring the applicability in the design of soluble, antimicrobial biomaterials. While our model is trained to ultimately perform 8 distinct tasks, with available datasets it can be extended to solve additional problems. In a broader sense, this work illustrates a form of multiscale modeling that relates a set of ultimate building blocks (here, byte-level utf8 characters that define the nature of the physical system at hand) to complex output. This materiomic scheme captures complex emergent relationships between universal building block and resulting properties via a synergizing learning capacity to express a set of potentialities embedded in the knowledge used in training, via the interplay of universality and diversity

arXiv.org e-Print Archive

Applications of deep neural networks to protein structure prediction

Author: Fang Chao (Computer scientist)
Publication venue: University of Missouri--Columbia
Publication date
Field of study

Professor Yi Shang, Dissertation Advisor; Professor Dong Xu, Dissertation Co-advisor.Includes vita.Field of Study: Computer science."July 2018."Protein secondary structure, backbone torsion angle and other secondary structure features can provide useful information for protein 3D structure prediction and protein functions. Deep learning offers a new opportunity to significantly improve prediction accuracy. In this dissertation, several new deep neural network architectures are proposed for protein secondary structure prediction: deep inception-inside-inception (Deep3I) networks and deep neighbor residual (DeepNRN) networks for secondary structure prediction; deep residual inception networks (DeepRIN) for backbone torsion angle prediction; deep dense inception networks (DeepDIN) for beta turn prediction; deep inception capsule networks (DeepICN) for gamma turn prediction. Every tool was then implemented as a standalone tool integrated into MUFold package and freely available to research community. A webserver called MUFold-SS-Angle is also developed for protein property prediction. The input feature to those deep neural networks is a carefully designed feature matrix corresponding to the primary amino acid sequence of a protein, which consists of a rich set of information derived from individual amino acid, as well as the context of the protein sequence. Specifically, the feature matrix is a composition of physio-chemical properties of amino acids, PSI-BLAST profile, HHBlits profile and/or predicted shape string. The deep architecture enables effective processing of local and global interactions between amino acids in making accurate prediction. In extensive experiments on multiple datasets, the proposed deep neural architectures outperformed the best existing methods and other deep neural networks significantly: The proposed DeepNRN achieved highest Q8 75.33, 72.9, 70.8 on CASP 10, 11, 12 higher than previous state-of-the-art DeepCNF-SS with 71.8, 72.3, and 69.76. The proposed MUFold-SS (Deep3I) achieved highest Q8 76.47, 74.51, 72.1 on CASP 10, 11, 12. Compared to the recently released state-of-the-art tool, SPIDER3, DeepRIN reduced the Psi angle prediction error by more than 5 degrees and the Phi angle prediction error by more than 2 degrees on average. DeepDIN outperformed significantly BetaTPred3 in both two-class and nine-class beta turn prediction on benchmark BT426 and BT6376. DeepICN is the first application of using capsule network to biological sequence analysis and outperformed all previous gamma-turn predictors on benchmark GT320.Includes bibliographical references (pages 114-131)

University of Missouri: MOspace

The promises of large language models for protein design and modeling.

Author: Cabri Alberto
Casiraghi Elena
Gliozzo Jessica
Malchiodi Dario
Mesiti Marco
Reese Justin
Robinson Peter N
Soto-Gomez Mauricio
Valentini Giorgio
Publication venue: The Mouseion at the JAXlibrary
Publication date: 01/01/2023
Field of study

The recent breakthroughs of Large Language Models (LLMs) in the context of natural language processing have opened the way to significant advances in protein research. Indeed, the relationships between human natural language and the language of proteins invite the application and adaptation of LLMs to protein modelling and design. Considering the impressive results of GPT-4 and other recently developed LLMs in processing, generating and translating human languages, we anticipate analogous results with the language of proteins. Indeed, protein language models have been already trained to accurately predict protein properties, generate novel functionally characterized proteins, achieving state-of-the-art results. In this paper we discuss the promises and the open challenges raised by this novel and exciting research area, and we propose our perspective on how LLMs will affect protein modeling and design

The Jackson Laboratory: The Mouseion at the JAXlibrary