Search CORE

12 research outputs found

Deep generative models for biology: represent, predict, design

Author: Notin Pascal
Publication venue
Publication date: 21/12/2023
Field of study

Deep generative models have revolutionized the field of artificial intelligence, fundamentally changing how we generate novel objects that imitate or extrapolate from training data, and transforming how we access and consume various types of information such as texts, images, speech, and computer programs. They have the potential to radically transform other scientific disciplines, ranging from mathematical problem solving, to supporting fast and accurate simulations in high-energy physics or enabling rapid weather forecasting. In computational biology, generative models hold immense promise for improving our understanding of complex biological processes, designing new drugs and therapies, and forecasting viral evolution during pandemics, among many other applications. Biological objects pose however unique challenges due to their inherent complexity, encompassing massive spaces, multiple complementary data modalities, and a unique interplay between highly structured and relatively unstructured components. In this thesis, we develop several deep generative modeling frameworks that are motivated by key questions in computational biology. Given the interdisciplinary nature of this endeavor, we first provide a comprehensive background in generative modeling, uncertainty quantification, sequential decision making, as well as important concepts in biology and chemistry to facilitate a thorough understanding of our work. We then deep dive into the core of our contributions, which are structured around three chapters. The first chapter introduces methods for learning representations of biological sequences, laying the foundation for subsequent analyses. The second chapter illustrates how these representations can be leveraged to predict complex properties of biomolecules, focusing on three specific applications: protein fitness prediction, the effects of genetic variations on human disease risk and viral immune escape. Finally, the third chapter is dedicated to methods for designing novel biomolecules, including drug target identification, de novo molecular optimization, and protein engineering. This thesis also makes several methodological contributions to broader machine learning challenges, such as uncertainty quantification in high-dimensional spaces or efficient transformer architectures, which hold potential value in other application domains. We conclude by summarizing our key findings, highlighting shortcomings of current approaches, proposing potential avenues for future research, and discussing emerging trends within the field

Oxford University Research Archive

RITA: a Study on Scaling Up Generative Protein Sequence Models

Author: Hesslow Daniel
Marks Debora
Notin Pascal
Poli Iacopo
Zanichelli Niccoló
Publication venue
Publication date: 14/07/2022
Field of study

In this work we introduce RITA: a suite of autoregressive generative models for protein sequences, with up to 1.2 billion parameters, trained on over 280 million protein sequences belonging to the UniRef-100 database. Such generative models hold the promise of greatly accelerating protein design. We conduct the first systematic study of how capabilities evolve with model size for autoregressive transformers in the protein domain: we evaluate RITA models in next amino acid prediction, zero-shot fitness, and enzyme function prediction, showing benefits from increased scale. We release the RITA models openly, to the benefit of the research community

arXiv.org e-Print Archive

DiscoBAX: Discovery of Optimal Intervention Sets in Genomic Experiment Design

Author: Bauer Stefan
Gal Yarin
Jesson Andrew
Lyle Clare
Mehrjou Arash
Notin Pascal
Schwab Patrick
Publication venue
Publication date: 07/12/2023
Field of study

The discovery of therapeutics to treat genetically-driven pathologies relies on identifying genes involved in the underlying disease mechanisms. Existing approaches search over the billions of potential interventions to maximize the expected influence on the target phenotype. However, to reduce the risk of failure in future stages of trials, practical experiment design aims to find a set of interventions that maximally change a target phenotype via diverse mechanisms. We propose DiscoBAX, a sample-efficient method for maximizing the rate of significant discoveries per experiment while simultaneously probing for a wide range of diverse mechanisms during a genomic experiment campaign. We provide theoretical guarantees of approximate optimality under standard assumptions, and conduct a comprehensive experimental evaluation covering both synthetic as well as real-world experimental design tasks. DiscoBAX outperforms existing state-of-the-art methods for experimental design, selecting effective and diverse perturbations in biological systems

arXiv.org e-Print Archive

Mixtures of large-scale dynamic functional brain network modes

Author: Adaszewski Stanislaw
Gal Yarin
Gohil Chetan
Higgins Cameron
Notin Pascal
Pervaiz Usama
Quinn Andrew
Roberts Evan
Skates Alex
Timms Ryan
van Amersfoort Joost
Woolrich Mark
Publication venue: Elsevier
Publication date: 27/08/2022
Field of study

Accurate temporal modelling of functional brain networks is essential in the quest for understanding how such networks facilitate cognition. Researchers are beginning to adopt time-varying analyses for electrophysiological data that capture highly dynamic processes on the order of milliseconds. Typically, these approaches, such as clustering of functional connectivity profiles and Hidden Markov Modelling (HMM), assume mutual exclusivity of networks over time. Whilst a powerful constraint, this assumption may be compromising the ability of these approaches to describe the data effectively. Here, we propose a new generative model for functional connectivity as a time-varying linear mixture of spatially distributed statistical “modes”. The temporal evolution of this mixture is governed by a recurrent neural network, which enables the model to generate data with a rich temporal structure. We use a Bayesian framework known as amortised variational inference to learn model parameters from observed data. We call the approach DyNeMo (for Dynamic Network Modes), and show using simulations it outperforms the HMM when the assumption of mutual exclusivity is violated. In resting-state MEG, DyNeMo reveals a mixture of modes that activate on fast time scales of 100–150 ms, which is similar to state lifetimes found using an HMM. In task MEG data, DyNeMo finds modes with plausible, task-dependent evoked responses without any knowledge of the task timings. Overall, DyNeMo provides decompositions that are an approximate remapping of the HMM’s while showing improvements in overall explanatory power. However, the magnitude of the improvements suggests that the HMM’s assumption of mutual exclusivity can be reasonable in practice. Nonetheless, DyNeMo provides a flexible framework for implementing and assessing future modelling developments

Oxford University Research Archive

Ag-Ca (Silver - Calcium)

Author: B. Pascal
H. Fischbach
J. Delcet
L.D. Calvert
M. Notin
M.R. Baren
N. Baar
W.A. Alexander
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2007
Field of study

Crossref

Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval

Author: Dias Mafalda
Frazer Jonathan
Gal Yarin
Gomez Aidan
Marchena-Hurtado Javier
Marks Debora S.
Notin Pascal
Publication venue
Publication date: 27/05/2022
Field of study

The ability to accurately model the fitness landscape of protein sequences is critical to a wide range of applications, from quantifying the effects of human variants on disease likelihood, to predicting immune-escape mutations in viruses and designing novel biotherapeutic proteins. Deep generative models of protein sequences trained on multiple sequence alignments have been the most successful approaches so far to address these tasks. The performance of these methods is however contingent on the availability of sufficiently deep and diverse alignments for reliable training. Their potential scope is thus limited by the fact many protein families are hard, if not impossible, to align. Large language models trained on massive quantities of non-aligned protein sequences from diverse families address these problems and show potential to eventually bridge the performance gap. We introduce Tranception, a novel transformer architecture leveraging autoregressive predictions and retrieval of homologous sequences at inference to achieve state-of-the-art fitness prediction performance. Given its markedly higher performance on multiple mutants, robustness to shallow alignments and ability to score indels, our approach offers significant gain of scope over existing approaches. To enable more rigorous model testing across a broader range of protein families, we develop ProteinGym -- an extensive set of multiplexed assays of variant effects, substantially increasing both the number and diversity of assays compared to existing benchmarks.Comment: ICML 202

arXiv.org e-Print Archive

The Ag−Ca (Silver-Calcium) system

Author: A.N. Campbell
A.R. Meidema
B. Pascal
F. Merlo
J. Delcet
L.D. Calvert
M. Notin
M. R. Baren
M.W. Chase
N. Baar
P. Chiotti
R.P. Rand
W.A. Alexander
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Thermodynamic description of the Ag–(Ca, Li, Zn) and Ca–(In, Li) binary systems

Author: Alexander
Andrews
Baar
Bale
Baren
Becker
Ben-Hamu
Birchenall
Blair
Bragg
Bruzzone
Campbell
Carfagno
Delcet
Delcet
Dinsdale
Dupin
Dębski
Fang
Fischbach
Fornasini
Freeth
Gerling
Gomez-Acebo
Grobner
Hansen
Helena Braga
Heycock
Hillert
Huang
Hultgren
Iandelli
In-Ho Jung
Ivanov
Jian Wang
Jung
Kameda
Kanda
Kang
Kleppa
Li
Liang
Liu
Masson
Mendis
Mendis
Mendis
Mendis
Ninomiya
Notin
Notin
Orr
Owen
Owen
Owen
Park
Pascal
Pastorello
Patrice Chartrand
Pelton
Pelton
Pelton
Pelton
Petrenko
Petrenko
Predel
Sasaki
Saunders
Scatchard
Schneider
Son
Straalsund
Sundman
Underwood
Wang
Wang
Wang
Wang
Wang
Wang
Wang
Wang
Wang
Wittig
Wolfson
Yazawa
Zamotorin
Zhao
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

Crossref

PolyPublie