12 research outputs found
Deep generative models for biology: represent, predict, design
Deep generative models have revolutionized the field of artificial intelligence, fundamentally changing how we generate novel objects that imitate or extrapolate from training data, and transforming how we access and consume various types of information such as texts, images, speech, and computer programs. They have the potential to radically transform other scientific disciplines, ranging from mathematical problem solving, to supporting fast and accurate simulations in high-energy physics or enabling rapid weather forecasting. In computational biology, generative models hold immense promise for improving our understanding of complex biological processes, designing new drugs and therapies, and forecasting viral evolution during pandemics, among many other applications. Biological objects pose however unique challenges due to their inherent complexity, encompassing massive spaces, multiple complementary data modalities, and a unique interplay between highly structured and relatively unstructured components.
In this thesis, we develop several deep generative modeling frameworks that are motivated by key questions in computational biology. Given the interdisciplinary nature of this endeavor, we first provide a comprehensive background in generative modeling, uncertainty quantification, sequential decision making, as well as important concepts in biology and chemistry to facilitate a thorough understanding of our work. We then deep dive into the core of our contributions, which are structured around three chapters. The first chapter introduces methods for learning representations of biological sequences, laying the foundation for subsequent analyses. The second chapter illustrates how these representations can be leveraged to predict complex properties of biomolecules, focusing on three specific applications: protein fitness prediction, the effects of genetic variations on human disease risk and viral immune escape. Finally, the third chapter is dedicated to methods for designing novel biomolecules, including drug target identification, de novo molecular optimization, and protein engineering.
This thesis also makes several methodological contributions to broader machine learning challenges, such as uncertainty quantification in high-dimensional spaces or efficient transformer architectures, which hold potential value in other application domains. We conclude by summarizing our key findings, highlighting shortcomings of current approaches, proposing potential avenues for future research, and discussing emerging trends within the field
RITA: a Study on Scaling Up Generative Protein Sequence Models
In this work we introduce RITA: a suite of autoregressive generative models
for protein sequences, with up to 1.2 billion parameters, trained on over 280
million protein sequences belonging to the UniRef-100 database. Such generative
models hold the promise of greatly accelerating protein design. We conduct the
first systematic study of how capabilities evolve with model size for
autoregressive transformers in the protein domain: we evaluate RITA models in
next amino acid prediction, zero-shot fitness, and enzyme function prediction,
showing benefits from increased scale. We release the RITA models openly, to
the benefit of the research community
DiscoBAX: Discovery of Optimal Intervention Sets in Genomic Experiment Design
The discovery of therapeutics to treat genetically-driven pathologies relies
on identifying genes involved in the underlying disease mechanisms. Existing
approaches search over the billions of potential interventions to maximize the
expected influence on the target phenotype. However, to reduce the risk of
failure in future stages of trials, practical experiment design aims to find a
set of interventions that maximally change a target phenotype via diverse
mechanisms. We propose DiscoBAX, a sample-efficient method for maximizing the
rate of significant discoveries per experiment while simultaneously probing for
a wide range of diverse mechanisms during a genomic experiment campaign. We
provide theoretical guarantees of approximate optimality under standard
assumptions, and conduct a comprehensive experimental evaluation covering both
synthetic as well as real-world experimental design tasks. DiscoBAX outperforms
existing state-of-the-art methods for experimental design, selecting effective
and diverse perturbations in biological systems
Mixtures of large-scale dynamic functional brain network modes
Accurate temporal modelling of functional brain networks is essential in the quest for understanding how such networks facilitate cognition. Researchers are beginning to adopt time-varying analyses for electrophysiological data that capture highly dynamic processes on the order of milliseconds. Typically, these approaches, such as clustering of functional connectivity profiles and Hidden Markov Modelling (HMM), assume mutual exclusivity of networks over time. Whilst a powerful constraint, this assumption may be compromising the ability of these approaches to describe the data effectively. Here, we propose a new generative model for functional connectivity as a time-varying linear mixture of spatially distributed statistical âmodesâ. The temporal evolution of this mixture is governed by a recurrent neural network, which enables the model to generate data with a rich temporal structure. We use a Bayesian framework known as amortised variational inference to learn model parameters from observed data. We call the approach DyNeMo (for Dynamic Network Modes), and show using simulations it outperforms the HMM when the assumption of mutual exclusivity is violated. In resting-state MEG, DyNeMo reveals a mixture of modes that activate on fast time scales of 100â150 ms, which is similar to state lifetimes found using an HMM. In task MEG data, DyNeMo finds modes with plausible, task-dependent evoked responses without any knowledge of the task timings. Overall, DyNeMo provides decompositions that are an approximate remapping of the HMMâs while showing improvements in overall explanatory power. However, the magnitude of the improvements suggests that the HMMâs assumption of mutual exclusivity can be reasonable in practice. Nonetheless, DyNeMo provides a flexible framework for implementing and assessing future modelling developments
Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval
The ability to accurately model the fitness landscape of protein sequences is
critical to a wide range of applications, from quantifying the effects of human
variants on disease likelihood, to predicting immune-escape mutations in
viruses and designing novel biotherapeutic proteins. Deep generative models of
protein sequences trained on multiple sequence alignments have been the most
successful approaches so far to address these tasks. The performance of these
methods is however contingent on the availability of sufficiently deep and
diverse alignments for reliable training. Their potential scope is thus limited
by the fact many protein families are hard, if not impossible, to align. Large
language models trained on massive quantities of non-aligned protein sequences
from diverse families address these problems and show potential to eventually
bridge the performance gap. We introduce Tranception, a novel transformer
architecture leveraging autoregressive predictions and retrieval of homologous
sequences at inference to achieve state-of-the-art fitness prediction
performance. Given its markedly higher performance on multiple mutants,
robustness to shallow alignments and ability to score indels, our approach
offers significant gain of scope over existing approaches. To enable more
rigorous model testing across a broader range of protein families, we develop
ProteinGym -- an extensive set of multiplexed assays of variant effects,
substantially increasing both the number and diversity of assays compared to
existing benchmarks.Comment: ICML 202