9 research outputs found
RITA: a Study on Scaling Up Generative Protein Sequence Models
In this work we introduce RITA: a suite of autoregressive generative models
for protein sequences, with up to 1.2 billion parameters, trained on over 280
million protein sequences belonging to the UniRef-100 database. Such generative
models hold the promise of greatly accelerating protein design. We conduct the
first systematic study of how capabilities evolve with model size for
autoregressive transformers in the protein domain: we evaluate RITA models in
next amino acid prediction, zero-shot fitness, and enzyme function prediction,
showing benefits from increased scale. We release the RITA models openly, to
the benefit of the research community
Tranception: protein fitness prediction with autoregressive transformers and inference-time retrieval
The ability to accurately model the fitness landscape of protein sequences is
critical to a wide range of applications, from quantifying the effects of human
variants on disease likelihood, to predicting immune-escape mutations in
viruses and designing novel biotherapeutic proteins. Deep generative models of
protein sequences trained on multiple sequence alignments have been the most
successful approaches so far to address these tasks. The performance of these
methods is however contingent on the availability of sufficiently deep and
diverse alignments for reliable training. Their potential scope is thus limited
by the fact many protein families are hard, if not impossible, to align. Large
language models trained on massive quantities of non-aligned protein sequences
from diverse families address these problems and show potential to eventually
bridge the performance gap. We introduce Tranception, a novel transformer
architecture leveraging autoregressive predictions and retrieval of homologous
sequences at inference to achieve state-of-the-art fitness prediction
performance. Given its markedly higher performance on multiple mutants,
robustness to shallow alignments and ability to score indels, our approach
offers significant gain of scope over existing approaches. To enable more
rigorous model testing across a broader range of protein families, we develop
ProteinGym -- an extensive set of multiplexed assays of variant effects,
substantially increasing both the number and diversity of assays compared to
existing benchmarks.Comment: ICML 202