154 research outputs found
To Transformers and Beyond: Large Language Models for the Genome
In the rapidly evolving landscape of genomics, deep learning has emerged as a
useful tool for tackling complex computational challenges. This review focuses
on the transformative role of Large Language Models (LLMs), which are mostly
based on the transformer architecture, in genomics. Building on the foundation
of traditional convolutional neural networks and recurrent neural networks, we
explore both the strengths and limitations of transformers and other LLMs for
genomics. Additionally, we contemplate the future of genomic modeling beyond
the transformer architecture based on current trends in research. The paper
aims to serve as a guide for computational biologists and computer scientists
interested in LLMs for genomic data. We hope the paper can also serve as an
educational introduction and discussion for biologists to a fundamental shift
in how we will be analyzing genomic data in the future
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
Autoregressive Transformers adopted in Large Language Models (LLMs) are hard
to scale to long sequences. Despite several works trying to reduce their
computational cost, most of LLMs still adopt attention layers between all pairs
of tokens in the sequence, thus incurring a quadratic cost. In this study, we
present a novel approach that dynamically prunes contextual information while
preserving the model's expressiveness, resulting in reduced memory and
computational requirements during inference. Our method employs a learnable
mechanism that determines which uninformative tokens can be dropped from the
context at any point across the generation process. By doing so, our approach
not only addresses performance concerns but also enhances interpretability,
providing valuable insight into the model's decision-making process. Our
technique can be applied to existing pre-trained models through a
straightforward fine-tuning process, and the pruning strength can be specified
by a sparsity parameter. Notably, our empirical findings demonstrate that we
can effectively prune up to 80\% of the context without significant performance
degradation on downstream tasks, offering a valuable tool for mitigating
inference costs. Our reference implementation achieves up to increase
in inference throughput and even greater memory savings
Fourier Transformer: Fast Long Range Modeling by Removing Sequence Redundancy with FFT Operator
The transformer model is known to be computationally demanding, and
prohibitively costly for long sequences, as the self-attention module uses a
quadratic time and space complexity with respect to sequence length. Many
researchers have focused on designing new forms of self-attention or
introducing new parameters to overcome this limitation, however a large portion
of them prohibits the model to inherit weights from large pretrained models. In
this work, the transformer's inefficiency has been taken care of from another
perspective. We propose Fourier Transformer, a simple yet effective approach by
progressively removing redundancies in hidden sequence using the ready-made
Fast Fourier Transform (FFT) operator to perform Discrete Cosine Transformation
(DCT). Fourier Transformer is able to significantly reduce computational costs
while retain the ability to inherit from various large pretrained models.
Experiments show that our model achieves state-of-the-art performances among
all transformer-based models on the long-range modeling benchmark LRA with
significant improvement in both speed and space. For generative seq-to-seq
tasks including CNN/DailyMail and ELI5, by inheriting the BART weights our
model outperforms the standard BART and other efficient models. \footnote{Our
code is publicly available at
\url{https://github.com/LUMIA-Group/FourierTransformer}
Improving Molecular Pretraining with Complementary Featurizations
Molecular pretraining, which learns molecular representations over massive
unlabeled data, has become a prominent paradigm to solve a variety of tasks in
computational chemistry and drug discovery. Recently, prosperous progress has
been made in molecular pretraining with different molecular featurizations,
including 1D SMILES strings, 2D graphs, and 3D geometries. However, the role of
molecular featurizations with their corresponding neural architectures in
molecular pretraining remains largely unexamined. In this paper, through two
case studies -- chirality classification and aromatic ring counting -- we first
demonstrate that different featurization techniques convey chemical information
differently. In light of this observation, we propose a simple and effective
MOlecular pretraining framework with COmplementary featurizations (MOCO). MOCO
comprehensively leverages multiple featurizations that complement each other
and outperforms existing state-of-the-art models that solely relies on one or
two featurizations on a wide range of molecular property prediction tasks.Comment: 24 pages, work in progres
HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution
Genomic (DNA) sequences encode an enormous amount of information for gene
regulation and protein synthesis. Similar to natural language models,
researchers have proposed foundation models in genomics to learn generalizable
features from unlabeled genome data that can then be fine-tuned for downstream
tasks such as identifying regulatory elements. Due to the quadratic scaling of
attention, previous Transformer-based genomic models have used 512 to 4k tokens
as context (<0.001% of the human genome), significantly limiting the modeling
of long-range interactions in DNA. In addition, these methods rely on
tokenizers to aggregate meaningful DNA units, losing single nucleotide
resolution where subtle genetic variations can completely alter protein
function via single nucleotide polymorphisms (SNPs). Recently, Hyena, a large
language model based on implicit convolutions was shown to match attention in
quality while allowing longer context lengths and lower time complexity.
Leveraging Hyenas new long-range capabilities, we present HyenaDNA, a genomic
foundation model pretrained on the human reference genome with context lengths
of up to 1 million tokens at the single nucleotide-level, an up to 500x
increase over previous dense attention-based models. HyenaDNA scales
sub-quadratically in sequence length (training up to 160x faster than
Transformer), uses single nucleotide tokens, and has full global context at
each layer. We explore what longer context enables - including the first use of
in-context learning in genomics for simple adaptation to novel tasks without
updating pretrained model weights. On fine-tuned benchmarks from the Nucleotide
Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 17 datasets
using a model with orders of magnitude less parameters and pretraining data. On
the GenomicBenchmarks, HyenaDNA surpasses SotA on all 8 datasets on average by
+9 accuracy points
Reducing Sequence Length by Predicting Edit Operations with Large Language Models
Large Language Models (LLMs) have demonstrated remarkable performance in
various tasks and gained significant attention. LLMs are also used for local
sequence transduction tasks, including grammatical error correction (GEC) and
formality style transfer, where most tokens in a source text are kept
unchanged. However, it is inefficient to generate all target tokens because a
prediction error of a target token may cause a catastrophe in predicting
subsequent tokens and because the computational cost grows quadratically with
the target sequence length. This paper proposes to predict a set of edit
operations for the source text for local sequence transduction tasks.
Representing an edit operation with a span of the source text and changed
tokens, we can reduce the length of the target sequence and thus the
computational cost for inference. We apply instruction tuning for LLMs on the
supervision data of edit operations. Experiments show that the proposed method
achieves comparable performance to the baseline in four tasks, paraphrasing,
formality style transfer, GEC, and text simplification, despite reducing the
length of the target text by as small as 21\%. Furthermore, we report that the
instruction tuning with the proposed method achieved the state-of-the-art
performance in the four tasks.Comment: Work in progres
- …