73 research outputs found
FNet: Mixing Tokens with Fourier Transforms
We show that Transformer encoder architectures can be massively sped up, with
limited accuracy costs, by replacing the self-attention sublayers with simple
linear transformations that "mix" input tokens. These linear transformations,
along with standard nonlinearities in feed-forward layers, prove competent at
modeling semantic relationships in several text classification tasks. Most
surprisingly, we find that replacing the self-attention sublayer in a
Transformer encoder with a standard, unparameterized Fourier Transform achieves
92-97% of the accuracy of BERT counterparts on the GLUE benchmark, but trains
nearly seven times faster on GPUs and twice as fast on TPUs. The resulting
model, FNet, also scales very efficiently to long inputs. Specifically, when
compared to the "efficient" Transformers on the Long Range Arena benchmark,
FNet matches the accuracy of the most accurate models, but is faster than the
fastest models across all sequence lengths on GPUs (and across relatively
shorter lengths on TPUs). Finally, FNet has a light memory footprint and is
particularly efficient at smaller model sizes: for a fixed speed and accuracy
budget, small FNet models outperform Transformer counterparts
GLIMMER: generalized late-interaction memory reranker
Memory-augmentation is a powerful approach for efficiently incorporating
external information into language models, but leads to reduced performance
relative to retrieving text. Recent work introduced LUMEN, a memory-retrieval
hybrid that partially pre-computes memory and updates memory representations on
the fly with a smaller live encoder.
We propose GLIMMER, which improves on this approach through 1) exploiting
free access to the powerful memory representations by applying a shallow
reranker on top of memory to drastically improve retrieval quality at low cost,
and 2) incorporating multi-task training to learn a general and higher quality
memory and live encoder. GLIMMER achieves strong gains in performance at faster
speeds compared to LUMEN and FiD on the KILT benchmark of knowledge-intensive
tasks
The impact of hypoxaemia on vascular function in lowlanders and high altitude indigenous populations
Pre-computed memory or on-the-fly encoding? A hybrid approach to retrieval augmentation makes the most of your compute
Retrieval-augmented language models such as Fusion-in-Decoder are powerful,
setting the state of the art on a variety of knowledge-intensive tasks.
However, they are also expensive, due to the need to encode a large number of
retrieved passages. Some work avoids this cost by pre-encoding a text corpus
into a memory and retrieving dense representations directly. However,
pre-encoding memory incurs a severe quality penalty as the memory
representations are not conditioned on the current input. We propose LUMEN, a
hybrid between these two extremes, pre-computing the majority of the retrieval
representation and completing the encoding on the fly using a live encoder that
is conditioned on the question and fine-tuned for the task. We show that LUMEN
significantly outperforms pure memory on multiple question-answering tasks
while being much cheaper than FiD, and outperforms both for any given compute
budget. Moreover, the advantage of LUMEN over FiD increases with model size
FiDO: Fusion-in-Decoder optimized for stronger performance and faster inference
Fusion-in-Decoder (FiD) is a powerful retrieval-augmented language model that
sets the state-of-the-art on many knowledge-intensive NLP tasks. However, the
architecture used for FiD was chosen by making minimal modifications to a
standard T5 model, which our analysis shows to be highly suboptimal for a
retrieval-augmented model. In particular, FiD allocates the bulk of FLOPs to
the encoder, while the majority of inference time results from memory bandwidth
constraints in the decoder. We propose two simple changes to the FiD
architecture to alleviate memory bandwidth constraints, and speed up inference
by 7x. This allows us to use a much larger decoder at modest cost. We denote
FiD with the above modifications as FiDO, and show that it strongly improves
performance over existing FiD models for a wide range of inference budgets. For
example, FiDO-Large-XXL performs faster inference than FiD-Base and achieves
better performance than FiD-Large.Comment: ACL Findings 202
LongT5: Efficient Text-To-Text Transformer for Long Sequences
Recent work has shown that either (1) increasing the input length or (2)
increasing model size can improve the performance of Transformer-based neural
models. In this paper, we present a new model, called LongT5, with which we
explore the effects of scaling both the input length and model size at the same
time. Specifically, we integrated attention ideas from long-input transformers
(ETC), and adopted pre-training strategies from summarization pre-training
(PEGASUS) into the scalable T5 architecture. The result is a new attention
mechanism we call {\em Transient Global} (TGlobal), which mimics ETC's
local/global attention mechanism, but without requiring additional side-inputs.
We are able to achieve state-of-the-art results on several summarization tasks
and outperform the original T5 models on question answering tasks.Comment: Accepted in NAACL 202
MEMORY-VQ: Compression for Tractable Internet-Scale Memory
Retrieval augmentation is a powerful but expensive method to make language
models more knowledgeable about the world. Memory-based methods like LUMEN
pre-compute token representations for retrieved passages to drastically speed
up inference. However, memory also leads to much greater storage requirements
from storing pre-computed representations.
We propose MEMORY-VQ, a new method to reduce storage requirements of
memory-augmented models without sacrificing performance. Our method uses a
vector quantization variational autoencoder (VQ-VAE) to compress token
representations. We apply MEMORY-VQ to the LUMEN model to obtain LUMEN-VQ, a
memory model that achieves a 16x compression rate with comparable performance
on the KILT benchmark. LUMEN-VQ enables practical retrieval augmentation even
for extremely large retrieval corpora
- …