Search CORE

73 research outputs found

Elemental Landscapes

Author: Murray Ainslie
Stack Cathe
Zeunert Joshua
Publication venue
Publication date: 20/04/2023
Field of study

FNet: Mixing Tokens with Fourier Transforms

Author: Ainslie Joshua
Eckstein Ilya
Lee-Thorp James
Ontanon Santiago
Publication venue
Publication date: 18/06/2021
Field of study

We show that Transformer encoder architectures can be massively sped up, with limited accuracy costs, by replacing the self-attention sublayers with simple linear transformations that "mix" input tokens. These linear transformations, along with standard nonlinearities in feed-forward layers, prove competent at modeling semantic relationships in several text classification tasks. Most surprisingly, we find that replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves 92-97% of the accuracy of BERT counterparts on the GLUE benchmark, but trains nearly seven times faster on GPUs and twice as fast on TPUs. The resulting model, FNet, also scales very efficiently to long inputs. Specifically, when compared to the "efficient" Transformers on the Long Range Arena benchmark, FNet matches the accuracy of the most accurate models, but is faster than the fastest models across all sequence lengths on GPUs (and across relatively shorter lengths on TPUs). Finally, FNet has a light memory footprint and is particularly efficient at smaller model sizes: for a fixed speed and accuracy budget, small FNet models outperform Transformer counterparts

arXiv.org e-Print Archive

GLIMMER: generalized late-interaction memory reranker

Author: Ainslie Joshua
Cohen William W.
de Jong Michiel
FitzGerald Nicholas
Sanghai Sumit
Zemlyanskiy Yury
Publication venue
Publication date: 16/06/2023
Field of study

Memory-augmentation is a powerful approach for efficiently incorporating external information into language models, but leads to reduced performance relative to retrieving text. Recent work introduced LUMEN, a memory-retrieval hybrid that partially pre-computes memory and updates memory representations on the fly with a smaller live encoder. We propose GLIMMER, which improves on this approach through 1) exploiting free access to the powerful memory representations by applying a shallow reranker on top of memory to drastically improve retrieval quality at low cost, and 2) incorporating multi-task training to learn a general and higher quality memory and live encoder. GLIMMER achieves strong gains in performance at faster speeds compared to LUMEN and FiD on the KILT benchmark of knowledge-intensive tasks

arXiv.org e-Print Archive

The impact of hypoxaemia on vascular function in lowlanders and high altitude indigenous populations

Author: Ainslie Philip N.
Bailey Damian M.
Green Daniel J.
Tremblay Joshua C.
Tymko Michael M.
Publication venue: 'Wiley'
Publication date: 02/11/2019
Field of study

University of South Wales Research Explorer

Pre-computed memory or on-the-fly encoding? A hybrid approach to retrieval augmentation makes the most of your compute

Author: Ainslie Joshua
Cohen William
de Jong Michiel
FitzGerald Nicholas
Sanghai Sumit
Sha Fei
Zemlyanskiy Yury
Publication venue
Publication date: 25/01/2023
Field of study

Retrieval-augmented language models such as Fusion-in-Decoder are powerful, setting the state of the art on a variety of knowledge-intensive tasks. However, they are also expensive, due to the need to encode a large number of retrieved passages. Some work avoids this cost by pre-encoding a text corpus into a memory and retrieving dense representations directly. However, pre-encoding memory incurs a severe quality penalty as the memory representations are not conditioned on the current input. We propose LUMEN, a hybrid between these two extremes, pre-computing the majority of the retrieval representation and completing the encoding on the fly using a live encoder that is conditioned on the question and fine-tuned for the task. We show that LUMEN significantly outperforms pure memory on multiple question-answering tasks while being much cheaper than FiD, and outperforms both for any given compute budget. Moreover, the advantage of LUMEN over FiD increases with model size

arXiv.org e-Print Archive

FiDO: Fusion-in-Decoder optimized for stronger performance and faster inference

Author: Ainslie Joshua
Cohen William
de Jong Michiel
FitzGerald Nicholas
Sanghai Sumit
Sha Fei
Zemlyanskiy Yury
Publication venue
Publication date: 02/06/2023
Field of study

Fusion-in-Decoder (FiD) is a powerful retrieval-augmented language model that sets the state-of-the-art on many knowledge-intensive NLP tasks. However, the architecture used for FiD was chosen by making minimal modifications to a standard T5 model, which our analysis shows to be highly suboptimal for a retrieval-augmented model. In particular, FiD allocates the bulk of FLOPs to the encoder, while the majority of inference time results from memory bandwidth constraints in the decoder. We propose two simple changes to the FiD architecture to alleviate memory bandwidth constraints, and speed up inference by 7x. This allows us to use a much larger decoder at modest cost. We denote FiD with the above modifications as FiDO, and show that it strongly improves performance over existing FiD models for a wide range of inference budgets. For example, FiDO-Large-XXL performs faster inference than FiD-Base and achieves better performance than FiD-Large.Comment: ACL Findings 202

arXiv.org e-Print Archive

LongT5: Efficient Text-To-Text Transformer for Long Sequences

Author: Ainslie Joshua
Guo Mandy
Ni Jianmo
Ontanon Santiago
Sung Yun-Hsuan
Uthus David
Yang Yinfei
Publication venue
Publication date: 03/05/2022
Field of study

Recent work has shown that either (1) increasing the input length or (2) increasing model size can improve the performance of Transformer-based neural models. In this paper, we present a new model, called LongT5, with which we explore the effects of scaling both the input length and model size at the same time. Specifically, we integrated attention ideas from long-input transformers (ETC), and adopted pre-training strategies from summarization pre-training (PEGASUS) into the scalable T5 architecture. The result is a new attention mechanism we call {\em Transient Global} (TGlobal), which mimics ETC's local/global attention mechanism, but without requiring additional side-inputs. We are able to achieve state-of-the-art results on several summarization tasks and outperform the original T5 models on question answering tasks.Comment: Accepted in NAACL 202

arXiv.org e-Print Archive

MEMORY-VQ: Compression for Tractable Internet-Scale Memory

Author: Ainslie Joshua
Cohen William W.
de Jong Michiel
Ontañón Santiago
Sanghai Sumit
Vilnis Luke
Zemlyanskiy Yury
Publication venue
Publication date: 28/08/2023
Field of study

Retrieval augmentation is a powerful but expensive method to make language models more knowledgeable about the world. Memory-based methods like LUMEN pre-compute token representations for retrieved passages to drastically speed up inference. However, memory also leads to much greater storage requirements from storing pre-computed representations. We propose MEMORY-VQ, a new method to reduce storage requirements of memory-augmented models without sacrificing performance. Our method uses a vector quantization variational autoencoder (VQ-VAE) to compress token representations. We apply MEMORY-VQ to the LUMEN model to obtain LUMEN-VQ, a memory model that achieves a 16x compression rate with comparable performance on the KILT benchmark. LUMEN-VQ enables practical retrieval augmentation even for extremely large retrieval corpora

arXiv.org e-Print Archive