444 research outputs found
Sparsifying Transformer Models with Trainable Representation Pooling
We propose a novel method to sparsify attention in the Transformer model by
learning to select the most-informative token representations during the
training process, thus focusing on task-specific parts of the input. A
reduction of quadratic time and memory complexity to sublinear was achieved due
to a robust trainable top-k operator. For example, our experiments on a
challenging summarization task of long documents show that our method is over 3
times faster and up to 16 times more memory efficient while significantly
outperforming both dense and state-of-the-art sparse transformer models. The
method can be effortlessly applied to many models used in NLP and CV,
simultaneously with other improvements.Comment: Provided formal overview. Reevaluated with Google Research scrip
Blockwise Parallel Transformer for Long Context Large Models
Transformers have emerged as the cornerstone of state-of-the-art natural
language processing models, showcasing exceptional performance across a wide
range of AI applications. However, the memory demands posed by the
self-attention mechanism and the large feedforward network in Transformers
limit their ability to handle long sequences, thereby creating challenges for
tasks involving multiple long sequences or long-term dependencies. We present a
distinct approach, Blockwise Parallel Transformer (BPT), that leverages
blockwise computation of self-attention and feedforward network fusion to
minimize memory costs. By processing longer input sequences while maintaining
memory efficiency, BPT enables training sequences up to 32 times longer than
vanilla Transformers and 2 to 4 times longer than previous memory-efficient
methods. Extensive experiments on language modeling and reinforcement learning
tasks demonstrate the effectiveness of BPT in reducing memory requirements and
improving performance
Long-Form End-to-End Speech Translation via Latent Alignment Segmentation
Current simultaneous speech translation models can process audio only up to a
few seconds long. Contemporary datasets provide an oracle segmentation into
sentences based on human-annotated transcripts and translations. However, the
segmentation into sentences is not available in the real world. Current speech
segmentation approaches either offer poor segmentation quality or have to trade
latency for quality. In this paper, we propose a novel segmentation approach
for a low-latency end-to-end speech translation. We leverage the existing
speech translation encoder-decoder architecture with ST CTC and show that it
can perform the segmentation task without supervision or additional parameters.
To the best of our knowledge, our method is the first that allows an actual
end-to-end simultaneous speech translation, as the same model is used for
translation and segmentation at the same time. On a diverse set of language
pairs and in- and out-of-domain data, we show that the proposed approach
achieves state-of-the-art quality at no additional computational cost.Comment: This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessibl
- …