Multi-Resolution and Asymmetric Implementation of Attention in Transformers

Chaudhry, Zaid

Multi-Resolution and Asymmetric Implementation of Attention in Transformers

Authors: Zaid Chaudhry
Publication date: 18 April 2022
Publisher: 'University of Waterloo'

Abstract

Transformers are the state-of-the-art for machine translation and grammar error correction. One of the most important components of transformers are the attention layers, but they require significant computational power. We suggest a new way of looking at the “mixing” mechanisms of tokens by doing a multi-resolution implementation of attention, which maintains inference results while also improving training and inference speed, thus getting the best of both worlds. This approximation can be applied in symmtrical and asymmetrical manner within and across attention layers. We also suggest an interesting alternative for the softmax layer in attention. We also analyzed some other hyperparameters in detail. For example, our experiments indicate that we can have asymmetry among the attention layers w.r.t. number of heads, while still achieving similar results. In many cases, reducing the number of heads improves inference results. We also explored the role of weighting matrices for query, key, and value vectors; and show that in case of self-attention, absence of these matrices results in the collapse of the attention layers to an identity matrix

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

University of Waterloo's Institutional Repository

oai:uwspace.uwaterloo.ca:10012...

Last time updated on 08/10/2022