The quadratic computation complexity of self-attention has been a persistent
challenge when applying Transformer models to vision tasks. Linear attention,
on the other hand, offers a much more efficient alternative with its linear
complexity by approximating the Softmax operation through carefully designed
mapping functions. However, current linear attention approaches either suffer
from significant performance degradation or introduce additional computation
overhead from the mapping functions. In this paper, we propose a novel Focused
Linear Attention module to achieve both high efficiency and expressiveness.
Specifically, we first analyze the factors contributing to the performance
degradation of linear attention from two perspectives: the focus ability and
feature diversity. To overcome these limitations, we introduce a simple yet
effective mapping function and an efficient rank restoration module to enhance
the expressiveness of self-attention while maintaining low computation
complexity. Extensive experiments show that our linear attention module is
applicable to a variety of advanced vision Transformers, and achieves
consistently improved performances on multiple benchmarks. Code is available at
https://github.com/LeapLabTHU/FLatten-Transformer.Comment: ICCV 202