Recently, Transformer architecture has been introduced into image restoration
to replace convolution neural network (CNN) with surprising results.
Considering the high computational complexity of Transformer with global
attention, some methods use the local square window to limit the scope of
self-attention. However, these methods lack direct interaction among different
windows, which limits the establishment of long-range dependencies. To address
the above issue, we propose a new image restoration model, Cross Aggregation
Transformer (CAT). The core of our CAT is the Rectangle-Window Self-Attention
(Rwin-SA), which utilizes horizontal and vertical rectangle window attention in
different heads parallelly to expand the attention area and aggregate the
features cross different windows. We also introduce the Axial-Shift operation
for different window interactions. Furthermore, we propose the Locality
Complementary Module to complement the self-attention mechanism, which
incorporates the inductive bias of CNN (e.g., translation invariance and
locality) into Transformer, enabling global-local coupling. Extensive
experiments demonstrate that our CAT outperforms recent state-of-the-art
methods on several image restoration applications. The code and models are
available at https://github.com/zhengchen1999/CAT.Comment: Accepted to NeurIPS 2022. Code is available at
https://github.com/zhengchen1999/CA