Recent studies have shown the importance of modeling long-range interactions
in the inpainting problem. To achieve this goal, existing approaches exploit
either standalone attention techniques or transformers, but usually under a low
resolution in consideration of computational cost. In this paper, we present a
novel transformer-based model for large hole inpainting, which unifies the
merits of transformers and convolutions to efficiently process high-resolution
images. We carefully design each component of our framework to guarantee the
high fidelity and diversity of recovered images. Specifically, we customize an
inpainting-oriented transformer block, where the attention module aggregates
non-local information only from partial valid tokens, indicated by a dynamic
mask. Extensive experiments demonstrate the state-of-the-art performance of the
new model on multiple benchmark datasets. Code is released at
https://github.com/fenglinglwb/MAT.Comment: Accepted to CVPR2022 Ora