Capturing the long-range dependencies has empirically proven to be effective
on a wide range of computer vision tasks. The progressive advances on this
topic have been made through the employment of the transformer framework with
the help of the multi-head attention mechanism. However, the attention-based
image patch interaction potentially suffers from problems of redundant
interactions of intra-class patches and unoriented interactions of inter-class
patches. In this paper, we propose a novel Graph Reasoning Transformer (GReaT)
for image parsing to enable image patches to interact following a relation
reasoning pattern. Specifically, the linearly embedded image patches are first
projected into the graph space, where each node represents the implicit visual
center for a cluster of image patches and each edge reflects the relation
weight between two adjacent nodes. After that, global relation reasoning is
performed on this graph accordingly. Finally, all nodes including the relation
information are mapped back into the original space for subsequent processes.
Compared to the conventional transformer, GReaT has higher interaction
efficiency and a more purposeful interaction pattern. Experiments are carried
out on the challenging Cityscapes and ADE20K datasets. Results show that GReaT
achieves consistent performance gains with slight computational overheads on
the state-of-the-art transformer baselines.Comment: Accepted in ACM MM202