Referring image segmentation aims to segment the target object described by a
given natural language expression. Typically, referring expressions contain
complex relationships between the target and its surrounding objects. The main
challenge of this task is to understand the visual and linguistic content
simultaneously and to find the referred object accurately among all instances
in the image. Currently, the most effective way to solve the above problem is
to obtain aligned multi-modal features by computing the correlation between
visual and linguistic feature modalities under the supervision of the
ground-truth mask. However, existing paradigms have difficulty in thoroughly
understanding visual and linguistic content due to the inability to perceive
information directly about surrounding objects that refer to the target. This
prevents them from learning aligned multi-modal features, which leads to
inaccurate segmentation. To address this issue, we present a position-aware
contrastive alignment network (PCAN) to enhance the alignment of multi-modal
features by guiding the interaction between vision and language through prior
position information. Our PCAN consists of two modules: 1) Position Aware
Module (PAM), which provides position information of all objects related to
natural language descriptions, and 2) Contrastive Language Understanding Module
(CLUM), which enhances multi-modal alignment by comparing the features of the
referred object with those of related objects. Extensive experiments on three
benchmarks demonstrate our PCAN performs favorably against the state-of-the-art
methods. Our code will be made publicly available.Comment: 12 pages, 6 figure