2,559 research outputs found
Referring Image Segmentation via Cross-Modal Progressive Comprehension
Referring image segmentation aims at segmenting the foreground masks of the
entities that can well match the description given in the natural language
expression. Previous approaches tackle this problem using implicit feature
interaction and fusion between visual and linguistic modalities, but usually
fail to explore informative words of the expression to well align features from
the two modalities for accurately identifying the referred entity. In this
paper, we propose a Cross-Modal Progressive Comprehension (CMPC) module and a
Text-Guided Feature Exchange (TGFE) module to effectively address the
challenging task. Concretely, the CMPC module first employs entity and
attribute words to perceive all the related entities that might be considered
by the expression. Then, the relational words are adopted to highlight the
correct entity as well as suppress other irrelevant ones by multimodal graph
reasoning. In addition to the CMPC module, we further leverage a simple yet
effective TGFE module to integrate the reasoned multimodal features from
different levels with the guidance of textual information. In this way,
features from multi-levels could communicate with each other and be refined
based on the textual context. We conduct extensive experiments on four popular
referring segmentation benchmarks and achieve new state-of-the-art
performances.Comment: Accepted by CVPR 2020. Code is available at
https://github.com/spyflying/CMPC-Refse
PolyFormer: Referring Image Segmentation as Sequential Polygon Generation
In this work, instead of directly predicting the pixel-level segmentation
masks, the problem of referring image segmentation is formulated as sequential
polygon generation, and the predicted polygons can be later converted into
segmentation masks. This is enabled by a new sequence-to-sequence framework,
Polygon Transformer (PolyFormer), which takes a sequence of image patches and
text query tokens as input, and outputs a sequence of polygon vertices
autoregressively. For more accurate geometric localization, we propose a
regression-based decoder, which predicts the precise floating-point coordinates
directly, without any coordinate quantization error. In the experiments,
PolyFormer outperforms the prior art by a clear margin, e.g., 5.40% and 4.52%
absolute improvements on the challenging RefCOCO+ and RefCOCOg datasets. It
also shows strong generalization ability when evaluated on the referring video
segmentation task without fine-tuning, e.g., achieving competitive 61.5% J&F on
the Ref-DAVIS17 dataset
Referring Multi-Object Tracking
Existing referring understanding tasks tend to involve the detection of a
single text-referred object. In this paper, we propose a new and general
referring understanding task, termed referring multi-object tracking (RMOT).
Its core idea is to employ a language expression as a semantic cue to guide the
prediction of multi-object tracking. To the best of our knowledge, it is the
first work to achieve an arbitrary number of referent object predictions in
videos. To push forward RMOT, we construct one benchmark with scalable
expressions based on KITTI, named Refer-KITTI. Specifically, it provides 18
videos with 818 expressions, and each expression in a video is annotated with
an average of 10.7 objects. Further, we develop a transformer-based
architecture TransRMOT to tackle the new task in an online manner, which
achieves impressive detection performance and outperforms other counterparts.
The dataset and code will be available at https://github.com/wudongming97/RMOT.Comment: Accpeted by CVPR 2023. The dataset and code will be available at
https://github.com/wudongming97/RMO
Position-Aware Contrastive Alignment for Referring Image Segmentation
Referring image segmentation aims to segment the target object described by a
given natural language expression. Typically, referring expressions contain
complex relationships between the target and its surrounding objects. The main
challenge of this task is to understand the visual and linguistic content
simultaneously and to find the referred object accurately among all instances
in the image. Currently, the most effective way to solve the above problem is
to obtain aligned multi-modal features by computing the correlation between
visual and linguistic feature modalities under the supervision of the
ground-truth mask. However, existing paradigms have difficulty in thoroughly
understanding visual and linguistic content due to the inability to perceive
information directly about surrounding objects that refer to the target. This
prevents them from learning aligned multi-modal features, which leads to
inaccurate segmentation. To address this issue, we present a position-aware
contrastive alignment network (PCAN) to enhance the alignment of multi-modal
features by guiding the interaction between vision and language through prior
position information. Our PCAN consists of two modules: 1) Position Aware
Module (PAM), which provides position information of all objects related to
natural language descriptions, and 2) Contrastive Language Understanding Module
(CLUM), which enhances multi-modal alignment by comparing the features of the
referred object with those of related objects. Extensive experiments on three
benchmarks demonstrate our PCAN performs favorably against the state-of-the-art
methods. Our code will be made publicly available.Comment: 12 pages, 6 figure
Towards Omni-supervised Referring Expression Segmentation
Referring Expression Segmentation (RES) is an emerging task in computer
vision, which segments the target instances in images based on text
descriptions. However, its development is plagued by the expensive segmentation
labels. To address this issue, we propose a new learning task for RES called
Omni-supervised Referring Expression Segmentation (Omni-RES), which aims to
make full use of unlabeled, fully labeled and weakly labeled data, e.g.,
referring points or grounding boxes, for efficient RES training. To accomplish
this task, we also propose a novel yet strong baseline method for Omni-RES
based on the recently popular teacher-student learning, where the weak labels
are not directly transformed into supervision signals but used as a yardstick
to select and refine high-quality pseudo-masks for teacher-student learning. To
validate the proposed Omni-RES method, we apply it to a set of state-of-the-art
RES models and conduct extensive experiments on a bunch of RES datasets. The
experimental results yield the obvious merits of Omni-RES than the
fully-supervised and semi-supervised training schemes. For instance, with only
10% fully labeled data, Omni-RES can help the base model achieve 100% fully
supervised performance, and it also outperform the semi-supervised alternative
by a large margin, e.g., +14.93% on RefCOCO and +14.95% on RefCOCO+,
respectively. More importantly, Omni-RES also enable the use of large-scale
vision-langauges like Visual Genome to facilitate low-cost RES training, and
achieve new SOTA performance of RES, e.g., 80.66 on RefCOCO
Multi-Modal Mutual Attention and Iterative Interaction for Referring Image Segmentation
We address the problem of referring image segmentation that aims to generate
a mask for the object specified by a natural language expression. Many recent
works utilize Transformer to extract features for the target object by
aggregating the attended visual regions. However, the generic attention
mechanism in Transformer only uses the language input for attention weight
calculation, which does not explicitly fuse language features in its output.
Thus, its output feature is dominated by vision information, which limits the
model to comprehensively understand the multi-modal information, and brings
uncertainty for the subsequent mask decoder to extract the output mask. To
address this issue, we propose Multi-Modal Mutual Attention ()
and Multi-Modal Mutual Decoder () that better fuse information
from the two input modalities. Based on {}, we further propose
Iterative Multi-modal Interaction () to allow continuous and
in-depth interactions between language and vision features. Furthermore, we
introduce Language Feature Reconstruction () to prevent the
language information from being lost or distorted in the extracted feature.
Extensive experiments show that our proposed approach significantly improves
the baseline and outperforms state-of-the-art referring image segmentation
methods on RefCOCO series datasets consistently.Comment: IEEE TI
Towards Robust Referring Image Segmentation
Referring Image Segmentation (RIS) aims to connect image and language via
outputting the corresponding object masks given a text description, which is a
fundamental vision-language task. Despite lots of works that have achieved
considerable progress for RIS, in this work, we explore an essential question,
"what if the description is wrong or misleading of the text description?". We
term such a sentence as a negative sentence. However, we find that existing
works cannot handle such settings. To this end, we propose a novel formulation
of RIS, named Robust Referring Image Segmentation (R-RIS). It considers the
negative sentence inputs besides the regularly given text inputs. We present
three different datasets via augmenting the input negative sentences and a new
metric to unify both input types. Furthermore, we design a new
transformer-based model named RefSegformer, where we introduce a token-based
vision and language fusion module. Such module can be easily extended to our
R-RIS setting by adding extra blank tokens. Our proposed RefSegformer achieves
the new state-of-the-art results on three regular RIS datasets and three R-RIS
datasets, which serves as a new solid baseline for further research. The
project page is at \url{https://lxtgh.github.io/project/robust_ref_seg/}.Comment: technical repor
Contrastive Grouping with Transformer for Referring Image Segmentation
Referring image segmentation aims to segment the target referent in an image
conditioning on a natural language expression. Existing one-stage methods
employ per-pixel classification frameworks, which attempt straightforwardly to
align vision and language at the pixel level, thus failing to capture critical
object-level information. In this paper, we propose a mask classification
framework, Contrastive Grouping with Transformer network (CGFormer), which
explicitly captures object-level information via token-based querying and
grouping strategy. Specifically, CGFormer first introduces learnable query
tokens to represent objects and then alternately queries linguistic features
and groups visual features into the query tokens for object-aware cross-modal
reasoning. In addition, CGFormer achieves cross-level interaction by jointly
updating the query tokens and decoding masks in every two consecutive layers.
Finally, CGFormer cooperates contrastive learning to the grouping strategy to
identify the token and its mask corresponding to the referent. Experimental
results demonstrate that CGFormer outperforms state-of-the-art methods in both
segmentation and generalization settings consistently and significantly.Comment: Accepted by CVPR 202
- …