7,349 research outputs found
Multi-Modal Mutual Attention and Iterative Interaction for Referring Image Segmentation
We address the problem of referring image segmentation that aims to generate
a mask for the object specified by a natural language expression. Many recent
works utilize Transformer to extract features for the target object by
aggregating the attended visual regions. However, the generic attention
mechanism in Transformer only uses the language input for attention weight
calculation, which does not explicitly fuse language features in its output.
Thus, its output feature is dominated by vision information, which limits the
model to comprehensively understand the multi-modal information, and brings
uncertainty for the subsequent mask decoder to extract the output mask. To
address this issue, we propose Multi-Modal Mutual Attention ()
and Multi-Modal Mutual Decoder () that better fuse information
from the two input modalities. Based on {}, we further propose
Iterative Multi-modal Interaction () to allow continuous and
in-depth interactions between language and vision features. Furthermore, we
introduce Language Feature Reconstruction () to prevent the
language information from being lost or distorted in the extracted feature.
Extensive experiments show that our proposed approach significantly improves
the baseline and outperforms state-of-the-art referring image segmentation
methods on RefCOCO series datasets consistently.Comment: IEEE TI
Linguistic Structure Guided Context Modeling for Referring Image Segmentation
Referring image segmentation aims to predict the foreground mask of the
object referred by a natural language sentence. Multimodal context of the
sentence is crucial to distinguish the referent from the background. Existing
methods either insufficiently or redundantly model the multimodal context. To
tackle this problem, we propose a "gather-propagate-distribute" scheme to model
multimodal context by cross-modal interaction and implement this scheme as a
novel Linguistic Structure guided Context Modeling (LSCM) module. Our LSCM
module builds a Dependency Parsing Tree suppressed Word Graph (DPT-WG) which
guides all the words to include valid multimodal context of the sentence while
excluding disturbing ones through three steps over the multimodal feature,
i.e., gathering, constrained propagation and distributing. Extensive
experiments on four benchmarks demonstrate that our method outperforms all the
previous state-of-the-arts.Comment: Accepted by ECCV 2020. Code is available at
https://github.com/spyflying/LSCM-Refse
MMNet: Multi-Mask Network for Referring Image Segmentation
Referring image segmentation aims to segment an object referred to by natural
language expression from an image. However, this task is challenging due to the
distinct data properties between text and image, and the randomness introduced
by diverse objects and unrestricted language expression. Most of previous work
focus on improving cross-modal feature fusion while not fully addressing the
inherent uncertainty caused by diverse objects and unrestricted language. To
tackle these problems, we propose an end-to-end Multi-Mask Network for
referring image segmentation(MMNet). we first combine picture and language and
then employ an attention mechanism to generate multiple queries that represent
different aspects of the language expression. We then utilize these queries to
produce a series of corresponding segmentation masks, assigning a score to each
mask that reflects its importance. The final result is obtained through the
weighted sum of all masks, which greatly reduces the randomness of the language
expression. Our proposed framework demonstrates superior performance compared
to state-of-the-art approaches on the two most commonly used datasets, RefCOCO,
RefCOCO+ and G-Ref, without the need for any post-processing. This further
validates the efficacy of our proposed framework.Comment: 10 pages, 5 figure
PolyFormer: Referring Image Segmentation as Sequential Polygon Generation
In this work, instead of directly predicting the pixel-level segmentation
masks, the problem of referring image segmentation is formulated as sequential
polygon generation, and the predicted polygons can be later converted into
segmentation masks. This is enabled by a new sequence-to-sequence framework,
Polygon Transformer (PolyFormer), which takes a sequence of image patches and
text query tokens as input, and outputs a sequence of polygon vertices
autoregressively. For more accurate geometric localization, we propose a
regression-based decoder, which predicts the precise floating-point coordinates
directly, without any coordinate quantization error. In the experiments,
PolyFormer outperforms the prior art by a clear margin, e.g., 5.40% and 4.52%
absolute improvements on the challenging RefCOCO+ and RefCOCOg datasets. It
also shows strong generalization ability when evaluated on the referring video
segmentation task without fine-tuning, e.g., achieving competitive 61.5% J&F on
the Ref-DAVIS17 dataset
Shatter and Gather: Learning Referring Image Segmentation with Text Supervision
Referring image segmentation, the task of segmenting any arbitrary entities
described in free-form texts, opens up a variety of vision applications.
However, manual labeling of training data for this task is prohibitively
costly, leading to lack of labeled data for training. We address this issue by
a weakly supervised learning approach using text descriptions of training
images as the only source of supervision. To this end, we first present a new
model that discovers semantic entities in input image and then combines such
entities relevant to text query to predict the mask of the referent. We also
present a new loss function that allows the model to be trained without any
further supervision. Our method was evaluated on four public benchmarks for
referring image segmentation, where it clearly outperformed the existing method
for the same task and recent open-vocabulary segmentation models on all the
benchmarks.Comment: Accepted to ICCV 2023, Project page:
https://southflame.github.io/sag
Position-Aware Contrastive Alignment for Referring Image Segmentation
Referring image segmentation aims to segment the target object described by a
given natural language expression. Typically, referring expressions contain
complex relationships between the target and its surrounding objects. The main
challenge of this task is to understand the visual and linguistic content
simultaneously and to find the referred object accurately among all instances
in the image. Currently, the most effective way to solve the above problem is
to obtain aligned multi-modal features by computing the correlation between
visual and linguistic feature modalities under the supervision of the
ground-truth mask. However, existing paradigms have difficulty in thoroughly
understanding visual and linguistic content due to the inability to perceive
information directly about surrounding objects that refer to the target. This
prevents them from learning aligned multi-modal features, which leads to
inaccurate segmentation. To address this issue, we present a position-aware
contrastive alignment network (PCAN) to enhance the alignment of multi-modal
features by guiding the interaction between vision and language through prior
position information. Our PCAN consists of two modules: 1) Position Aware
Module (PAM), which provides position information of all objects related to
natural language descriptions, and 2) Contrastive Language Understanding Module
(CLUM), which enhances multi-modal alignment by comparing the features of the
referred object with those of related objects. Extensive experiments on three
benchmarks demonstrate our PCAN performs favorably against the state-of-the-art
methods. Our code will be made publicly available.Comment: 12 pages, 6 figure
EPCFormer: Expression Prompt Collaboration Transformer for Universal Referring Video Object Segmentation
Audio-guided Video Object Segmentation (A-VOS) and Referring Video Object
Segmentation (R-VOS) are two highly-related tasks, which both aim to segment
specific objects from video sequences according to user-provided expression
prompts. However, due to the challenges in modeling representations for
different modalities, contemporary methods struggle to strike a balance between
interaction flexibility and high-precision localization and segmentation. In
this paper, we address this problem from two perspectives: the alignment
representation of audio and text and the deep interaction among audio, text,
and visual features. First, we propose a universal architecture, the Expression
Prompt Collaboration Transformer, herein EPCFormer. Next, we propose an
Expression Alignment (EA) mechanism for audio and text expressions. By
introducing contrastive learning for audio and text expressions, the proposed
EPCFormer realizes comprehension of the semantic equivalence between audio and
text expressions denoting the same objects. Then, to facilitate deep
interactions among audio, text, and video features, we introduce an
Expression-Visual Attention (EVA) mechanism. The knowledge of video object
segmentation in terms of the expression prompts can seamlessly transfer between
the two tasks by deeply exploring complementary cues between text and audio.
Experiments on well-recognized benchmarks demonstrate that our universal
EPCFormer attains state-of-the-art results on both tasks. The source code of
EPCFormer will be made publicly available at
https://github.com/lab206/EPCFormer.Comment: The source code will be made publicly available at
https://github.com/lab206/EPCForme
Bidirectional Correlation-Driven Inter-Frame Interaction Transformer for Referring Video Object Segmentation
Referring video object segmentation (RVOS) aims to segment the target object
in a video sequence described by a language expression. Typical multimodal
Transformer based RVOS approaches process video sequence in a frame-independent
manner to reduce the high computational cost, which however restricts the
performance due to the lack of inter-frame interaction for temporal coherence
modeling and spatio-temporal representation learning of the referred object.
Besides, the absence of sufficient cross-modal interactions results in weak
correlation between the visual and linguistic features, which increases the
difficulty of decoding the target information and limits the performance of the
model. In this paper, we propose a bidirectional correlation-driven inter-frame
interaction Transformer, dubbed BIFIT, to address these issues in RVOS.
Specifically, we design a lightweight and plug-and-play inter-frame interaction
module in the Transformer decoder to efficiently learn the spatio-temporal
features of the referred object, so as to decode the object information in the
video sequence more precisely and generate more accurate segmentation results.
Moreover, a bidirectional vision-language interaction module is implemented
before the multimodal Transformer to enhance the correlation between the visual
and linguistic features, thus facilitating the language queries to decode more
precise object information from visual features and ultimately improving the
segmentation performance. Extensive experimental results on four benchmarks
validate the superiority of our BIFIT over state-of-the-art methods and the
effectiveness of our proposed modules
- …