2,308 research outputs found
CLIPUNetr: Assisting Human-robot Interface for Uncalibrated Visual Servoing Control with CLIP-driven Referring Expression Segmentation
The classical human-robot interface in uncalibrated image-based visual
servoing (UIBVS) relies on either human annotations or semantic segmentation
with categorical labels. Both methods fail to match natural human communication
and convey rich semantics in manipulation tasks as effectively as natural
language expressions. In this paper, we tackle this problem by using referring
expression segmentation, which is a prompt-based approach, to provide more
in-depth information for robot perception. To generate high-quality
segmentation predictions from referring expressions, we propose CLIPUNetr - a
new CLIP-driven referring expression segmentation network. CLIPUNetr leverages
CLIP's strong vision-language representations to segment regions from referring
expressions, while utilizing its ``U-shaped'' encoder-decoder architecture to
generate predictions with sharper boundaries and finer structures. Furthermore,
we propose a new pipeline to integrate CLIPUNetr into UIBVS and apply it to
control robots in real-world environments. In experiments, our method improves
boundary and structure measurements by an average of 120% and can successfully
assist real-world UIBVS control in an unstructured manipulation environment
Contrastive Grouping with Transformer for Referring Image Segmentation
Referring image segmentation aims to segment the target referent in an image
conditioning on a natural language expression. Existing one-stage methods
employ per-pixel classification frameworks, which attempt straightforwardly to
align vision and language at the pixel level, thus failing to capture critical
object-level information. In this paper, we propose a mask classification
framework, Contrastive Grouping with Transformer network (CGFormer), which
explicitly captures object-level information via token-based querying and
grouping strategy. Specifically, CGFormer first introduces learnable query
tokens to represent objects and then alternately queries linguistic features
and groups visual features into the query tokens for object-aware cross-modal
reasoning. In addition, CGFormer achieves cross-level interaction by jointly
updating the query tokens and decoding masks in every two consecutive layers.
Finally, CGFormer cooperates contrastive learning to the grouping strategy to
identify the token and its mask corresponding to the referent. Experimental
results demonstrate that CGFormer outperforms state-of-the-art methods in both
segmentation and generalization settings consistently and significantly.Comment: Accepted by CVPR 202
EAVL: Explicitly Align Vision and Language for Referring Image Segmentation
Referring image segmentation aims to segment an object mentioned in natural
language from an image. A main challenge is language-related localization,
which means locating the object with the relevant language. Previous approaches
mainly focus on the fusion of vision and language features without fully
addressing language-related localization. In previous approaches, fused
vision-language features are directly fed into a decoder and pass through a
convolution with a fixed kernel to obtain the result, which follows a similar
pattern as traditional image segmentation. This approach does not explicitly
align language and vision features in the segmentation stage, resulting in a
suboptimal language-related localization. Different from previous methods, we
propose Explicitly Align the Vision and Language for Referring Image
Segmentation (EAVL). Instead of using a fixed convolution kernel, we propose an
Aligner which explicitly aligns the vision and language features in the
segmentation stage. Specifically, a series of unfixed convolution kernels are
generated based on the input l, and then are use to explicitly align the vision
and language features. To achieve this, We generate multiple queries that
represent different emphases of the language expression. These queries are
transformed into a series of query-based convolution kernels. Then, we utilize
these kernels to do convolutions in the segmentation stage and obtain a series
of segmentation masks. The final result is obtained through the aggregation of
all masks. Our method can not only fuse vision and language features
effectively but also exploit their potential in the segmentation stage. And
most importantly, we explicitly align language features of different emphases
with the image features to achieve language-related localization. Our method
surpasses previous state-of-the-art methods on RefCOCO, RefCOCO+, and G-Ref by
large margins.Comment: 10 pages, 4 figures. arXiv admin note: text overlap with
arXiv:2305.1496
MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions
This paper strives for motion expressions guided video segmentation, which
focuses on segmenting objects in video content based on a sentence describing
the motion of the objects. Existing referring video object datasets typically
focus on salient objects and use language expressions that contain excessive
static attributes that could potentially enable the target object to be
identified in a single frame. These datasets downplay the importance of motion
in video content for language-guided video object segmentation. To investigate
the feasibility of using motion expressions to ground and segment objects in
videos, we propose a large-scale dataset called MeViS, which contains numerous
motion expressions to indicate target objects in complex environments. We
benchmarked 5 existing referring video object segmentation (RVOS) methods and
conducted a comprehensive comparison on the MeViS dataset. The results show
that current RVOS methods cannot effectively address motion expression-guided
video segmentation. We further analyze the challenges and propose a baseline
approach for the proposed MeViS dataset. The goal of our benchmark is to
provide a platform that enables the development of effective language-guided
video segmentation algorithms that leverage motion expressions as a primary cue
for object segmentation in complex video scenes. The proposed MeViS dataset has
been released at https://henghuiding.github.io/MeViS.Comment: ICCV 2023, Project Page: https://henghuiding.github.io/MeViS
MMNet: Multi-Mask Network for Referring Image Segmentation
Referring image segmentation aims to segment an object referred to by natural
language expression from an image. However, this task is challenging due to the
distinct data properties between text and image, and the randomness introduced
by diverse objects and unrestricted language expression. Most of previous work
focus on improving cross-modal feature fusion while not fully addressing the
inherent uncertainty caused by diverse objects and unrestricted language. To
tackle these problems, we propose an end-to-end Multi-Mask Network for
referring image segmentation(MMNet). we first combine picture and language and
then employ an attention mechanism to generate multiple queries that represent
different aspects of the language expression. We then utilize these queries to
produce a series of corresponding segmentation masks, assigning a score to each
mask that reflects its importance. The final result is obtained through the
weighted sum of all masks, which greatly reduces the randomness of the language
expression. Our proposed framework demonstrates superior performance compared
to state-of-the-art approaches on the two most commonly used datasets, RefCOCO,
RefCOCO+ and G-Ref, without the need for any post-processing. This further
validates the efficacy of our proposed framework.Comment: 10 pages, 5 figure
Text Augmented Spatial-aware Zero-shot Referring Image Segmentation
In this paper, we study a challenging task of zero-shot referring image
segmentation. This task aims to identify the instance mask that is most related
to a referring expression without training on pixel-level annotations. Previous
research takes advantage of pre-trained cross-modal models, e.g., CLIP, to
align instance-level masks with referring expressions. %Yet, CLIP only
considers image-text pair level alignment, which neglects fine-grained image
region and complex sentence matching. Yet, CLIP only considers the global-level
alignment of image-text pairs, neglecting fine-grained matching between the
referring sentence and local image regions. To address this challenge, we
introduce a Text Augmented Spatial-aware (TAS) zero-shot referring image
segmentation framework that is training-free and robust to various visual
encoders. TAS incorporates a mask proposal network for instance-level mask
extraction, a text-augmented visual-text matching score for mining the
image-text correlation, and a spatial rectifier for mask post-processing.
Notably, the text-augmented visual-text matching score leverages a score
and an -score in addition to the typical visual-text matching score. The
-score is utilized to close the visual-text domain gap through a surrogate
captioning model, where the score is computed between the surrogate
model-generated texts and the referring expression. The -score considers the
fine-grained alignment of region-text pairs via negative phrase mining,
encouraging the masked image to be repelled from the mined distracting phrases.
Extensive experiments are conducted on various datasets, including RefCOCO,
RefCOCO+, and RefCOCOg. The proposed method clearly outperforms
state-of-the-art zero-shot referring image segmentation methods.Comment: Findings of EMNLP202
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
Modeling textual or visual information with vector representations trained
from large language or visual datasets has been successfully explored in recent
years. However, tasks such as visual question answering require combining these
vector representations with each other. Approaches to multimodal pooling
include element-wise product or sum, as well as concatenation of the visual and
textual representations. We hypothesize that these methods are not as
expressive as an outer product of the visual and textual vectors. As the outer
product is typically infeasible due to its high dimensionality, we instead
propose utilizing Multimodal Compact Bilinear pooling (MCB) to efficiently and
expressively combine multimodal features. We extensively evaluate MCB on the
visual question answering and grounding tasks. We consistently show the benefit
of MCB over ablations without MCB. For visual question answering, we present an
architecture which uses MCB twice, once for predicting attention over spatial
features and again to combine the attended representation with the question
representation. This model outperforms the state-of-the-art on the Visual7W
dataset and the VQA challenge.Comment: Accepted to EMNLP 201
Shatter and Gather: Learning Referring Image Segmentation with Text Supervision
Referring image segmentation, the task of segmenting any arbitrary entities
described in free-form texts, opens up a variety of vision applications.
However, manual labeling of training data for this task is prohibitively
costly, leading to lack of labeled data for training. We address this issue by
a weakly supervised learning approach using text descriptions of training
images as the only source of supervision. To this end, we first present a new
model that discovers semantic entities in input image and then combines such
entities relevant to text query to predict the mask of the referent. We also
present a new loss function that allows the model to be trained without any
further supervision. Our method was evaluated on four public benchmarks for
referring image segmentation, where it clearly outperformed the existing method
for the same task and recent open-vocabulary segmentation models on all the
benchmarks.Comment: Accepted to ICCV 2023, Project page:
https://southflame.github.io/sag
Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation
Parameter Efficient Tuning (PET) has gained attention for reducing the number
of parameters while maintaining performance and providing better hardware
resource savings, but few studies investigate dense prediction tasks and
interaction between modalities. In this paper, we do an investigation of
efficient tuning problems on referring image segmentation. We propose a novel
adapter called Bridger to facilitate cross-modal information exchange and
inject task-specific information into the pre-trained model. We also design a
lightweight decoder for image segmentation. Our approach achieves comparable or
superior performance with only 1.61\% to 3.38\% backbone parameter updates,
evaluated on challenging benchmarks. The code is available at
\url{https://github.com/kkakkkka/ETRIS}.Comment: Computer Vision and Natural Language Processing. 14 pages, 8 figures.
ICCV-202
- …