19,183 research outputs found
Referring Expression Comprehension: A Survey of Methods and Datasets
Referring expression comprehension (REC) aims to localize a target object in
an image described by a referring expression phrased in natural language.
Different from the object detection task that queried object labels have been
pre-defined, the REC problem only can observe the queries during the test. It
thus more challenging than a conventional computer vision problem. This task
has attracted a lot of attention from both computer vision and natural language
processing community, and several lines of work have been proposed, from
CNN-RNN model, modular network to complex graph-based model. In this survey, we
first examine the state of the art by comparing modern approaches to the
problem. We classify methods by their mechanism to encode the visual and
textual modalities. In particular, we examine the common approach of joint
embedding images and expressions to a common feature space. We also discuss
modular architectures and graph-based models that interface with structured
graph representation. In the second part of this survey, we review the datasets
available for training and evaluating REC systems. We then group results
according to the datasets, backbone models, settings so that they can be fairly
compared. Finally, we discuss promising future directions for the field, in
particular the compositional referring expression comprehension that requires
longer reasoning chain to address.Comment: Accepted to IEEE TM
One for All: One-stage Referring Expression Comprehension with Dynamic Reasoning
Referring Expression Comprehension (REC) is one of the most important tasks
in visual reasoning that requires a model to detect the target object referred
by a natural language expression. Among the proposed pipelines, the one-stage
Referring Expression Comprehension (OSREC) has become the dominant trend since
it merges the region proposal and selection stages. Many state-of-the-art OSREC
models adopt a multi-hop reasoning strategy because a sequence of objects is
frequently mentioned in a single expression which needs multi-hop reasoning to
analyze the semantic relation. However, one unsolved issue of these models is
that the number of reasoning steps needs to be pre-defined and fixed before
inference, ignoring the varying complexity of expressions. In this paper, we
propose a Dynamic Multi-step Reasoning Network, which allows the reasoning
steps to be dynamically adjusted based on the reasoning state and expression
complexity. Specifically, we adopt a Transformer module to memorize & process
the reasoning state and a Reinforcement Learning strategy to dynamically infer
the reasoning steps. The work achieves the state-of-the-art performance or
significant improvements on several REC datasets, ranging from RefCOCO (+, g)
with short expressions, to Ref-Reasoning, a dataset with long and complex
compositional expressions.Comment: 27 pages, 6 figure
Referring Image Segmentation via Cross-Modal Progressive Comprehension
Referring image segmentation aims at segmenting the foreground masks of the
entities that can well match the description given in the natural language
expression. Previous approaches tackle this problem using implicit feature
interaction and fusion between visual and linguistic modalities, but usually
fail to explore informative words of the expression to well align features from
the two modalities for accurately identifying the referred entity. In this
paper, we propose a Cross-Modal Progressive Comprehension (CMPC) module and a
Text-Guided Feature Exchange (TGFE) module to effectively address the
challenging task. Concretely, the CMPC module first employs entity and
attribute words to perceive all the related entities that might be considered
by the expression. Then, the relational words are adopted to highlight the
correct entity as well as suppress other irrelevant ones by multimodal graph
reasoning. In addition to the CMPC module, we further leverage a simple yet
effective TGFE module to integrate the reasoned multimodal features from
different levels with the guidance of textual information. In this way,
features from multi-levels could communicate with each other and be refined
based on the textual context. We conduct extensive experiments on four popular
referring segmentation benchmarks and achieve new state-of-the-art
performances.Comment: Accepted by CVPR 2020. Code is available at
https://github.com/spyflying/CMPC-Refse
What Goes beyond Multi-modal Fusion in One-stage Referring Expression Comprehension: An Empirical Study
Most of the existing work in one-stage referring expression comprehension
(REC) mainly focuses on multi-modal fusion and reasoning, while the influence
of other factors in this task lacks in-depth exploration. To fill this gap, we
conduct an empirical study in this paper. Concretely, we first build a very
simple REC network called SimREC, and ablate 42 candidate designs/settings,
which covers the entire process of one-stage REC from network design to model
training. Afterwards, we conduct over 100 experimental trials on three
benchmark datasets of REC. The extensive experimental results not only show the
key factors that affect REC performance in addition to multi-modal fusion,
e.g., multi-scale features and data augmentation, but also yield some findings
that run counter to conventional understanding. For example, as a vision and
language (V&L) task, REC does is less impacted by language prior. In addition,
with a proper combination of these findings, we can improve the performance of
SimREC by a large margin, e.g., +27.12% on RefCOCO+, which outperforms all
existing REC methods. But the most encouraging finding is that with much less
training overhead and parameters, SimREC can still achieve better performance
than a set of large-scale pre-trained models, e.g., UNITER and VILLA,
portraying the special role of REC in existing V&L research
Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding
The prevailing framework for solving referring expression grounding is based
on a two-stage process: 1) detecting proposals with an object detector and 2)
grounding the referent to one of the proposals. Existing two-stage solutions
mostly focus on the grounding step, which aims to align the expressions with
the proposals. In this paper, we argue that these methods overlook an obvious
mismatch between the roles of proposals in the two stages: they generate
proposals solely based on the detection confidence (i.e., expression-agnostic),
hoping that the proposals contain all right instances in the expression (i.e.,
expression-aware). Due to this mismatch, current two-stage methods suffer from
a severe performance drop between detected and ground-truth proposals. To this
end, we propose Ref-NMS, which is the first method to yield expression-aware
proposals at the first stage. Ref-NMS regards all nouns in the expression as
critical objects, and introduces a lightweight module to predict a score for
aligning each box with a critical object. These scores can guide the NMS
operation to filter out the boxes irrelevant to the expression, increasing the
recall of critical objects, resulting in a significantly improved grounding
performance. Since Ref- NMS is agnostic to the grounding step, it can be easily
integrated into any state-of-the-art two-stage method. Extensive ablation
studies on several backbones, benchmarks, and tasks consistently demonstrate
the superiority of Ref-NMS. Codes are available at:
https://github.com/ChopinSharp/ref-nms.Comment: Appear in AAAI 2021, Codes are available at:
https://github.com/ChopinSharp/ref-nm
- …