802 research outputs found
Knowledge-guided Pairwise Reconstruction Network for Weakly Supervised Referring Expression Grounding
Weakly supervised referring expression grounding (REG) aims at localizing the
referential entity in an image according to linguistic query, where the mapping
between the image region (proposal) and the query is unknown in the training
stage. In referring expressions, people usually describe a target entity in
terms of its relationship with other contextual entities as well as visual
attributes. However, previous weakly supervised REG methods rarely pay
attention to the relationship between the entities. In this paper, we propose a
knowledge-guided pairwise reconstruction network (KPRN), which models the
relationship between the target entity (subject) and contextual entity (object)
as well as grounds these two entities. Specifically, we first design a
knowledge extraction module to guide the proposal selection of subject and
object. The prior knowledge is obtained in a specific form of semantic
similarities between each proposal and the subject/object. Second, guided by
such knowledge, we design the subject and object attention module to construct
the subject-object proposal pairs. The subject attention excludes the unrelated
proposals from the candidate proposals. The object attention selects the most
suitable proposal as the contextual proposal. Third, we introduce a pairwise
attention and an adaptive weighting scheme to learn the correspondence between
these proposal pairs and the query. Finally, a pairwise reconstruction module
is used to measure the grounding for weakly supervised learning. Extensive
experiments on four large-scale datasets show our method outperforms existing
state-of-the-art methods by a large margin.Comment: Accepted by ACMMM 2019. arXiv admin note: text overlap with
arXiv:1908.1056
Referring Expression Comprehension: A Survey of Methods and Datasets
Referring expression comprehension (REC) aims to localize a target object in
an image described by a referring expression phrased in natural language.
Different from the object detection task that queried object labels have been
pre-defined, the REC problem only can observe the queries during the test. It
thus more challenging than a conventional computer vision problem. This task
has attracted a lot of attention from both computer vision and natural language
processing community, and several lines of work have been proposed, from
CNN-RNN model, modular network to complex graph-based model. In this survey, we
first examine the state of the art by comparing modern approaches to the
problem. We classify methods by their mechanism to encode the visual and
textual modalities. In particular, we examine the common approach of joint
embedding images and expressions to a common feature space. We also discuss
modular architectures and graph-based models that interface with structured
graph representation. In the second part of this survey, we review the datasets
available for training and evaluating REC systems. We then group results
according to the datasets, backbone models, settings so that they can be fairly
compared. Finally, we discuss promising future directions for the field, in
particular the compositional referring expression comprehension that requires
longer reasoning chain to address.Comment: Accepted to IEEE TM
Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding
3D visual grounding involves finding a target object in a 3D scene that
corresponds to a given sentence query. Although many approaches have been
proposed and achieved impressive performance, they all require dense
object-sentence pair annotations in 3D point clouds, which are both
time-consuming and expensive. To address the problem that fine-grained
annotated data is difficult to obtain, we propose to leverage weakly supervised
annotations to learn the 3D visual grounding model, i.e., only coarse
scene-sentence correspondences are used to learn object-sentence links. To
accomplish this, we design a novel semantic matching model that analyzes the
semantic similarity between object proposals and sentences in a coarse-to-fine
manner. Specifically, we first extract object proposals and coarsely select the
top-K candidates based on feature and class similarity matrices. Next, we
reconstruct the masked keywords of the sentence using each candidate one by
one, and the reconstructed accuracy finely reflects the semantic similarity of
each candidate to the query. Additionally, we distill the coarse-to-fine
semantic matching knowledge into a typical two-stage 3D visual grounding model,
which reduces inference costs and improves performance by taking full advantage
of the well-studied structure of the existing architectures. We conduct
extensive experiments on ScanRefer, Nr3D, and Sr3D, which demonstrate the
effectiveness of our proposed method.Comment: ICCV202
Who are you referring to?:Coreference resolution in image narrations
Coreference resolution aims to identify words and phrases which refer to same entity in a text, a core task in natural language processing. In this paper, we extend this task to resolving coreferences in long-form narrations of visual scenes. First we introduce a new dataset with annotated coreference chains and their bounding boxes, as most existing image-text datasets only contain short sentences without coreferring expressions or labeled chains. We propose a new technique that learns to identify coreference chains using weak supervision, only from image-text pairs and a regularization using prior linguistic knowledge. Our model yields large performance gains over several strong baselines in resolving coreferences. We also show that coreference resolution helps improving grounding narratives in images
Discriminative Triad Matching and Reconstruction for Weakly Referring Expression Grounding
In this paper, we are tackling the weakly-supervised referring expression
grounding task, for the localization of a referent object in an image according
to a query sentence, where the mapping between image regions and queries are
not available during the training stage. In traditional methods, an object
region that best matches the referring expression is picked out, and then the
query sentence is reconstructed from the selected region, where the
reconstruction difference serves as the loss for back-propagation. The existing
methods, however, conduct both the matching and the reconstruction
approximately as they ignore the fact that the matching correctness is unknown.
To overcome this limitation, a discriminative triad is designed here as the
basis to the solution, through which a query can be converted into one or
multiple discriminative triads in a very scalable way. Based on the
discriminative triad, we further propose the triad-level matching and
reconstruction modules which are lightweight yet effective for the
weakly-supervised training, making it three times lighter and faster than the
previous state-of-the-art methods. One important merit of our work is its
superior performance despite the simple and neat design. Specifically, the
proposed method achieves a new state-of-the-art accuracy when evaluated on
RefCOCO (39.21%), RefCOCO+ (39.18%) and RefCOCOg (43.24%) datasets, that is
4.17%, 4.08% and 7.8% higher than the previous one, respectively.Comment: TPAM
- …