6 research outputs found
Consensus Graph Representation Learning for Better Grounded Image Captioning
The contemporary visual captioning models frequently hallucinate objects that
are not actually in a scene, due to the visual misclassification or
over-reliance on priors that resulting in the semantic inconsistency between
the visual information and the target lexical words. The most common way is to
encourage the captioning model to dynamically link generated object words or
phrases to appropriate regions of the image, i.e., the grounded image
captioning (GIC). However, GIC utilizes an auxiliary task (grounding objects)
that has not solved the key issue of object hallucination, i.e., the semantic
inconsistency. In this paper, we take a novel perspective on the issue above -
exploiting the semantic coherency between the visual and language modalities.
Specifically, we propose the Consensus Rraph Representation Learning framework
(CGRL) for GIC that incorporates a consensus representation into the grounded
captioning pipeline. The consensus is learned by aligning the visual graph
(e.g., scene graph) to the language graph that consider both the nodes and
edges in a graph. With the aligned consensus, the captioning model can capture
both the correct linguistic characteristics and visual relevance, and then
grounding appropriate image regions further. We validate the effectiveness of
our model, with a significant decline in object hallucination (-9% CHAIRi) on
the Flickr30k Entities dataset. Besides, our CGRL also evaluated by several
automatic metrics and human evaluation, the results indicate that the proposed
approach can simultaneously improve the performance of image captioning (+2.9
Cider) and grounding (+2.3 F1LOC).Comment: 9 pages, 5 figures, AAAI 202
Referring Expression Comprehension: A Survey of Methods and Datasets
Referring expression comprehension (REC) aims to localize a target object in
an image described by a referring expression phrased in natural language.
Different from the object detection task that queried object labels have been
pre-defined, the REC problem only can observe the queries during the test. It
thus more challenging than a conventional computer vision problem. This task
has attracted a lot of attention from both computer vision and natural language
processing community, and several lines of work have been proposed, from
CNN-RNN model, modular network to complex graph-based model. In this survey, we
first examine the state of the art by comparing modern approaches to the
problem. We classify methods by their mechanism to encode the visual and
textual modalities. In particular, we examine the common approach of joint
embedding images and expressions to a common feature space. We also discuss
modular architectures and graph-based models that interface with structured
graph representation. In the second part of this survey, we review the datasets
available for training and evaluating REC systems. We then group results
according to the datasets, backbone models, settings so that they can be fairly
compared. Finally, we discuss promising future directions for the field, in
particular the compositional referring expression comprehension that requires
longer reasoning chain to address.Comment: Accepted to IEEE TM