3 research outputs found
Relational Reasoning using Prior Knowledge for Visual Captioning
Exploiting relationships among objects has achieved remarkable progress in
interpreting images or videos by natural language. Most existing methods resort
to first detecting objects and their relationships, and then generating textual
descriptions, which heavily depends on pre-trained detectors and leads to
performance drop when facing problems of heavy occlusion, tiny-size objects and
long-tail in object detection. In addition, the separate procedure of detecting
and captioning results in semantic inconsistency between the pre-defined
object/relation categories and the target lexical words. We exploit prior human
commonsense knowledge for reasoning relationships between objects without any
pre-trained detectors and reaching semantic coherency within one image or video
in captioning. The prior knowledge (e.g., in the form of knowledge graph)
provides commonsense semantic correlation and constraint between objects that
are not explicit in the image and video, serving as useful guidance to build
semantic graph for sentence generation. Particularly, we present a joint
reasoning method that incorporates 1) commonsense reasoning for embedding image
or video regions into semantic space to build semantic graph and 2) relational
reasoning for encoding semantic graph to generate sentences. Extensive
experiments on the MS-COCO image captioning benchmark and the MSVD video
captioning benchmark validate the superiority of our method on leveraging prior
commonsense knowledge to enhance relational reasoning for visual captioning
Visual Relationship Forecasting in Videos
Real-world scenarios often require the anticipation of object interactions in
unknown future, which would assist the decision-making process of both humans
and agents. To meet this challenge, we present a new task named Visual
Relationship Forecasting (VRF) in videos to explore the prediction of visual
relationships in a reasoning manner. Specifically, given a subject-object pair
with H existing frames, VRF aims to predict their future interactions for the
next T frames without visual evidence. To evaluate the VRF task, we introduce
two video datasets named VRF-AG and VRF-VidOR, with a series of
spatio-temporally localized visual relation annotations in a video. These two
datasets densely annotate 13 and 35 visual relationships in 1923 and 13447
video clips, respectively. In addition, we present a novel Graph Convolutional
Transformer (GCT) framework, which captures both object-level and frame-level
dependencies by spatio-temporal Graph Convolution Network and Transformer.
Experimental results on both VRF-AG and VRF-VidOR datasets demonstrate that GCT
outperforms the state-of-the-art sequence modelling methods on visual
relationship forecasting
Learning Visual Relation Priors for Image-Text Matching and Image Captioning with Neural Scene Graph Generators
Grounding language to visual relations is critical to various
language-and-vision applications. In this work, we tackle two fundamental
language-and-vision tasks: image-text matching and image captioning, and
demonstrate that neural scene graph generators can learn effective visual
relation features to facilitate grounding language to visual relations and
subsequently improve the two end applications. By combining relation features
with the state-of-the-art models, our experiments show significant improvement
on the standard Flickr30K and MSCOCO benchmarks. Our experimental results and
analysis show that relation features improve downstream models' capability of
capturing visual relations in end vision-and-language applications. We also
demonstrate the importance of learning scene graph generators with visually
relevant relations to the effectiveness of relation features