806 research outputs found
Visual Question Answering: A Survey of Methods and Datasets
Visual Question Answering (VQA) is a challenging task that has received
increasing attention from both the computer vision and the natural language
processing communities. Given an image and a question in natural language, it
requires reasoning over visual elements of the image and general knowledge to
infer the correct answer. In the first part of this survey, we examine the
state of the art by comparing modern approaches to the problem. We classify
methods by their mechanism to connect the visual and textual modalities. In
particular, we examine the common approach of combining convolutional and
recurrent neural networks to map images and questions to a common feature
space. We also discuss memory-augmented and modular architectures that
interface with structured knowledge bases. In the second part of this survey,
we review the datasets available for training and evaluating VQA systems. The
various datatsets contain questions at different levels of complexity, which
require different capabilities and types of reasoning. We examine in depth the
question/answer pairs from the Visual Genome project, and evaluate the
relevance of the structured annotations of images with scene graphs for VQA.
Finally, we discuss promising future directions for the field, in particular
the connection to structured knowledge bases and the use of natural language
processing models.Comment: 25 page
Variational Cross-Graph Reasoning and Adaptive Structured Semantics Learning for Compositional Temporal Grounding
Temporal grounding is the task of locating a specific segment from an
untrimmed video according to a query sentence. This task has achieved
significant momentum in the computer vision community as it enables activity
grounding beyond pre-defined activity classes by utilizing the semantic
diversity of natural language descriptions. The semantic diversity is rooted in
the principle of compositionality in linguistics, where novel semantics can be
systematically described by combining known words in novel ways (compositional
generalization). However, existing temporal grounding datasets are not
carefully designed to evaluate the compositional generalizability. To
systematically benchmark the compositional generalizability of temporal
grounding models, we introduce a new Compositional Temporal Grounding task and
construct two new dataset splits, i.e., Charades-CG and ActivityNet-CG. When
evaluating the state-of-the-art methods on our new dataset splits, we
empirically find that they fail to generalize to queries with novel
combinations of seen words. We argue that the inherent structured semantics
inside the videos and language is the crucial factor to achieve compositional
generalization. Based on this insight, we propose a variational cross-graph
reasoning framework that explicitly decomposes video and language into
hierarchical semantic graphs, respectively, and learns fine-grained semantic
correspondence between the two graphs. Furthermore, we introduce a novel
adaptive structured semantics learning approach to derive the
structure-informed and domain-generalizable graph representations, which
facilitate the fine-grained semantic correspondence reasoning between the two
graphs. Extensive experiments validate the superior compositional
generalizability of our approach.Comment: arXiv admin note: substantial text overlap with arXiv:2203.1304
Text-based Localization of Moments in a Video Corpus
Prior works on text-based video moment localization focus on temporally
grounding the textual query in an untrimmed video. These works assume that the
relevant video is already known and attempt to localize the moment on that
relevant video only. Different from such works, we relax this assumption and
address the task of localizing moments in a corpus of videos for a given
sentence query. This task poses a unique challenge as the system is required to
perform: (i) retrieval of the relevant video where only a segment of the video
corresponds with the queried sentence, and (ii) temporal localization of moment
in the relevant video based on sentence query. Towards overcoming this
challenge, we propose Hierarchical Moment Alignment Network (HMAN) which learns
an effective joint embedding space for moments and sentences. In addition to
learning subtle differences between intra-video moments, HMAN focuses on
distinguishing inter-video global semantic concepts based on sentence queries.
Qualitative and quantitative results on three benchmark text-based video moment
retrieval datasets - Charades-STA, DiDeMo, and ActivityNet Captions -
demonstrate that our method achieves promising performance on the proposed task
of temporal localization of moments in a corpus of videos
- …