9,422 research outputs found
VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval
Video Moment Retrieval (VMR) is a task to localize the temporal moment in
untrimmed video specified by natural language query. For VMR, several methods
that require full supervision for training have been proposed. Unfortunately,
acquiring a large number of training videos with labeled temporal boundaries
for each query is a labor-intensive process. This paper explores methods for
performing VMR in a weakly-supervised manner (wVMR): training is performed
without temporal moment labels but only with the text query that describes a
segment of the video. Existing methods on wVMR generate multi-scale proposals
and apply query-guided attention mechanisms to highlight the most relevant
proposal. To leverage the weak supervision, contrastive learning is used which
predicts higher scores for the correct video-query pairs than for the incorrect
pairs. It has been observed that a large number of candidate proposals, coarse
query representation, and one-way attention mechanism lead to blurry attention
maps which limit the localization performance. To handle this issue,
Video-Language Alignment Network (VLANet) is proposed that learns sharper
attention by pruning out spurious candidate proposals and applying a
multi-directional attention mechanism with fine-grained query representation.
The Surrogate Proposal Selection module selects a proposal based on the
proximity to the query in the joint embedding space, and thus substantially
reduces candidate proposals which leads to lower computation load and sharper
attention. Next, the Cascaded Cross-modal Attention module considers dense
feature interactions and multi-directional attention flow to learn the
multi-modal alignment. VLANet is trained end-to-end using contrastive loss
which enforces semantically similar videos and queries to gather. The
experiments show that the method achieves state-of-the-art performance on
Charades-STA and DiDeMo datasets.Comment: 16 pages, 6 figures, European Conference on Computer Vision, 202
Generation-Guided Multi-Level Unified Network for Video Grounding
Video grounding aims to locate the timestamps best matching the query
description within an untrimmed video. Prevalent methods can be divided into
moment-level and clip-level frameworks. Moment-level approaches directly
predict the probability of each transient moment to be the boundary in a global
perspective, and they usually perform better in coarse grounding. On the other
hand, clip-level ones aggregate the moments in different time windows into
proposals and then deduce the most similar one, leading to its advantage in
fine-grained grounding. In this paper, we propose a multi-level unified
framework to enhance performance by leveraging the merits of both moment-level
and clip-level methods. Moreover, a novel generation-guided paradigm in both
levels is adopted. It introduces a multi-modal generator to produce the
implicit boundary feature and clip feature, later regarded as queries to
calculate the boundary scores by a discriminator. The generation-guided
solution enhances video grounding from a two-unique-modals' match task to a
cross-modal attention task, which steps out of the previous framework and
obtains notable gains. The proposed Generation-guided Multi-level Unified
network (GMU) surpasses previous methods and reaches State-Of-The-Art on
various benchmarks with disparate features, e.g., Charades-STA, ActivityNet
captions
Temporal Sentence Grounding in Videos: A Survey and Future Directions
Temporal sentence grounding in videos (TSGV), \aka natural language video
localization (NLVL) or video moment retrieval (VMR), aims to retrieve a
temporal moment that semantically corresponds to a language query from an
untrimmed video. Connecting computer vision and natural language, TSGV has
drawn significant attention from researchers in both communities. This survey
attempts to provide a summary of fundamental concepts in TSGV and current
research status, as well as future research directions. As the background, we
present a common structure of functional components in TSGV, in a tutorial
style: from feature extraction from raw video and language query, to answer
prediction of the target moment. Then we review the techniques for multimodal
understanding and interaction, which is the key focus of TSGV for effective
alignment between the two modalities. We construct a taxonomy of TSGV
techniques and elaborate the methods in different categories with their
strengths and weaknesses. Lastly, we discuss issues with the current TSGV
research and share our insights about promising research directions.Comment: 29 pages, 32 figures, 9 table
- …