548 research outputs found
Temporal Sentence Grounding in Videos: A Survey and Future Directions
Temporal sentence grounding in videos (TSGV), \aka natural language video
localization (NLVL) or video moment retrieval (VMR), aims to retrieve a
temporal moment that semantically corresponds to a language query from an
untrimmed video. Connecting computer vision and natural language, TSGV has
drawn significant attention from researchers in both communities. This survey
attempts to provide a summary of fundamental concepts in TSGV and current
research status, as well as future research directions. As the background, we
present a common structure of functional components in TSGV, in a tutorial
style: from feature extraction from raw video and language query, to answer
prediction of the target moment. Then we review the techniques for multimodal
understanding and interaction, which is the key focus of TSGV for effective
alignment between the two modalities. We construct a taxonomy of TSGV
techniques and elaborate the methods in different categories with their
strengths and weaknesses. Lastly, we discuss issues with the current TSGV
research and share our insights about promising research directions.Comment: 29 pages, 32 figures, 9 table
Frame-wise Cross-modal Matching for Video Moment Retrieval
Video moment retrieval targets at retrieving a moment in a video for a given
language query. The challenges of this task include 1) the requirement of
localizing the relevant moment in an untrimmed video, and 2) bridging the
semantic gap between textual query and video contents. To tackle those
problems, early approaches adopt the sliding window or uniform sampling to
collect video clips first and then match each clip with the query. Obviously,
these strategies are time-consuming and often lead to unsatisfied accuracy in
localization due to the unpredictable length of the golden moment. To avoid the
limitations, researchers recently attempt to directly predict the relevant
moment boundaries without the requirement to generate video clips first. One
mainstream approach is to generate a multimodal feature vector for the target
query and video frames (e.g., concatenation) and then use a regression approach
upon the multimodal feature vector for boundary detection. Although some
progress has been achieved by this approach, we argue that those methods have
not well captured the cross-modal interactions between the query and video
frames.
In this paper, we propose an Attentive Cross-modal Relevance Matching (ACRM)
model which predicts the temporal boundaries based on an interaction modeling.
In addition, an attention module is introduced to assign higher weights to
query words with richer semantic cues, which are considered to be more
important for finding relevant video contents. Another contribution is that we
propose an additional predictor to utilize the internal frames in the model
training to improve the localization accuracy. Extensive experiments on two
datasets TACoS and Charades-STA demonstrate the superiority of our method over
several state-of-the-art methods. Ablation studies have been also conducted to
examine the effectiveness of different modules in our ACRM model.Comment: 12 pages; accepted by IEEE TM
Temporal Sentence Grounding in Streaming Videos
This paper aims to tackle a novel task - Temporal Sentence Grounding in
Streaming Videos (TSGSV). The goal of TSGSV is to evaluate the relevance
between a video stream and a given sentence query. Unlike regular videos,
streaming videos are acquired continuously from a particular source, and are
always desired to be processed on-the-fly in many applications such as
surveillance and live-stream analysis. Thus, TSGSV is challenging since it
requires the model to infer without future frames and process long historical
frames effectively, which is untouched in the early methods. To specifically
address the above challenges, we propose two novel methods: (1) a TwinNet
structure that enables the model to learn about upcoming events; and (2) a
language-guided feature compressor that eliminates redundant visual frames and
reinforces the frames that are relevant to the query. We conduct extensive
experiments using ActivityNet Captions, TACoS, and MAD datasets. The results
demonstrate the superiority of our proposed methods. A systematic ablation
study also confirms their effectiveness.Comment: Accepted by ACM MM 202
Generation-Guided Multi-Level Unified Network for Video Grounding
Video grounding aims to locate the timestamps best matching the query
description within an untrimmed video. Prevalent methods can be divided into
moment-level and clip-level frameworks. Moment-level approaches directly
predict the probability of each transient moment to be the boundary in a global
perspective, and they usually perform better in coarse grounding. On the other
hand, clip-level ones aggregate the moments in different time windows into
proposals and then deduce the most similar one, leading to its advantage in
fine-grained grounding. In this paper, we propose a multi-level unified
framework to enhance performance by leveraging the merits of both moment-level
and clip-level methods. Moreover, a novel generation-guided paradigm in both
levels is adopted. It introduces a multi-modal generator to produce the
implicit boundary feature and clip feature, later regarded as queries to
calculate the boundary scores by a discriminator. The generation-guided
solution enhances video grounding from a two-unique-modals' match task to a
cross-modal attention task, which steps out of the previous framework and
obtains notable gains. The proposed Generation-guided Multi-level Unified
network (GMU) surpasses previous methods and reaches State-Of-The-Art on
various benchmarks with disparate features, e.g., Charades-STA, ActivityNet
captions
- …