585 research outputs found
Temporal Sentence Grounding in Videos: A Survey and Future Directions
Temporal sentence grounding in videos (TSGV), \aka natural language video
localization (NLVL) or video moment retrieval (VMR), aims to retrieve a
temporal moment that semantically corresponds to a language query from an
untrimmed video. Connecting computer vision and natural language, TSGV has
drawn significant attention from researchers in both communities. This survey
attempts to provide a summary of fundamental concepts in TSGV and current
research status, as well as future research directions. As the background, we
present a common structure of functional components in TSGV, in a tutorial
style: from feature extraction from raw video and language query, to answer
prediction of the target moment. Then we review the techniques for multimodal
understanding and interaction, which is the key focus of TSGV for effective
alignment between the two modalities. We construct a taxonomy of TSGV
techniques and elaborate the methods in different categories with their
strengths and weaknesses. Lastly, we discuss issues with the current TSGV
research and share our insights about promising research directions.Comment: 29 pages, 32 figures, 9 table
Commonsense for Zero-Shot Natural Language Video Localization
Zero-shot Natural Language-Video Localization (NLVL) methods have exhibited
promising results in training NLVL models exclusively with raw video data by
dynamically generating video segments and pseudo-query annotations. However,
existing pseudo-queries often lack grounding in the source video, resulting in
unstructured and disjointed content. In this paper, we investigate the
effectiveness of commonsense reasoning in zero-shot NLVL. Specifically, we
present CORONET, a zero-shot NLVL framework that leverages commonsense to
bridge the gap between videos and generated pseudo-queries via a commonsense
enhancement module. CORONET employs Graph Convolution Networks (GCN) to encode
commonsense information extracted from a knowledge graph, conditioned on the
video, and cross-attention mechanisms to enhance the encoded video and
pseudo-query representations prior to localization. Through empirical
evaluations on two benchmark datasets, we demonstrate that CORONET surpasses
both zero-shot and weakly supervised baselines, achieving improvements up to
32.13% across various recall thresholds and up to 6.33% in mIoU. These results
underscore the significance of leveraging commonsense reasoning for zero-shot
NLVL.Comment: Accepted to AAAI 202
Frame-wise Cross-modal Matching for Video Moment Retrieval
Video moment retrieval targets at retrieving a moment in a video for a given
language query. The challenges of this task include 1) the requirement of
localizing the relevant moment in an untrimmed video, and 2) bridging the
semantic gap between textual query and video contents. To tackle those
problems, early approaches adopt the sliding window or uniform sampling to
collect video clips first and then match each clip with the query. Obviously,
these strategies are time-consuming and often lead to unsatisfied accuracy in
localization due to the unpredictable length of the golden moment. To avoid the
limitations, researchers recently attempt to directly predict the relevant
moment boundaries without the requirement to generate video clips first. One
mainstream approach is to generate a multimodal feature vector for the target
query and video frames (e.g., concatenation) and then use a regression approach
upon the multimodal feature vector for boundary detection. Although some
progress has been achieved by this approach, we argue that those methods have
not well captured the cross-modal interactions between the query and video
frames.
In this paper, we propose an Attentive Cross-modal Relevance Matching (ACRM)
model which predicts the temporal boundaries based on an interaction modeling.
In addition, an attention module is introduced to assign higher weights to
query words with richer semantic cues, which are considered to be more
important for finding relevant video contents. Another contribution is that we
propose an additional predictor to utilize the internal frames in the model
training to improve the localization accuracy. Extensive experiments on two
datasets TACoS and Charades-STA demonstrate the superiority of our method over
several state-of-the-art methods. Ablation studies have been also conducted to
examine the effectiveness of different modules in our ACRM model.Comment: 12 pages; accepted by IEEE TM
Temporal Sentence Grounding in Streaming Videos
This paper aims to tackle a novel task - Temporal Sentence Grounding in
Streaming Videos (TSGSV). The goal of TSGSV is to evaluate the relevance
between a video stream and a given sentence query. Unlike regular videos,
streaming videos are acquired continuously from a particular source, and are
always desired to be processed on-the-fly in many applications such as
surveillance and live-stream analysis. Thus, TSGSV is challenging since it
requires the model to infer without future frames and process long historical
frames effectively, which is untouched in the early methods. To specifically
address the above challenges, we propose two novel methods: (1) a TwinNet
structure that enables the model to learn about upcoming events; and (2) a
language-guided feature compressor that eliminates redundant visual frames and
reinforces the frames that are relevant to the query. We conduct extensive
experiments using ActivityNet Captions, TACoS, and MAD datasets. The results
demonstrate the superiority of our proposed methods. A systematic ablation
study also confirms their effectiveness.Comment: Accepted by ACM MM 202
- …