5,615 research outputs found
Temporal Sentence Grounding in Videos: A Survey and Future Directions
Temporal sentence grounding in videos (TSGV), \aka natural language video
localization (NLVL) or video moment retrieval (VMR), aims to retrieve a
temporal moment that semantically corresponds to a language query from an
untrimmed video. Connecting computer vision and natural language, TSGV has
drawn significant attention from researchers in both communities. This survey
attempts to provide a summary of fundamental concepts in TSGV and current
research status, as well as future research directions. As the background, we
present a common structure of functional components in TSGV, in a tutorial
style: from feature extraction from raw video and language query, to answer
prediction of the target moment. Then we review the techniques for multimodal
understanding and interaction, which is the key focus of TSGV for effective
alignment between the two modalities. We construct a taxonomy of TSGV
techniques and elaborate the methods in different categories with their
strengths and weaknesses. Lastly, we discuss issues with the current TSGV
research and share our insights about promising research directions.Comment: 29 pages, 32 figures, 9 table
DemaFormer: Damped Exponential Moving Average Transformer with Energy-Based Modeling for Temporal Language Grounding
Temporal Language Grounding seeks to localize video moments that semantically
correspond to a natural language query. Recent advances employ the attention
mechanism to learn the relations between video moments and the text query.
However, naive attention might not be able to appropriately capture such
relations, resulting in ineffective distributions where target video moments
are difficult to separate from the remaining ones. To resolve the issue, we
propose an energy-based model framework to explicitly learn moment-query
distributions. Moreover, we propose DemaFormer, a novel Transformer-based
architecture that utilizes exponential moving average with a learnable damping
factor to effectively encode moment-query inputs. Comprehensive experiments on
four public temporal language grounding datasets showcase the superiority of
our methods over the state-of-the-art baselines.Comment: Accepted at EMNLP 2023 (Findings
UCF-Crime Annotation: A Benchmark for Surveillance Video-and-Language Understanding
Surveillance videos are an essential component of daily life with various
critical applications, particularly in public security. However, current
surveillance video tasks mainly focus on classifying and localizing anomalous
events. Existing methods are limited to detecting and classifying the
predefined events with unsatisfactory generalization ability and semantic
understanding, although they have obtained considerable performance. To address
this issue, we propose constructing the first multimodal surveillance video
dataset by manually annotating the real-world surveillance dataset UCF-Crime
with fine-grained event content and timing. Our newly annotated dataset, UCA
(UCF-Crime Annotation), provides a novel benchmark for multimodal surveillance
video analysis. It not only describes events in detailed descriptions but also
provides precise temporal grounding of the events in 0.1-second intervals. UCA
contains 20,822 sentences, with an average length of 23 words, and its
annotated videos are as long as 102 hours. Furthermore, we benchmark the
state-of-the-art models of multiple multimodal tasks on this newly created
dataset, including temporal sentence grounding in videos, video captioning, and
dense video captioning. Through our experiments, we found that mainstream
models used in previously publicly available datasets perform poorly on
multimodal surveillance video scenarios, which highlights the necessity of
constructing this dataset. The link to our dataset and code is provided at:
https://github.com/Xuange923/UCA-dataset
- …