655 research outputs found
Text-based Localization of Moments in a Video Corpus
Prior works on text-based video moment localization focus on temporally
grounding the textual query in an untrimmed video. These works assume that the
relevant video is already known and attempt to localize the moment on that
relevant video only. Different from such works, we relax this assumption and
address the task of localizing moments in a corpus of videos for a given
sentence query. This task poses a unique challenge as the system is required to
perform: (i) retrieval of the relevant video where only a segment of the video
corresponds with the queried sentence, and (ii) temporal localization of moment
in the relevant video based on sentence query. Towards overcoming this
challenge, we propose Hierarchical Moment Alignment Network (HMAN) which learns
an effective joint embedding space for moments and sentences. In addition to
learning subtle differences between intra-video moments, HMAN focuses on
distinguishing inter-video global semantic concepts based on sentence queries.
Qualitative and quantitative results on three benchmark text-based video moment
retrieval datasets - Charades-STA, DiDeMo, and ActivityNet Captions -
demonstrate that our method achieves promising performance on the proposed task
of temporal localization of moments in a corpus of videos
MRTNet: Multi-Resolution Temporal Network for Video Sentence Grounding
Given an untrimmed video and natural language query, video sentence grounding
aims to localize the target temporal moment in the video. Existing methods
mainly tackle this task by matching and aligning semantics of the descriptive
sentence and video segments on a single temporal resolution, while neglecting
the temporal consistency of video content in different resolutions. In this
work, we propose a novel multi-resolution temporal video sentence grounding
network: MRTNet, which consists of a multi-modal feature encoder, a
Multi-Resolution Temporal (MRT) module, and a predictor module. MRT module is
an encoder-decoder network, and output features in the decoder part are in
conjunction with Transformers to predict the final start and end timestamps.
Particularly, our MRT module is hot-pluggable, which means it can be seamlessly
incorporated into any anchor-free models. Besides, we utilize a hybrid loss to
supervise cross-modal features in MRT module for more accurate grounding in
three scales: frame-level, clip-level and sequence-level. Extensive experiments
on three prevalent datasets have shown the effectiveness of MRTNet.Comment: work in progres
UCF-Crime Annotation: A Benchmark for Surveillance Video-and-Language Understanding
Surveillance videos are an essential component of daily life with various
critical applications, particularly in public security. However, current
surveillance video tasks mainly focus on classifying and localizing anomalous
events. Existing methods are limited to detecting and classifying the
predefined events with unsatisfactory generalization ability and semantic
understanding, although they have obtained considerable performance. To address
this issue, we propose constructing the first multimodal surveillance video
dataset by manually annotating the real-world surveillance dataset UCF-Crime
with fine-grained event content and timing. Our newly annotated dataset, UCA
(UCF-Crime Annotation), provides a novel benchmark for multimodal surveillance
video analysis. It not only describes events in detailed descriptions but also
provides precise temporal grounding of the events in 0.1-second intervals. UCA
contains 20,822 sentences, with an average length of 23 words, and its
annotated videos are as long as 102 hours. Furthermore, we benchmark the
state-of-the-art models of multiple multimodal tasks on this newly created
dataset, including temporal sentence grounding in videos, video captioning, and
dense video captioning. Through our experiments, we found that mainstream
models used in previously publicly available datasets perform poorly on
multimodal surveillance video scenarios, which highlights the necessity of
constructing this dataset. The link to our dataset and code is provided at:
https://github.com/Xuange923/UCA-dataset
Temporal Sentence Grounding in Videos: A Survey and Future Directions
Temporal sentence grounding in videos (TSGV), \aka natural language video
localization (NLVL) or video moment retrieval (VMR), aims to retrieve a
temporal moment that semantically corresponds to a language query from an
untrimmed video. Connecting computer vision and natural language, TSGV has
drawn significant attention from researchers in both communities. This survey
attempts to provide a summary of fundamental concepts in TSGV and current
research status, as well as future research directions. As the background, we
present a common structure of functional components in TSGV, in a tutorial
style: from feature extraction from raw video and language query, to answer
prediction of the target moment. Then we review the techniques for multimodal
understanding and interaction, which is the key focus of TSGV for effective
alignment between the two modalities. We construct a taxonomy of TSGV
techniques and elaborate the methods in different categories with their
strengths and weaknesses. Lastly, we discuss issues with the current TSGV
research and share our insights about promising research directions.Comment: 29 pages, 32 figures, 9 table
- …