1,867 research outputs found
Temporal Sentence Grounding in Videos: A Survey and Future Directions
Temporal sentence grounding in videos (TSGV), \aka natural language video
localization (NLVL) or video moment retrieval (VMR), aims to retrieve a
temporal moment that semantically corresponds to a language query from an
untrimmed video. Connecting computer vision and natural language, TSGV has
drawn significant attention from researchers in both communities. This survey
attempts to provide a summary of fundamental concepts in TSGV and current
research status, as well as future research directions. As the background, we
present a common structure of functional components in TSGV, in a tutorial
style: from feature extraction from raw video and language query, to answer
prediction of the target moment. Then we review the techniques for multimodal
understanding and interaction, which is the key focus of TSGV for effective
alignment between the two modalities. We construct a taxonomy of TSGV
techniques and elaborate the methods in different categories with their
strengths and weaknesses. Lastly, we discuss issues with the current TSGV
research and share our insights about promising research directions.Comment: 29 pages, 32 figures, 9 table
MINOTAUR: Multi-task Video Grounding From Multimodal Queries
Video understanding tasks take many forms, from action detection to visual
query localization and spatio-temporal grounding of sentences. These tasks
differ in the type of inputs (only video, or video-query pair where query is an
image region or sentence) and outputs (temporal segments or spatio-temporal
tubes). However, at their core they require the same fundamental understanding
of the video, i.e., the actors and objects in it, their actions and
interactions. So far these tasks have been tackled in isolation with
individual, highly specialized architectures, which do not exploit the
interplay between tasks. In contrast, in this paper, we present a single,
unified model for tackling query-based video understanding in long-form videos.
In particular, our model can address all three tasks of the Ego4D Episodic
Memory benchmark which entail queries of three different forms: given an
egocentric video and a visual, textual or activity query, the goal is to
determine when and where the answer can be seen within the video. Our model
design is inspired by recent query-based approaches to spatio-temporal
grounding, and contains modality-specific query encoders and task-specific
sliding window inference that allow multi-task training with diverse input
modalities and different structured outputs. We exhaustively analyze
relationships among the tasks and illustrate that cross-task learning leads to
improved performance on each individual task, as well as the ability to
generalize to unseen tasks, such as zero-shot spatial localization of language
queries
TubeDETR: Spatio-Temporal Video Grounding with Transformers
Updated vIoU results compared to the CVPR'22 camera-ready version. Code and trained models are publicly available at https://antoyang.github.io/tubedetr.html.International audienceWe consider the problem of localizing a spatio-temporal tube in a video corresponding to a given text query. This is a challenging task that requires the joint and efficient modeling of temporal, spatial and multi-modal interactions. To address this task, we propose TubeDETR, a transformer-based architecture inspired by the recent success of such models for text-conditioned object detection. Our model notably includes: (i) an efficient video and text encoder that models spatial multi-modal interactions over sparsely sampled frames and (ii) a space-time decoder that jointly performs spatio-temporal localization. We demonstrate the advantage of our proposed components through an extensive ablation study. We also evaluate our full approach on the spatio-temporal video grounding task and demonstrate improvements over the state of the art on the challenging VidSTG and HC-STVG benchmarks
- …