4,607 research outputs found
Prompting Large Language Models to Reformulate Queries for Moment Localization
The task of moment localization is to localize a temporal moment in an
untrimmed video for a given natural language query. Since untrimmed video
contains highly redundant contents, the quality of the query is crucial for
accurately localizing moments, i.e., the query should provide precise
information about the target moment so that the localization model can
understand what to look for in the videos. However, the natural language
queries in current datasets may not be easy to understand for existing models.
For example, the Ego4D dataset uses question sentences as the query to describe
relatively complex moments. While being natural and straightforward for humans,
understanding such question sentences are challenging for mainstream moment
localization models like 2D-TAN. Inspired by the recent success of large
language models, especially their ability of understanding and generating
complex natural language contents, in this extended abstract, we make early
attempts at reformulating the moment queries into a set of instructions using
large language models and making them more friendly to the localization models.Comment: 4 pages, 2 figure
Spatio-temporal Video Re-localization by Warp LSTM
The need for efficiently finding the video content a user wants is increasing
because of the erupting of user-generated videos on the Web. Existing
keyword-based or content-based video retrieval methods usually determine what
occurs in a video but not when and where. In this paper, we make an answer to
the question of when and where by formulating a new task, namely
spatio-temporal video re-localization. Specifically, given a query video and a
reference video, spatio-temporal video re-localization aims to localize
tubelets in the reference video such that the tubelets semantically correspond
to the query. To accurately localize the desired tubelets in the reference
video, we propose a novel warp LSTM network, which propagates the
spatio-temporal information for a long period and thereby captures the
corresponding long-term dependencies. Another issue for spatio-temporal video
re-localization is the lack of properly labeled video datasets. Therefore, we
reorganize the videos in the AVA dataset to form a new dataset for
spatio-temporal video re-localization research. Extensive experimental results
show that the proposed model achieves superior performances over the designed
baselines on the spatio-temporal video re-localization task
Unsupervised Learning from Narrated Instruction Videos
We address the problem of automatically learning the main steps to complete a
certain task, such as changing a car tire, from a set of narrated instruction
videos. The contributions of this paper are three-fold. First, we develop a new
unsupervised learning approach that takes advantage of the complementary nature
of the input video and the associated narration. The method solves two
clustering problems, one in text and one in video, applied one after each other
and linked by joint constraints to obtain a single coherent sequence of steps
in both modalities. Second, we collect and annotate a new challenging dataset
of real-world instruction videos from the Internet. The dataset contains about
800,000 frames for five different tasks that include complex interactions
between people and objects, and are captured in a variety of indoor and outdoor
settings. Third, we experimentally demonstrate that the proposed method can
automatically discover, in an unsupervised manner, the main steps to achieve
the task and locate the steps in the input videos.Comment: Appears in: 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR 2016). 21 page
- …