238 research outputs found
Unsupervised Learning from Narrated Instruction Videos
We address the problem of automatically learning the main steps to complete a
certain task, such as changing a car tire, from a set of narrated instruction
videos. The contributions of this paper are three-fold. First, we develop a new
unsupervised learning approach that takes advantage of the complementary nature
of the input video and the associated narration. The method solves two
clustering problems, one in text and one in video, applied one after each other
and linked by joint constraints to obtain a single coherent sequence of steps
in both modalities. Second, we collect and annotate a new challenging dataset
of real-world instruction videos from the Internet. The dataset contains about
800,000 frames for five different tasks that include complex interactions
between people and objects, and are captured in a variety of indoor and outdoor
settings. Third, we experimentally demonstrate that the proposed method can
automatically discover, in an unsupervised manner, the main steps to achieve
the task and locate the steps in the input videos.Comment: Appears in: 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR 2016). 21 page
TagBook: A Semantic Video Representation without Supervision for Event Detection
We consider the problem of event detection in video for scenarios where only
few, or even zero examples are available for training. For this challenging
setting, the prevailing solutions in the literature rely on a semantic video
representation obtained from thousands of pre-trained concept detectors.
Different from existing work, we propose a new semantic video representation
that is based on freely available social tagged videos only, without the need
for training any intermediate concept detectors. We introduce a simple
algorithm that propagates tags from a video's nearest neighbors, similar in
spirit to the ones used for image retrieval, but redesign it for video event
detection by including video source set refinement and varying the video tag
assignment. We call our approach TagBook and study its construction,
descriptiveness and detection performance on the TRECVID 2013 and 2014
multimedia event detection datasets and the Columbia Consumer Video dataset.
Despite its simple nature, the proposed TagBook video representation is
remarkably effective for few-example and zero-example event detection, even
outperforming very recent state-of-the-art alternatives building on supervised
representations.Comment: accepted for publication as a regular paper in the IEEE Transactions
on Multimedi
VidChapters-7M: Video Chapters at Scale
Segmenting long videos into chapters enables users to quickly navigate to the
information of their interest. This important topic has been understudied due
to the lack of publicly released datasets. To address this issue, we present
VidChapters-7M, a dataset of 817K user-chaptered videos including 7M chapters
in total. VidChapters-7M is automatically created from videos online in a
scalable manner by scraping user-annotated chapters and hence without any
additional manual annotation. We introduce the following three tasks based on
this data. First, the video chapter generation task consists of temporally
segmenting the video and generating a chapter title for each segment. To
further dissect the problem, we also define two variants of this task: video
chapter generation given ground-truth boundaries, which requires generating a
chapter title given an annotated video segment, and video chapter grounding,
which requires temporally localizing a chapter given its annotated title. We
benchmark both simple baselines and state-of-the-art video-language models for
these three tasks. We also show that pretraining on VidChapters-7M transfers
well to dense video captioning tasks in both zero-shot and finetuning settings,
largely improving the state of the art on the YouCook2 and ViTT benchmarks.
Finally, our experiments reveal that downstream performance scales well with
the size of the pretraining dataset. Our dataset, code, and models are publicly
available at https://antoyang.github.io/vidchapters.html.Comment: Accepted at NeurIPS 2023 Track on Datasets and Benchmarks; Project
Webpage: https://antoyang.github.io/vidchapters.html ; 31 pages; 8 figure
- …