104 research outputs found
Unsupervised Learning from Narrated Instruction Videos
We address the problem of automatically learning the main steps to complete a
certain task, such as changing a car tire, from a set of narrated instruction
videos. The contributions of this paper are three-fold. First, we develop a new
unsupervised learning approach that takes advantage of the complementary nature
of the input video and the associated narration. The method solves two
clustering problems, one in text and one in video, applied one after each other
and linked by joint constraints to obtain a single coherent sequence of steps
in both modalities. Second, we collect and annotate a new challenging dataset
of real-world instruction videos from the Internet. The dataset contains about
800,000 frames for five different tasks that include complex interactions
between people and objects, and are captured in a variety of indoor and outdoor
settings. Third, we experimentally demonstrate that the proposed method can
automatically discover, in an unsupervised manner, the main steps to achieve
the task and locate the steps in the input videos.Comment: Appears in: 2016 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR 2016). 21 page
Learning from narrated instruction videos
International audienceAutomatic assistants could guide a person or a robot in performing new tasks, such as changing a car tire or repotting a plant. Creating such assistants, however, is non-trivial and requires understanding of visual and verbal content of a video. Towards this goal, we here address the problem of automatically learning the main steps of a task from a set of narrated instruction videos. We develop a new unsupervised learning approach that takes advantage of the complementary nature of the input video and the associated narration. The method sequentially clusters textual and visual representations of a task, where the two clustering problems are linked by joint constraints to obtain a single coherent sequence of steps in both modalities. To evaluate our method, we collect and annotate a new challenging dataset of real-world instruction videos from the Internet. The dataset contains videos for five different tasks with complex interactions between people and objects, captured in a variety of indoor and outdoor settings. We experimentally demonstrate that the proposed method can automatically discover, learn and localize the main steps of a task in input videos
Convergence Rate of Frank-Wolfe for Non-Convex Objectives
We give a simple proof that the Frank-Wolfe algorithm obtains a stationary
point at a rate of on non-convex objectives with a Lipschitz
continuous gradient. Our analysis is affine invariant and is the first, to the
best of our knowledge, giving a similar rate to what was already proven for
projected gradient methods (though on slightly different measures of
stationarity).Comment: 6 page
Cross-task weakly supervised learning from instructional videos
In this paper we investigate learning visual models for the steps of ordinary
tasks using weak supervision via instructional narrations and an ordered list
of steps instead of strong supervision via temporal annotations. At the heart
of our approach is the observation that weakly supervised learning may be
easier if a model shares components while learning different steps: `pour egg'
should be trained jointly with other tasks involving `pour' and `egg'. We
formalize this in a component model for recognizing steps and a weakly
supervised learning framework that can learn this model under temporal
constraints from narration and the list of steps. Past data does not permit
systematic studying of sharing and so we also gather a new dataset, CrossTask,
aimed at assessing cross-task sharing. Our experiments demonstrate that sharing
across tasks improves performance, especially when done at the component level
and that our component model can parse previously unseen tasks by virtue of its
compositionality.Comment: 18 pages, 17 figures, to be published in proceedings of the CVPR,
201
Action Recognition from Single Timestamp Supervision in Untrimmed Videos
Recognising actions in videos relies on labelled supervision during training,
typically the start and end times of each action instance. This supervision is
not only subjective, but also expensive to acquire. Weak video-level
supervision has been successfully exploited for recognition in untrimmed
videos, however it is challenged when the number of different actions in
training videos increases. We propose a method that is supervised by single
timestamps located around each action instance, in untrimmed videos. We replace
expensive action bounds with sampling distributions initialised from these
timestamps. We then use the classifier's response to iteratively update the
sampling distributions. We demonstrate that these distributions converge to the
location and extent of discriminative action segments. We evaluate our method
on three datasets for fine-grained recognition, with increasing number of
different actions per video, and show that single timestamps offer a reasonable
compromise between recognition performance and labelling effort, performing
comparably to full temporal supervision. Our update method improves top-1 test
accuracy by up to 5.4%. across the evaluated datasets.Comment: CVPR 201
- …