710 research outputs found
Cross-task weakly supervised learning from instructional videos
International audienceIn this paper we investigate learning visual models for the steps of ordinary tasks using weak supervision via instructional narrations and an ordered list of steps instead of strong supervision via temporal annotations. At the heart of our approach is the observation that weakly supervised learning may be easier if a model shares components while learning different steps: "pour egg" should be trained jointly with other tasks involving "pour" and "egg". We formalize this in a component model for recognizing steps and a weakly supervised learning framework that can learn this model under temporal constraints from narration and the list of steps. Past data does not permit systematic studying of sharing and so we also gather a new dataset, CrossTask, aimed at assessing cross-task sharing. Our experiments demonstrate that sharing across tasks improves performance, especially when done at the component level and that our component model can parse previously unseen tasks by virtue of its compositionality
Weakly Supervised Action Localization by Sparse Temporal Pooling Network
We propose a weakly supervised temporal action localization algorithm on
untrimmed videos using convolutional neural networks. Our algorithm learns from
video-level class labels and predicts temporal intervals of human actions with
no requirement of temporal localization annotations. We design our network to
identify a sparse subset of key segments associated with target actions in a
video using an attention module and fuse the key segments through adaptive
temporal pooling. Our loss function is comprised of two terms that minimize the
video-level action classification error and enforce the sparsity of the segment
selection. At inference time, we extract and score temporal proposals using
temporal class activations and class-agnostic attentions to estimate the time
intervals that correspond to target actions. The proposed algorithm attains
state-of-the-art results on the THUMOS14 dataset and outstanding performance on
ActivityNet1.3 even with its weak supervision.Comment: Accepted to CVPR 201
Action Modifiers:Learning from Adverbs in Instructional Videos
We present a method to learn a representation for adverbs from instructional
videos using weak supervision from the accompanying narrations. Key to our
method is the fact that the visual representation of the adverb is highly
dependant on the action to which it applies, although the same adverb will
modify multiple actions in a similar way. For instance, while 'spread quickly'
and 'mix quickly' will look dissimilar, we can learn a common representation
that allows us to recognize both, among other actions. We formulate this as an
embedding problem, and use scaled dot-product attention to learn from
weakly-supervised video narrations. We jointly learn adverbs as invertible
transformations operating on the embedding space, so as to add or remove the
effect of the adverb. As there is no prior work on weakly supervised learning
from adverbs, we gather paired action-adverb annotations from a subset of the
HowTo100M dataset for 6 adverbs: quickly/slowly, finely/coarsely, and
partially/completely. Our method outperforms all baselines for video-to-adverb
retrieval with a performance of 0.719 mAP. We also demonstrate our model's
ability to attend to the relevant video parts in order to determine the adverb
for a given action.Comment: CVPR 202
- …