42 research outputs found
End-to-end learning of visual representations from uncurated instructional videos
Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing mis- alignments inherent in narrated videos. With this approach we are able to learn strong video representations from scratch, without the need for any manual annotation. We evaluate our representations on a wide range of four downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to- video retrieval (YouCook2, MSR-VTT), action localization (YouTube-8M Segments, CrossTask) and action segmentation (COIN). Our method outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines
End-to-End Learning of Visual Representations from Uncurated Instructional Videos
International audienceAnnotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing misalignments inherent to narrated videos. With this approach we are able to learn strong video representations from scratch, without the need for any manual annotation. We evaluate our representations on a wide range of four downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to-video retrieval (YouCook2, MSR-VTT), action localization (YouTube-8M Segments, CrossTask) and action segmentation (COIN). Our method outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines
QuerYD: A video dataset with high-quality text and audio narrations
We introduce QuerYD, a new large-scale dataset for retrieval and event
localisation in video. A unique feature of our dataset is the availability of
two audio tracks for each video: the original audio, and a high-quality spoken
description of the visual content. The dataset is based on YouDescribe, a
volunteer project that assists visually-impaired people by attaching voiced
narrations to existing YouTube videos. This ever-growing collection of videos
contains highly detailed, temporally aligned audio and text annotations. The
content descriptions are more relevant than dialogue, and more detailed than
previous description attempts, which can be observed to contain many
superficial or uninformative descriptions. To demonstrate the utility of the
QuerYD dataset, we show that it can be used to train and benchmark strong
models for retrieval and event localisation. Data, code and models are made
publicly available, and we hope that QuerYD inspires further research on video
understanding with written and spoken natural language.Comment: 5 pages, 4 figures, accepted at ICASSP 202
Support-set bottlenecks for video-text representation learning
The dominant paradigm for learning video-text representations -- noise
contrastive learning -- increases the similarity of the representations of
pairs of samples that are known to be related, such as text and video from the
same sample, and pushes away the representations of all other pairs. We posit
that this last behaviour is too strict, enforcing dissimilar representations
even for samples that are semantically-related -- for example, visually similar
videos or ones that share the same depicted action. In this paper, we propose a
novel method that alleviates this by leveraging a generative model to naturally
push these related samples together: each sample's caption must be
reconstructed as a weighted combination of other support samples' visual
representations. This simple idea ensures that representations are not
overly-specialized to individual samples, are reusable across the dataset, and
results in representations that explicitly encode semantics shared between
samples, unlike noise contrastive learning. Our proposed method outperforms
others by a large margin on MSR-VTT, VATEX and ActivityNet, and MSVD for
video-to-text and text-to-video retrieval.Comment: Accepted as spotlight paper at the International Conference on
Learning Representations (ICLR) 202