Search CORE

42 research outputs found

End-to-end learning of visual representations from uncurated instructional videos

Author: Alayrac Jean-Baptiste
Laptev Ivan
Miech Antoine
Sivic Josef
Smaira Lucas
Zisserman Andrew
Publication venue: IEEE
Publication date: 05/08/2020
Field of study

Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing mis- alignments inherent in narrated videos. With this approach we are able to learn strong video representations from scratch, without the need for any manual annotation. We evaluate our representations on a wide range of four downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to- video retrieval (YouCook2, MSR-VTT), action localization (YouTube-8M Segments, CrossTask) and action segmentation (COIN). Our method outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines

Oxford University Research Archive

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Author: Alayrac Jean-Baptiste
Laptev Ivan
Miech Antoine
Sivic Josef
Smaira Lucas
Zisserman Andrew
Publication venue: HAL CCSD
Publication date: 09/01/2020
Field of study

International audienceAnnotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing misalignments inherent to narrated videos. With this approach we are able to learn strong video representations from scratch, without the need for any manual annotation. We evaluate our representations on a wide range of four downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to-video retrieval (YouCook2, MSR-VTT), action localization (YouTube-8M Segments, CrossTask) and action segmentation (COIN). Our method outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

Oxford University Research Archive

QuerYD: A video dataset with high-quality text and audio narrations

Author: Albanie Samuel
Henriques João F.
Liu Yang
Oncescu Andreea-Maria
Zisserman Andrew
Publication venue
Publication date: 01/01/2021
Field of study

We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video. A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description of the visual content. The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos. This ever-growing collection of videos contains highly detailed, temporally aligned audio and text annotations. The content descriptions are more relevant than dialogue, and more detailed than previous description attempts, which can be observed to contain many superficial or uninformative descriptions. To demonstrate the utility of the QuerYD dataset, we show that it can be used to train and benchmark strong models for retrieval and event localisation. Data, code and models are made publicly available, and we hope that QuerYD inspires further research on video understanding with written and spoken natural language.Comment: 5 pages, 4 figures, accepted at ICASSP 202

arXiv.org e-Print Archive

Oxford University Research Archive

Support-set bottlenecks for video-text representation learning

Author: Asano Yuki
Hauptmann Alexander
Henriques João
Huang Po-Yao
Metze Florian
Patrick Mandela
Vedaldi Andrea
Publication venue
Publication date: 01/01/2021
Field of study

The dominant paradigm for learning video-text representations -- noise contrastive learning -- increases the similarity of the representations of pairs of samples that are known to be related, such as text and video from the same sample, and pushes away the representations of all other pairs. We posit that this last behaviour is too strict, enforcing dissimilar representations even for samples that are semantically-related -- for example, visually similar videos or ones that share the same depicted action. In this paper, we propose a novel method that alleviates this by leveraging a generative model to naturally push these related samples together: each sample's caption must be reconstructed as a weighted combination of other support samples' visual representations. This simple idea ensures that representations are not overly-specialized to individual samples, are reusable across the dataset, and results in representations that explicitly encode semantics shared between samples, unlike noise contrastive learning. Our proposed method outperforms others by a large margin on MSR-VTT, VATEX and ActivityNet, and MSVD for video-to-text and text-to-video retrieval.Comment: Accepted as spotlight paper at the International Conference on Learning Representations (ICLR) 202

arXiv.org e-Print Archive

Oxford University Research Archive