147 research outputs found
End-to-End Learning of Visual Representations from Uncurated Instructional Videos
International audienceAnnotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing misalignments inherent to narrated videos. With this approach we are able to learn strong video representations from scratch, without the need for any manual annotation. We evaluate our representations on a wide range of four downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to-video retrieval (YouCook2, MSR-VTT), action localization (YouTube-8M Segments, CrossTask) and action segmentation (COIN). Our method outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines
Multi-Task Learning of Object State Changes from Uncurated Videos
We aim to learn to temporally localize object state changes and the
corresponding state-modifying actions by observing people interacting with
objects in long uncurated web videos. We introduce three principal
contributions. First, we explore alternative multi-task network architectures
and identify a model that enables efficient joint learning of multiple object
states and actions such as pouring water and pouring coffee. Second, we design
a multi-task self-supervised learning procedure that exploits different types
of constraints between objects and state-modifying actions enabling end-to-end
training of a model for temporal localization of object states and actions in
videos from only noisy video-level supervision. Third, we report results on the
large-scale ChangeIt and COIN datasets containing tens of thousands of long
(un)curated web videos depicting various interactions such as hole drilling,
cream whisking, or paper plane folding. We show that our multi-task model
achieves a relative improvement of 40% over the prior single-task methods and
significantly outperforms both image-based and video-based zero-shot models for
this problem. We also test our method on long egocentric videos of the
EPIC-KITCHENS and the Ego4D datasets in a zero-shot setup demonstrating the
robustness of our learned model
In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval
Large-scale noisy web image-text datasets have been proven to be efficient
for learning robust vision-language models. However, when transferring them to
the task of video retrieval, models still need to be fine-tuned on hand-curated
paired text-video data to adapt to the diverse styles of video descriptions. To
address this problem without the need for hand-annotated pairs, we propose a
new setting, text-video retrieval with uncurated & unpaired data, that during
training utilizes only text queries together with uncurated web videos without
any paired text-video data. To this end, we propose an approach, In-Style, that
learns the style of the text queries and transfers it to uncurated web videos.
Moreover, to improve generalization, we show that one model can be trained with
multiple text styles. To this end, we introduce a multi-style contrastive
training procedure that improves the generalizability over several datasets
simultaneously. We evaluate our model on retrieval performance over multiple
datasets to demonstrate the advantages of our style transfer framework on the
new task of uncurated & unpaired text-video retrieval and improve
state-of-the-art performance on zero-shot text-video retrieval.Comment: Published at ICCV 2023, code: https://github.com/ninatu/in_styl
Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization
Self-supervised pre-training recently demonstrates success on large-scale
multimodal data, and state-of-the-art contrastive learning methods often
enforce the feature consistency from cross-modality inputs, such as video/audio
or video/text pairs. Despite its convenience to formulate and leverage in
practice, such cross-modality alignment (CMA) is only a weak and noisy
supervision, since two modalities can be semantically misaligned even they are
temporally aligned. For example, even in the commonly adopted instructional
videos, a speaker can sometimes refer to something that is not visually present
in the current frame; and the semantic misalignment would only be more
unpredictable for the raw videos from the internet. We conjecture that might
cause conflicts and biases among modalities, and may hence prohibit CMA from
scaling up to training with larger and more heterogeneous data. This paper
first verifies our conjecture by observing that, even in the latest VATT
pre-training using only instructional videos, there exist strong gradient
conflicts between different CMA losses within the same video, audio, text
triplet, indicating them as the noisy source of supervision. We then propose to
harmonize such gradients, via two techniques: (i) cross-modality gradient
realignment: modifying different CMA loss gradients for each sample triplet, so
that their gradient directions are more aligned; and (ii) gradient-based
curriculum learning: leveraging the gradient conflict information on an
indicator of sample noisiness, to develop a curriculum learning strategy to
prioritize training on less noisy sample triplets. Applying those techniques to
pre-training VATT on the HowTo100M dataset, we consistently improve its
performance on different downstream tasks. Moreover, we are able to scale VATT
pre-training to more complicated non-narrative Youtube8M dataset to further
improve the state-of-the-arts.Comment: Accepted at NeurIPS 202
Apprentissage vidéo et langage naturel à grande échelle
The goal of this thesis is to build and train machine learning models capable of understanding the content of videos. Current video understanding approaches mainly rely on large-scale manually annotated video datasets for training. However, collecting and annotating such dataset is cumbersome, expensive and time-consuming. To address this issue, this thesis focuses on leveraging large amounts of readily-available, but noisy annotations in the form of natural language. In particular, we exploit a diverse corpus of textual metadata such as movie scripts, web video titles and descriptions or automatically transcribed speech obtained from narrated videos. Training video models on such readily-available textual data is challenging as such annotation is often imprecise or wrong. In this thesis, we introduce learning approaches to deal with weak annotation and design specialized training objectives and neural network architectures.Nous nous intéressons à l’apprentissage automatique d’algorithmes pour la compréhension automatique de vidéos. Une majorité des approaches en compréhension de vidéos dépend de large base de données de vidéos manuellement annotées pour l’entraînement. Cependant, la collection et l’annotation de telles base de données est fastidieuse, coûte cher et prend du temps. Pour palier à ce problème, cette thèse se concentre sur l’exploitation de large quantité d’annotations publiquement disponible, cependant bruitées, sous forme de language naturel. En particulier, nous nous intéressons à un corpus divers de métadonnées textuelles incluant des scripts de films, des titres et descriptions de vidéos internet ou encore des transcriptions de paroles. L’usage de ce type de données publiquement disponibles est difficile car l’annotation y est faible. Pour cela, nous introduisons différentes approches d’apprentissage telles que de nouvelles fonctions de coûts ou architectures de réseaux de neurones, adaptées à de faibles annotations
- …