20 research outputs found
Rethinking Zero-shot Video Classification: End-to-end Training for Realistic Applications
Trained on large datasets, deep learning (DL) can accurately classify videos into hundreds of diverse classes. However, video data is expensive to annotate. Zero-shot learning (ZSL) proposes one solution to this problem. ZSL trains a model once, and generalizes to new tasks whose classes are not present in the training dataset. We propose the first end-to-end algorithm for ZSL in video classification. Our training procedure builds on insights from recent video classification literature and uses a trainable 3D CNN to learn the visual features. This is in contrast to previous video ZSL methods, which use pretrained feature extractors. We also extend the current benchmarking paradigm: Previous techniques aim to make the test task unknown at training time but fall short of this goal. We encourage domain shift across training and test data and disallow tailoring a ZSL model to a specific test dataset. We outperform the state-of-the-art by a wide margin. Our code, evaluation procedure and model weights are available at this http URL
QuerYD: A video dataset with high-quality text and audio narrations
We introduce QuerYD, a new large-scale dataset for retrieval and event
localisation in video. A unique feature of our dataset is the availability of
two audio tracks for each video: the original audio, and a high-quality spoken
description of the visual content. The dataset is based on YouDescribe, a
volunteer project that assists visually-impaired people by attaching voiced
narrations to existing YouTube videos. This ever-growing collection of videos
contains highly detailed, temporally aligned audio and text annotations. The
content descriptions are more relevant than dialogue, and more detailed than
previous description attempts, which can be observed to contain many
superficial or uninformative descriptions. To demonstrate the utility of the
QuerYD dataset, we show that it can be used to train and benchmark strong
models for retrieval and event localisation. Data, code and models are made
publicly available, and we hope that QuerYD inspires further research on video
understanding with written and spoken natural language.Comment: 5 pages, 4 figures, accepted at ICASSP 202
Support-set bottlenecks for video-text representation learning
The dominant paradigm for learning video-text representations -- noise
contrastive learning -- increases the similarity of the representations of
pairs of samples that are known to be related, such as text and video from the
same sample, and pushes away the representations of all other pairs. We posit
that this last behaviour is too strict, enforcing dissimilar representations
even for samples that are semantically-related -- for example, visually similar
videos or ones that share the same depicted action. In this paper, we propose a
novel method that alleviates this by leveraging a generative model to naturally
push these related samples together: each sample's caption must be
reconstructed as a weighted combination of other support samples' visual
representations. This simple idea ensures that representations are not
overly-specialized to individual samples, are reusable across the dataset, and
results in representations that explicitly encode semantics shared between
samples, unlike noise contrastive learning. Our proposed method outperforms
others by a large margin on MSR-VTT, VATEX and ActivityNet, and MSVD for
video-to-text and text-to-video retrieval.Comment: Accepted as spotlight paper at the International Conference on
Learning Representations (ICLR) 202
DASZL: Dynamic Action Signatures for Zero-shot Learning
There are many realistic applications of activity recognition where the set
of potential activity descriptions is combinatorially large. This makes
end-to-end supervised training of a recognition system impractical as no
training set is practically able to encompass the entire label set. In this
paper, we present an approach to fine-grained recognition that models
activities as compositions of dynamic action signatures. This compositional
approach allows us to reframe fine-grained recognition as zero-shot activity
recognition, where a detector is composed "on the fly" from simple
first-principles state machines supported by deep-learned components. We
evaluate our method on the Olympic Sports and UCF101 datasets, where our model
establishes a new state of the art under multiple experimental paradigms. We
also extend this method to form a unique framework for zero-shot joint
segmentation and classification of activities in video and demonstrate the
first results in zero-shot decoding of complex action sequences on a
widely-used surgical dataset. Lastly, we show that we can use off-the-shelf
object detectors to recognize activities in completely de-novo settings with no
additional training.Comment: 10 pages, 4 figures, 3 tables, AAAI2021 submissio
Multi-modal Transformer for Video Retrieval
The task of retrieving video content relevant to natural language queries
plays a critical role in effectively handling internet-scale datasets. Most of
the existing methods for this caption-to-video retrieval problem do not fully
exploit cross-modal cues present in video. Furthermore, they aggregate
per-frame visual features with limited or no temporal information. In this
paper, we present a multi-modal transformer to jointly encode the different
modalities in video, which allows each of them to attend to the others. The
transformer architecture is also leveraged to encode and model the temporal
information. On the natural language side, we investigate the best practices to
jointly optimize the language embedding together with the multi-modal
transformer. This novel framework allows us to establish state-of-the-art
results for video retrieval on three datasets. More details are available at
http://thoth.inrialpes.fr/research/MMT.Comment: ECCV 2020 (spotlight paper
Use What You Have: Video Retrieval Using Representations From Collaborative Experts
The rapid growth of video on the internet has made searching for video
content using natural language queries a significant challenge. Human-generated
queries for video datasets `in the wild' vary a lot in terms of degree of
specificity, with some queries describing specific details such as the names of
famous identities, content from speech, or text available on the screen. Our
goal is to condense the multi-modal, extremely high dimensional information
from videos into a single, compact video representation for the task of video
retrieval using free-form text queries, where the degree of specificity is
open-ended.
For this we exploit existing knowledge in the form of pre-trained semantic
embeddings which include 'general' features such as motion, appearance, and
scene features from visual content. We also explore the use of more 'specific'
cues from ASR and OCR which are intermittently available for videos and find
that these signals remain challenging to use effectively for retrieval. We
propose a collaborative experts model to aggregate information from these
different pre-trained experts and assess our approach empirically on five
retrieval benchmarks: MSR-VTT, LSMDC, MSVD, DiDeMo, and ActivityNet. Code and
data can be found at www.robots.ox.ac.uk/~vgg/research/collaborative-experts/.
This paper contains a correction to results reported in the previous version.Comment: This update contains a correction to previously reported result
Rethinking Zero-shot Video Classification: End-to-end Training for Realistic Applications
Trained on large datasets, deep learning (DL) can accurately classify videos into hundreds of diverse classes. However, video data is expensive to annotate. Zero-shot learning (ZSL) proposes one solution to this problem. ZSL trains a model once, and generalizes to new tasks whose classes are not present in the training dataset. We propose the first end-to-end algorithm for ZSL in video classification. Our training procedure builds on insights from recent video classification literature and uses a trainable 3D CNN to learn the visual features. This is in contrast to previous video ZSL methods, which use pretrained feature extractors. We also extend the current benchmarking paradigm: Previous techniques aim to make the test task unknown at training time but fall short of this goal. We encourage domain shift across training and test data and disallow tailoring a ZSL model to a specific test dataset. We outperform the state-of-the-art by a wide margin. Our code, evaluation procedure and model weights are available at this http URL