16 research outputs found
Rethinking Zero-shot Video Classification: End-to-end Training for Realistic Applications
Trained on large datasets, deep learning (DL) can accurately classify videos into hundreds of diverse classes. However, video data is expensive to annotate. Zero-shot learning (ZSL) proposes one solution to this problem. ZSL trains a model once, and generalizes to new tasks whose classes are not present in the training dataset. We propose the first end-to-end algorithm for ZSL in video classification. Our training procedure builds on insights from recent video classification literature and uses a trainable 3D CNN to learn the visual features. This is in contrast to previous video ZSL methods, which use pretrained feature extractors. We also extend the current benchmarking paradigm: Previous techniques aim to make the test task unknown at training time but fall short of this goal. We encourage domain shift across training and test data and disallow tailoring a ZSL model to a specific test dataset. We outperform the state-of-the-art by a wide margin. Our code, evaluation procedure and model weights are available at this http URL
Query Twice: Dual Mixture Attention Meta Learning for Video Summarization
Video summarization aims to select representative frames to retain high-level
information, which is usually solved by predicting the segment-wise importance
score via a softmax function. However, softmax function suffers in retaining
high-rank representations for complex visual or sequential information, which
is known as the Softmax Bottleneck problem. In this paper, we propose a novel
framework named Dual Mixture Attention (DMASum) model with Meta Learning for
video summarization that tackles the softmax bottleneck problem, where the
Mixture of Attention layer (MoA) effectively increases the model capacity by
employing twice self-query attention that can capture the second-order changes
in addition to the initial query-key attention, and a novel Single Frame Meta
Learning rule is then introduced to achieve more generalization to small
datasets with limited training sources. Furthermore, the DMASum significantly
exploits both visual and sequential attention that connects local key-frame and
global attention in an accumulative way. We adopt the new evaluation protocol
on two public datasets, SumMe, and TVSum. Both qualitative and quantitative
experiments manifest significant improvements over the state-of-the-art
methods.Comment: This manuscript has been accepted at ACM MM 202
Zero-Shot Sign Language Recognition: Can Textual Data Uncover Sign Languages?
We introduce the problem of zero-shot sign language recognition (ZSSLR),
where the goal is to leverage models learned over the seen sign class examples
to recognize the instances of unseen signs. To this end, we propose to utilize
the readily available descriptions in sign language dictionaries as an
intermediate-level semantic representation for knowledge transfer. We introduce
a new benchmark dataset called ASL-Text that consists of 250 sign language
classes and their accompanying textual descriptions. Compared to the ZSL
datasets in other domains (such as object recognition), our dataset consists of
limited number of training examples for a large number of classes, which
imposes a significant challenge. We propose a framework that operates over the
body and hand regions by means of 3D-CNNs, and models longer temporal
relationships via bidirectional LSTMs. By leveraging the descriptive text
embeddings along with these spatio-temporal representations within a zero-shot
learning framework, we show that textual data can indeed be useful in
uncovering sign languages. We anticipate that the introduced approach and the
accompanying dataset will provide a basis for further exploration of this new
zero-shot learning problem.Comment: To appear in British Machine Vision Conference (BMVC) 201
ZSTAD: Zero-Shot Temporal Activity Detection
An integral part of video analysis and surveillance is temporal activity
detection, which means to simultaneously recognize and localize activities in
long untrimmed videos. Currently, the most effective methods of temporal
activity detection are based on deep learning, and they typically perform very
well with large scale annotated videos for training. However, these methods are
limited in real applications due to the unavailable videos about certain
activity classes and the time-consuming data annotation. To solve this
challenging problem, we propose a novel task setting called zero-shot temporal
activity detection (ZSTAD), where activities that have never been seen in
training can still be detected. We design an end-to-end deep network based on
R-C3D as the architecture for this solution. The proposed network is optimized
with an innovative loss function that considers the embeddings of activity
labels and their super-classes while learning the common semantics of seen and
unseen activities. Experiments on both the THUMOS14 and the Charades datasets
show promising performance in terms of detecting unseen activities
CLASTER: Clustering with Reinforcement Learning for Zero-Shot Action Recognition
Zero-shot action recognition is the task of recognizing action classes
without visual examples, only with a semantic embedding which relates unseen to
seen classes. The problem can be seen as learning a function which generalizes
well to instances of unseen classes without losing discrimination between
classes. Neural networks can model the complex boundaries between visual
classes, which explains their success as supervised models. However, in
zero-shot learning, these highly specialized class boundaries may not transfer
well from seen to unseen classes. In this paper, we propose a clustering-based
model, which considers all training samples at once, instead of optimizing for
each instance individually. We optimize the clustering using Reinforcement
Learning which we show is critical for our approach to work. We call the
proposed method CLASTER and observe that it consistently improves over the
state-of-the-art in all standard datasets, UCF101, HMDB51, and Olympic Sports;
both in the standard zero-shot evaluation and the generalized zero-shot
learning
Rethinking Zero-shot Video Classification: End-to-end Training for Realistic Applications
Trained on large datasets, deep learning (DL) can accurately classify videos into hundreds of diverse classes. However, video data is expensive to annotate. Zero-shot learning (ZSL) proposes one solution to this problem. ZSL trains a model once, and generalizes to new tasks whose classes are not present in the training dataset. We propose the first end-to-end algorithm for ZSL in video classification. Our training procedure builds on insights from recent video classification literature and uses a trainable 3D CNN to learn the visual features. This is in contrast to previous video ZSL methods, which use pretrained feature extractors. We also extend the current benchmarking paradigm: Previous techniques aim to make the test task unknown at training time but fall short of this goal. We encourage domain shift across training and test data and disallow tailoring a ZSL model to a specific test dataset. We outperform the state-of-the-art by a wide margin. Our code, evaluation procedure and model weights are available at this http URL