20 research outputs found
A Generative Approach to Zero-Shot and Few-Shot Action Recognition
We present a generative framework for zero-shot action recognition where some
of the possible action classes do not occur in the training data. Our approach
is based on modeling each action class using a probability distribution whose
parameters are functions of the attribute vector representing that action
class. In particular, we assume that the distribution parameters for any action
class in the visual space can be expressed as a linear combination of a set of
basis vectors where the combination weights are given by the attributes of the
action class. These basis vectors can be learned solely using labeled data from
the known (i.e., previously seen) action classes, and can then be used to
predict the parameters of the probability distributions of unseen action
classes. We consider two settings: (1) Inductive setting, where we use only the
labeled examples of the seen action classes to predict the unseen action class
parameters; and (2) Transductive setting which further leverages unlabeled data
from the unseen action classes. Our framework also naturally extends to
few-shot action recognition where a few labeled examples from unseen classes
are available. Our experiments on benchmark datasets (UCF101, HMDB51 and
Olympic) show significant performance improvements as compared to various
baselines, in both standard zero-shot (disjoint seen and unseen classes) and
generalized zero-shot learning settings.Comment: Accepted in WACV 201
Cross-modal Hallucination for Few-shot Fine-grained Recognition
State-of-the-art deep learning algorithms generally require large amounts of
data for model training. Lack thereof can severely deteriorate the performance,
particularly in scenarios with fine-grained boundaries between categories. To
this end, we propose a multimodal approach that facilitates bridging the
information gap by means of meaningful joint embeddings. Specifically, we
present a benchmark that is multimodal during training (i.e. images and texts)
and single-modal in testing time (i.e. images), with the associated task to
utilize multimodal data in base classes (with many samples), to learn explicit
visual classifiers for novel classes (with few samples). Next, we propose a
framework built upon the idea of cross-modal data hallucination. In this
regard, we introduce a discriminative text-conditional GAN for sample
generation with a simple self-paced strategy for sample selection. We show the
results of our proposed discriminative hallucinated method for 1-, 2-, and 5-
shot learning on the CUB dataset, where the accuracy is improved by employing
multimodal data.Comment: CVPR 2018 Workshop on Fine-Grained Visual Categorizatio
Unified Generator-Classifier for Efficient Zero-Shot Learning
Generative models have achieved state-of-the-art performance for the
zero-shot learning problem, but they require re-training the classifier every
time a new object category is encountered. The traditional semantic embedding
approaches, though very elegant, usually do not perform at par with their
generative counterparts. In this work, we propose an unified framework termed
GenClass, which integrates the generator with the classifier for efficient
zero-shot learning, thus combining the representative power of the generative
approaches and the elegance of the embedding approaches. End-to-end training of
the unified framework not only eliminates the requirement of additional
classifier for new object categories as in the generative approaches, but also
facilitates the generation of more discriminative and useful features.
Extensive evaluation on three standard zero-shot object classification
datasets, namely AWA, CUB and SUN shows the effectiveness of the proposed
approach. The approach without any modification, also gives state-of-the-art
performance for zero-shot action classification, thus showing its
generalizability to other domains.Comment: 4 page
Generative Model for Zero-Shot Sketch-Based Image Retrieval
We present a probabilistic model for Sketch-Based Image Retrieval (SBIR)
where, at retrieval time, we are given sketches from novel classes, that were
not present at training time. Existing SBIR methods, most of which rely on
learning class-wise correspondences between sketches and images, typically work
well only for previously seen sketch classes, and result in poor retrieval
performance on novel classes. To address this, we propose a generative model
that learns to generate images, conditioned on a given novel class sketch. This
enables us to reduce the SBIR problem to a standard image-to-image search
problem. Our model is based on an inverse auto-regressive flow based
variational autoencoder, with a feedback mechanism to ensure robust image
generation. We evaluate our model on two very challenging datasets, Sketchy,
and TU Berlin, with novel train-test split. The proposed approach significantly
outperforms various baselines on both the datasets.Comment: Accepted at CVPR-Workshop 201
Skeleton based Zero Shot Action Recognition in Joint Pose-Language Semantic Space
How does one represent an action? How does one describe an action that we
have never seen before? Such questions are addressed by the Zero Shot Learning
paradigm, where a model is trained on only a subset of classes and is evaluated
on its ability to correctly classify an example from a class it has never seen
before. In this work, we present a body pose based zero shot action recognition
network and demonstrate its performance on the NTU RGB-D dataset. Our model
learns to jointly encapsulate visual similarities based on pose features of the
action performer as well as similarities in the natural language descriptions
of the unseen action class names. We demonstrate how this pose-language
semantic space encodes knowledge which allows our model to correctly predict
actions not seen during training
Unifying Few- and Zero-Shot Egocentric Action Recognition
Although there has been significant research in egocentric action
recognition, most methods and tasks, including EPIC-KITCHENS, suppose a fixed
set of action classes. Fixed-set classification is useful for benchmarking
methods, but is often unrealistic in practical settings due to the
compositionality of actions, resulting in a functionally infinite-cardinality
label set. In this work, we explore generalization with an open set of classes
by unifying two popular approaches: few- and zero-shot generalization (the
latter which we reframe as cross-modal few-shot generalization). We propose a
new set of splits derived from the EPIC-KITCHENS dataset that allow evaluation
of open-set classification, and use these splits to show that adding a
metric-learning loss to the conventional direct-alignment baseline can improve
zero-shot classification by as much as 10%, while not sacrificing few-shot
performance.Comment: Accepted for presentation at the EPIC@CVPR2020 worksho
ProtoGAN: Towards Few Shot Learning for Action Recognition
Few-shot learning (FSL) for action recognition is a challenging task of
recognizing novel action categories which are represented by few instances in
the training data. In a more generalized FSL setting (G-FSL), both seen as well
as novel action categories need to be recognized. Conventional classifiers
suffer due to inadequate data in FSL setting and inherent bias towards seen
action categories in G-FSL setting. In this paper, we address this problem by
proposing a novel ProtoGAN framework which synthesizes additional examples for
novel categories by conditioning a conditional generative adversarial network
with class prototype vectors. These class prototype vectors are learnt using a
Class Prototype Transfer Network (CPTN) from examples of seen categories. Our
synthesized examples for a novel class are semantically similar to real
examples belonging to that class and is used to train a model exhibiting better
generalization towards novel classes. We support our claim by performing
extensive experiments on three datasets: UCF101, HMDB51 and Olympic-Sports. To
the best of our knowledge, we are the first to report the results for G-FSL and
provide a strong benchmark for future research. We also outperform the
state-of-the-art method in FSL for all the aforementioned datasets.Comment: 9 pages, 5 tables, 2 figures. To appear in the proceedings of ICCV
Workshop 201
SL-DML: Signal Level Deep Metric Learning for Multimodal One-Shot Action Recognition
Recognizing an activity with a single reference sample using metric learning
approaches is a promising research field. The majority of few-shot methods
focus on object recognition or face-identification. We propose a metric
learning approach to reduce the action recognition problem to a nearest
neighbor search in embedding space. We encode signals into images and extract
features using a deep residual CNN. Using triplet loss, we learn a feature
embedding. The resulting encoder transforms features into an embedding space in
which closer distances encode similar actions while higher distances encode
different actions. Our approach is based on a signal level formulation and
remains flexible across a variety of modalities. It further outperforms the
baseline on the large scale NTU RGB+D 120 dataset for the One-Shot action
recognition protocol by 5.6%. With just 60% of the training data, our approach
still outperforms the baseline approach by 3.7%. With 40% of the training data,
our approach performs comparably well to the second follow up. Further, we show
that our approach generalizes well in experiments on the UTD-MHAD dataset for
inertial, skeleton and fused data and the Simitate dataset for motion capturing
data. Furthermore, our inter-joint and inter-sensor experiments suggest good
capabilities on previously unseen setups.Comment: 8 pages, 6 figures, 7 table
Similarity R-C3D for Few-shot Temporal Activity Detection
Many activities of interest are rare events, with only a few labeled examples
available. Therefore models for temporal activity detection which are able to
learn from a few examples are desirable. In this paper, we present a
conceptually simple and general yet novel framework for few-shot temporal
activity detection which detects the start and end time of the few-shot input
activities in an untrimmed video. Our model is end-to-end trainable and can
benefit from more few-shot examples. At test time, each proposal is assigned
the label of the few-shot activity class corresponding to the maximum
similarity score. Our Similarity R-C3D method outperforms previous work on
three large-scale benchmarks for temporal activity detection (THUMOS14,
ActivityNet1.2, and ActivityNet1.3 datasets) in the few-shot setting. Our code
will be made available
Revisiting Few-shot Activity Detection with Class Similarity Control
Many interesting events in the real world are rare making preannotated
machine learning ready videos a rarity in consequence. Thus, temporal activity
detection models that are able to learn from a few examples are desirable. In
this paper, we present a conceptually simple and general yet novel framework
for few-shot temporal activity detection based on proposal regression which
detects the start and end time of the activities in untrimmed videos. Our model
is end-to-end trainable, takes into account the frame rate differences between
few-shot activities and untrimmed test videos, and can benefit from additional
few-shot examples. We experiment on three large scale benchmarks for temporal
activity detection (ActivityNet1.2, ActivityNet1.3 and THUMOS14 datasets) in a
few-shot setting. We also study the effect on performance of different amount
of overlap with activities used to pretrain the video classification backbone
and propose corrective measures for future works in this domain. Our code will
be made available