16,891 research outputs found
Action2Vec: A Crossmodal Embedding Approach to Action Learning
We describe a novel cross-modal embedding space for actions, named
Action2Vec, which combines linguistic cues from class labels with
spatio-temporal features derived from video clips. Our approach uses a
hierarchical recurrent network to capture the temporal structure of video
features. We train our embedding using a joint loss that combines
classification accuracy with similarity to Word2Vec semantics. We evaluate
Action2Vec by performing zero shot action recognition and obtain state of the
art results on three standard datasets. In addition, we present two novel
analogy tests which quantify the extent to which our joint embedding captures
distributional semantics. This is the first joint embedding space to combine
verbs and action videos, and the first to be thoroughly evaluated with respect
to its distributional semantics
Recent Advances in Transfer Learning for Cross-Dataset Visual Recognition: A Problem-Oriented Perspective
This paper takes a problem-oriented perspective and presents a comprehensive
review of transfer learning methods, both shallow and deep, for cross-dataset
visual recognition. Specifically, it categorises the cross-dataset recognition
into seventeen problems based on a set of carefully chosen data and label
attributes. Such a problem-oriented taxonomy has allowed us to examine how
different transfer learning approaches tackle each problem and how well each
problem has been researched to date. The comprehensive problem-oriented review
of the advances in transfer learning with respect to the problem has not only
revealed the challenges in transfer learning for visual recognition, but also
the problems (e.g. eight of the seventeen problems) that have been scarcely
studied. This survey not only presents an up-to-date technical review for
researchers, but also a systematic approach and a reference for a machine
learning practitioner to categorise a real problem and to look up for a
possible solution accordingly
Zero-Shot Visual Recognition using Semantics-Preserving Adversarial Embedding Networks
We propose a novel framework called Semantics-Preserving Adversarial
Embedding Network (SP-AEN) for zero-shot visual recognition (ZSL), where test
images and their classes are both unseen during training. SP-AEN aims to tackle
the inherent problem --- semantic loss --- in the prevailing family of
embedding-based ZSL, where some semantics would be discarded during training if
they are non-discriminative for training classes, but could become critical for
recognizing test classes. Specifically, SP-AEN prevents the semantic loss by
introducing an independent visual-to-semantic space embedder which disentangles
the semantic space into two subspaces for the two arguably conflicting
objectives: classification and reconstruction. Through adversarial learning of
the two subspaces, SP-AEN can transfer the semantics from the reconstructive
subspace to the discriminative one, accomplishing the improved zero-shot
recognition of unseen classes. Comparing with prior works, SP-AEN can not only
improve classification but also generate photo-realistic images, demonstrating
the effectiveness of semantic preservation. On four popular benchmarks: CUB,
AWA, SUN and aPY, SP-AEN considerably outperforms other state-of-the-art
methods by an absolute performance difference of 12.2\%, 9.3\%, 4.0\%, and
3.6\% in terms of harmonic mean value
Deep Multiple Instance Learning for Zero-shot Image Tagging
In-line with the success of deep learning on traditional recognition problem,
several end-to-end deep models for zero-shot recognition have been proposed in
the literature. These models are successful to predict a single unseen label
given an input image, but does not scale to cases where multiple unseen objects
are present. In this paper, we model this problem within the framework of
Multiple Instance Learning (MIL). To the best of our knowledge, we propose the
first end-to-end trainable deep MIL framework for the multi-label zero-shot
tagging problem. Due to its novel design, the proposed framework has several
interesting features: (1) Unlike previous deep MIL models, it does not use any
off-line procedure (e.g., Selective Search or EdgeBoxes) for bag generation.
(2) During test time, it can process any number of unseen labels given their
semantic embedding vectors. (3) Using only seen labels per image as weak
annotation, it can produce a bounding box for each predicted labels. We
experiment with the NUS-WIDE dataset and achieve superior performance across
conventional, zero-shot and generalized zero-shot tagging tasks
Detecting Human-Object Interactions via Functional Generalization
We present an approach for detecting human-object interactions (HOIs) in
images, based on the idea that humans interact with functionally similar
objects in a similar manner. The proposed model is simple and efficiently uses
the data, visual features of the human, relative spatial orientation of the
human and the object, and the knowledge that functionally similar objects take
part in similar interactions with humans. We provide extensive experimental
validation for our approach and demonstrate state-of-the-art results for HOI
detection. On the HICO-Det dataset our method achieves a gain of over 2.5%
absolute points in mean average precision (mAP) over state-of-the-art. We also
show that our approach leads to significant performance gains for zero-shot HOI
detection in the seen object setting. We further demonstrate that using a
generic object detector, our model can generalize to interactions involving
previously unseen objects.Comment: AAAI 202
Unified Generator-Classifier for Efficient Zero-Shot Learning
Generative models have achieved state-of-the-art performance for the
zero-shot learning problem, but they require re-training the classifier every
time a new object category is encountered. The traditional semantic embedding
approaches, though very elegant, usually do not perform at par with their
generative counterparts. In this work, we propose an unified framework termed
GenClass, which integrates the generator with the classifier for efficient
zero-shot learning, thus combining the representative power of the generative
approaches and the elegance of the embedding approaches. End-to-end training of
the unified framework not only eliminates the requirement of additional
classifier for new object categories as in the generative approaches, but also
facilitates the generation of more discriminative and useful features.
Extensive evaluation on three standard zero-shot object classification
datasets, namely AWA, CUB and SUN shows the effectiveness of the proposed
approach. The approach without any modification, also gives state-of-the-art
performance for zero-shot action classification, thus showing its
generalizability to other domains.Comment: 4 page
Unsupervised Meta-Learning For Few-Shot Image Classification
Few-shot or one-shot learning of classifiers requires a significant inductive
bias towards the type of task to be learned. One way to acquire this is by
meta-learning on tasks similar to the target task. In this paper, we propose
UMTRA, an algorithm that performs unsupervised, model-agnostic meta-learning
for classification tasks. The meta-learning step of UMTRA is performed on a
flat collection of unlabeled images. While we assume that these images can be
grouped into a diverse set of classes and are relevant to the target task, no
explicit information about the classes or any labels are needed. UMTRA uses
random sampling and augmentation to create synthetic training tasks for
meta-learning phase. Labels are only needed at the final target task learning
step, and they can be as little as one sample per class. On the Omniglot and
Mini-Imagenet few-shot learning benchmarks, UMTRA outperforms every tested
approach based on unsupervised learning of representations, while alternating
for the best performance with the recent CACTUs algorithm. Compared to
supervised model-agnostic meta-learning approaches, UMTRA trades off some
classification accuracy for a reduction in the required labels of several
orders of magnitude
Sherlock: Scalable Fact Learning in Images
We study scalable and uniform understanding of facts in images. Existing
visual recognition systems are typically modeled differently for each fact type
such as objects, actions, and interactions. We propose a setting where all
these facts can be modeled simultaneously with a capacity to understand
unbounded number of facts in a structured way. The training data comes as
structured facts in images, including (1) objects (e.g., ), (3) actions (e.g., ). Each fact has a semantic
language view (e.g., ) and a visual view (an image with this
fact). We show that learning visual facts in a structured way enables not only
a uniform but also generalizable visual understanding. We propose and
investigate recent and strong approaches from the multiview learning literature
and also introduce two learning representation models as potential baselines.
We applied the investigated methods on several datasets that we augmented with
structured facts and a large scale dataset of more than 202,000 facts and
814,000 images. Our experiments show the advantage of relating facts by the
structure by the proposed models compared to the designed baselines on
bidirectional fact retrieval.Comment: Jan 7 Updat
Learning Spatiotemporal Features via Video and Text Pair Discrimination
Current video representations heavily rely on learning from manually
annotated video datasets which are time-consuming and expensive to acquire. We
observe videos are naturally accompanied by abundant text information such as
YouTube titles and Instagram captions. In this paper, we leverage this
visual-textual connection to learn spatiotemporal features in an efficient
weakly-supervised manner. We present a general cross-modal pair discrimination
(CPD) framework to capture this correlation between a video and its associated
text. Specifically, we adopt noise-contrastive estimation to tackle the
computational issue imposed by the huge amount of pair instance classes and
design a practical curriculum learning strategy. We train our CPD models on
both standard video dataset (Kinetics-210k) and uncurated web video dataset
(Instagram-300k) to demonstrate its effectiveness. Without further fine-tuning,
the learnt models obtain competitive results for action classification on
Kinetics under the linear classification protocol. Moreover, our visual model
provides an effective initialization to fine-tune on downstream tasks, which
yields a remarkable performance gain for action recognition on UCF101 and
HMDB51, compared with the existing state-of-the-art self-supervised training
methods. In addition, our CPD model yields a new state of the art for zero-shot
action recognition on UCF101 by directly utilizing the learnt visual-textual
embeddings. The code will be made available at
https://github.com/MCG-NJU/CPD-Video.Comment: Technical Repor
Integrating Local Material Recognition with Large-Scale Perceptual Attribute Discovery
Material attributes have been shown to provide a discriminative intermediate
representation for recognizing materials, especially for the challenging task
of recognition from local material appearance (i.e., regardless of object and
scene context). In the past, however, material attributes have been recognized
separately preceding category recognition. In contrast, neuroscience studies on
material perception and computer vision research on object and place
recognition have shown that attributes are produced as a by-product during the
category recognition process. Does the same hold true for material attribute
and category recognition? In this paper, we introduce a novel material category
recognition network architecture to show that perceptual attributes can, in
fact, be automatically discovered inside a local material recognition
framework. The novel material-attribute-category convolutional neural network
(MAC-CNN) produces perceptual material attributes from the intermediate pooling
layers of an end-to-end trained category recognition network using an auxiliary
loss function that encodes human material perception. To train this model, we
introduce a novel large-scale database of local material appearance organized
under a canonical material category taxonomy and careful image patch extraction
that avoids unwanted object and scene context. We show that the discovered
attributes correspond well with semantically-meaningful visual material traits
via Boolean algebra, and enable recognition of previously unseen material
categories given only a few examples. These results have strong implications in
how perceptually meaningful attributes can be learned in other recognition
tasks
- …