15,809 research outputs found
Semantic-Guided Multi-Attention Localization for Zero-Shot Learning
Zero-shot learning extends the conventional object classification to the
unseen class recognition by introducing semantic representations of classes.
Existing approaches predominantly focus on learning the proper mapping function
for visual-semantic embedding, while neglecting the effect of learning
discriminative visual features. In this paper, we study the significance of the
discriminative region localization. We propose a semantic-guided
multi-attention localization model, which automatically discovers the most
discriminative parts of objects for zero-shot learning without any human
annotations. Our model jointly learns cooperative global and local features
from the whole object as well as the detected parts to categorize objects based
on semantic descriptions. Moreover, with the joint supervision of embedding
softmax loss and class-center triplet loss, the model is encouraged to learn
features with high inter-class dispersion and intra-class compactness. Through
comprehensive experiments on three widely used zero-shot learning benchmarks,
we show the efficacy of the multi-attention localization and our proposed
approach improves the state-of-the-art results by a considerable margin.Comment: accepted to NeurIPS'1
Stacked Semantic-Guided Attention Model for Fine-Grained Zero-Shot Learning
Zero-Shot Learning (ZSL) is achieved via aligning the semantic relationships
between the global image feature vector and the corresponding class semantic
descriptions. However, using the global features to represent fine-grained
images may lead to sub-optimal results since they neglect the discriminative
differences of local regions. Besides, different regions contain distinct
discriminative information. The important regions should contribute more to the
prediction. To this end, we propose a novel stacked semantics-guided attention
(S2GA) model to obtain semantic relevant features by using individual class
semantic features to progressively guide the visual features to generate an
attention map for weighting the importance of different local regions. Feeding
both the integrated visual features and the class semantic features into a
multi-class classification architecture, the proposed framework can be trained
end-to-end. Extensive experimental results on CUB and NABird datasets show that
the proposed approach has a consistent improvement on both fine-grained
zero-shot classification and retrieval tasks
Action2Vec: A Crossmodal Embedding Approach to Action Learning
We describe a novel cross-modal embedding space for actions, named
Action2Vec, which combines linguistic cues from class labels with
spatio-temporal features derived from video clips. Our approach uses a
hierarchical recurrent network to capture the temporal structure of video
features. We train our embedding using a joint loss that combines
classification accuracy with similarity to Word2Vec semantics. We evaluate
Action2Vec by performing zero shot action recognition and obtain state of the
art results on three standard datasets. In addition, we present two novel
analogy tests which quantify the extent to which our joint embedding captures
distributional semantics. This is the first joint embedding space to combine
verbs and action videos, and the first to be thoroughly evaluated with respect
to its distributional semantics
Few-Shot Adaptation for Multimedia Semantic Indexing
We propose a few-shot adaptation framework, which bridges zero-shot learning
and supervised many-shot learning, for semantic indexing of image and video
data. Few-shot adaptation provides robust parameter estimation with few
training examples, by optimizing the parameters of zero-shot learning and
supervised many-shot learning simultaneously. In this method, first we build a
zero-shot detector, and then update it by using the few examples. Our
experiments show the effectiveness of the proposed framework on three datasets:
TRECVID Semantic Indexing 2010, 2014, and ImageNET. On the ImageNET dataset, we
show that our method outperforms recent few-shot learning methods. On the
TRECVID 2014 dataset, we achieve 15.19% and 35.98% in Mean Average Precision
under the zero-shot condition and the supervised condition, respectively. To
the best of our knowledge, these are the best results on this dataset
Towards Context-aware Interaction Recognition
Recognizing how objects interact with each other is a crucial task in visual
recognition. If we define the context of the interaction to be the objects
involved, then most current methods can be categorized as either: (i) training
a single classifier on the combination of the interaction and its context; or
(ii) aiming to recognize the interaction independently of its explicit context.
Both methods suffer limitations: the former scales poorly with the number of
combinations and fails to generalize to unseen combinations, while the latter
often leads to poor interaction recognition performance due to the difficulty
of designing a context-independent interaction classifier. To mitigate those
drawbacks, this paper proposes an alternative, context-aware interaction
recognition framework. The key to our method is to explicitly construct an
interaction classifier which combines the context, and the interaction. The
context is encoded via word2vec into a semantic space, and is used to derive a
classification result for the interaction.
The proposed method still builds one classifier for one interaction (as per
type (ii) above), but the classifier built is adaptive to context via weights
which are context dependent. The benefit of using the semantic space is that it
naturally leads to zero-shot generalizations in which semantically similar
contexts (subjectobject pairs) can be recognized as suitable contexts for an
interaction, even if they were not observed in the training set.Comment: Fixed typo
Learning Visually Consistent Label Embeddings for Zero-Shot Learning
In this work, we propose a zero-shot learning method to effectively model
knowledge transfer between classes via jointly learning visually consistent
word vectors and label embedding model in an end-to-end manner. The main idea
is to project the vector space word vectors of attributes and classes into the
visual space such that word representations of semantically related classes
become more closer, and use the projected vectors in the proposed embedding
model to identify unseen classes. We evaluate the proposed approach on two
benchmark datasets and the experimental results show that our method yields
significant improvements in recognition accuracy.Comment: To appear at IEEE Int. Conference on Image Processing (ICIP) 201
Generative Model for Zero-Shot Sketch-Based Image Retrieval
We present a probabilistic model for Sketch-Based Image Retrieval (SBIR)
where, at retrieval time, we are given sketches from novel classes, that were
not present at training time. Existing SBIR methods, most of which rely on
learning class-wise correspondences between sketches and images, typically work
well only for previously seen sketch classes, and result in poor retrieval
performance on novel classes. To address this, we propose a generative model
that learns to generate images, conditioned on a given novel class sketch. This
enables us to reduce the SBIR problem to a standard image-to-image search
problem. Our model is based on an inverse auto-regressive flow based
variational autoencoder, with a feedback mechanism to ensure robust image
generation. We evaluate our model on two very challenging datasets, Sketchy,
and TU Berlin, with novel train-test split. The proposed approach significantly
outperforms various baselines on both the datasets.Comment: Accepted at CVPR-Workshop 201
Concept Mask: Large-Scale Segmentation from Semantic Concepts
Existing works on semantic segmentation typically consider a small number of
labels, ranging from tens to a few hundreds. With a large number of labels,
training and evaluation of such task become extremely challenging due to
correlation between labels and lack of datasets with complete annotations. We
formulate semantic segmentation as a problem of image segmentation given a
semantic concept, and propose a novel system which can potentially handle an
unlimited number of concepts, including objects, parts, stuff, and attributes.
We achieve this using a weakly and semi-supervised framework leveraging
multiple datasets with different levels of supervision. We first train a deep
neural network on a 6M stock image dataset with only image-level labels to
learn visual-semantic embedding on 18K concepts. Then, we refine and extend the
embedding network to predict an attention map, using a curated dataset with
bounding box annotations on 750 concepts. Finally, we train an attention-driven
class agnostic segmentation network using an 80-category fully annotated
dataset. We perform extensive experiments to validate that the proposed system
performs competitively to the state of the art on fully supervised concepts,
and is capable of producing accurate segmentations for weakly learned and
unseen concepts.Comment: Accepted to ECCV1
SR-GAN: Semantic Rectifying Generative Adversarial Network for Zero-shot Learning
The existing Zero-Shot learning (ZSL) methods may suffer from the vague class
attributes that are highly overlapped for different classes. Unlike these
methods that ignore the discrimination among classes, in this paper, we propose
to classify unseen image by rectifying the semantic space guided by the visual
space. First, we pre-train a Semantic Rectifying Network (SRN) to rectify
semantic space with a semantic loss and a rectifying loss. Then, a Semantic
Rectifying Generative Adversarial Network (SR-GAN) is built to generate
plausible visual feature of unseen class from both semantic feature and
rectified semantic feature. To guarantee the effectiveness of rectified
semantic features and synthetic visual features, a pre-reconstruction and a
post reconstruction networks are proposed, which keep the consistency between
visual feature and semantic feature. Experimental results demonstrate that our
approach significantly outperforms the state-of-the-arts on four benchmark
datasets.Comment: ICME 2019 Ora
A Novel Perspective to Zero-shot Learning: Towards an Alignment of Manifold Structures via Semantic Feature Expansion
Zero-shot learning aims at recognizing unseen classes (no training example)
with knowledge transferred from seen classes. This is typically achieved by
exploiting a semantic feature space shared by both seen and unseen classes,
i.e., attribute or word vector, as the bridge. One common practice in zero-shot
learning is to train a projection between the visual and semantic feature
spaces with labeled seen classes examples. When inferring, this learned
projection is applied to unseen classes and recognizes the class labels by some
metrics. However, the visual and semantic feature spaces are mutually
independent and have quite different manifold structures. Under such a
paradigm, most existing methods easily suffer from the domain shift problem and
weaken the performance of zero-shot recognition. To address this issue, we
propose a novel model called AMS-SFE. It considers the alignment of manifold
structures by semantic feature expansion. Specifically, we build upon an
autoencoder-based model to expand the semantic features from the visual inputs.
Additionally, the expansion is jointly guided by an embedded manifold extracted
from the visual feature space of the data. Our model is the first attempt to
align both feature spaces by expanding semantic features and derives two
benefits: first, we expand some auxiliary features that enhance the semantic
feature space; second and more importantly, we implicitly align the manifold
structures between the visual and semantic feature spaces; thus, the projection
can be better trained and mitigate the domain shift problem. Extensive
experiments show significant performance improvement, which verifies the
effectiveness of our model
- …