4,402 research outputs found
Semantic Embedding Space for Zero-Shot Action Recognition
The number of categories for action recognition is growing rapidly. It is
thus becoming increasingly hard to collect sufficient training data to learn
conventional models for each category. This issue may be ameliorated by the
increasingly popular 'zero-shot learning' (ZSL) paradigm. In this framework a
mapping is constructed between visual features and a human interpretable
semantic description of each category, allowing categories to be recognised in
the absence of any training data. Existing ZSL studies focus primarily on image
data, and attribute-based semantic representations. In this paper, we address
zero-shot recognition in contemporary video action recognition tasks, using
semantic word vector space as the common space to embed videos and category
labels. This is more challenging because the mapping between the semantic space
and space-time features of videos containing complex actions is more complex
and harder to learn. We demonstrate that a simple self-training and data
augmentation strategy can significantly improve the efficacy of this mapping.
Experiments on human action datasets including HMDB51 and UCF101 demonstrate
that our approach achieves the state-of-the-art zero-shot action recognition
performance.Comment: 5 page
Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions
We aim for zero-shot localization and classification of human actions in
video. Where traditional approaches rely on global attribute or object
classification scores for their zero-shot knowledge transfer, our main
contribution is a spatial-aware object embedding. To arrive at spatial
awareness, we build our embedding on top of freely available actor and object
detectors. Relevance of objects is determined in a word embedding space and
further enforced with estimated spatial preferences. Besides local object
awareness, we also embed global object awareness into our embedding to maximize
actor and object interaction. Finally, we exploit the object positions and
sizes in the spatial-aware embedding to demonstrate a new spatio-temporal
action retrieval scenario with composite queries. Action localization and
classification experiments on four contemporary action video datasets support
our proposal. Apart from state-of-the-art results in the zero-shot localization
and classification settings, our spatial-aware embedding is even competitive
with recent supervised action localization alternatives.Comment: ICC
Multi-Label Zero-Shot Human Action Recognition via Joint Latent Ranking Embedding
Human action recognition refers to automatic recognizing human actions from a
video clip. In reality, there often exist multiple human actions in a video
stream. Such a video stream is often weakly-annotated with a set of relevant
human action labels at a global level rather than assigning each label to a
specific video episode corresponding to a single action, which leads to a
multi-label learning problem. Furthermore, there are many meaningful human
actions in reality but it would be extremely difficult to collect/annotate
video clips regarding all of various human actions, which leads to a zero-shot
learning scenario. To the best of our knowledge, there is no work that has
addressed all the above issues together in human action recognition. In this
paper, we formulate a real-world human action recognition task as a multi-label
zero-shot learning problem and propose a framework to tackle this problem in a
holistic way. Our framework holistically tackles the issue of unknown temporal
boundaries between different actions for multi-label learning and exploits the
side information regarding the semantic relationship between different human
actions for knowledge transfer. Consequently, our framework leads to a joint
latent ranking embedding for multi-label zero-shot human action recognition. A
novel neural architecture of two component models and an alternate learning
algorithm are proposed to carry out the joint latent ranking embedding
learning. Thus, multi-label zero-shot recognition is done by measuring
relatedness scores of action labels to a test video clip in the joint latent
visual and semantic embedding spaces. We evaluate our framework with different
settings, including a novel data split scheme designed especially for
evaluating multi-label zero-shot learning, on two datasets: Breakfast and
Charades. The experimental results demonstrate the effectiveness of our
framework.Comment: 27 pages, 10 figures and 7 tables. Technical report submitted to a
journal. More experimental results/references were added and typos were
correcte
Sherlock: Scalable Fact Learning in Images
We study scalable and uniform understanding of facts in images. Existing
visual recognition systems are typically modeled differently for each fact type
such as objects, actions, and interactions. We propose a setting where all
these facts can be modeled simultaneously with a capacity to understand
unbounded number of facts in a structured way. The training data comes as
structured facts in images, including (1) objects (e.g., ), (3) actions (e.g., ). Each fact has a semantic
language view (e.g., ) and a visual view (an image with this
fact). We show that learning visual facts in a structured way enables not only
a uniform but also generalizable visual understanding. We propose and
investigate recent and strong approaches from the multiview learning literature
and also introduce two learning representation models as potential baselines.
We applied the investigated methods on several datasets that we augmented with
structured facts and a large scale dataset of more than 202,000 facts and
814,000 images. Our experiments show the advantage of relating facts by the
structure by the proposed models compared to the designed baselines on
bidirectional fact retrieval.Comment: Jan 7 Updat
Recent Advances in Transfer Learning for Cross-Dataset Visual Recognition: A Problem-Oriented Perspective
This paper takes a problem-oriented perspective and presents a comprehensive
review of transfer learning methods, both shallow and deep, for cross-dataset
visual recognition. Specifically, it categorises the cross-dataset recognition
into seventeen problems based on a set of carefully chosen data and label
attributes. Such a problem-oriented taxonomy has allowed us to examine how
different transfer learning approaches tackle each problem and how well each
problem has been researched to date. The comprehensive problem-oriented review
of the advances in transfer learning with respect to the problem has not only
revealed the challenges in transfer learning for visual recognition, but also
the problems (e.g. eight of the seventeen problems) that have been scarcely
studied. This survey not only presents an up-to-date technical review for
researchers, but also a systematic approach and a reference for a machine
learning practitioner to categorise a real problem and to look up for a
possible solution accordingly
Multi-Label Zero-Shot Learning with Structured Knowledge Graphs
In this paper, we propose a novel deep learning architecture for multi-label
zero-shot learning (ML-ZSL), which is able to predict multiple unseen class
labels for each input instance. Inspired by the way humans utilize semantic
knowledge between objects of interests, we propose a framework that
incorporates knowledge graphs for describing the relationships between multiple
labels. Our model learns an information propagation mechanism from the semantic
label space, which can be applied to model the interdependencies between seen
and unseen class labels. With such investigation of structured knowledge graphs
for visual reasoning, we show that our model can be applied for solving
multi-label classification and ML-ZSL tasks. Compared to state-of-the-art
approaches, comparable or improved performances can be achieved by our method.Comment: CVPR 201
Universal Prototype Transport for Zero-Shot Action Recognition and Localization
This work addresses the problem of recognizing action categories in videos when no training examples are available. The current state-of-the-art enables such a zero-shot recognition by learning universal mappings from videos to a semantic space, either trained on large-scale seen actions or on objects. While effective, we find that universal action and object mappings are biased to specific regions in the semantic space. These biases lead to a fundamental problem: many unseen action categories are simply never inferred during testing. For example on UCF-101, a quarter of the unseen actions are out of reach with a state-of-the-art universal action model. To that end, this paper introduces universal prototype transport for zero-shot action recognition. The main idea is to re-position the semantic prototypes of unseen actions by matching them to the distribution of all test videos. For universal action models, we propose to match distributions through a hyperspherical optimal transport from unseen action prototypes to the set of all projected test videos. The resulting transport couplings in turn determine the target prototype for each unseen action. Rather than directly using the target prototype as final result, we re-position unseen action prototypes along the geodesic spanned by the original and target prototypes as a form of semantic regularization. For universal object models, we outline a variant that defines target prototypes based on an optimal transport between unseen action prototypes and object prototypes. Empirically, we show that universal prototype transport diminishes the biased selection of unseen action prototypes and boosts both universal action and object models for zero-shot classification and spatio-temporal localization
Robotic Pick-and-Place of Novel Objects in Clutter with Multi-Affordance Grasping and Cross-Domain Image Matching
This paper presents a robotic pick-and-place system that is capable of
grasping and recognizing both known and novel objects in cluttered
environments. The key new feature of the system is that it handles a wide range
of object categories without needing any task-specific training data for novel
objects. To achieve this, it first uses a category-agnostic affordance
prediction algorithm to select and execute among four different grasping
primitive behaviors. It then recognizes picked objects with a cross-domain
image classification framework that matches observed images to product images.
Since product images are readily available for a wide range of objects (e.g.,
from the web), the system works out-of-the-box for novel objects without
requiring any additional training data. Exhaustive experimental results
demonstrate that our multi-affordance grasping achieves high success rates for
a wide variety of objects in clutter, and our recognition algorithm achieves
high accuracy for both known and novel grasped objects. The approach was part
of the MIT-Princeton Team system that took 1st place in the stowing task at the
2017 Amazon Robotics Challenge. All code, datasets, and pre-trained models are
available online at http://arc.cs.princeton.eduComment: Project webpage: http://arc.cs.princeton.edu Summary video:
https://youtu.be/6fG7zwGfIk
- …