1,043 research outputs found
No Spare Parts: Sharing Part Detectors for Image Categorization
This work aims for image categorization using a representation of distinctive
parts. Different from existing part-based work, we argue that parts are
naturally shared between image categories and should be modeled as such. We
motivate our approach with a quantitative and qualitative analysis by
backtracking where selected parts come from. Our analysis shows that in
addition to the category parts defining the class, the parts coming from the
background context and parts from other image categories improve categorization
performance. Part selection should not be done separately for each category,
but instead be shared and optimized over all categories. To incorporate part
sharing between categories, we present an algorithm based on AdaBoost to
jointly optimize part sharing and selection, as well as fusion with the global
image representation. We achieve results competitive to the state-of-the-art on
object, scene, and action categories, further improving over deep convolutional
neural networks
Compositional Structure Learning for Action Understanding
The focus of the action understanding literature has predominately been
classification, how- ever, there are many applications demanding richer action
understanding such as mobile robotics and video search, with solutions to
classification, localization and detection. In this paper, we propose a
compositional model that leverages a new mid-level representation called
compositional trajectories and a locally articulated spatiotemporal deformable
parts model (LALSDPM) for fully action understanding. Our methods is
advantageous in capturing the variable structure of dynamic human activity over
a long range. First, the compositional trajectories capture long-ranging,
frequently co-occurring groups of trajectories in space time and represent them
in discriminative hierarchies, where human motion is largely separated from
camera motion; second, LASTDPM learns a structured model with multi-layer
deformable parts to capture multiple levels of articulated motion. We implement
our methods and demonstrate state of the art performance on all three problems:
action detection, localization, and recognition.Comment: 13 page
Unsupervised Discovery of Object Landmarks as Structural Representations
Deep neural networks can model images with rich latent representations, but
they cannot naturally conceptualize structures of object categories in a
human-perceptible way. This paper addresses the problem of learning object
structures in an image modeling process without supervision. We propose an
autoencoding formulation to discover landmarks as explicit structural
representations. The encoding module outputs landmark coordinates, whose
validity is ensured by constraints that reflect the necessary properties for
landmarks. The decoding module takes the landmarks as a part of the learnable
input representations in an end-to-end differentiable framework. Our discovered
landmarks are semantically meaningful and more predictive of manually annotated
landmarks than those discovered by previous methods. The coordinates of our
landmarks are also complementary features to pretrained deep-neural-network
representations in recognizing visual attributes. In addition, the proposed
method naturally creates an unsupervised, perceptible interface to manipulate
object shapes and decode images with controllable structures. The project
webpage is at http://ytzhang.net/projects/lmdis-repComment: 48 page
cvpaper.challenge in 2016: Futuristic Computer Vision through 1,600 Papers Survey
The paper gives futuristic challenges disscussed in the cvpaper.challenge. In
2015 and 2016, we thoroughly study 1,600+ papers in several
conferences/journals such as CVPR/ICCV/ECCV/NIPS/PAMI/IJCV
Mid-level Representation for Visual Recognition
Visual Recognition is one of the fundamental challenges in AI, where the goal
is to understand the semantics of visual data. Employing mid-level
representation, in particular, shifted the paradigm in visual recognition. The
mid-level image/video representation involves discovering and training a set of
mid-level visual patterns (e.g., parts and attributes) and represent a given
image/video utilizing them. The mid-level patterns can be extracted from images
and videos using the motion and appearance information of visual phenomenas.
This thesis targets employing mid-level representations for different
high-level visual recognition tasks, namely (i)image understanding and
(ii)video understanding.
In the case of image understanding, we focus on object detection/recognition
task. We investigate on discovering and learning a set of mid-level patches to
be used for representing the images of an object category. We specifically
employ the discriminative patches in a subcategory-aware webly-supervised
fashion. We, additionally, study the outcomes provided by employing the
subcategory-based models for undoing dataset bias
Unsupervised Representation Learning by Sorting Sequences
We present an unsupervised representation learning approach using videos
without semantic labels. We leverage the temporal coherence as a supervisory
signal by formulating representation learning as a sequence sorting task. We
take temporally shuffled frames (i.e., in non-chronological order) as inputs
and train a convolutional neural network to sort the shuffled sequences.
Similar to comparison-based sorting algorithms, we propose to extract features
from all frame pairs and aggregate them to predict the correct order. As
sorting shuffled image sequence requires an understanding of the statistical
temporal structure of images, training with such a proxy task allows us to
learn rich and generalizable visual representation. We validate the
effectiveness of the learned representation using our method as pre-training on
high-level recognition problems. The experimental results show that our method
compares favorably against state-of-the-art methods on action recognition,
image classification and object detection tasks.Comment: ICCV 2017. Project page: http://vllab1.ucmerced.edu/~hylee/OPN
Im2Flow: Motion Hallucination from Static Images for Action Recognition
Existing methods to recognize actions in static images take the images at
their face value, learning the appearances---objects, scenes, and body
poses---that distinguish each action class. However, such models are deprived
of the rich dynamic structure and motions that also define human activity. We
propose an approach that hallucinates the unobserved future motion implied by a
single snapshot to help static-image action recognition. The key idea is to
learn a prior over short-term dynamics from thousands of unlabeled videos,
infer the anticipated optical flow on novel static images, and then train
discriminative models that exploit both streams of information. Our main
contributions are twofold. First, we devise an encoder-decoder convolutional
neural network and a novel optical flow encoding that can translate a static
image into an accurate flow map. Second, we show the power of hallucinated flow
for recognition, successfully transferring the learned motion into a standard
two-stream network for activity recognition. On seven datasets, we demonstrate
the power of the approach. It not only achieves state-of-the-art accuracy for
dense optical flow prediction, but also consistently enhances recognition of
actions and dynamic scenes.Comment: Published in CVPR 2018, project page:
http://vision.cs.utexas.edu/projects/im2flow
Features in Concert: Discriminative Feature Selection meets Unsupervised Clustering
Feature selection is an essential problem in computer vision, important for
category learning and recognition. Along with the rapid development of a wide
variety of visual features and classifiers, there is a growing need for
efficient feature selection and combination methods, to construct powerful
classifiers for more complex and higher-level recognition tasks. We propose an
algorithm that efficiently discovers sparse, compact representations of input
features or classifiers, from a vast sea of candidates, with important
optimality properties, low computational cost and excellent accuracy in
practice. Different from boosting, we start with a discriminant linear
classification formulation that encourages sparse solutions. Then we obtain an
equivalent unsupervised clustering problem that jointly discovers ensembles of
diverse features. They are independently valuable but even more powerful when
united in a cluster of classifiers. We evaluate our method on the task of
large-scale recognition in video and show that it significantly outperforms
classical selection approaches, such as AdaBoost and greedy forward-backward
selection, and powerful classifiers such as SVMs, in speed of training and
performance, especially in the case of limited training data
A Bag-of-Words Equivalent Recurrent Neural Network for Action Recognition
The traditional bag-of-words approach has found a wide range of applications
in computer vision. The standard pipeline consists of a generation of a visual
vocabulary, a quantization of the features into histograms of visual words, and
a classification step for which usually a support vector machine in combination
with a non-linear kernel is used. Given large amounts of data, however, the
model suffers from a lack of discriminative power. This applies particularly
for action recognition, where the vast amount of video features needs to be
subsampled for unsupervised visual vocabulary generation. Moreover, the kernel
computation can be very expensive on large datasets. In this work, we propose a
recurrent neural network that is equivalent to the traditional bag-of-words
approach but enables for the application of discriminative training. The model
further allows to incorporate the kernel computation into the neural network
directly, solving the complexity issue and allowing to represent the complete
classification system within a single network. We evaluate our method on four
recent action recognition benchmarks and show that the conventional model as
well as sparse coding methods are outperformed
- …