20,762 research outputs found
A Study on Unsupervised Dictionary Learning and Feature Encoding for Action Classification
Many efforts have been devoted to develop alternative methods to traditional
vector quantization in image domain such as sparse coding and soft-assignment.
These approaches can be split into a dictionary learning phase and a feature
encoding phase which are often closely connected. In this paper, we investigate
the effects of these phases by separating them for video-based action
classification. We compare several dictionary learning methods and feature
encoding schemes through extensive experiments on KTH and HMDB51 datasets.
Experimental results indicate that sparse coding performs consistently better
than the other encoding methods in large complex dataset (i.e., HMDB51), and it
is robust to different dictionaries. For small simple dataset (i.e., KTH) with
less variation, however, all the encoding strategies perform competitively. In
addition, we note that the strength of sophisticated encoding approaches comes
not from their corresponding dictionaries but the encoding mechanisms, and we
can just use randomly selected exemplars as dictionaries for video-based action
classification
Kernel Coding: General Formulation and Special Cases
Representing images by compact codes has proven beneficial for many visual
recognition tasks. Most existing techniques, however, perform this coding step
directly in image feature space, where the distributions of the different
classes are typically entangled. In contrast, here, we study the problem of
performing coding in a high-dimensional Hilbert space, where the classes are
expected to be more easily separable. To this end, we introduce a general
coding formulation that englobes the most popular techniques, such as bag of
words, sparse coding and locality-based coding, and show how this formulation
and its special cases can be kernelized. Importantly, we address several
aspects of learning in our general formulation, such as kernel learning,
dictionary learning and supervised kernel coding. Our experimental evaluation
on several visual recognition tasks demonstrates the benefits of performing
coding in Hilbert space, and in particular of jointly learning the kernel, the
dictionary and the classifier
Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice
Video based action recognition is one of the important and challenging
problems in computer vision research. Bag of Visual Words model (BoVW) with
local features has become the most popular method and obtained the
state-of-the-art performance on several realistic datasets, such as the HMDB51,
UCF50, and UCF101. BoVW is a general pipeline to construct a global
representation from a set of local features, which is mainly composed of five
steps: (i) feature extraction, (ii) feature pre-processing, (iii) codebook
generation, (iv) feature encoding, and (v) pooling and normalization. Many
efforts have been made in each step independently in different scenarios and
their effect on action recognition is still unknown. Meanwhile, video data
exhibits different views of visual pattern, such as static appearance and
motion dynamics. Multiple descriptors are usually extracted to represent these
different views. Many feature fusion methods have been developed in other areas
and their influence on action recognition has never been investigated before.
This paper aims to provide a comprehensive study of all steps in BoVW and
different fusion methods, and uncover some good practice to produce a
state-of-the-art action recognition system. Specifically, we explore two kinds
of local features, ten kinds of encoding methods, eight kinds of pooling and
normalization strategies, and three kinds of fusion methods. We conclude that
every step is crucial for contributing to the final recognition rate.
Furthermore, based on our comprehensive study, we propose a simple yet
effective representation, called hybrid representation, by exploring the
complementarity of different BoVW frameworks and local descriptors. Using this
representation, we obtain the state-of-the-art on the three challenging
datasets: HMDB51 (61.1%), UCF50 (92.3%), and UCF101 (87.9%)
Online Unsupervised Feature Learning for Visual Tracking
Feature encoding with respect to an over-complete dictionary learned by
unsupervised methods, followed by spatial pyramid pooling, and linear
classification, has exhibited powerful strength in various vision applications.
Here we propose to use the feature learning pipeline for visual tracking.
Tracking is implemented using tracking-by-detection and the resulted framework
is very simple yet effective. First, online dictionary learning is used to
build a dictionary, which captures the appearance changes of the tracking
target as well as the background changes. Given a test image window, we extract
local image patches from it and each local patch is encoded with respect to the
dictionary. The encoded features are then pooled over a spatial pyramid to form
an aggregated feature vector. Finally, a simple linear classifier is trained on
these features.
Our experiments show that the proposed powerful---albeit simple---tracker,
outperforms all the state-of-the-art tracking methods that we have tested.
Moreover, we evaluate the performance of different dictionary learning and
feature encoding methods in the proposed tracking framework, and analyse the
impact of each component in the tracking scenario. We also demonstrate the
flexibility of feature learning by plugging it into Hare et al.'s tracking
method. The outcome is, to our knowledge, the best tracker ever reported, which
facilitates the advantages of both feature learning and structured output
prediction.Comment: 11 page
Bag of Attributes for Video Event Retrieval
In this paper, we present the Bag-of-Attributes (BoA) model for video
representation aiming at video event retrieval. The BoA model is based on a
semantic feature space for representing videos, resulting in high-level video
feature vectors. For creating a semantic space, i.e., the attribute space, we
can train a classifier using a labeled image dataset, obtaining a
classification model that can be understood as a high-level codebook. This
model is used to map low-level frame vectors into high-level vectors (e.g.,
classifier probability scores). Then, we apply pooling operations on the frame
vectors to create the final bag of attributes for the video. In the BoA
representation, each dimension corresponds to one category (or attribute) of
the semantic space. Other interesting properties are: compactness, flexibility
regarding the classifier, and ability to encode multiple semantic concepts in a
single video representation. Our experiments considered the semantic space
created by a deep convolutional neural network (OverFeat) pre-trained on 1000
object categories of ImageNet. OverFeat was then used to classify each video
frame and max pooling combined the frame vectors in the BoA representation for
the video. Results using BoA outperformed the baselines with statistical
significance in the task of video event retrieval using the EVVE dataset
Are visual dictionaries generalizable?
Mid-level features based on visual dictionaries are today a cornerstone of
systems for classification and retrieval of images. Those state-of-the-art
representations depend crucially on the choice of a codebook (visual
dictionary), which is usually derived from the dataset. In general-purpose,
dynamic image collections (e.g., the Web), one cannot have the entire
collection in order to extract a representative dictionary. However, based on
the hypothesis that the dictionary reflects only the diversity of low-level
appearances and does not capture semantics, we argue that a dictionary based on
a small subset of the data, or even on an entirely different dataset, is able
to produce a good representation, provided that the chosen images span a
diverse enough portion of the low-level feature space. Our experiments confirm
that hypothesis, opening the opportunity to greatly alleviate the burden in
generating the codebook, and confirming the feasibility of employing visual
dictionaries in large-scale dynamic environments
Local Similarities, Global Coding: An Algorithm for Feature Coding and its Applications
Data coding as a building block of several image processing algorithms has
been received great attention recently. Indeed, the importance of the locality
assumption in coding approaches is studied in numerous works and several
methods are proposed based on this concept. We probe this assumption and claim
that taking the similarity between a data point and a more global set of anchor
points does not necessarily weaken the coding method as long as the underlying
structure of the anchor points are taken into account. Based on this fact, we
propose to capture this underlying structure by assuming a random walker over
the anchor points. We show that our method is a fast approximate learning
algorithm based on the diffusion map kernel. The experiments on various
datasets show that making different state-of-the-art coding algorithms aware of
this structure boosts them in different learning tasks
Crowd Counting via Weighted VLAD on Dense Attribute Feature Maps
Crowd counting is an important task in computer vision, which has many
applications in video surveillance. Although the regression-based framework has
achieved great improvements for crowd counting, how to improve the
discriminative power of image representation is still an open problem.
Conventional holistic features used in crowd counting often fail to capture
semantic attributes and spatial cues of the image. In this paper, we propose
integrating semantic information into learning locality-aware feature sets for
accurate crowd counting. First, with the help of convolutional neural network
(CNN), the original pixel space is mapped onto a dense attribute feature map,
where each dimension of the pixel-wise feature indicates the probabilistic
strength of a certain semantic class. Then, locality-aware features (LAF) built
on the idea of spatial pyramids on neighboring patches are proposed to explore
more spatial context and local information. Finally, the traditional VLAD
encoding method is extended to a more generalized form in which diverse
coefficient weights are taken into consideration. Experimental results validate
the effectiveness of our presented method.Comment: 10 page
Generic Image Classification Approaches Excel on Face Recognition
The main finding of this work is that the standard image classification
pipeline, which consists of dictionary learning, feature encoding, spatial
pyramid pooling and linear classification, outperforms all state-of-the-art
face recognition methods on the tested benchmark datasets (we have tested on
AR, Extended Yale B, the challenging FERET, and LFW-a datasets). This
surprising and prominent result suggests that those advances in generic image
classification can be directly applied to improve face recognition systems. In
other words, face recognition may not need to be viewed as a separate object
classification problem.
While recently a large body of residual based face recognition methods focus
on developing complex dictionary learning algorithms, in this work we show that
a dictionary of randomly extracted patches (even from non-face images) can
achieve very promising results using the image classification pipeline. That
means, the choice of dictionary learning methods may not be important. Instead,
we find that learning multiple dictionaries using different low-level image
features often improve the final classification accuracy. Our proposed face
recognition approach offers the best reported results on the widely-used face
recognition benchmark datasets. In particular, on the challenging FERET and
LFW-a datasets, we improve the best reported accuracies in the literature by
about 20% and 30% respectively.Comment: 10 page
A Bag-of-Words Equivalent Recurrent Neural Network for Action Recognition
The traditional bag-of-words approach has found a wide range of applications
in computer vision. The standard pipeline consists of a generation of a visual
vocabulary, a quantization of the features into histograms of visual words, and
a classification step for which usually a support vector machine in combination
with a non-linear kernel is used. Given large amounts of data, however, the
model suffers from a lack of discriminative power. This applies particularly
for action recognition, where the vast amount of video features needs to be
subsampled for unsupervised visual vocabulary generation. Moreover, the kernel
computation can be very expensive on large datasets. In this work, we propose a
recurrent neural network that is equivalent to the traditional bag-of-words
approach but enables for the application of discriminative training. The model
further allows to incorporate the kernel computation into the neural network
directly, solving the complexity issue and allowing to represent the complete
classification system within a single network. We evaluate our method on four
recent action recognition benchmarks and show that the conventional model as
well as sparse coding methods are outperformed
- …