65,445 research outputs found
Learning Mid-Level Representations for Visual Recognition
The objective of this thesis is to enhance visual recognition for objects and scenes
through the development of novel mid-level representations and appendent learning
algorithms. In particular, this work is focusing on category level recognition which
is still a very challenging and mainly unsolved task. One crucial component in visual
recognition systems is the representation of objects and scenes. However, depending on
the representation, suitable learning strategies need to be developed that make it possible
to learn new categories automatically from training data. Therefore, the aim of this thesis
is to extend low-level representations by mid-level representations and to develop suitable
learning mechanisms.
A popular kind of mid-level representations are higher order statistics such as
self-similarity and co-occurrence statistics. While these descriptors are satisfying the
demand for higher-level object representations, they are also exhibiting very large and ever
increasing dimensionality. In this thesis a new object representation, based on curvature
self-similarity, is suggested that goes beyond the currently popular approximation of
objects using straight lines. However, like all descriptors using second order statistics,
it also exhibits a high dimensionality. Although improving discriminability, the high
dimensionality becomes a critical issue due to lack of generalization ability and curse
of dimensionality. Given only a limited amount of training data, even sophisticated
learning algorithms such as the popular kernel methods are not able to suppress noisy or
superfluous dimensions of such high-dimensional data. Consequently, there is a natural
need for feature selection when using present-day informative features and, particularly,
curvature self-similarity. We therefore suggest an embedded feature selection method for
support vector machines that reduces complexity and improves generalization capability
of object models. The proposed curvature self-similarity representation is successfully
integrated together with the embedded feature selection in a widely used state-of-the-art
object detection framework.
The influence of higher order statistics for category level object recognition, is further
investigated by learning co-occurrences between foreground and background, to reduce
the number of false detections. While the suggested curvature self-similarity descriptor
is improving the model for more detailed description of the foreground, higher order
statistics are now shown to be also suitable for explicitly modeling the background.
This is of particular use for the popular chamfer matching technique, since it is prone
to accidental matches in dense clutter. As clutter only interferes with the foreground
model contour, we learn where to place the background contours with respect to the
foreground object boundary. The co-occurrence of background contours is integrated
into a max-margin framework. Thus the suggested approach combines the advantages of
accurately detecting object parts via chamfer matching and the robustness of max-margin
learning.
While chamfer matching is very efficient technique for object detection, parts are only
detected based on a simple distance measure. Contrary to that, mid-level parts and
patches are explicitly trained to distinguish true positives in the foreground from false
positives in the background. Due to the independence of mid-level patches and parts it
is possible to train a large number of instance specific part classifiers. This is contrary
to the current most powerful discriminative approaches that are typically only feasible
for a small number of parts, as they are modeling the spatial dependencies between
them. Due to their number, we cannot directly train a powerful classifier to combine
all parts. Instead, parts are randomly grouped into fewer, overlapping compositions that
are trained using a maximum-margin approach. In contrast to the common rationale of
compositional approaches, we do not aim for semantically meaningful ensembles. Rather
we seek randomized compositions that are discriminative and generalize over all instances
of a category. Compositions are all combined by a non-linear decision function which is
completing the powerful hierarchy of discriminative classifiers.
In summary, this thesis is improving visual recognition of objects and scenes, by
developing novel mid-level representations on top of different kinds of low-level
representations. Furthermore, it investigates in the development of suitable learning
algorithms, to deal with the new challenges that are arising form the novel object
representations presented in this work
Mid-level Representation for Visual Recognition
Visual Recognition is one of the fundamental challenges in AI, where the goal
is to understand the semantics of visual data. Employing mid-level
representation, in particular, shifted the paradigm in visual recognition. The
mid-level image/video representation involves discovering and training a set of
mid-level visual patterns (e.g., parts and attributes) and represent a given
image/video utilizing them. The mid-level patterns can be extracted from images
and videos using the motion and appearance information of visual phenomenas.
This thesis targets employing mid-level representations for different
high-level visual recognition tasks, namely (i)image understanding and
(ii)video understanding.
In the case of image understanding, we focus on object detection/recognition
task. We investigate on discovering and learning a set of mid-level patches to
be used for representing the images of an object category. We specifically
employ the discriminative patches in a subcategory-aware webly-supervised
fashion. We, additionally, study the outcomes provided by employing the
subcategory-based models for undoing dataset bias
Fusion of Learned Multi-Modal Representations and Dense Trajectories for Emotional Analysis in Videos
When designing a video affective content analysis algorithm, one of the most important steps is the selection of discriminative features for the effective representation of video segments. The majority of existing affective content analysis methods either use low-level audio-visual features or generate handcrafted higher level representations based on these low-level features. We propose in this work to use deep learning methods, in particular convolutional neural networks (CNNs), in order to automatically learn and extract mid-level representations from raw data. To this end, we exploit the audio and visual modality of videos by employing Mel-Frequency Cepstral Coefficients (MFCC) and color values in the HSV color space. We also incorporate dense trajectory based motion features in order to further enhance the performance of the analysis. By means of multi-class support vector machines (SVMs) and fusion mechanisms, music video clips are classified into one of four affective categories representing the four quadrants of the Valence-Arousal (VA) space. Results obtained on a subset of the DEAP dataset show (1) that higher level representations perform better than low-level features, and (2) that incorporating motion information leads to a notable performance gain, independently from the chosen representation
Supervised mid-level features for word image representation
This paper addresses the problem of learning word image representations:
given the cropped image of a word, we are interested in finding a descriptive,
robust, and compact fixed-length representation. Machine learning techniques
can then be supplied with these representations to produce models useful for
word retrieval or recognition tasks. Although many works have focused on the
machine learning aspect once a global representation has been produced, little
work has been devoted to the construction of those base image representations:
most works use standard coding and aggregation techniques directly on top of
standard computer vision features such as SIFT or HOG.
We propose to learn local mid-level features suitable for building word image
representations. These features are learnt by leveraging character bounding box
annotations on a small set of training images. However, contrary to other
approaches that use character bounding box information, our approach does not
rely on detecting the individual characters explicitly at testing time. Our
local mid-level features can then be aggregated to produce a global word image
signature. When pairing these features with the recent word attributes
framework of Almaz\'an et al., we obtain results comparable with or better than
the state-of-the-art on matching and recognition tasks using global descriptors
of only 96 dimensions
Multi-Level Recurrent Residual Networks for Action Recognition
Most existing Convolutional Neural Networks(CNNs) used for action recognition
are either difficult to optimize or underuse crucial temporal information.
Inspired by the fact that the recurrent model consistently makes breakthroughs
in the task related to sequence, we propose a novel Multi-Level Recurrent
Residual Networks(MRRN) which incorporates three recognition streams. Each
stream consists of a Residual Networks(ResNets) and a recurrent model. The
proposed model captures spatiotemporal information by employing both
alternative ResNets to learn spatial representations from static frames and
stacked Simple Recurrent Units(SRUs) to model temporal dynamics. Three
distinct-level streams learned low-, mid-, high-level representations
independently are fused by computing a weighted average of their softmax scores
to obtain the complementary representations of the video. Unlike previous
models which boost performance at the cost of time complexity and space
complexity, our models have a lower complexity by employing shortcut connection
and are trained end-to-end with greater efficiency. MRRN displays significant
performance improvements compared to CNN-RNN framework baselines and obtains
comparable performance with the state-of-the-art, achieving 51.3% on HMDB-51
dataset and 81.9% on UCF-101 dataset although no additional data
Satellite Image-based Localization via Learned Embeddings
We propose a vision-based method that localizes a ground vehicle using
publicly available satellite imagery as the only prior knowledge of the
environment. Our approach takes as input a sequence of ground-level images
acquired by the vehicle as it navigates, and outputs an estimate of the
vehicle's pose relative to a georeferenced satellite image. We overcome the
significant viewpoint and appearance variations between the images through a
neural multi-view model that learns location-discriminative embeddings in which
ground-level images are matched with their corresponding satellite view of the
scene. We use this learned function as an observation model in a filtering
framework to maintain a distribution over the vehicle's pose. We evaluate our
method on different benchmark datasets and demonstrate its ability localize
ground-level images in environments novel relative to training, despite the
challenges of significant viewpoint and appearance variations.Comment: To be published in IEEE International Conference on Robotics and
Automation (ICRA), 201
Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Visuomotor Policies
How much does having visual priors about the world (e.g. the fact that the
world is 3D) assist in learning to perform downstream motor tasks (e.g.
delivering a package)? We study this question by integrating a generic
perceptual skill set (e.g. a distance estimator, an edge detector, etc.) within
a reinforcement learning framework--see Figure 1. This skill set (hereafter
mid-level perception) provides the policy with a more processed state of the
world compared to raw images.
We find that using a mid-level perception confers significant advantages over
training end-to-end from scratch (i.e. not leveraging priors) in
navigation-oriented tasks. Agents are able to generalize to situations where
the from-scratch approach fails and training becomes significantly more sample
efficient. However, we show that realizing these gains requires careful
selection of the mid-level perceptual skills. Therefore, we refine our findings
into an efficient max-coverage feature set that can be adopted in lieu of raw
images. We perform our study in completely separate buildings for training and
testing and compare against visually blind baseline policies and
state-of-the-art feature learning methods.Comment: See project website, demos, and code at http://perceptual.acto
Hybrid CNN and Dictionary-Based Models for Scene Recognition and Domain Adaptation
Convolutional neural network (CNN) has achieved state-of-the-art performance
in many different visual tasks. Learned from a large-scale training dataset,
CNN features are much more discriminative and accurate than the hand-crafted
features. Moreover, CNN features are also transferable among different domains.
On the other hand, traditional dictionarybased features (such as BoW and SPM)
contain much more local discriminative and structural information, which is
implicitly embedded in the images. To further improve the performance, in this
paper, we propose to combine CNN with dictionarybased models for scene
recognition and visual domain adaptation. Specifically, based on the well-tuned
CNN models (e.g., AlexNet and VGG Net), two dictionary-based representations
are further constructed, namely mid-level local representation (MLR) and
convolutional Fisher vector representation (CFV). In MLR, an efficient
two-stage clustering method, i.e., weighted spatial and feature space spectral
clustering on the parts of a single image followed by clustering all
representative parts of all images, is used to generate a class-mixture or a
classspecific part dictionary. After that, the part dictionary is used to
operate with the multi-scale image inputs for generating midlevel
representation. In CFV, a multi-scale and scale-proportional GMM training
strategy is utilized to generate Fisher vectors based on the last convolutional
layer of CNN. By integrating the complementary information of MLR, CFV and the
CNN features of the fully connected layer, the state-of-the-art performance can
be achieved on scene recognition and domain adaptation problems. An interested
finding is that our proposed hybrid representation (from VGG net trained on
ImageNet) is also complementary with GoogLeNet and/or VGG-11 (trained on
Place205) greatly.Comment: Accepted by TCSVT on Sep.201
Scenarios: A New Representation for Complex Scene Understanding
The ability for computational agents to reason about the high-level content
of real world scene images is important for many applications. Existing
attempts at addressing the problem of complex scene understanding lack
representational power, efficiency, and the ability to create robust
meta-knowledge about scenes. In this paper, we introduce scenarios as a new way
of representing scenes. The scenario is a simple, low-dimensional, data-driven
representation consisting of sets of frequently co-occurring objects and is
useful for a wide range of scene understanding tasks. We learn scenarios from
data using a novel matrix factorization method which we integrate into a new
neural network architecture, the ScenarioNet. Using ScenarioNet, we can recover
semantic information about real world scene images at three levels of
granularity: 1) scene categories, 2) scenarios, and 3) objects. Training a
single ScenarioNet model enables us to perform scene classification, scenario
recognition, multi-object recognition, content-based scene image retrieval, and
content-based image comparison. In addition to solving many tasks in a single,
unified framework, ScenarioNet is more computationally efficient than other
CNNs because it requires significantly fewer parameters while achieving similar
performance on benchmark tasks and is more interpretable because it produces
explanations when making decisions. We validate the utility of scenarios and
ScenarioNet on a diverse set of scene understanding tasks on several benchmark
datasets
Mid-level Elements for Object Detection
Building on the success of recent discriminative mid-level elements, we
propose a surprisingly simple approach for object detection which performs
comparable to the current state-of-the-art approaches on PASCAL VOC comp-3
detection challenge (no external data). Through extensive experiments and
ablation analysis, we show how our approach effectively improves upon the
HOG-based pipelines by adding an intermediate mid-level representation for the
task of object detection. This representation is easily interpretable and
allows us to visualize what our object detector "sees". We also discuss the
insights our approach shares with CNN-based methods, such as sharing
representation between categories helps
- …