22 research outputs found
A Generative Model for Parts-based Object Segmentation
The Shape Boltzmann Machine (SBM) [1] has recently been introduced as a stateof-the-art model of foreground/background object shape. We extend the SBM to account for the foreground objectâs parts. Our new model, the Multinomial SBM (MSBM), can capture both local and global statistics of part shapes accurately. We combine the MSBM with an appearance model to form a fully generative model of images of objects. Parts-based object segmentations are obtained simply by performing probabilistic inference in the model. We apply the model to two challenging datasets which exhibit significant shape and appearance variability, and find that it obtains results that are comparable to the state-of-the-art. There has been significant focus in computer vision on object recognition and detection e.g. [2], but a strong desire remains to obtain richer descriptions of objects than just their bounding boxes. One such description is a parts-based object segmentation, in which an image is partitioned into multiple sets of pixels, each belonging to either a part of the object of interest, or its background. The significance of parts in computer vision has been recognized since the earliest days of th
Improving Deep Representation Learning with Complex and Multimodal Data.
Representation learning has emerged as a way to learn meaningful representation from data and made a breakthrough in many applications including visual object recognition, speech recognition, and text understanding. However, learning representation from complex high-dimensional sensory data is challenging since there exist many irrelevant factors of variation (e.g., data transformation, random noise). On the other hand, to build an end-to-end prediction system for structured output variables, one needs to incorporate probabilistic inference to properly model a mapping from single input to possible configurations of output variables. This thesis addresses limitations of current representation learning in two parts.
The first part discusses efficient learning algorithms of invariant representation based on restricted Boltzmann machines (RBMs). Pointing out the difficulty of learning, we develop an efficient initialization method for sparse and convolutional RBMs. On top of that, we develop variants of RBM that learn representations invariant to data transformations such as translation, rotation, or scale variation by pooling the filter responses of input data after a transformation, or to irrelevant patterns such as random or structured noise, by jointly performing feature selection and feature learning. We demonstrate improved performance on visual object recognition and weakly supervised foreground object segmentation.
The second part discusses conditional graphical models and learning frameworks for structured output variables using deep generative models as prior. For example, we combine the best properties of the CRF and the RBM to enforce both local and global (e.g., object shape) consistencies for visual object segmentation. Furthermore, we develop a deep conditional generative model of structured output variables, which is an end-to-end system trainable by backpropagation. We demonstrate the importance of global prior and probabilistic inference for visual object segmentation. Second, we develop a novel multimodal learning framework by casting the problem into structured output representation learning problems, where the output is one data modality to be predicted from the other modalities, and vice versa. We explain as to how our method could be more effective than maximum likelihood learning and demonstrate the state-of-the-art performance on visual-text and visual-only recognition tasks.PhDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113549/1/kihyuks_1.pd
Learning generative models of mid-level structure in natural images
Natural images arise from complicated processes involving many factors of variation.
They reflect the wealth of shapes and appearances of objects in our three-dimensional
world, but they are also affected by factors such as distortions due to perspective, occlusions,
and illumination, giving rise to structure with regularities at many different
levels. Prior knowledge about these regularities and suitable representations that allow
efficient reasoning about the properties of a visual scene are important for many image
processing and computer vision tasks. This thesis focuses on models of image structure
at intermediate levels of complexity as required, for instance, for image inpainting
or segmentation. It aims at developing generative, probabilistic models of this kind of
structure, and, in particular, at devising strategies for learning such models in a largely
unsupervised manner from data.
One hallmark of natural images is that they can often be decomposed into regions
with very different visual characteristics. The main approach of this thesis is therefore
to represent images in terms of regions that are characterized by their shapes and
appearances, and an image is then composed from many such regions. We explore
approaches to learn about the appearance of regions, to learn about region shapes, and
ways to combine several regions to form a full image. To achieve this goal, we make
use of some ideas for unsupervised learning developed in the literature on models of
low-level image structure and in the âdeep learningâ literature. These models are used
as building blocks of more structured model formulations that incorporate additional
prior knowledge of how images are formed.
The thesis makes the following contributions: Firstly, we investigate a popular,
MRF based prior of natural image structure, the Field-of Experts, with respect to its
ability to model image textures, and propose an extended formulation that is considerably
more successful at this task. This formulation gives rise to a fully parametric,
translation-invariant probabilistic generative model of image textures. We illustrate
how this model can be used as a component of a more comprehensive model of images
comprising multiple textured regions. Secondly, we develop a model of region shape.
This work is an extension of the âMasked Restricted Boltzmann Machineâ proposed by
Le Roux et al. (2011) and it allows explicit reasoning about the independent shapes and
relative depths of occluding objects. We develop an inference and unsupervised learning
scheme and demonstrate how this shape model, in combination with the masked
RBM gives rise to a good model of natural image patches. Finally, we demonstrate how this model of region shape can be extended to model shapes in large images. The
result is a generative model of large images which are formed by composition from
many small, partially overlapping and occluding objects
Complex-Valued Autoencoders for Object Discovery
Object-centric representations form the basis of human perception and enable
us to reason about the world and to systematically generalize to new settings.
Currently, most machine learning work on unsupervised object discovery focuses
on slot-based approaches, which explicitly separate the latent representations
of individual objects. While the result is easily interpretable, it usually
requires the design of involved architectures. In contrast to this, we propose
a distributed approach to object-centric representations: the Complex
AutoEncoder. Following a coding scheme theorized to underlie object
representations in biological neurons, its complex-valued activations represent
two messages: their magnitudes express the presence of a feature, while the
relative phase differences between neurons express which features should be
bound together to create joint object representations. We show that this simple
and efficient approach achieves better reconstruction performance than an
equivalent real-valued autoencoder on simple multi-object datasets.
Additionally, we show that it achieves competitive unsupervised object
discovery performance to a SlotAttention model on two datasets, and manages to
disentangle objects in a third dataset where SlotAttention fails - all while
being 7-70 times faster to train
Holistic interpretation of visual data based on topology:semantic segmentation of architectural facades
The work presented in this dissertation is a step towards effectively incorporating contextual knowledge in the task of semantic segmentation. To date, the use of context has been confined to the genre of the scene with a few exceptions in the field. Research has been directed towards enhancing appearance descriptors. While this is unarguably important, recent studies show that computer vision has reached a near-human level of performance in relying on these descriptors when objects have stable distinctive surface properties and in proper imaging conditions. When these conditions are not met, humans exploit their knowledge about the intrinsic geometric layout of the scene to make local decisions. Computer vision lags behind when it comes to this asset. For this reason, we aim to bridge the gap by presenting algorithms for semantic segmentation of building facades making use of scene topological aspects. We provide a classification scheme to carry out segmentation and recognition simultaneously.The algorithm is able to solve a single optimization function and yield a semantic interpretation of facades, relying on the modeling power of probabilistic graphs and efficient discrete combinatorial optimization tools. We tackle the same problem of semantic facade segmentation with the neural network approach.We attain accuracy figures that are on-par with the state-of-the-art in a fully automated pipeline.Starting from pixelwise classifications obtained via Convolutional Neural Networks (CNN). These are then structurally validated through a cascade of Restricted Boltzmann Machines (RBM) and Multi-Layer Perceptron (MLP) that regenerates the most likely layout. In the domain of architectural modeling, there is geometric multi-model fitting. We introduce a novel guided sampling algorithm based on Minimum Spanning Trees (MST), which surpasses other propagation techniques in terms of robustness to noise. We make a number of additional contributions such as measure of model deviation which captures variations among fitted models
Top-Down Selection in Convolutional Neural Networks
Feedforward information processing fills the role of hierarchical feature encoding, transformation, reduction, and abstraction in a bottom-up manner. This paradigm of information processing is sufficient for task requirements that are satisfied in the one-shot rapid traversal of sensory information through the visual hierarchy. However, some tasks demand higher-order information processing using short-term recurrent, long-range feedback, or other processes. The predictive, corrective, and modulatory information processing in top-down fashion complement the feedforward pass to fulfill many complex task requirements. Convolutional neural networks have recently been successful in addressing some aspects of the feedforward processing. However, the role of top-down processing in such models has not yet been fully understood. We propose a top-down selection framework for convolutional neural networks to address the selective and modulatory nature of top-down processing in vision systems. We examine various aspects of the proposed model in different experimental settings such as object localization, object segmentation, task priming, compact neural representation, and contextual interference reduction. We test the hypothesis that the proposed approach is capable of accomplishing hierarchical feature localization according to task cuing. Additionally, feature modulation using the proposed approach is tested for demanding tasks such as segmentation and iterative parameter fine-tuning. Moreover, the top-down attentional traces are harnessed to enable a more compact neural representation. The experimental achievements support the practical complementary role of the top-down selection mechanisms to the bottom-up feature encoding routines
Generative probabilistic models for object segmentation
One of the long-standing open problems in machine vision has been the task of âobject segmentationâ, in which an image is partitioned into two sets of pixels: those that belong to the object of interest, and those that do not. A closely related task is that of âparts-based object segmentationâ, where additionally each of the objectâs pixels are labelled as belonging to one of several predetermined parts. There is broad agreement that segmentation is coupled to the task of object recognition. Knowledge of the objectâs class can lead to more accurate segmentations, and in turn accurate segmentations can be used to obtain higher recognition rates. In this thesis we focus on one side of this relationship: given the objectâs class and its bounding box, how accurately can we segment it? Segmentation is challenging primarily due to the huge amount of variability one sees in images of natural scenes. A large number of factors combine in complex ways to generate the pixel intensities that make up any given image. In this work we approach the problem by developing generative probabilistic models of the objects in question. Not only does this allow us to express notions of variability and uncertainty in a principled way, but also to separate the problems of model design and inference. The thesis makes the following contributions: First, we demonstrate an explicit probabilistic model of images of objects based on a latent Gaussian model of shape. This can be learned from images in an unsupervised fashion. Through experiments on a variety of datasets we demonstrate the advantages of explicitly modelling shape variability. We then focus on the task of constructing more accurate models of shape. We present a type of layered probabilistic model that we call a Shape Boltzmann Machine (SBM) for the task of modelling foreground/background (binary) and parts-based (categorical) shapes. We demonstrate that it constitutes the state-of-the-art and characterises a âstrongâ model of shape, in that samples from the model look realistic and that it generalises to generate samples that differ from training examples. Finally, we demonstrate how the SBM can be used in conjunction with an appearance model to form a fully generative model of images of objects. We show how parts-based object segmentations can be obtained simply by performing probabilistic inference in this joint model. We apply the model to several challenging datasets and find that its performance is comparable to the state-of-the-art
Visual Feature Learning
Categorization is a fundamental problem of many computer vision applications, e.g., image
classification, pedestrian detection and face recognition. The robustness of a categorization
system heavily relies on the quality of features, by which data are represented. The prior
arts of feature extraction can be concluded in different levels, which, in a bottom up order,
are low level features (e.g., pixels and gradients) and middle/high-level features (e.g., the
BoW model and sparse coding). Low level features can be directly extracted from images
or videos, while middle/high-level features are constructed upon low-level features, and are
designed to enhance the capability of categorization systems based on different considerations
(e.g., guaranteeing the domain-invariance and improving the discriminative power).
This thesis focuses on the study of visual feature learning. Challenges that remain in designing
visual features lie in intra-class variation, occlusions, illumination and view-point
changes and insufficient prior knowledge. To address these challenges, I present several
visual feature learning methods, where these methods cover the following sub-topics: (i)
I start by introducing a segmentation-based object recognition system. (ii) When training
data are insufficient, I seek data from other resources, which include images or videos in a
different domain, actions captured from a different viewpoint and information in a different
media form. In order to appropriately transfer such resources into the target categorization
system, four transfer learning-based feature learning methods are presented in this section,
where both cross-view, cross-domain and cross-modality scenarios are addressed accordingly.
(iii) Finally, I present a random-forest based feature fusion method for multi-view
action recognition
ATTENTION-BASED CONVOLUTIONAL NEURAL NETWORK MODEL AND ITS COMBINATION WITH FEW-SHOT LEARNING FOR AUDIO CLASSIFICATION
Environmental sound and acoustic scene classification are crucial tasks in audio signal
processing and audio pattern recognition. In recent years, deep learning methods such as
convolutional neural networks (CNN), recurrent neural networks (RNN), and their com-
binations, have achieved great success in such tasks. However, there are still numerous
challenges left to be addressed in this domain. For example, in most cases, the sound
events of interest will be present through only a portion of the entire audio clip, and the clip
can also suffer from the background noise. Furthermore, in many application scenarios
where the amount of labelled training data can be very limited, the application of few-
shot learning methods especially prototypical networks have achieved great success. But
metric learning methods such as prototypical networks often suffer from bad feature em-
beddings of support samples or outliers, or may not perform well on noisy data. Therefore,
the proposed work seeks to overcome the above limitations by introducing a multi-channel
temporal attention-based CNN model and then introduce a hybrid attention module into the
framework of prototypical networks. Additionally, a Î -model is integrated into our model
to improve performance on noisy data, and a new time-frequency feature is explored. Var-
ious experiments have shown that our proposed framework is capable of dealing with the
above mentioned issues and providing promising results.Ph.D