873 research outputs found
Attentive Adversarial Learning for Domain-Invariant Training
Adversarial domain-invariant training (ADIT) proves to be effective in
suppressing the effects of domain variability in acoustic modeling and has led
to improved performance in automatic speech recognition (ASR). In ADIT, an
auxiliary domain classifier takes in equally-weighted deep features from a deep
neural network (DNN) acoustic model and is trained to improve their
domain-invariance by optimizing an adversarial loss function. In this work, we
propose an attentive ADIT (AADIT) in which we advance the domain classifier
with an attention mechanism to automatically weight the input deep features
according to their importance in domain classification. With this attentive
re-weighting, AADIT can focus on the domain normalization of phonetic
components that are more susceptible to domain variability and generates deep
features with improved domain-invariance and senone-discriminativity over ADIT.
Most importantly, the attention block serves only as an external component to
the DNN acoustic model and is not involved in ASR, so AADIT can be used to
improve the acoustic modeling with any DNN architectures. More generally, the
same methodology can improve any adversarial learning system with an auxiliary
discriminator. Evaluated on CHiME-3 dataset, the AADIT achieves 13.6% and 9.3%
relative WER improvements, respectively, over a multi-conditional model and a
strong ADIT baseline.Comment: 5 pages, 1 figure, ICASSP 201
A Supervised Neural Autoregressive Topic Model for Simultaneous Image Classification and Annotation
Topic modeling based on latent Dirichlet allocation (LDA) has been a
framework of choice to perform scene recognition and annotation. Recently, a
new type of topic model called the Document Neural Autoregressive Distribution
Estimator (DocNADE) was proposed and demonstrated state-of-the-art performance
for document modeling. In this work, we show how to successfully apply and
extend this model to the context of visual scene modeling. Specifically, we
propose SupDocNADE, a supervised extension of DocNADE, that increases the
discriminative power of the hidden topic features by incorporating label
information into the training objective of the model. We also describe how to
leverage information about the spatial position of the visual words and how to
embed additional image annotations, so as to simultaneously perform image
classification and annotation. We test our model on the Scene15, LabelMe and
UIUC-Sports datasets and show that it compares favorably to other topic models
such as the supervised variant of LDA.Comment: 13 pages, 5 figure
Single and multiple object tracking using a multi-feature joint sparse representation
In this paper, we propose a tracking algorithm based on a multi-feature joint sparse representation. The templates for the sparse representation can include pixel values, textures, and edges. In the multi-feature joint optimization, noise or occlusion is dealt with using a set of trivial templates. A sparse weight constraint is introduced to dynamically select the relevant templates from the full set of templates. A variance ratio measure is adopted to adaptively adjust the weights of different features. The multi-feature template set is updated adaptively. We further propose an algorithm for tracking multi-objects with occlusion handling based on the multi-feature joint sparse reconstruction. The observation model based on sparse reconstruction automatically focuses on the visible parts of an occluded object by using the information in the trivial templates. The multi-object tracking is simplified into a joint Bayesian inference. The experimental results show the superiority of our algorithm over several state-of-the-art tracking algorithms
Discriminative conditional restricted Boltzmann machine for discrete choice and latent variable modelling
Conventional methods of estimating latent behaviour generally use attitudinal
questions which are subjective and these survey questions may not always be
available. We hypothesize that an alternative approach can be used for latent
variable estimation through an undirected graphical models. For instance,
non-parametric artificial neural networks. In this study, we explore the use of
generative non-parametric modelling methods to estimate latent variables from
prior choice distribution without the conventional use of measurement
indicators. A restricted Boltzmann machine is used to represent latent
behaviour factors by analyzing the relationship information between the
observed choices and explanatory variables. The algorithm is adapted for latent
behaviour analysis in discrete choice scenario and we use a graphical approach
to evaluate and understand the semantic meaning from estimated parameter vector
values. We illustrate our methodology on a financial instrument choice dataset
and perform statistical analysis on parameter sensitivity and stability. Our
findings show that through non-parametric statistical tests, we can extract
useful latent information on the behaviour of latent constructs through machine
learning methods and present strong and significant influence on the choice
process. Furthermore, our modelling framework shows robustness in input
variability through sampling and validation
Spatiotemporal Stacked Sequential Learning for Pedestrian Detection
Pedestrian classifiers decide which image windows contain a pedestrian. In
practice, such classifiers provide a relatively high response at neighbor
windows overlapping a pedestrian, while the responses around potential false
positives are expected to be lower. An analogous reasoning applies for image
sequences. If there is a pedestrian located within a frame, the same pedestrian
is expected to appear close to the same location in neighbor frames. Therefore,
such a location has chances of receiving high classification scores during
several frames, while false positives are expected to be more spurious. In this
paper we propose to exploit such correlations for improving the accuracy of
base pedestrian classifiers. In particular, we propose to use two-stage
classifiers which not only rely on the image descriptors required by the base
classifiers but also on the response of such base classifiers in a given
spatiotemporal neighborhood. More specifically, we train pedestrian classifiers
using a stacked sequential learning (SSL) paradigm. We use a new pedestrian
dataset we have acquired from a car to evaluate our proposal at different frame
rates. We also test on a well known dataset: Caltech. The obtained results show
that our SSL proposal boosts detection accuracy significantly with a minimal
impact on the computational cost. Interestingly, SSL improves more the accuracy
at the most dangerous situations, i.e. when a pedestrian is close to the
camera.Comment: 8 pages, 5 figure, 1 tabl
Compound Models for Vision-Based Pedestrian Recognition
This thesis addresses the problem of recognizing pedestrians in video images acquired from a moving camera in real-world cluttered environments. Instead of focusing on the development of novel feature primitives or pattern classifiers, we follow an orthogonal direction and develop feature- and classifier-independent compound techniques which integrate complementary information from multiple image-based sources with the objective of improved pedestrian classification performance. After establishing a performance baseline in terms of a thorough experimental study on monocular pedestrian recognition, we investigate the use of multiple cues on module-level. A motion-based focus of attention stage is proposed based on a learned probabilistic pedestrian-specific model of motion features. The model is used to generate pedestrian localization hypotheses for subsequent shape- and texture-based classification modules. In the remainder of this work, we focus on the integration of complementary information directly into the pattern classification step. We present a combination of shape and texture information by means of pose-specific generative shape and texture models. The generative models are integrated with discriminative classification models by utilizing synthesized virtual pedestrian training samples from the former to enhance the classification performance of the latter. Both models are linked using Active Learning to guide the training process towards informative samples. A multi-level mixture-of-experts classification framework is proposed which involves local pose-specific expert classifiers operating on multiple image modalities and features. In terms of image modalities, we consider gray-level intensity, depth cues derived from dense stereo vision and motion cues arising from dense optical flow. We furthermore employ shape-based, gradient-based and texture-based features. The mixture-of-experts formulation compares favorably to joint space approaches, in view of performance and practical feasibility. Finally, we extend this mixture-of-experts framework in terms of multi-cue partial occlusion handling and the estimation of pedestrian body orientation. Our occlusion model involves examining occlusion boundaries which manifest in discontinuities in depth and motion space. Occlusion-dependent weights which relate to the visibility of certain body parts focus the decision on unoccluded body components. We further apply the pose-specific nature of our mixture-of-experts framework towards estimating the density of pedestrian body orientation from single images, again integrating shape and texture information. Throughout this work, particular emphasis is laid on thorough performance evaluation both regarding methodology and competitive real-world datasets. Several datasets used in this thesis are made publicly available for benchmarking purposes. Our results indicate significant performance boosts over state-of-the-art for all aspects considered in this thesis, i.e. pedestrian recognition, partial occlusion handling and body orientation estimation. The pedestrian recognition performance in particular is considerably advanced; false detections at constant detection rates are reduced by significantly more than an order of magnitude
- …