873 research outputs found

    Attentive Adversarial Learning for Domain-Invariant Training

    Full text link
    Adversarial domain-invariant training (ADIT) proves to be effective in suppressing the effects of domain variability in acoustic modeling and has led to improved performance in automatic speech recognition (ASR). In ADIT, an auxiliary domain classifier takes in equally-weighted deep features from a deep neural network (DNN) acoustic model and is trained to improve their domain-invariance by optimizing an adversarial loss function. In this work, we propose an attentive ADIT (AADIT) in which we advance the domain classifier with an attention mechanism to automatically weight the input deep features according to their importance in domain classification. With this attentive re-weighting, AADIT can focus on the domain normalization of phonetic components that are more susceptible to domain variability and generates deep features with improved domain-invariance and senone-discriminativity over ADIT. Most importantly, the attention block serves only as an external component to the DNN acoustic model and is not involved in ASR, so AADIT can be used to improve the acoustic modeling with any DNN architectures. More generally, the same methodology can improve any adversarial learning system with an auxiliary discriminator. Evaluated on CHiME-3 dataset, the AADIT achieves 13.6% and 9.3% relative WER improvements, respectively, over a multi-conditional model and a strong ADIT baseline.Comment: 5 pages, 1 figure, ICASSP 201

    A Supervised Neural Autoregressive Topic Model for Simultaneous Image Classification and Annotation

    Full text link
    Topic modeling based on latent Dirichlet allocation (LDA) has been a framework of choice to perform scene recognition and annotation. Recently, a new type of topic model called the Document Neural Autoregressive Distribution Estimator (DocNADE) was proposed and demonstrated state-of-the-art performance for document modeling. In this work, we show how to successfully apply and extend this model to the context of visual scene modeling. Specifically, we propose SupDocNADE, a supervised extension of DocNADE, that increases the discriminative power of the hidden topic features by incorporating label information into the training objective of the model. We also describe how to leverage information about the spatial position of the visual words and how to embed additional image annotations, so as to simultaneously perform image classification and annotation. We test our model on the Scene15, LabelMe and UIUC-Sports datasets and show that it compares favorably to other topic models such as the supervised variant of LDA.Comment: 13 pages, 5 figure

    Single and multiple object tracking using a multi-feature joint sparse representation

    Get PDF
    In this paper, we propose a tracking algorithm based on a multi-feature joint sparse representation. The templates for the sparse representation can include pixel values, textures, and edges. In the multi-feature joint optimization, noise or occlusion is dealt with using a set of trivial templates. A sparse weight constraint is introduced to dynamically select the relevant templates from the full set of templates. A variance ratio measure is adopted to adaptively adjust the weights of different features. The multi-feature template set is updated adaptively. We further propose an algorithm for tracking multi-objects with occlusion handling based on the multi-feature joint sparse reconstruction. The observation model based on sparse reconstruction automatically focuses on the visible parts of an occluded object by using the information in the trivial templates. The multi-object tracking is simplified into a joint Bayesian inference. The experimental results show the superiority of our algorithm over several state-of-the-art tracking algorithms

    Discriminative conditional restricted Boltzmann machine for discrete choice and latent variable modelling

    Full text link
    Conventional methods of estimating latent behaviour generally use attitudinal questions which are subjective and these survey questions may not always be available. We hypothesize that an alternative approach can be used for latent variable estimation through an undirected graphical models. For instance, non-parametric artificial neural networks. In this study, we explore the use of generative non-parametric modelling methods to estimate latent variables from prior choice distribution without the conventional use of measurement indicators. A restricted Boltzmann machine is used to represent latent behaviour factors by analyzing the relationship information between the observed choices and explanatory variables. The algorithm is adapted for latent behaviour analysis in discrete choice scenario and we use a graphical approach to evaluate and understand the semantic meaning from estimated parameter vector values. We illustrate our methodology on a financial instrument choice dataset and perform statistical analysis on parameter sensitivity and stability. Our findings show that through non-parametric statistical tests, we can extract useful latent information on the behaviour of latent constructs through machine learning methods and present strong and significant influence on the choice process. Furthermore, our modelling framework shows robustness in input variability through sampling and validation

    Spatiotemporal Stacked Sequential Learning for Pedestrian Detection

    Full text link
    Pedestrian classifiers decide which image windows contain a pedestrian. In practice, such classifiers provide a relatively high response at neighbor windows overlapping a pedestrian, while the responses around potential false positives are expected to be lower. An analogous reasoning applies for image sequences. If there is a pedestrian located within a frame, the same pedestrian is expected to appear close to the same location in neighbor frames. Therefore, such a location has chances of receiving high classification scores during several frames, while false positives are expected to be more spurious. In this paper we propose to exploit such correlations for improving the accuracy of base pedestrian classifiers. In particular, we propose to use two-stage classifiers which not only rely on the image descriptors required by the base classifiers but also on the response of such base classifiers in a given spatiotemporal neighborhood. More specifically, we train pedestrian classifiers using a stacked sequential learning (SSL) paradigm. We use a new pedestrian dataset we have acquired from a car to evaluate our proposal at different frame rates. We also test on a well known dataset: Caltech. The obtained results show that our SSL proposal boosts detection accuracy significantly with a minimal impact on the computational cost. Interestingly, SSL improves more the accuracy at the most dangerous situations, i.e. when a pedestrian is close to the camera.Comment: 8 pages, 5 figure, 1 tabl

    Compound Models for Vision-Based Pedestrian Recognition

    Get PDF
    This thesis addresses the problem of recognizing pedestrians in video images acquired from a moving camera in real-world cluttered environments. Instead of focusing on the development of novel feature primitives or pattern classifiers, we follow an orthogonal direction and develop feature- and classifier-independent compound techniques which integrate complementary information from multiple image-based sources with the objective of improved pedestrian classification performance. After establishing a performance baseline in terms of a thorough experimental study on monocular pedestrian recognition, we investigate the use of multiple cues on module-level. A motion-based focus of attention stage is proposed based on a learned probabilistic pedestrian-specific model of motion features. The model is used to generate pedestrian localization hypotheses for subsequent shape- and texture-based classification modules. In the remainder of this work, we focus on the integration of complementary information directly into the pattern classification step. We present a combination of shape and texture information by means of pose-specific generative shape and texture models. The generative models are integrated with discriminative classification models by utilizing synthesized virtual pedestrian training samples from the former to enhance the classification performance of the latter. Both models are linked using Active Learning to guide the training process towards informative samples. A multi-level mixture-of-experts classification framework is proposed which involves local pose-specific expert classifiers operating on multiple image modalities and features. In terms of image modalities, we consider gray-level intensity, depth cues derived from dense stereo vision and motion cues arising from dense optical flow. We furthermore employ shape-based, gradient-based and texture-based features. The mixture-of-experts formulation compares favorably to joint space approaches, in view of performance and practical feasibility. Finally, we extend this mixture-of-experts framework in terms of multi-cue partial occlusion handling and the estimation of pedestrian body orientation. Our occlusion model involves examining occlusion boundaries which manifest in discontinuities in depth and motion space. Occlusion-dependent weights which relate to the visibility of certain body parts focus the decision on unoccluded body components. We further apply the pose-specific nature of our mixture-of-experts framework towards estimating the density of pedestrian body orientation from single images, again integrating shape and texture information. Throughout this work, particular emphasis is laid on thorough performance evaluation both regarding methodology and competitive real-world datasets. Several datasets used in this thesis are made publicly available for benchmarking purposes. Our results indicate significant performance boosts over state-of-the-art for all aspects considered in this thesis, i.e. pedestrian recognition, partial occlusion handling and body orientation estimation. The pedestrian recognition performance in particular is considerably advanced; false detections at constant detection rates are reduced by significantly more than an order of magnitude
    • …
    corecore