10 research outputs found

    A New Scene Classification Method Based on Local Gabor Features

    Get PDF
    A new scene classification method is proposed based on the combination of local Gabor features with a spatial pyramid matching model. First, new local Gabor feature descriptors are extracted from dense sampling patches of scene images. These local feature descriptors are embedded into a bag-of-visual-words (BOVW) model, which is combined with a spatial pyramid matching framework. The new local Gabor feature descriptors have sufficient discrimination abilities for dense regions of scene images. Then the efficient feature vectors of scene images can be obtained by K-means clustering method and visual word statistics. Second, in order to decrease classification time and improve accuracy, an improved kernel principal component analysis (KPCA) method is applied to reduce the dimensionality of pyramid histogram of visual words (PHOW). The principal components with the bigger interclass separability are retained in feature vectors, which are used for scene classification by the linear support vector machine (SVM) method. The proposed method is evaluated on three commonly used scene datasets. Experimental results demonstrate the effectiveness of the method

    Ask the locals: multi-way local pooling for image recognition

    Get PDF
    International audienceInvariant representations in object recognition systems are generally obtained by pooling feature vectors over spatially local neighborhoods. But pooling is not local in the feature vector space, so that widely dissimilar features may be pooled together if they are in nearby locations. Recent approaches rely on sophisticated encoding methods and more specialized codebooks (or dictionaries), e.g., learned on subsets of descriptors which are close in feature space, to circumvent this problem. In this work, we argue that a common trait found in much recent work in image recognition or retrieval is that it leverages locality in feature space on top of purely spatial locality. We propose to apply this idea in its simplest form to an object recognition system based on the spatial pyramid framework, to increase the performance of small dictionaries with very little added engineering. State of- the-art results on several object recognition benchmarks show the promise of this approach

    ROBUST SPEAKER RECOGNITION BASED ON LATENT VARIABLE MODELS

    Get PDF
    Automatic speaker recognition in uncontrolled environments is a very challenging task due to channel distortions, additive noise and reverberation. To address these issues, this thesis studies probabilistic latent variable models of short-term spectral information that leverage large amounts of data to achieve robustness in challenging conditions. Current speaker recognition systems represent an entire speech utterance as a single point in a high-dimensional space. This representation is known as "supervector". This thesis starts by analyzing the properties of this representation. A novel visualization procedure of supervectors is presented by which qualitative insight about the information being captured is obtained. We then propose the use of an overcomplete dictionary to explicitly decompose a supervector into a speaker-specific component and an undesired variability component. An algorithm to learn the dictionary from a large collection of data is discussed and analyzed. A subset of the entries of the dictionary is learned to represent speaker-specific information and another subset to represent distortions. After encoding the supervector as a linear combination of the dictionary entries, the undesired variability is removed by discarding the contribution of the distortion components. This paradigm is closely related to the previously proposed paradigm of Joint Factor Analysis modeling of supervectors. We establish a connection between the two approaches and show how our proposed method provides improvements in terms of computation and recognition accuracy. An alternative way to handle undesired variability in supervector representations is to first project them into a lower dimensional space and then to model them in the reduced subspace. This low-dimensional projection is known as "i-vector". Unfortunately, i-vectors exhibit non-Gaussian behavior, and direct statistical modeling requires the use of heavy-tailed distributions for optimal performance. These approaches lack closed-form solutions, and therefore are hard to analyze. Moreover, they do not scale well to large datasets. Instead of directly modeling i-vectors, we propose to first apply a non-linear transformation and then use a linear-Gaussian model. We present two alternative transformations and show experimentally that the transformed i-vectors can be optimally modeled by a simple linear-Gaussian model (factor analysis). We evaluate our method on a benchmark dataset with a large amount of channel variability and show that the results compare favorably against the competitors. Also, our approach has closed-form solutions and scales gracefully to large datasets. Finally, a multi-classifier architecture trained on a multicondition fashion is proposed to address the problem of speaker recognition in the presence of additive noise. A large number of experiments are conducted to analyze the proposed architecture and to obtain guidelines for optimal performance in noisy environments. Overall, it is shown that multicondition training of multi-classifier architectures not only produces great robustness in the anticipated conditions, but also generalizes well to unseen conditions

    Functional and structural neural contributions to skilled word reading

    Get PDF
    Reading is an essential skill in our everyday lives and individuals are required to process, understand, and respond to textual information at an increasingly rapid rate in order to be active participants in society. The role of spatial attention in reading has recently been emphasized, whereby better spatial attentional skills are associated with stronger reading skills, and spatial attentional training has a large impact on improving reading ability. However, the neuroanatomical correlates of reading and attention have primarily been studied in isolation. Further, there has recently been a shift to understanding how underlying white matter connectivity networks contribute to cognitive processes. However, much of the research focusing on the intersection of reading and spatial attention, as well as underlying white matter connectivity, has focused primarily on individuals with reading impairments. This thesis will focus on unraveling the neural relationship between spatial attention and reading, and how structural connectivity accounts for functional activation in reading tasks. In Chapter 2, we examine the neural relationship between lexical and sublexical reading with voluntary and reflexive spatial attention. In Experiments 1 and 2, participants performed overt reading of both lexical exception word (EW; words with inconsistent spelling-to-sound correspondences, e.g., ‘pint’) and sublexical pseudohomophone (PH; non-words that when decoded phonetically sound like real words, e.g., ‘pynt’) reading tasks, as well as tasks involving either voluntary attention (Experiment 1) or reflexive attention (Experiment 2) during functional magnetic resonance imaging (fMRI). Experiment 3 used hybrid combined reading attention tasks during fMRI, whereby the spatial attentional cue preceded presentation of the EW or PH stimulus. Overall, the results from these experiments showed that sublexical reading was more strongly associated with brain regions involved in voluntary attention, whereas lexical reading was more strongly associated with brain regions involved in reflexive attention. Thus, Experiments 1, 2 and 3 lend support to the idea that lexical and sublexical reading strategies are differentially associated with these two types of attention. In Chapter 3, we examined the extent to which fine-grained underlying white matter connectivity is able to predict fMRI activation during both lexical reading and phonetic decoding in skilled readers. Experiment 4 employed EW and PH reading and a computational modeling technique to model the relationship between whole-brain structural DTI connectivity and task-based fMRI activation during lexical and sublexical reading. Results from this study showed that brain activation during both lexical and sublexical reading in skilled readers can be accurately predicted using DTI connectivity, specifically in known reading and language areas, as well as important spatial attentional areas. Thus, this research suggests that there is a fine-grained relationship between skilled reading and extrinsic brain connectivity, showing that functional organization of reading and language can be determined (at least in part) by structural connectivity patterns. Together, the studies presented in this thesis provide valuable insight into functional and structural contributions to word reading that may serve as biomarkers of skilled reading, which in turn may have important implications for understanding and remediating reading impairments

    Human Action Recognition Using Deep Probabilistic Graphical Models

    Get PDF
    Building intelligent systems that are capable of representing or extracting high-level representations from high-dimensional sensory data lies at the core of solving many A.I. related tasks. Human action recognition is an important topic in computer vision that lies in high-dimensional space. Its applications include robotics, video surveillance, human-computer interaction, user interface design, and multi-media video retrieval amongst others. A number of approaches have been proposed to extract representative features from high-dimensional temporal data, most commonly hard wired geometric or bio-inspired shape context features. This thesis first demonstrates some \emph{ad-hoc} hand-crafted rules for effectively encoding motion features, and later elicits a more generic approach for incorporating structured feature learning and reasoning, \ie deep probabilistic graphical models. The hierarchial dynamic framework first extracts high level features and then uses the learned representation for estimating emission probability to infer action sequences. We show that better action recognition can be achieved by replacing gaussian mixture models by Deep Neural Networks that contain many layers of features to predict probability distributions over states of Markov Models. The framework can be easily extended to include an ergodic state to segment and recognise actions simultaneously. The first part of the thesis focuses on analysis and applications of hand-crafted features for human action representation and classification. We show that the ``hard coded" concept of correlogram can incorporate correlations between time domain sequences and we further investigate multi-modal inputs, \eg depth sensor input and its unique traits for action recognition. The second part of this thesis focuses on marrying probabilistic graphical models with Deep Neural Networks (both Deep Belief Networks and Deep 3D Convolutional Neural Networks) for structured sequence prediction. The proposed Deep Dynamic Neural Network exhibits its general framework for structured 2D data representation and classification. This inspires us to further investigate for applying various graphical models for time-variant video sequences

    Complex internal representations in sensorimotor decision making: a Bayesian investigation

    Get PDF
    The past twenty years have seen a successful formalization of the idea that perception is a form of probabilistic inference. Bayesian Decision Theory (BDT) provides a neat mathematical framework for describing how an ideal observer and actor should interpret incoming sensory stimuli and act in the face of uncertainty. The predictions of BDT, however, crucially depend on the observer’s internal models, represented in the Bayesian framework by priors, likelihoods, and the loss function. Arguably, only in the simplest scenarios (e.g., with a few Gaussian variables) we can expect a real observer’s internal representations to perfectly match the true statistics of the task at hand, and to conform to exact Bayesian computations, but how humans systematically deviate from BDT in more complex cases is yet to be understood. In this thesis we theoretically and experimentally investigate how people represent and perform probabilistic inference with complex (beyond Gaussian) one-dimensional distributions of stimuli in the context of sensorimotor decision making. The goal is to reconstruct the observers’ internal representations and details of their decision-making process from the behavioural data – by employing Bayesian inference to uncover properties of a system, the ideal observer, that is believed to perform Bayesian inference itself. This “inverse problem” is not unique: in principle, distinct Bayesian observer models can produce very similar behaviours. We circumvented this issue by means of experimental constraints and independent validation of the results. To understand how people represent complex distributions of stimuli in the specific domain of time perception, we conducted a series of psychophysical experiments where participants were asked to reproduce the time interval between a mouse click and a flash, drawn from a session-dependent distribution of intervals. We found that participants could learn smooth approximations of the non-Gaussian experimental distributions, but seemed to have trouble with learning some complex statistical features such as bimodality. To investigate whether this difficulty arose from learning complex distributions or computing with them, we conducted a target estimation experiment in which “priors” where explicitly displayed on screen and therefore did not need to be learnt. Lack of difference in performance between the Gaussian and bimodal conditions in this task suggests that acquiring a bimodal prior, rather than computing with it, is the major difficulty. Model comparison on a large number of Bayesian observer models, representing different assumptions about the noise sources and details of the decision process, revealed a further source of variability in decision making that was modelled as a “stochastic posterior”. Finally, prompted by a secondary finding of the previous experiment, we tested the effect of decision uncertainty on the capacity of the participants to correct for added perturbations in the visual feedback in a centre of mass estimation task. Participants almost completely compensated for the injected error in low uncertainty trials, but only partially so in the high uncertainty ones, even when allowed sufficient time to adjust their response. Surprisingly, though, their overall performance was not significantly affected. This finding is consistent with the behaviour of a Bayesian observer with an additional term in the loss function that represents “effort” – a component of optimal control usually thought to be negligible in sensorimotor estimation tasks. Together, these studies provide new insight into the capacity and limitations people have in learning and performing probabilistic inference with distributions beyond Gaussian. This work also introduces several tools and techniques that can help in the systematic exploration of suboptimal behaviour. Developing a language to describe suboptimality, mismatching representations and approximate inference, as opposed to optimality and exact inference, is a fundamental step to link behavioural studies to actual neural computations
    corecore