7 research outputs found
A New Scene Classification Method Based on Local Gabor Features
A new scene classification method is proposed based on the combination of local Gabor features with a spatial pyramid matching model. First, new local Gabor feature descriptors are extracted from dense sampling patches of scene images. These local feature descriptors are embedded into a bag-of-visual-words (BOVW) model, which is combined with a spatial pyramid matching framework. The new local Gabor feature descriptors have sufficient discrimination abilities for dense regions of scene images. Then the efficient feature vectors of scene images can be obtained by K-means clustering method and visual word statistics. Second, in order to decrease classification time and improve accuracy, an improved kernel principal component analysis (KPCA) method is applied to reduce the dimensionality of pyramid histogram of visual words (PHOW). The principal components with the bigger interclass separability are retained in feature vectors, which are used for scene classification by the linear support vector machine (SVM) method. The proposed method is evaluated on three commonly used scene datasets. Experimental results demonstrate the effectiveness of the method
ROBUST SPEAKER RECOGNITION BASED ON LATENT VARIABLE MODELS
Automatic speaker recognition in uncontrolled environments is a very challenging task due to channel distortions, additive noise and reverberation. To address these issues, this thesis studies probabilistic latent variable models of short-term spectral information that leverage large amounts of data to achieve robustness in challenging conditions.
Current speaker recognition systems represent an entire speech utterance as a single point in a high-dimensional space. This representation is known as "supervector". This thesis starts by analyzing the properties of this representation. A novel visualization procedure of supervectors is presented by which qualitative insight about the information being captured is obtained. We then propose the use of an overcomplete dictionary to explicitly decompose a supervector into a speaker-specific component and an undesired variability component. An algorithm to learn the dictionary from a large collection of data is discussed and analyzed. A subset of the entries of the dictionary is learned to represent speaker-specific information and another subset to represent distortions. After encoding the supervector as a linear combination of the dictionary entries, the undesired variability is removed by discarding the contribution of the distortion components. This paradigm is closely related to the previously proposed paradigm of Joint Factor Analysis modeling of supervectors. We establish a connection between the two approaches and show how our proposed method provides improvements in terms of computation and recognition accuracy.
An alternative way to handle undesired variability in supervector representations is to first project them into a lower dimensional space and then to model them in the reduced subspace. This low-dimensional projection is known as "i-vector". Unfortunately, i-vectors exhibit non-Gaussian behavior, and direct statistical modeling requires the use of heavy-tailed distributions for optimal performance. These approaches lack closed-form solutions, and therefore are hard to analyze. Moreover, they do not scale well to large datasets. Instead of directly modeling i-vectors, we propose to first apply a non-linear transformation and then use a linear-Gaussian model. We present two alternative transformations and show experimentally that the transformed i-vectors can be optimally modeled by a simple linear-Gaussian model (factor analysis). We evaluate our method on a benchmark dataset with a large amount of channel variability and show that the results compare favorably against the competitors. Also, our approach has closed-form solutions and scales gracefully to large datasets.
Finally, a multi-classifier architecture trained on a multicondition fashion is proposed to address the problem of speaker recognition in the presence of additive noise. A large number of experiments are conducted to analyze the proposed architecture and to obtain guidelines for optimal performance in noisy environments. Overall, it is shown that multicondition training of multi-classifier architectures not only produces great robustness in the anticipated conditions, but also generalizes well to unseen conditions
Recommended from our members
Content-Style Decomposition: Representation Discovery and Applications
Content-style decompositions, or CSDs, decompose entities into content, defined by the entity's class, and style, defined as the remaining within-class variation. Content is typically defined in terms of some task. For example, in a face recognition task, identity is the content; in an emotion recognition task, expression is the content. CSDs have many applications: they can provide insight into domains where we have little prior knowledge of the sources of within- and between-class variation, and content-style recombinations are interesting as a creative exercise or for data set augmentation. Our approach is to decompose CSD discovery into two sub-problems: (1) to find useful representations of content that capture the class structure of the domain, and (2) to use those content-representations to discover CSDs. We make contributions to both sub-problems. First, we propose the F-statistic loss, a new method for discovering content representations that uses statistics of class separation on isolated embedding dimensions within a minibatch to determine when to terminate training. In addition to state-of-the-art performance on few-shot learning, we find that the method leads to factorial (also known as disentangled) representations of content when applied with a novel form of weak supervision. Previous work on disentangling is either unsupervised or uses a factor-aware oracle, which provides similar/dissimilar judgments with respect to a named attribute/factor. We explore an intermediate form of supervision, an unnamed-factor oracle, which provides similarity judgments with respect to a random unnamed factor. We demonstrate that the F-statistic loss leads to better disentangling when compared with other deep-embeddings losses and β-VAE, a state-of-the-art unsupervised disentangling method. Second, we introduce a new loss for discovering CSDs that can arbitrarily recombine content and style, called leakage filtering. In contrast to previous research which attempts to separate content and style in two different representation vectors, leakage filtering allows for imperfectly disentangled representations but ensures that residual content information will not leak out of the style representation and vice versa. Leakage filtering is also distinguished by its ability to operate on novel content-classes and by its lack of dependency on style labels for training. The recombined images produced are of high quality and can be used to augment datasets for few-shot learning tasks, yielding significant generalization improvements
Human Action Recognition Using Deep Probabilistic Graphical Models
Building intelligent systems that are capable of representing or extracting high-level representations from high-dimensional sensory data lies at the core of solving many A.I. related tasks. Human action recognition is an important topic in computer vision that lies in high-dimensional space. Its applications include robotics, video surveillance, human-computer interaction, user interface design, and multi-media video retrieval amongst others.
A number of approaches have been proposed to extract representative features from high-dimensional temporal data, most commonly hard wired geometric or bio-inspired shape context features.
This thesis first demonstrates some \emph{ad-hoc} hand-crafted rules for effectively encoding motion features, and later elicits a more generic approach for incorporating structured feature learning and reasoning, \ie deep probabilistic graphical models.
The hierarchial dynamic framework first extracts high level features and then uses the learned representation for estimating emission probability to infer action sequences.
We show that better action recognition can be achieved by replacing gaussian mixture models by Deep Neural Networks that contain many layers of features to predict probability distributions over states of Markov Models. The framework can be easily extended to include an ergodic state to segment and recognise actions simultaneously.
The first part of the thesis focuses on analysis and applications of hand-crafted features for human action representation and classification. We show that the ``hard coded" concept of correlogram can incorporate correlations between time domain sequences and we further investigate multi-modal inputs, \eg depth sensor input and its unique traits for action recognition.
The second part of this thesis focuses on marrying probabilistic graphical models with Deep Neural Networks (both Deep Belief Networks and Deep 3D Convolutional Neural Networks) for structured sequence prediction. The proposed Deep Dynamic Neural Network exhibits its general framework for structured 2D data representation and classification. This inspires us to further investigate for applying various graphical models for time-variant video sequences
Functional and structural neural contributions to skilled word reading
Reading is an essential skill in our everyday lives and individuals are required to process, understand, and respond to textual information at an increasingly rapid rate in order to be active participants in society. The role of spatial attention in reading has recently been emphasized, whereby better spatial attentional skills are associated with stronger reading skills, and spatial attentional training has a large impact on improving reading ability. However, the neuroanatomical correlates of reading and attention have primarily been studied in isolation. Further, there has recently been a shift to understanding how underlying white matter connectivity networks contribute to cognitive processes. However, much of the research focusing on the intersection of reading and spatial attention, as well as underlying white matter connectivity, has focused primarily on individuals with reading impairments. This thesis will focus on unraveling the neural relationship between spatial attention and reading, and how structural connectivity accounts for functional activation in reading tasks. In Chapter 2, we examine the neural relationship between lexical and sublexical reading with voluntary and reflexive spatial attention. In Experiments 1 and 2, participants performed overt reading of both lexical exception word (EW; words with inconsistent spelling-to-sound correspondences, e.g., âpintâ) and sublexical pseudohomophone (PH; non-words that when decoded phonetically sound like real words, e.g., âpyntâ) reading tasks, as well as tasks involving either voluntary attention (Experiment 1) or reflexive attention (Experiment 2) during functional magnetic resonance imaging (fMRI). Experiment 3 used hybrid combined reading attention tasks during fMRI, whereby the spatial attentional cue preceded presentation of the EW or PH stimulus. Overall, the results from these experiments showed that sublexical reading was more strongly associated with brain regions involved in voluntary attention, whereas lexical reading was more strongly associated with brain regions involved in reflexive attention. Thus, Experiments 1, 2 and 3 lend support to the idea that lexical and sublexical reading strategies are differentially associated with these two types of attention. In Chapter 3, we examined the extent to which fine-grained underlying white matter connectivity is able to predict fMRI activation during both lexical reading and phonetic decoding in skilled readers. Experiment 4 employed EW and PH reading and a computational modeling technique to model the relationship between whole-brain structural DTI connectivity and task-based fMRI activation during lexical and sublexical reading. Results from this study showed that brain activation during both lexical and sublexical reading in skilled readers can be accurately predicted using DTI connectivity, specifically in known reading and language areas, as well as important spatial attentional areas. Thus, this research suggests that there is a fine-grained relationship between skilled reading and extrinsic brain connectivity, showing that functional organization of reading and language can be determined (at least in part) by structural connectivity patterns. Together, the studies presented in this thesis provide valuable insight into functional and structural contributions to word reading that may serve as biomarkers of skilled reading, which in turn may have important implications for understanding and remediating reading impairments
Complex internal representations in sensorimotor decision making: a Bayesian investigation
The past twenty years have seen a successful formalization of the idea that perception
is a form of probabilistic inference. Bayesian Decision Theory (BDT) provides a
neat mathematical framework for describing how an ideal observer and actor should
interpret incoming sensory stimuli and act in the face of uncertainty. The predictions
of BDT, however, crucially depend on the observerâs internal models, represented in
the Bayesian framework by priors, likelihoods, and the loss function. Arguably, only
in the simplest scenarios (e.g., with a few Gaussian variables) we can expect a real
observerâs internal representations to perfectly match the true statistics of the task at
hand, and to conform to exact Bayesian computations, but how humans systematically
deviate from BDT in more complex cases is yet to be understood.
In this thesis we theoretically and experimentally investigate how people represent
and perform probabilistic inference with complex (beyond Gaussian) one-dimensional
distributions of stimuli in the context of sensorimotor decision making. The goal is
to reconstruct the observersâ internal representations and details of their decision-making
process from the behavioural data â by employing Bayesian inference to uncover
properties of a system, the ideal observer, that is believed to perform Bayesian
inference itself. This âinverse problemâ is not unique: in principle, distinct Bayesian
observer models can produce very similar behaviours. We circumvented this issue by
means of experimental constraints and independent validation of the results.
To understand how people represent complex distributions of stimuli in the specific
domain of time perception, we conducted a series of psychophysical experiments
where participants were asked to reproduce the time interval between a mouse click
and a flash, drawn from a session-dependent distribution of intervals. We found that
participants could learn smooth approximations of the non-Gaussian experimental
distributions, but seemed to have trouble with learning some complex statistical features
such as bimodality.
To investigate whether this difficulty arose from learning complex distributions
or computing with them, we conducted a target estimation experiment in which
âpriorsâ where explicitly displayed on screen and therefore did not need to be learnt.
Lack of difference in performance between the Gaussian and bimodal conditions in
this task suggests that acquiring a bimodal prior, rather than computing with it, is the
major difficulty. Model comparison on a large number of Bayesian observer models,
representing different assumptions about the noise sources and details of the decision
process, revealed a further source of variability in decision making that was modelled
as a âstochastic posteriorâ.
Finally, prompted by a secondary finding of the previous experiment, we tested the
effect of decision uncertainty on the capacity of the participants to correct for added
perturbations in the visual feedback in a centre of mass estimation task. Participants
almost completely compensated for the injected error in low uncertainty trials, but
only partially so in the high uncertainty ones, even when allowed sufficient time to
adjust their response. Surprisingly, though, their overall performance was not significantly
affected. This finding is consistent with the behaviour of a Bayesian observer
with an additional term in the loss function that represents âeffortâ â a component of
optimal control usually thought to be negligible in sensorimotor estimation tasks.
Together, these studies provide new insight into the capacity and limitations people
have in learning and performing probabilistic inference with distributions beyond
Gaussian. This work also introduces several tools and techniques that can help in the
systematic exploration of suboptimal behaviour. Developing a language to describe
suboptimality, mismatching representations and approximate inference, as opposed
to optimality and exact inference, is a fundamental step to link behavioural studies
to actual neural computations