126 research outputs found
Fast vocabulary acquisition in an NMF-based self-learning vocal user interface
AbstractIn command-and-control applications, a vocal user interface (VUI) is useful for handsfree control of various devices, especially for people with a physical disability. The spoken utterances are usually restricted to a predefined list of phrases or to a restricted grammar, and the acoustic models work well for normal speech. While some state-of-the-art methods allow for user adaptation of the predefined acoustic models and lexicons, we pursue a fully adaptive VUI by learning both vocabulary and acoustics directly from interaction examples. A learning curve usually has a steep rise in the beginning and an asymptotic ceiling at the end. To limit tutoring time and to guarantee good performance in the long run, the word learning rate of the VUI should be fast and the learning curve should level off at a high accuracy. In order to deal with these performance indicators, we propose a multi-level VUI architecture and we investigate the effectiveness of alternative processing schemes. In the low-level layer, we explore the use of MIDA features (Mutual Information Discrimination Analysis) against conventional MFCC features. In the mid-level layer, we enhance the acoustic representation by means of phone posteriorgrams and clustering procedures. In the high-level layer, we use the NMF (Non-negative Matrix Factorization) procedure which has been demonstrated to be an effective approach for word learning. We evaluate and discuss the performance and the feasibility of our approach in a realistic experimental setting of the VUI-user learning context
Recommended from our members
Composing Deep Learning and Bayesian Nonparametric Methods
Recent progress in Bayesian methods largely focus on non-conjugate models featured with extensive use of black-box functions: continuous functions implemented with neural networks. Using deep neural networks, Bayesian models can reasonably fit big data while at the same time capturing model uncertainty. This thesis targets at a more challenging problem: how do we model general random objects, including discrete ones, using random functions? Our conclusion is: many (discrete) random objects are in nature a composition of Poisson processes and random functions}. Thus, all discreteness is handled through the Poisson process while random functions captures the rest complexities of the object. Thus the title: composing deep learning and Bayesian nonparametric methods.
This conclusion is not a conjecture. In spacial cases such as latent feature models , we can prove this claim by working on infinite dimensional spaces, and that is how Bayesian nonparametric kicks in. Moreover, we will assume some regularity assumptions on random objects such as exchangeability. Then the representations will show up magically using representation theorems. We will see this two times throughout this thesis.
One may ask: when a random object is too simple, such as a non-negative random vector in the case of latent feature models, how can we exploit exchangeability? The answer is to aggregate infinite random objects and map them altogether onto an infinite dimensional space. And then assume exchangeability on the infinite dimensional space. We demonstrate two examples of latent feature models by (1) concatenating them as an infinite sequence (Section 2,3) and (2) stacking them as a 2d array (Section 4).
Besides, we will see that Bayesian nonparametric methods are useful to model discrete patterns in time series data. We will showcase two examples: (1) using variance Gamma processes to model change points (Section 5), and (2) using Chinese restaurant processes to model speech with switching speakers (Section 6).
We also aware that the inference problem can be non-trivial in popular Bayesian nonparametric models. In Section 7, we find a novel solution of online inference for the popular HDP-HMM model
Symbol Emergence in Robotics: A Survey
Humans can learn the use of language through physical interaction with their
environment and semiotic communication with other people. It is very important
to obtain a computational understanding of how humans can form a symbol system
and obtain semiotic skills through their autonomous mental development.
Recently, many studies have been conducted on the construction of robotic
systems and machine-learning methods that can learn the use of language through
embodied multimodal interaction with their environment and other systems.
Understanding human social interactions and developing a robot that can
smoothly communicate with human users in the long term, requires an
understanding of the dynamics of symbol systems and is crucially important. The
embodied cognition and social interaction of participants gradually change a
symbol system in a constructive manner. In this paper, we introduce a field of
research called symbol emergence in robotics (SER). SER is a constructive
approach towards an emergent symbol system. The emergent symbol system is
socially self-organized through both semiotic communications and physical
interactions with autonomous cognitive developmental agents, i.e., humans and
developmental robots. Specifically, we describe some state-of-art research
topics concerning SER, e.g., multimodal categorization, word discovery, and a
double articulation analysis, that enable a robot to obtain words and their
embodied meanings from raw sensory--motor information, including visual
information, haptic information, auditory information, and acoustic speech
signals, in a totally unsupervised manner. Finally, we suggest future
directions of research in SER.Comment: submitted to Advanced Robotic
Unsupervised Word Segmentation and Lexicon Discovery Using Acoustic Word Embeddings
In settings where only unlabelled speech data is available, speech technology
needs to be developed without transcriptions, pronunciation dictionaries, or
language modelling text. A similar problem is faced when modelling infant
language acquisition. In these cases, categorical linguistic structure needs to
be discovered directly from speech audio. We present a novel unsupervised
Bayesian model that segments unlabelled speech and clusters the segments into
hypothesized word groupings. The result is a complete unsupervised tokenization
of the input speech in terms of discovered word types. In our approach, a
potential word segment (of arbitrary length) is embedded in a fixed-dimensional
acoustic vector space. The model, implemented as a Gibbs sampler, then builds a
whole-word acoustic model in this space while jointly performing segmentation.
We report word error rates in a small-vocabulary connected digit recognition
task by mapping the unsupervised decoded output to ground truth transcriptions.
The model achieves around 20% error rate, outperforming a previous HMM-based
system by about 10% absolute. Moreover, in contrast to the baseline, our model
does not require a pre-specified vocabulary size.Comment: 11 pages, 8 figures; Accepted to the IEEE/ACM Transactions on Audio,
Speech, and Language Processin
Unsupervised neural and Bayesian models for zero-resource speech processing
Zero-resource speech processing is a growing research area which aims to develop methods
that can discover linguistic structure and representations directly from unlabelled speech
audio. Such unsupervised methods would allow speech technology to be developed
in settings where transcriptions, pronunciation dictionaries, and text for language
modelling are not available. Similar methods are required for cognitive models of
language acquisition in human infants, and for developing robotic applications that are
able to automatically learn language in a novel linguistic environment.
There are two central problems in zero-resource speech processing: (i) finding frame-level feature representations which make it easier to discriminate between linguistic units
(phones or words), and (ii) segmenting and clustering unlabelled speech into meaningful
units. The claim of this thesis is that both top-down modelling (using knowledge of
higher-level units to to learn, discover and gain insight into their lower-level constituents)
as well as bottom-up modelling (piecing together lower-level features to give rise to
more complex higher-level structures) are advantageous in tackling these two problems.
The thesis is divided into three parts. The first part introduces a new autoencoder-like
deep neural network for unsupervised frame-level representation learning. This
correspondence autoencoder (cAE) uses weak top-down supervision from an unsupervised
term discovery system that identifies noisy word-like terms in unlabelled speech data.
In an intrinsic evaluation of frame-level representations, the cAE outperforms several
state-of-the-art bottom-up and top-down approaches, achieving a relative improvement
of more than 60% over the previous best system. This shows that the cAE is particularly
effective in using top-down knowledge of longer-spanning patterns in the data; at the
same time, we find that the cAE is only able to learn useful representations when it is
initialized using bottom-up pretraining on a large set of unlabelled speech. The second part of the thesis presents a novel unsupervised segmental Bayesian
model that segments unlabelled speech data and clusters the segments into hypothesized
word groupings. The result is a complete unsupervised tokenization of the input speech
in terms of discovered word types|the system essentially performs unsupervised speech
recognition. In this approach, a potential word segment (of arbitrary length) is embedded
in a fixed-dimensional vector space. The model, implemented as a Gibbs sampler, then
builds a whole-word acoustic model in this embedding space while jointly performing
segmentation. We first evaluate the approach in a small-vocabulary multi-speaker
connected digit recognition task, where we report unsupervised word error rates (WER)
by mapping the unsupervised decoded output to ground truth transcriptions. The model
achieves around 20% WER, outperforming a previous HMM-based system by about 10% absolute. To achieve this performance, the acoustic word embedding function (which
maps variable-duration segments to single vectors) is refined in a top-down manner by
using terms discovered by the model in an outer loop of segmentation.
The third and final part of the study extends the small-vocabulary system in order to handle larger vocabularies in conversational speech data. To our knowledge, this is the
first full-coverage segmentation and clustering system that is applied to large-vocabulary
multi-speaker data. To improve efficiency, the system incorporates a bottom-up syllable
boundary detection method to eliminate unlikely word boundaries. We compare the
system on English and Xitsonga datasets to several state-of-the-art baselines. We
show that by imposing a consistent top-down segmentation while also using bottom-up
knowledge from detected syllable boundaries, both single-speaker and multi-speaker
versions of our system outperform a purely bottom-up single-speaker syllable-based
approach. We also show that the discovered clusters can be made less speaker- and
gender-specific by using features from the cAE (which incorporates both top-down and
bottom-up learning). The system's discovered clusters are still less pure than those of
two multi-speaker unsupervised term discovery systems, but provide far greater coverage.
In summary, the different models and systems presented in this thesis show that both
top-down and bottom-up modelling can improve representation learning, segmentation
and clustering of unlabelled speech data
Discovering a Domain Knowledge Representation for Image Grouping: Multimodal Data Modeling, Fusion, and Interactive Learning
In visually-oriented specialized medical domains such as dermatology and radiology, physicians explore interesting image cases from medical image repositories for comparative case studies to aid clinical diagnoses, educate medical trainees, and support medical research. However, general image classification and retrieval approaches fail in grouping medical images from the physicians\u27 viewpoint. This is because fully-automated learning techniques cannot yet bridge the gap between image features and domain-specific content for the absence of expert knowledge. Understanding how experts get information from medical images is therefore an important research topic.
As a prior study, we conducted data elicitation experiments, where physicians were instructed to inspect each medical image towards a diagnosis while describing image content to a student seated nearby. Experts\u27 eye movements and their verbal descriptions of the image content were recorded to capture various aspects of expert image understanding. This dissertation aims at an intuitive approach to extracting expert knowledge, which is to find patterns in expert data elicited from image-based diagnoses. These patterns are useful to understand both the characteristics of the medical images and the experts\u27 cognitive reasoning processes.
The transformation from the viewed raw image features to interpretation as domain-specific concepts requires experts\u27 domain knowledge and cognitive reasoning. This dissertation also approximates this transformation using a matrix factorization-based framework, which helps project multiple expert-derived data modalities to high-level abstractions.
To combine additional expert interventions with computational processing capabilities, an interactive machine learning paradigm is developed to treat experts as an integral part of the learning process. Specifically, experts refine medical image groups presented by the learned model locally, to incrementally re-learn the model globally. This paradigm avoids the onerous expert annotations for model training, while aligning the learned model with experts\u27 sense-making
A segmental framework for fully-unsupervised large-vocabulary speech recognition
Zero-resource speech technology is a growing research area that aims to
develop methods for speech processing in the absence of transcriptions,
lexicons, or language modelling text. Early term discovery systems focused on
identifying isolated recurring patterns in a corpus, while more recent
full-coverage systems attempt to completely segment and cluster the audio into
word-like units---effectively performing unsupervised speech recognition. This
article presents the first attempt we are aware of to apply such a system to
large-vocabulary multi-speaker data. Our system uses a Bayesian modelling
framework with segmental word representations: each word segment is represented
as a fixed-dimensional acoustic embedding obtained by mapping the sequence of
feature frames to a single embedding vector. We compare our system on English
and Xitsonga datasets to state-of-the-art baselines, using a variety of
measures including word error rate (obtained by mapping the unsupervised output
to ground truth transcriptions). Very high word error rates are reported---in
the order of 70--80% for speaker-dependent and 80--95% for speaker-independent
systems---highlighting the difficulty of this task. Nevertheless, in terms of
cluster quality and word segmentation metrics, we show that by imposing a
consistent top-down segmentation while also using bottom-up knowledge from
detected syllable boundaries, both single-speaker and multi-speaker versions of
our system outperform a purely bottom-up single-speaker syllable-based
approach. We also show that the discovered clusters can be made less speaker-
and gender-specific by using an unsupervised autoencoder-like feature extractor
to learn better frame-level features (prior to embedding). Our system's
discovered clusters are still less pure than those of unsupervised term
discovery systems, but provide far greater coverage.Comment: 15 pages, 6 figures, 8 table
Modeling and inference of changing dependence among multiple time-series
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 183-190).In this dissertation we investigate the problem of reasoning over evolving structures which describe the dependence among multiple, possibly vector-valued, time-series. Such problems arise naturally in variety of settings. Consider the problem of object interaction analysis. Given tracks of multiple moving objects one may wish to describe if and how these objects are interacting over time. Alternatively, consider a scenario in which one observes multiple video streams representing participants in a conversation. Given a single audio stream, one may wish to determine with which video stream the audio stream is associated as a means of indicating who is speaking at any point in time. Both of these problems can be cast as inference over dependence structures. In the absence of training data, such reasoning is challenging for several reasons. If one is solely interested in the structure of dependence as described by a graphical model, there is the question of how to account for unknown parameters. Additionally, the set of possible structures is generally super-exponential in the number of time series. Furthermore, if one wishes to reason about structure which varies over time, the number of structural sequences grows exponentially with the length of time being analyzed. We present tractable methods for reasoning in such scenarios. We consider two approaches for reasoning over structure while treating the unknown parameters as nuisance variables. First, we develop a generalized likelihood approach in which point estimates of parameters are used in place of the unknown quantities. We explore this approach in scenarios in which one considers a small enumerated set of specified structures.(cont.) Second, we develop a Bayesian approach and present a conjugate prior on the parameters and structure of a model describing the dependence among time-series. This allows for Bayesian reasoning over structure while integrating over parameters. The modular nature of the prior we define allows one to reason over a super-exponential number of structures in exponential-time in general. Furthermore, by imposing simple local or global structural constraints we show that one can reduce the exponential-time complexity to polynomial-time complexity while still reasoning over a super-exponential number of candidate structures. We cast the problem of reasoning over temporally evolving structures as inference over a latent state sequence which indexes structure over time in a dynamic Bayesian network. This model allows one to utilize standard algorithms such as Expectation Maximization, Viterbi decoding, forward-backward messaging and Gibbs sampling in order to efficiently reasoning over an exponential number of structural sequences. We demonstrate the utility of our methodology on two tasks: audio-visual association and moving object interaction analysis. We achieve state-of-the-art performance on a standard audio-visual dataset and show how our model allows one to tractably make exact probabilistic statements about interactions among multiple moving objects.by Michael Richard Siracusa.Ph.D
- …