269 research outputs found
Unsupervised neural and Bayesian models for zero-resource speech processing
Zero-resource speech processing is a growing research area which aims to develop methods
that can discover linguistic structure and representations directly from unlabelled speech
audio. Such unsupervised methods would allow speech technology to be developed
in settings where transcriptions, pronunciation dictionaries, and text for language
modelling are not available. Similar methods are required for cognitive models of
language acquisition in human infants, and for developing robotic applications that are
able to automatically learn language in a novel linguistic environment.
There are two central problems in zero-resource speech processing: (i) finding frame-level feature representations which make it easier to discriminate between linguistic units
(phones or words), and (ii) segmenting and clustering unlabelled speech into meaningful
units. The claim of this thesis is that both top-down modelling (using knowledge of
higher-level units to to learn, discover and gain insight into their lower-level constituents)
as well as bottom-up modelling (piecing together lower-level features to give rise to
more complex higher-level structures) are advantageous in tackling these two problems.
The thesis is divided into three parts. The first part introduces a new autoencoder-like
deep neural network for unsupervised frame-level representation learning. This
correspondence autoencoder (cAE) uses weak top-down supervision from an unsupervised
term discovery system that identifies noisy word-like terms in unlabelled speech data.
In an intrinsic evaluation of frame-level representations, the cAE outperforms several
state-of-the-art bottom-up and top-down approaches, achieving a relative improvement
of more than 60% over the previous best system. This shows that the cAE is particularly
effective in using top-down knowledge of longer-spanning patterns in the data; at the
same time, we find that the cAE is only able to learn useful representations when it is
initialized using bottom-up pretraining on a large set of unlabelled speech. The second part of the thesis presents a novel unsupervised segmental Bayesian
model that segments unlabelled speech data and clusters the segments into hypothesized
word groupings. The result is a complete unsupervised tokenization of the input speech
in terms of discovered word types|the system essentially performs unsupervised speech
recognition. In this approach, a potential word segment (of arbitrary length) is embedded
in a fixed-dimensional vector space. The model, implemented as a Gibbs sampler, then
builds a whole-word acoustic model in this embedding space while jointly performing
segmentation. We first evaluate the approach in a small-vocabulary multi-speaker
connected digit recognition task, where we report unsupervised word error rates (WER)
by mapping the unsupervised decoded output to ground truth transcriptions. The model
achieves around 20% WER, outperforming a previous HMM-based system by about 10% absolute. To achieve this performance, the acoustic word embedding function (which
maps variable-duration segments to single vectors) is refined in a top-down manner by
using terms discovered by the model in an outer loop of segmentation.
The third and final part of the study extends the small-vocabulary system in order to handle larger vocabularies in conversational speech data. To our knowledge, this is the
first full-coverage segmentation and clustering system that is applied to large-vocabulary
multi-speaker data. To improve efficiency, the system incorporates a bottom-up syllable
boundary detection method to eliminate unlikely word boundaries. We compare the
system on English and Xitsonga datasets to several state-of-the-art baselines. We
show that by imposing a consistent top-down segmentation while also using bottom-up
knowledge from detected syllable boundaries, both single-speaker and multi-speaker
versions of our system outperform a purely bottom-up single-speaker syllable-based
approach. We also show that the discovered clusters can be made less speaker- and
gender-specific by using features from the cAE (which incorporates both top-down and
bottom-up learning). The system's discovered clusters are still less pure than those of
two multi-speaker unsupervised term discovery systems, but provide far greater coverage.
In summary, the different models and systems presented in this thesis show that both
top-down and bottom-up modelling can improve representation learning, segmentation
and clustering of unlabelled speech data
A segmental framework for fully-unsupervised large-vocabulary speech recognition
Zero-resource speech technology is a growing research area that aims to
develop methods for speech processing in the absence of transcriptions,
lexicons, or language modelling text. Early term discovery systems focused on
identifying isolated recurring patterns in a corpus, while more recent
full-coverage systems attempt to completely segment and cluster the audio into
word-like units---effectively performing unsupervised speech recognition. This
article presents the first attempt we are aware of to apply such a system to
large-vocabulary multi-speaker data. Our system uses a Bayesian modelling
framework with segmental word representations: each word segment is represented
as a fixed-dimensional acoustic embedding obtained by mapping the sequence of
feature frames to a single embedding vector. We compare our system on English
and Xitsonga datasets to state-of-the-art baselines, using a variety of
measures including word error rate (obtained by mapping the unsupervised output
to ground truth transcriptions). Very high word error rates are reported---in
the order of 70--80% for speaker-dependent and 80--95% for speaker-independent
systems---highlighting the difficulty of this task. Nevertheless, in terms of
cluster quality and word segmentation metrics, we show that by imposing a
consistent top-down segmentation while also using bottom-up knowledge from
detected syllable boundaries, both single-speaker and multi-speaker versions of
our system outperform a purely bottom-up single-speaker syllable-based
approach. We also show that the discovered clusters can be made less speaker-
and gender-specific by using an unsupervised autoencoder-like feature extractor
to learn better frame-level features (prior to embedding). Our system's
discovered clusters are still less pure than those of unsupervised term
discovery systems, but provide far greater coverage.Comment: 15 pages, 6 figures, 8 table
Measuring context dependency in birdsong using artificial neural networks
Context dependency is a key feature in sequential structures of human language, which requires reference between words far apart in the produced sequence. Assessing how long the past context has an effect on the current status provides crucial information to understand the mechanism for complex sequential behaviors. Birdsongs serve as a representative model for studying the context dependency in sequential signals produced by non-human animals, while previous reports were upper-bounded by methodological limitations. Here, we newly estimated the context dependency in birdsongs in a more scalable way using a modern neural-network-based language model whose accessible context length is sufficiently long. The detected context dependency was beyond the order of traditional Markovian models of birdsong, but was consistent with previous experimental investigations. We also studied the relation between the assumed/auto-detected vocabulary size of birdsong (i.e., fine- vs. coarse-grained syllable classifications) and the context dependency. It turned out that the larger vocabulary (or the more fine-grained classification) is assumed, the shorter context dependency is detected
Unsupervised Lexicon Discovery from Acoustic Input
We present a model of unsupervised phonological lexicon discovery -- the problem of simultaneously learning phoneme-like and word-like units from acoustic input. Our model builds on earlier models of unsupervised phone-like unit discovery from acoustic data (Lee and Glass, 2012), and unsupervised symbolic lexicon discovery using the Adaptor Grammar framework (Johnson et al., 2006), integrating these earlier approaches using a probabilistic model of phonological variation. We show that the model is competitive with state-of-the-art spoken term discovery systems, and present analyses exploring the model's behavior and the kinds of linguistic structures it learns
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
Hidden Markov Models
Hidden Markov Models (HMMs), although known for decades, have made a big career nowadays and are still in state of development. This book presents theoretical issues and a variety of HMMs applications in speech recognition and synthesis, medicine, neurosciences, computational biology, bioinformatics, seismology, environment protection and engineering. I hope that the reader will find this book useful and helpful for their own research
- …