5 research outputs found
Double Articulation Analyzer with Prosody for Unsupervised Word and Phoneme Discovery
Infants acquire words and phonemes from unsegmented speech signals using
segmentation cues, such as distributional, prosodic, and co-occurrence cues.
Many pre-existing computational models that represent the process tend to focus
on distributional or prosodic cues. This paper proposes a nonparametric
Bayesian probabilistic generative model called the prosodic hierarchical
Dirichlet process-hidden language model (Prosodic HDP-HLM). Prosodic HDP-HLM,
an extension of HDP-HLM, considers both prosodic and distributional cues within
a single integrative generative model. We conducted three experiments on
different types of datasets, and demonstrate the validity of the proposed
method. The results show that the Prosodic DAA successfully uses prosodic cues
and outperforms a method that solely uses distributional cues. The main
contributions of this study are as follows: 1) We develop a probabilistic
generative model for time series data including prosody that potentially has a
double articulation structure; 2) We propose the Prosodic DAA by deriving the
inference procedure for Prosodic HDP-HLM and show that Prosodic DAA can
discover words directly from continuous human speech signals using statistical
information and prosodic information in an unsupervised manner; 3) We show that
prosodic cues contribute to word segmentation more in naturally distributed
case words, i.e., they follow Zipf's law.Comment: 11 pages, Submitted to IEEE Transactions on Cognitive and
Developmental System
Unsupervised Word Segmentation and Lexicon Discovery Using Acoustic Word Embeddings
In settings where only unlabelled speech data is available, speech technology
needs to be developed without transcriptions, pronunciation dictionaries, or
language modelling text. A similar problem is faced when modelling infant
language acquisition. In these cases, categorical linguistic structure needs to
be discovered directly from speech audio. We present a novel unsupervised
Bayesian model that segments unlabelled speech and clusters the segments into
hypothesized word groupings. The result is a complete unsupervised tokenization
of the input speech in terms of discovered word types. In our approach, a
potential word segment (of arbitrary length) is embedded in a fixed-dimensional
acoustic vector space. The model, implemented as a Gibbs sampler, then builds a
whole-word acoustic model in this space while jointly performing segmentation.
We report word error rates in a small-vocabulary connected digit recognition
task by mapping the unsupervised decoded output to ground truth transcriptions.
The model achieves around 20% error rate, outperforming a previous HMM-based
system by about 10% absolute. Moreover, in contrast to the baseline, our model
does not require a pre-specified vocabulary size.Comment: 11 pages, 8 figures; Accepted to the IEEE/ACM Transactions on Audio,
Speech, and Language Processin
Unsupervised neural and Bayesian models for zero-resource speech processing
Zero-resource speech processing is a growing research area which aims to develop methods
that can discover linguistic structure and representations directly from unlabelled speech
audio. Such unsupervised methods would allow speech technology to be developed
in settings where transcriptions, pronunciation dictionaries, and text for language
modelling are not available. Similar methods are required for cognitive models of
language acquisition in human infants, and for developing robotic applications that are
able to automatically learn language in a novel linguistic environment.
There are two central problems in zero-resource speech processing: (i) finding frame-level feature representations which make it easier to discriminate between linguistic units
(phones or words), and (ii) segmenting and clustering unlabelled speech into meaningful
units. The claim of this thesis is that both top-down modelling (using knowledge of
higher-level units to to learn, discover and gain insight into their lower-level constituents)
as well as bottom-up modelling (piecing together lower-level features to give rise to
more complex higher-level structures) are advantageous in tackling these two problems.
The thesis is divided into three parts. The first part introduces a new autoencoder-like
deep neural network for unsupervised frame-level representation learning. This
correspondence autoencoder (cAE) uses weak top-down supervision from an unsupervised
term discovery system that identifies noisy word-like terms in unlabelled speech data.
In an intrinsic evaluation of frame-level representations, the cAE outperforms several
state-of-the-art bottom-up and top-down approaches, achieving a relative improvement
of more than 60% over the previous best system. This shows that the cAE is particularly
effective in using top-down knowledge of longer-spanning patterns in the data; at the
same time, we find that the cAE is only able to learn useful representations when it is
initialized using bottom-up pretraining on a large set of unlabelled speech. The second part of the thesis presents a novel unsupervised segmental Bayesian
model that segments unlabelled speech data and clusters the segments into hypothesized
word groupings. The result is a complete unsupervised tokenization of the input speech
in terms of discovered word types|the system essentially performs unsupervised speech
recognition. In this approach, a potential word segment (of arbitrary length) is embedded
in a fixed-dimensional vector space. The model, implemented as a Gibbs sampler, then
builds a whole-word acoustic model in this embedding space while jointly performing
segmentation. We first evaluate the approach in a small-vocabulary multi-speaker
connected digit recognition task, where we report unsupervised word error rates (WER)
by mapping the unsupervised decoded output to ground truth transcriptions. The model
achieves around 20% WER, outperforming a previous HMM-based system by about 10% absolute. To achieve this performance, the acoustic word embedding function (which
maps variable-duration segments to single vectors) is refined in a top-down manner by
using terms discovered by the model in an outer loop of segmentation.
The third and final part of the study extends the small-vocabulary system in order to handle larger vocabularies in conversational speech data. To our knowledge, this is the
first full-coverage segmentation and clustering system that is applied to large-vocabulary
multi-speaker data. To improve efficiency, the system incorporates a bottom-up syllable
boundary detection method to eliminate unlikely word boundaries. We compare the
system on English and Xitsonga datasets to several state-of-the-art baselines. We
show that by imposing a consistent top-down segmentation while also using bottom-up
knowledge from detected syllable boundaries, both single-speaker and multi-speaker
versions of our system outperform a purely bottom-up single-speaker syllable-based
approach. We also show that the discovered clusters can be made less speaker- and
gender-specific by using features from the cAE (which incorporates both top-down and
bottom-up learning). The system's discovered clusters are still less pure than those of
two multi-speaker unsupervised term discovery systems, but provide far greater coverage.
In summary, the different models and systems presented in this thesis show that both
top-down and bottom-up modelling can improve representation learning, segmentation
and clustering of unlabelled speech data
Sequence-to-sequence learning for machine translation and automatic differentiation for machine learning software tools
Cette thèse regroupe des articles d'apprentissage automatique et s'articule autour de deux thématiques complémentaires.
D'une part, les trois premiers articles examinent l'application des réseaux de neurones artificiels aux problèmes du traitement automatique du langage naturel (TALN). Le premier article introduit une structure codificatrice-décodificatrice avec des réseaux de neurones récurrents pour traduire des segments de phrases de longueur variable. Le deuxième article analyse la performance de ces modèles de `traduction neuronale automatique' de manière qualitative et quantitative, tout en soulignant les difficultés posées par les phrases longues et les mots rares. Le troisième article s'adresse au traitement des mots rares et hors du vocabulaire commun en combinant des algorithmes de compression par dictionnaire et des réseaux de neurones récurrents.
D'autre part, la deuxième partie de cette thèse fait abstraction de modèles particuliers de réseaux de neurones afin d'aborder l'infrastructure logicielle nécessaire à leur définition et entraînement. Les infrastructures modernes d'apprentissage profond doivent avoir la capacité d'exécuter efficacement des programmes d'algèbre linéaire et par tableaux, tout en étant capable de différentiation automatique (DA) pour calculer des dérivées multiples. Le premier article aborde les défis généraux posés par la conciliation de ces deux objectifs et propose la solution d'une représentation intermédiaire fondée sur les graphes. Le deuxième article attaque le même problème d'une manière différente: en implémentant un code source par bande dans un langage de programmation dynamique par tableau (Python et NumPy).This thesis consists of a series of articles that contribute to the field of machine learning. In particular, it covers two distinct and loosely related fields.
The first three articles consider the use of neural network models for problems in natural language processing (NLP). The first article introduces the use of an encoder-decoder structure involving recurrent neural networks (RNNs) to translate from and to variable length phrases and sentences. The second article contains a quantitative and qualitative analysis of the performance of these `neural machine translation' models, laying bare the difficulties posed by long sentences and rare words. The third article deals with handling rare and out-of-vocabulary words in neural network models by using dictionary coder compression algorithms and multi-scale RNN models.
The second half of this thesis does not deal with specific neural network models, but with the software tools and frameworks that can be used to define and train them. Modern deep learning frameworks need to be able to efficiently execute programs involving linear algebra and array programming, while also being able to employ automatic differentiation (AD) in order to calculate a variety of derivatives. The first article provides an overview of the difficulties posed in reconciling these two objectives, and introduces a graph-based intermediate representation that aims to tackle these difficulties. The second article considers a different approach to the same problem, implementing a tape-based source-code transformation approach to AD on a dynamically typed array programming language (Python and NumPy)