326 research outputs found
Unsupervised lexical clustering of speech segments using fixed dimensional acoustic embeddings
Unsupervised speech processing methods are essential for ap-plications ranging from zero-resource speech technology to modelling child language acquisition. One challenging prob-lem is discovering the word inventory of the language: the lexicon. Lexical clustering is the task of grouping unlabelled acoustic word tokens according to type. We propose a novel lexical clustering model: variable-length word segments are embedded in a fixed-dimensional acoustic space in which clustering is then performed. We evaluate several clustering algorithms and find that the best methods produce clusters with wide variation in sizes, as observed in natural language. The best probabilistic approach is an infinite Gaussian mix-ture model (IGMM), which automatically chooses the num-ber of clusters. Performance is comparable to that of non-probabilistic Chinese Whispers and average-linkage hierarchi-cal clustering. We conclude that IGMM clustering of fixed-dimensional embeddings holds promise as the lexical cluster-ing component in unsupervised speech processing systems. Index Terms — Lexical clustering, unsupervised learning, fixed-dimensional embeddings, lexical discovery
Integrating Form and Meaning: A Multi-Task Learning Model for Acoustic Word Embeddings
Models of acoustic word embeddings (AWEs) learn to map variable-length spoken
word segments onto fixed-dimensionality vector representations such that
different acoustic exemplars of the same word are projected nearby in the
embedding space. In addition to their speech technology applications, AWE
models have been shown to predict human performance on a variety of auditory
lexical processing tasks. Current AWE models are based on neural networks and
trained in a bottom-up approach that integrates acoustic cues to build up a
word representation given an acoustic or symbolic supervision signal.
Therefore, these models do not leverage or capture high-level lexical knowledge
during the learning process. In this paper, we propose a multi-task learning
model that incorporates top-down lexical knowledge into the training procedure
of AWEs. Our model learns a mapping between the acoustic input and a lexical
representation that encodes high-level information such as word semantics in
addition to bottom-up form-based supervision. We experiment with three
languages and demonstrate that incorporating lexical knowledge improves the
embedding space discriminability and encourages the model to better separate
lexical categories.Comment: Accepted in INTERSPEECH 202
Improved acoustic word embeddings for zero-resource languages using multilingual transfer
Acoustic word embeddings are fixed-dimensional representations of
variable-length speech segments. Such embeddings can form the basis for speech
search, indexing and discovery systems when conventional speech recognition is
not possible. In zero-resource settings where unlabelled speech is the only
available resource, we need a method that gives robust embeddings on an
arbitrary language. Here we explore multilingual transfer: we train a single
supervised embedding model on labelled data from multiple well-resourced
languages and then apply it to unseen zero-resource languages. We consider
three multilingual recurrent neural network (RNN) models: a classifier trained
on the joint vocabularies of all training languages; a Siamese RNN trained to
discriminate between same and different words from multiple languages; and a
correspondence autoencoder (CAE) RNN trained to reconstruct word pairs. In a
word discrimination task on six target languages, all of these models
outperform state-of-the-art unsupervised models trained on the zero-resource
languages themselves, giving relative improvements of more than 30% in average
precision. When using only a few training languages, the multilingual CAE
performs better, but with more training languages the other multilingual models
perform similarly. Using more training languages is generally beneficial, but
improvements are marginal on some languages. We present probing experiments
which show that the CAE encodes more phonetic, word duration, language identity
and speaker information than the other multilingual models.Comment: 11 pages, 7 figures, 8 tables. arXiv admin note: text overlap with
arXiv:2002.02109. Submitted to the IEEE Transactions on Audio, Speech and
Language Processin
Feature Trajectory Dynamic Time Warping for Clustering of Speech Segments
Dynamic time warping (DTW) can be used to compute the similarity between two
sequences of generally differing length. We propose a modification to DTW that
performs individual and independent pairwise alignment of feature trajectories.
The modified technique, termed feature trajectory dynamic time warping (FTDTW),
is applied as a similarity measure in the agglomerative hierarchical clustering
of speech segments. Experiments using MFCC and PLP parametrisations extracted
from TIMIT and from the Spoken Arabic Digit Dataset (SADD) show consistent and
statistically significant improvements in the quality of the resulting clusters
in terms of F-measure and normalised mutual information (NMI).Comment: 10 page
Fully Unsupervised Small-Vocabulary Speech Recognition Using a Segmental Bayesian Model
Current supervised speech technology relies heavily on tran-scribed speech and pronunciation dictionaries. In settings where unlabelled speech data alone is available, unsupervised methods are required to discover categorical linguistic structure directly from the audio. We present a novel Bayesian model which seg-ments unlabelled input speech into word-like units, resulting in a complete unsupervised transcription of the speech in terms of discovered word types. In our approach, a potential word segment (of arbitrary length) is embedded in a fixed-dimensional space; the model (implemented as a Gibbs sampler) then builds a whole-word acoustic model in this space while jointly doing seg-mentation. We report word error rates in a connected digit recog-nition task by mapping the unsupervised output to ground truth transcriptions. Our model outperforms a previously developed HMM-based system, even when the model is not constrained to discover only the 11 word types present in the data. Index Terms: unsupervised speech processing, word discovery, speech segmentation, unsupervised learning, segmental models 1
- …