103 research outputs found
Acoustic Word Embeddings for Zero-Resource Languages Using Self-Supervised Contrastive Learning and Multilingual Adaptation
Acoustic word embeddings (AWEs) are fixed-dimensional representations of
variable-length speech segments. For zero-resource languages where labelled
data is not available, one AWE approach is to use unsupervised
autoencoder-based recurrent models. Another recent approach is to use
multilingual transfer: a supervised AWE model is trained on several
well-resourced languages and then applied to an unseen zero-resource language.
We consider how a recent contrastive learning loss can be used in both the
purely unsupervised and multilingual transfer settings. Firstly, we show that
terms from an unsupervised term discovery system can be used for contrastive
self-supervision, resulting in improvements over previous unsupervised
monolingual AWE models. Secondly, we consider how multilingual AWE models can
be adapted to a specific zero-resource language using discovered terms. We find
that self-supervised contrastive adaptation outperforms adapted multilingual
correspondence autoencoder and Siamese AWE models, giving the best overall
results in a word discrimination task on six zero-resource languages.Comment: Accepted to SLT 202
Additional Shared Decoder on Siamese Multi-view Encoders for Learning Acoustic Word Embeddings
Acoustic word embeddings --- fixed-dimensional vector representations of
arbitrary-length words --- have attracted increasing interest in
query-by-example spoken term detection. Recently, on the fact that the
orthography of text labels partly reflects the phonetic similarity between the
words' pronunciation, a multi-view approach has been introduced that jointly
learns acoustic and text embeddings. It showed that it is possible to learn
discriminative embeddings by designing the objective which takes text labels as
well as word segments. In this paper, we propose a network architecture that
expands the multi-view approach by combining the Siamese multi-view encoders
with a shared decoder network to maximize the effect of the relationship
between acoustic and text embeddings in embedding space. Discriminatively
trained with multi-view triplet loss and decoding loss, our proposed approach
achieves better performance on acoustic word discrimination task with the WSJ
dataset, resulting in 11.1% relative improvement in average precision. We also
present experimental results on cross-view word discrimination and word level
speech recognition tasks.Comment: Accepted at 2019 IEEE Automatic Speech Recognition and Understanding
Workshop (ASRU 2019
Evaluating the reliability of acoustic speech embeddings
International audienceSpeech embeddings are fixed-size acoustic representations of variable-length speech sequences. They are increasingly used for a variety of tasks ranging from information retrieval to un-supervised term discovery and speech segmentation. However, there is currently no clear methodology to compare or optimize the quality of these embeddings in a task-neutral way. Here, we systematically compare two popular metrics, ABX discrimination and Mean Average Precision (MAP), on 5 languages across 17 embedding methods, ranging from supervised to fully unsu-pervised, and using different loss functions (autoencoders, cor-respondance autoencoders, siamese). Then we use the ABX and MAP to predict performances on a new downstream task: the unsupervised estimation of the frequencies of speech segments in a given corpus. We find that overall, ABX and MAP correlate with one another and with frequency estimation. However, substantial discrepancies appear in the fine-grained distinctions across languages and/or embedding methods. This makes it un-realistic at present to propose a task-independent silver bullet method for computing the intrinsic quality of speech embed-dings. There is a need for more detailed analysis of the metrics currently used to evaluate such embeddings
- …