102 research outputs found
Additional Shared Decoder on Siamese Multi-view Encoders for Learning Acoustic Word Embeddings
Acoustic word embeddings --- fixed-dimensional vector representations of
arbitrary-length words --- have attracted increasing interest in
query-by-example spoken term detection. Recently, on the fact that the
orthography of text labels partly reflects the phonetic similarity between the
words' pronunciation, a multi-view approach has been introduced that jointly
learns acoustic and text embeddings. It showed that it is possible to learn
discriminative embeddings by designing the objective which takes text labels as
well as word segments. In this paper, we propose a network architecture that
expands the multi-view approach by combining the Siamese multi-view encoders
with a shared decoder network to maximize the effect of the relationship
between acoustic and text embeddings in embedding space. Discriminatively
trained with multi-view triplet loss and decoding loss, our proposed approach
achieves better performance on acoustic word discrimination task with the WSJ
dataset, resulting in 11.1% relative improvement in average precision. We also
present experimental results on cross-view word discrimination and word level
speech recognition tasks.Comment: Accepted at 2019 IEEE Automatic Speech Recognition and Understanding
Workshop (ASRU 2019
Improved acoustic word embeddings for zero-resource languages using multilingual transfer
Acoustic word embeddings are fixed-dimensional representations of
variable-length speech segments. Such embeddings can form the basis for speech
search, indexing and discovery systems when conventional speech recognition is
not possible. In zero-resource settings where unlabelled speech is the only
available resource, we need a method that gives robust embeddings on an
arbitrary language. Here we explore multilingual transfer: we train a single
supervised embedding model on labelled data from multiple well-resourced
languages and then apply it to unseen zero-resource languages. We consider
three multilingual recurrent neural network (RNN) models: a classifier trained
on the joint vocabularies of all training languages; a Siamese RNN trained to
discriminate between same and different words from multiple languages; and a
correspondence autoencoder (CAE) RNN trained to reconstruct word pairs. In a
word discrimination task on six target languages, all of these models
outperform state-of-the-art unsupervised models trained on the zero-resource
languages themselves, giving relative improvements of more than 30% in average
precision. When using only a few training languages, the multilingual CAE
performs better, but with more training languages the other multilingual models
perform similarly. Using more training languages is generally beneficial, but
improvements are marginal on some languages. We present probing experiments
which show that the CAE encodes more phonetic, word duration, language identity
and speaker information than the other multilingual models.Comment: 11 pages, 7 figures, 8 tables. arXiv admin note: text overlap with
arXiv:2002.02109. Submitted to the IEEE Transactions on Audio, Speech and
Language Processin
Audio self-supervised learning: a survey
Inspired by the humans' cognitive ability to generalise knowledge and skills,
Self-Supervised Learning (SSL) targets at discovering general representations
from large-scale data without requiring human annotations, which is an
expensive and time consuming task. Its success in the fields of computer vision
and natural language processing have prompted its recent adoption into the
field of audio and speech processing. Comprehensive reviews summarising the
knowledge in audio SSL are currently missing. To fill this gap, in the present
work, we provide an overview of the SSL methods used for audio and speech
processing applications. Herein, we also summarise the empirical works that
exploit the audio modality in multi-modal SSL frameworks, and the existing
suitable benchmarks to evaluate the power of SSL in the computer audition
domain. Finally, we discuss some open problems and point out the future
directions on the development of audio SSL
Neural approaches to spoken content embedding
Comparing spoken segments is a central operation to speech processing.
Traditional approaches in this area have favored frame-level dynamic
programming algorithms, such as dynamic time warping, because they require no
supervision, but they are limited in performance and efficiency. As an
alternative, acoustic word embeddings -- fixed-dimensional vector
representations of variable-length spoken word segments -- have begun to be
considered for such tasks as well. However, the current space of such
discriminative embedding models, training approaches, and their application to
real-world downstream tasks is limited. We start by considering ``single-view"
training losses where the goal is to learn an acoustic word embedding model
that separates same-word and different-word spoken segment pairs. Then, we
consider ``multi-view" contrastive losses. In this setting, acoustic word
embeddings are learned jointly with embeddings of character sequences to
generate acoustically grounded embeddings of written words, or acoustically
grounded word embeddings.
In this thesis, we contribute new discriminative acoustic word embedding
(AWE) and acoustically grounded word embedding (AGWE) approaches based on
recurrent neural networks (RNNs). We improve model training in terms of both
efficiency and performance. We take these developments beyond English to
several low-resource languages and show that multilingual training improves
performance when labeled data is limited. We apply our embedding models, both
monolingual and multilingual, to the downstream tasks of query-by-example
speech search and automatic speech recognition. Finally, we show how our
embedding approaches compare with and complement more recent self-supervised
speech models.Comment: PhD thesi
Pre-training for Speech Translation: CTC Meets Optimal Transport
The gap between speech and text modalities is a major challenge in
speech-to-text translation (ST). Different methods have been proposed to reduce
this gap, but most of them require architectural changes in ST training. In
this work, we propose to mitigate this issue at the pre-training stage,
requiring no change in the ST model. First, we show that the connectionist
temporal classification (CTC) loss can reduce the modality gap by design. We
provide a quantitative comparison with the more common cross-entropy loss,
showing that pre-training with CTC consistently achieves better final ST
accuracy. Nevertheless, CTC is only a partial solution and thus, in our second
contribution, we propose a novel pre-training method combining CTC and optimal
transport to further reduce this gap. Our method pre-trains a Siamese-like
model composed of two encoders, one for acoustic inputs and the other for
textual inputs, such that they produce representations that are close to each
other in the Wasserstein space. Extensive experiments on the standard CoVoST-2
and MuST-C datasets show that our pre-training method applied to the vanilla
encoder-decoder Transformer achieves state-of-the-art performance under the
no-external-data setting, and performs on par with recent strong multi-task
learning systems trained with external data. Finally, our method can also be
applied on top of these multi-task systems, leading to further improvements for
these models. Code and pre-trained models are available at
https://github.com/formiel/fairseq.Comment: ICML 2023 (oral presentation). This version fixed URLs, updated
affiliations & acknowledgements, and improved formattin
- …