25 research outputs found
Multilingual Acoustic Word Embedding Models for Processing Zero-Resource Languages
Acoustic word embeddings are fixed-dimensional representations of
variable-length speech segments. In settings where unlabelled speech is the
only available resource, such embeddings can be used in "zero-resource" speech
search, indexing and discovery systems. Here we propose to train a single
supervised embedding model on labelled data from multiple well-resourced
languages and then apply it to unseen zero-resource languages. For this
transfer learning approach, we consider two multilingual recurrent neural
network models: a discriminative classifier trained on the joint vocabularies
of all training languages, and a correspondence autoencoder trained to
reconstruct word pairs. We test these using a word discrimination task on six
target zero-resource languages. When trained on seven well-resourced languages,
both models perform similarly and outperform unsupervised models trained on the
zero-resource languages. With just a single training language, the second model
works better, but performance depends more on the particular training--testing
language pair.Comment: 5 pages, 4 figures, 1 table; accepted to ICASSP 2020. arXiv admin
note: text overlap with arXiv:1811.0040
Multilingual and Unsupervised Subword Modelingfor Zero-Resource Languages
Subword modeling for zero-resource languages aims to learn low-level
representations of speech audio without using transcriptions or other resources
from the target language (such as text corpora or pronunciation dictionaries).
A good representation should capture phonetic content and abstract away from
other types of variability, such as speaker differences and channel noise.
Previous work in this area has primarily focused unsupervised learning from
target language data only, and has been evaluated only intrinsically. Here we
directly compare multiple methods, including some that use only target language
speech data and some that use transcribed speech from other (non-target)
languages, and we evaluate using two intrinsic measures as well as on a
downstream unsupervised word segmentation and clustering task. We find that
combining two existing target-language-only methods yields better features than
either method alone. Nevertheless, even better results are obtained by
extracting target language bottleneck features using a model trained on other
languages. Cross-lingual training using just one other language is enough to
provide this benefit, but multilingual training helps even more. In addition to
these results, which hold across both intrinsic measures and the extrinsic
task, we discuss the qualitative differences between the different types of
learned features.Comment: 17 pages, 6 figures, 7 tables. Accepted for publication in Computer
Speech and Language. arXiv admin note: text overlap with arXiv:1803.0886
Cross-Lingual Topic Prediction for Speech Using Translations
Given a large amount of unannotated speech in a low-resource language, can we
classify the speech utterances by topic? We consider this question in the
setting where a small amount of speech in the low-resource language is paired
with text translations in a high-resource language. We develop an effective
cross-lingual topic classifier by training on just 20 hours of translated
speech, using a recent model for direct speech-to-text translation. While the
translations are poor, they are still good enough to correctly classify the
topic of 1-minute speech segments over 70% of the time - a 20% improvement over
a majority-class baseline. Such a system could be useful for humanitarian
applications like crisis response, where incoming speech in a foreign
low-resource language must be quickly assessed for further action.Comment: Accepted to ICASSP 202
Unsupervised Pre-Training for Voice Activation
The problem of voice activation is to find a pre-defined word in the audio stream. Solutions such as keyword spotter “Ok, Google” for Android devices or keyword spotter “Alexa” for Amazon devices use tens of thousands to millions of keyword examples in training. In this paper, we explore the possibility of using pre-trained audio features to build voice activation with a small number of keyword examples. The contribution of this article consists of two parts. First, we investigate the dependence of the quality of the voice activation system on the number of examples in training for English and Russian and show that the use of pre-trained audio features, such as wav2vec, increases the accuracy of the system by up to 10% if only seven examples are available for each keyword during training. At the same time, the benefits of such features become less and disappear as the dataset size increases. Secondly, we prepare and provide for general use a dataset for training and testing voice activation for the Lithuanian language. We also provide training results on this dataset.This article belongs to the Section Computing and Artificial Intelligenc
A Study of Low-Resource Speech Commands Recognition based on Adversarial Reprogramming
In this study, we propose a novel adversarial reprogramming (AR) approach for
low-resource spoken command recognition (SCR), and build an AR-SCR system. The
AR procedure aims to modify the acoustic signals (from the target domain) to
repurpose a pretrained SCR model (from the source domain). To solve the label
mismatches between source and target domains, and further improve the stability
of AR, we propose a novel similarity-based label mapping technique to align
classes. In addition, the transfer learning (TL) technique is combined with the
original AR process to improve the model adaptation capability. We evaluate the
proposed AR-SCR system on three low-resource SCR datasets, including Arabic,
Lithuanian, and dysarthric Mandarin speech. Experimental results show that with
a pretrained AM trained on a large-scale English dataset, the proposed AR-SCR
system outperforms the current state-of-the-art results on Arabic and
Lithuanian speech commands datasets, with only a limited amount of training
data.Comment: Submitted to ICASSP 202