793 research outputs found
Language Transfer of Audio Word2Vec: Learning Audio Segment Representations without Target Language Data
Audio Word2Vec offers vector representations of fixed dimensionality for
variable-length audio segments using Sequence-to-sequence Autoencoder (SA).
These vector representations are shown to describe the sequential phonetic
structures of the audio segments to a good degree, with real world applications
such as query-by-example Spoken Term Detection (STD). This paper examines the
capability of language transfer of Audio Word2Vec. We train SA from one
language (source language) and use it to extract the vector representation of
the audio segments of another language (target language). We found that SA can
still catch phonetic structure from the audio segments of the target language
if the source and target languages are similar. In query-by-example STD, we
obtain the vector representations from the SA learned from a large amount of
source language data, and found them surpass the representations from naive
encoder and SA directly learned from a small amount of target language data.
The result shows that it is possible to learn Audio Word2Vec model from
high-resource languages and use it on low-resource languages. This further
expands the usability of Audio Word2Vec.Comment: arXiv admin note: text overlap with arXiv:1603.0098
Multilingual Acoustic Word Embedding Models for Processing Zero-Resource Languages
Acoustic word embeddings are fixed-dimensional representations of
variable-length speech segments. In settings where unlabelled speech is the
only available resource, such embeddings can be used in "zero-resource" speech
search, indexing and discovery systems. Here we propose to train a single
supervised embedding model on labelled data from multiple well-resourced
languages and then apply it to unseen zero-resource languages. For this
transfer learning approach, we consider two multilingual recurrent neural
network models: a discriminative classifier trained on the joint vocabularies
of all training languages, and a correspondence autoencoder trained to
reconstruct word pairs. We test these using a word discrimination task on six
target zero-resource languages. When trained on seven well-resourced languages,
both models perform similarly and outperform unsupervised models trained on the
zero-resource languages. With just a single training language, the second model
works better, but performance depends more on the particular training--testing
language pair.Comment: 5 pages, 4 figures, 1 table; accepted to ICASSP 2020. arXiv admin
note: text overlap with arXiv:1811.0040
Improved acoustic word embeddings for zero-resource languages using multilingual transfer
Acoustic word embeddings are fixed-dimensional representations of
variable-length speech segments. Such embeddings can form the basis for speech
search, indexing and discovery systems when conventional speech recognition is
not possible. In zero-resource settings where unlabelled speech is the only
available resource, we need a method that gives robust embeddings on an
arbitrary language. Here we explore multilingual transfer: we train a single
supervised embedding model on labelled data from multiple well-resourced
languages and then apply it to unseen zero-resource languages. We consider
three multilingual recurrent neural network (RNN) models: a classifier trained
on the joint vocabularies of all training languages; a Siamese RNN trained to
discriminate between same and different words from multiple languages; and a
correspondence autoencoder (CAE) RNN trained to reconstruct word pairs. In a
word discrimination task on six target languages, all of these models
outperform state-of-the-art unsupervised models trained on the zero-resource
languages themselves, giving relative improvements of more than 30% in average
precision. When using only a few training languages, the multilingual CAE
performs better, but with more training languages the other multilingual models
perform similarly. Using more training languages is generally beneficial, but
improvements are marginal on some languages. We present probing experiments
which show that the CAE encodes more phonetic, word duration, language identity
and speaker information than the other multilingual models.Comment: 11 pages, 7 figures, 8 tables. arXiv admin note: text overlap with
arXiv:2002.02109. Submitted to the IEEE Transactions on Audio, Speech and
Language Processin
Seeing wake words: Audio-visual Keyword Spotting
The goal of this work is to automatically determine whether and when a word
of interest is spoken by a talking face, with or without the audio. We propose
a zero-shot method suitable for in the wild videos. Our key contributions are:
(1) a novel convolutional architecture, KWS-Net, that uses a similarity map
intermediate representation to separate the task into (i) sequence matching,
and (ii) pattern detection, to decide whether the word is there and when; (2)
we demonstrate that if audio is available, visual keyword spotting improves the
performance both for a clean and noisy audio signal. Finally, (3) we show that
our method generalises to other languages, specifically French and German, and
achieves a comparable performance to English with less language specific data,
by fine-tuning the network pre-trained on English. The method exceeds the
performance of the previous state-of-the-art visual keyword spotting
architecture when trained and tested on the same benchmark, and also that of a
state-of-the-art lip reading method
A segmental framework for fully-unsupervised large-vocabulary speech recognition
Zero-resource speech technology is a growing research area that aims to
develop methods for speech processing in the absence of transcriptions,
lexicons, or language modelling text. Early term discovery systems focused on
identifying isolated recurring patterns in a corpus, while more recent
full-coverage systems attempt to completely segment and cluster the audio into
word-like units---effectively performing unsupervised speech recognition. This
article presents the first attempt we are aware of to apply such a system to
large-vocabulary multi-speaker data. Our system uses a Bayesian modelling
framework with segmental word representations: each word segment is represented
as a fixed-dimensional acoustic embedding obtained by mapping the sequence of
feature frames to a single embedding vector. We compare our system on English
and Xitsonga datasets to state-of-the-art baselines, using a variety of
measures including word error rate (obtained by mapping the unsupervised output
to ground truth transcriptions). Very high word error rates are reported---in
the order of 70--80% for speaker-dependent and 80--95% for speaker-independent
systems---highlighting the difficulty of this task. Nevertheless, in terms of
cluster quality and word segmentation metrics, we show that by imposing a
consistent top-down segmentation while also using bottom-up knowledge from
detected syllable boundaries, both single-speaker and multi-speaker versions of
our system outperform a purely bottom-up single-speaker syllable-based
approach. We also show that the discovered clusters can be made less speaker-
and gender-specific by using an unsupervised autoencoder-like feature extractor
to learn better frame-level features (prior to embedding). Our system's
discovered clusters are still less pure than those of unsupervised term
discovery systems, but provide far greater coverage.Comment: 15 pages, 6 figures, 8 table
Improving the translation environment for professional translators
When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side.
This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project
- …