3 research outputs found
Visually grounded cross-lingual keyword spotting in speech
Recent work considered how images paired with speech can be used as
supervision for building speech systems when transcriptions are not available.
We ask whether visual grounding can be used for cross-lingual keyword spotting:
given a text keyword in one language, the task is to retrieve spoken utterances
containing that keyword in another language. This could enable searching
through speech in a low-resource language using text queries in a high-resource
language. As a proof-of-concept, we use English speech with German queries: we
use a German visual tagger to add keyword labels to each training image, and
then train a neural network to map English speech to German keywords. Without
seeing parallel speech-transcriptions or translations, the model achieves a
precision at ten of 58%. We show that most erroneous retrievals contain
equivalent or semantically relevant keywords; excluding these would improve
P@10 to 91%.Comment: 5 pages, 2 figures, 4 table
Semantic query-by-example speech search using visual grounding
A number of recent studies have started to investigate how speech systems can
be trained on untranscribed speech by leveraging accompanying images at
training time. Examples of tasks include keyword prediction and within- and
across-mode retrieval. Here we consider how such models can be used for
query-by-example (QbE) search, the task of retrieving utterances relevant to a
given spoken query. We are particularly interested in semantic QbE, where the
task is not only to retrieve utterances containing exact instances of the
query, but also utterances whose meaning is relevant to the query. We follow a
segmental QbE approach where variable-duration speech segments (queries, search
utterances) are mapped to fixed-dimensional embedding vectors. We show that a
QbE system using an embedding function trained on visually grounded speech data
outperforms a purely acoustic QbE system in terms of both exact and semantic
retrieval performance.Comment: Accepted to ICASSP 201
Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech
In this paper, we present a method for learning discrete linguistic units by
incorporating vector quantization layers into neural models of visually
grounded speech. We show that our method is capable of capturing both
word-level and sub-word units, depending on how it is configured. What
differentiates this paper from prior work on speech unit learning is the choice
of training objective. Rather than using a reconstruction-based loss, we use a
discriminative, multimodal grounding objective which forces the learned units
to be useful for semantic image retrieval. We evaluate the sub-word units on
the ZeroSpeech 2019 challenge, achieving a 27.3\% reduction in ABX error rate
over the top-performing submission, while keeping the bitrate approximately the
same. We also present experiments demonstrating the noise robustness of these
units. Finally, we show that a model with multiple quantizers can
simultaneously learn phone-like detectors at a lower layer and word-like
detectors at a higher layer. We show that these detectors are highly accurate,
discovering 279 words with an F1 score of greater than 0.5.Comment: Camera-ready version for ICL