60 research outputs found
Visually grounded learning of keyword prediction from untranscribed speech
During language acquisition, infants have the benefit of visual cues to
ground spoken language. Robots similarly have access to audio and visual
sensors. Recent work has shown that images and spoken captions can be mapped
into a meaningful common space, allowing images to be retrieved using speech
and vice versa. In this setting of images paired with untranscribed spoken
captions, we consider whether computer vision systems can be used to obtain
textual labels for the speech. Concretely, we use an image-to-words multi-label
visual classifier to tag images with soft textual labels, and then train a
neural network to map from the speech to these soft targets. We show that the
resulting speech system is able to predict which words occur in an
utterance---acting as a spoken bag-of-words classifier---without seeing any
parallel speech and text. We find that the model often confuses semantically
related words, e.g. "man" and "person", making it even more effective as a
semantic keyword spotter.Comment: 5 pages, 3 figures, 5 tables; small updates, added link to code;
accepted to Interspeech 201
Symbolic inductive bias for visually grounded learning of spoken language
A widespread approach to processing spoken language is to first automatically
transcribe it into text. An alternative is to use an end-to-end approach:
recent works have proposed to learn semantic embeddings of spoken language from
images with spoken captions, without an intermediate transcription step. We
propose to use multitask learning to exploit existing transcribed speech within
the end-to-end setting. We describe a three-task architecture which combines
the objectives of matching spoken captions with corresponding images, speech
with text, and text with images. We show that the addition of the speech/text
task leads to substantial performance improvements on image retrieval when
compared to training the speech/image task in isolation. We conjecture that
this is due to a strong inductive bias transcribed speech provides to the
model, and offer supporting evidence for this.Comment: ACL 201
Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Mode
In this paper, we show that representations capturing syllabic units emerge
when training a self-supervised speech model with a visually-grounded training
objective. We demonstrate that a nearly identical model architecture (HuBERT)
trained with a masked language modeling loss does not exhibit this same
ability, suggesting that the visual grounding objective is responsible for the
emergence of this phenomenon. We propose the use of a minimum cut algorithm to
automatically predict syllable boundaries in speech, followed by a 2-stage
clustering method to group identical syllables together. We show that our model
not only outperforms a state-of-the-art syllabic segmentation method on the
language it was trained on (English), but also generalizes in a zero-shot
fashion to Estonian. Finally, we show that the same model is capable of
zero-shot generalization for a word segmentation task on 4 other languages from
the Zerospeech Challenge, in some cases beating the previous state-of-the-art.Comment: Interspeech 2023. Code & Model:
https://github.com/jasonppy/syllable-discover
- …