29 research outputs found
Visually grounded learning of keyword prediction from untranscribed speech
During language acquisition, infants have the benefit of visual cues to
ground spoken language. Robots similarly have access to audio and visual
sensors. Recent work has shown that images and spoken captions can be mapped
into a meaningful common space, allowing images to be retrieved using speech
and vice versa. In this setting of images paired with untranscribed spoken
captions, we consider whether computer vision systems can be used to obtain
textual labels for the speech. Concretely, we use an image-to-words multi-label
visual classifier to tag images with soft textual labels, and then train a
neural network to map from the speech to these soft targets. We show that the
resulting speech system is able to predict which words occur in an
utterance---acting as a spoken bag-of-words classifier---without seeing any
parallel speech and text. We find that the model often confuses semantically
related words, e.g. "man" and "person", making it even more effective as a
semantic keyword spotter.Comment: 5 pages, 3 figures, 5 tables; small updates, added link to code;
accepted to Interspeech 201
Semantic sentence similarity: size does not always matter
This study addresses the question whether visually grounded speech
recognition (VGS) models learn to capture sentence semantics without access to
any prior linguistic knowledge. We produce synthetic and natural spoken
versions of a well known semantic textual similarity database and show that our
VGS model produces embeddings that correlate well with human semantic
similarity judgements. Our results show that a model trained on a small
image-caption database outperforms two models trained on much larger databases,
indicating that database size is not all that matters. We also investigate the
importance of having multiple captions per image and find that this is indeed
helpful even if the total number of images is lower, suggesting that
paraphrasing is a valuable learning signal. While the general trend in the
field is to create ever larger datasets to train models on, our findings
indicate other characteristics of the database can just as important important.Comment: This paper has been accepted at Interspeech 2021 where it will be
presented and appear in the conference proceedings in September 202