21,310 research outputs found
Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop
We summarize the accomplishments of a multi-disciplinary workshop exploring
the computational and scientific issues surrounding the discovery of linguistic
units (subwords and words) in a language without orthography. We study the
replacement of orthographic transcriptions by images and/or translated text in
a well-resourced language to help unsupervised discovery from raw speech.Comment: Accepted to ICASSP 201
Word Discovery in Visually Grounded, Self-Supervised Speech Models
We present a method for visually-grounded spoken term discovery. After
training either a HuBERT or wav2vec2.0 model to associate spoken captions with
natural images, we show that powerful word segmentation and clustering
capability emerges within the model's self-attention heads. Our experiments
reveal that this ability is not present to nearly the same extent in the base
HuBERT and wav2vec2.0 models, suggesting that the visual grounding task is a
crucial component of the word discovery capability we observe. We also evaluate
our method on the Buckeye word segmentation and ZeroSpeech spoken term
discovery tasks, where we outperform all currently published methods on several
metrics.Comment: submitted to Interspeech 202
Keyword localisation in untranscribed speech using visually grounded speech models
Keyword localisation is the task of finding where in a speech utterance a
given query keyword occurs. We investigate to what extent keyword localisation
is possible using a visually grounded speech (VGS) model. VGS models are
trained on unlabelled images paired with spoken captions. These models are
therefore self-supervised -- trained without any explicit textual label or
location information. To obtain training targets, we first tag training images
with soft text labels using a pretrained visual classifier with a fixed
vocabulary. This enables a VGS model to predict the presence of a written
keyword in an utterance, but not its location. We consider four ways to equip
VGS models with localisations capabilities. Two of these -- a saliency approach
and input masking -- can be applied to an arbitrary prediction model after
training, while the other two -- attention and a score aggregation approach --
are incorporated directly into the structure of the model. Masked-based
localisation gives some of the best reported localisation scores from a VGS
model, with an accuracy of 57% when the system knows that a keyword occurs in
an utterance and need to predict its location. In a setting where localisation
is performed after detection, an of 25% is achieved, and in a setting
where a keyword spotting ranking pass is first performed, we get a localisation
P@10 of 32%. While these scores are modest compared to the idealised setting
with unordered bag-of-word-supervision (from transcriptions), these models do
not receive any textual or location supervision. Further analyses show that
these models are limited by the first detection or ranking pass. Moreover,
individual keyword localisation performance is correlated with the tagging
performance from the visual classifier. We also show qualitatively how and
where semantic mistakes occur, e.g. that the model locates surfer when queried
with ocean.Comment: 10 figures, 5 table
- …