580 research outputs found
An embedded segmental K-means model for unsupervised segmentation and clustering of speech
Unsupervised segmentation and clustering of unlabelled speech are core
problems in zero-resource speech processing. Most approaches lie at
methodological extremes: some use probabilistic Bayesian models with
convergence guarantees, while others opt for more efficient heuristic
techniques. Despite competitive performance in previous work, the full Bayesian
approach is difficult to scale to large speech corpora. We introduce an
approximation to a recent Bayesian model that still has a clear objective
function but improves efficiency by using hard clustering and segmentation
rather than full Bayesian inference. Like its Bayesian counterpart, this
embedded segmental K-means model (ES-KMeans) represents arbitrary-length word
segments as fixed-dimensional acoustic word embeddings. We first compare
ES-KMeans to previous approaches on common English and Xitsonga data sets (5
and 2.5 hours of speech): ES-KMeans outperforms a leading heuristic method in
word segmentation, giving similar scores to the Bayesian model while being 5
times faster with fewer hyperparameters. However, its clusters are less pure
than those of the other models. We then show that ES-KMeans scales to larger
corpora by applying it to the 5 languages of the Zero Resource Speech Challenge
2017 (up to 45 hours), where it performs competitively compared to the
challenge baseline.Comment: 8 pages, 3 figures, 3 tables; accepted to ASRU 201
Bayesian Models for Unit Discovery on a Very Low Resource Language
Developing speech technologies for low-resource languages has become a very
active research field over the last decade. Among others, Bayesian models have
shown some promising results on artificial examples but still lack of in situ
experiments. Our work applies state-of-the-art Bayesian models to unsupervised
Acoustic Unit Discovery (AUD) in a real low-resource language scenario. We also
show that Bayesian models can naturally integrate information from other
resourceful languages by means of informative prior leading to more consistent
discovered units. Finally, discovered acoustic units are used, either as the
1-best sequence or as a lattice, to perform word segmentation. Word
segmentation results show that this Bayesian approach clearly outperforms a
Segmental-DTW baseline on the same corpus.Comment: Accepted to ICASSP 201
The Zero Resource Speech Challenge 2017
We describe a new challenge aimed at discovering subword and word units from
raw speech. This challenge is the followup to the Zero Resource Speech
Challenge 2015. It aims at constructing systems that generalize across
languages and adapt to new speakers. The design features and evaluation metrics
of the challenge are presented and the results of seventeen models are
discussed.Comment: IEEE ASRU (Automatic Speech Recognition and Understanding) 2017.
Okinawa, Japa
Symbol Emergence in Robotics: A Survey
Humans can learn the use of language through physical interaction with their
environment and semiotic communication with other people. It is very important
to obtain a computational understanding of how humans can form a symbol system
and obtain semiotic skills through their autonomous mental development.
Recently, many studies have been conducted on the construction of robotic
systems and machine-learning methods that can learn the use of language through
embodied multimodal interaction with their environment and other systems.
Understanding human social interactions and developing a robot that can
smoothly communicate with human users in the long term, requires an
understanding of the dynamics of symbol systems and is crucially important. The
embodied cognition and social interaction of participants gradually change a
symbol system in a constructive manner. In this paper, we introduce a field of
research called symbol emergence in robotics (SER). SER is a constructive
approach towards an emergent symbol system. The emergent symbol system is
socially self-organized through both semiotic communications and physical
interactions with autonomous cognitive developmental agents, i.e., humans and
developmental robots. Specifically, we describe some state-of-art research
topics concerning SER, e.g., multimodal categorization, word discovery, and a
double articulation analysis, that enable a robot to obtain words and their
embodied meanings from raw sensory--motor information, including visual
information, haptic information, auditory information, and acoustic speech
signals, in a totally unsupervised manner. Finally, we suggest future
directions of research in SER.Comment: submitted to Advanced Robotic
Word Discovery in Visually Grounded, Self-Supervised Speech Models
We present a method for visually-grounded spoken term discovery. After
training either a HuBERT or wav2vec2.0 model to associate spoken captions with
natural images, we show that powerful word segmentation and clustering
capability emerges within the model's self-attention heads. Our experiments
reveal that this ability is not present to nearly the same extent in the base
HuBERT and wav2vec2.0 models, suggesting that the visual grounding task is a
crucial component of the word discovery capability we observe. We also evaluate
our method on the Buckeye word segmentation and ZeroSpeech spoken term
discovery tasks, where we outperform all currently published methods on several
metrics.Comment: submitted to Interspeech 202
Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Mode
In this paper, we show that representations capturing syllabic units emerge
when training a self-supervised speech model with a visually-grounded training
objective. We demonstrate that a nearly identical model architecture (HuBERT)
trained with a masked language modeling loss does not exhibit this same
ability, suggesting that the visual grounding objective is responsible for the
emergence of this phenomenon. We propose the use of a minimum cut algorithm to
automatically predict syllable boundaries in speech, followed by a 2-stage
clustering method to group identical syllables together. We show that our model
not only outperforms a state-of-the-art syllabic segmentation method on the
language it was trained on (English), but also generalizes in a zero-shot
fashion to Estonian. Finally, we show that the same model is capable of
zero-shot generalization for a word segmentation task on 4 other languages from
the Zerospeech Challenge, in some cases beating the previous state-of-the-art.Comment: Interspeech 2023. Code & Model:
https://github.com/jasonppy/syllable-discover
- âŠ