18 research outputs found
The Zero Resource Speech Challenge 2017
We describe a new challenge aimed at discovering subword and word units from
raw speech. This challenge is the followup to the Zero Resource Speech
Challenge 2015. It aims at constructing systems that generalize across
languages and adapt to new speakers. The design features and evaluation metrics
of the challenge are presented and the results of seventeen models are
discussed.Comment: IEEE ASRU (Automatic Speech Recognition and Understanding) 2017.
Okinawa, Japa
Scalable Estimation of Dirichlet Process Mixture Models on Distributed Data
We consider the estimation of Dirichlet Process Mixture Models (DPMMs) in
distributed environments, where data are distributed across multiple computing
nodes. A key advantage of Bayesian nonparametric models such as DPMMs is that
they allow new components to be introduced on the fly as needed. This, however,
posts an important challenge to distributed estimation -- how to handle new
components efficiently and consistently. To tackle this problem, we propose a
new estimation method, which allows new components to be created locally in
individual computing nodes. Components corresponding to the same cluster will
be identified and merged via a probabilistic consolidation scheme. In this way,
we can maintain the consistency of estimation with very low communication cost.
Experiments on large real-world data sets show that the proposed method can
achieve high scalability in distributed and asynchronous environments without
compromising the mixing performance.Comment: This paper is published on IJCAI 2017.
https://www.ijcai.org/proceedings/2017/64
Unsupervised Acoustic Unit Representation Learning for Voice Conversion using WaveNet Auto-encoders
Unsupervised representation learning of speech has been of keen interest in
recent years, which is for example evident in the wide interest of the
ZeroSpeech challenges. This work presents a new method for learning frame level
representations based on WaveNet auto-encoders. Of particular interest in the
ZeroSpeech Challenge 2019 were models with discrete latent variable such as the
Vector Quantized Variational Auto-Encoder (VQVAE). However these models
generate speech with relatively poor quality. In this work we aim to address
this with two approaches: first WaveNet is used as the decoder and to generate
waveform data directly from the latent representation; second, the low
complexity of latent representations is improved with two alternative
disentanglement learning methods, namely instance normalization and sliced
vector quantization. The method was developed and tested in the context of the
recent ZeroSpeech challenge 2020. The system output submitted to the challenge
obtained the top position for naturalness (Mean Opinion Score 4.06), top
position for intelligibility (Character Error Rate 0.15), and third position
for the quality of the representation (ABX test score 12.5). These and further
analysis in this paper illustrates that quality of the converted speech and the
acoustic units representation can be well balanced.Comment: To be presented in Interspeech 202
Comparing unsupervised speech learning directly to human performance in speech perception
International audienceWe compare the performance of humans (English and French listeners) versus an unsupervised speech model in a perception experiment (ABX discrimination task). Although the ABX task has been used for acoustic model evaluation in previous research, the results have not, until now, been compared directly with human behaviour in an experiment. We show that a standard, well-performing model (DPGMM) has better accuracy at predicting human responses than the acoustic baseline. The model also shows a native language effect, better resembling native listeners of the language on which it was trained. However, the native language effect shown by the models is different than the one shown by the human listeners, and, notably , the models do not show the same overall patterns of vowel confusions
DAMM: Directionality-Aware Mixture Model Parallel Sampling for Efficient Dynamical System Learning
The Linear Parameter Varying Dynamical System (LPV-DS) is a promising
framework for learning stable time-invariant motion policies in robot control.
By employing statistical modeling and semi-definite optimization, LPV-DS
encodes complex motions via non-linear DS, ensuring the robustness and
stability of the system. However, the current LPV-DS scheme faces challenges in
accurately interpreting trajectory data while maintaining model efficiency and
computational efficiency. To address these limitations, we propose the
Directionality-aware Mixture Model (DAMM), a new statistical model that
leverages Riemannian metric on -dimensional sphere , and
efficiently incorporates non-Euclidean directional information with position.
Additionally, we introduce a hybrid Markov chain Monte Carlo method that
combines the Gibbs Sampling and the Split/Merge Proposal, facilitating parallel
computation and enabling faster inference for near real-time learning
performance. Through extensive empirical validation, we demonstrate that the
improved LPV-DS framework with DAMM is capable of producing
physically-meaningful representations of the trajectory data and improved
performance of the generated DS while showcasing significantly enhanced
learning speed compared to its previous iterations
Scaling-up Split-Merge MCMC with Locality Sensitive Sampling (LSS)
Split-Merge MCMC (Monte Carlo Markov Chain) is one of the essential and
popular variants of MCMC for problems when an MCMC state consists of an unknown
number of components. It is well known that state-of-the-art methods for
split-merge MCMC do not scale well. Strategies for rapid mixing requires smart
and informative proposals to reduce the rejection rate. However, all known
smart proposals involve expensive operations to suggest informative
transitions. As a result, the cost of each iteration is prohibitive for massive
scale datasets. It is further known that uninformative but computationally
efficient proposals, such as random split-merge, leads to extremely slow
convergence. This tradeoff between mixing time and per update cost seems hard
to get around.
In this paper, we show a sweet spot. We leverage some unique properties of
weighted MinHash, which is a popular LSH, to design a novel class of
split-merge proposals which are significantly more informative than random
sampling but at the same time efficient to compute. Overall, we obtain a
superior tradeoff between convergence and per update cost. As a direct
consequence, our proposals are around 6X faster than the state-of-the-art
sampling methods on two large real datasets KDDCUP and PubMed with several
millions of entities and thousands of clusters
Comparing unsupervised speech learning directly to human performance in speech perception
International audienceWe compare the performance of humans (English and French listeners) versus an unsupervised speech model in a perception experiment (ABX discrimination task). Although the ABX task has been used for acoustic model evaluation in previous research, the results have not, until now, been compared directly with human behaviour in an experiment. We show that a standard, well-performing model (DPGMM) has better accuracy at predicting human responses than the acoustic baseline. The model also shows a native language effect, better resembling native listeners of the language on which it was trained. However, the native language effect shown by the models is different than the one shown by the human listeners, and, notably , the models do not show the same overall patterns of vowel confusions