18 research outputs found

    The Zero Resource Speech Challenge 2017

    Full text link
    We describe a new challenge aimed at discovering subword and word units from raw speech. This challenge is the followup to the Zero Resource Speech Challenge 2015. It aims at constructing systems that generalize across languages and adapt to new speakers. The design features and evaluation metrics of the challenge are presented and the results of seventeen models are discussed.Comment: IEEE ASRU (Automatic Speech Recognition and Understanding) 2017. Okinawa, Japa

    Scalable Estimation of Dirichlet Process Mixture Models on Distributed Data

    Full text link
    We consider the estimation of Dirichlet Process Mixture Models (DPMMs) in distributed environments, where data are distributed across multiple computing nodes. A key advantage of Bayesian nonparametric models such as DPMMs is that they allow new components to be introduced on the fly as needed. This, however, posts an important challenge to distributed estimation -- how to handle new components efficiently and consistently. To tackle this problem, we propose a new estimation method, which allows new components to be created locally in individual computing nodes. Components corresponding to the same cluster will be identified and merged via a probabilistic consolidation scheme. In this way, we can maintain the consistency of estimation with very low communication cost. Experiments on large real-world data sets show that the proposed method can achieve high scalability in distributed and asynchronous environments without compromising the mixing performance.Comment: This paper is published on IJCAI 2017. https://www.ijcai.org/proceedings/2017/64

    Unsupervised Acoustic Unit Representation Learning for Voice Conversion using WaveNet Auto-encoders

    Get PDF
    Unsupervised representation learning of speech has been of keen interest in recent years, which is for example evident in the wide interest of the ZeroSpeech challenges. This work presents a new method for learning frame level representations based on WaveNet auto-encoders. Of particular interest in the ZeroSpeech Challenge 2019 were models with discrete latent variable such as the Vector Quantized Variational Auto-Encoder (VQVAE). However these models generate speech with relatively poor quality. In this work we aim to address this with two approaches: first WaveNet is used as the decoder and to generate waveform data directly from the latent representation; second, the low complexity of latent representations is improved with two alternative disentanglement learning methods, namely instance normalization and sliced vector quantization. The method was developed and tested in the context of the recent ZeroSpeech challenge 2020. The system output submitted to the challenge obtained the top position for naturalness (Mean Opinion Score 4.06), top position for intelligibility (Character Error Rate 0.15), and third position for the quality of the representation (ABX test score 12.5). These and further analysis in this paper illustrates that quality of the converted speech and the acoustic units representation can be well balanced.Comment: To be presented in Interspeech 202

    Comparing unsupervised speech learning directly to human performance in speech perception

    Get PDF
    International audienceWe compare the performance of humans (English and French listeners) versus an unsupervised speech model in a perception experiment (ABX discrimination task). Although the ABX task has been used for acoustic model evaluation in previous research, the results have not, until now, been compared directly with human behaviour in an experiment. We show that a standard, well-performing model (DPGMM) has better accuracy at predicting human responses than the acoustic baseline. The model also shows a native language effect, better resembling native listeners of the language on which it was trained. However, the native language effect shown by the models is different than the one shown by the human listeners, and, notably , the models do not show the same overall patterns of vowel confusions

    DAMM: Directionality-Aware Mixture Model Parallel Sampling for Efficient Dynamical System Learning

    Full text link
    The Linear Parameter Varying Dynamical System (LPV-DS) is a promising framework for learning stable time-invariant motion policies in robot control. By employing statistical modeling and semi-definite optimization, LPV-DS encodes complex motions via non-linear DS, ensuring the robustness and stability of the system. However, the current LPV-DS scheme faces challenges in accurately interpreting trajectory data while maintaining model efficiency and computational efficiency. To address these limitations, we propose the Directionality-aware Mixture Model (DAMM), a new statistical model that leverages Riemannian metric on dd-dimensional sphere Sd\mathbb{S}^d, and efficiently incorporates non-Euclidean directional information with position. Additionally, we introduce a hybrid Markov chain Monte Carlo method that combines the Gibbs Sampling and the Split/Merge Proposal, facilitating parallel computation and enabling faster inference for near real-time learning performance. Through extensive empirical validation, we demonstrate that the improved LPV-DS framework with DAMM is capable of producing physically-meaningful representations of the trajectory data and improved performance of the generated DS while showcasing significantly enhanced learning speed compared to its previous iterations

    Scaling-up Split-Merge MCMC with Locality Sensitive Sampling (LSS)

    Full text link
    Split-Merge MCMC (Monte Carlo Markov Chain) is one of the essential and popular variants of MCMC for problems when an MCMC state consists of an unknown number of components. It is well known that state-of-the-art methods for split-merge MCMC do not scale well. Strategies for rapid mixing requires smart and informative proposals to reduce the rejection rate. However, all known smart proposals involve expensive operations to suggest informative transitions. As a result, the cost of each iteration is prohibitive for massive scale datasets. It is further known that uninformative but computationally efficient proposals, such as random split-merge, leads to extremely slow convergence. This tradeoff between mixing time and per update cost seems hard to get around. In this paper, we show a sweet spot. We leverage some unique properties of weighted MinHash, which is a popular LSH, to design a novel class of split-merge proposals which are significantly more informative than random sampling but at the same time efficient to compute. Overall, we obtain a superior tradeoff between convergence and per update cost. As a direct consequence, our proposals are around 6X faster than the state-of-the-art sampling methods on two large real datasets KDDCUP and PubMed with several millions of entities and thousands of clusters

    Comparing unsupervised speech learning directly to human performance in speech perception

    Get PDF
    International audienceWe compare the performance of humans (English and French listeners) versus an unsupervised speech model in a perception experiment (ABX discrimination task). Although the ABX task has been used for acoustic model evaluation in previous research, the results have not, until now, been compared directly with human behaviour in an experiment. We show that a standard, well-performing model (DPGMM) has better accuracy at predicting human responses than the acoustic baseline. The model also shows a native language effect, better resembling native listeners of the language on which it was trained. However, the native language effect shown by the models is different than the one shown by the human listeners, and, notably , the models do not show the same overall patterns of vowel confusions
    corecore