Search CORE

164 research outputs found

Personalized Acoustic Modeling by Weakly Supervised Multi-Task Deep Learning using Acoustic Tokens Discovered from Unlabeled Data

Author: Chung Cheng-Tao
Lee Hung-Yi
Lee Lin-Shan
Wei Cheng-Kuan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 23/06/2017
Field of study

It is well known that recognizers personalized to each user are much more effective than user-independent recognizers. With the popularity of smartphones today, although it is not difficult to collect a large set of audio data for each user, it is difficult to transcribe it. However, it is now possible to automatically discover acoustic tokens from unlabeled personal data in an unsupervised way. We therefore propose a multi-task deep learning framework called a phoneme-token deep neural network (PTDNN), jointly trained from unsupervised acoustic tokens discovered from unlabeled data and very limited transcribed data for personalized acoustic modeling. We term this scenario "weakly supervised". The underlying intuition is that the high degree of similarity between the HMM states of acoustic token models and phoneme models may help them learn from each other in this multi-task learning framework. Initial experiments performed over a personalized audio data set recorded from Facebook posts demonstrated that very good improvements can be achieved in both frame accuracy and word accuracy over popularly-considered baselines such as fDLR, speaker code and lightly supervised adaptation. This approach complements existing speaker adaptation approaches and can be used jointly with such techniques to yield improved results.Comment: 5 pages, 5 figures, published in IEEE ICASSP 201

arXiv.org e-Print Archive

Crossref

Unsupervised Spoken Term Detection with Spoken Queries by Multi-level Acoustic Patterns with Varying Model Granularity

Author: Chan Chun-an
Chung Cheng-Tao
Lee Lin-shan
Publication venue
Publication date: 07/09/2015
Field of study

This paper presents a new approach for unsupervised Spoken Term Detection with spoken queries using multiple sets of acoustic patterns automatically discovered from the target corpus. The different pattern HMM configurations(number of states per model, number of distinct models, number of Gaussians per state)form a three-dimensional model granularity space. Different sets of acoustic patterns automatically discovered on different points properly distributed over this three-dimensional space are complementary to one another, thus can jointly capture the characteristics of the spoken terms. By representing the spoken content and spoken query as sequences of acoustic patterns, a series of approaches for matching the pattern index sequences while considering the signal variations are developed. In this way, not only the on-line computation load can be reduced, but the signal distributions caused by different speakers and acoustic conditions can be reasonably taken care of. The results indicate that this approach significantly outperformed the unsupervised feature-based DTW baseline by 16.16\% in mean average precision on the TIMIT corpus.Comment: Accepted by ICASSP 201

arXiv.org e-Print Archive

Crossref

Dictionary Learning-Based Speech Enhancement

Author: Bui Manh-Quan
Duong Viet-Hang
Wang Jia-Ching
Publication venue: 'IntechOpen'
Publication date: 06/05/2019
Field of study

IntechOpen

Crossref

A segmental framework for fully-unsupervised large-vocabulary speech recognition

Author: Abdel-Hamid
Aren Jansen
Badino
Badino
Bisani
Bortfeld
Chen
Chung
De Vries
Dredze
Eimas
Feldman
Gillick
Gish
Goldwater
Goldwater
Herman Kamper
Heymann
Jansen
Jansen
Jansen
Jansen
Kamper
Kamper
Kamper
Kamper
Kamper
Lee
Lee
Lee
Levin
Levin
Ludusan
Lyzinski
Martin
McQueen
Mochihashi
Murphy
Neubig
Park
Pitt
Renshaw
Resnik
Räsänen
Räsänen
Räsänen
Scott
Sharon Goldwater
Shum
Siu
Sun
Synnaeve
Taniguchi
Thiollière
Varadarajan
Versteegh
Versteegh
Walter
Wilkinson
Zeghidour
Zeghidour
Zeiler
Zhang
Zhang
Zweig
Publication venue: 'Elsevier BV'
Publication date: 16/09/2017
Field of study

Zero-resource speech technology is a growing research area that aims to develop methods for speech processing in the absence of transcriptions, lexicons, or language modelling text. Early term discovery systems focused on identifying isolated recurring patterns in a corpus, while more recent full-coverage systems attempt to completely segment and cluster the audio into word-like units---effectively performing unsupervised speech recognition. This article presents the first attempt we are aware of to apply such a system to large-vocabulary multi-speaker data. Our system uses a Bayesian modelling framework with segmental word representations: each word segment is represented as a fixed-dimensional acoustic embedding obtained by mapping the sequence of feature frames to a single embedding vector. We compare our system on English and Xitsonga datasets to state-of-the-art baselines, using a variety of measures including word error rate (obtained by mapping the unsupervised output to ground truth transcriptions). Very high word error rates are reported---in the order of 70--80% for speaker-dependent and 80--95% for speaker-independent systems---highlighting the difficulty of this task. Nevertheless, in terms of cluster quality and word segmentation metrics, we show that by imposing a consistent top-down segmentation while also using bottom-up knowledge from detected syllable boundaries, both single-speaker and multi-speaker versions of our system outperform a purely bottom-up single-speaker syllable-based approach. We also show that the discovered clusters can be made less speaker- and gender-specific by using an unsupervised autoencoder-like feature extractor to learn better frame-level features (prior to embedding). Our system's discovered clusters are still less pure than those of unsupervised term discovery systems, but provide far greater coverage.Comment: 15 pages, 6 figures, 8 table

arXiv.org e-Print Archive

Crossref

Edinburgh Research Explorer