164 research outputs found
Personalized Acoustic Modeling by Weakly Supervised Multi-Task Deep Learning using Acoustic Tokens Discovered from Unlabeled Data
It is well known that recognizers personalized to each user are much more
effective than user-independent recognizers. With the popularity of smartphones
today, although it is not difficult to collect a large set of audio data for
each user, it is difficult to transcribe it. However, it is now possible to
automatically discover acoustic tokens from unlabeled personal data in an
unsupervised way. We therefore propose a multi-task deep learning framework
called a phoneme-token deep neural network (PTDNN), jointly trained from
unsupervised acoustic tokens discovered from unlabeled data and very limited
transcribed data for personalized acoustic modeling. We term this scenario
"weakly supervised". The underlying intuition is that the high degree of
similarity between the HMM states of acoustic token models and phoneme models
may help them learn from each other in this multi-task learning framework.
Initial experiments performed over a personalized audio data set recorded from
Facebook posts demonstrated that very good improvements can be achieved in both
frame accuracy and word accuracy over popularly-considered baselines such as
fDLR, speaker code and lightly supervised adaptation. This approach complements
existing speaker adaptation approaches and can be used jointly with such
techniques to yield improved results.Comment: 5 pages, 5 figures, published in IEEE ICASSP 201
Unsupervised Spoken Term Detection with Spoken Queries by Multi-level Acoustic Patterns with Varying Model Granularity
This paper presents a new approach for unsupervised Spoken Term Detection
with spoken queries using multiple sets of acoustic patterns automatically
discovered from the target corpus. The different pattern HMM
configurations(number of states per model, number of distinct models, number of
Gaussians per state)form a three-dimensional model granularity space. Different
sets of acoustic patterns automatically discovered on different points properly
distributed over this three-dimensional space are complementary to one another,
thus can jointly capture the characteristics of the spoken terms. By
representing the spoken content and spoken query as sequences of acoustic
patterns, a series of approaches for matching the pattern index sequences while
considering the signal variations are developed. In this way, not only the
on-line computation load can be reduced, but the signal distributions caused by
different speakers and acoustic conditions can be reasonably taken care of. The
results indicate that this approach significantly outperformed the unsupervised
feature-based DTW baseline by 16.16\% in mean average precision on the TIMIT
corpus.Comment: Accepted by ICASSP 201
A segmental framework for fully-unsupervised large-vocabulary speech recognition
Zero-resource speech technology is a growing research area that aims to
develop methods for speech processing in the absence of transcriptions,
lexicons, or language modelling text. Early term discovery systems focused on
identifying isolated recurring patterns in a corpus, while more recent
full-coverage systems attempt to completely segment and cluster the audio into
word-like units---effectively performing unsupervised speech recognition. This
article presents the first attempt we are aware of to apply such a system to
large-vocabulary multi-speaker data. Our system uses a Bayesian modelling
framework with segmental word representations: each word segment is represented
as a fixed-dimensional acoustic embedding obtained by mapping the sequence of
feature frames to a single embedding vector. We compare our system on English
and Xitsonga datasets to state-of-the-art baselines, using a variety of
measures including word error rate (obtained by mapping the unsupervised output
to ground truth transcriptions). Very high word error rates are reported---in
the order of 70--80% for speaker-dependent and 80--95% for speaker-independent
systems---highlighting the difficulty of this task. Nevertheless, in terms of
cluster quality and word segmentation metrics, we show that by imposing a
consistent top-down segmentation while also using bottom-up knowledge from
detected syllable boundaries, both single-speaker and multi-speaker versions of
our system outperform a purely bottom-up single-speaker syllable-based
approach. We also show that the discovered clusters can be made less speaker-
and gender-specific by using an unsupervised autoencoder-like feature extractor
to learn better frame-level features (prior to embedding). Our system's
discovered clusters are still less pure than those of unsupervised term
discovery systems, but provide far greater coverage.Comment: 15 pages, 6 figures, 8 table
- …