10,698 research outputs found
A summary of the 2012 JHU CLSP Workshop on Zero Resource Speech Technologies and Models of Early Language Acquisition
We summarize the accomplishments of a multi-disciplinary workshop exploring the computational and scientific issues surrounding zero resource (unsupervised) speech technologies and related models of early language acquisition. Centered around the tasks of phonetic and lexical discovery, we consider unified evaluation metrics, present two new approaches for improving speaker independence in the absence of supervision, and evaluate the application of Bayesian word segmentation algorithms to automatic subword unit tokenizations. Finally, we present two strategies for integrating zero resource techniques into supervised settings, demonstrating the potential of unsupervised methods to improve mainstream technologies.5 page(s
Latent Class Model with Application to Speaker Diarization
In this paper, we apply a latent class model (LCM) to the task of speaker
diarization. LCM is similar to Patrick Kenny's variational Bayes (VB) method in
that it uses soft information and avoids premature hard decisions in its
iterations. In contrast to the VB method, which is based on a generative model,
LCM provides a framework allowing both generative and discriminative models.
The discriminative property is realized through the use of i-vector (Ivec),
probabilistic linear discriminative analysis (PLDA), and a support vector
machine (SVM) in this work. Systems denoted as LCM-Ivec-PLDA, LCM-Ivec-SVM, and
LCM-Ivec-Hybrid are introduced. In addition, three further improvements are
applied to enhance its performance. 1) Adding neighbor windows to extract more
speaker information for each short segment. 2) Using a hidden Markov model to
avoid frequent speaker change points. 3) Using an agglomerative hierarchical
cluster to do initialization and present hard and soft priors, in order to
overcome the problem of initial sensitivity. Experiments on the National
Institute of Standards and Technology Rich Transcription 2009 speaker
diarization database, under the condition of a single distant microphone, show
that the diarization error rate (DER) of the proposed methods has substantial
relative improvements compared with mainstream systems. Compared to the VB
method, the relative improvements of LCM-Ivec-PLDA, LCM-Ivec-SVM, and
LCM-Ivec-Hybrid systems are 23.5%, 27.1%, and 43.0%, respectively. Experiments
on our collected database, CALLHOME97, CALLHOME00 and SRE08 short2-summed trial
conditions also show that the proposed LCM-Ivec-Hybrid system has the best
overall performance
Deep clustering: Discriminative embeddings for segmentation and separation
We address the problem of acoustic source separation in a deep learning
framework we call "deep clustering." Rather than directly estimating signals or
masking functions, we train a deep network to produce spectrogram embeddings
that are discriminative for partition labels given in training data. Previous
deep network approaches provide great advantages in terms of learning power and
speed, but previously it has been unclear how to use them to separate signals
in a class-independent way. In contrast, spectral clustering approaches are
flexible with respect to the classes and number of items to be segmented, but
it has been unclear how to leverage the learning power and speed of deep
networks. To obtain the best of both worlds, we use an objective function that
to train embeddings that yield a low-rank approximation to an ideal pairwise
affinity matrix, in a class-independent way. This avoids the high cost of
spectral factorization and instead produces compact clusters that are amenable
to simple clustering methods. The segmentations are therefore implicitly
encoded in the embeddings, and can be "decoded" by clustering. Preliminary
experiments show that the proposed method can separate speech: when trained on
spectrogram features containing mixtures of two speakers, and tested on
mixtures of a held-out set of speakers, it can infer masking functions that
improve signal quality by around 6dB. We show that the model can generalize to
three-speaker mixtures despite training only on two-speaker mixtures. The
framework can be used without class labels, and therefore has the potential to
be trained on a diverse set of sound types, and to generalize to novel sources.
We hope that future work will lead to segmentation of arbitrary sounds, with
extensions to microphone array methods as well as image segmentation and other
domains.Comment: Originally submitted on June 5, 201
- …