171 research outputs found
The Blame Game: Performance Analysis of Speaker Diarization System Components
In this paper we discuss the performance analysis of a speaker diarization system similar to the system that was submitted by ICSI at the NIST RT06s evaluation benchmark. The analysis that is based on a series of oracle experiments, provides a good understanding of the performance of each system component on a test set of twelve conference meetings used in previous NIST benchmarks. Our analysis shows that the speech activity detection component contributes most to the total diarization error rate (23%). The lack of ability to model verlapping speech is also a large source of errors (22%) followed by the component that creates the initial system models (15%)
Latent Class Model with Application to Speaker Diarization
In this paper, we apply a latent class model (LCM) to the task of speaker
diarization. LCM is similar to Patrick Kenny's variational Bayes (VB) method in
that it uses soft information and avoids premature hard decisions in its
iterations. In contrast to the VB method, which is based on a generative model,
LCM provides a framework allowing both generative and discriminative models.
The discriminative property is realized through the use of i-vector (Ivec),
probabilistic linear discriminative analysis (PLDA), and a support vector
machine (SVM) in this work. Systems denoted as LCM-Ivec-PLDA, LCM-Ivec-SVM, and
LCM-Ivec-Hybrid are introduced. In addition, three further improvements are
applied to enhance its performance. 1) Adding neighbor windows to extract more
speaker information for each short segment. 2) Using a hidden Markov model to
avoid frequent speaker change points. 3) Using an agglomerative hierarchical
cluster to do initialization and present hard and soft priors, in order to
overcome the problem of initial sensitivity. Experiments on the National
Institute of Standards and Technology Rich Transcription 2009 speaker
diarization database, under the condition of a single distant microphone, show
that the diarization error rate (DER) of the proposed methods has substantial
relative improvements compared with mainstream systems. Compared to the VB
method, the relative improvements of LCM-Ivec-PLDA, LCM-Ivec-SVM, and
LCM-Ivec-Hybrid systems are 23.5%, 27.1%, and 43.0%, respectively. Experiments
on our collected database, CALLHOME97, CALLHOME00 and SRE08 short2-summed trial
conditions also show that the proposed LCM-Ivec-Hybrid system has the best
overall performance
A sticky HDP-HMM with application to speaker diarization
We consider the problem of speaker diarization, the problem of segmenting an
audio recording of a meeting into temporal segments corresponding to individual
speakers. The problem is rendered particularly difficult by the fact that we
are not allowed to assume knowledge of the number of people participating in
the meeting. To address this problem, we take a Bayesian nonparametric approach
to speaker diarization that builds on the hierarchical Dirichlet process hidden
Markov model (HDP-HMM) of Teh et al. [J. Amer. Statist. Assoc. 101 (2006)
1566--1581]. Although the basic HDP-HMM tends to over-segment the audio
data---creating redundant states and rapidly switching among them---we describe
an augmented HDP-HMM that provides effective control over the switching rate.
We also show that this augmentation makes it possible to treat emission
distributions nonparametrically. To scale the resulting architecture to
realistic diarization problems, we develop a sampling algorithm that employs a
truncated approximation of the Dirichlet process to jointly resample the full
state sequence, greatly improving mixing rates. Working with a benchmark NIST
data set, we show that our Bayesian nonparametric architecture yields
state-of-the-art speaker diarization results.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS395 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Meta-learning with Latent Space Clustering in Generative Adversarial Network for Speaker Diarization
The performance of most speaker diarization systems with x-vector embeddings
is both vulnerable to noisy environments and lacks domain robustness. Earlier
work on speaker diarization using generative adversarial network (GAN) with an
encoder network (ClusterGAN) to project input x-vectors into a latent space has
shown promising performance on meeting data. In this paper, we extend the
ClusterGAN network to improve diarization robustness and enable rapid
generalization across various challenging domains. To this end, we fetch the
pre-trained encoder from the ClusterGAN and fine-tune it by using prototypical
loss (meta-ClusterGAN or MCGAN) under the meta-learning paradigm. Experiments
are conducted on CALLHOME telephonic conversations, AMI meeting data, DIHARD II
(dev set) which includes challenging multi-domain corpus, and two
child-clinician interaction corpora (ADOS, BOSCC) related to the autism
spectrum disorder domain. Extensive analyses of the experimental data are done
to investigate the effectiveness of the proposed ClusterGAN and MCGAN
embeddings over x-vectors. The results show that the proposed embeddings with
normalized maximum eigengap spectral clustering (NME-SC) back-end consistently
outperform Kaldi state-of-the-art z-vector diarization system. Finally, we
employ embedding fusion with x-vectors to provide further improvement in
diarization performance. We achieve a relative diarization error rate (DER)
improvement of 6.67% to 53.93% on the aforementioned datasets using the
proposed fused embeddings over x-vectors. Besides, the MCGAN embeddings provide
better performance in the number of speakers estimation and short speech
segment diarization as compared to x-vectors and ClusterGAN in telephonic data.Comment: Submitted to IEEE/ACM TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE
PROCESSIN
Self-supervised Speaker Diarization
Over the last few years, deep learning has grown in popularity for speaker
verification, identification, and diarization. Inarguably, a significant part
of this success is due to the demonstrated effectiveness of their speaker
representations. These, however, are heavily dependent on large amounts of
annotated data and can be sensitive to new domains. This study proposes an
entirely unsupervised deep-learning model for speaker diarization.
Specifically, the study focuses on generating high-quality neural speaker
representations without any annotated data, as well as on estimating secondary
hyperparameters of the model without annotations.
The speaker embeddings are represented by an encoder trained in a
self-supervised fashion using pairs of adjacent segments assumed to be of the
same speaker. The trained encoder model is then used to self-generate
pseudo-labels to subsequently train a similarity score between different
segments of the same call using probabilistic linear discriminant analysis
(PLDA) and further to learn a clustering stopping threshold. We compared our
model to state-of-the-art unsupervised as well as supervised baselines on the
CallHome benchmarks. According to empirical results, our approach outperforms
unsupervised methods when only two speakers are present in the call, and is
only slightly worse than recent supervised models.Comment: Submitted to Interspeech 202
- …