3 research outputs found
Robust Feature Clustering for Unsupervised Speech Activity Detection
In certain applications such as zero-resource speech processing or very-low
resource speech-language systems, it might not be feasible to collect speech
activity detection (SAD) annotations. However, the state-of-the-art supervised
SAD techniques based on neural networks or other machine learning methods
require annotated training data matched to the target domain. This paper
establish a clustering approach for fully unsupervised SAD useful for cases
where SAD annotations are not available. The proposed approach leverages
Hartigan dip test in a recursive strategy for segmenting the feature space into
prominent modes. Statistical dip is invariant to distortions that lends
robustness to the proposed method. We evaluate the method on NIST OpenSAD 2015
and NIST OpenSAT 2017 public safety communications data. The results showed the
superiority of proposed approach over the two-component GMM baseline. Index
Terms: Clustering, Hartigan dip test, NIST OpenSAD, NIST OpenSAT, speech
activity detection, zero-resource speech processing, unsupervised learning.Comment: 5 Pages, 4 Tables, 1 Figur
Robust Speaker Clustering using Mixtures of von Mises-Fisher Distributions for Naturalistic Audio Streams
Speaker Diarization (i.e. determining who spoke and when?) for multi-speaker
naturalistic interactions such as Peer-Led Team Learning (PLTL) sessions is a
challenging task. In this study, we propose robust speaker clustering based on
mixture of multivariate von Mises-Fisher distributions. Our diarization
pipeline has two stages: (i) ground-truth segmentation; (ii) proposed speaker
clustering. The ground-truth speech activity information is used for extracting
i-Vectors from each speechsegment. We post-process the i-Vectors with principal
component analysis for dimension reduction followed by lengthnormalization.
Normalized i-Vectors are high-dimensional unit vectors possessing
discriminative directional characteristics. We model the normalized i-Vectors
with a mixture model consisting of multivariate von Mises-Fisher distributions.
K-means clustering with cosine distance is chosen as baseline approach. The
evaluation data is derived from: (i) CRSS-PLTL corpus; and (ii) three-meetings
subset of AMI corpus. The CRSSPLTL data contain audio recordings of PLTL
sessions which is student-led STEM education paradigm. Proposed approach is
consistently better than baseline leading to upto 44.48% and 53.68% relative
improvements for PLTL and AMI corpus, respectively. Index Terms: Speaker
clustering, von Mises-Fisher distribution, Peer-led team learning, i-Vector,
Naturalistic Audio.Comment: 5 pages, 2 figure
Toeplitz Inverse Covariance based Robust Speaker Clustering for Naturalistic Audio Streams
Speaker diarization determines who spoke and when? in an audio stream. In
this study, we propose a model-based approach for robust speaker clustering
using i-vectors. The ivectors extracted from different segments of same speaker
are correlated. We model this correlation with a Markov Random Field (MRF)
network. Leveraging the advancements in MRF modeling, we used Toeplitz Inverse
Covariance (TIC) matrix to represent the MRF correlation network for each
speaker. This approaches captures the sequential structure of i-vectors (or
equivalent speaker turns) belonging to same speaker in an audio stream. A
variant of standard Expectation Maximization (EM) algorithm is adopted for
deriving closed-form solution using dynamic programming (DP) and the
alternating direction method of multiplier (ADMM). Our diarization system has
four steps: (1) ground-truth segmentation; (2) i-vector extraction; (3)
post-processing (mean subtraction, principal component analysis, and
length-normalization) ; and (4) proposed speaker clustering. We employ cosine
K-means and movMF speaker clustering as baseline approaches. Our evaluation
data is derived from: (i) CRSS-PLTL corpus, and (ii) two meetings subset of the
AMI corpus. Relative reduction in diarization error rate (DER) for CRSS-PLTL
corpus is 43.22% using the proposed advancements as compared to baseline. For
AMI meetings IS1000a and IS1003b, relative DER reduction is 29.37% and 9.21%,
respectively.Comment: 6 Pages, 3 Fiigures, 5 Equation