49 research outputs found
I4U Submission to NIST SRE 2018: Leveraging from a Decade of Shared Experiences
The I4U consortium was established to facilitate a joint entry to NIST
speaker recognition evaluations (SRE). The latest edition of such joint
submission was in SRE 2018, in which the I4U submission was among the
best-performing systems. SRE'18 also marks the 10-year anniversary of I4U
consortium into NIST SRE series of evaluation. The primary objective of the
current paper is to summarize the results and lessons learned based on the
twelve sub-systems and their fusion submitted to SRE'18. It is also our
intention to present a shared view on the advancements, progresses, and major
paradigm shifts that we have witnessed as an SRE participant in the past decade
from SRE'08 to SRE'18. In this regard, we have seen, among others, a paradigm
shift from supervector representation to deep speaker embedding, and a switch
of research challenge from channel compensation to domain adaptation.Comment: 5 page
LSTM based Similarity Measurement with Spectral Clustering for Speaker Diarization
More and more neural network approaches have achieved considerable
improvement upon submodules of speaker diarization system, including speaker
change detection and segment-wise speaker embedding extraction. Still, in the
clustering stage, traditional algorithms like probabilistic linear discriminant
analysis (PLDA) are widely used for scoring the similarity between two speech
segments. In this paper, we propose a supervised method to measure the
similarity matrix between all segments of an audio recording with sequential
bidirectional long short-term memory networks (Bi-LSTM). Spectral clustering is
applied on top of the similarity matrix to further improve the performance.
Experimental results show that our system significantly outperforms the
state-of-the-art methods and achieves a diarization error rate of 6.63% on the
NIST SRE 2000 CALLHOME database.Comment: Accepted for INTERSPEECH 201
Deep Self-Supervised Hierarchical Clustering for Speaker Diarization
The state-of-the-art speaker diarization systems use agglomerative
hierarchical clustering (AHC) which performs the clustering of previously
learned neural embeddings. While the clustering approach attempts to identify
speaker clusters, the AHC algorithm does not involve any further learning. In
this paper, we propose a novel algorithm for hierarchical clustering which
combines the speaker clustering along with a representation learning framework.
The proposed approach is based on principles of self-supervised learning where
the self-supervision is derived from the clustering algorithm. The
representation learning network is trained with a regularized triplet loss
using the clustering solution at the current step while the clustering
algorithm uses the deep embeddings from the representation learning step. By
combining the self-supervision based representation learning along with the
clustering algorithm, we show that the proposed algorithm improves
significantly 29% relative improvement) over the AHC algorithm with cosine
similarity for a speaker diarization task on CALLHOME dataset. In addition, the
proposed approach also improves over the state-of-the-art system with PLDA
affinity matrix with 10% relative improvement in DER.Comment: 5 pages, Accepted in Interspeech 202
Attention-based Encoder-Decoder Network for End-to-End Neural Speaker Diarization with Target Speaker Attractor
This paper proposes a novel Attention-based Encoder-Decoder network for
End-to-End Neural speaker Diarization (AED-EEND). In AED-EEND system, we
incorporate the target speaker enrollment information used in target speaker
voice activity detection (TS-VAD) to calculate the attractor, which can
mitigate the speaker permutation problem and facilitate easier model
convergence. In the training process, we propose a teacher-forcing strategy to
obtain the enrollment information using the ground-truth label. Furthermore, we
propose three heuristic decoding methods to identify the enrollment area for
each speaker during the evaluation process. Additionally, we enhance the
attractor calculation network LSTM used in the end-to-end encoder-decoder based
attractor calculation (EEND-EDA) system by incorporating an attention-based
model. By utilizing such an attention-based attractor decoder, our proposed
AED-EEND system outperforms both the EEND-EDA and TS-VAD systems with only 0.5s
of enrollment data.Comment: Accepted by InterSpeech 202