568 research outputs found
Attention-based Encoder-Decoder End-to-End Neural Diarization with Embedding Enhancer
Deep neural network-based systems have significantly improved the performance
of speaker diarization tasks. However, end-to-end neural diarization (EEND)
systems often struggle to generalize to scenarios with an unseen number of
speakers, while target speaker voice activity detection (TS-VAD) systems tend
to be overly complex. In this paper, we propose a simple attention-based
encoder-decoder network for end-to-end neural diarization (AED-EEND). In our
training process, we introduce a teacher-forcing strategy to address the
speaker permutation problem, leading to faster model convergence. For
evaluation, we propose an iterative decoding method that outputs diarization
results for each speaker sequentially. Additionally, we propose an Enhancer
module to enhance the frame-level speaker embeddings, enabling the model to
handle scenarios with an unseen number of speakers. We also explore replacing
the transformer encoder with a Conformer architecture, which better models
local information. Furthermore, we discovered that commonly used simulation
datasets for speaker diarization have a much higher overlap ratio compared to
real data. We found that using simulated training data that is more consistent
with real data can achieve an improvement in consistency. Extensive
experimental validation demonstrates the effectiveness of our proposed
methodologies. Our best system achieved a new state-of-the-art diarization
error rate (DER) performance on all the CALLHOME (10.08%), DIHARD II (24.64%),
and AMI (13.00%) evaluation benchmarks, when no oracle voice activity detection
(VAD) is used. Beyond speaker diarization, our AED-EEND system also shows
remarkable competitiveness as a speech type detection model.Comment: IEEE/ACM Transactions on Audio Speech and Language Processing Under
Revie
Improving speaker turn embedding by crossmodal transfer learning from face embedding
Learning speaker turn embeddings has shown considerable improvement in
situations where conventional speaker modeling approaches fail. However, this
improvement is relatively limited when compared to the gain observed in face
embedding learning, which has been proven very successful for face verification
and clustering tasks. Assuming that face and voices from the same identities
share some latent properties (like age, gender, ethnicity), we propose three
transfer learning approaches to leverage the knowledge from the face domain
(learned from thousands of images and identities) for tasks in the speaker
domain. These approaches, namely target embedding transfer, relative distance
transfer, and clustering structure transfer, utilize the structure of the
source face embedding space at different granularities to regularize the target
speaker turn embedding space as optimizing terms. Our methods are evaluated on
two public broadcast corpora and yield promising advances over competitive
baselines in verification and audio clustering tasks, especially when dealing
with short speaker utterances. The analysis of the results also gives insight
into characteristics of the embedding spaces and shows their potential
applications
- …