3 research outputs found
Supervised online diarization with sample mean loss for multi-domain data
Recently, a fully supervised speaker diarization approach was proposed
(UIS-RNN) which models speakers using multiple instances of a parameter-sharing
recurrent neural network. In this paper we propose qualitative modifications to
the model that significantly improve the learning efficiency and the overall
diarization performance. In particular, we introduce a novel loss function, we
called Sample Mean Loss and we present a better modelling of the speaker turn
behaviour, by devising an analytical expression to compute the probability of a
new speaker joining the conversation. In addition, we demonstrate that our
model can be trained on fixed-length speech segments, removing the need for
speaker change information in inference. Using x-vectors as input features, we
evaluate our proposed approach on the multi-domain dataset employed in the
DIHARD II challenge: our online method improves with respect to the original
UIS-RNN and achieves similar performance to an offline agglomerative clustering
baseline using PLDA scoring
Online End-to-End Neural Diarization with Speaker-Tracing Buffer
This paper proposes a novel online speaker diarization algorithm based on a
fully supervised self-attention mechanism (SA-EEND). Online diarization
inherently presents a speaker's permutation problem due to the possibility to
assign speaker regions incorrectly across the recording. To circumvent this
inconsistency, we proposed a speaker-tracing buffer mechanism that selects
several input frames representing the speaker permutation information from
previous chunks and stores them in a buffer. These buffered frames are stacked
with the input frames in the current chunk and fed into a self-attention
network. Our method ensures consistent diarization outputs across the buffer
and the current chunk by checking the correlation between their corresponding
outputs. Additionally, we trained SA-EEND with variable chunk-sizes to mitigate
the mismatch between training and inference introduced by the speaker-tracing
buffer mechanism. Experimental results, including online SA-EEND and variable
chunk-size, achieved DERs of 12.54% for CALLHOME and 20.77% for CSJ with 1.4s
actual latency.Comment: Accepted to SLT 202
INTERSPEECH 2010 GMM-UBM based open-set online speaker diarization
In this paper, we present an open-set online speaker diarization system. The system is based on Gaussian mixture models (GMMs), which are used as speaker models. The system starts with just 3 such models (one each for both genders and one for non-speech) and creates models for individual speakers not till the speakers occur. As more and more speakers appear, more models are created. Our system implicitly performs audio segmentation, speech/non-speech classification, gender recognition and speaker identification. The system is tested with the HUB4-1996 radio broadcast news database. Index Terms: Speaker diarization, Gaussian mixture models, open-set speaker recognitio