80 research outputs found
Online speaker diarization of meetings guided by speech separation
Overlapped speech is notoriously problematic for speaker diarization systems.
Consequently, the use of speech separation has recently been proposed to
improve their performance. Although promising, speech separation models
struggle with realistic data because they are trained on simulated mixtures
with a fixed number of speakers. In this work, we introduce a new speech
separation-guided diarization scheme suitable for the online speaker
diarization of long meeting recordings with a variable number of speakers, as
present in the AMI corpus. We envisage ConvTasNet and DPRNN as alternatives for
the separation networks, with two or three output sources. To obtain the
speaker diarization result, voice activity detection is applied on each
estimated source. The final model is fine-tuned end-to-end, after first
adapting the separation to real data using AMI. The system operates on short
segments, and inference is performed by stitching the local predictions using
speaker embeddings and incremental clustering. The results show that our system
improves the state-of-the-art on the AMI headset mix, using no oracle
information and under full evaluation (no collar and including overlapped
speech). Finally, we show the strength of our system particularly on overlapped
speech sections.Comment: Accepted at ICASSP 202
LSTM based Similarity Measurement with Spectral Clustering for Speaker Diarization
More and more neural network approaches have achieved considerable
improvement upon submodules of speaker diarization system, including speaker
change detection and segment-wise speaker embedding extraction. Still, in the
clustering stage, traditional algorithms like probabilistic linear discriminant
analysis (PLDA) are widely used for scoring the similarity between two speech
segments. In this paper, we propose a supervised method to measure the
similarity matrix between all segments of an audio recording with sequential
bidirectional long short-term memory networks (Bi-LSTM). Spectral clustering is
applied on top of the similarity matrix to further improve the performance.
Experimental results show that our system significantly outperforms the
state-of-the-art methods and achieves a diarization error rate of 6.63% on the
NIST SRE 2000 CALLHOME database.Comment: Accepted for INTERSPEECH 201
Language modelling for speaker diarization in telephonic interviews
The aim of this paper is to investigate the benefit of combining both language and acoustic modelling for speaker diarization. Although conventional systems only use acoustic features, in some scenarios linguistic data contain high discriminative speaker information, even more reliable than the acoustic ones. In this study we analyze how an appropriate fusion of both kind of features is able to obtain good results in these cases. The proposed system is based on an iterative algorithm where a LSTM network is used as a speaker classifier. The network is fed with character-level word embeddings and a GMM based acoustic score created with the output labels from previous iterations. The presented algorithm has been evaluated in a Call-Center database, which is composed of telephone interview audios. The combination of acoustic features and linguistic content shows a 84.29% improvement in terms of a word-level DER as compared to a HMM/VB baseline system. The results of this study confirms that linguistic content can be efficiently used for some speaker recognition tasks.This work was partially supported by the Spanish Project DeepVoice (TEC2015-69266-P) and by the project PID2019-107579RBI00/ AEI /10.13039/501100011033.Peer ReviewedPostprint (published version
Recommended from our members
Speaker diarisation and longitudinal linking in multi-genre broadcast data
This paper presents a multi-stage speaker diarisation system with longitudinal linking developed on BBC multi-genre data for the 2015 Multi-Genre Broadcast (MGB) challenge. The basic speaker diarisation system draws on techniques from the Cambridge March 2005 system with a new deep neural network (DNN)-based speech/non speech segmenter. A newly developed linking stage is next added to the basic diarisation output aiming at the identification of speakers across multiple episodes of the same series. The longitudinal constraint imposes an incremental processing of the episodes, where speaker labels for each episode can be obtained using only material from the episode in question, and those broadcast earlier in time. The nature of the data as well as the longitudinal linking constraint position this diarisation task as a new open-research topic, and a particularly challenging one. Different linking clustering metrics are compared and the lowest within-episode and cross-episode DER scores are achieved on the MGB challenge evaluation set.This work is in part supported by EPSRC Programme Grant EP/I031022/1 (Natural Speech Technology). C. Zhang is also supported by a Cambridge International Scholarship from the Cambridge Commonwealth, European & International Trust.This is the author accepted manuscript. The final version is available from IEEE via http://dx.doi.org/10.1109/ASRU.2015.740485
- …