23 research outputs found
Language modelling for speaker diarization in telephonic interviews
The aim of this paper is to investigate the benefit of combining both language and acoustic modelling for speaker diarization. Although conventional systems only use acoustic features, in some scenarios linguistic data contain high discriminative speaker information, even more reliable than the acoustic ones. In this study we analyze how an appropriate fusion of both kind of features is able to obtain good results in these cases. The proposed system is based on an iterative algorithm where a LSTM network is used as a speaker classifier. The network is fed with character-level word embeddings and a GMM based acoustic score created with the output labels from previous iterations. The presented algorithm has been evaluated in a Call-Center database, which is composed of telephone interview audios. The combination of acoustic features and linguistic content shows a 84.29% improvement in terms of a word-level DER as compared to a HMM/VB baseline system. The results of this study confirms that linguistic content can be efficiently used for some speaker recognition tasks.This work was partially supported by the Spanish Project DeepVoice (TEC2015-69266-P) and by the project PID2019-107579RBI00/ AEI /10.13039/501100011033.Peer ReviewedPostprint (published version
Lexical Speaker Error Correction: Leveraging Language Models for Speaker Diarization Error Correction
Speaker diarization (SD) is typically used with an automatic speech
recognition (ASR) system to ascribe speaker labels to recognized words. The
conventional approach reconciles outputs from independently optimized ASR and
SD systems, where the SD system typically uses only acoustic information to
identify the speakers in the audio stream. This approach can lead to speaker
errors especially around speaker turns and regions of speaker overlap. In this
paper, we propose a novel second-pass speaker error correction system using
lexical information, leveraging the power of modern language models (LMs). Our
experiments across multiple telephony datasets show that our approach is both
effective and robust. Training and tuning only on the Fisher dataset, this
error correction approach leads to relative word-level diarization error rate
(WDER) reductions of 15-30% on three telephony datasets: RT03-CTS, Callhome
American English and held-out portions of Fisher.Comment: Accepted at INTERSPEECH 202