79 research outputs found
Enhancing Speaker Diarization with Large Language Models: A Contextual Beam Search Approach
Large language models (LLMs) have shown great promise for capturing
contextual information in natural language processing tasks. We propose a novel
approach to speaker diarization that incorporates the prowess of LLMs to
exploit contextual cues in human dialogues. Our method builds upon an
acoustic-based speaker diarization system by adding lexical information from an
LLM in the inference stage. We model the multi-modal decoding process
probabilistically and perform joint acoustic and lexical beam search to
incorporate cues from both modalities: audio and text. Our experiments
demonstrate that infusing lexical knowledge from the LLM into an acoustics-only
diarization system improves overall speaker-attributed word error rate
(SA-WER). The experimental results show that LLMs can provide complementary
information to acoustic models for the speaker diarization task via proposed
beam search decoding approach showing up to 39.8% relative delta-SA-WER
improvement from the baseline system. Thus, we substantiate that the proposed
technique is able to exploit contextual information that is inaccessible to
acoustics-only systems which is represented by speaker embeddings. In addition,
these findings point to the potential of using LLMs to improve speaker
diarization and other speech processing tasks by capturing semantic and
contextual cues.Comment: 4 pages 1 reference page, ICASSP forma
Speaker Diarization with Lexical Information
This work presents a novel approach for speaker diarization to leverage
lexical information provided by automatic speech recognition. We propose a
speaker diarization system that can incorporate word-level speaker turn
probabilities with speaker embeddings into a speaker clustering process to
improve the overall diarization accuracy. To integrate lexical and acoustic
information in a comprehensive way during clustering, we introduce an adjacency
matrix integration for spectral clustering. Since words and word boundary
information for word-level speaker turn probability estimation are provided by
a speech recognition system, our proposed method works without any human
intervention for manual transcriptions. We show that the proposed method
improves diarization performance on various evaluation datasets compared to the
baseline diarization system using acoustic information only in speaker
embeddings
Linguistically Aided Speaker Diarization Using Speaker Role Information
Speaker diarization relies on the assumption that speech segments
corresponding to a particular speaker are concentrated in a specific region of
the speaker space; a region which represents that speaker's identity. These
identities are not known a priori, so a clustering algorithm is typically
employed, which is traditionally based solely on audio. Under noisy conditions,
however, such an approach poses the risk of generating unreliable speaker
clusters. In this work we aim to utilize linguistic information as a
supplemental modality to identify the various speakers in a more robust way. We
are focused on conversational scenarios where the speakers assume distinct
roles and are expected to follow different linguistic patterns. This distinct
linguistic variability can be exploited to help us construct the speaker
identities. That way, we are able to boost the diarization performance by
converting the clustering task to a classification one. The proposed method is
applied in real-world dyadic psychotherapy interactions between a provider and
a patient and demonstrated to show improved results.Comment: from v1: restructured Introduction and Background, added experimental
results with ASR text and language-only baselin
Lexical Speaker Error Correction: Leveraging Language Models for Speaker Diarization Error Correction
Speaker diarization (SD) is typically used with an automatic speech
recognition (ASR) system to ascribe speaker labels to recognized words. The
conventional approach reconciles outputs from independently optimized ASR and
SD systems, where the SD system typically uses only acoustic information to
identify the speakers in the audio stream. This approach can lead to speaker
errors especially around speaker turns and regions of speaker overlap. In this
paper, we propose a novel second-pass speaker error correction system using
lexical information, leveraging the power of modern language models (LMs). Our
experiments across multiple telephony datasets show that our approach is both
effective and robust. Training and tuning only on the Fisher dataset, this
error correction approach leads to relative word-level diarization error rate
(WDER) reductions of 15-30% on three telephony datasets: RT03-CTS, Callhome
American English and held-out portions of Fisher.Comment: Accepted at INTERSPEECH 202
Language modelling for speaker diarization in telephonic interviews
The aim of this paper is to investigate the benefit of combining both language and acoustic modelling for speaker diarization. Although conventional systems only use acoustic features, in some scenarios linguistic data contain high discriminative speaker information, even more reliable than the acoustic ones. In this study we analyze how an appropriate fusion of both kind of features is able to obtain good results in these cases. The proposed system is based on an iterative algorithm where a LSTM network is used as a speaker classifier. The network is fed with character-level word embeddings and a GMM based acoustic score created with the output labels from previous iterations. The presented algorithm has been evaluated in a Call-Center database, which is composed of telephone interview audios. The combination of acoustic features and linguistic content shows a 84.29% improvement in terms of a word-level DER as compared to a HMM/VB baseline system. The results of this study confirms that linguistic content can be efficiently used for some speaker recognition tasks.This work was partially supported by the Spanish Project DeepVoice (TEC2015-69266-P) and by the project PID2019-107579RBI00/ AEI /10.13039/501100011033.Peer ReviewedPostprint (published version
Predicting continuous conflict perception with Bayesian Gaussian processes
Conflict is one of the most important phenomena of social life, but it is still largely neglected by the computing community. This work proposes an approach
that detects common conversational social signals (loudness, overlapping speech,
etc.) and predicts the conflict level perceived by human observers in continuous,
non-categorical terms. The proposed regression approach is fully Bayesian and it
adopts Automatic Relevance Determination to identify the social signals that influence most the outcome of the prediction. The experiments are performed over the SSPNet Conflict Corpus, a publicly available collection of 1430 clips extracted from televised political debates (roughly 12 hours of material for 138 subjects in total). The results show that it is possible to achieve a correlation close to 0.8 between actual and predicted conflict perception
- …