67 research outputs found
Speaker diarization assisted ASR for multi-speaker conversations
In this paper, we propose a novel approach for the transcription of speech
conversations with natural speaker overlap, from single channel recordings. We
propose a combination of a speaker diarization system and a hybrid automatic
speech recognition (ASR) system with speaker activity assisted acoustic model
(AM). An end-to-end neural network system is used for speaker diarization. Two
architectures, (i) input conditioned AM, and (ii) gated features AM, are
explored to incorporate the speaker activity information. The models output
speaker specific senones. The experiments on Switchboard telephone
conversations show the advantage of incorporating speaker activity information
in the ASR system for recordings with overlapped speech. In particular, an
absolute improvement of in word error rate (WER) is seen for the
proposed approach on natural conversation speech with automatic diarization.Comment: Manuscript submitted to INTERSPEECH 202
Enhancing Speaker Diarization with Large Language Models: A Contextual Beam Search Approach
Large language models (LLMs) have shown great promise for capturing
contextual information in natural language processing tasks. We propose a novel
approach to speaker diarization that incorporates the prowess of LLMs to
exploit contextual cues in human dialogues. Our method builds upon an
acoustic-based speaker diarization system by adding lexical information from an
LLM in the inference stage. We model the multi-modal decoding process
probabilistically and perform joint acoustic and lexical beam search to
incorporate cues from both modalities: audio and text. Our experiments
demonstrate that infusing lexical knowledge from the LLM into an acoustics-only
diarization system improves overall speaker-attributed word error rate
(SA-WER). The experimental results show that LLMs can provide complementary
information to acoustic models for the speaker diarization task via proposed
beam search decoding approach showing up to 39.8% relative delta-SA-WER
improvement from the baseline system. Thus, we substantiate that the proposed
technique is able to exploit contextual information that is inaccessible to
acoustics-only systems which is represented by speaker embeddings. In addition,
these findings point to the potential of using LLMs to improve speaker
diarization and other speech processing tasks by capturing semantic and
contextual cues.Comment: 4 pages 1 reference page, ICASSP forma
End-to-end speech recognition modeling from de-identified data
De-identification of data used for automatic speech recognition modeling is a
critical component in protecting privacy, especially in the medical domain.
However, simply removing all personally identifiable information (PII) from
end-to-end model training data leads to a significant performance degradation
in particular for the recognition of names, dates, locations, and words from
similar categories. We propose and evaluate a two-step method for partially
recovering this loss. First, PII is identified, and each occurrence is replaced
with a random word sequence of the same category. Then, corresponding audio is
produced via text-to-speech or by splicing together matching audio fragments
extracted from the corpus. These artificial audio/label pairs, together with
speaker turns from the original data without PII, are used to train models. We
evaluate the performance of this method on in-house data of medical
conversations and observe a recovery of almost the entire performance
degradation in the general word error rate while still maintaining a strong
diarization performance. Our main focus is the improvement of recall and
precision in the recognition of PII-related words. Depending on the PII
category, between of the performance degradation can be recovered
using our proposed method.Comment: Accepted to INTERSPEECH 202
Lexical Speaker Error Correction: Leveraging Language Models for Speaker Diarization Error Correction
Speaker diarization (SD) is typically used with an automatic speech
recognition (ASR) system to ascribe speaker labels to recognized words. The
conventional approach reconciles outputs from independently optimized ASR and
SD systems, where the SD system typically uses only acoustic information to
identify the speakers in the audio stream. This approach can lead to speaker
errors especially around speaker turns and regions of speaker overlap. In this
paper, we propose a novel second-pass speaker error correction system using
lexical information, leveraging the power of modern language models (LMs). Our
experiments across multiple telephony datasets show that our approach is both
effective and robust. Training and tuning only on the Fisher dataset, this
error correction approach leads to relative word-level diarization error rate
(WDER) reductions of 15-30% on three telephony datasets: RT03-CTS, Callhome
American English and held-out portions of Fisher.Comment: Accepted at INTERSPEECH 202
Language modelling for speaker diarization in telephonic interviews
The aim of this paper is to investigate the benefit of combining both language and acoustic modelling for speaker diarization. Although conventional systems only use acoustic features, in some scenarios linguistic data contain high discriminative speaker information, even more reliable than the acoustic ones. In this study we analyze how an appropriate fusion of both kind of features is able to obtain good results in these cases. The proposed system is based on an iterative algorithm where a LSTM network is used as a speaker classifier. The network is fed with character-level word embeddings and a GMM based acoustic score created with the output labels from previous iterations. The presented algorithm has been evaluated in a Call-Center database, which is composed of telephone interview audios. The combination of acoustic features and linguistic content shows a 84.29% improvement in terms of a word-level DER as compared to a HMM/VB baseline system. The results of this study confirms that linguistic content can be efficiently used for some speaker recognition tasks.This work was partially supported by the Spanish Project DeepVoice (TEC2015-69266-P) and by the project PID2019-107579RBI00/ AEI /10.13039/501100011033.Peer ReviewedPostprint (published version
Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings
This paper presents a streaming speaker-attributed automatic speech
recognition (SA-ASR) model that can recognize "who spoke what" with low latency
even when multiple people are speaking simultaneously. Our model is based on
token-level serialized output training (t-SOT) which was recently proposed to
transcribe multi-talker speech in a streaming fashion. To further recognize
speaker identities, we propose an encoder-decoder based speaker embedding
extractor that can estimate a speaker representation for each recognized token
not only from non-overlapping speech but also from overlapping speech. The
proposed speaker embedding, named t-vector, is extracted synchronously with the
t-SOT ASR model, enabling joint execution of speaker identification (SID) or
speaker diarization (SD) with the multi-talker transcription with low latency.
We evaluate the proposed model for a joint task of ASR and SID/SD by using
LibriSpeechMix and LibriCSS corpora. The proposed model achieves substantially
better accuracy than a prior streaming model and shows comparable or sometimes
even superior results to the state-of-the-art offline SA-ASR model.Comment: Submitted to Interspeech 202
- …