14 research outputs found
Cascaded encoders for fine-tuning ASR models on overlapped speech
Multi-talker speech recognition (MT-ASR) has been shown to improve ASR
performance on speech containing overlapping utterances from more than one
speaker. Multi-talker models have typically been trained from scratch using
simulated or actual overlapping speech datasets. On the other hand, the trend
in ASR has been to train foundation models using massive datasets collected
from a wide variety of task domains. Given the scale of these models and their
ability to generalize well across a variety of domains, it makes sense to
consider scenarios where a foundation model is augmented with multi-talker
capability. This paper presents an MT-ASR model formed by combining a
well-trained foundation model with a multi-talker mask model in a cascaded
RNN-T encoder configuration. Experimental results show that the cascade
configuration provides improved WER on overlapping speech utterances with
respect to a baseline multi-talker model without sacrificing performance
achievable by the foundation model on non-overlapping utterances
End-to-End Joint Target and Non-Target Speakers ASR
This paper proposes a novel automatic speech recognition (ASR) system that
can transcribe individual speaker's speech while identifying whether they are
target or non-target speakers from multi-talker overlapped speech.
Target-speaker ASR systems are a promising way to only transcribe a target
speaker's speech by enrolling the target speaker's information. However, in
conversational ASR applications, transcribing both the target speaker's speech
and non-target speakers' ones is often required to understand interactive
information. To naturally consider both target and non-target speakers in a
single ASR model, our idea is to extend autoregressive modeling-based
multi-talker ASR systems to utilize the enrollment speech of the target
speaker. Our proposed ASR is performed by recursively generating both textual
tokens and tokens that represent target or non-target speakers. Our experiments
demonstrate the effectiveness of our proposed method.Comment: Accepted at Interspeech 202