64 research outputs found
VarArray Meets t-SOT: Advancing the State of the Art of Streaming Distant Conversational Speech Recognition
This paper presents a novel streaming automatic speech recognition (ASR)
framework for multi-talker overlapping speech captured by a distant microphone
array with an arbitrary geometry. Our framework, named t-SOT-VA, capitalizes on
independently developed two recent technologies; array-geometry-agnostic
continuous speech separation, or VarArray, and streaming multi-talker ASR based
on token-level serialized output training (t-SOT). To combine the best of both
technologies, we newly design a t-SOT-based ASR model that generates a
serialized multi-talker transcription based on two separated speech signals
from VarArray. We also propose a pre-training scheme for such an ASR model
where we simulate VarArray's output signals based on monaural single-talker ASR
training data. Conversation transcription experiments using the AMI meeting
corpus show that the system based on the proposed framework significantly
outperforms conventional ones. Our system achieves the state-of-the-art word
error rates of 13.7% and 15.5% for the AMI development and evaluation sets,
respectively, in the multiple-distant-microphone setting while retaining the
streaming inference capability.Comment: 6 pages, 2 figure, 3 tables, v2: Appendix A has been adde
End-to-End Joint Target and Non-Target Speakers ASR
This paper proposes a novel automatic speech recognition (ASR) system that
can transcribe individual speaker's speech while identifying whether they are
target or non-target speakers from multi-talker overlapped speech.
Target-speaker ASR systems are a promising way to only transcribe a target
speaker's speech by enrolling the target speaker's information. However, in
conversational ASR applications, transcribing both the target speaker's speech
and non-target speakers' ones is often required to understand interactive
information. To naturally consider both target and non-target speakers in a
single ASR model, our idea is to extend autoregressive modeling-based
multi-talker ASR systems to utilize the enrollment speech of the target
speaker. Our proposed ASR is performed by recursively generating both textual
tokens and tokens that represent target or non-target speakers. Our experiments
demonstrate the effectiveness of our proposed method.Comment: Accepted at Interspeech 202
- …