This paper proposes a novel automatic speech recognition (ASR) system that
can transcribe individual speaker's speech while identifying whether they are
target or non-target speakers from multi-talker overlapped speech.
Target-speaker ASR systems are a promising way to only transcribe a target
speaker's speech by enrolling the target speaker's information. However, in
conversational ASR applications, transcribing both the target speaker's speech
and non-target speakers' ones is often required to understand interactive
information. To naturally consider both target and non-target speakers in a
single ASR model, our idea is to extend autoregressive modeling-based
multi-talker ASR systems to utilize the enrollment speech of the target
speaker. Our proposed ASR is performed by recursively generating both textual
tokens and tokens that represent target or non-target speakers. Our experiments
demonstrate the effectiveness of our proposed method.Comment: Accepted at Interspeech 202