173 research outputs found
VarArray Meets t-SOT: Advancing the State of the Art of Streaming Distant Conversational Speech Recognition
This paper presents a novel streaming automatic speech recognition (ASR)
framework for multi-talker overlapping speech captured by a distant microphone
array with an arbitrary geometry. Our framework, named t-SOT-VA, capitalizes on
independently developed two recent technologies; array-geometry-agnostic
continuous speech separation, or VarArray, and streaming multi-talker ASR based
on token-level serialized output training (t-SOT). To combine the best of both
technologies, we newly design a t-SOT-based ASR model that generates a
serialized multi-talker transcription based on two separated speech signals
from VarArray. We also propose a pre-training scheme for such an ASR model
where we simulate VarArray's output signals based on monaural single-talker ASR
training data. Conversation transcription experiments using the AMI meeting
corpus show that the system based on the proposed framework significantly
outperforms conventional ones. Our system achieves the state-of-the-art word
error rates of 13.7% and 15.5% for the AMI development and evaluation sets,
respectively, in the multiple-distant-microphone setting while retaining the
streaming inference capability.Comment: 6 pages, 2 figure, 3 tables, v2: Appendix A has been adde
Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings
This paper presents a streaming speaker-attributed automatic speech
recognition (SA-ASR) model that can recognize "who spoke what" with low latency
even when multiple people are speaking simultaneously. Our model is based on
token-level serialized output training (t-SOT) which was recently proposed to
transcribe multi-talker speech in a streaming fashion. To further recognize
speaker identities, we propose an encoder-decoder based speaker embedding
extractor that can estimate a speaker representation for each recognized token
not only from non-overlapping speech but also from overlapping speech. The
proposed speaker embedding, named t-vector, is extracted synchronously with the
t-SOT ASR model, enabling joint execution of speaker identification (SID) or
speaker diarization (SD) with the multi-talker transcription with low latency.
We evaluate the proposed model for a joint task of ASR and SID/SD by using
LibriSpeechMix and LibriCSS corpora. The proposed model achieves substantially
better accuracy than a prior streaming model and shows comparable or sometimes
even superior results to the state-of-the-art offline SA-ASR model.Comment: Submitted to Interspeech 202
Exploiting Single-Channel Speech for Multi-Channel End-to-End Speech Recognition: A Comparative Study
Recently, the end-to-end training approach for multi-channel ASR has shown
its effectiveness, which usually consists of a beamforming front-end and a
recognition back-end. However, the end-to-end training becomes more difficult
due to the integration of multiple modules, particularly considering that
multi-channel speech data recorded in real environments are limited in size.
This raises the demand to exploit the single-channel data for multi-channel
end-to-end ASR. In this paper, we systematically compare the performance of
three schemes to exploit external single-channel data for multi-channel
end-to-end ASR, namely back-end pre-training, data scheduling, and data
simulation, under different settings such as the sizes of the single-channel
data and the choices of the front-end. Extensive experiments on CHiME-4 and
AISHELL-4 datasets demonstrate that while all three methods improve the
multi-channel end-to-end speech recognition performance, data simulation
outperforms the other two, at the cost of longer training time. Data scheduling
outperforms back-end pre-training marginally but nearly consistently,
presumably because that in the pre-training stage, the back-end tends to
overfit on the single-channel data, especially when the single-channel data
size is small.Comment: submitted to INTERSPEECH 2022. arXiv admin note: substantial text
overlap with arXiv:2107.0267
- …