244 research outputs found
Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings
This paper presents a streaming speaker-attributed automatic speech
recognition (SA-ASR) model that can recognize "who spoke what" with low latency
even when multiple people are speaking simultaneously. Our model is based on
token-level serialized output training (t-SOT) which was recently proposed to
transcribe multi-talker speech in a streaming fashion. To further recognize
speaker identities, we propose an encoder-decoder based speaker embedding
extractor that can estimate a speaker representation for each recognized token
not only from non-overlapping speech but also from overlapping speech. The
proposed speaker embedding, named t-vector, is extracted synchronously with the
t-SOT ASR model, enabling joint execution of speaker identification (SID) or
speaker diarization (SD) with the multi-talker transcription with low latency.
We evaluate the proposed model for a joint task of ASR and SID/SD by using
LibriSpeechMix and LibriCSS corpora. The proposed model achieves substantially
better accuracy than a prior streaming model and shows comparable or sometimes
even superior results to the state-of-the-art offline SA-ASR model.Comment: Submitted to Interspeech 202
VarArray Meets t-SOT: Advancing the State of the Art of Streaming Distant Conversational Speech Recognition
This paper presents a novel streaming automatic speech recognition (ASR)
framework for multi-talker overlapping speech captured by a distant microphone
array with an arbitrary geometry. Our framework, named t-SOT-VA, capitalizes on
independently developed two recent technologies; array-geometry-agnostic
continuous speech separation, or VarArray, and streaming multi-talker ASR based
on token-level serialized output training (t-SOT). To combine the best of both
technologies, we newly design a t-SOT-based ASR model that generates a
serialized multi-talker transcription based on two separated speech signals
from VarArray. We also propose a pre-training scheme for such an ASR model
where we simulate VarArray's output signals based on monaural single-talker ASR
training data. Conversation transcription experiments using the AMI meeting
corpus show that the system based on the proposed framework significantly
outperforms conventional ones. Our system achieves the state-of-the-art word
error rates of 13.7% and 15.5% for the AMI development and evaluation sets,
respectively, in the multiple-distant-microphone setting while retaining the
streaming inference capability.Comment: 6 pages, 2 figure, 3 tables, v2: Appendix A has been adde
BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR
The recently proposed serialized output training (SOT) simplifies
multi-talker automatic speech recognition (ASR) by generating speaker
transcriptions separated by a special token. However, frequent speaker changes
can make speaker change prediction difficult. To address this, we propose
boundary-aware serialized output training (BA-SOT), which explicitly
incorporates boundary knowledge into the decoder via a speaker change detection
task and boundary constraint loss. We also introduce a two-stage connectionist
temporal classification (CTC) strategy that incorporates token-level SOT CTC to
restore temporal context information. Besides typical character error rate
(CER), we introduce utterance-dependent character error rate (UD-CER) to
further measure the precision of speaker change prediction. Compared to
original SOT, BA-SOT reduces CER/UD-CER by 5.1%/14.0%, and leveraging a
pre-trained ASR model for BA-SOT model initialization further reduces
CER/UD-CER by 8.4%/19.9%.Comment: Accepted by INTERSPEECH 202
Speaker diarization assisted ASR for multi-speaker conversations
In this paper, we propose a novel approach for the transcription of speech
conversations with natural speaker overlap, from single channel recordings. We
propose a combination of a speaker diarization system and a hybrid automatic
speech recognition (ASR) system with speaker activity assisted acoustic model
(AM). An end-to-end neural network system is used for speaker diarization. Two
architectures, (i) input conditioned AM, and (ii) gated features AM, are
explored to incorporate the speaker activity information. The models output
speaker specific senones. The experiments on Switchboard telephone
conversations show the advantage of incorporating speaker activity information
in the ASR system for recordings with overlapped speech. In particular, an
absolute improvement of in word error rate (WER) is seen for the
proposed approach on natural conversation speech with automatic diarization.Comment: Manuscript submitted to INTERSPEECH 202
Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization
We propose a modular pipeline for the single-channel separation, recognition,
and diarization of meeting-style recordings and evaluate it on the Libri-CSS
dataset. Using a Continuous Speech Separation (CSS) system with a TF-GridNet
separation architecture, followed by a speaker-agnostic speech recognizer, we
achieve state-of-the-art recognition performance in terms of Optimal Reference
Combination Word Error Rate (ORC WER). Then, a d-vector-based diarization
module is employed to extract speaker embeddings from the enhanced signals and
to assign the CSS outputs to the correct speaker. Here, we propose a
syntactically informed diarization using sentence- and word-level boundaries of
the ASR module to support speaker turn detection. This results in a
state-of-the-art Concatenated minimum-Permutation Word Error Rate (cpWER) for
the full meeting recognition pipeline.Comment: Submitted to ICASSP 202
UniX-Encoder: A Universal -Channel Speech Encoder for Ad-Hoc Microphone Array Speech Processing
The speech field is evolving to solve more challenging scenarios, such as
multi-channel recordings with multiple simultaneous talkers. Given the many
types of microphone setups out there, we present the UniX-Encoder. It's a
universal encoder designed for multiple tasks, and worked with any microphone
array, in both solo and multi-talker environments. Our research enhances
previous multi-channel speech processing efforts in four key areas: 1)
Adaptability: Contrasting traditional models constrained to certain microphone
array configurations, our encoder is universally compatible. 2) Multi-Task
Capability: Beyond the single-task focus of previous systems, UniX-Encoder acts
as a robust upstream model, adeptly extracting features for diverse tasks
including ASR and speaker recognition. 3) Self-Supervised Training: The encoder
is trained without requiring labeled multi-channel data. 4) End-to-End
Integration: In contrast to models that first beamform then process
single-channels, our encoder offers an end-to-end solution, bypassing explicit
beamforming or separation. To validate its effectiveness, we tested the
UniX-Encoder on a synthetic multi-channel dataset from the LibriSpeech corpus.
Across tasks like speech recognition and speaker diarization, our encoder
consistently outperformed combinations like the WavLM model with the BeamformIt
frontend.Comment: Submitted to ICASSP 202
- …