Search CORE

244 research outputs found

Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings

Author: Chen Zhuo
Gaur Yashesh
Kanda Naoyuki
Li Jinyu
Meng Zhong
Wang Xiaofei
Wu Jian
Wu Yu
Xiao Xiong
Yoshioka Takuya
Publication venue
Publication date: 30/03/2022
Field of study

This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what" with low latency even when multiple people are speaking simultaneously. Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion. To further recognize speaker identities, we propose an encoder-decoder based speaker embedding extractor that can estimate a speaker representation for each recognized token not only from non-overlapping speech but also from overlapping speech. The proposed speaker embedding, named t-vector, is extracted synchronously with the t-SOT ASR model, enabling joint execution of speaker identification (SID) or speaker diarization (SD) with the multi-talker transcription with low latency. We evaluate the proposed model for a joint task of ASR and SID/SD by using LibriSpeechMix and LibriCSS corpora. The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.Comment: Submitted to Interspeech 202

arXiv.org e-Print Archive

VarArray Meets t-SOT: Advancing the State of the Art of Streaming Distant Conversational Speech Recognition

Author: Chen Zhuo
Kanda Naoyuki
Li Jinyu
Wang Xiaofei
Wu Jian
Yoshioka Takuya
Publication venue
Publication date: 03/10/2022
Field of study

This paper presents a novel streaming automatic speech recognition (ASR) framework for multi-talker overlapping speech captured by a distant microphone array with an arbitrary geometry. Our framework, named t-SOT-VA, capitalizes on independently developed two recent technologies; array-geometry-agnostic continuous speech separation, or VarArray, and streaming multi-talker ASR based on token-level serialized output training (t-SOT). To combine the best of both technologies, we newly design a t-SOT-based ASR model that generates a serialized multi-talker transcription based on two separated speech signals from VarArray. We also propose a pre-training scheme for such an ASR model where we simulate VarArray's output signals based on monaural single-talker ASR training data. Conversation transcription experiments using the AMI meeting corpus show that the system based on the proposed framework significantly outperforms conventional ones. Our system achieves the state-of-the-art word error rates of 13.7% and 15.5% for the AMI development and evaluation sets, respectively, in the multiple-distant-microphone setting while retaining the streaming inference capability.Comment: 6 pages, 2 figure, 3 tables, v2: Appendix A has been adde

arXiv.org e-Print Archive

BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR

Author: Chen Qian
Guo Pengcheng
Li Yangze
Liang Yuhao
Xie Lei
Yu Fan
Zhang Shiliang
Publication venue
Publication date: 30/05/2023
Field of study

The recently proposed serialized output training (SOT) simplifies multi-talker automatic speech recognition (ASR) by generating speaker transcriptions separated by a special token. However, frequent speaker changes can make speaker change prediction difficult. To address this, we propose boundary-aware serialized output training (BA-SOT), which explicitly incorporates boundary knowledge into the decoder via a speaker change detection task and boundary constraint loss. We also introduce a two-stage connectionist temporal classification (CTC) strategy that incorporates token-level SOT CTC to restore temporal context information. Besides typical character error rate (CER), we introduce utterance-dependent character error rate (UD-CER) to further measure the precision of speaker change prediction. Compared to original SOT, BA-SOT reduces CER/UD-CER by 5.1%/14.0%, and leveraging a pre-trained ASR model for BA-SOT model initialization further reduces CER/UD-CER by 8.4%/19.9%.Comment: Accepted by INTERSPEECH 202

arXiv.org e-Print Archive

Speaker diarization assisted ASR for multi-speaker conversations

Author: Chetupalli Srikanth Raj
Ganapathy Sriram
Publication venue
Publication date: 05/04/2021
Field of study

In this paper, we propose a novel approach for the transcription of speech conversations with natural speaker overlap, from single channel recordings. We propose a combination of a speaker diarization system and a hybrid automatic speech recognition (ASR) system with speaker activity assisted acoustic model (AM). An end-to-end neural network system is used for speaker diarization. Two architectures, (i) input conditioned AM, and (ii) gated features AM, are explored to incorporate the speaker activity information. The models output speaker specific senones. The experiments on Switchboard telephone conversations show the advantage of incorporating speaker activity information in the ASR system for recordings with overlapped speech. In particular, an absolute improvement of

11\%

in word error rate (WER) is seen for the proposed approach on natural conversation speech with automatic diarization.Comment: Manuscript submitted to INTERSPEECH 202

arXiv.org e-Print Archive

Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization

Author: Boeddeker Christoph
Cord-Landwehr Tobias
Delcroix Marc
Haeb-Umbach Reinhold
von Neumann Thilo
Publication venue
Publication date: 28/09/2023
Field of study

We propose a modular pipeline for the single-channel separation, recognition, and diarization of meeting-style recordings and evaluate it on the Libri-CSS dataset. Using a Continuous Speech Separation (CSS) system with a TF-GridNet separation architecture, followed by a speaker-agnostic speech recognizer, we achieve state-of-the-art recognition performance in terms of Optimal Reference Combination Word Error Rate (ORC WER). Then, a d-vector-based diarization module is employed to extract speaker embeddings from the enhanced signals and to assign the CSS outputs to the correct speaker. Here, we propose a syntactically informed diarization using sentence- and word-level boundaries of the ASR module to support speaker turn detection. This results in a state-of-the-art Concatenated minimum-Permutation Word Error Rate (cpWER) for the full meeting recognition pipeline.Comment: Submitted to ICASSP 202

arXiv.org e-Print Archive

UniX-Encoder: A Universal $X$ -Channel Speech Encoder for Ad-Hoc Microphone Array Speech Processing

Author: Huang Zili
Shao Yiwen
Yu Dong
Zhang Shi-Xiong
Publication venue
Publication date: 25/10/2023
Field of study

The speech field is evolving to solve more challenging scenarios, such as multi-channel recordings with multiple simultaneous talkers. Given the many types of microphone setups out there, we present the UniX-Encoder. It's a universal encoder designed for multiple tasks, and worked with any microphone array, in both solo and multi-talker environments. Our research enhances previous multi-channel speech processing efforts in four key areas: 1) Adaptability: Contrasting traditional models constrained to certain microphone array configurations, our encoder is universally compatible. 2) Multi-Task Capability: Beyond the single-task focus of previous systems, UniX-Encoder acts as a robust upstream model, adeptly extracting features for diverse tasks including ASR and speaker recognition. 3) Self-Supervised Training: The encoder is trained without requiring labeled multi-channel data. 4) End-to-End Integration: In contrast to models that first beamform then process single-channels, our encoder offers an end-to-end solution, bypassing explicit beamforming or separation. To validate its effectiveness, we tested the UniX-Encoder on a synthetic multi-channel dataset from the LibriSpeech corpus. Across tasks like speech recognition and speaker diarization, our encoder consistently outperformed combinations like the WavLM model with the BeamformIt frontend.Comment: Submitted to ICASSP 202

arXiv.org e-Print Archive