Search CORE

6,054 research outputs found

Diffusion Conditional Expectation Model for Efficient and Robust Target Speech Extraction

Author: Liu Shujie
Qian Yanmin
Qian Yao
Wang Heming
Wang Xinkai
Yang Hemin
Yu Linfeng
Zeng Michael
Zhang Leying
Zhou Long
Publication venue
Publication date: 25/09/2023
Field of study

Target Speech Extraction (TSE) is a crucial task in speech processing that focuses on isolating the clean speech of a specific speaker from complex mixtures. While discriminative methods are commonly used for TSE, they can introduce distortion in terms of speech perception quality. On the other hand, generative approaches, particularly diffusion-based methods, can enhance speech quality perceptually but suffer from slower inference speed. We propose an efficient generative approach named Diffusion Conditional Expectation Model (DCEM) for TSE. It can handle multi- and single-speaker scenarios in both noisy and clean conditions. Additionally, we introduce Regenerate-DCEM (R-DCEM) that can regenerate and optimize speech quality based on pre-processed speech from a discriminative model. Our method outperforms conventional methods in terms of both intrusive and non-intrusive metrics and demonstrates notable strengths in inference efficiency and robustness to unseen tasks. Audio examples are available online (https://vivian556123.github.io/dcem).Comment: Submitted to ICASSP 202

arXiv.org e-Print Archive

BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR

Author: Chen Qian
Guo Pengcheng
Li Yangze
Liang Yuhao
Xie Lei
Yu Fan
Zhang Shiliang
Publication venue
Publication date: 30/05/2023
Field of study

The recently proposed serialized output training (SOT) simplifies multi-talker automatic speech recognition (ASR) by generating speaker transcriptions separated by a special token. However, frequent speaker changes can make speaker change prediction difficult. To address this, we propose boundary-aware serialized output training (BA-SOT), which explicitly incorporates boundary knowledge into the decoder via a speaker change detection task and boundary constraint loss. We also introduce a two-stage connectionist temporal classification (CTC) strategy that incorporates token-level SOT CTC to restore temporal context information. Besides typical character error rate (CER), we introduce utterance-dependent character error rate (UD-CER) to further measure the precision of speaker change prediction. Compared to original SOT, BA-SOT reduces CER/UD-CER by 5.1%/14.0%, and leveraging a pre-trained ASR model for BA-SOT model initialization further reduces CER/UD-CER by 8.4%/19.9%.Comment: Accepted by INTERSPEECH 202

arXiv.org e-Print Archive