6,054 research outputs found
Diffusion Conditional Expectation Model for Efficient and Robust Target Speech Extraction
Target Speech Extraction (TSE) is a crucial task in speech processing that
focuses on isolating the clean speech of a specific speaker from complex
mixtures. While discriminative methods are commonly used for TSE, they can
introduce distortion in terms of speech perception quality. On the other hand,
generative approaches, particularly diffusion-based methods, can enhance speech
quality perceptually but suffer from slower inference speed. We propose an
efficient generative approach named Diffusion Conditional Expectation Model
(DCEM) for TSE. It can handle multi- and single-speaker scenarios in both noisy
and clean conditions. Additionally, we introduce Regenerate-DCEM (R-DCEM) that
can regenerate and optimize speech quality based on pre-processed speech from a
discriminative model. Our method outperforms conventional methods in terms of
both intrusive and non-intrusive metrics and demonstrates notable strengths in
inference efficiency and robustness to unseen tasks. Audio examples are
available online (https://vivian556123.github.io/dcem).Comment: Submitted to ICASSP 202
BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR
The recently proposed serialized output training (SOT) simplifies
multi-talker automatic speech recognition (ASR) by generating speaker
transcriptions separated by a special token. However, frequent speaker changes
can make speaker change prediction difficult. To address this, we propose
boundary-aware serialized output training (BA-SOT), which explicitly
incorporates boundary knowledge into the decoder via a speaker change detection
task and boundary constraint loss. We also introduce a two-stage connectionist
temporal classification (CTC) strategy that incorporates token-level SOT CTC to
restore temporal context information. Besides typical character error rate
(CER), we introduce utterance-dependent character error rate (UD-CER) to
further measure the precision of speaker change prediction. Compared to
original SOT, BA-SOT reduces CER/UD-CER by 5.1%/14.0%, and leveraging a
pre-trained ASR model for BA-SOT model initialization further reduces
CER/UD-CER by 8.4%/19.9%.Comment: Accepted by INTERSPEECH 202
- …