15 research outputs found
Complexity Scaling for Speech Denoising
Computational complexity is critical when deploying deep learning-based
speech denoising models for on-device applications. Most prior research focused
on optimizing model architectures to meet specific computational cost
constraints, often creating distinct neural network architectures for different
complexity limitations. This study conducts complexity scaling for speech
denoising tasks, aiming to consolidate models with various complexities into a
unified architecture. We present a Multi-Path Transform-based (MPT)
architecture to handle both low- and high-complexity scenarios. A series of MPT
networks present high performance covering a wide range of computational
complexities on the DNS challenge dataset. Moreover, inspired by the scaling
experiments in natural language processing, we explore the empirical
relationship between model performance and computational cost on the denoising
task. As the complexity number of multiply-accumulate operations (MACs) is
scaled from 50M/s to 15G/s on MPT networks, we observe a linear increase in the
values of PESQ-WB and SI-SNR, proportional to the logarithm of MACs, which
might contribute to the understanding and application of complexity scaling in
speech denoising tasks.Comment: Submitted to ICASSP202
Ultra Dual-Path Compression For Joint Echo Cancellation And Noise Suppression
Echo cancellation and noise reduction are essential for full-duplex
communication, yet most existing neural networks have high computational costs
and are inflexible in tuning model complexity. In this paper, we introduce
time-frequency dual-path compression to achieve a wide range of compression
ratios on computational cost. Specifically, for frequency compression,
trainable filters are used to replace manually designed filters for dimension
reduction. For time compression, only using frame skipped prediction causes
large performance degradation, which can be alleviated by a post-processing
network with full sequence modeling. We have found that under fixed compression
ratios, dual-path compression combining both the time and frequency methods
will give further performance improvement, covering compression ratios from 4x
to 32x with little model size change. Moreover, the proposed models show
competitive performance compared with fast FullSubNet and DeepFilterNet. A demo
page can be found at
hangtingchen.github.io/ultra_dual_path_compression.github.io/.Comment: Accepted by Interspeech 202
Bayes Risk Transducer: Transducer with Controllable Alignment Prediction
Automatic speech recognition (ASR) based on transducers is widely used. In
training, a transducer maximizes the summed posteriors of all paths. The path
with the highest posterior is commonly defined as the predicted alignment
between the speech and the transcription. While the vanilla transducer does not
have a prior preference for any of the valid paths, this work intends to
enforce the preferred paths and achieve controllable alignment prediction.
Specifically, this work proposes Bayes Risk Transducer (BRT), which uses a
Bayes risk function to set lower risk values to the preferred paths so that the
predicted alignment is more likely to satisfy specific desired properties. We
further demonstrate that these predicted alignments with intentionally designed
properties can provide practical advantages over the vanilla transducer.
Experimentally, the proposed BRT saves inference cost by up to 46% for
non-streaming ASR and reduces overall system latency by 41% for streaming ASR
AutoPrep: An Automatic Preprocessing Framework for In-the-Wild Speech Data
Recently, the utilization of extensive open-sourced text data has
significantly advanced the performance of text-based large language models
(LLMs). However, the use of in-the-wild large-scale speech data in the speech
technology community remains constrained. One reason for this limitation is
that a considerable amount of the publicly available speech data is compromised
by background noise, speech overlapping, lack of speech segmentation
information, missing speaker labels, and incomplete transcriptions, which can
largely hinder their usefulness. On the other hand, human annotation of speech
data is both time-consuming and costly. To address this issue, we introduce an
automatic in-the-wild speech data preprocessing framework (AutoPrep) in this
paper, which is designed to enhance speech quality, generate speaker labels,
and produce transcriptions automatically. The proposed AutoPrep framework
comprises six components: speech enhancement, speech segmentation, speaker
clustering, target speech extraction, quality filtering and automatic speech
recognition. Experiments conducted on the open-sourced WenetSpeech and our
self-collected AutoPrepWild corpora demonstrate that the proposed AutoPrep
framework can generate preprocessed data with similar DNSMOS and PDNSMOS scores
compared to several open-sourced TTS datasets. The corresponding TTS system can
achieve up to 0.68 in-domain speaker similarity