13 research outputs found
DCCRN-KWS: an audio bias based model for noise robust small-footprint keyword spotting
Real-world complex acoustic environments especially the ones with a low
signal-to-noise ratio (SNR) will bring tremendous challenges to a keyword
spotting (KWS) system. Inspired by the recent advances of neural speech
enhancement and context bias in speech recognition, we propose a robust audio
context bias based DCCRN-KWS model to address this challenge. We form the whole
architecture as a multi-task learning framework for both denosing and keyword
spotting, where the DCCRN encoder is connected with the KWS model. Helped with
the denoising task, we further introduce an audio context bias module to
leverage the real keyword samples and bias the network to better iscriminate
keywords in noisy conditions. Feature merge and complex context linear modules
are also introduced to strength such discrimination and to effectively leverage
contextual information respectively. Experiments on the internal challenging
dataset and the HIMIYA public dataset show that our DCCRN-KWS system is
superior in performance, while ablation study demonstrates the good design of
the whole model.Comment: Accepted by INTERSPEECH202
Uformer: A Unet based dilated complex & real dual-path conformer network for simultaneous speech enhancement and dereverberation
Complex spectrum and magnitude are considered as two major features of speech
enhancement and dereverberation. Traditional approaches always treat these two
features separately, ignoring their underlying relationship. In this paper, we
propose Uformer, a Unet based dilated complex & real dual-path conformer
network in both complex and magnitude domain for simultaneous speech
enhancement and dereverberation. We exploit time attention (TA) and dilated
convolution (DC) to leverage local and global contextual information and
frequency attention (FA) to model dimensional information. These three
sub-modules contained in the proposed dilated complex & real dual-path
conformer module effectively improve the speech enhancement and dereverberation
performance. Furthermore, hybrid encoder and decoder are adopted to
simultaneously model the complex spectrum and magnitude and promote the
information interaction between two domains. Encoder decoder attention is also
applied to enhance the interaction between encoder and decoder. Our
experimental results outperform all SOTA time and complex domain models
objectively and subjectively. Specifically, Uformer reaches 3.6032 DNSMOS on
the blind test set of Interspeech 2021 DNS Challenge, which outperforms all
top-performed models. We also carry out ablation experiments to tease apart all
proposed sub-modules that are most important.Comment: Accepted by ICASSP 202
MBTFNet: Multi-Band Temporal-Frequency Neural Network For Singing Voice Enhancement
A typical neural speech enhancement (SE) approach mainly handles speech and
noise mixtures, which is not optimal for singing voice enhancement scenarios.
Music source separation (MSS) models treat vocals and various accompaniment
components equally, which may reduce performance compared to the model that
only considers vocal enhancement. In this paper, we propose a novel multi-band
temporal-frequency neural network (MBTFNet) for singing voice enhancement,
which particularly removes background music, noise and even backing vocals from
singing recordings. MBTFNet combines inter and intra-band modeling for better
processing of full-band signals. Dual-path modeling are introduced to expand
the receptive field of the model. We propose an implicit personalized
enhancement (IPE) stage based on signal-to-noise ratio (SNR) estimation, which
further improves the performance of MBTFNet. Experiments show that our proposed
model significantly outperforms several state-of-the-art SE and MSS models
Two-stage Neural Network for ICASSP 2023 Speech Signal Improvement Challenge
In ICASSP 2023 speech signal improvement challenge, we developed a dual-stage
neural model which improves speech signal quality induced by different
distortions in a stage-wise divide-and-conquer fashion. Specifically, in the
first stage, the speech improvement network focuses on recovering the missing
components of the spectrum, while in the second stage, our model aims to
further suppress noise, reverberation, and artifacts introduced by the
first-stage model. Achieving 0.446 in the final score and 0.517 in the P.835
score, our system ranks 4th in the non-real-time track.Comment: Accepted by ICASSP 202
Preparation and Properties of sc-PLA/PMMA Transparent Nanofiber Air Filter
Particulate matter (PM) pollution is a serious concern for the environment and public health. To protect indoor air quality, nanofiber filters have been used to coat window screens due to their high PM removal efficiency, transparency and low air resistance. However, these materials have poor mechanical property. In this study, electrostatic induction-assisted solution blowing was used to fabricate polylactide stereocomplex (sc-PLA), which served as reinforcement to enhance the physical cross-linking point to significantly restrict poly(methyl methacrylate) (PMMA) molecular chain motion and improve the mechanical properties of sc-PLA/PMMA nanofibers. Moreover, the introduction of sc-PLA led to the formation of thick/thin composite nanofiber structure, which is beneficial for the mechanical property. Thus, sc-PLA/PMMA air filters of ~83% transparency with 99.5% PM2.5 removal and 140% increase in mechanical properties were achieved when 5 wt % sc-PLA was added to PMMA. Hence, the addition of sc-PLA to transparent filters can effectively improve their performance