457 research outputs found
Parallel Gated Neural Network With Attention Mechanism For Speech Enhancement
Deep learning algorithm are increasingly used for speech enhancement (SE). In
supervised methods, global and local information is required for accurate
spectral mapping. A key restriction is often poor capture of key contextual
information. To leverage long-term for target speakers and compensate
distortions of cleaned speech, this paper adopts a sequence-to-sequence (S2S)
mapping structure and proposes a novel monaural speech enhancement system,
consisting of a Feature Extraction Block (FEB), a Compensation Enhancement
Block (ComEB) and a Mask Block (MB). In the FEB a U-net block is used to
extract abstract features using complex-valued spectra with one path to
suppress the background noise in the magnitude domain using masking methods and
the MB takes magnitude features from the FEBand compensates the lost
complex-domain features produced from ComEB to restore the final cleaned
speech. Experiments are conducted on the Librispeech dataset and results show
that the proposed model obtains better performance than recent models in terms
of ESTOI and PESQ scores.Comment: 5 pages, 6 figures, references adde
Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement
Phase information has a significant impact on speech perceptual quality and
intelligibility. However, existing speech enhancement methods encounter
limitations in explicit phase estimation due to the non-structural nature and
wrapping characteristics of the phase, leading to a bottleneck in enhanced
speech quality. To overcome the above issue, in this paper, we proposed
MP-SENet, a novel Speech Enhancement Network which explicitly enhances
Magnitude and Phase spectra in parallel. The proposed MP-SENet adopts a codec
architecture in which the encoder and decoder are bridged by time-frequency
Transformers along both time and frequency dimensions. The encoder aims to
encode time-frequency representations derived from the input distorted
magnitude and phase spectra. The decoder comprises dual-stream magnitude and
phase decoders, directly enhancing magnitude and wrapped phase spectra by
incorporating a magnitude estimation architecture and a phase parallel
estimation architecture, respectively. To train the MP-SENet model effectively,
we define multi-level loss functions, including mean square error and
perceptual metric loss of magnitude spectra, anti-wrapping loss of phase
spectra, as well as mean square error and consistency loss of short-time
complex spectra. Experimental results demonstrate that our proposed MP-SENet
excels in high-quality speech enhancement across multiple tasks, including
speech denoising, dereverberation, and bandwidth extension. Compared to
existing phase-aware speech enhancement methods, it successfully avoids the
bidirectional compensation effect between the magnitude and phase, leading to a
better harmonic restoration. Notably, for the speech denoising task, the
MP-SENet yields a state-of-the-art performance with a PESQ of 3.60 on the
public VoiceBank+DEMAND dataset.Comment: Submmited to IEEE Transactions on Audio, Speech and Language
Processin
A General Unfolding Speech Enhancement Method Motivated by Taylor's Theorem
While deep neural networks have facilitated significant advancements in the
field of speech enhancement, most existing methods are developed following
either empirical or relatively blind criteria, lacking adequate guidelines in
pipeline design. Inspired by Taylor's theorem, we propose a general unfolding
framework for both single- and multi-channel speech enhancement tasks.
Concretely, we formulate the complex spectrum recovery into the spectral
magnitude mapping in the neighborhood space of the noisy mixture, in which an
unknown sparse term is introduced and applied for phase modification in
advance. Based on that, the mapping function is decomposed into the
superimposition of the 0th-order and high-order polynomials in Taylor's series,
where the former coarsely removes the interference in the magnitude domain and
the latter progressively complements the remaining spectral detail in the
complex spectrum domain. In addition, we study the relation between adjacent
order terms and reveal that each high-order term can be recursively estimated
with its lower-order term, and each high-order term is then proposed to
evaluate using a surrogate function with trainable weights so that the whole
system can be trained in an end-to-end manner. Given that the proposed
framework is devised based on Taylor's theorem, it possesses improved internal
flexibility. Extensive experiments are conducted on WSJ0-SI84, DNS-Challenge,
Voicebank+Demand, spatialized Librispeech, and L3DAS22 multi-channel speech
enhancement challenge datasets. Quantitative results show that the proposed
approach yields competitive performance over existing top-performing approaches
in terms of multiple objective metrics.Comment: Submitted to TASLP, revised version, 17 page
CMGAN: Conformer-Based Metric-GAN for Monaural Speech Enhancement
Convolution-augmented transformers (Conformers) are recently proposed in
various speech-domain applications, such as automatic speech recognition (ASR)
and speech separation, as they can capture both local and global dependencies.
In this paper, we propose a conformer-based metric generative adversarial
network (CMGAN) for speech enhancement (SE) in the time-frequency (TF) domain.
The generator encodes the magnitude and complex spectrogram information using
two-stage conformer blocks to model both time and frequency dependencies. The
decoder then decouples the estimation into a magnitude mask decoder branch to
filter out unwanted distortions and a complex refinement branch to further
improve the magnitude estimation and implicitly enhance the phase information.
Additionally, we include a metric discriminator to alleviate metric mismatch by
optimizing the generator with respect to a corresponding evaluation score.
Objective and subjective evaluations illustrate that CMGAN is able to show
superior performance compared to state-of-the-art methods in three speech
enhancement tasks (denoising, dereverberation and super-resolution). For
instance, quantitative denoising analysis on Voice Bank+DEMAND dataset
indicates that CMGAN outperforms various previous models with a margin, i.e.,
PESQ of 3.41 and SSNR of 11.10 dB.Comment: 16 pages, 10 figures and 5 tables. arXiv admin note: text overlap
with arXiv:2203.1514
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments
Eliminating the negative effect of non-stationary environmental noise is a
long-standing research topic for automatic speech recognition that stills
remains an important challenge. Data-driven supervised approaches, including
ones based on deep neural networks, have recently emerged as potential
alternatives to traditional unsupervised approaches and with sufficient
training, can alleviate the shortcomings of the unsupervised methods in various
real-life acoustic environments. In this light, we review recently developed,
representative deep learning approaches for tackling non-stationary additive
and convolutional degradation of speech with the aim of providing guidelines
for those involved in the development of environmentally robust speech
recognition systems. We separately discuss single- and multi-channel techniques
developed for the front-end and back-end of speech recognition systems, as well
as joint front-end and back-end training frameworks
TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation
We propose TF-GridNet for speech separation. The model is a novel multi-path
deep neural network (DNN) integrating full- and sub-band modeling in the
time-frequency (T-F) domain. It stacks several multi-path blocks, each
consisting of an intra-frame full-band module, a sub-band temporal module, and
a cross-frame self-attention module. It is trained to perform complex spectral
mapping, where the real and imaginary (RI) components of input signals are
stacked as features to predict target RI components. We first evaluate it on
monaural anechoic speaker separation. Without using data augmentation and
dynamic mixing, it obtains a state-of-the-art 23.5 dB improvement in
scale-invariant signal-to-distortion ratio (SI-SDR) on WSJ0-2mix, a standard
dataset for two-speaker separation. To show its robustness to noise and
reverberation, we evaluate it on monaural reverberant speaker separation using
the SMS-WSJ dataset and on noisy-reverberant speaker separation using WHAMR!,
and obtain state-of-the-art performance on both datasets. We then extend
TF-GridNet to multi-microphone conditions through multi-microphone complex
spectral mapping, and integrate it into a two-DNN system with a beamformer in
between (named as MISO-BF-MISO in earlier studies), where the beamformer
proposed in this paper is a novel multi-frame Wiener filter computed based on
the outputs of the first DNN. State-of-the-art performance is obtained on the
multi-channel tasks of SMS-WSJ and WHAMR!. Besides speaker separation, we apply
the proposed algorithms to speech dereverberation and noisy-reverberant speech
enhancement. State-of-the-art performance is obtained on a dereverberation
dataset and on the dataset of the recent L3DAS22 multi-channel speech
enhancement challenge.Comment: In submission. A sound demo is available at
https://zqwang7.github.io/demos/TF-GridNet-demo/index.htm
- …