10 research outputs found
Convolutional Neural Networks to Enhance Coded Speech
Enhancing coded speech suffering from far-end acoustic background noise,
quantization noise, and potentially transmission errors, is a challenging task.
In this work we propose two postprocessing approaches applying convolutional
neural networks (CNNs) either in the time domain or the cepstral domain to
enhance the coded speech without any modification of the codecs. The time
domain approach follows an end-to-end fashion, while the cepstral domain
approach uses analysis-synthesis with cepstral domain features. The proposed
postprocessors in both domains are evaluated for various narrowband and
wideband speech codecs in a wide range of conditions. The proposed
postprocessor improves speech quality (PESQ) by up to 0.25 MOS-LQO points for
G.711, 0.30 points for G.726, 0.82 points for G.722, and 0.26 points for
adaptive multirate wideband codec (AMR-WB). In a subjective CCR listening test,
the proposed postprocessor on G.711-coded speech exceeds the speech quality of
an ITU-T-standardized postfilter by 0.36 CMOS points, and obtains a clear
preference of 1.77 CMOS points compared to legacy G.711, even better than
uncoded speech with statistical significance. The source code for the cepstral
domain approach to enhance G.711-coded speech is made available.Comment: More analysis are added for version
Learning with Learned Loss Function: Speech Enhancement with Quality-Net to Improve Perceptual Evaluation of Speech Quality
Utilizing a human-perception-related objective function to train a speech
enhancement model has become a popular topic recently. The main reason is that
the conventional mean squared error (MSE) loss cannot represent auditory
perception well. One of the typical hu-man-perception-related metrics, which is
the perceptual evaluation of speech quality (PESQ), has been proven to provide
a high correlation to the quality scores rated by humans. Owing to its complex
and non-differentiable properties, however, the PESQ function may not be used
to optimize speech enhancement models directly. In this study, we propose
optimizing the enhancement model with an approximated PESQ function, which is
differentiable and learned from the training data. The experimental results
show that the learned surrogate function can guide the enhancement model to
further boost the PESQ score (in-crease of 0.18 points compared to the results
trained with MSE loss) and maintain the speech intelligibility.Comment: Accepted by IEEE Signal Processing Letters (SPL
MetricGAN: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement
Adversarial loss in a conditional generative adversarial network (GAN) is not
designed to directly optimize evaluation metrics of a target task, and thus,
may not always guide the generator in a GAN to generate data with improved
metric scores. To overcome this issue, we propose a novel MetricGAN approach
with an aim to optimize the generator with respect to one or multiple
evaluation metrics. Moreover, based on MetricGAN, the metric scores of the
generated data can also be arbitrarily specified by users. We tested the
proposed MetricGAN on a speech enhancement task, which is particularly suitable
to verify the proposed approach because there are multiple metrics measuring
different aspects of speech signals. Moreover, these metrics are generally
complex and could not be fully optimized by Lp or conventional adversarial
losses.Comment: Accepted by Thirty-sixth International Conference on Machine Learning
(ICML) 201
AMRConvNet: AMR-Coded Speech Enhancement Using Convolutional Neural Networks
Speech is converted to digital signals using speech coding for efficient
transmission. However, this often lowers the quality and bandwidth of speech.
This paper explores the application of convolutional neural networks for
Artificial Bandwidth Expansion (ABE) and speech enhancement on coded speech,
particularly Adaptive Multi-Rate (AMR) used in 2G cellular phone calls. In this
paper, we introduce AMRConvNet: a convolutional neural network that performs
ABE and speech enhancement on speech encoded with AMR. The model operates
directly on the time-domain for both input and output speech but optimizes
using combined time-domain reconstruction loss and frequency-domain perceptual
loss. AMRConvNet resulted in an average improvement of 0.425 Mean Opinion Score
- Listening Quality Objective (MOS-LQO) points for AMR bitrate of 4.75k, and
0.073 MOS-LQO points for AMR bitrate of 12.2k. AMRConvNet also showed
robustness in AMR bitrate inputs. Finally, an ablation test showed that our
combined time-domain and frequency-domain loss leads to slightly higher MOS-LQO
and faster training convergence than using either loss alone.Comment: IEEE SMC 202
Speech Enhancement with Zero-Shot Model Selection
Recent research on speech enhancement (SE) has seen the emergence of deep
learning-based methods. It is still a challenging task to determine effective
ways to increase the generalizability of SE under diverse test conditions. In
this paper, we combine zero-shot learning and ensemble learning to propose a
zero-shot model selection (ZMOS) approach to increase the generalization of SE
performance. The proposed approach is realized in two phases, namely offline
and online phases. The offline phase clusters the entire set of training data
into multiple subsets, and trains a specialized SE model (termed component SE
model) with each subset. The online phase selects the most suitable component
SE model to carry out enhancement. Two selection strategies are developed:
selection based on quality score (QS) and selection based on quality embedding
(QE). Both QS and QE are obtained by a Quality-Net, a non-intrusive quality
assessment network. In the offline phase, the QS or QE of a train-ing utterance
is used to group the training data into clusters. In the online phase, the QS
or QE of the test utterance is used to identify the appropriate component SE
model to perform enhancement on the test utterance. Experimental results have
confirmed that the proposed ZMOS approach can achieve better performance in
both seen and unseen noise types compared to the baseline systems, which
indicates the effectiveness of the proposed approach to provide robust SE
performance
Audio Codec Enhancement with Generative Adversarial Networks
Audio codecs are typically transform-domain based and efficiently code
stationary audio signals, but they struggle with speech and signals containing
dense transient events such as applause. Specifically, with these two classes
of signals as examples, we demonstrate a technique for restoring audio from
coding noise based on generative adversarial networks (GAN). A primary
advantage of the proposed GAN-based coded audio enhancer is that the method
operates end-to-end directly on decoded audio samples, eliminating the need to
design any manually-crafted frontend. Furthermore, the enhancement approach
described in this paper can improve the sound quality of low-bit rate coded
audio without any modifications to the existent standard-compliant encoders.
Subjective tests illustrate that the proposed enhancer improves the quality of
speech and difficult to code applause excerpts significantly.Comment: Accepted to 45th IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), Barcelona, Spain, 04-08 May 202
Multichannel Speech Enhancement by Raw Waveform-mapping using Fully Convolutional Networks
In recent years, waveform-mapping-based speech enhancement (SE) methods have
garnered significant attention. These methods generally use a deep learning
model to directly process and reconstruct speech waveforms. Because both the
input and output are in waveform format, the waveform-mapping-based SE methods
can overcome the distortion caused by imperfect phase estimation, which may be
encountered in spectral-mapping-based SE systems. So far, most
waveform-mapping-based SE methods have focused on single-channel tasks. In this
paper, we propose a novel fully convolutional network (FCN) with Sinc and
dilated convolutional layers (termed SDFCN) for multichannel SE that operates
in the time domain. We also propose an extended version of SDFCN, called the
residual SDFCN (termed rSDFCN). The proposed methods are evaluated on two
multichannel SE tasks, namely the dual-channel inner-ear microphones SE task
and the distributed microphones SE task. The experimental results confirm the
outstanding denoising capability of the proposed SE systems on both tasks and
the benefits of using the residual architecture on the overall SE performance.Comment: Accepted to IEEE/ACM Transactions on Audio, Speech and Language
Processin
Components Loss for Neural Networks in Mask-Based Speech Enhancement
Estimating time-frequency domain masks for single-channel speech enhancement
using deep learning methods has recently become a popular research field with
promising results. In this paper, we propose a novel components loss (CL) for
the training of neural networks for mask-based speech enhancement. During the
training process, the proposed CL offers separate control over preservation of
the speech component quality, suppression of the residual noise component, and
preservation of a naturally sounding residual noise component. We illustrate
the potential of the proposed CL by evaluating a standard convolutional neural
network (CNN) for mask-based speech enhancement. The new CL obtains a better
and more balanced performance in almost all employed instrumental quality
metrics over the baseline losses, the latter comprising the conventional mean
squared error (MSE) loss and also auditory-related loss functions, such as the
perceptual evaluation of speech quality (PESQ) loss and the recently proposed
perceptual weighting filter loss. Particularly, applying the CL offers better
speech component quality, better overall enhanced speech perceptual quality, as
well as a more naturally sounding residual noise. On average, an at least 0.1
points higher PESQ score on the enhanced speech is obtained while also
obtaining a higher SNR improvement by more than 0.5 dB, for seen noise types.
This improvement is stronger for unseen noise types, where an about 0.2 points
higher PESQ score on the enhanced speech is obtained, while also the output SNR
is ahead by more than 0.5 dB. The new proposed CL is easy to implement and code
is provided at https://github.com/ifnspaml/Components-Loss