1,537 research outputs found
A Deep Learning Loss Function Based on the Perceptual Evaluation of the Speech Quality
This letter proposes a perceptual metric for speech quality evaluation, which is suitable, as a loss function, for training deep learning methods. This metric, derived from the perceptual evaluation of the speech quality algorithm, is computed in a per-frame basis and from the power spectra of the reference and processed speech signal. Thus, two disturbance terms, which account for distortion once auditory masking and threshold effects are factored in, amend the mean square error (MSE) loss function by introducing perceptual criteria based on human psychoacoustics. The proposed loss function is evaluated for noisy speech enhancement with deep neural networks. Experimental results show that our metric achieves significant gains in speech quality (evaluated using an objective metric and a listening test) when compared to using MSE or other perceptual-based loss functions from the literature.Spanish MINECO/FEDER (Grant Number: TEC2016-80141-P)Spanish Ministry of Education through the National Program FPU (Grant Number: FPU15/04161)NVIDIA Corporation with the donation of a Titan X GP
SEGAN: Speech Enhancement Generative Adversarial Network
Current speech enhancement techniques operate on the spectral domain and/or
exploit some higher-level feature. The majority of them tackle a limited number
of noise conditions and rely on first-order statistics. To circumvent these
issues, deep networks are being increasingly used, thanks to their ability to
learn complex functions from large example sets. In this work, we propose the
use of generative adversarial networks for speech enhancement. In contrast to
current techniques, we operate at the waveform level, training the model
end-to-end, and incorporate 28 speakers and 40 different noise conditions into
the same model, such that model parameters are shared across them. We evaluate
the proposed model using an independent, unseen test set with two speakers and
20 alternative noise conditions. The enhanced samples confirm the viability of
the proposed model, and both objective and subjective evaluations confirm the
effectiveness of it. With that, we open the exploration of generative
architectures for speech enhancement, which may progressively incorporate
further speech-centric design choices to improve their performance.Comment: 5 pages, 4 figures, accepted in INTERSPEECH 201
HiFi-GAN: High-Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks
Real-world audio recordings are often degraded by factors such as noise,
reverberation, and equalization distortion. This paper introduces HiFi-GAN, a
deep learning method to transform recorded speech to sound as though it had
been recorded in a studio. We use an end-to-end feed-forward WaveNet
architecture, trained with multi-scale adversarial discriminators in both the
time domain and the time-frequency domain. It relies on the deep feature
matching losses of the discriminators to improve the perceptual quality of
enhanced speech. The proposed model generalizes well to new speakers, new
speech content, and new environments. It significantly outperforms
state-of-the-art baseline methods in both objective and subjective experiments.Comment: Accepted by INTERSPEECH 202
A Comparison of Perceptually Motivated Loss Functions for Binary Mask Estimation in Speech Separation
This work proposes and compares perceptually motivated loss functions for deep learning based binary mask estimation for speech separation. Previous loss functions have focused on maximising classification accuracy of mask estimation but we now propose loss functions that aim to maximise the hit mi- nus false-alarm (HIT-FA) rate which is known to correlate more closely to speech intelligibility. The baseline loss function is bi- nary cross-entropy (CE), a standard loss function used in binary mask estimation, which maximises classification accuracy. We propose first a loss function that maximises the HIT-FA rate in- stead of classification accuracy. We then propose a second loss function that is a hybrid between CE and HIT-FA, providing a balance between classification accuracy and HIT-FA rate. Eval- uations of the perceptually motivated loss functions with the GRID database show improvements to HIT-FA rate and ESTOI across babble and factory noises. Further tests then explore ap- plication of the perceptually motivated loss functions to a larger vocabulary dataset
- …