3 research outputs found
Exploring Tradeoffs in Models for Low-latency Speech Enhancement
We explore a variety of neural networks configurations for one- and
two-channel spectrogram-mask-based speech enhancement. Our best model improves
on previous state-of-the-art performance on the CHiME2 speech enhancement task
by 0.4 decibels in signal-to-distortion ratio (SDR). We examine trade-offs such
as non-causal look-ahead, computation, and parameter count versus enhancement
performance and find that zero-look-ahead models can achieve, on average,
within 0.03 dB SDR of our best bidirectional model. Further, we find that 200
milliseconds of look-ahead is sufficient to achieve equivalent performance to
our best bidirectional model
Data augmentation and loss normalization for deep noise suppression
Speech enhancement using neural networks is recently receiving large
attention in research and being integrated in commercial devices and
applications. In this work, we investigate data augmentation techniques for
supervised deep learning-based speech enhancement. We show that not only
augmenting SNR values to a broader range and a continuous distribution helps to
regularize training, but also augmenting the spectral and dynamic level
diversity. However, to not degrade training by level augmentation, we propose a
modification to signal-based loss functions by applying sequence level
normalization. We show in experiments that this normalization overcomes the
degradation caused by training on sequences with imbalanced signal levels, when
using a level-dependent loss function.Comment: to appear in Proc. 22nd International Conference on Speech and
Computer (SPECOM), 202
A consolidated view of loss functions for supervised deep learning-based speech enhancement
Deep learning-based speech enhancement for real-time applications recently
made large advancements. Due to the lack of a tractable perceptual optimization
target, many myths around training losses emerged, whereas the contribution to
success of the loss functions in many cases has not been investigated isolated
from other factors such as network architecture, features, or training
procedures. In this work, we investigate a wide variety of loss spectral
functions for a recurrent neural network architecture suitable to operate in
online frame-by-frame processing. We relate magnitude-only with phase-aware
losses, ratios, correlation metrics, and compressed metrics. Our results reveal
that combining magnitude-only with phase-aware objectives always leads to
improvements, even when the phase is not enhanced. Furthermore, using
compressed spectral values also yields a significant improvement. On the other
hand, phase-sensitive improvement is best achieved by linear domain losses such
as mean absolute error