10,283 research outputs found
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments
Eliminating the negative effect of non-stationary environmental noise is a
long-standing research topic for automatic speech recognition that stills
remains an important challenge. Data-driven supervised approaches, including
ones based on deep neural networks, have recently emerged as potential
alternatives to traditional unsupervised approaches and with sufficient
training, can alleviate the shortcomings of the unsupervised methods in various
real-life acoustic environments. In this light, we review recently developed,
representative deep learning approaches for tackling non-stationary additive
and convolutional degradation of speech with the aim of providing guidelines
for those involved in the development of environmentally robust speech
recognition systems. We separately discuss single- and multi-channel techniques
developed for the front-end and back-end of speech recognition systems, as well
as joint front-end and back-end training frameworks
Efficient Monaural Speech Enhancement using Spectrum Attention Fusion
Speech enhancement is a demanding task in automated speech processing
pipelines, focusing on separating clean speech from noisy channels. Transformer
based models have recently bested RNN and CNN models in speech enhancement,
however at the same time they are much more computationally expensive and
require much more high quality training data, which is always hard to come by.
In this paper, we present an improvement for speech enhancement models that
maintains the expressiveness of self-attention while significantly reducing
model complexity, which we have termed Spectrum Attention Fusion. We carefully
construct a convolutional module to replace several self-attention layers in a
speech Transformer, allowing the model to more efficiently fuse spectral
features. Our proposed model is able to achieve comparable or better results
against SOTA models but with significantly smaller parameters (0.58M) on the
Voice Bank + DEMAND dataset
- …