120 research outputs found
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments
Eliminating the negative effect of non-stationary environmental noise is a
long-standing research topic for automatic speech recognition that stills
remains an important challenge. Data-driven supervised approaches, including
ones based on deep neural networks, have recently emerged as potential
alternatives to traditional unsupervised approaches and with sufficient
training, can alleviate the shortcomings of the unsupervised methods in various
real-life acoustic environments. In this light, we review recently developed,
representative deep learning approaches for tackling non-stationary additive
and convolutional degradation of speech with the aim of providing guidelines
for those involved in the development of environmentally robust speech
recognition systems. We separately discuss single- and multi-channel techniques
developed for the front-end and back-end of speech recognition systems, as well
as joint front-end and back-end training frameworks
Towards Unified All-Neural Beamforming for Time and Frequency Domain Speech Separation
Recently, frequency domain all-neural beamforming methods have achieved
remarkable progress for multichannel speech separation. In parallel, the
integration of time domain network structure and beamforming also gains
significant attention. This study proposes a novel all-neural beamforming
method in time domain and makes an attempt to unify the all-neural beamforming
pipelines for time domain and frequency domain multichannel speech separation.
The proposed model consists of two modules: separation and beamforming. Both
modules perform temporal-spectral-spatial modeling and are trained from
end-to-end using a joint loss function. The novelty of this study lies in two
folds. Firstly, a time domain directional feature conditioned on the direction
of the target speaker is proposed, which can be jointly optimized within the
time domain architecture to enhance target signal estimation. Secondly, an
all-neural beamforming network in time domain is designed to refine the
pre-separated results. This module features with parametric time-variant
beamforming coefficient estimation, without explicitly following the derivation
of optimal filters that may lead to an upper bound. The proposed method is
evaluated on simulated reverberant overlapped speech data derived from the
AISHELL-1 corpus. Experimental results demonstrate significant performance
improvements over frequency domain state-of-the-arts, ideal magnitude masks and
existing time domain neural beamforming methods
- …