271 research outputs found
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments
Eliminating the negative effect of non-stationary environmental noise is a
long-standing research topic for automatic speech recognition that stills
remains an important challenge. Data-driven supervised approaches, including
ones based on deep neural networks, have recently emerged as potential
alternatives to traditional unsupervised approaches and with sufficient
training, can alleviate the shortcomings of the unsupervised methods in various
real-life acoustic environments. In this light, we review recently developed,
representative deep learning approaches for tackling non-stationary additive
and convolutional degradation of speech with the aim of providing guidelines
for those involved in the development of environmentally robust speech
recognition systems. We separately discuss single- and multi-channel techniques
developed for the front-end and back-end of speech recognition systems, as well
as joint front-end and back-end training frameworks
MICROPHONE ARRAY OPTIMIZATION IN IMMERSIVE ENVIRONMENTS
The complex relationship between array gain patterns and microphone distributions limits the application of traditional optimization algorithms on irregular arrays, which show enhanced beamforming performance for human speech capture in immersive environments. This work analyzes the relationship between irregular microphone geometries and spatial filtering performance with statistical methods. Novel geometry descriptors are developed to capture the properties of irregular microphone distributions showing their impact on array performance. General guidelines and optimization methods for regular and irregular array design are proposed in immersive (near-field) environments to obtain superior beamforming ability for speech applications. Optimization times are greatly reduced through the objective function rules using performance-based geometric descriptions of microphone distributions that circumvent direct array gain computations over the space of interest. In addition, probabilistic descriptions of acoustic scenes are introduced to incorporate various levels of prior knowledge for the source distribution. To verify the effectiveness of the proposed optimization methods, simulated gain patterns and real SNR results of the optimized arrays are compared to corresponding traditional regular arrays and arrays obtained from direct exhaustive searching methods. Results show large SNR enhancements for the optimized arrays over arbitrary randomly generated arrays and regular arrays, especially at low microphone densities. The rapid convergence and acceptable processing times observed during the experiments establish the feasibility of proposed optimization methods for array geometry design in immersive environments where rapid deployment is required with limited knowledge of the acoustic scene, such as in mobile platforms and audio surveillance applications
Adaptive Algorithms for Intelligent Acoustic Interfaces
Modern speech communications are evolving towards a new direction which involves users in a more perceptive way. That is the immersive experience, which may be considered as the “last-mile” problem of telecommunications.
One of the main feature of immersive communications is the distant-talking,
i.e. the hands-free (in the broad sense) speech communications without bodyworn
or tethered microphones that takes place in a multisource environment where interfering signals may degrade the communication quality and the intelligibility of the desired speech source. In order to preserve speech quality intelligent acoustic interfaces may be used. An intelligent acoustic interface may comprise multiple microphones and loudspeakers and its peculiarity is to model the acoustic channel in order to adapt to user requirements and to environment conditions. This is the reason why intelligent acoustic interfaces are based on adaptive filtering algorithms.
The acoustic path modelling entails a set of problems which have to be taken into account in designing an adaptive filtering algorithm. Such problems may be basically generated by a linear or a nonlinear process and can be tackled respectively by linear or nonlinear adaptive algorithms.
In this work we consider such modelling problems and we propose novel effective adaptive algorithms that allow acoustic interfaces to be robust against any interfering signals, thus preserving the perceived quality of desired speech signals.
As regards linear adaptive algorithms, a class of adaptive filters based on the
sparse nature of the acoustic impulse response has been recently proposed.
We adopt such class of adaptive filters, named proportionate adaptive filters, and derive a general framework from which it is possible to derive any linear adaptive algorithm. Using such framework we also propose some efficient proportionate adaptive algorithms, expressly designed to tackle problems of a linear nature.
On the other side, in order to address problems deriving from a nonlinear process, we propose a novel filtering model which performs a nonlinear transformations by means of functional links. Using such nonlinear model, we propose functional link adaptive filters which provide an efficient solution to the modelling of a nonlinear acoustic channel.
Finally, we introduce robust filtering architectures based on adaptive combinations of filters that allow acoustic interfaces to more effectively adapt to environment conditions, thus providing a powerful mean to immersive speech communications
Adaptive Algorithms for Intelligent Acoustic Interfaces
Modern speech communications are evolving towards a new direction which involves users in a more perceptive way. That is the immersive experience, which may be considered as the “last-mile” problem of telecommunications.
One of the main feature of immersive communications is the distant-talking,
i.e. the hands-free (in the broad sense) speech communications without bodyworn
or tethered microphones that takes place in a multisource environment where interfering signals may degrade the communication quality and the intelligibility of the desired speech source. In order to preserve speech quality intelligent acoustic interfaces may be used. An intelligent acoustic interface may comprise multiple microphones and loudspeakers and its peculiarity is to model the acoustic channel in order to adapt to user requirements and to environment conditions. This is the reason why intelligent acoustic interfaces are based on adaptive filtering algorithms.
The acoustic path modelling entails a set of problems which have to be taken into account in designing an adaptive filtering algorithm. Such problems may be basically generated by a linear or a nonlinear process and can be tackled respectively by linear or nonlinear adaptive algorithms.
In this work we consider such modelling problems and we propose novel effective adaptive algorithms that allow acoustic interfaces to be robust against any interfering signals, thus preserving the perceived quality of desired speech signals.
As regards linear adaptive algorithms, a class of adaptive filters based on the
sparse nature of the acoustic impulse response has been recently proposed.
We adopt such class of adaptive filters, named proportionate adaptive filters, and derive a general framework from which it is possible to derive any linear adaptive algorithm. Using such framework we also propose some efficient proportionate adaptive algorithms, expressly designed to tackle problems of a linear nature.
On the other side, in order to address problems deriving from a nonlinear process, we propose a novel filtering model which performs a nonlinear transformations by means of functional links. Using such nonlinear model, we propose functional link adaptive filters which provide an efficient solution to the modelling of a nonlinear acoustic channel.
Finally, we introduce robust filtering architectures based on adaptive combinations of filters that allow acoustic interfaces to more effectively adapt to environment conditions, thus providing a powerful mean to immersive speech communications
PSD Estimation and Source Separation in a Noisy Reverberant Environment using a Spherical Microphone Array
In this paper, we propose an efficient technique for estimating individual
power spectral density (PSD) components, i.e., PSD of each desired sound source
as well as of noise and reverberation, in a multi-source reverberant sound
scene with coherent background noise. We formulate the problem in the spherical
harmonics domain to take the advantage of the inherent orthogonality of the
spherical harmonics basis functions and extract the PSD components from the
cross-correlation between the different sound field modes. We also investigate
an implementation issue that occurs at the nulls of the Bessel functions and
offer an engineering solution. The performance evaluation takes place in a
practical environment with a commercial microphone array in order to measure
the robustness of the proposed algorithm against all the deviations incurred in
practice. We also exhibit an application of the proposed PSD estimator through
a source septation algorithm and compare the performance with a contemporary
method in terms of different objective measures
Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation
This paper investigates deep neural networks (DNN) based on nonlinear feature mapping and statistical linear feature adaptation approaches for reducing reverberation in speech signals. In the nonlinear feature mapping approach, DNN is trained from parallel clean/distorted speech corpus to map reverberant and noisy speech coefficients (such as log magnitude spectrum) to the underlying clean speech coefficients. The constraint imposed by dynamic features (i.e., the time derivatives of the speech coefficients) are used to enhance the smoothness of predicted coefficient trajectories in two ways. One is to obtain the enhanced speech coefficients with a least square estimation from the coefficients and dynamic features predicted by DNN. The other is to incorporate the constraint of dynamic features directly into the DNN training process using a sequential cost function.
In the linear feature adaptation approach, a sparse linear transform, called cross transform, is used to transform multiple frames of speech coefficients to a new feature space. The transform is estimated to maximize the likelihood of the transformed coefficients given a model of clean speech coefficients. Unlike the DNN approach, no parallel corpus is used and no assumption on distortion types is made.
The two approaches are evaluated on the REVERB Challenge 2014 tasks. Both speech enhancement and automatic speech recognition (ASR) results show that the DNN-based mappings significantly reduce the reverberation in speech and improve both speech quality and ASR performance. For the speech enhancement task, the proposed dynamic feature constraint help to improve cepstral distance, frequency-weighted segmental signal-to-noise ratio (SNR), and log likelihood ratio metrics while moderately degrades the speech-to-reverberation modulation energy ratio. In addition, the cross transform feature adaptation improves the ASR performance significantly for clean-condition trained acoustic models.Published versio
- …