Search CORE

538 research outputs found

Spatial Diffuseness Features for DNN-Based Speech Recognition in Noisy and Reverberant Environments

Author: Huemmer Christian
Kellermann Walter
Maas Roland
Schwarz Andreas
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 16/02/2015
Field of study

We propose a spatial diffuseness feature for deep neural network (DNN)-based automatic speech recognition to improve recognition accuracy in reverberant and noisy environments. The feature is computed in real-time from multiple microphone signals without requiring knowledge or estimation of the direction of arrival, and represents the relative amount of diffuse noise in each time and frequency bin. It is shown that using the diffuseness feature as an additional input to a DNN-based acoustic model leads to a reduced word error rate for the REVERB challenge corpus, both compared to logmelspec features extracted from noisy signals, and features enhanced by spectral subtraction.Comment: accepted for ICASSP201

arXiv.org e-Print Archive

Crossref

Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

Author: Geiger Jürgen
Jin Wenyu
Mousa Amr El-Desoky
Pohjalainen Jouni
Schuller Björn
Zhang Zixing
Publication venue
Publication date: 01/01/2018
Field of study

Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition that stills remains an important challenge. Data-driven supervised approaches, including ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches and with sufficient training, can alleviate the shortcomings of the unsupervised methods in various real-life acoustic environments. In this light, we review recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems. We separately discuss single- and multi-channel techniques developed for the front-end and back-end of speech recognition systems, as well as joint front-end and back-end training frameworks

arXiv.org e-Print Archive

OPUS Augsburg

Environmentally robust ASR front-end for deep neural network acoustic models

Author: Gales MJF
Yoshioka T
Publication venue: Computer Speech and Language
Publication date: 01/01/2015
Field of study

This paper examines the individual and combined impacts of various front-end approaches on the performance of deep neural network (DNN) based speech recognition systems in distant talking situations, where acoustic environmental distortion degrades the recognition performance. Training of a DNN-based acoustic model consists of generation of state alignments followed by learning the network parameters. This paper first shows that the network parameters are more sensitive to the speech quality than the alignments and thus this stage requires improvement. Then, various front-end robustness approaches to addressing this problem are categorised based on functionality. The degree to which each class of approaches impacts the performance of DNN-based acoustic models is examined experimentally. Based on the results, a front-end processing pipeline is proposed for efficiently combining different classes of approaches. Using this front-end, the combined effects of different classes of approaches are further evaluated in a single distant microphone-based meeting transcription task with both speaker independent (SI) and speaker adaptive training (SAT) set-ups. By combining multiple speech enhancement results, multiple types of features, and feature transformation, the front-end shows relative performance gains of 7.24% and 9.83% in the SI and SAT scenarios, respectively, over competitive DNN-based systems using log mel-filter bank features.This is the final version of the article. It first appeared from Elsevier via http://dx.doi.org/10.1016/j.csl.2014.11.00

Elsevier - Publisher Connector

Apollo (Cambridge)

Morphological processing of a dynamic compressive gammachirp filterbank for automatic speech recognition

Author: Cadore Joyner
Gallardo Antolín Ascensión
Peláez Moreno Carmen
Publication venue: Universidad Autonoma De Madrid. Escuela Politécnica Superior
Publication date: 01/01/2012
Field of study

Actas de: VII Jornadas en Tecnología del Habla and III Iberian SLTECH Workshop (IberSPEECH 2012). Madrid, 21-23 noviembre 2012.The Dynamic Compressive Gammachirp is presented for producing auditory-inspired feature extraction in Automatic Speech Recognition. The proposed acoustic features combine spectral subtraction and two-dimensional non-linear filtering technique most usually employed for image processing: morphological filtering. These features have been proven to be more robust to noisy speech than those based on simpler auditory filterbanks like the classical mel-scaled triangular filterbank, the Gammatone filterbank and the passive Gammachirp in a noisy Isolet database.This work has been partially supported by the Spanish Ministry of Science and Innovation CICYT Projects No. TEC2008-06382/TEC and No. TEC2011-26807.Publicad

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Carlos III de Madrid e-Archivo

Auditory-inspired morphological processing of speech spectrograms: applications in automatic speech recognition and speech enhancement

Author: A Rix
Ascensión Gallardo-Antolín
B Glasberg
B Moore
B Moore
C Martínez
C Peláez-Moreno
Carmen Peláez-Moreno
E Zwicker
E Zwicker
E Zwicker
F Jelinek
Francisco J. Valverde-Albacete
H Fastl
J Baker
J Beerends
Joyner Cadore
L Rabiner
L ten Bosch
M Florentine
Q Summerfield
R Gonzalez
R Meddis
R Meddis
R Patterson
S Davis
S Quackenbush
SS Stevens
T Irino
T Irino
TF Quatieri
TS Gunawan
W Jesteadt
Y Ephraim
Y Hu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

New auditory-inspired speech processing methods are presented in this paper, combining spectral subtraction and two-dimensional non-linear filtering techniques originally conceived for image processing purposes. In particular, mathematical morphology operations, like erosion and dilation, are applied to noisy speech spectrograms using specifically designed structuring elements inspired in the masking properties of the human auditory system. This is effectively complemented with a pre-processing stage including the conventional spectral subtraction procedure and auditory filterbanks. These methods were tested in both speech enhancement and automatic speech recognition tasks. For the first, time-frequency anisotropic structuring elements over grey-scale spectrograms were found to provide a better perceptual quality than isotropic ones, revealing themselves as more appropriate—under a number of perceptual quality estimation measures and several signal-to-noise ratios on the Aurora database—for retaining the structure of speech while removing background noise. For the second, the combination of Spectral Subtraction and auditory-inspired Morphological Filtering was found to improve recognition rates in a noise-contaminated version of the Isolet database.This work has been partially supported by the Spanish Ministry of Science and Innovation CICYT Project No. TEC2008-06382/TEC.Publicad

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Universidad Carlos III de Madrid e-Archivo

Morphologically filtered power-normalized cochleograms as robust, biologically inspired features for ASR

Author: Calle Silos Fernando de la
Gallardo Antolín Ascensión
Peláez Moreno Carmen
Valverde Albacete Francisco José
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2015
Field of study

In this paper, we present advances in the modeling of the masking behavior of the human auditory system (HAS) to enhance the robustness of the feature extraction stage in automatic speech recognition (ASR). The solution adopted is based on a nonlinear filtering of a spectro-temporal representation applied simultaneously to both frequency and time domains-as if it were an image-using mathematical morphology operations. A particularly important component of this architecture is the so-called structuring element (SE) that in the present contribution is designed as a single three-dimensional pattern using physiological facts, in such a way that closely resembles the masking phenomena taking place in the cochlea. A proper choice of spectro-temporal representation lends validity to the model throughout the whole frequency spectrum and intensity spans assuming the variability of the masking properties of the HAS in these two domains. The best results were achieved with the representation introduced as part of the power normalized cepstral coefficients (PNCC) together with a spectral subtraction step. This method has been tested on Aurora 2, Wall Street Journal and ISOLET databases including both classical hidden Markov model (HMM) and hybrid artificial neural networks (ANN)-HMM back-ends. In these, the proposed front-end analysis provides substantial and significant improvements compared to baseline techniques: up to 39.5% relative improvement compared to MFCC, and 18.7% compared to PNCC in the Aurora 2 database.This contribution has been supported by an Airbus Defense and Space Grant (Open Innovation - SAVIER) and Spanish Government-CICYT projects TEC2014-53390-P and TEC2014-61729-EX

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Carlos III de Madrid e-Archivo