Search CORE

9,581 research outputs found

Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

Author: Geiger Jürgen
Jin Wenyu
Mousa Amr El-Desoky
Pohjalainen Jouni
Schuller Björn
Zhang Zixing
Publication venue
Publication date: 01/01/2018
Field of study

Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition that stills remains an important challenge. Data-driven supervised approaches, including ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches and with sufficient training, can alleviate the shortcomings of the unsupervised methods in various real-life acoustic environments. In this light, we review recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems. We separately discuss single- and multi-channel techniques developed for the front-end and back-end of speech recognition systems, as well as joint front-end and back-end training frameworks

arXiv.org e-Print Archive

OPUS Augsburg

Morphologically filtered power-normalized cochleograms as robust, biologically inspired features for ASR

Author: Calle Silos Fernando de la
Gallardo Antolín Ascensión
Peláez Moreno Carmen
Valverde Albacete Francisco José
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2015
Field of study

In this paper, we present advances in the modeling of the masking behavior of the human auditory system (HAS) to enhance the robustness of the feature extraction stage in automatic speech recognition (ASR). The solution adopted is based on a nonlinear filtering of a spectro-temporal representation applied simultaneously to both frequency and time domains-as if it were an image-using mathematical morphology operations. A particularly important component of this architecture is the so-called structuring element (SE) that in the present contribution is designed as a single three-dimensional pattern using physiological facts, in such a way that closely resembles the masking phenomena taking place in the cochlea. A proper choice of spectro-temporal representation lends validity to the model throughout the whole frequency spectrum and intensity spans assuming the variability of the masking properties of the HAS in these two domains. The best results were achieved with the representation introduced as part of the power normalized cepstral coefficients (PNCC) together with a spectral subtraction step. This method has been tested on Aurora 2, Wall Street Journal and ISOLET databases including both classical hidden Markov model (HMM) and hybrid artificial neural networks (ANN)-HMM back-ends. In these, the proposed front-end analysis provides substantial and significant improvements compared to baseline techniques: up to 39.5% relative improvement compared to MFCC, and 18.7% compared to PNCC in the Aurora 2 database.This contribution has been supported by an Airbus Defense and Space Grant (Open Innovation - SAVIER) and Spanish Government-CICYT projects TEC2014-53390-P and TEC2014-61729-EX

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Carlos III de Madrid e-Archivo

ASR Feature Extraction with Morphologically-Filtered Power-Normalized Cochleograms

Author: Calle Silos Fernando de la
Gallardo Antolín Ascensión
Peláez Moreno Carmen
Valverde Albacete Francisco José
Publication venue: 'International Speech Communication Association'
Publication date: 01/01/2014
Field of study

Proceedings of: 15th Annual Conference of the International Speech Communication Association. Singapore, September 14-18, 2014.In this paper we present advances in the modeling of the masking behavior of the Human Auditory System to enhance the robustness of the feature extraction stage in Automatic Speech Recognition. The solution adopted is based on a non-linear filtering of a spectro-temporal representation applied simultaneously on both the frequency and time domains, by processing it using mathematical morphology operations as if it were an image. A particularly important component of this architecture is the so called structuring element: biologically-based considerations are addressed in the present contribution to design an element that closely resembles the masking phenomena taking place in the cochlea. The second feature of this contribution is the choice of underlying spectro-temporal representation. The best results were achieved by the representation introduced as part of the Power Normalized Cepstral Coefficients together with a spectral subtraction step. On the Aurora 2 noisy continuous digits task, we report relative error reductions of 18.7% compared to PNCC and 39.5% compared to MFCC.This contribution has been supported by an Airbus Defense and Space Grant (Open Innovation - SAVIER) and Spanish Government-CICYT project 2011-26807/TEC.Publicad

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Carlos III de Madrid e-Archivo

Recommended from our members

Amplitude and frequency-modulated stimuli activate common regions of human auditory cortex

Author: Hall DA
Hart HC
Palmer AR
Publication venue: 'Oxford University Press (OUP)'
Publication date: 20/06/2003
Field of study

Hall et al. (Hall et al., 2002, Cerebral Cortex 12:140–149) recently showed that pulsed frequency-modulated tones generate considerably higher activation than their unmodulated counterparts in non-primary auditory regions immediately posterior and lateral to Heschl’s gyrus (HG). Here, we use fMRI to explore the type of modulation necessary to evoke such differential activation. Carrier signals were a single tone and a harmonic-complex tone, with a 300 Hz fundamental, that were modulated at a rate of 5 Hz either in frequency, or in amplitude, to create six stimulus conditions (unmodulated, FM, AM). Relative to the silent baseline, the modulated tones, in particular, activated widespread regions of the auditory cortex bilaterally along the supra-temporal plane. When compared with the unmodulated tones, both AM and FM tones generated significantly greater activation in lateral HG and the planum temporale, replicating the previous findings. These activation patterns were largely overlapping, indicating a common sensitivity to both AM and FM. Direct comparisons between AM and FM revealed a higher magnitude of activation in response to the variation in amplitude than in frequency, plus a small part of the posterolateral region in the right hemisphere whose response was specifically AM-, and not FM-, dependent. The dominant pattern of activation was that of co-localized activation by AM and FM, which is consistent with a common neural code for AM and FM within these brain regions

Nottingham Trent Institutional Repository (IRep)

Acoustic Simulations of Cochlear Implants in Human and Machine Hearing Research

Author: Cong-Thanh Do
Publication venue: 'IntechOpen'
Publication date: 27/04/2012
Field of study

IntechOpen

Determination and evaluation of clinically efficient stopping criteria for the multiple auditory steady-state response technique

Author: D'Haenens Wendy
Dhooge Ingeborg
Vinck Bart
Publication venue
Publication date: 01/01/2009
Field of study

Background: Although the auditory steady-state response (ASSR) technique utilizes objective statistical detection algorithms to estimate behavioural hearing thresholds, the audiologist still has to decide when to terminate ASSR recordings introducing once more a certain degree of subjectivity. Aims: The present study aimed at establishing clinically efficient stopping criteria for a multiple 80-Hz ASSR system. Methods: In Experiment 1, data of 31 normal hearing subjects were analyzed off-line to propose stopping rules. Consequently, ASSR recordings will be stopped when (1) all 8 responses reach significance and significance can be maintained for 8 consecutive sweeps; (2) the mean noise levels were ≤ 4 nV (if at this “≤ 4-nV” criterion, p-values were between 0.05 and 0.1, measurements were extended only once by 8 sweeps); and (3) a maximum amount of 48 sweeps was attained. In Experiment 2, these stopping criteria were applied on 10 normal hearing and 10 hearing-impaired adults to asses the efficiency. Results: The application of these stopping rules resulted in ASSR threshold values that were comparable to other multiple-ASSR research with normal hearing and hearing-impaired adults. Furthermore, in 80% of the cases, ASSR thresholds could be obtained within a time-frame of 1 hour. Investigating the significant response-amplitudes of the hearing-impaired adults through cumulative curves indicated that probably a higher noise-stop criterion than “≤ 4 nV” can be used. Conclusions: The proposed stopping rules can be used in adults to determine accurate ASSR thresholds within an acceptable time-frame of about 1 hour. However, additional research with infants and adults with varying degrees and configurations of hearing loss is needed to optimize these criteria

Ghent University Academic Bibliography

The neural representation and behavioral detection of frequency modulation

Author: Shearer Daniel Elliott
Publication venue: JMU Scholarly Commons
Publication date: 09/05/2014
Field of study

Understanding a speech signal is reliant on the ability of the auditory system to accurately encode rapidly changing spectral and temporal cues over time. Evidence from behavioral studies in humans suggests that relatively poor temporal fine structure (TFS) encoding ability is correlated with poorer performance on speech understanding tasks in quiet and in noise. Electroencephalography, including measurement of the frequency-following response, has been used to assess the human central auditory nervous system’s ability to encode temporal patterns in steady-state and dynamic tonal stimuli and short syllables. To date, the FFR has been used to investigate the accuracy of phase-locked auditory encoding of various stimuli, however, no study has demonstrated an FFR evoked by dynamic TFS contained in the modulating frequency content of a carrier tone. Furthermore, the relationship between a physiological representation of TFS encoding and either behavioral perception or speech-in-noise understanding has not been studied. The present study investigated the feasibility of eliciting FFRs in young, normal-hearing listeners using frequency-modulated (FM) tones, which contain TFS. Brainstem responses were compared to the behavioral detection of frequency modulation as well as speech-in-noise understanding. FFRs in response to FM tones were obtained from all listeners, indicating a reliable measurement of TFS encoding within the brainstem. FFRs were more accurate at lower carrier frequencies and at shallower FM depths. FM detection ability was consistent with previously reported findings in normal-hearing listeners. In the present study, however, FFR accuracy was not predictive of behavioral performance. Additionally, FFR accuracy was not predictive of speech-in-noise understanding. Further investigation of brainstem encoding of TFS may reveal a stronger brain-behavior relationship across an age continuum

James Madison University