In noisy environments, speech recognition accuracy degrades significantly. Speech enhancement algorithms have been designed to overcome this, however solutions to date have not been optimal for speech recognition especially for non-stationary noise like that in a car. Recently, a likelihood-maximising (LIMA) criteria has been applied to speech enhancement techniques. This paper analyses the suitability of spectral subtraction for potential use under a modified version of this framework where direct access to and manipulation of speech recognition models is not available. Analysis shows spectral subtraction is suited to this holistic LIMA approach by confirming the cost surface is appropriate for gradient descent methods. It is also observed that there are regions on the cost surface where performance exceeds that achieved by parameter values traditionally selected for spectral subtraction

Kleinschmidt, Tristan

Mason, Michael

Sridharan, Sridha

English

In noisy environments, speech recognition accuracy\ud
degrades significantly. Speech enhancement algorithms have been\ud
designed to overcome this, however solutions to date have not\ud
been optimal for speech recognition especially for non-stationary\ud
noise like that in a car. Recently, a likelihood-maximising (LIMA)\ud
criteria has been applied to speech enhancement techniques.\ud
This paper analyses the suitability of spectral subtraction for\ud
potential use under a modified version of this framework where\ud
direct access to and manipulation of speech recognition models\ud
is not available. Analysis shows spectral subtraction is suited to\ud
this holistic LIMA approach by confirming the cost surface is\ud
appropriate for gradient descent methods. It is also observed\ud
that there are regions on the cost surface where performance\ud
exceeds that achieved by parameter values traditionally selected\ud
for spectral subtraction

Kleinschmidt, Tristan F.

Mason, Michael W.

Name not available

A Modified LIMA Framework for Spectral Subtraction Applied to In-Car Speech Recognition

Abstract — In noisy environments, speech recognition accuracy degrades significantly. Speech enhancement algorithms have been designed to overcome this, however solutions to date have not been optimal for speech recognition especially for non-stationary noise like that in a car. Recently, a likelihood-maximising (LIMA) criteria has been applied to speech enhancement techniques. This paper analyses the suitability of spectral subtraction for potential use under a modified version of this framework where direct access to and manipulation of speech recognition models is not available. Analysis shows spectral subtraction is suited to this holistic LIMA approach by confirming the cost surface is appropriate for gradient descent methods. It is also observed that there are regions on the cost surface where performance exceeds that achieved by parameter values traditionally selected for spectral subtraction. I

Tristan Kleinschmidt

Sridha Sridharan

Michael Mason

CiteSeerX

A modified LIMA framework for spectral subtraction applied to in-car speech recognition

Queensland University of Technology ePrints Archive

               This is the author-manuscript version of this work - accessed from   http://eprints.qut.edu.au  Kleinschmidt, Tristan F. and Sridharan, Sridha and Mason, Michael W. (2007) A Modified LIMA Framework for Spectral Subtraction Applied to In-Car Speech Recognition. In Proceedings International Conference On Signal Processing and Communication Systems, Gold Coast, Australia   Copyright 2007 (please consult author)           A Modified LIMA Framework for SpectralSubtraction Applied to In-Car Speech RecognitionTristan Kleinschmidt, Sridha Sridharan, Michael MasonSpeech and Audio Research LaboratoryQueensland University of Technology,GPO Box 2434, Brisbane, Australia, 4001{t.kleinschmidt, s.sridharan, m.mason}@qut.edu.auAbstract— In noisy environments, speech recognition accuracydegrades significantly. Speech enhancement algorithms have beendesigned to overcome this, however solutions to date have notbeen optimal for speech recognition especially for non-stationarynoise like that in a car. Recently, a likelihood-maximising (LIMA)criteria has been applied to speech enhancement techniques.This paper analyses the suitability of spectral subtraction forpotential use under a modified version of this framework wheredirect access to and manipulation of speech recognition modelsis not available. Analysis shows spectral subtraction is suited tothis holistic LIMA approach by confirming the cost surface isappropriate for gradient descent methods. It is also observedthat there are regions on the cost surface where performanceexceeds that achieved by parameter values traditionally selectedfor spectral subtraction.I. INTRODUCTIONA key challenge of deploying speech recognition in real-world environments is the requirement to perform well in thepresence of high levels of noise. Since most speech recognitionsystems are trained for use in controlled environments, theyfail to produce satisfactory performance under more adverseconditions.Methods for robust speech recognition include model com-pensation, use of robust features and recognition algorithms,as well as speech enhancement. Enhancement is a popularapproach as little-or-no prior knowledge of the operating envi-ronment is required for improvements in recognition accuracy.Popular speech enhancement algorithms (e.g. filter-and-sum beamforming or spectral subtraction) have been primar-ily designed to improve intelligibility and/or quality of thespeech signal without consideration of what effect that mayhave on other speech processing systems [1]. Optimisationin these algorithms is focussed on signal-based measuresincluding maximising signal-to-noise ratio or minimisationof the mean-squared signal error. Some of these techniquesproduce improvements in word accuracy performance, butthese improvements are by-products, rather than the goals ofthe enhancement techniques.One possible solution to the problem is to use speech recog-nition likelihoods as the optimisation criteria in the enhance-ment algorithms. Promising results have been shown in recentParts of the work presented here were funded through the CooperativeResearch Centre for Advanced Automotive Technology (AutoCRC).studies using this approach [1], [2], [3], [4]. In their currentform these techniques require access to the underlying statemodels and attempt to jointly optimise both state sequencesand enhancement parameters. This paper proposes a modifiedapproach in which the speech recogniser can be regarded as a‘black-box’. This approach removes the need for access tothe recogniser’s acoustic models and a fully decoded statesequence. The details of this approach and its applicability touse with spectral subtraction is presented. Spectral subtractionis chosen for its simplicity and common use in single-channelspeech enhancement applications.The rest of this paper is presented as follows. Section IIprovides background on spectral subtraction speech enhance-ment. Section III looks at the likelihood-maximising (LIMA)framework and its application to spectral subtraction. Prelimi-nary experimental results and discussion of the importance ofthese results is presented in Section IV.II. SPECTRAL SUBTRACTIONIn a noisy environment, speech s(n) is assumed to be cor-rupted by additive background noise d(n) to produce corruptedspeech y(n) as follows:y(n) = s(n) + d(n) (1)Equation (1) can be represented in frequency domain as:Y (ω) = S(ω) +D(ω) (2)Generally, an estimate of the magnitude (or power) spectraof the noise signal Dˆ(ω) is subtracted from the correspondingspectra of the noisy signal Y (ω) to give an estimate of theclean speech signal Sˆ(ω):|Sˆ(ω)|γ = |Y (ω)|γ − |Dˆ(ω)|γ (3)where γ is the power exponent which equals 1 for magnitudespectral subtraction or 2 for power spectral subtraction [5]. Thephase component of the noisy speech signal is left unalteredand is kept for reconstruction into the time domain.Should the subtraction in (3) give negative values (i.e. thenoise estimate |Dˆ(ω)|γ is greater than the signal |Y (ω)|γ)a flooring factor is introduced. This leads to the followingformulation of spectral subtraction:|Sˆ(k)|γ ={|Y (k)|γ − |Dˆ(k)|γ |Dˆ(k)|γ > |Y (k)|γβ|Dˆ(k)|γ otherwise (4)where β is the noise floor factor, and 0 < β ¿ 1 [5]. Commonvalues for this parameter range between 0.005 and 0.1 [5], [6].Although common values for γ and β are those notedabove, there is actually no limitation on the values that theseparameters can take. These values are typically used for theirconceptual meanings as opposed to performance. Alteringthese two parameters can make considerable difference to thespeech recognition performance of spectral subtraction, as willbe demonstrated in Section 4.It should also be noted here that in order to derive thetwo common rules denoted in (3) two conflicting assumptionsare made. If the clean speech and noise signals are assumedto be uncorrelated, the power spectral subtraction rule (i.e.γ = 2) results. Alternatively, if the two signals are assumedto be co-linear, the equation reduces to the magnitude spectralsubtraction rule. In practice, neither of these assumptions isvalid all the time. This leads to the possibility of optimisingthese parameters to best fit the instantaneous relationshipbetween clean speech and noise signals.III. LIMA SPECTRAL SUBTRACTIONAs mentioned in Section I, in recent studies the likelihood-based criterion has been used to replace traditional signal-level criteria in speech enhancement algorithms with the aimto improve speech recognition accuracies. This was seento minimise distortion in the effective auditory signal forrecognition purposes instead of the distortion of the speechwaveform [3]. Techniques which maximise the likelihood inthe speech recogniser are referred to as LIMA enhancementtechniques.The LIMA framework first generates an initial state se-quence using the speech recogniser. This sequence is usedto optimise the parameters using a gradient-descent algorithm– ensuring an optimal set of parameters for the proposedstate sequence. The utterance is decoded again using thenew parameters to generate a new state sequence. This jointoptimisation of both the array parameters and state sequencecontinues until the recognition likelihood converges.Formulated in this manner, it is required to obtain bothframe-by-frame state sequences and access to the model setin order to perform optimisation in LIMA techniques. Thispaper proposes a modification to the LIMA framework aimedat removing the need for access to state models and statesequence information – information rarely available whenendeavouring to integrate third party recognition engines inpractical applications. Here, we assume that only access tofull utterance likelihoods and word sequences is available.Using these two pieces of information, a ”blind” gradient-descent approach can be applied whereby a new set of en-hancement parameters are tried and the resulting likelihoodcompared to that of the previous iteration. The comparisondirects the correct direction to take.This method may be seen to be more restrictive than theoriginal framework as it may require a series of enhancementand recognition steps in order to determine a valid directionof optimisation. This is not the case in the existing work asthe optimisation takes place directly on the state sequence.The modified framework does however ensure no internalinformation about the recogniser is required. It may alsoremove some of the reliance on the initial state sequence whichis a downfall of the existing framework.In order to apply the modified (or original) framework tospectral subtraction the two parameters referred to in SectionII constitute the full parameter set. We denote this set by:ξss = [γ, β] (5)An investigation into the affect of altering this parameterset is presented in Section IV.IV. EXPERIMENTAL RESULTSTo evaluate the suitability of the modified LIMA framework,two experiments were designed using spectral subtraction asthe enhancement method of interest. The first experimentinvestigated the existence of enhancement parameters whichprovided superior performance to traditionally selected valuesof γ and β in spectral subtraction. The second experiment ex-tends the initial experiment by examining whether a gradient-descent method would still be appropriate for optimising theenhancement parameter set when only full utterance scoreswere available.Both experiments use speaker-independent, context-dependent 3-state triphone Hidden Markov Models (HMM)trained using the Wall Street Journal 1 corpus. The modelswere trained using 39-dimensional Mel-Frequency CepstralCoefficient (MFCC) vectors - 13 MFCC (including C0) plusdelta and acceleration coefficients. Each HMM state wasrepresented using a 16-component Gaussian Mixture Model.Experimental data came from the phone numbers task of theAVICAR database collected by the University of Illinois [7].This database contains real speech recordings under 5 differentdriving conditions: idle (IDL), 35mph with windows up (35U)and down (35D), and 55mph with windows up (55U) anddown (55D). In this way, performance under specific noiseconditions is of interest as opposed to different signal-to-noise ratios. Microphone number 4 of the 8-channel recordingswas utilised. The first experiment included utterances from 61distinct speakers (30 male, 31 females) and the second usedutterances from 20 speakers (14 male, 6 female).A. Experiment 1In order to demonstrate that varying the values of the twoparameters in (5) alters speech recognition accuracy, a numberof recognition experiments were conducted. A selection of3140 phone number utterances from the test database wereused.TABLE IWORD RECOGNITION ACCURACIES (%) FOR VARYING VALUES OF SPECTRAL SUBTRACTION PARAMETERS.IDL 35U 35D 55U 55DBaseline 75.24 49.95 36.18 41.00 22.35γ=1.0,β=0.1 80.70 47.34 37.42 39.62 27.82γ=1.5,β=0.1 81.37 51.77 41.16 44.65 29.12γ=2.0,β=0.1 81.15 53.92 42.01 46.24 28.66γ=1.5,β=0.1 81.37 51.77 41.16 44.65 29.12γ=1.5,β=0.3 81.94 57.21 43.84 50.11 29.63γ=1.5,β=0.5 80.37 56.86 42.62 48.93 27.59Values for γ and β were varied in linear increments throughthe ranges [1.0, 2.0] and [0.1, 0.5] respectively. Word recog-nition accuracies for increments of γ by 0.5 and β by 0.2 andare shown in Table I.It can be seen from the table that altering the spectralsubtraction parameter values leads to changes in speech recog-nition performance and that it is possible to locate values ofβ and γ which provide better word recognition performancethan those commonly proposed in literature - the accuracy atβ = 0.3 and γ = 1.5 exceeds that achieved when β ≤ 0.1and γ = 1or 2. These findings show the potential for spectralsubtraction parameter optimisation under a LIMA framework.B. Experiment 2Evaluation of the potential for the modified LIMA frame-work to find optimal spectral subtraction parameters usinggradient-descent methods was performed using a selection of250 of the phone number utterances used in experiment 1. Thevalues for γ and β were varied in linear increments throughthe ranges [1.0, 5.0] and [0.1, 3.0] respectively.Fig. 1 shows a typical surface of recognition likelihoodscores versus variations in γ and β. The general shape ofthe surface was observed to be common to all utterancestested, suggesting that it is utterance, speaker and noise-independent. We observe that increases in either γ or βwithin the ranges specified above leads to an increase in thelikelihood score of the utterance. It can also be seen thatwhilst the likelihood surface flattens out considerably, it isstill marginally increasing. From this figure, it is believed thatthe likelihood surface may be monotonically increasing, whichis very problematic for gradient-descent optimisation.To avoid this problematic feature of the cost surface, it isimportant to identify which likelihood scores are associatedwith correct transcriptions. Region 2 in Fig. 1 depicts thetypical location and shape of the region associated withcorrect transcriptions. This region was observed to vary in sizedepending on speaker, utterance and noise level. Examples ofthe variations associated with noise level are depicted in Fig.2. As the noise level increases (a-d) the size of the correctsurface diminishes considerably, and changes shape slightly.This is expected as the increased levels of noise hamper thespeech recogniser.The results presented indicate that in order for the proposedmodified LIMA framework to perform optimisation, it isimportant that the region of correct transcriptions is able to beidentified. Whilst this may appear to be hidden information, weare aware of several voice control applications where utteranceconfirmation is a well established mechanism. Therefore bycollecting user confirmations and associating them with theutterances of interest, there is sufficient information to performthe likelihood maximisation for the benefit of future utterances.V. CONCLUSIONFrom the results of the second experiment, we believe thatthe modified LIMA framework, which attempts to optimiseenhancement parameters based on whole utterance scores andwithout access to state sequences or models, is capable ofbeing applied to a system using spectral subtraction. Thisconclusion is supported through the observation of a smoothcost surface suitable for gradient based optimisation.The first experiment demonstrated that in addition to theproposed framework being able to blindly optimise spectralsubtraction parameters using only utterance level scores, thatthere was also the potential to achieve better performancewhen the values of β and γ are not constrained to theirtraditionally used values.REFERENCES[1] M. Seltzer, B. Raj, and R. Stern, “Likelihood-maximizing beamformingfor robust hands-free speech recognition,” IEEE Transactions on Speechand Audio Processing, vol. 12, no. 5, pp. 489–498, 2004.[2] A. Sankar and C.-H. Lee, “A maximum-likelihood approach to stochasticmatching for robust speech recognition,” IEEE Transactions on Speechand Audio Processing, vol. 4, no. 3, pp. 190–202, 1996.[3] M. Seltzer and R. Stern, “Subband likelihood-maximizing beamformingfor speech recognition in reverberant environments,” IEEE Transactionson Audio, Speech and Language Processing, vol. 14, no. 6, pp. 2109–2121, 2006.[4] G. Shi, P. Aarabi, and H. Jiang, “Phase-based dual-microphone speechenhancement using a prior speech model,” IEEE Transactions on Audio,Speech and Language Processing, vol. 15, no. 1, pp. 109–118, 2007.[5] M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement of speechcorrupted by acoustic noise,” in IEEE Int. Conf. on Acoustics, Speech,and Signal Processing, 1979, pp. 208–211.[6] R. Martin, “Spectral subtraction based on minimum statistics,” in EU-SIPCO, Edinburgh, 1994, pp. 1182–1185.[7] B. Lee, M. Hasegawa-Johnson, C. Goudeseune, S. Kamdar, S. Borys,M. Liu, and T. Huang, “Avicar: Audio-visual speech corpus in a carenvironment,” in INTERSPEECH, Jeju Island, Korea, 2004, pp. 2489–2492.Fig. 1. Visualisation of the likelihood surface for varying β and γ. Region 1 is a region of high distortion; Region 2 is the region of 100% accuracy; Region3 exhibits insufficient speech enhancement to recover in speech recognition.123450123−70−60−50−40−30GammaBeta 123450123−70−60−50−40−30GammaBetaLog Likelihood123450123−70−60−50−40−30GammaBeta 123450123−70−60−50−40−30GammaBetaLog Likelihood(d)(a) (b)(c)Fig. 2. An example of correct utterance region decreasing as noise levels increase. Noise levels in the figures are (a) -40.8dB, (b) -35.0dB, (c) -33.1dB, and(d) -23.6dB.

A maximum-likelihood approach to stochastic matching for robust speech recognition,”

Avicar: Audio-visual speech corpus in a car environment,”

Enhancement of speech corrupted by acoustic noise,”

Likelihood-maximizing beamforming for robust hands-free speech recognition,”

Phase-based dual-microphone speech enhancement using a prior speech model,”

Spectral subtraction based on minimum statistics,”

Subband likelihood-maximizing beamforming for speech recognition in reverberant environments,”

http://eprints.qut.edu.au/11334/1/11334.pdf

A modified LIMA framework for spectral subtraction applied to in-car speech recognition

Abstract

Similar works

Full text

Available Versions

Name not available

CiteSeerX

Queensland University of Technology ePrints Archive

CiteSeerX