This work focuses on efficient use of the training material by selecting the optimal set of model topologies. We do this by training multiple word models of each word class, based on a subclassification according to a priori knowledge of the training material. We will examine classification criteria with respect to duration of the word, gender of the speaker, position of the word in the utterance, pauses in the vicinity of the word, and combinations of these. Comparative experiments were carried out on a corpus consisting of Dutch spoken connected digit strings and isolated digits, which are recorded in a wide variety of acoustic conditions. The results show, that classification based on gender of the speaker, position of the digit in the string, pauses in the vicinity of the training tokens, and models based on a combination of these criteria perform significantly better than the set with single models per digit

Bouwman, G.

Boves, L.

Scharenborg, O.

English

MPG.PuRe

CONNECTED DIGIT RECOGNITION WITH CLASSSPECIFIC WORD MODELSOdette Scharenborg, Gies Bouwman, Lou BovesA2RT, Dept. Language & Speech, University of NijmegenP.O. Box 9103, 6500 HD Nijmegen, The Netherlands{odettes, bouwman, boves}@lands.let.kun.nlABSTRACTThis work focuses on efficient use of the training material byselecting the optimal set of model topologies. We do this bytraining multiple word models of each word class, based on asubclassification according to a priori knowledge of the trainingmaterial. We will examine classification criteria with respect toduration of the word, gender of the speaker, position of the wordin the utterance, pauses in the vicinity of the word, and combina-tions of these.Comparative experiments were carried out on a corpus consistingof Dutch spoken connected digit strings and isolated digits,which are recorded in a wide variety of acoustic conditions. Theresults show, that classification based on gender of the speaker,position of the digit in the string, pauses in the vicinity of thetraining tokens, and models based on a combination of thesecriteria perform significantly better than the set with single mod-els per digit.keywords: connected digit recognition, acoustic modelling, lan-guage modelling1. INTRODUCTIONSpeaker-independent connected digit recognition (CDR) overthe telephone is a particularly interesting challenge for automaticspeech recognition. On the one hand, the size of the vocabularyis small, which should make the task tractable. On the otherhand, a digit string is incorrect when only one digit is recognisedincorrectly. Therefore, string lengths of ten or more require a perdigit recognition accuracy close to 100% in order to keep thestring recognition accuracy higher than, say, 98%. Optimal useof the available training material and training techniques are ofcrucial importance to reach this ‘near perfect’ recognition accu-racy.The focus of the work presented here is efficient use of thetraining material by selecting the optimal set of models and theirtopologies. Efficient use of the material means finding the num-ber of models, states and densities that maximises performance.It is known that training just one model per phone or word is notalways optimal. Many digit recognisers use separate model setsfor male and female speakers. In addition, the authors in [1]proposed to train models for fast, average, and slow realisationsof the words. In [2],[3] realisation speed and speaker genderwere combined in order to train gender dependent word models,for fast and slow realisations of the training tokens separately. Inall cases, significant recognition improvements were reported.These studies suggest that prior knowledge of the training mate-rial can be used to improve recognition performance. In [4] itwas shown that a Classification Tree approach to the problemproves that linguistic features can be used to advantage. In thispaper we investigate whether comparable improvements can beobtained with a rule based or ‘common sense’ approach. In doingso, we investigate two features (viz. the position of a digit in astring and the presence of a pause before or after a digit) thathave not been used before for the purpose. In summary, we willexamine classification criteria with respect to• duration of the digit,• gender of the speaker,• position of the digit in the string,• pauses in the vicinity of the digit, and• combinations of these.Different criteria will result in different numbers of models perdigit, different numbers of states, and eventually different num-bers of Gaussian densities. In order to allow a fair comparisonwe will keep the total number of densities in all model setsroughly equal. A system with just 10 models, but with a highnumber of densities per state will serve as the reference.This paper is organised as follows. In Section 2, we explain thedifferent selection criteria on which the class specific models arebased. Section 3 presents the results of the experiments. In Sec-tion 4, we give an interpretation of  these results. Finally, inSection 5 we summarise our method, briefly draw the most re-markable conclusions and outline some of our plans for follow-up research.2. METHODWe measure the influence of each classification by comparingthe performance of a speech recognition system using class spe-cific models to a baseline system with only 10 models. In theremainder of this paper we will refer to this model set as BASE.All model sets investigated in this paper represent whole wordmodels. All models have the same left-to-right HMM topology,but the number of states for each model is one of the optimisa-tion parameters.The general procedure for training class specific word models isas follows:1. add a label to each word in the baseline transcription of thetraining corpus according to the subclass imposed by thecurrent classification criterion;2. determine the duration distribution of each subclass in or-der to choose the number of states for each model, using aforced alignment with the BASE models;3. generate a uni- and bigram language model based on thelabels in the transcription.First, we explain the five selection criteria in more detail. Sec-tion 2.6 and 2.7 then elaborate on the second and third step.2.1 Digit durationTo account for different speaking rates, between speakers andwithin speakers, we trained duration based models.The median of the duration distribution of the digit was taken asthreshold value to divide the set of digit tokens into short andlong realisations, thus, both sets have an equal amount of train-ing tokens. To this aim the following labels were added to thedigit tokens in the transcription:short for digit tokens comprising fewer frames than the me-dian number of frames of that digit type andlong for digit tokens comprising at least as many frames asthe median number of frames of that digit type.We will use shorthand notation DUR to refer to this model set.2.2 Speaker genderThe training databases used in this study contain only utteranceslabelled for speaker gender. This allows us to add gender labelsto the words in the transcription:male for words uttered by male speakers andfemale for words uttered by female speakersThis model set will be referred to as GENDER.2.3 Word positionMany phonetic and ASR studies have shown that the acousticrealisation of words is strongly affected by the position of theword in an utterance. For example, string final digits tend to havea falling pitch contour, lower intensity and longer duration. Thismotivates a distinction between three subclasses per digit, indi-cated by the following labels:initial for the first digit in an utterance,middle for digits from the second up to the penultimate digit,andfinal for the last digit in an utterance.A consequence of this definition is that in case the average stringlength of the training corpus is greater than three, the middle setcontains more tokens than the initial and final sets. Single digitutterances obtain the final label, because their acoustic propertiesresemble those of final digits most. In the remainder of this paperthis set will be denoted as POS.2.4 Pause contextThe final criterion for distinguishing between models is thepresence of a pause in the vicinity of the digit. Most speakerstend to cluster long digit strings into groups of two, three or fourdigits, separated by short pauses. It is not unlikely that thisclustering of strings into small groups affects the acoustics andduration as well, as already pointed out in [5]. Therefore, eachdigit is given one of three labels:head for a digit preceded, but not followed by a pause,between for a digit neither preceded, nor followed by a pause,andtail for any digit followed by a pauseIn our experiments we consider a pause as a silence of at least250 ms. Each utterance is considered to be preceded and fol-lowed by a pause. Digits surrounded by pauses are labelled witha tail tag, for the same reason why we labelled POS for isolateddigits as final. We will use PAUSE as shorthand notation for thismodel set.2.5 Combination of criteriaIn addition to the criteria presented in the previous paragraphs, itis also possible to combine two or three of them. The order inwhich to apply the criteria may become important if the criteriaare somehow correlated. We examined the following combina-tions:• Classification with respect to digit duration, followed byclassification for speaker gender. (notation: DUR-GEN)• Classification with respect to speaker gender, followed bydigit duration. (notation: GEN-DUR)• A combined classification of speaker gender and presence ofpauses in the digit context. (notation: GEN-PAUSE)The first two combinations are examined to investigate whetherthere is a correlation between the speaker gender and the digitduration. In [2] and [3] the second combination has been investi-gated for Italian and English digit strings. The difference be-tween the two combinations lies in the number of states definedfor each word model. The last combination of criteria was chosenbecause GENDER models and PAUSE models ranked among thebest criteria tested.2.6 Model topologyChoosing an appropriate number of states for a word HMM isespecially important for the experiments with the DUR models.On the one hand, a model with too small a number of states is notcapable of modelling the dynamic acoustics accurately, becausetoo many frames are allocated to the same state. On the otherhand, models with a number of states much larger than the ob-served number of frames in the shortest tokens may result in poormodelling during training, because some frames in the vicinity ofthese tokens will be assigned to the head and/or tail states ofthese models.The number of HMM states was set equal to the minimum ob-served duration, i.e. number of frames, of each subclass in thetraining material. The duration was determined by a forcedalignment of signal and transcription, using model set BASE.The number of states of these baseline models was determined onthe basis of a forced alignment with the best phone models avail-able at the start of the research.2.7 Language ModelFor the experiments described in Section 3 a combined uni- andbigram language model was used. The language models weretrained on the corresponding transcriptions of the training cor-pus.The classification strategies for acoustic modelling, as proposedin the previous subsections, do not necessarily benefit equallyfrom N-gram language models. For the POS models it is unlikelythat the bigram language model will add much value. After all,the assumed distinction is purely of an acoustic nature and thelanguage model may put too much restriction on the choice ofthe best acoustic model. However, for the GENDER models thebigram language model can be expected to add the extra knowl-edge that during one utterance the models of only one gendermust be used. The different contributions of the language modelmake it an interesting topic to explore. Therefore, we performedtests with and without the language model.3. RESULTSExperiments were carried out on a corpus created from threeDutch spoken connected digit databases: Polyphone, SESP andCasimir. All these corpora contain telephone speech recorded ina wide variety of acoustic conditions. The acoustic features were14 Mel-scale Frequency Cepstrum Coefficients (c0 …c13), andtheir first order derivatives, i.e. 28 features. These vectors werebased on 16 ms frames and a 10 ms frame shift. Next, HMMswere trained. Each state comprised a mixture of maximally 128Gaussian densities. The training set consisted of 9753 utteranceswith an average of 6.3 digits per utterance. The unseen test cor-pus contained 76,682 digits in 10,000 digit strings. Additionalinformation can be found in [6].The distribution of training material of each criterion is displayedin Table 1.Model set Percentage training tokens per subclassDUR short: 50%, long: 50%GENDER male: 53%, female: 47%POS initial: 16%, middle: 68%, final: 16%PAUSE head: 28%, between: 35%, tail: 37%Table 1 Distribution of the training tokens for each sub-class per model set.Table 2 shows the word and sentence error rates obtained in thetests we performed with the system with just one word model perdigit class (BASE) for 32, 64 and 128 Gaussians per state.Table 3 displays the word and sentence error rates obtained in thetests we performed with the class specific models. For ease ofreference the performance of the BASE models is repeated.Tot. Gaussians WER (%) SER (%)3744 (5 splits) 4.65 21.787481 (6 splits) 4.36 20.5614920 (7 splits) 4.17 19.63Table 2 The performance of the BASE models at wordand sentence level as a function of the total number ofGaussians per set of models.Criterion Tot. Gaussians WER (%) SER (%)BASE 14920 4.17 19.63DUR 28316 4.20 19.95GENDER 18877 3.27 15.59POS 29100 4.52 20.81PAUSE 29818 3.37 16.54Table 3 The performance of the class specific models(max. 64 Gauss. / state) as a function of the type of clas-sification criterion.Table 4 presents the performance of the class specific models,without any kind of language modelling. Again, for ease of refer-ence, the performance of the BASE models is shown in the 2ndrow of this table.Criterion WER (%) SER (%)BASE 4.17 19.63GENDER 3.36 16.23POS 3.41 16.30PAUSE 3.13 14.97Table 4 The performance of the class specific models (64Gauss. / state) without a language model as a function ofthe type of classification criterion.As can be seen in Table 3 and 4 the performance of GENDERdeteriorated in the tests without a language model, while theperformance of both POS and PAUSE improves significantly (ata 95% confidence level).Table 5 displays the word and sentence error rates obtained inthe tests with the class specific models for combined criteria,with a language model. There are six models per digit inPAUSE-GEN. Although the individual model sets PAUSE andGENDER have the lowest error rates (cf. Tables 3 and 4), theperformance of the combination is much worse.Criterion Tot. Gaussians WER (%) SER (%)BASE 14920 4.17 19.63DUR-GEN 15171 3.33 16.18GEN-DUR 15664 3.32 15.75PAUSE-GEN 15495 4.08 20.38Table 5 The performance of the class specific models(max. 16 Gauss. / state) with a language model as a func-tion of the type of combined classification criterion.Finally, Figure 1 shows all Word Error Rates as a function of thetotal number of densities per model set. The dotted line connectsthe results of the BASE models with 32, 64 and 128 Gaussiansper state.4. DISCUSSIONA fair comparison of the word and sentence error rates can onlybe made, if the acoustic resolution of the complete set of modelsis taken into account. This capacity depends on the number ofacoustic parameters that have been trained. Therefore, the mostefficient model set is the set that uses as few parameters as possi-ble to get a lowest possible error rate.Keeping this in mind the class specific model sets can be com-pared to the set of BASE models in Figure 1. Although we didnot test systems with single models per digit for exactly the samenumber of acoustic parameters as the class specific models, ex-trapolating the BASE performance suggests that it won’t drop farbelow 4.0% WER for a higher number of acoustic parameters.The results show that all class specific models, except model setDUR, provide better acoustic modelling compared to BASEmodels. It is remarkable to see that the PAUSE models performequally well as the well-known GENDER models. However, theperformance of these three model sets is strongly dependent onthe relative contribution of the language model, as we alreadypredicted in Section 2.8. The word error rates for the model setsPOS and PAUSE drop significantly when the language modelinfluence is reduced. These results suggest that the languagemodel may have been too restrictive.Remarkable is that the performance of our model set DUR is farbelow the performance of the duration based models in [1,2,3].One explanation could be that our algorithm to define the num-ber of states for each subclass model is sub-optimal. This is sub-ject for further study.Concerning the combined selection criteria GEN-DUR andDUR-GEN, the small difference in the number of Gaussian den-sities are caused by the order in which the selection criteria wereapplied. This can be explained by the fact that the median of thenumber of frames for digits spoken by male and female speakersis not always the same. Since our model topology algorithmtakes the minimum duration in ms divided by 10 as the numberof HMM states, this will result in different model topologies forlong duration digit models for male and female speech. How-ever, the error rates are still very much alike, indicating that theorder for classification does not matter significantly.The results obtained with the model set PAUSE-GEN show aclear deterioration in comparison with the individual model setsPAUSE and GENDER. In order to understand this deterioration,we performed an analysis on an independent development cor-pus. It appeared that the overlap between the set of incorrectlyrecognised words of PAUSE and that of GENDER is very high.Therefore, it is less likely that combining the classification crite-ria of PAUSE and GENDER would add much value to eitherone of the individual model sets. On the other hand, the inten-tion to keep the total number of densities approximately fixedresulted in models with only 16 densities per state. This may notbe enough to properly represent all variation within the sub-classes.5. CONCLUSIONSWe compared several classification criteria to select a set ofmodel topologies to make efficient use of the available trainingmaterial. The classification criteria were word duration, genderof the speaker, word position in the string, and presence ofpauses in the vicinity of the digit.One of the best experimental results presented in this work wasobtained with the well-known gender classification criterion.The proposed criterion, for pauses in the vicinity of the trainingtokens, performed equally well. All class specific model sets,except for the one based on duration, give significant efficiencyimprovement when compared to the set with single models perdigit.Currently we are experimenting with new ways of defining thenumber of states per subclass model. The first results are verypromising.6. REFERENCES[1] Pfau T., Ruske G., “Creating Hidden Markov Models forFast Speech”, Proc. of ICSLP ’98, Sydney, paper 255, pp.205-208[2] Chesta C., Laface P., Ravera F., “HMM Topology Selectionfor Accurate Acoustic and Duration Modelling”, Proc. ofICSLP ’98, Sydney, vol 7., pp. 2951-2954[3] Chesta C., Laface P., Ravera F., “Connected Digit Recogni-tion Using Short and Long Duration Models”, Proc. ofICASSP ’99, Phoenix, vol 3., pp. 775-778[4] Reichl W., Chou W., “Decision Tree State Tying based onSegmental Clustering for Acoustic Modeling”, Proc. ofICASSP ’98, Seattle, vol. 2, pp. 801-804[5] Godfrey J., Ganapathiraju A., Ramalingam C., Picone J.,“Microsegment-Based Connected Digit Recognition”, Proc.of ICASSP ’97, Munich, vol. 3, pp. 1755-1758[6] http://lands.let.kun.nl/A2RT/cdr/webref.htmlFigure 1 Word Error Rate plotted as a function ofthe number of Gaussian densities for all testedmodel sets.103 10433.23.43.63.844.24.44.64.85Word Error Rate (%)Total number of Gaussian densities per set of modelsBaseline                 Digit duration           Speaker gender           Word position            Pause context            Gender−duration          Duration−gender          Pause−gender             Speaker gender without LMWord position without LM Pause context without LM 

Connected digit recognition with class specific word models

https://pure.mpg.de/pubman/item/item_561607_4/component/file_561606/4DBB6523d01.pdf

Connected digit recognition with class specific word models

Abstract

Similar works

Full text

Available Versions

MPG.PuRe