International audienceIn this paper we present our own real time speaker-independent continuous phone recognition (Spirit) using Context-Independent Continuous Density HMMs (CI-CDHMMs) modeled by Gaussian Mixtures Models (GMMs). All the parameters of our system are estimated directly from data by using an improved Viterbi alignment process instead of the classical Baum-Welch estimation procedure. Generally, in the literature the Viterbi training algorithm is used as a pretreatment to initialize HMMs models that will be most often re-estimated by using complex re-estimation formula. In order to evaluate and compare the performance of our system with other previous works, we use the TIMIT database. The duration test of our recognition system for each sentence is between 2 seconds (for short sentences) to 12 seconds (for long sentences). We get, by combining the 64 possible phones into 39 phonetic classes, a phone recognition correct rate of 71.06% and an accuracy rate of 65.25%. These results compare favorably with previously published works

Di Martino, Joseph

Hammouch, Ahmed

Ibn Elhaj, El Hassan

Lachhab, Othman

INRIA a CCSD electronic archive server

HAL Id: hal-00761816https://hal.inria.fr/hal-00761816Submitted on 10 Dec 2012HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.Real Time Context-Independent Phone RecognitionUsing a Simplified Statistical Training AlgorithmOthman Lachhab, Joseph Di Martino, El Hassan Ibn Elhaj, AhmedHammouchTo cite this version:Othman Lachhab, Joseph Di Martino, El Hassan Ibn Elhaj, Ahmed Hammouch. Real Time Context-Independent Phone Recognition Using a Simplified Statistical Training Algorithm. 3rd InternationalConference on Multimedia Computing and Systems - ICMCS’12, May 2012, Tangier, Morocco. ￿hal-00761816￿REAL TIME CONTEXT-INDEPENDENT PHONE RECOGNITION USING A SIMPLIFIEDSTATISTICAL TRAINING ALGORITHMOthman LACHHAB,∗*INPT / ENSIASMadinat Al IrfaneRabat, MOROCCOothmanlachhab@yahoo.frJoseph Di MARTINO,*INRIA / LORIAVandoeuvre-lès-NancyFRANCEjdm@loria.frEl Hassane Ibn ELHAJ,*INPTMadinat Al IrfaneRabat, MOROCCOibnelhaj@inpt.ac.maAhmed HAMMOUCH,ENSETMadinat Al IrfaneRabat, MOROCCOhammouch a@yahoo.comAbstract—In this paper we present our own realtime speaker-independent continuous phone recognition(Spirit) using Context-Independent Continuous DensityHMMs (CI-CDHMMs) modeled by Gaussian MixturesModels (GMMs). All the parameters of our system are es-timated directly from data by using an improved Viterbialignment process instead of the classical Baum-Welchestimation procedure. Generally, in the literature theViterbi training algorithm is used as a pretreatment to ini-tialize HMMs models that will be most often re-estimatedby using complex re-estimation formula. In order to eval-uate and compare the performance of our system withother previous works, we use the TIMIT database. Theduration test of our recognition system for each sentenceis between 2 seconds (for short sentences) to 12 seconds(for long sentences). We get, by combining the 64 possi-ble phones into 39 phonetic classes, a phone recognitioncorrect rate of 71.06% and an accuracy rate of 65.25%.These results compare favorably with previously pub-lished works.Keywords-component—Real Time Automatic Speech Recog-nition (ASR) System, Continuous Speech Recognition, Con-tinuous Density Hidden Markov Models (CDHMMs), Viterbi,Simplified Statistical Trainning Algorithm, Gaussian MixtureModels (GMMs).I. INTRODUCTIONImplementation of a continuous speech recognition systemis difficult because of the large amount of variability in thespeech signal. However, there are a lot of possible acous-tical units able to represent the speech, the most interestingone is probably the phone which can be considered as thesmallest acoustical unit. To model these units, several tech-niques have been proposed, the connectionist approach with∗This study has been realized in the framework of the INRIA Euro-Mediterranean 3+3 M09/02 OESOVOX project with help of the EuropeanCOADVISE - IRSES (FP7) program.Neural Networks (NN), support vector machines (SVM) andfinally the most popular in the field of Automatic SpeechRecognition the statistical approach based on Hidden MarkovModels (HMMs) [1][2]. The most recent works modelthe acoustic space by Gaussian Mixture Models (GMMs).Many researchers have introduced the formalism of this tech-nique in their Automatic Speech Recognition (ASR) system[3][4][5][6][7], and they have proved that Continuous DensityHidden Markov Models (CDHMMs) permit to achieve betterresults than discrete HMMs. In this work, we shall describeour own speaker-independent continuous speech recognitionsystem we call Spirit.The purpose of this paper is not to provide the best phonerecognition rates on the Timit database [8], but to demonstratethat by using a simple statistical training algorithm, we canreach similar or better context-independent phone recognitionrates than those proposed in the literature.This paper is organised as follows: in section 2, we ex-plain our HMM training and recognition procedure; in sec-tion 3, we present experiments and results; finally in section4, comparisons with other context-independent phone recog-nition systems, and some concluding and perspective worksare given.II. THE PHONE RECOGNITIONA. Speech processingBefore training the models, it is necessary to prepare theacoustic data by calculating the MFCC feature vectors. Thesignal is sampled at 16Khz and preeemphasized with a fac-tor of 0.96. The static Mel-Cepstral vectors are computedfrom windowed time sections of 32ms duration and shiftedevery 10ms. Every calculated frame consists in 11 firststatic Mel-Cepstrum coefficients and the log energy(E), (thec0 cepstrum coefficient was discarded). We also includedthe first and second order derivatives called dynamic co-efficients (∆ and ∆∆) in the same high dimentional fea-ture vector. So we work with vectors of dimension d=36(11MFCC,E; 11∆MFCC,∆E; 11∆∆MFCC,∆∆E).B. Context-independent HMM trainingEach phone of the system is represented by a left-to-rightHMM composed of five states (but only three of them areemitting). Fig. 1 illustrates the topology and the type ofHMM model used. Learning models is the starting point ofany (ASR) system and certainly the most crucial. This con-sists in determining the optimal parameters Θ̃ = {A, πi, B}.Fig. 1. Topology of the context-independent phonetic HMM.• πi : An initial state probability.• A = aij : The state transition probability matrix.• B = bi(~ot) : The distribution probability of emission ofthe observations ~ot in state i.In a CDHMM the output distribution bi(~ot) for observa-tion ~ot in state i is generated by a Gaussian Mixture Model(GMM) which corresponds to a mixture of multivariate gaus-sian distributions of probability N (~ot, ~µik,Σik) with meanvector ~µik and covariance matrix Σik:bi(~ot) =ni∑k=1cik√(2π)d|Σik|exp(−12(~ot − ~µik)TΣ−1ik (~ot − ~µik))(1)Where ni represents the number of gaussian components instate i and ~ot corresponds to an observation at time t of di-mension d = 36. The ~µik centroids are statistically computedin state i by using the LBG vector quantization algorithm [9]applied to the vectors associated with state i. Each k centroidin state i (~µik) is calculated (see Eq. 2) by an average of itsassociated cepstral vectors ~o(n)ik .~µik =1NikNik∑n=1~o(n)ik (2)Where Nik represents the number of associated vectors forthe k centroid in state i, and in Eq. 3, cik represents the mix-ture weight for the centroid k in state i estimated as follows:cik =NikNi(3)With Ni is the total number of vectors associated with statei. Σik is the covariance matrix associated with the gaussian kof state i which is computed directly from the data using theclassical estimation formula (4):Σik = E((X − E[X]).(Y − E[Y ]))=1NikNik∑n=1(~o(n)ik − ~µik)(~o(n)ik − ~µik)T (4)It is important to say that the number of gaussian componentsassociated to each state must be chosen, by making a com-promise between a good modeling of the phonetic HMMsand the limited amount of training data. A too high numberof gaussian components compared to the amount of availabledata leads to a bad learning because the training databasehas a limited number of samples for each phone. For thisreason we optimize the number of gaussian components ineach HMM state. We begin by setting the number of gaussiancomponents for each state to 16. The actual optimum numberof gaussian components is related to the number of MFCCvectors associated to each centroid: if the latter is less thanthe dimension d, then the associated gaussian component isremoved because its covariance matrix will be non-invertible.The associated vectors with this removed gaussian compo-nent are then redistributed to the nearest remaining centroids.The state transition probabilities were evaluated by Eq. 7. Thesame principle of this method has been successfully appliedto various specialized tasks, such as speaker-independentalphabet letters recognition [7] and voice conversion [10] .Let X be a random variable giving the number of timesa HMM state is visited. If we consider event Sj “Staying jtimes in the same state” and Mj “Moving to the next state atthe j-th time”.Then event [X = l] can be expressed by :[X = l] =Sj︷ ︸︸ ︷S1⋂S2⋂· · ·⋂Sl−1⋂Mj︷︸︸︷Ml︸ ︷︷ ︸intersection of independent eventsThen the probability distribution of X is given by:P (X = l) = pl−1s .pm (5)Where ps is the probability to stay in the same state and pm =1− ps is the probability to move to the next state.Then by definition the expectation of X is given by:E[X ] =+∞∑l=1l.pl−1s .(1− ps) =11− ps(6)Consequently :ps =E[X ]− 1E[X ](7)The expectation E[X ] is calculated directly from data by thefollowing formula:E[X ] =NipRp(8)Where Nip is the total number of vectors related to state i ofphone p and Rp is the total number of samples of phone p inthe training data space.The Viterbi algorithm was applied to the acoustic vectorsof each sentence to determine an optimal sequence of stateswhich has produced the best sequence of observations. Thisprocess is iterated several times until a stability criterion,calculated from the paths returned by the viterbi process, hasbeen reached. The maximal number of iterations was 20.C. Monophone HMM recognitionContinuous speech recognition is a difficult process becausewe do not know the boundaries of the phones making up asentence. Furthermore the monophone HMMs assume thatspeech is produced as a concatenation of phones, not affectedby the phonetic context neighbors. To perform the recognitionit is useful to infer the sequence of states that has generatedthe given observations. Actually, from the sequence of stateswe can easily find the phone string: this task is performed bythe Viterbi decoding algorithm applied on each test sentencesusing the optimal parameters (A, πi, B). To better carry outthis task and find the adequate path, we built a bigram lan-guage model and a duration model on the phone durationswitch we assume to follow a normal distribution (N (µ, σ2)).III. EXPERIMENTS AND RESULTSThe Spirit system has been implemented and tested on a linuxmachine with an Intel Pentium Dual CPU 1.86GHz and 2GBof RAM. We choose to evaluate our ASR system with theTIMIT database [8]. In this database a total of 64 phoneticlabels, generally considered too detailed for learning HMMsmodels, has been reduced to 39 classes by K.F. Lee and H.W.Hon [11]. We used the same labeling in our system. 39 pho-netic HMMs with the same topology described in Section 2.B(see Fig. 1) are used in the training and testing, with a totalstates of 3x39 = 117. These HMMs, the bigram model andthe duration model are learned on 8 sentences ”si” and ”sx”of 462 speakers of the TIMIT database training part, corre-sponding to 3696 sentences. In the test 1344 sentences, pro-nounced by 168 speakers, corresponding to a total number of50754 phones. The ”sa” calibration sentences are excluded inboth training and testing. In continuous speech recognition,the most common phone recognition evaluation measures arethe phone error rate (PER), or the related performance met-ric, phone accuracy. These measures, calculated by Eq. 9are used in this paper for making comparisons between thedifferent phone recognition systems.Accuracy =N − (S +D + I)NCorrect =N − (S +D)N(9)Where N is the total number of labels in the reference utter-ances and S, I and D (resp.) the Substitution, Insertion andDeletions errors, computed by a DTW algorithm (DynamicTime Warping) between the correct phone strings (reference)and the recognized phone strings (test).Table.1 presents the accuracy obtained by our system us-ing the complete TIMIT test set. The phone recognitioncorrect rate is 71.06% and the accuracy rate is 65.25%.39 Monophone Bigram Bigram+DurationSubstitution 17.61% (8938) 17.25% (8756)Deletion 10.46% (5310) 11.69% (5932)Insertion 7.11% (3607) 5.81% (2951)Correct 71.93% (36506) 71.06% (36066)Accuracy 64.82% (32899) 65.25% (33115)Table 1. Phone recognition results with our context-independent phone HMM system on all the TIMIT test setTwo other measures were chosen to evaluate the speed ofour recognition system. The first is the average recognitiontime. In this case, the test sentences are classified into sixcategories (see Table. 2) according to their total number ofphones Nph. It is clear that the recognition time depends onthe duration of the sentences to recognize. The second mea-sure is the Real Time Factor (RTF) defined as the total com-putation time for recognition, divided by the total durationof the recorded speech processed. We obtain an acceptablespeed with an RTF of 2.5.Total number of phones Nph Average time in second (s)10 ≤ Nph < 20 2 s20 ≤ Nph < 30 5 s30 ≤ Nph < 40 7 s40 ≤ Nph < 50 9 s50 ≤ Nph < 60 11 s60 ≤ Nph < 75 12 sTable 2. Average recognition times for the six categories ofTimit test sentences.Fig. 2 shows the evolution of the phone accuracy versusthe number of iterations of the proposed training algorithm,by varying the shift from 8 to 10ms. We note that our (ASR)system is more efficient using a shift of 10ms. This behaviourcan be explained by the fact that by decreasing the shift value,the number of insertion errors increased.IV. COMPARISONSTable.3 provides an accuracy comparison, between ourspirit system with previously published results on the Timitdatabase for the phone recognition task, using CI-CDHMMs.These systems differ by their learning approach of the pho-netic model, level complexity, time computation etc; whichmakes this comparison a very difficult task. But we havedemonstrated that by using a simple minded system, we canreach in real time a competitive accurracy in comparison withthose obtained by other researchers.V. CONCLUSION AND FUTURE WORKSIn this paper, we built a reference system for continu-ous speech recognition using context-independent phoneticHMMs. We show that the obtained results compare favor-ably with already published HMM technology. In the futurewe forsee to test our system using context-dependent phonemodels, and implement a new technique to locate the positionof the insertion errors, in order to remove them.0 2 4 6 8 10 12 14 16 18 2061.56262.56363.56464.56565.5Iteration numberPhone Accuracy(%)  shift=8msshift=10msFig. 2. Evolution of phone accuracy versus the number ofiterations of the training algorithm.System Correct Accuracydiscrete HMM (monophone) [11] 64.07% 53.28%Tandem (monophone) [12] [13] 63.50% 61.48%HTK (monophone) [6] 71.9% 62.8%CDHMM (monophone) [5] 69.33% 63.05%CDHMM (monophone) [4] 64.1%CRF (monophone) [13] 66.74% 65.23%CDHMM (monophone) this paper 71.06% 65.25%Table 3. Phone accuracy comparisons using TIMITVI. REFERENCES[1] J. Baker, “The dragon system–an overview,” IEEETransactions on Acoustics, Speech, and Signal Process-ing, vol. 23, pp. 24–29, 1975.[2] F. Jelinek, “Continuous speech recognition by statisti-cal methods,” IEEE Proceedings, vol. 64, pp. 532–556,Avril 1976.[3] J.L. Gauvain L.F. Lamel, “High performance speaker-independent phone recognition using cdhmm,” Proc.Eurospeech, vol. Berlin, pp. 121–124, September 1993.[4] J. Chang J. Glass and M. McCandless, “A probabilis-tic framework for feature-based speech recognition,” inProc. of the ICSLP, pp. 2277–2280, October 2000.[5] Z. Ben Hamida A. Ben Messaoud, “Cdhmm parametersselection for speaker-independent phone recognition incontinuous speech system,” IEEE Mediterranean Elec-trotechnical Conference, pp. 253–258, 2010.[6] S.J. Young, “The general use of tying in phoneme-basedhmm speech recognisers,” ICASSP, vol. 1, pp. 569–572,1992.[7] J. Di Martino, “On the use of high order derivatives forhigh performance alphabet recognition,” ICASSP, vol.Orlando, pp. USA, May 2002.[8] L. Lamel W. Fisher J. Fiscus D. Pallet J. Garofolo andN. Dahlgren., “The darpa timit acoustic-phonetic con-tinuous speech corpus cdrom. ntis order number pb91-505065,” October 1990.[9] A. Buzo Y. Linde and R. M. Gray, “An algorithm forvector quantizer design,” IEEE Transactions on Com-munications, Vol, vol. COM-28, pp. No.1, January 1980.[10] J. Di Martino S. Ben Jebara A. Werghi, “On the use ofan iterative estimation of continuous probabilistic trans-forms for voice conversion,” ISIVC, 2010.[11] K.F. Lee and H.W. Hon, “Speaker-independent phonerecognition using hidden markov models,” IEEE Trans.ASSP, vol. 37(11), pp. 164–1648, November 1989.[12] D. Ellis H. Hermansky and S. Sharma, “Tandem con-nectionist feature stream extraction for conventionalhmm systems,” in Proc. of the ICASSP, 2000.[13] J. Moris and E. Fosler-Lussier, “Combining phoneticattributes using conditional random fields,” in Proc. ofthe InterSpeech, pp. 597–600, 2006.

Real Time Context-Independent Phone Recognition Using a Simplified Statistical Training Algorithm

https://hal.inria.fr/hal-00761816/document

Real Time Context-Independent Phone Recognition Using a Simplified Statistical Training Algorithm

Abstract

Similar works

Full text

Available Versions

INRIA a CCSD electronic archive server