This paper presents trainable methods for generating letter to sound rules
from a given lexicon for use in pronouncing out-of-vocabulary words and as a
method for lexicon compression.
  As the relationship between a string of letters and a string of phonemes
representing its pronunciation for many languages is not trivial, we discuss
two alignment procedures, one fully automatic and one hand-seeded which produce
reasonable alignments of letters to phones.
  Top Down Induction Tree models are trained on the aligned entries. We show
how combined phoneme/stress prediction is better than separate prediction
processes, and still better when including in the model the last phonemes
transcribed and part of speech information. For the lexicons we have tested,
our models have a word accuracy (including stress) of 78% for OALD, 62% for CMU
and 94% for BRULEX. The extremely high scores on the training sets allow
substantial size reductions (more than 1/20).
  WWW site: http://tcts.fpms.ac.be/synthesis/mbrdicoComment: 4 pages 1 figur

Black, A.

Lenzo, K.

Pagel, V.

English

arXiv

This paper presents trainable methods for generating letter to sound rules from a given lexicon for use in pronouncing out-of-vocabulary words and as a method for lexicon compression. As the relationship between a string of letters and a string of phonemes representing its pronunciation for many languages is not trivial, we discuss two alignment procedures, one fully automatic and one hand-seeded which produce reasonable alignments of letters to phones. Top Down Induction Tree models are trained on the aligned entries. We show how combined phoneme/stress prediction is better than separate prediction processes, and still better when including in the model the last phonemes transcribed and part of speech information. For the lexicons we have tested, our models have a word accuracy (including stress) of 78% for OALD, 62% for CMU and 94% for BRULEX. The extremely high scores on the training sets allow substantial size reductions (more than 1/20). WWW site: http://tcts.fpms.ac.be/synthesis/mbrdic

Pagel, Vincent

Lenzo, Kevin

Black, Alan W

Edinburgh Research Archive

LETTER TO SOUND RULES FOR ACCENTED LEXICON COMPRESSIONVincent Pagel1, Kevin Lenzo2 and Alan W. Black31 Faculté Polytechnique de Mons, Dolez 31, 7000 Mons, Belgium2 Carnegie Mellon University,  5000 Forbes Av, Pittsburgh PA15213 , USA3 Centre for Speech Technology Research, University of Edinburgh, UKABSTRACTThis paper presents trainable methods for generating letter tosound rules from a given lexicon for use in pronouncing out-of-vocabulary words and as a method for lexicon compression. Asthe relationship between a string of letters and a string ofphonemes representing its pronunciation for many languages isnot trivial, we discuss two alignment procedures, one fullyautomatic and one hand seeded which produce reasonablealignments of letters to phones (or epsilon).  Top DownInduction Tree models are trained on the aligned entries. Weshow how combined phoneme/stress prediction is better thanseparate prediction processes, and still better when including inthe model the last phonemes transcribed and part of speechinformation. For the lexicons we have tested, our models have aword accuracy (including stress) of 78% for OALD, 62% forCMU and 94% for BRULEX, allowing substantial reduction inthe size of these lexicons.1. MOTIVATIONIn a text-to-speech (TTS) system, a major interest of buildingrule based grapheme-to-phoneme transcription systems is totreat out of vocabulary words (OOV). The secondary effect ofstoring rules is to reduce the memory amount required by thelexicon, which is of interest for hand-held devices such astalking dictionaries. The rule set can be viewed as a sort ofcompression algorithm that captures language regularities.Those regularities are often disrupted by complex wordmorphology, and accentuation pattern in stress timed languagessuch as English or Dutch, which makes this field attractive formachine learning techniques.Given a dictionary of words with stressed phonemictranscriptions, such as CMU [1], OALD [2] or BRULEX [3],one notices that when two word chunks have similar spellingthey have a similar pronunciation. Two broad categories ofalgorithm can be used to learn those similarities:• Grapheme to Phoneme (G2P) transducerstreating variable graphemic chunk sizes. Yvongives a summary of available methods in [4].Among which HMM where phonemescorrespond to states emitting zero on moreletters. Yvon also proposes an original chunkrecombination method• G2P with fixed size learning windows. A fixedset of attributes is comfortable for many learningtechniques, which was initiated in the NetTalksystem by Rosenberg et al.Lazy learning techniques contribute to the second category withtheir amazing back-off abilities: if a test vector is not present inthe training set, it is classified according to its most significantattributes (c.f. many papers by Daelemans and Bosch describingIGTREE [5]). The drawback of fixed size windows is that theword and its phonemic transcription have not the same length,hence one has to introduce empty symbols (noted epsilon) in thealphabet to align graphemic and phonemic representation andget a one to one correspondenceIn the rest of this paper we deal with the second family ofalgorithms, and we propose solutions for both thegrapheme/phoneme alignment and the grapheme-to-phonemetranscription.2. GRAPHEME/PHONEME ALIGNMENTThe first problem to solve with alignment is that one letter cancorrespond to more than one phoneme, and that one phonemecan correspond to more than one letter. Since the learningtechnique we use requires a fixed size learning vector, oneshould introduce epsilons both in the graphemic and phonemicstrings as in the example given table 1.Graph.E X - E M P L A R YPhon.IH G Z EH*M P L ER - IYTable 1: alignment of graphemes and stressed phonemes ( -stands for epsilon, and * is a primary accent)For the languages we study in this paper (French and English)we can avoid introducing epsilons in the graphemic string sinceonly few letters generate more than one phoneme. We define ashort list of pseudo phonemes such as K_S or W_A (as found inthe English word ‘fax’ and French word ‘royal’) so that all ourcorpora can have a one letter to one phoneme alignment.Thus the alignment task becomes “introduce epsilons in thephonemic representation so that it matches the length of thegraphemic representation”. We propose 2 solutions to solve thisproblem.2.1 Automatic Epsilon Scattering MethodThe idea is to estimate the probabilities for one grapheme G tomatch with one phoneme P, and to use DTW to introduceepsilons at positions maximizing the probability of the word’salignment path. Once the dictionary is aligned, the associationprobabilities can be computed again, and so on untilconvergence. Five such iterations have been found to benecessary on the CMU corpus.Algorithm:/*  initialize prob(G ,P ) the probability of G matching P */1. foreach wordi in training_setcount with DTW all possible G/P association for allpossible     epsilon positions in the phonetictranscription/* EM loop */2. foreach wordi in training_setcompute new_p(G,P) on alignment_path3  if (prob != new_p )  goto 22.2 Hand-Seeded MethodThe best alignment method we experimented with requiresseeding with the set of feasible letter-phone (or pseudo phone)pairs regardless of context.Thus for each letter a table is written of the possible phones itmay match.  E.g. "c" may go to /ch/, /s/, /k/, /sh/ or epsilon.  Fora given lexicon this is an easy incremental process to add to thistable until most of the entries can be aligned. Once the table isbuilt, all possible alignments for each entry are found.The occurrences of each letter/phone pairs in these alignmentsare summed.  A table of probabilities of phone given letter isestimated. Then all possible alignments are found again, but thistime they are scored with respect to the probabilities of theletter/phone pairs. The most probable alignment is then selected.In almost all cases this alignment appears to be that which ahuman labeler would select. Note that although this does requirea human to seed the table no real expertise is needed, so this canbe done even with only a little knowledge of the language thelexicon covers.  As the table is built, entries that have noalignment are displayed, the number of which eventuallyreduces to a few per thousand.  These are typically acronyms,abbreviations, foreign words etc, where there is in fact noobvious alignment and definitely not one that would beproductive to learn.2.3 The test corporaIn the rest of the paper we evaluate our algorithm on 3dictionaries which are split into 90-10% partitions by selectingevery tenth entry for the whole set for train and test procedure:1. The Oxford Advanced Learner Dictionary(OALD) which contains 63399 British Englishentries including morphological variations,primary accents and Part of Speech.2. The CMU release 0.6 contains 127070 AmericanEnglish entries, we use a 111726 subset of theCMU only containing entries that can bealigned. Notably this lexicon contains asubstantial number of acronyms and propernames, many of which have a non-Englishorigin.3. The Brulex corpus contains 35743 Frenchentries. Note that there is no flexion, whichoversimplifies the task since it removesambiguities between conjugated verbs/nouns,and ascertain the pronunciation of final ‘s’(many Latin words such as ‘le bus’).2.4 ResultsThe result table 2 shows the difference in the accuracy of themodels generated by the epsilon scattering method versus thehand-seeded method.Method Word accuracy Phone accuracyEpsilon scattering 63.97% 90.69%Hand-seeded 78.13% 93.97%Table 2: Performance on OALD vs. alignment method (usinglearning technique described below)Ideally we would like to fully automatically extract thealignments and we see the hand seeded method as the target forour fully automatic method.  We are still working on improvingthe epsilon scattering method.3. LETTER TO PHONE TRANSDUCTIONGiven a fixed size learning vector, we use Top Down InductionTrees to predict the corresponding phonemic output, epsilonbeing considered as a phoneme.3.1 Learning techniqueWith ID3 [6], information gain is used to recursively determinewhich attribute in the learning vectors allow the best entropygain between the full set and the partition of the set according toits attribute’s values. The resulting structure is a decision treecontaining questions and return values on terminal nodes. Thedifference with IGTREE [5] is that the information gain is notcomputed once for all for each attribute, but is computed againon each recursively split subset. This consumes small extramemory since the tree has to store which attribute is beingtested, but allows different branches to test attributes in differentorders. To rate compression ratio, we give tree sizes in the restof this paper according to this formula:size(tree)= if terminal_node(tree) then 1 else 1 +    /* for the default return */        sum foreach t (subtree(tree)) 1+size(t)To summarize, the size is the number of “if” tests in the tree plusthe number of “return”, which is directly proportional to the),(maxarg_,jjii PGprobpathalignment ∏=memory requirements (whether the tree is compiled as sourcecode or downloaded in memory and interpreted).Most of ID3 smoothing power lies in the default case statement,which returns the most probable value for a partial tree pathaccording to the learning set. We have implemented thepossibility not to develop branches when the information gaindrops under a given threshold (over-training on the data).We tried another implementation of decision trees (Wagon aCART [7] implementation) and the results we found were verysimilar. It appears that the alignment algorithm and vectorcontent contributes much more to the accuracy of the modelsthan the actual decision tree learning technique.The classical learning vector is a graphemic sliding window thattakes N letters on the left and N letters on the right of the letterbeing transcribed. Grapheme Vector  Phoneme + Stress        ---   e   xam  IH        --e   x  amp   G_Z        -ex   a  mpl      AE   *        exa  m  ple  M        xam p   le- PTable 3: input vectors and corresponding P+S output,transcribing the beginning of the word “example” (note the useof the pseudo-phoneme G_Z)4.1 Models For Stress AssignmentSome models previously treated the assignment of stress as aparallel task. In agreement with recent findings of Bosch et al.[5], we measured on the OALD corpus a drastic enhancementwhen merging phoneme and stress prediction in the same tree.Phoneme(no stress)Phoneme+ StressWord(no  Stress)Word+StressTreesize2 Trees 95.6% - 73.1% 54.6% 24552+103Merged 95.4% 94.8% 69.4% 69.3% 30368Table 4: tree for G2P + tree for stress VS single tree or G2PSTable 4 shows the results on OALD corpus for comparingseparate phone prediction followed by stress prediction asopposed to predicting both phones and stress with a singlemodel. Although word accuracy for the models excluding stressand the stress model’s accuracy (94.6%) are individually hightheir combined result (54.68%) is significantly lower thanpredicting the two together (69.36%). The Phoneme+Stressvalue is not available for the separate models, as the stressprediction model does not preserve phone alignment. Neithermodel currently has any explicit morphology, which isobviously relevant, as some stress cannot be assigned with justlocal context.In the following we always include stress in our learningparameters (accented and unaccented vowels are considered asdifferent phonemes).4.2 Phonemic feedbackThe transcription of a letter, with an N-sized context, isindependent of the transcription job that has been carried out onthe rest of the word before the current position. However hand-derived rule systems in French used to include phonetic contextin their rules, to write more compact systems. The reason is thatdefining an open/closed syllable for example is straightforwardwith the right phonemic context, and tedious with graphemes.The phonetic feedback can also help balancing accents in theword.Left to Right or Right to Left. Of course one can onlyintroduce the phonemes that have already been transcribed, thusif one needs the phonemes on the right one must transcribe theletters from left to right, and vice versa.Corpus type Word+StressPhoneme+StressTree sizeOALD 3 letters 73.41 92.74 57395OALD 3 letters +3 last phonemes L to R75.46 92.62 59667OALD 3 leters +3 last phonemes R to L76.66 93.60 56299BRULEX 3 letters 93.74 97.76 9917BRULEX 3 letters +3 last phonemes L to R94.05 98.23 8743BRULEX 3 letters +3 last phonemes R to L94.34 98.84 9059CMU 3 letters 59.71 86.95 127393CMU 3 letters +3 last phonemes, L to R62.79 87.84 123301CMU 3 letters +3 last phonemes, R to L61.40 87.90 118767Table 5: tree performance and size when including in thecontext vectors the 3 last phoneme transcribed (depends on thetranscription direction, Left to Right, or Right to Left)Evaluation. The enhancement shown table 5 for French ismarginal, which means that the tree was already embeddingsyllable information derived from the letter sequence (what istedious for a human being need not be for a decision tree).However the phonemic feedback clearly simplifies the decisiontree (12% smaller).English corpora benefit from both tree simplification andperformance enhancement. The advantage of the right to lefttranscription direction can be explained by the fact that most ofthe time the end of the word gives indication on its morphology(hence on its accentuation pattern).For example stress shifts like in ‘strategy’ (S T R AE* T AH JHIY) / ‘strategic’ (S T R AH T IY* JH IH K) cannot be handled bya system provided with a 3 letter context and left to righttranscription. As a matter of fact, when transcribing the ‘a’ withinformation ‘str a teg’ one cannot decide between AE* or AH.On the other hand, with a right to left transcription, theinformation vector is either  ‘str a teg’ + T IY* JH, either ‘str ateg’ + T AH JH. Two successive syllables can’t be accentuated;the system has thus enough information to correctly decidebetween the 2 options.4.3 Including Part Of SpeechHeterophonic-homographs are quite common in English andFrench, and can be disambiguated when their part of speech isknown (many verb-noun or verb-adjective pairs).Corpus type Word+StressPhoneme+stressTree sizeOALD 3 letters 73.41% 92.74% 57395OALD 3 letters +Part of Speech75.73% 93.19% 61671OALD 3 leters +Part of Speech +3 phones left to right78.13% 93.97% 59135Table 6: Influence of POS alone, and POS + 3 last phonemes.Including a POS tag in each learning vector to indicate thenature of the word to which it belongs is easy, this enhance theword accuracy by 2.3% as shown table 6. The synergy of POSwith the phonemic feedback in a left to right transcription isexcellent, as the resulting gain is nearly the sum of their gainindependently.4. APPLICATION TO COMPRESSIONAll the results given from the start of the paper are results aboutgeneralization performances. Are those methods applicable forthe compression of dictionaries? To evaluate the compression,let’s recall that the training and testing sets of the G2P model arethe same.Depending on the amount of memory occupied by a node in thetree, and on the size of the exception lexicon, the developer canchoose a memory trade-off by diminishing the depth of thedecision tree as shown figure 1. On OALD for example, the bestcompression result is 61831 nodes to represent 99.02% of thecorpus (that is an exception lexicon of 621 entries, and acompression ratio of 1 to 22 for the text version of OALD).5. CONCLUSIONThis paper has presented a method building letter to sound rulesfor a given lexicon in a general language independent way. Wehave tested it on English and French and feel it suitable formany other languages.From our results it seems that over-training is generally not aproblem, more data for the rules is always useful.  However thismay be partly be due to the way we selected our test sets (oneentry out of ten).  As  the  lexicon is  in  alphabetical  order  it  isFigure 1: OALD percentage of correct word (train/test sets arethe same) as a function of the tree size for Grapheme only,Grapheme+Phoneme, Grapheme+Phoneme+Part of Speechlikely that the words that were immediately next to test entriesare very similar.  The question of accuracy with respect togenuinely unknown words is discussed more fully in [8].The automatic learning programs described in this paper as wellas speaking dictionaries are available from the MBROLAproject [9] home page http://tcts.fpms.ac.be/synthesis/mbrdico6. REFERENCES1. Weide R. L, “Carnegie Mellon Pronouncing Dictionary”release 0.6 , www.cs.cmu.edu , 19982. Mitten R. “Computer-usable version of OxfordAdvanced Learner's Dictionary of Current English”Oxford Text Archive, 1992.3. Content A., Mousty P. and Radeau M. “BRULEX: Unebase de données lexicales informatisée pour le françaisécrit et parlé”, L'Année Psychologique, 1990, p551-5664. Yvon F. “Prononcer par analogie: motivationformalisation et évaluation”, Phd thesis, ENST, 19965. Van den Bosch A., Weijters T and Daelemans W.“Modularity in inductive-learned word pronunciationsystems” in proc. NeMLaP3/CoNNL98 / PowersD.M.W., Sydney, 1998, p. 185-1946. Quinlan J. R., “C4.5 Programs for Machine Learning”San Mateo, CA Morgan Kaufman, 19937. Breiman L.,  Friedman J., Olshen R. and Stone C.,"Classification and Regression Trees", Wadsworth &Brooks, Pacific Grove, CA., 19848. Black A.W., Lenzo K., Pagel V. "Issues in BuildingGeneral Letter to Sound Rules", ESCA SynthesisWorkshop, Australia 19989. Dutoit T., Pagel V., Pierret N., Bataille F., Van derVrecken O. "The MBROLA Project: Towards a Set ofHigh-Quality Speech Synthesizers Free of Use for Non-Commercial Purpose", proc ICSLP'96,vol.3, p1393-1396500005200054000560005800060000620006400090 92 94 96 98 100Word + Stress accuracy (%)Tree size (node)GG+PG+P + POS

Letter to Sound Rules for Accented Lexicon Compression

Abstract

Similar works

Full text

Available Versions

Edinburgh Research Archive