The problem of vocalization, or diacritization, is essential to many tasks in Arabic NLP. Arabic is generally written without the short vowels, which leads to one written form having several pronunciations with each pronunciation carrying its own meaning(s). In the experiments reported here, we define vocalization as a classification problem in which we decide for each character in the unvocalized word whether it is followed by a short vowel. We investigate the importance of different types of context. Our results show that the combination of using memory-based learning with only a word internal context leads to a word error rate of 6.64%. If a lexical context is added, the results deteriorate slowly

Kübler, Sandra

Mohamed, Emad

English

Hochschulschriftenserver - Universität Frankfurt am Main

Memory-Based Vocalization of ArabicSandra K¨ ubler, Emad MohamedIndiana UniversityDepartment of Linguistics1021 E 3rd St.Bloomington, IN-47405USA{skuebler,emohamed}@indiana.eduAbstractThe problem of vocalization, or diacritization, is essential to many tasks in Arabic NLP. Arabic is generally written without the shortvowels, which leads to one written form having several pronunciations with each pronunciation carrying its own meaning(s). In theexperiments reported here, we deﬁne vocalization as a classiﬁcation problem in which we decide for each character in the unvocalizedword whether it is followed by a short vowel. We investigate the importance of different types of context. Our results show that thecombination of using memory-based learning with only a word internal context leads to a word error rate of 6.64%. If a lexical contextis added, the results deteriorate slowly.1. IntroductionThe problem of vocalization, or diacritization, is essentialto many tasks in Arabic NLP. Arabic is generally writtenwithout the short vowels, which leads to one written formhaving several pronunciations with each pronunciation car-rying its own meaning(s). The word form ’mskn’ is anexample for a highly ambiguous word. Its possible pro-nunciations include ’maskan’ (home), ’musakkin’ (anal-gesic), ’masakn’ (they-fem. have held), or ’musikn’ (they-fem. have been held). The importance of vocalization be-come clear when we look at how Google Translate renders’A$tryt Almskn mn AlSydlyp’ (I bought a pain killer fromthe pharmacy): as ’I bought the home from the pharmacy’.This error would not occur if the input to the translationsystem were vocalized in a ﬁrst step before the actual trans-lation process. However, vocalization is far from trivial:the example above shows that the vocalized words of a sin-gle unvocalized form differ in their parts-of-speech as wellas in their meaning. This shows that vocalization performsimplicit POS tagging and word sense disambiguation. It isalso obvious that word forms cannot be vocalized in iso-lation, the task is heavily dependent on the context of theword.In the experiments reported here, we investigate the impor-tance of different types of context. We follow (Zitouni etal., 2006) in deﬁning vocalization as a classiﬁcation prob-lem in which we decide for each character in the unvocal-ized word whether it is followed by a short vowel. We in-vestigatehowwellthetaskcanbeperformedifonlycontextfrom the same word is available as compared to having ac-cess to a lexical context of 5 words on each side. We alsoinvestigate which types of features are the most importantones. Lastly, we investigate the learning curve to determinehow much training data we need for reliable results.2. Previous ResearchThe ﬁrst approaches to the vocalization of Arabic deﬁnedthe problem word-based, i.e. the task was to determinefor each word the complete vocalized form. (Gal, 2002)uses a bigram HMM model for vocalizing the Qur’an andachieves a word error rate (WER) of 14%. His error anal-ysis showed that the errors resulted mostly from unknownwords. (Kirchhoff et al., 2002) extend a unigram modelby a heuristic for unknown words, which retrieves the mostsimilar unlexicalized word and then applies edit distanceoperations to turn it into the unknown word. They reacha WER of 16.5% on conversational Arabic. (Nelken andShieber, 2005)tackletheproblemwithweightedﬁnite-statetransducers. For known words, morphological units areused for retrieving the vocalization while unknown wordsare vocalized based on the sequence of characters. Theyreach a WER of 12.8%. (Zitouni et al., 2006) use a maxi-mum entropy model in combination with a character basedclassiﬁcation. Their features are based on single charac-ters of the focus word, morphological segments, and POStags. They reach a WER of 7.9%. A comparison of the dif-ferent approaches shows that the deﬁnition of vocalizationas inserting vowels between characters results in the low-est WER. However, this study leaves the lexical context ofwords completely unexplored. In the present study, we willinvestigate this area of research.3. Experimental Setup3.1. DataWe used the Penn Arabic Treebank (Bies and Maamouri,2003) as the data source. The treebank is encoded in Buck-walter transliteration (Buckwalter, 2002) and is available ina vocalized and an unvocalized version. From the treebank,we extracted 170 000 words from the AFP section (part 1v 2.0) and approximately 160 000 words from the Ummahsection (part 2 v2.0).As mentioned previously, we deﬁned vocalization as a clas-siﬁcation problem: For each character in the focus word,the learner needs to decide whether the character is fol-lowed by a short vowel and what the short vowel is. We willcall this character the focus charcter. The task also involvesthe restoration of the shadda (double consonant, long con-sonant, gemination) but, at present, it does not include caseendings.w−5 w−4 w−3 w−2 w−1 c−5 c−4 c−3 c−2 c−1 cc 1 c2 c3 c4 c5 w1 w2 w3 w4 w5 vkl ”$y” tgyr fy HyAp A l m t $ r styfn knt EndmA Evrt Elykl ”$y” tgyr fy HyAp A l m t $ r d styfn knt EndmA Evrt Elykl ”$y” tgyr fy HyAp Al m t$ r d styfn knt EndmA Evrt Ely ukl ”$y” tgyr fy HyAp Al mt $ r d styfn knt EndmA Evrt Ely akl ”$y” tgyr fy HyAp Al mt $ r d styfn knt EndmA Evrt Ely akl ”$y” tgyr fy HyAp A l m t $ r d styfn knt EndmA Evrt Ely ˜ iTable 1: The word ’Almt$rd’ represented with one instance per word; the class represents the vowel to be inserted after thecharacter.Thefeaturesusedfordeterminingtheshortvowelfollowingthe focus character consist of the focus character itself (c),its local context in terms of neighboring characters withinthe focus word, and a more global context of neighboringwords. For the local context, 5 characters to the left (c−5...c −1), 5 characters to the right (c1 ...c 5) are used; forthe lexical context, 5 words to the left (w−5 ...w −1), and 5words to the right (w1 ...w 5). The last value in the vector(v) provides the correct classiﬁcation, i.e. the short vowelto be inserted after c, or - in cases where no vowel is in-serted in that position. The instance for the Arabic word’Almt$rd’, for example, is shown in Table 1.For most of the experiments, we used 10-fold cross valida-tion, the only exception is the experiment concerning thesize of the training set. In order to simulate a real-life situa-tion, we did not build the folds randomly but rather sequen-tially, thus ensuring that a single fold contains consecutivearticles, which may cover different topics from the otherfolds. However, we made sure that all instances of a wordwere put in the same fold.3.2. MethodsFor classiﬁcation, we used a memory-based learner,TiMBL (Daelemans et al., 2007). Memory-based learningis a lazy-learning paradigm, which assumes that learningdoes not consist of abstraction of the training instances intorules or probabilities. Instead, the learner uses the train-ing instance directly. As a consequence, training consistsin storing the instances in an instance base, and classiﬁca-tion ﬁnds the k nearest neighbor in the instance base andchooses their most frequent class as the class for the newinstance. Memory-based learning has been proven to havea suitable bias for many NLP problems (Daelemans et al.,1999). One of the reasons for this success is that naturallanguage exhibits a high percentage of subregularities or ir-regularities, which cannot be distinguished from noise. Ea-ger learning paradigms smooth over all these cases whilememory-based learning still has access to the original in-stance. Thus, if a new instance is similar enough to oneof these irregular instances, it can be correctly classiﬁed assuch.Memory-based learning was chosen for two reasons: First,this approach weights features based on information gainor gain ratio (Daelemans et al., 2007), thus giving someindication of the most and the least important features. Ad-ditionally, it is a paradigm that is capable of handling sym-bolic features with a high number of different feature val-ues. This allows us to use complete context words as fea-tures.CER WERbaseline – 47.2character context 2.22 6.64left word context 2.26 7.06word context 2.35 6.86Table 2: The results of the vocalization experiments withTiMBL.Parameter settings for TiMBL were determined ﬁrst. Thebest results were obtained for all experiments with the IB1algorithm with similarity computed as weighted overlap,relevance weights computed with gain ratio, and the num-ber of k nearest neighbors (or in TiMBL’s case, nearest dis-tances) equal to 1.4. ResultsThe results of our experiments with regard to different con-texts as well the baseline are shown in Table 2. We evalu-ate the error rate based on characters (CER) and based onwords (WER). The baseline experiment was set up so thatthe classiﬁer was presented with 11 words: the focus word,5 context words to its left, and 5 context words to its right.The results for the baseline show that vocalization is a dif-ﬁcult task even in our data set where a word on average hasonly 1.67 vocalizations. This ﬁgure is considerably lowerthan the average on normal texts. (Debili et al., 2002) foundthat on average, each unvocalized word type has 2.9 vo-calized versions, and there is an average of 11.6 vocalizedversions per word token in a text.Relevant features. The next three lines in Table 2 reportsthe results for the experiments in which we deﬁne the taskas deciding for each character whether it is followed by avowel. The experiment in line 2 uses only a character con-text of 5 characters to each side of the focus character butignores the context words, i.e. the features from c−5 to c5in Figure 1 are used. The next experiment uses the lexicalcontext to the left of the focus word in addition to the char-acter context but ignores the context words on the right, i.e.the features from w−5 to c5 are used. Finally, the last ex-periment uses all features shown in Figure 1, i.e. it uses thecharacter context as well as the lexical feature to the leftand to the right of the focus word. When going from classi-fying complete words to classifying characters separately,the results improve dramatically. This method results in aWER of 6.64%; to our knowledge, the highest reported re-sult (but notice that (Zitouni et al., 2006) use a differenttraining set). Surprisingly, adding the context words doesFigure 1: The learning curve.not improve the classiﬁcation results. On the contrary, itresults in a lower WER. This is unexpected, we would haveexpected that at least in cases where the vocalizations havedifferent parts of speech, the lexical context would provideimportant information. One possible explanation for thesenegative results may be data sparseness. However, if weuse the lexical context on both sides of the focus word, theCER is lower but the WER is higher than in the experimentwith the left context only. This shows that the individualdecisions concerning single vowels become more difﬁcultbut the recognition of complete words becomes more sta-ble. Thus, in some cases, the lexical context does improveclassiﬁcation. This also becomes evident when we comparethe results of single folds in the 10-fold setting. Some of thefolds have better results in the left context setting, and somein the full context setting.Next, we look at the weights that TiMBL assigns to the dif-ferent features in the character based experiments. Here,the results are very stable. If we look at the gain ratioweights, in all experiments over all folds, we get the sameordering of features. The feature with the highest weightis the character following the focus character, c1. The nextmost important feature is the focus character, c. The thirdmost important character is the next character to the left,c−1, followed by all its preceding characters c−5 to c−2,followed by all the characters to the right of c1: c2, c5, c3,and c4.Size of the training set. The next question to investigateconcerns the importance of the training set size. In order toinvestigate how much training data we need for the task, weconducted an experiment in which we started with a smalltraining set containing 1000 character instance, and thencontinually increased the training setsize to the fulltrainingset size of 1230723 character instances. The test set waskept stable, we used one of the folds for testing. In orderto ensure reliable results, we chose a fold that resulted inaverage results in the ten-fold experiments reported above.All the experiments were performed with the best featureset determined in the previous experiments, i.e. with char-acters from the focus word as the only features. The learn-ing curve is shown in Figure 1. When training on a set ofonly 1000 characters, the WER is 47%, but raising the sizeof the training set reduces the WER to 27%. The satura-tion point is reached at approximately 700000 characters(which corresponds to 5 folds), with a WER of 6.9%. Af-ter this point, there are only minor improvements, and theWER reaches 6.64% for the whole training set1.5. Conclusion and Future WorkIn the experiments reported here, we have investigated thevocalization of Arabic. The results show that the word in-ternal context provides enough information for vocalizinga high percentage correctly. The best parameter and fea-ture setting results in an error rate of 6.64%, which is morethan one percent point lower than the results presented by(Zitouni et al., 2006) even though our system did not haveaccess to either word segments or POS tags. Adding lexi-cal context as additional features did not increase the per-formance of the memory-based classiﬁer TiMBL. Interest-ingly, the most informative feature is the character follow-ing the focus character although in general, the left charac-ter context within the focus word is more informative thantherightcharactercontext. Thelearningcurveshowsthatatleast in the experiments with features only from within thefocus word, a training set of 700000 characters is sufﬁcientfor reliable results. For the future, we are planning to use astemmer for Arabic to reduce the lexical features to stemsin order to alleviate the sparse data problem concerning thelexical features. Additionally, we will follow (Zitouni etal., 2006) and include part of speech information for all thewords as well. Since the tagset of the Penn Arabic treebankis rather ﬁne grained, we expect to reach the best results byreducing the tagset to a manageable level, following (Diab,2007). A further line of investigation concerns the use ofprevious classiﬁcation within a word for the classiﬁcationof the next character.6. ReferencesAnn Bies and Mohamed Maamouri. 2003. Penn ArabicTreebank guidelines. Technical report, LDC, Universityof Pennsylvania.Tim Buckwalter. 2002. Arabic morphological analyzerversion 1.0. Linguistic Data Consortium.Walter Daelemans, Antal van den Bosch, and Jakub Za-vrel. 1999. Forgetting exceptions is harmful in languagelearning. Machine Learning, 34:11–43. Special Issue onNatural Language Learning.Walter Daelemans, Jakub Zavrel, Ko van der Sloot, andAntal van den Bosch. 2007. TiMBL: Tilburg memorybased learner – version 6.1 – reference guide. TechnicalReport ILK 07-07, Induction of Linguistic Knowledge,Computational Linguistics, Tilburg University.Fahti Debili, Hadhebi Achour, and Emna Souissi. 2002.De l’etiquetage grammatical a la voyellation automa-tique de l’arabe. Technical report, Correspondances del’Institut de Recherche sur le Maghreb Contemporain.MonaDiab. 2007. TowardsanoptimalPOStagsetforAra-bic processing. In Proceedings of the International Con-1We will present results for the experiments with lexical fea-tures in the ﬁnal version of the paper.ference on Recent Advances in Natural Language Pro-cessing, RANLP 2007, pages 157–161, Borovets, Bul-garia.Ya’akov Gal. 2002. An HMM approach to vowel restora-tion in Arabic and Hebrew. In Proceedings of the ACL-02 Workshop on Computational Approaches to SemiticLanguages, Philadelphia, PA.Katrin Kirchhoff, Jeff Bilmes, John Henderson, RichardSchwartz, Mohamed Noamany, Pat Schone, Gang Ji,Sourin Das, Melissa Egan, Feng He, Dimitra Vergyri,Daben Liu, and Nicolae Duta. 2002. Novel speechrecognition models for Arabic - ﬁnal report of the JHUsummer workshop. Technical report, Johns HopkinsUniversity.Rani Nelken and Stuart Shieber. 2005. Arabic diacritiza-tion using weighed ﬁnite-state transducers. In Proceed-ings of the ACL Workshop on Computational Approachesto Semitic Language, Ann Arbor, MI.Imed Zitouni, Jeffrey Sorensen, and Ruhi Sarikaya. 2006.Maximum entropy based restoration of arabic diacrit-ics. In Proceedings of the 21st International Conferenceon Computational Linguistics and 44th Annual Meet-ing of the Association for Computational Linguistics,COLING-ACL-2006, Sydney, Australia.

An HMM approach to vowel restoration in Arabic and Hebrew.

Arabic diacritization using weighed ﬁnite-state transducers.

Arabic morphological analyzer version 1.0. Linguistic Data Consortium.

De l’etiquetage grammatical a la voyellation automatique de l’arabe. Technical report, Correspondances de l’Institut de Recherche sur le Maghreb Contemporain.

Forgetting exceptions is harmful in language learning.

Maximum entropy based restoration of arabic diacritics.

Penn Arabic Treebank guidelines.

Sourin Das, Melissa Egan, Feng He, Dimitra Vergyri, Daben Liu, and Nicolae Duta.

TiMBL: Tilburg memory based learner – version 6.1 – reference guide.

TowardsanoptimalPOStagsetforArabic processing.

Memory-based vocalization of Arabic

Memory-based vocalization of Arabic

Abstract

Similar works

Full text

Available Versions

Hochschulschriftenserver - Universität Frankfurt am Main