International audienceWe present in this paper a comparison between three segmentation systems for the Vietnamese language. Indeed, the majority of Vietnamese words is built by semantic composition from about 7,000 syllables, that also have a meaning as isolated words. So the identification of word boundaries in a text is not a simple task, and ambiguities often appear. Beyond the presentation of the tested systems, we also propose a standard definition for word segmentation in Vietnamese, and introduce a reference corpus developed for the purpose of evaluating such a task. The results observed confirm that it can be relatively well treated by automatic means, although a solution needs to be found to take into account out-of-vocabulary words

Dinh, Quang Thang

Le, Hong Phuong

Nguyen, Cam Tu

Nguyen, Thi Minh Huyen

Rossignol, Mathias

Vu, Xuan Luong

Hal - Université Grenoble Alpes

Word segmentation of Vietnamese texts: a comparison of approaches

INRIA a CCSD electronic archive server

HAL Id: inria-00334760https://hal.inria.fr/inria-00334760Submitted on 28 Oct 2008HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.Word segmentation of Vietnamese texts: a comparisonof approachesQuang Thang Dinh, Hong Phuong Le, Thi Minh Huyen Nguyen, Cam TuNguyen, Mathias Rossignol, Xuan Luong VuTo cite this version:Quang Thang Dinh, Hong Phuong Le, Thi Minh Huyen Nguyen, Cam Tu Nguyen, Mathias Rossignol,et al.. Word segmentation of Vietnamese texts: a comparison of approaches. 6th international con-ference on Language Resources and Evaluation - LREC 2008, ELRA - European Language ResourcesAssociation, May 2008, Marrakech, Morocco. ￿inria-00334760￿Word segmentation of Vietnamese texts: a comparison of approachesĐINH Quang Thắng∗, LÊ Hồng Phương∗†, NGUYỄN Thị Minh Huyền∗, NGUYỄN Cẩm Tú∗,Mathias ROSSIGNOL‡ , VŨ Xuân Lương?∗ Vietnam National University of Hanoi, Vietnam† LORIA, France‡ MICA, Hanoi, Vietnam? Vietlex, Hanoi, Vietnamdqthang@vnu.edu.vn, lehong@loria.fr, huyenntm@vnu.edu.vn, ncamtu@vnu.edu.vn,mathias.rossignol@mica.edu.vn, vuluong@vietlex.comAbstractWe present in this paper a comparison between three segmentation systems for the Vietnamese language. Indeed, the majority of Viet-namese words is built by semantic composition from about 7,000 syllables, that also have a meaning as isolated words. So the identifi-cation of word boundaries in a text is not a simple task, and ambiguities often appear. Beyond the presentation of the tested systems, wealso propose a standard definition for word segmentation in Vietnamese, and introduce a reference corpus developed for the purpose ofevaluating such a task. The results observed confirm that it can be relatively well treated by automatic means, although a solution needsto be found to take into account out-of-vocabulary words.1. IntroductionDespite the fact that, for historical and practical reasons,a variant of the Latin alphabet is now used to representVietnamese, its linguistic mechanisms remain close to thatof languages using syllabic alphabets, like Chinese. In par-ticular, the Vietnamese language creates words of complexmeaning by combining syllables that most of the time alsopossess a meaning when considered individually. That cre-ates problems for all NLP tasks, due to the difficulty inidentifying what constitutes a word in an input text.We present in this article three systems developed by sepa-rate research teams to address that issue, and compare theirperformance on a corpus of about 1,500,000 words manu-ally segmented for the purpose of this experiment.The two first systems are based on the principle of maxi-mum matching, that is, the search for the combination ofwords that produces the segmentation having the smallestnumber of words. The first one, vnTokenizer, completes thisprinciple by relying on statistical textual data (word and bi-gram frequencies) to deal with possible ambiguities (Lê etal., 2008). The second, PVnSeg does not modify the max-imum matching algorithm, but performs heavy pre- andpost-processing of segmented files using pattern matchingtechniques.The third system, JVnSegmenter, adopts for its part a rad-ically different approach, employing statistical machinelearning techniques to identify word boundaries from localcontextual characteristics of the text.We first present in Section 2. an overview of word segmen-tation in various languages. Section 3. is then dedicated tothe description of our corpus and the specification of thetype of segmentation we wish to achieve. Sections 4. to 6.presents the three systems in greater detail, before proceed-ing to Section 7., containing the description of the exper-imental setup and the result of the tests. We conclude inSection 8. with a few teachings for future research in thatfield.2. Existing works on word segmentationInflected languages (typically, western languages) also havethe problem of compound words, but it lies in the identifi-cation of stabilized syntactic constructs that refer to a veryprecise meaning. Those words are often not present in dic-tionaries, and their relevance may be limited to a specificdomain, which is why such research is mostly met in thefield of terminology extraction (Kageura et al., 2004). Bycontrast, in isolating languages compound words belong tothe core of the language; they are present in dictionariesand extremely frequent (in Vietnamese, 28,000 compoundwords in a 35,000-word dictionary). Therefore, we believethe problems to be quite distinct, and shall focus in this sec-tion on Asian languages.The task of segmentation can be made more or less difficultby the writing system: in Thai, for example, each syllable istranscribed using several characters, and there is no spacein the text between syllables (Kawtrakul et al., 2002). Theproblem of word segmentation is thus double: first, sylla-ble segmentation, then word segmentation itself. For Chi-nese or Vietnamese, the situation is easier, since basic lex-ical units are easily identifiable: Chinese hanzi (Sproat etal., 1996) are each represented by one character, and Viet-namese tiếng are separated by spaces.In (Ha, 2003), L. A. Ha separates the task of text segmen-tation into two sub-tasks:• Disambiguation between possible word sequences us-ing a lexicon and statistical methods (Wong and Chan,1996).• Identification of unknown words using collocation de-tection measures such as mutual information and t-score: that is the approach of (Sun et al., 1998) forChinese and (Sornlertlamvanich et al., 2000) for Thai.It can also happen that morphosyntactic analysis tools inte-grate their own segmentation rules based on syntactic evi-dence (Feng et al., 2004).The tools presented in this paper are mostly concernedwith the task of disambiguating between possible word se-quences. Although some attempts are made to extend thoseresults to unknown sequences presenting salient features(proper nouns, numbers, etc.), no work yet presents the abil-ity to discover fully unknown compound words from cor-pus. Before delving further into the characteristics of thosetools, we detail in the next section the exploited experimen-tal data.3. Experimental dataIn order to perform a thorough evaluation and provide areference corpus usable for further research, great care hasbeen taken to properly specify the segmentation task. Wetherefore present in this section, first the specification ofthe segmentation task, then the contents and characteristicsof our corpus.3.1. Segmentation specificationWe have developed a set of segmentation rules based on theprinciples discussed in the document of the ISO/TC 37/SC4 workgroup on word segmentation (2006).Notably, the segmentation of the test corpus follows the fol-lowing rules:Compounds: word compounds are considered as words iftheir meaning is not compound from their subparts(e.g. xe/vehicle, đạp/pedal - xe đạp/bicycle), or if theirusage frequency justifies it.Derivation: when a bound morpheme is attached to aword, the result is considered as a word (học/study -tâm lí học/psychology). The reduplication of a word(common phenomenon in Vietnamese) also gives alexical unit (e.g. tháng/month – tháng tháng/ monthafter month.)Multi-word expressions: expressions such as “ bởivì/because of ” are considered as lexical units.Proper names: names of people and locations are consid-ered as lexical units.Fixed structured locutions: numbers, times, and dates,which can be written in letters or numbers or usinga mix of both, are recognized as lexical units (e.g. 30– ba mươi/ thirty).Foreign language words: foreign language words are ig-nored in the process of segmentation3.2. Corpus constitutionOur test corpus gathers a selection of 1,264 articles fromthe “Politics – Society” section of the newspaper Tuổi Trẻ,for a total of 507,358 words that have been manually spell-checked and segmented by linguists from the Vietnam Lex-icography Center (Vietlex).The following sections provide detailed descriptions of thecompared tools.4. vnTokenizervnTokenizer implements a hybrid approach to automati-cally tokenize Vietnamese text. The approach combinesboth finite-state automata technique, regular expressionparsing and the maximal-matching strategy which is aug-mented by statistical methods to resolve ambiguities of seg-mentation. The Vietnamese lexicon in use is compactly rep-resented by a minimal finite-state automaton. A text to betokenized is first parsed into lexical phrases and other pat-terns using pre-defined regular expressions. The automa-ton is then deployed to build linear graphs correspondingto the phrases to be segmented. The application of a max-imal matching strategy on a graph results in all candidatesegmentations of a phrase. It is the responsibility of an am-biguity resolver, which uses a smoothed bigram languagemodel, to choose the most probable segmentation for thephrase.vnTokenizer is written in Java and bundledas an Eclipse plugin. It is distributed un-der the GPL and freely downloadable fromhttp://www.loria.fr/~lehong/projects.php.5. PVnSegPVnSeg is a command-line tool for the segmentation ofVietnamese texts combining several simple programs writ-ten in Perl. Its basic operating principle is, once again,maximum matching, using a backtracking algorithm for in-creased efficiency. The specificity of PVnSeg is that it ex-ploits the power of Perl for text analysis and pattern match-ing to implement a series of heuristics for the detection ofcompound formulas such as proper nouns, common abbre-viations, dates, numbers, URLs, e-mail addresses, etc.Work is underway to include the detection of other cat-egories of standardized formulations, such as street ad-dresses, and the automatic extraction from corpora of listsof common abbreviations. Emphasis is also put on intelli-gent punctuation segmentation using evidence such as cap-italization, presence of numbers, of special characters. . .6. JVnSegmenterJVnSegmenter departs from the traditional maximummatching approach and uses statistical machine learningtechniques to identify word boundaries in Vietnamese text.JvnSegmenter casts the word segmentation task as the prob-lem of tagging sentences with three predefined labels: BW(beginning of a word), IW (inside a word) and O (others).Each sequence of tagged syllables in which the first oneis tagged as BW and the others are tagged as IW forms aword. Two methods are presented: (1) Linear ConditionalRandom Fields with first order Markov Dependency and(2) Support Vector Machines with second degree polynom-inal kernel.Two kinds of feature functions are used in linear CRFs:edge features which obey to the first Markov property, andper-state features which are generated by combining infor-mation concerning the context of the current position in theobservation sequence (context predicate) with the currentlabel. Based on the same idea, JVnSegmenter integratestwo kinds of features into the SVM model, static featuresand dynamic features. While SVM models decide upon dy-namic features in the tagging process by considering thetwo previous labels, static features are very similar to ver-tex features in the CRF model, in that they also takes intoaccount context predicates at the current observation.Experiments presented in detail in (Nguyen et al., 2006)suggest that the best results are to be obtained by usingthe full set of defined features, both techniques (CRF andSVM) exhibiting comparable performance. In the tests pre-sented in this paper, we have therefore exploited the samefeatures and present results for the CRF approach only.Now that we have described all considered systems, wepresent in the next section the devised experimental setupand obtained results.7. ExperimentWe present in this section the experimental setup used tocompare the presented tools, as well as the segmentationcomparison algorithm, in order to permit result comparisonwith other similar studies, and finally the obtained figuresin Section 7.3..7.1. Experimental setupSome of the tools we wish to compare require a trainingphase. We have chosen to provide all systems with the op-portunity to use training data if they need it, by performinga 10-fold cross validation.In the case of JVnSegmenter, since it is distributed withpre-trained parameter files, we have computed performanceboth with those parameters and with parameters acquiredfrom the training corpus.7.2. Evaluation methodFor each test run, the resulting segmented file is alignedwith the hand-segmented reference by counting all non-blank characters; we then count all identical parallel tokenstowards the global score.Precision is computed as the count of common tokens overtokens of the automatically segmented files, recall as thecount of common tokens over tokens of the manually seg-mented files, and F-measure is computed as usual fromthese two values.7.3. ResultsTable 7. presents the values of precision, recall and f-measure computed for all the considered systems.The first interpretable result is that JVnSegmenter reallyneeds to be trained for the considered task, which is not sur-prising since we cannot know whether the original modelfiles were trained with the same segmentation rules.From the relatively good results of PVnSeg, we can con-clude that efforts at integrating lexical and linguistic knowl-edge in the tool, in the form of pattern-matching rules, aremore fruitful than efforts to solve segmentation ambigui-ties. Indeed, that phenomenon seems, after closer sudy ofthe data, relatively rare. The majority of errors, for all sys-tems, are due to the presence in the texts of compoundsabsent from the dictionary.Finally, it should be noted that vnTokenizer is, of the threesystems, the one with the most consistent results, i.e. thelowest standard deviation of performance between articles.8. ConclusionWe have presented three systems for the segmentation ofVietnamese texts into words, and evaluated them on a refer-ence corpus segmented by Vietnamese linguists. All threeoffer performance within a 2 % range around 95 %, withvarying strengths and weaknesses. An important teachingof this experiment is that unknown compounds are a muchgreater source of segmenting errors than segmentation am-biguities, which are, after all, relatively rare. Future effortsshould therefore be geared in priority towards the automaticdetection of new compounds, which can be performed bymeans statistical (given a large enough corpus) or rule-based (using linguistic knowledge about word composition)or hybrid .9. AcknowledgementsThis work has been carried on in the framework, andwith the support of the National Vietnamese projectKC.01.01/06-10 on the development of essential tools andresources for Vietnamese language and speech processing.10. ReferencesJ. Feng, L. Hui, C. Yuquan, and L. Ruzhan. 2004. An en-hanced model for Chinese word segmentation and part-of-speech tagging. In SIGHAN Workshop, Meeting of theAssociation for Computational Linguistics (ACL 2004),Barcelona, SP.L. A. Ha. 2003. A method for word segmentation in Viet-namese. In Proceedings of the International Conferenceon Corpus Linguistics, Lancaster, UK.K. Kageura, B. Daille, H. Nakagawa, and L.F. Chien. 2004.Recent trends in computational terminology. Terminol-ogy, 10(2):1–21.A. Kawtrakul, M. Suktarachan, P. Varasai, and H. Chan-lekha. 2002. A state of the art of Thai language re-sources and Thai language behavior analysis and model-ing. In Proceedings of the ACL-02 - Workshop on Effec-tive Tools and Methodologies for Teaching Natural Lan-guage Processing and Computational Linguistics, Uni-versity of Pennsylvania, USA.H. P. Lê, T. M. H. Nguyen, A. Roussanaly and T. V. Ho.2008. A hybrid approach to word segmentation of Viet-namese texts. In 2nd International Conference on Lan-guage and Automata Theory and Applications, Tarrag-ona, Spain.ISO/TC 37/SC 4 AWI N309. 2006. Language resourcemanagement - word segmentation of written texts formono-lingual and multi-lingual information processing- part 1: General principles and methods. Technical re-port, ISO.C. T. Nguyen, T. K. Nguyen, X. H. Phan, L. M. Nguyen,and Q. T. Ha. 2006. Vietnamese word segmentation withCRFs and SVMs: An investigation. In Proceedings ofthe 20th Pacific Asia Conference on Language, Informa-tion and Computation (PACLIC 2006), Wuhan, CH.System Precision Recall F-measurevnTokenizer 93.68 % 94.42 % 94.05 %PVnSeg 96.89 % 96.21 % 96.55 %JVnSegmenter (original) 85.22 % 81.40 % 83.27 %JVnSegmenter (re-trained) 95.03 % 93.82 % 94.42 %Table 1: Precision, recall and f-measure of the three systems for word segmentation.V. Sornlertlamvanich, T. Potipiti, and T. Charoenporn.2000. Automatic corpus-based Thai word extractionwith the C4.5 learning algorithm. In Proceedings of theInternational Conference on Computational Linguistics(COLING 2000), Saarbrücken, DE.R. Sproat, C. Shi, W. Gale, and N. Chang. 1996. A stochas-tic finite-state word-segmentation algorithm for chinese.Computational Linguistics, 22(3):377–404.M. Sun, D. Shen, and B. K. Tsou. 1998. Chinese word seg-mentation without using lexicon and hand-crafted train-ing data. In Proceedings of COLING-ACL 98, Montreal,Quebec, CA.P. Wong and C. Chan. 1996. Chinese word segmentationbased on maximum matching and word binding force.In Proceedings of the 16th conference on Computationallinguistics, Copenhagen, DK.

https://hal.inria.fr/inria-00334760/document

Word segmentation of Vietnamese texts: a comparison of approaches

Abstract

Similar works

Full text

Available Versions

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server