Search CORE

1 research outputs found

The Automatic Extraction of Open Compounds from Text Corpora

Author: Hozumi Tanaka
Virach Sornlertlamvanich
Publication venue
Publication date: 01/01/1996
Field of study

Titis paper dcscribcs a new ,nethod for extracting opcu compounds (uninterrupted sequences of words) from text corpora of languages, such as Thai, Japanese and Korea that exhibit unexplicit word segmentation. Without plying word scgmeutatiou techniques to the inputted plain text, we generate gram data from it. We then count the oc currenee of each string and sort them in alphabetical order. It is significant that the frcqueuey of occurrence of strings decreases when the window size of observation is extended. From the statis- tical point of view, a word is a string with a fixed pattern that is nscd repeatedly, meaning that it should occur with a higher frequency than a string that is not a word. Wc observe the w;riation of frequency of the sorted n-gram data and extract the strings that experience a significant change in frequency of octre'- renee when their length is extended. We apply this occurrence test to both the right and left hand sides of all strings to ensure the accurate detection of both boundaries of the string. The method returned satisfying results regardless of the size of the input file

CiteSeerX

Crossref