1 research outputs found

    Exploiting Unlabeled Text to Extract New Words of Different Semantic Transparency for Chinese Word Segmentation Richard Tzong-Han Tsai

    No full text
    corresponding author This paper exploits unlabeled text data to improve new word identification and Chinese word segmentation performance. Our contributions are twofold. First, for new words that lack semantic transparency, such as person, location, or transliteration names, we calculate association metrics of adjacent character segments on unlabeled data and encode this information as features. Second, we construct an internal dictionary by using an initial model to extract words from both the unlabeled training and test set to maintain balanced coverage on the training and test set. In comparison to the baseline model which only uses n-gram features, our approach increases new word recall up to 6.0%. Additionally, our approaches reduce segmentation errors up to 32.3%. Our system achieves state-of-the-art performance for both the closed and open tasks of the 2006 SIGHAN bakeoff.
    corecore