research

Unsupervised Statistical Segmentation of Japanese Kanji Strings

Abstract

Word segmentation is an important issue in Japanese language processing because Japanese is written without space delimiters between words. We propose a simple dictionary-less method to segment Japanese kanji sequences into words based solely on character nn-gram counts from an unannotated corpus. The performance was often better than that of rule-based morphological analyzers over a variety of both standard and novel error metrics

    Similar works

    Full text

    thumbnail-image

    Available Versions