1 research outputs found

    A Comparison of Chinese Word Segmentation on News and Microblog Corpora with a Lexicon Based Method

    No full text
    Microblog is a new and important social media nowadays. Can traditional methods deal well with Chinese microblog word segmentation? We adopt the forward maximum matching (FMM) method and design rules to recognize words with non-Chinese characters. We focus on comparing results between news text and microblog. The lexicon based method allows us to investigate well new words emerging in microblog by comparing with lexicon words. Experimental results show that the performance on microblog outperforms that on news text under the same setup, which may be a signal that microblog word segmentation is not as hard as expected.
    corecore