Search CORE

1 research outputs found

Y.W.: Maximum entropy word segmentation of Chinese text

Author: Aaron J. Jacobs
Yuk Wah Wong
Publication venue
Publication date: 01/01/2006
Field of study

We extended the work of Low, Ng, and Guo (2005) to create a Chinese word segmentation system based upon a maximum entropy statistical model. This system was entered into the Third International Chinese Language Processing Bakeoff and evaluated on all four corpora in their respective open tracks. Our system achieved the highest F-score for the UPUC corpus, and the second, third, and seventh highest for CKIP, CITYU, and MSRA respectively. Later testing with the gold-standard data revealed that while the additions we made to Low et al.’s system helped our results for the 2005 data with which we experimented during development, a number of them actually hurt our scores for this year’s corpora. 1 Segmenter Our Chinese word segmenter is a modification of the system described by Low et al. (2005), which they entered in the 2005 Second International Chinese Word Segmentation Bakeoff. It uses a maximum entropy (Ratnaparkhi, 1998) model which is trained on the training corpora provided for this year’s bakeoff. The maximum entropy framework used is the Python interface of Zhang Le’s maximum entropy modeling toolkit (Zhang, 2004). 1.1 Properties in common with Low et al. As with the system of Low et al., our system treats the word segmentation problem as a tagging problem. When segmenting a string of Chinese text, each character can be assigned one of four boundary tags: S for a character that stand

CiteSeerX