Neural Chinese Word Segmentation with Lexicon and Unlabeled Data via
  Posterior Regularization

Cai Deng; Chen Xinchi; Chen Xinchi; Dauphin Yann; Ganchev Kuzman; Lafferty John D.; Levow Gina-Anne; Liu Junxin; Liu Yijia; Low Jin Kiat; Luo Wencan; Pei Wenzhe; Sun Weiwei; Xu Jingjing; Xue Nianwen; Yang Jie; Zhang Jiacheng; Zhang Meishan; Zhang Qi; Zhang Yanna; Zhao Hai; Zhao Hai; Zhao Lujun; Zheng Xiaoqing

research

Neural Chinese Word Segmentation with Lexicon and Unlabeled Data via Posterior Regularization

Authors: Cai Deng
Chen Xinchi
Chen Xinchi
Dauphin Yann
Ganchev Kuzman
Lafferty John D.
Levow Gina-Anne
Liu Junxin
Liu Yijia
Low Jin Kiat
Luo Wencan
Pei Wenzhe
Sun Weiwei
Xu Jingjing
Xue Nianwen
Yang Jie
Zhang Jiacheng
Zhang Meishan
Zhang Qi
Zhang Yanna
Zhao Hai
Zhao Hai
Zhao Lujun
Zheng Xiaoqing
Publication date: 26 April 2019
Publisher: 'Association for Computing Machinery (ACM)'
Doi

Abstract

Existing methods for CWS usually rely on a large number of labeled sentences to train word segmentation models, which are expensive and time-consuming to annotate. Luckily, the unlabeled data is usually easy to collect and many high-quality Chinese lexicons are off-the-shelf, both of which can provide useful information for CWS. In this paper, we propose a neural approach for Chinese word segmentation which can exploit both lexicon and unlabeled data. Our approach is based on a variant of posterior regularization algorithm, and the unlabeled data and lexicon are incorporated into model training as indirect supervision by regularizing the prediction space of CWS models. Extensive experiments on multiple benchmark datasets in both in-domain and cross-domain scenarios validate the effectiveness of our approach.Comment: 7 pages, 11 figures, accepted by the 2019 World Wide Web Conference (WWW '19

Similar works

Full text

Available Versions

Crossref

Last time updated on 10/08/2021