Search CORE

399 research outputs found

Neural Chinese Word Segmentation with Lexicon and Unlabeled Data via Posterior Regularization

Author: Cai Deng
Chen Xinchi
Chen Xinchi
Dauphin Yann
Ganchev Kuzman
Lafferty John D.
Levow Gina-Anne
Liu Junxin
Liu Yijia
Low Jin Kiat
Luo Wencan
Pei Wenzhe
Sun Weiwei
Xu Jingjing
Xue Nianwen
Yang Jie
Zhang Jiacheng
Zhang Meishan
Zhang Qi
Zhang Yanna
Zhao Hai
Zhao Hai
Zhao Lujun
Zheng Xiaoqing
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 26/04/2019
Field of study

Existing methods for CWS usually rely on a large number of labeled sentences to train word segmentation models, which are expensive and time-consuming to annotate. Luckily, the unlabeled data is usually easy to collect and many high-quality Chinese lexicons are off-the-shelf, both of which can provide useful information for CWS. In this paper, we propose a neural approach for Chinese word segmentation which can exploit both lexicon and unlabeled data. Our approach is based on a variant of posterior regularization algorithm, and the unlabeled data and lexicon are incorporated into model training as indirect supervision by regularizing the prediction space of CWS models. Extensive experiments on multiple benchmark datasets in both in-domain and cross-domain scenarios validate the effectiveness of our approach.Comment: 7 pages, 11 figures, accepted by the 2019 World Wide Web Conference (WWW '19

arXiv.org e-Print Archive

Crossref

Neural Word Segmentation with Rich Pretraining

Author: Dong Fei
Yang Jie
Zhang Yue
Publication venue
Publication date: 01/01/2017
Field of study

Neural word segmentation research has benefited from large-scale raw texts by leveraging them for pretraining character and word embeddings. On the other hand, statistical segmentation research has exploited richer sources of external information, such as punctuation, automatic segmentation and POS. We investigate the effectiveness of a range of external training sources for neural word segmentation by building a modular segmentation model, pretraining the most important submodule using rich external sources. Results show that such pretraining significantly improves the model, leading to accuracies competitive to the best methods on six benchmarks.Comment: Accepted by ACL 201

arXiv.org e-Print Archive

Crossref

Mining Word Boundaries in Speech as Naturally Annotated Word Segmentation Data

Author: Gong Chen
Huai Baoxing
Li Zhenghua
Wang Zhefeng
Zhang Lei
Zhang Min
Zhou Shilin
Publication venue
Publication date: 30/10/2023
Field of study

Inspired by early research on exploring naturally annotated data for Chinese word segmentation (CWS), and also by recent research on integration of speech and text processing, this work for the first time proposes to mine word boundaries from parallel speech/text data. First we collect parallel speech/text data from two Internet sources that are related with CWS data used in our experiments. Then, we obtain character-level alignments and design simple heuristic rules for determining word boundaries according to pause duration between adjacent characters. Finally, we present an effective complete-then-train strategy that can better utilize extra naturally annotated data for model training. Experiments demonstrate our approach can significantly boost CWS performance in both cross-domain and low-resource scenarios.Comment: latest versio

arXiv.org e-Print Archive

Improving Cross-Domain Chinese Word Segmentation with Word Embeddings

Author: Li Weikang
Qiu Likun
Sun Jian
Ye Yuxiao
Zhang Yue
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2019
Field of study

Cross-domain Chinese Word Segmentation (CWS) remains a challenge despite recent progress in neural-based CWS. The limited amount of annotated data in the target domain has been the key obstacle to a satisfactory performance. In this paper, we propose a semi-supervised word-based approach to improving cross-domain CWS given a baseline segmenter. Particularly, our model only deploys word embeddings trained on raw text in the target domain, discarding complex hand-crafted features and domain-specific dictionaries. Innovative subsampling and negative sampling methods are proposed to derive word embeddings optimized for CWS. We conduct experiments on five datasets in special domains, covering domains in novels, medicine, and patent. Results show that our model can obviously improve cross-domain CWS, especially in the segmentation of domain-specific noun entities. The word F-measure increases by over 3.0% on four datasets, outperforming state-of-the-art semi-supervised and unsupervised cross-domain CWS approaches with a large margin. We make our code and data available on Github

arXiv.org e-Print Archive

Crossref