399 research outputs found
Neural Chinese Word Segmentation with Lexicon and Unlabeled Data via Posterior Regularization
Existing methods for CWS usually rely on a large number of labeled sentences
to train word segmentation models, which are expensive and time-consuming to
annotate. Luckily, the unlabeled data is usually easy to collect and many
high-quality Chinese lexicons are off-the-shelf, both of which can provide
useful information for CWS. In this paper, we propose a neural approach for
Chinese word segmentation which can exploit both lexicon and unlabeled data.
Our approach is based on a variant of posterior regularization algorithm, and
the unlabeled data and lexicon are incorporated into model training as indirect
supervision by regularizing the prediction space of CWS models. Extensive
experiments on multiple benchmark datasets in both in-domain and cross-domain
scenarios validate the effectiveness of our approach.Comment: 7 pages, 11 figures, accepted by the 2019 World Wide Web Conference
(WWW '19
Neural Word Segmentation with Rich Pretraining
Neural word segmentation research has benefited from large-scale raw texts by
leveraging them for pretraining character and word embeddings. On the other
hand, statistical segmentation research has exploited richer sources of
external information, such as punctuation, automatic segmentation and POS. We
investigate the effectiveness of a range of external training sources for
neural word segmentation by building a modular segmentation model, pretraining
the most important submodule using rich external sources. Results show that
such pretraining significantly improves the model, leading to accuracies
competitive to the best methods on six benchmarks.Comment: Accepted by ACL 201
Mining Word Boundaries in Speech as Naturally Annotated Word Segmentation Data
Inspired by early research on exploring naturally annotated data for Chinese
word segmentation (CWS), and also by recent research on integration of speech
and text processing, this work for the first time proposes to mine word
boundaries from parallel speech/text data. First we collect parallel
speech/text data from two Internet sources that are related with CWS data used
in our experiments. Then, we obtain character-level alignments and design
simple heuristic rules for determining word boundaries according to pause
duration between adjacent characters. Finally, we present an effective
complete-then-train strategy that can better utilize extra naturally annotated
data for model training. Experiments demonstrate our approach can significantly
boost CWS performance in both cross-domain and low-resource scenarios.Comment: latest versio
Improving Cross-Domain Chinese Word Segmentation with Word Embeddings
Cross-domain Chinese Word Segmentation (CWS) remains a challenge despite
recent progress in neural-based CWS. The limited amount of annotated data in
the target domain has been the key obstacle to a satisfactory performance. In
this paper, we propose a semi-supervised word-based approach to improving
cross-domain CWS given a baseline segmenter. Particularly, our model only
deploys word embeddings trained on raw text in the target domain, discarding
complex hand-crafted features and domain-specific dictionaries. Innovative
subsampling and negative sampling methods are proposed to derive word
embeddings optimized for CWS. We conduct experiments on five datasets in
special domains, covering domains in novels, medicine, and patent. Results show
that our model can obviously improve cross-domain CWS, especially in the
segmentation of domain-specific noun entities. The word F-measure increases by
over 3.0% on four datasets, outperforming state-of-the-art semi-supervised and
unsupervised cross-domain CWS approaches with a large margin. We make our code
and data available on Github
- …