399 research outputs found

    Neural Chinese Word Segmentation with Lexicon and Unlabeled Data via Posterior Regularization

    Full text link
    Existing methods for CWS usually rely on a large number of labeled sentences to train word segmentation models, which are expensive and time-consuming to annotate. Luckily, the unlabeled data is usually easy to collect and many high-quality Chinese lexicons are off-the-shelf, both of which can provide useful information for CWS. In this paper, we propose a neural approach for Chinese word segmentation which can exploit both lexicon and unlabeled data. Our approach is based on a variant of posterior regularization algorithm, and the unlabeled data and lexicon are incorporated into model training as indirect supervision by regularizing the prediction space of CWS models. Extensive experiments on multiple benchmark datasets in both in-domain and cross-domain scenarios validate the effectiveness of our approach.Comment: 7 pages, 11 figures, accepted by the 2019 World Wide Web Conference (WWW '19

    Neural Word Segmentation with Rich Pretraining

    Full text link
    Neural word segmentation research has benefited from large-scale raw texts by leveraging them for pretraining character and word embeddings. On the other hand, statistical segmentation research has exploited richer sources of external information, such as punctuation, automatic segmentation and POS. We investigate the effectiveness of a range of external training sources for neural word segmentation by building a modular segmentation model, pretraining the most important submodule using rich external sources. Results show that such pretraining significantly improves the model, leading to accuracies competitive to the best methods on six benchmarks.Comment: Accepted by ACL 201

    Mining Word Boundaries in Speech as Naturally Annotated Word Segmentation Data

    Full text link
    Inspired by early research on exploring naturally annotated data for Chinese word segmentation (CWS), and also by recent research on integration of speech and text processing, this work for the first time proposes to mine word boundaries from parallel speech/text data. First we collect parallel speech/text data from two Internet sources that are related with CWS data used in our experiments. Then, we obtain character-level alignments and design simple heuristic rules for determining word boundaries according to pause duration between adjacent characters. Finally, we present an effective complete-then-train strategy that can better utilize extra naturally annotated data for model training. Experiments demonstrate our approach can significantly boost CWS performance in both cross-domain and low-resource scenarios.Comment: latest versio

    Improving Cross-Domain Chinese Word Segmentation with Word Embeddings

    Full text link
    Cross-domain Chinese Word Segmentation (CWS) remains a challenge despite recent progress in neural-based CWS. The limited amount of annotated data in the target domain has been the key obstacle to a satisfactory performance. In this paper, we propose a semi-supervised word-based approach to improving cross-domain CWS given a baseline segmenter. Particularly, our model only deploys word embeddings trained on raw text in the target domain, discarding complex hand-crafted features and domain-specific dictionaries. Innovative subsampling and negative sampling methods are proposed to derive word embeddings optimized for CWS. We conduct experiments on five datasets in special domains, covering domains in novels, medicine, and patent. Results show that our model can obviously improve cross-domain CWS, especially in the segmentation of domain-specific noun entities. The word F-measure increases by over 3.0% on four datasets, outperforming state-of-the-art semi-supervised and unsupervised cross-domain CWS approaches with a large margin. We make our code and data available on Github
    • …
    corecore