8,651 research outputs found

    A Graph-based Model for Joint Chinese Word Segmentation and Dependency Parsing

    Full text link
    Chinese word segmentation and dependency parsing are two fundamental tasks for Chinese natural language processing. The dependency parsing is defined on word-level. Therefore word segmentation is the precondition of dependency parsing, which makes dependency parsing suffer from error propagation and unable to directly make use of the character-level pre-trained language model (such as BERT). In this paper, we propose a graph-based model to integrate Chinese word segmentation and dependency parsing. Different from previous transition-based joint models, our proposed model is more concise, which results in fewer efforts of feature engineering. Our graph-based joint model achieves better performance than previous joint models and state-of-the-art results in both Chinese word segmentation and dependency parsing. Besides, when BERT is combined, our model can substantially reduce the performance gap of dependency parsing between joint models and gold-segmented word-based models. Our code is publicly available at https://github.com/fastnlp/JointCwsParser.Comment: Accepted at Transactions of the Association for Computational Linguistics (TACL

    Combining Discrete and Neural Features for Sequence Labeling

    Full text link
    Neural network models have recently received heated research attention in the natural language processing community. Compared with traditional models with discrete features, neural models have two main advantages. First, they take low-dimensional, real-valued embedding vectors as inputs, which can be trained over large raw data, thereby addressing the issue of feature sparsity in discrete models. Second, deep neural networks can be used to automatically combine input features, and including non-local features that capture semantic patterns that cannot be expressed using discrete indicator features. As a result, neural network models have achieved competitive accuracies compared with the best discrete models for a range of NLP tasks. On the other hand, manual feature templates have been carefully investigated for most NLP tasks over decades and typically cover the most useful indicator pattern for solving the problems. Such information can be complementary the features automatically induced from neural networks, and therefore combining discrete and neural features can potentially lead to better accuracy compared with models that leverage discrete or neural features only. In this paper, we systematically investigate the effect of discrete and neural feature combination for a range of fundamental NLP tasks based on sequence labeling, including word segmentation, POS tagging and named entity recognition for Chinese and English, respectively. Our results on standard benchmarks show that state-of-the-art neural models can give accuracies comparable to the best discrete models in the literature for most tasks and combing discrete and neural features unanimously yield better results.Comment: Accepted by International Conference on Computational Linguistics and Intelligent Text Processing (CICLing) 2016, Apri

    Improving Named Entity Recognition for Chinese Social Media with Word Segmentation Representation Learning

    Full text link
    Named entity recognition, and other information extraction tasks, frequently use linguistic features such as part of speech tags or chunkings. For languages where word boundaries are not readily identified in text, word segmentation is a key first step to generating features for an NER system. While using word boundary tags as features are helpful, the signals that aid in identifying these boundaries may provide richer information for an NER system. New state-of-the-art word segmentation systems use neural models to learn representations for predicting word boundaries. We show that these same representations, jointly trained with an NER system, yield significant improvements in NER for Chinese social media. In our experiments, jointly training NER and word segmentation with an LSTM-CRF model yields nearly 5% absolute improvement over previously published results.Comment: This is the camera ready version of our ACL'16 paper. We also added a supplementary material containing the results of our systems on a cleaner dataset (much higher F1 scores). More information please refer to the repo https://github.com/hltcoe/golden-hors

    Attention Is All You Need for Chinese Word Segmentation

    Full text link
    Taking greedy decoding algorithm as it should be, this work focuses on further strengthening the model itself for Chinese word segmentation (CWS), which results in an even more fast and more accurate CWS model. Our model consists of an attention only stacked encoder and a light enough decoder for the greedy segmentation plus two highway connections for smoother training, in which the encoder is composed of a newly proposed Transformer variant, Gaussian-masked Directional (GD) Transformer, and a biaffine attention scorer. With the effective encoder design, our model only needs to take unigram features for scoring. Our model is evaluated on SIGHAN Bakeoff benchmark datasets. The experimental results show that with the highest segmentation speed, the proposed model achieves new state-of-the-art or comparable performance against strong baselines in terms of strict closed test setting.Comment: 11 pages, to appear in EMNLP 2020 as a long pape

    Fast and Accurate Neural Word Segmentation for Chinese

    Full text link
    Neural models with minimal feature engineering have achieved competitive performance against traditional methods for the task of Chinese word segmentation. However, both training and working procedures of the current neural models are computationally inefficient. This paper presents a greedy neural word segmenter with balanced word and character embedding inputs to alleviate the existing drawbacks. Our segmenter is truly end-to-end, capable of performing segmentation much faster and even more accurate than state-of-the-art neural models on Chinese benchmark datasets.Comment: To appear in ACL201

    Tracing a Loose Wordhood for Chinese Input Method Engine

    Full text link
    Chinese input methods are used to convert pinyin sequence or other Latin encoding systems into Chinese character sentences. For more effective pinyin-to-character conversion, typical Input Method Engines (IMEs) rely on a predefined vocabulary that demands manually maintenance on schedule. For the purpose of removing the inconvenient vocabulary setting, this work focuses on automatic wordhood acquisition by fully considering that Chinese inputting is a free human-computer interaction procedure. Instead of strictly defining words, a loose word likelihood is introduced for measuring how likely a character sequence can be a user-recognized word with respect to using IME. Then an online algorithm is proposed to adjust the word likelihood or generate new words by comparing user true choice for inputting and the algorithm prediction. The experimental results show that the proposed solution can agilely adapt to diverse typings and demonstrate performance approaching highly-optimized IME with fixed vocabulary

    Open Vocabulary Learning for Neural Chinese Pinyin IME

    Full text link
    Pinyin-to-character (P2C) conversion is the core component of pinyin-based Chinese input method engine (IME). However, the conversion is seriously compromised by the ambiguities of Chinese characters corresponding to pinyin as well as the predefined fixed vocabularies. To alleviate such inconveniences, we propose a neural P2C conversion model augmented by an online updated vocabulary with a sampling mechanism to support open vocabulary learning during IME working. Our experiments show that the proposed method outperforms commercial IMEs and state-of-the-art traditional models on standard corpus and true inputting history dataset in terms of multiple metrics and thus the online updated vocabulary indeed helps our IME effectively follows user inputting behavior.Comment: Accepted by ACL 201

    Neural Word Segmentation with Rich Pretraining

    Full text link
    Neural word segmentation research has benefited from large-scale raw texts by leveraging them for pretraining character and word embeddings. On the other hand, statistical segmentation research has exploited richer sources of external information, such as punctuation, automatic segmentation and POS. We investigate the effectiveness of a range of external training sources for neural word segmentation by building a modular segmentation model, pretraining the most important submodule using rich external sources. Results show that such pretraining significantly improves the model, leading to accuracies competitive to the best methods on six benchmarks.Comment: Accepted by ACL 201

    82 Treebanks, 34 Models: Universal Dependency Parsing with Multi-Treebank Models

    Full text link
    We present the Uppsala system for the CoNLL 2018 Shared Task on universal dependency parsing. Our system is a pipeline consisting of three components: the first performs joint word and sentence segmentation; the second predicts part-of- speech tags and morphological features; the third predicts dependency trees from words and tags. Instead of training a single parsing model for each treebank, we trained models with multiple treebanks for one language or closely related languages, greatly reducing the number of models. On the official test run, we ranked 7th of 27 teams for the LAS and MLAS metrics. Our system obtained the best scores overall for word segmentation, universal POS tagging, and morphological features.Comment: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencie

    2kenize: Tying Subword Sequences for Chinese Script Conversion

    Full text link
    Simplified Chinese to Traditional Chinese character conversion is a common preprocessing step in Chinese NLP. Despite this, current approaches have poor performance because they do not take into account that a simplified Chinese character can correspond to multiple traditional characters. Here, we propose a model that can disambiguate between mappings and convert between the two scripts. The model is based on subword segmentation, two language models, as well as a method for mapping between subword sequences. We further construct benchmark datasets for topic classification and script conversion. Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy. These results are further confirmed in a downstream application, where 2kenize is used to convert pretraining dataset for topic classification. An error analysis reveals that our method's particular strengths are in dealing with code-mixing and named entities.Comment: Accepted to ACL 202
    • …
    corecore