8,651 research outputs found
A Graph-based Model for Joint Chinese Word Segmentation and Dependency Parsing
Chinese word segmentation and dependency parsing are two fundamental tasks
for Chinese natural language processing. The dependency parsing is defined on
word-level. Therefore word segmentation is the precondition of dependency
parsing, which makes dependency parsing suffer from error propagation and
unable to directly make use of the character-level pre-trained language model
(such as BERT). In this paper, we propose a graph-based model to integrate
Chinese word segmentation and dependency parsing. Different from previous
transition-based joint models, our proposed model is more concise, which
results in fewer efforts of feature engineering. Our graph-based joint model
achieves better performance than previous joint models and state-of-the-art
results in both Chinese word segmentation and dependency parsing. Besides, when
BERT is combined, our model can substantially reduce the performance gap of
dependency parsing between joint models and gold-segmented word-based models.
Our code is publicly available at https://github.com/fastnlp/JointCwsParser.Comment: Accepted at Transactions of the Association for Computational
Linguistics (TACL
Combining Discrete and Neural Features for Sequence Labeling
Neural network models have recently received heated research attention in the
natural language processing community. Compared with traditional models with
discrete features, neural models have two main advantages. First, they take
low-dimensional, real-valued embedding vectors as inputs, which can be trained
over large raw data, thereby addressing the issue of feature sparsity in
discrete models. Second, deep neural networks can be used to automatically
combine input features, and including non-local features that capture semantic
patterns that cannot be expressed using discrete indicator features. As a
result, neural network models have achieved competitive accuracies compared
with the best discrete models for a range of NLP tasks.
On the other hand, manual feature templates have been carefully investigated
for most NLP tasks over decades and typically cover the most useful indicator
pattern for solving the problems. Such information can be complementary the
features automatically induced from neural networks, and therefore combining
discrete and neural features can potentially lead to better accuracy compared
with models that leverage discrete or neural features only.
In this paper, we systematically investigate the effect of discrete and
neural feature combination for a range of fundamental NLP tasks based on
sequence labeling, including word segmentation, POS tagging and named entity
recognition for Chinese and English, respectively. Our results on standard
benchmarks show that state-of-the-art neural models can give accuracies
comparable to the best discrete models in the literature for most tasks and
combing discrete and neural features unanimously yield better results.Comment: Accepted by International Conference on Computational Linguistics and
Intelligent Text Processing (CICLing) 2016, Apri
Improving Named Entity Recognition for Chinese Social Media with Word Segmentation Representation Learning
Named entity recognition, and other information extraction tasks, frequently
use linguistic features such as part of speech tags or chunkings. For languages
where word boundaries are not readily identified in text, word segmentation is
a key first step to generating features for an NER system. While using word
boundary tags as features are helpful, the signals that aid in identifying
these boundaries may provide richer information for an NER system. New
state-of-the-art word segmentation systems use neural models to learn
representations for predicting word boundaries. We show that these same
representations, jointly trained with an NER system, yield significant
improvements in NER for Chinese social media. In our experiments, jointly
training NER and word segmentation with an LSTM-CRF model yields nearly 5%
absolute improvement over previously published results.Comment: This is the camera ready version of our ACL'16 paper. We also added a
supplementary material containing the results of our systems on a cleaner
dataset (much higher F1 scores). More information please refer to the repo
https://github.com/hltcoe/golden-hors
Attention Is All You Need for Chinese Word Segmentation
Taking greedy decoding algorithm as it should be, this work focuses on
further strengthening the model itself for Chinese word segmentation (CWS),
which results in an even more fast and more accurate CWS model. Our model
consists of an attention only stacked encoder and a light enough decoder for
the greedy segmentation plus two highway connections for smoother training, in
which the encoder is composed of a newly proposed Transformer variant,
Gaussian-masked Directional (GD) Transformer, and a biaffine attention scorer.
With the effective encoder design, our model only needs to take unigram
features for scoring. Our model is evaluated on SIGHAN Bakeoff benchmark
datasets. The experimental results show that with the highest segmentation
speed, the proposed model achieves new state-of-the-art or comparable
performance against strong baselines in terms of strict closed test setting.Comment: 11 pages, to appear in EMNLP 2020 as a long pape
Fast and Accurate Neural Word Segmentation for Chinese
Neural models with minimal feature engineering have achieved competitive
performance against traditional methods for the task of Chinese word
segmentation. However, both training and working procedures of the current
neural models are computationally inefficient. This paper presents a greedy
neural word segmenter with balanced word and character embedding inputs to
alleviate the existing drawbacks. Our segmenter is truly end-to-end, capable of
performing segmentation much faster and even more accurate than
state-of-the-art neural models on Chinese benchmark datasets.Comment: To appear in ACL201
Tracing a Loose Wordhood for Chinese Input Method Engine
Chinese input methods are used to convert pinyin sequence or other Latin
encoding systems into Chinese character sentences. For more effective
pinyin-to-character conversion, typical Input Method Engines (IMEs) rely on a
predefined vocabulary that demands manually maintenance on schedule. For the
purpose of removing the inconvenient vocabulary setting, this work focuses on
automatic wordhood acquisition by fully considering that Chinese inputting is a
free human-computer interaction procedure. Instead of strictly defining words,
a loose word likelihood is introduced for measuring how likely a character
sequence can be a user-recognized word with respect to using IME. Then an
online algorithm is proposed to adjust the word likelihood or generate new
words by comparing user true choice for inputting and the algorithm prediction.
The experimental results show that the proposed solution can agilely adapt to
diverse typings and demonstrate performance approaching highly-optimized IME
with fixed vocabulary
Open Vocabulary Learning for Neural Chinese Pinyin IME
Pinyin-to-character (P2C) conversion is the core component of pinyin-based
Chinese input method engine (IME). However, the conversion is seriously
compromised by the ambiguities of Chinese characters corresponding to pinyin as
well as the predefined fixed vocabularies. To alleviate such inconveniences, we
propose a neural P2C conversion model augmented by an online updated vocabulary
with a sampling mechanism to support open vocabulary learning during IME
working. Our experiments show that the proposed method outperforms commercial
IMEs and state-of-the-art traditional models on standard corpus and true
inputting history dataset in terms of multiple metrics and thus the online
updated vocabulary indeed helps our IME effectively follows user inputting
behavior.Comment: Accepted by ACL 201
Neural Word Segmentation with Rich Pretraining
Neural word segmentation research has benefited from large-scale raw texts by
leveraging them for pretraining character and word embeddings. On the other
hand, statistical segmentation research has exploited richer sources of
external information, such as punctuation, automatic segmentation and POS. We
investigate the effectiveness of a range of external training sources for
neural word segmentation by building a modular segmentation model, pretraining
the most important submodule using rich external sources. Results show that
such pretraining significantly improves the model, leading to accuracies
competitive to the best methods on six benchmarks.Comment: Accepted by ACL 201
82 Treebanks, 34 Models: Universal Dependency Parsing with Multi-Treebank Models
We present the Uppsala system for the CoNLL 2018 Shared Task on universal
dependency parsing. Our system is a pipeline consisting of three components:
the first performs joint word and sentence segmentation; the second predicts
part-of- speech tags and morphological features; the third predicts dependency
trees from words and tags. Instead of training a single parsing model for each
treebank, we trained models with multiple treebanks for one language or closely
related languages, greatly reducing the number of models. On the official test
run, we ranked 7th of 27 teams for the LAS and MLAS metrics. Our system
obtained the best scores overall for word segmentation, universal POS tagging,
and morphological features.Comment: Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from
Raw Text to Universal Dependencie
2kenize: Tying Subword Sequences for Chinese Script Conversion
Simplified Chinese to Traditional Chinese character conversion is a common
preprocessing step in Chinese NLP. Despite this, current approaches have poor
performance because they do not take into account that a simplified Chinese
character can correspond to multiple traditional characters. Here, we propose a
model that can disambiguate between mappings and convert between the two
scripts. The model is based on subword segmentation, two language models, as
well as a method for mapping between subword sequences. We further construct
benchmark datasets for topic classification and script conversion. Our proposed
method outperforms previous Chinese Character conversion approaches by 6 points
in accuracy. These results are further confirmed in a downstream application,
where 2kenize is used to convert pretraining dataset for topic classification.
An error analysis reveals that our method's particular strengths are in dealing
with code-mixing and named entities.Comment: Accepted to ACL 202
- …