1,299 research outputs found
Overview of the NLPCC 2015 Shared Task: Chinese Word Segmentation and POS Tagging for Micro-blog Texts
In this paper, we give an overview for the shared task at the 4th CCF
Conference on Natural Language Processing \& Chinese Computing (NLPCC 2015):
Chinese word segmentation and part-of-speech (POS) tagging for micro-blog
texts. Different with the popular used newswire datasets, the dataset of this
shared task consists of the relatively informal micro-texts. The shared task
has two sub-tasks: (1) individual Chinese word segmentation and (2) joint
Chinese word segmentation and POS Tagging. Each subtask has three tracks to
distinguish the systems with different resources. We first introduce the
dataset and task, then we characterize the different approaches of the
participating systems, report the test results, and provide a overview analysis
of these results. An online system is available for open registration and
evaluation at http://nlp.fudan.edu.cn/nlpcc2015
A Seq-to-Seq Transformer Premised Temporal Convolutional Network for Chinese Word Segmentation
The prevalent approaches of Chinese word segmentation task almost rely on the
Bi-LSTM neural network. However, the methods based the Bi-LSTM have some
inherent drawbacks: hard to parallel computing, little efficient in applying
the Dropout method to inhibit the Overfitting and little efficient in capturing
the character information at the more distant site of a long sentence for the
word segmentation task. In this work, we propose a sequence-to-sequence
transformer model for Chinese word segmentation, which is premised a type of
convolutional neural network named temporal convolutional network. The model
uses the temporal convolutional network to construct an encoder, and uses one
layer of fully-connected neural network to build a decoder, and applies the
Dropout method to inhibit the Overfitting, and captures the character
information at the distant site of a sentence by adding the layers of the
encoder, and binds Conditional Random Fields model to train parameters, and
uses the Viterbi algorithm to infer the final result of the Chinese word
segmentation. The experiments on traditional Chinese corpora and simplified
Chinese corpora show that the performance of Chinese word segmentation of the
model is equivalent to the performance of the methods based the Bi-LSTM, and
the model has a tremendous growth in parallel computing than the models based
the Bi-LSTM
A Graph-based Model for Joint Chinese Word Segmentation and Dependency Parsing
Chinese word segmentation and dependency parsing are two fundamental tasks
for Chinese natural language processing. The dependency parsing is defined on
word-level. Therefore word segmentation is the precondition of dependency
parsing, which makes dependency parsing suffer from error propagation and
unable to directly make use of the character-level pre-trained language model
(such as BERT). In this paper, we propose a graph-based model to integrate
Chinese word segmentation and dependency parsing. Different from previous
transition-based joint models, our proposed model is more concise, which
results in fewer efforts of feature engineering. Our graph-based joint model
achieves better performance than previous joint models and state-of-the-art
results in both Chinese word segmentation and dependency parsing. Besides, when
BERT is combined, our model can substantially reduce the performance gap of
dependency parsing between joint models and gold-segmented word-based models.
Our code is publicly available at https://github.com/fastnlp/JointCwsParser.Comment: Accepted at Transactions of the Association for Computational
Linguistics (TACL
Word Segmentation as Graph Partition
We propose a new approach to the Chinese word segmentation problem that
considers the sentence as an undirected graph, whose nodes are the characters.
One can use various techniques to compute the edge weights that measure the
connection strength between characters. Spectral graph partition algorithms are
used to group the characters and achieve word segmentation. We follow the graph
partition approach and design several unsupervised algorithms, and we show
their inspiring segmentation results on two corpora: (1) electronic health
records in Chinese, and (2) benchmark data from the Second International
Chinese Word Segmentation Bakeoff
Adversarial Multi-Criteria Learning for Chinese Word Segmentation
Different linguistic perspectives causes many diverse segmentation criteria
for Chinese word segmentation (CWS). Most existing methods focus on improve the
performance for each single criterion. However, it is interesting to exploit
these different criteria and mining their common underlying knowledge. In this
paper, we propose adversarial multi-criteria learning for CWS by integrating
shared knowledge from multiple heterogeneous segmentation criteria. Experiments
on eight corpora with heterogeneous segmentation criteria show that the
performance of each corpus obtains a significant improvement, compared to
single-criterion learning. Source codes of this paper are available on Github
Subword Encoding in Lattice LSTM for Chinese Word Segmentation
We investigate a lattice LSTM network for Chinese word segmentation (CWS) to
utilize words or subwords. It integrates the character sequence features with
all subsequences information matched from a lexicon. The matched subsequences
serve as information shortcut tunnels which link their start and end characters
directly. Gated units are used to control the contribution of multiple input
links. Through formula derivation and comparison, we show that the lattice LSTM
is an extension of the standard LSTM with the ability to take multiple inputs.
Previous lattice LSTM model takes word embeddings as the lexicon input, we
prove that subword encoding can give the comparable performance and has the
benefit of not relying on any external segmentor. The contribution of lattice
LSTM comes from both lexicon and pretrained embeddings information, we find
that the lexicon information contributes more than the pretrained embeddings
information through controlled experiments. Our experiments show that the
lattice structure with subword encoding gives competitive or better results
with previous state-of-the-art methods on four segmentation benchmarks.
Detailed analyses are conducted to compare the performance of word encoding and
subword encoding in lattice LSTM. We also investigate the performance of
lattice LSTM structure under different circumstances and when this model works
or fails.Comment: 8 page
Attention Is All You Need for Chinese Word Segmentation
Taking greedy decoding algorithm as it should be, this work focuses on
further strengthening the model itself for Chinese word segmentation (CWS),
which results in an even more fast and more accurate CWS model. Our model
consists of an attention only stacked encoder and a light enough decoder for
the greedy segmentation plus two highway connections for smoother training, in
which the encoder is composed of a newly proposed Transformer variant,
Gaussian-masked Directional (GD) Transformer, and a biaffine attention scorer.
With the effective encoder design, our model only needs to take unigram
features for scoring. Our model is evaluated on SIGHAN Bakeoff benchmark
datasets. The experimental results show that with the highest segmentation
speed, the proposed model achieves new state-of-the-art or comparable
performance against strong baselines in terms of strict closed test setting.Comment: 11 pages, to appear in EMNLP 2020 as a long pape
A realistic and robust model for Chinese word segmentation
A realistic Chinese word segmentation tool must adapt to textual variations
with minimal training input and yet robust enough to yield reliable
segmentation result for all variants. Various lexicon-driven approaches to
Chinese segmentation, e.g. [1,16], achieve high f-scores yet require massive
training for any variation. Text-driven approach, e.g. [12], can be easily
adapted for domain and genre changes yet has difficulty matching the high
f-scores of the lexicon-driven approaches. In this paper, we refine and
implement an innovative text-driven word boundary decision (WBD) segmentation
model proposed in [15]. The WBD model treats word segmentation simply and
efficiently as a binary decision on whether to realize the natural textual
break between two adjacent characters as a word boundary. The WBD model allows
simple and quick training data preparation converting characters as contextual
vectors for learning the word boundary decision. Machine learning experiments
with four different classifiers show that training with 1,000 vectors and 1
million vectors achieve comparable and reliable results. In addition, when
applied to SigHAN Bakeoff 3 competition data, the WBD model produces OOV recall
rates that are higher than all published results. Unlike all previous work, our
OOV recall rate is comparable to our own F-score. Both experiments support the
claim that the WBD model is a realistic model for Chinese word segmentation as
it can be easily adapted for new variants with the robust result. In
conclusion, we will discuss linguistic ramifications as well as future
implications for the WBD approach.Comment: Proceedings of the 20th Conference on Computational Linguistics and
Speech Processin
BERT Meets Chinese Word Segmentation
Chinese word segmentation (CWS) is a fundamental task for Chinese language
understanding. Recently, neural network-based models have attained superior
performance in solving the in-domain CWS task. Last year, Bidirectional Encoder
Representation from Transformers (BERT), a new language representation model,
has been proposed as a backbone model for many natural language tasks and
redefined the corresponding performance. The excellent performance of BERT
motivates us to apply it to solve the CWS task. By conducting intensive
experiments in the benchmark datasets from the second International Chinese
Word Segmentation Bake-off, we obtain several keen observations. BERT can
slightly improve the performance even when the datasets contain the issue of
labeling inconsistency. When applying sufficiently learned features, Softmax, a
simpler classifier, can attain the same performance as that of a more
complicated classifier, e.g., Conditional Random Field (CRF). The performance
of BERT usually increases as the model size increases. The features extracted
by BERT can be also applied as good candidates for other neural network models.Comment: 13 pages; 3 figure
Convolutional Neural Network with Word Embeddings for Chinese Word Segmentation
Character-based sequence labeling framework is flexible and efficient for
Chinese word segmentation (CWS). Recently, many character-based neural models
have been applied to CWS. While they obtain good performance, they have two
obvious weaknesses. The first is that they heavily rely on manually designed
bigram feature, i.e. they are not good at capturing n-gram features
automatically. The second is that they make no use of full word information.
For the first weakness, we propose a convolutional neural model, which is able
to capture rich n-gram features without any feature engineering. For the second
one, we propose an effective approach to integrate the proposed model with word
embeddings. We evaluate the model on two benchmark datasets: PKU and MSR.
Without any feature engineering, the model obtains competitive performance --
95.7% on PKU and 97.3% on MSR. Armed with word embeddings, the model achieves
state-of-the-art performance on both datasets -- 96.5% on PKU and 98.0% on MSR,
without using any external labeled resource.Comment: will be published by IJCNLP201
- …