3 research outputs found
Onto Word Segmentation of the Complete Tang Poems
We aim at segmenting words in the Complete Tang Poems (CTP). Although it is
possible to do some research about CTP without doing full-scale word
segmentation, we must move forward to word-level analysis of CTP for conducting
advanced research topics. In November 2018 when we submitted the manuscript for
DH 2019 (ADHO), we collected only 2433 poems that were segmented by trained
experts, and used the segmented poems to evaluate the segmenter that considered
domain knowledge of Chinese poetry. We trained pointwise mutual information
(PMI) between Chinese characters based on the CTP poems (excluding the 2433
poems, which were used exclusively only for testing) and the domain knowledge.
The segmenter relied on the PMI information to the recover 85.7% of words in
the test poems. We could segment a poem completely correct only 17.8% of the
time, however. When we presented our work at DH 2019, we have annotated more
than 20000 poems. With a much larger amount of data, we were able to apply
biLSTM models for this word segmentation task, and we segmented a poem
completely correct above 20% of the time. In contrast, human annotators
completely agreed on their annotations about 40% of the time.Comment: 5 pages, 2 tables, presented at the 2019 International Conference on
Digital Humanities (ADHO
BERT Meets Chinese Word Segmentation
Chinese word segmentation (CWS) is a fundamental task for Chinese language
understanding. Recently, neural network-based models have attained superior
performance in solving the in-domain CWS task. Last year, Bidirectional Encoder
Representation from Transformers (BERT), a new language representation model,
has been proposed as a backbone model for many natural language tasks and
redefined the corresponding performance. The excellent performance of BERT
motivates us to apply it to solve the CWS task. By conducting intensive
experiments in the benchmark datasets from the second International Chinese
Word Segmentation Bake-off, we obtain several keen observations. BERT can
slightly improve the performance even when the datasets contain the issue of
labeling inconsistency. When applying sufficiently learned features, Softmax, a
simpler classifier, can attain the same performance as that of a more
complicated classifier, e.g., Conditional Random Field (CRF). The performance
of BERT usually increases as the model size increases. The features extracted
by BERT can be also applied as good candidates for other neural network models.Comment: 13 pages; 3 figure
Fast Neural Chinese Word Segmentation for Long Sentences
Rapidly developed neural models have achieved competitive performance in
Chinese word segmentation (CWS) as their traditional counterparts. However,
most of methods encounter the computational inefficiency especially for long
sentences because of the increasing model complexity and slower decoders. This
paper presents a simple neural segmenter which directly labels the gap
existence between adjacent characters to alleviate the existing drawback. Our
segmenter is fully end-to-end and capable of performing segmentation very fast.
We also show a performance difference with different tag sets. The experiments
show that our segmenter can provide comparable performance with
state-of-the-art