2 research outputs found
A Graph-based Model for Joint Chinese Word Segmentation and Dependency Parsing
Chinese word segmentation and dependency parsing are two fundamental tasks
for Chinese natural language processing. The dependency parsing is defined on
word-level. Therefore word segmentation is the precondition of dependency
parsing, which makes dependency parsing suffer from error propagation and
unable to directly make use of the character-level pre-trained language model
(such as BERT). In this paper, we propose a graph-based model to integrate
Chinese word segmentation and dependency parsing. Different from previous
transition-based joint models, our proposed model is more concise, which
results in fewer efforts of feature engineering. Our graph-based joint model
achieves better performance than previous joint models and state-of-the-art
results in both Chinese word segmentation and dependency parsing. Besides, when
BERT is combined, our model can substantially reduce the performance gap of
dependency parsing between joint models and gold-segmented word-based models.
Our code is publicly available at https://github.com/fastnlp/JointCwsParser.Comment: Accepted at Transactions of the Association for Computational
Linguistics (TACL
Augmenting Part-of-speech Tagging with Syntactic Information for Vietnamese and Chinese
Word segmentation and part-of-speech tagging are two critical preliminary
steps for downstream tasks in Vietnamese natural language processing. In
reality, people tend to consider also the phrase boundary when performing word
segmentation and part of speech tagging rather than solely process word by word
from left to right. In this paper, we implement this idea to improve word
segmentation and part of speech tagging the Vietnamese language by employing a
simplified constituency parser. Our neural model for joint word segmentation
and part-of-speech tagging has the architecture of the syllable-based CRF
constituency parser. To reduce the complexity of parsing, we replace all
constituent labels with a single label indicating for phrases. This model can
be augmented with predicted word boundary and part-of-speech tags by other
tools. Because Vietnamese and Chinese have some similar linguistic phenomena,
we evaluated the proposed model and its augmented versions on three Vietnamese
benchmark datasets and six Chinese benchmark datasets. Our experimental results
show that the proposed model achieves higher performances than previous works
for both languages.Comment: The comparison with existing methods in this paper is unfair because
the hyper-parameters of Bi-LSTM are different compared with previous
research. Importantly, there is a data leakage issue w.r.t this paper's
experimental setu