2 research outputs found
Unified Multi-Criteria Chinese Word Segmentation with BERT
Multi-Criteria Chinese Word Segmentation (MCCWS) aims at finding word
boundaries in a Chinese sentence composed of continuous characters while
multiple segmentation criteria exist. The unified framework has been widely
used in MCCWS and shows its effectiveness. Besides, the pre-trained BERT
language model has been also introduced into the MCCWS task in a multi-task
learning framework. In this paper, we combine the superiority of the unified
framework and pretrained language model, and propose a unified MCCWS model
based on BERT. Moreover, we augment the unified BERT-based MCCWS model with the
bigram features and an auxiliary criterion classification task. Experiments on
eight datasets with diverse criteria demonstrate that our methods could achieve
new state-of-the-art results for MCCWS
Pre-training with Meta Learning for Chinese Word Segmentation
Recent researches show that pre-trained models (PTMs) are beneficial to
Chinese Word Segmentation (CWS). However, PTMs used in previous works usually
adopt language modeling as pre-training tasks, lacking task-specific prior
segmentation knowledge and ignoring the discrepancy between pre-training tasks
and downstream CWS tasks. In this paper, we propose a CWS-specific pre-trained
model METASEG, which employs a unified architecture and incorporates meta
learning algorithm into a multi-criteria pre-training task. Empirical results
show that METASEG could utilize common prior segmentation knowledge from
different existing criteria and alleviate the discrepancy between pre-trained
models and downstream CWS tasks. Besides, METASEG can achieve new
state-of-the-art performance on twelve widely-used CWS datasets and
significantly improve model performance in low-resource settings.Comment: Accepted by NAACL 202