21 research outputs found
RSpell: Retrieval-augmented Framework for Domain Adaptive Chinese Spelling Check
Chinese Spelling Check (CSC) refers to the detection and correction of
spelling errors in Chinese texts. In practical application scenarios, it is
important to make CSC models have the ability to correct errors across
different domains. In this paper, we propose a retrieval-augmented spelling
check framework called RSpell, which searches corresponding domain terms and
incorporates them into CSC models. Specifically, we employ pinyin fuzzy
matching to search for terms, which are combined with the input and fed into
the CSC model. Then, we introduce an adaptive process control mechanism to
dynamically adjust the impact of external knowledge on the model. Additionally,
we develop an iterative strategy for the RSpell framework to enhance reasoning
capabilities. We conducted experiments on CSC datasets in three domains: law,
medicine, and official document writing. The results demonstrate that RSpell
achieves state-of-the-art performance in both zero-shot and fine-tuning
scenarios, demonstrating the effectiveness of the retrieval-augmented CSC
framework. Our code is available at https://github.com/47777777/Rspell
Contextual Similarity is More Valuable than Character Similarity: Curriculum Learning for Chinese Spell Checking
Chinese Spell Checking (CSC) task aims to detect and correct Chinese spelling
errors. In recent years, related researches focus on introducing the character
similarity from confusion set to enhance the CSC models, ignoring the context
of characters that contain richer information. To make better use of contextual
similarity, we propose a simple yet effective curriculum learning framework for
the CSC task. With the help of our designed model-agnostic framework, existing
CSC models will be trained from easy to difficult as humans learn Chinese
characters and achieve further performance improvements. Extensive experiments
and detailed analyses on widely used SIGHAN datasets show that our method
outperforms previous state-of-the-art methods
SDCL: Self-Distillation Contrastive Learning for Chinese Spell Checking
Due to the ambiguity of homophones, Chinese Spell Checking (CSC) has
widespread applications. Existing systems typically utilize BERT for text
encoding. However, CSC requires the model to account for both phonetic and
graphemic information. To adapt BERT to the CSC task, we propose a token-level
self-distillation contrastive learning method. We employ BERT to encode both
the corrupted and corresponding correct sentence. Then, we use contrastive
learning loss to regularize corrupted tokens' hidden states to be closer to
counterparts in the correct sentence. On three CSC datasets, we confirmed our
method provides a significant improvement above baselines
Eval-GCSC: A New Metric for Evaluating ChatGPT's Performance in Chinese Spelling Correction
ChatGPT has demonstrated impressive performance in various downstream tasks.
However, in the Chinese Spelling Correction (CSC) task, we observe a
discrepancy: while ChatGPT performs well under human evaluation, it scores
poorly according to traditional metrics. We believe this inconsistency arises
because the traditional metrics are not well-suited for evaluating generative
models. Their overly strict length and phonics constraints may lead to
underestimating ChatGPT's correction capabilities. To better evaluate
generative models in the CSC task, this paper proposes a new evaluation metric:
Eval-GCSC. By incorporating word-level and semantic similarity judgments, it
relaxes the stringent length and phonics constraints. Experimental results show
that Eval-GCSC closely aligns with human evaluations. Under this metric,
ChatGPT's performance is comparable to traditional token-level classification
models (TCM), demonstrating its potential as a CSC tool. The source code and
scripts can be accessed at https://github.com/ktlKTL/Eval-GCSC
Integrating Dictionary and Web N-grams for Chinese Spell Checking
Abstract Chinese spell checking is an important component of many NLP applications, including word processors, search engines, and automatic essay rating. Nevertheless, compared to spell checkers for alphabetical languages (e.g., English or French), Chinese spell checkers are more difficult to develop because there are no word boundaries in the Chinese writing system and errors may be caused by various Chinese input methods. In this paper, we propose a novel method for detecting and correcting Chinese typographical errors. Our approach involves word segmentation, detection rules, and phrase-based machine translation. The error detection module detects errors by segmenting words and checking word and phrase frequency based on compiled and Web corpora. The phonological or morphological typographical errors found then are corrected by running a decoder based on the statistical machine translation model (SMT). The results show that the proposed system achieves significantly better accuracy in error detection and more satisfactory performance in error correction than the state-of-the-art systems
Disentangled Phonetic Representation for Chinese Spelling Correction
Chinese Spelling Correction (CSC) aims to detect and correct erroneous
characters in Chinese texts. Although efforts have been made to introduce
phonetic information (Hanyu Pinyin) in this task, they typically merge phonetic
representations with character representations, which tends to weaken the
representation effect of normal texts. In this work, we propose to disentangle
the two types of features to allow for direct interaction between textual and
phonetic information. To learn useful phonetic representations, we introduce a
pinyin-to-character objective to ask the model to predict the correct
characters based solely on phonetic information, where a separation mask is
imposed to disable attention from phonetic input to text. To avoid overfitting
the phonetics, we further design a self-distillation module to ensure that
semantic information plays a major role in the prediction. Extensive
experiments on three CSC benchmarks demonstrate the superiority of our method
in using phonetic information.Comment: Accepted to ACL 2023 Main Conferenc
From Spelling to Grammar: A New Framework for Chinese Grammatical Error Correction
Chinese Grammatical Error Correction (CGEC) aims to generate a correct
sentence from an erroneous sequence, where different kinds of errors are mixed.
This paper divides the CGEC task into two steps, namely spelling error
correction and grammatical error correction. Specifically, we propose a novel
zero-shot approach for spelling error correction, which is simple but
effective, obtaining a high precision to avoid error accumulation of the
pipeline structure. To handle grammatical error correction, we design
part-of-speech (POS) features and semantic class features to enhance the neural
network model, and propose an auxiliary task to predict the POS sequence of the
target sentence. Our proposed framework achieves a 42.11 F0.5 score on CGEC
dataset without using any synthetic data or data augmentation methods, which
outperforms the previous state-of-the-art by a wide margin of 1.30 points.
Moreover, our model produces meaningful POS representations that capture
different POS words and convey reasonable POS transition rules