541 research outputs found
Rethinking Masked Language Modeling for Chinese Spelling Correction
In this paper, we study Chinese Spelling Correction (CSC) as a joint decision
made by two separate models: a language model and an error model. Through
empirical analysis, we find that fine-tuning BERT tends to over-fit the error
model while under-fit the language model, resulting in poor generalization to
out-of-distribution error patterns. Given that BERT is the backbone of most CSC
models, this phenomenon has a significant negative impact. To address this
issue, we are releasing a multi-domain benchmark LEMON, with higher quality and
diversity than existing benchmarks, to allow a comprehensive assessment of the
open domain generalization of CSC models. Then, we demonstrate that a very
simple strategy, randomly masking 20\% non-error tokens from the input sequence
during fine-tuning is sufficient for learning a much better language model
without sacrificing the error model. This technique can be applied to any model
architecture and achieves new state-of-the-art results on SIGHAN, ECSpell, and
LEMON.Comment: Accepted by ACL'202
Contextual Similarity is More Valuable than Character Similarity: Curriculum Learning for Chinese Spell Checking
Chinese Spell Checking (CSC) task aims to detect and correct Chinese spelling
errors. In recent years, related researches focus on introducing the character
similarity from confusion set to enhance the CSC models, ignoring the context
of characters that contain richer information. To make better use of contextual
similarity, we propose a simple yet effective curriculum learning framework for
the CSC task. With the help of our designed model-agnostic framework, existing
CSC models will be trained from easy to difficult as humans learn Chinese
characters and achieve further performance improvements. Extensive experiments
and detailed analyses on widely used SIGHAN datasets show that our method
outperforms previous state-of-the-art methods
An Empirical Investigation of Domain Adaptation Ability for Chinese Spelling Check Models
Chinese Spelling Check (CSC) is a meaningful task in the area of Natural
Language Processing (NLP) which aims at detecting spelling errors in Chinese
texts and then correcting these errors. However, CSC models are based on
pretrained language models, which are trained on a general corpus.
Consequently, their performance may drop when confronted with downstream tasks
involving domain-specific terms. In this paper, we conduct a thorough
evaluation about the domain adaption ability of various typical CSC models by
building three new datasets encompassing rich domain-specific terms from the
financial, medical, and legal domains. Then we conduct empirical investigations
in the corresponding domain-specific test datasets to ascertain the
cross-domain adaptation ability of several typical CSC models. We also test the
performance of the popular large language model ChatGPT. As shown in our
experiments, the performances of the CSC models drop significantly in the new
domains.Comment: ICASSP202
Chinese Spelling Correction as Rephrasing Language Model
This paper studies Chinese Spelling Correction (CSC), which aims to detect
and correct potential spelling errors in a given sentence. Current
state-of-the-art methods regard CSC as a sequence tagging task and fine-tune
BERT-based models on sentence pairs. However, we note a critical flaw in the
process of tagging one character to another, that the correction is excessively
conditioned on the error. This is opposite from human mindset, where
individuals rephrase the complete sentence based on its semantics, rather than
solely on the error patterns memorized before. Such a counter-intuitive
learning process results in the bottleneck of generalizability and
transferability of machine spelling correction. To address this, we propose
(ReLM), where the model is trained to rephrase
the entire sentence by infilling additional slots, instead of
character-to-character tagging. This novel training paradigm achieves the new
state-of-the-art results across fine-tuned and zero-shot CSC benchmarks,
outperforming previous counterparts by a large margin. Our method also learns
transferable language representation when CSC is jointly trained with other
tasks
BSpell: A CNN-Blended BERT Based Bangla Spell Checker
Bangla typing is mostly performed using English keyboard and can be highly
erroneous due to the presence of compound and similarly pronounced letters.
Spelling correction of a misspelled word requires understanding of word typing
pattern as well as the context of the word usage. A specialized BERT model
named BSpell has been proposed in this paper targeted towards word for word
correction in sentence level. BSpell contains an end-to-end trainable CNN
sub-model named SemanticNet along with specialized auxiliary loss. This allows
BSpell to specialize in highly inflected Bangla vocabulary in the presence of
spelling errors. Furthermore, a hybrid pretraining scheme has been proposed for
BSpell that combines word level and character level masking. Comparison on two
Bangla and one Hindi spelling correction dataset shows the superiority of our
proposed approach. BSpell is available as a Bangla spell checking tool via
GitHub: https://github.com/Hasiburshanto/Bangla-Spell-Checke
An Adversarial Multi-Task Learning Method for Chinese Text Correction with Semantic Detection
Text correction, especially the semantic correction of more widely used
scenes, is strongly required to improve, for the fluency and writing efficiency
of the text. An adversarial multi-task learning method is proposed to enhance
the modeling and detection ability of character polysemy in Chinese sentence
context. Wherein, two models, the masked language model and scoring language
model, are introduced as a pair of not only coupled but also adversarial
learning tasks. Moreover, the Monte Carlo tree search strategy and a policy
network are introduced to accomplish the efficient Chinese text correction task
with semantic detection. The experiments are executed on three datasets and
five comparable methods, and the experimental results show that our method can
obtain good performance in Chinese text correction task for better semantic
rationality.Comment: Published on 31st International Conference on Artificial Neural
Networ
A Frustratingly Easy Plug-and-Play Detection-and-Reasoning Module for Chinese Spelling Check
In recent years, Chinese Spelling Check (CSC) has been greatly improved by
designing task-specific pre-training methods or introducing auxiliary tasks,
which mostly solve this task in an end-to-end fashion. In this paper, we
propose to decompose the CSC workflow into detection, reasoning, and searching
subtasks so that the rich external knowledge about the Chinese language can be
leveraged more directly and efficiently. Specifically, we design a
plug-and-play detection-and-reasoning module that is compatible with existing
SOTA non-autoregressive CSC models to further boost their performance. We find
that the detection-and-reasoning module trained for one model can also benefit
other models. We also study the primary interpretability provided by the task
decomposition. Extensive experiments and detailed analyses demonstrate the
effectiveness and competitiveness of the proposed module.Comment: Accepted for publication in Findings of EMNLP 202
CSCD-IME: Correcting Spelling Errors Generated by Pinyin IME
Chinese Spelling Correction (CSC) is a task to detect and correct spelling
mistakes in texts. In fact, most of Chinese input is based on pinyin input
method, so the study of spelling errors in this process is more practical and
valuable. However, there is still no research dedicated to this essential
scenario. In this paper, we first present a Chinese Spelling Correction Dataset
for errors generated by pinyin IME (CSCD-IME), including 40,000 annotated
sentences from real posts of official media on Sina Weibo. Furthermore, we
propose a novel method to automatically construct large-scale and high-quality
pseudo data by simulating the input through pinyin IME. A series of analyses
and experiments on CSCD-IME show that spelling errors produced by pinyin IME
hold a particular distribution at pinyin level and semantic level and are
challenging enough. Meanwhile, our proposed pseudo-data construction method can
better fit this error distribution and improve the performance of CSC systems.
Finally, we provide a useful guide to using pseudo data, including the data
scale, the data source, and the training strategy
- …