Text error correction aims to correct the errors in text sequences such as
those typed by humans or generated by speech recognition models. Previous error
correction methods usually take the source (incorrect) sentence as encoder
input and generate the target (correct) sentence through the decoder. Since the
error rate of the incorrect sentence is usually low (e.g., 10\%), the
correction model can only learn to correct on limited error tokens but
trivially copy on most tokens (correct tokens), which harms the effective
training of error correction. In this paper, we argue that the correct tokens
should be better utilized to facilitate effective training and then propose a
simple yet effective masking strategy to achieve this goal. Specifically, we
randomly mask out a part of the correct tokens in the source sentence and let
the model learn to not only correct the original error tokens but also predict
the masked tokens based on their context information. Our method enjoys several
advantages: 1) it alleviates trivial copy; 2) it leverages effective training
signals from correct tokens; 3) it is a plug-and-play module and can be applied
to different models and tasks. Experiments on spelling error correction and
speech recognition error correction on Mandarin datasets and grammar error
correction on English datasets with both autoregressive and non-autoregressive
generation models show that our method improves the correction accuracy
consistently.Comment: main track of EMNLP 202