11 research outputs found
Chinese Spelling Correction as Rephrasing Language Model
This paper studies Chinese Spelling Correction (CSC), which aims to detect
and correct potential spelling errors in a given sentence. Current
state-of-the-art methods regard CSC as a sequence tagging task and fine-tune
BERT-based models on sentence pairs. However, we note a critical flaw in the
process of tagging one character to another, that the correction is excessively
conditioned on the error. This is opposite from human mindset, where
individuals rephrase the complete sentence based on its semantics, rather than
solely on the error patterns memorized before. Such a counter-intuitive
learning process results in the bottleneck of generalizability and
transferability of machine spelling correction. To address this, we propose
(ReLM), where the model is trained to rephrase
the entire sentence by infilling additional slots, instead of
character-to-character tagging. This novel training paradigm achieves the new
state-of-the-art results across fine-tuned and zero-shot CSC benchmarks,
outperforming previous counterparts by a large margin. Our method also learns
transferable language representation when CSC is jointly trained with other
tasks
SDCL: Self-Distillation Contrastive Learning for Chinese Spell Checking
Due to the ambiguity of homophones, Chinese Spell Checking (CSC) has
widespread applications. Existing systems typically utilize BERT for text
encoding. However, CSC requires the model to account for both phonetic and
graphemic information. To adapt BERT to the CSC task, we propose a token-level
self-distillation contrastive learning method. We employ BERT to encode both
the corrupted and corresponding correct sentence. Then, we use contrastive
learning loss to regularize corrupted tokens' hidden states to be closer to
counterparts in the correct sentence. On three CSC datasets, we confirmed our
method provides a significant improvement above baselines
Rethinking Masked Language Modeling for Chinese Spelling Correction
In this paper, we study Chinese Spelling Correction (CSC) as a joint decision
made by two separate models: a language model and an error model. Through
empirical analysis, we find that fine-tuning BERT tends to over-fit the error
model while under-fit the language model, resulting in poor generalization to
out-of-distribution error patterns. Given that BERT is the backbone of most CSC
models, this phenomenon has a significant negative impact. To address this
issue, we are releasing a multi-domain benchmark LEMON, with higher quality and
diversity than existing benchmarks, to allow a comprehensive assessment of the
open domain generalization of CSC models. Then, we demonstrate that a very
simple strategy, randomly masking 20\% non-error tokens from the input sequence
during fine-tuning is sufficient for learning a much better language model
without sacrificing the error model. This technique can be applied to any model
architecture and achieves new state-of-the-art results on SIGHAN, ECSpell, and
LEMON.Comment: Accepted by ACL'202
Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text
Understanding the sentiment of a comment from a video or an image is an
essential task in many applications. Sentiment analysis of a text can be useful
for various decision-making processes. One such application is to analyse the
popular sentiments of videos on social media based on viewer comments. However,
comments from social media do not follow strict rules of grammar, and they
contain mixing of more than one language, often written in non-native scripts.
Non-availability of annotated code-mixed data for a low-resourced language like
Tamil also adds difficulty to this problem. To overcome this, we created a gold
standard Tamil-English code-switched, sentiment-annotated corpus containing
15,744 comment posts from YouTube. In this paper, we describe the process of
creating the corpus and assigning polarities. We present inter-annotator
agreement and show the results of sentiment analysis trained on this corpus as
a benchmark
How Do Multilingual Encoders Learn Cross-lingual Representation?
NLP systems typically require support for more than one language. As different languages have different amounts of supervision, cross-lingual transfer benefits languages with little to no training data by transferring from other languages. From an engineering perspective, multilingual NLP benefits development and maintenance by serving multiple languages with a single system. Both cross-lingual transfer and multilingual NLP rely on cross-lingual representations serving as the foundation. As BERT revolutionized representation learning and NLP, it also revolutionized cross-lingual representations and cross-lingual transfer. Multilingual BERT was released as a replacement for single-language BERT, trained with Wikipedia data in 104 languages.
Surprisingly, without any explicit cross-lingual signal, multilingual BERT learns cross-lingual representations in addition to representations for individual languages. This thesis first shows such surprising cross-lingual effectiveness compared against prior art on various tasks. Naturally, it raises a set of questions, most notably how do these multilingual encoders learn cross-lingual representations. In exploring these questions, this thesis will analyze the behavior of multilingual models in a variety of settings on high and low resource languages. We also look at how to inject different cross-lingual signals into multilingual encoders, and the optimization behavior of cross-lingual transfer with these models. Together, they provide a better understanding of multilingual encoders on cross-lingual transfer. Our findings will lead us to suggested improvements to multilingual encoders and cross-lingual transfer