4 research outputs found
SDCL: Self-Distillation Contrastive Learning for Chinese Spell Checking
Due to the ambiguity of homophones, Chinese Spell Checking (CSC) has
widespread applications. Existing systems typically utilize BERT for text
encoding. However, CSC requires the model to account for both phonetic and
graphemic information. To adapt BERT to the CSC task, we propose a token-level
self-distillation contrastive learning method. We employ BERT to encode both
the corrupted and corresponding correct sentence. Then, we use contrastive
learning loss to regularize corrupted tokens' hidden states to be closer to
counterparts in the correct sentence. On three CSC datasets, we confirmed our
method provides a significant improvement above baselines
A Frustratingly Easy Plug-and-Play Detection-and-Reasoning Module for Chinese Spelling Check
In recent years, Chinese Spelling Check (CSC) has been greatly improved by
designing task-specific pre-training methods or introducing auxiliary tasks,
which mostly solve this task in an end-to-end fashion. In this paper, we
propose to decompose the CSC workflow into detection, reasoning, and searching
subtasks so that the rich external knowledge about the Chinese language can be
leveraged more directly and efficiently. Specifically, we design a
plug-and-play detection-and-reasoning module that is compatible with existing
SOTA non-autoregressive CSC models to further boost their performance. We find
that the detection-and-reasoning module trained for one model can also benefit
other models. We also study the primary interpretability provided by the task
decomposition. Extensive experiments and detailed analyses demonstrate the
effectiveness and competitiveness of the proposed module.Comment: Accepted for publication in Findings of EMNLP 202
Datasets for Large Language Models: A Comprehensive Survey
This paper embarks on an exploration into the Large Language Model (LLM)
datasets, which play a crucial role in the remarkable advancements of LLMs. The
datasets serve as the foundational infrastructure analogous to a root system
that sustains and nurtures the development of LLMs. Consequently, examination
of these datasets emerges as a critical topic in research. In order to address
the current lack of a comprehensive overview and thorough analysis of LLM
datasets, and to gain insights into their current status and future trends,
this survey consolidates and categorizes the fundamental aspects of LLM
datasets from five perspectives: (1) Pre-training Corpora; (2) Instruction
Fine-tuning Datasets; (3) Preference Datasets; (4) Evaluation Datasets; (5)
Traditional Natural Language Processing (NLP) Datasets. The survey sheds light
on the prevailing challenges and points out potential avenues for future
investigation. Additionally, a comprehensive review of the existing available
dataset resources is also provided, including statistics from 444 datasets,
covering 8 language categories and spanning 32 domains. Information from 20
dimensions is incorporated into the dataset statistics. The total data size
surveyed surpasses 774.5 TB for pre-training corpora and 700M instances for
other datasets. We aim to present the entire landscape of LLM text datasets,
serving as a comprehensive reference for researchers in this field and
contributing to future studies. Related resources are available at:
https://github.com/lmmlzn/Awesome-LLMs-Datasets.Comment: 181 pages, 21 figure