197 research outputs found
The asymptotic distribution and Berry--Esseen bound of a new test for independence in high dimension with an application to stochastic optimization
Let be a random sample from a -dimensional
population distribution. Assume that
for some positive constants and . In this paper we introduce
a new statistic for testing independence of the -variates of the population
and prove that the limiting distribution is the extreme distribution of type I
with a rate of convergence . This is much faster
than , a typical convergence rate for this type of extreme
distribution. A simulation study and application to stochastic optimization are
discussed.Comment: Published in at http://dx.doi.org/10.1214/08-AAP527 the Annals of
Applied Probability (http://www.imstat.org/aap/) by the Institute of
Mathematical Statistics (http://www.imstat.org
CPET: Effective Parameter-Efficient Tuning for Compressed Large Language Models
Parameter-efficient tuning (PET) has been widely explored in recent years
because it tunes much fewer parameters (PET modules) than full-parameter
fine-tuning (FT) while still stimulating sufficient knowledge from large
language models (LLMs) for downstream tasks. Moreover, when PET is employed to
serve multiple tasks, different task-specific PET modules can be built on a
frozen LLM, avoiding redundant LLM deployments. Although PET significantly
reduces the cost of tuning and deploying LLMs, its inference still suffers from
the computational bottleneck of LLMs. To address the above issue, we propose an
effective PET framework based on compressed LLMs, named "CPET". In CPET, we
evaluate the impact of mainstream LLM compression techniques on PET performance
and then introduce knowledge inheritance and recovery strategies to restore the
knowledge loss caused by these compression techniques. Our experimental results
demonstrate that, owing to the restoring strategies of CPET, collaborating
task-specific PET modules with a compressed LLM can achieve comparable
performance to collaborating PET modules with the original version of the
compressed LLM and outperform directly applying vanilla PET methods to the
compressed LLM
READIN: A Chinese Multi-Task Benchmark with Realistic and Diverse Input Noises
For many real-world applications, the user-generated inputs usually contain
various noises due to speech recognition errors caused by linguistic
variations1 or typographical errors (typos). Thus, it is crucial to test model
performance on data with realistic input noises to ensure robustness and
fairness. However, little study has been done to construct such benchmarks for
Chinese, where various language-specific input noises happen in the real world.
In order to fill this important gap, we construct READIN: a Chinese multi-task
benchmark with REalistic And Diverse Input Noises. READIN contains four diverse
tasks and requests annotators to re-enter the original test data with two
commonly used Chinese input methods: Pinyin input and speech input. We designed
our annotation pipeline to maximize diversity, for example by instructing the
annotators to use diverse input method editors (IMEs) for keyboard noises and
recruiting speakers from diverse dialectical groups for speech noises. We
experiment with a series of strong pretrained language models as well as robust
training methods, we find that these models often suffer significant
performance drops on READIN even with robustness methods like data
augmentation. As the first large-scale attempt in creating a benchmark with
noises geared towards user-generated inputs, we believe that READIN serves as
an important complement to existing Chinese NLP benchmarks. The source code and
dataset can be obtained from https://github.com/thunlp/READIN.Comment: Preprin
Effective Few-Shot Named Entity Linking by Meta-Learning
Entity linking aims to link ambiguous mentions to their corresponding
entities in a knowledge base, which is significant and fundamental for various
downstream applications, e.g., knowledge base completion, question answering,
and information extraction. While great efforts have been devoted to this task,
most of these studies follow the assumption that large-scale labeled data is
available. However, when the labeled data is insufficient for specific domains
due to labor-intensive annotation work, the performance of existing algorithms
will suffer an intolerable decline. In this paper, we endeavor to solve the
problem of few-shot entity linking, which only requires a minimal amount of
in-domain labeled data and is more practical in real situations. Specifically,
we firstly propose a novel weak supervision strategy to generate non-trivial
synthetic entity-mention pairs based on mention rewriting. Since the quality of
the synthetic data has a critical impact on effective model training, we
further design a meta-learning mechanism to assign different weights to each
synthetic entity-mention pair automatically. Through this way, we can
profoundly exploit rich and precious semantic information to derive a
well-trained entity linking model under the few-shot setting. The experiments
on real-world datasets show that the proposed method can extensively improve
the state-of-the-art few-shot entity linking model and achieve impressive
performance when only a small amount of labeled data is available. Moreover, we
also demonstrate the outstanding ability of the model's transferability.Comment: 14 pages, 4 figures. Accepted at IEEE ICDE 202
Automatic Label Sequence Generation for Prompting Sequence-to-sequence Models
Prompting, which casts downstream applications as language modeling tasks,
has shown to be sample efficient compared to standard fine-tuning with
pre-trained models. However, one pitfall of prompting is the need of
manually-designed patterns, whose outcome can be unintuitive and requires large
validation sets to tune. To tackle the challenge, we propose AutoSeq, a fully
automatic prompting method: (1) We adopt natural language prompts on
sequence-to-sequence models, enabling free-form generation and larger label
search space; (2) We propose label sequences -- phrases with indefinite lengths
to verbalize the labels -- which eliminate the need of manual templates and are
more expressive than single label words; (3) We use beam search to
automatically generate a large amount of label sequence candidates and propose
contrastive re-ranking to get the best combinations. AutoSeq significantly
outperforms other no-manual-design methods, such as soft prompt tuning, adapter
tuning, and automatic search on single label words; the generated label
sequences are even better than curated manual ones on a variety of tasks. Our
method reveals the potential of sequence-to-sequence models in few-shot
learning and sheds light on a path to generic and automatic prompting. The
source code of this paper can be obtained from
https://github.com/thunlp/Seq2Seq-Prompt.Comment: Accepted to COLING 202
- …