15 research outputs found
K-Theory Of Root Stacks And Its Application To Equivariant K-Theory
We give a definition of a root stack and describe its most basic properties. Then we recall the necessary background (Abhyankar’s lemma, Chevalley-Shephard-Todd theorem, Luna’s etale slice theorem) and prove that under some conditions a quotient stack is a root stack. Then we compute G-theory and K-theory of a root stack. These results are used to formulate the theorem on equivariant algebraic K-theory of schemes
Hyperparameter Optimization for Large Language Model Instruction-Tuning
The fine-tuning of Large Language Models (LLMs) has enabled them to recently
achieve milestones in natural language processing applications. The emergence
of ever larger LLMs has paved the way for more efficient fine-tuning methods.
Among these, the Low-Rank Adaptation (LoRA) method keeps most of the weights of
the pre-trained LLM frozen while introducing a low-rank decomposition of the
weight matrix, enabling the tuning of only a very small proportion of the
network. The performance on downstream tasks of models fine-tuned with LoRA
heavily relies on a set of hyperparameters including the rank of the
decomposition. In this work, we investigate the choice of these hyperparameters
through two main blackbox optimization (BBO) techniques. We examine the whole
pipeline of performing fine-tuning and validation on a pre-trained LLM as a
blackbox and efficiently explore the space of hyperparameters with the \nomad
algorithm, achieving a boost in performance and human alignment of the tuned
model
DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation
With the ever-growing size of pretrained models (PMs), fine-tuning them has
become more expensive and resource-hungry. As a remedy, low-rank adapters
(LoRA) keep the main pretrained weights of the model frozen and just introduce
some learnable truncated SVD modules (so-called LoRA blocks) to the model.
While LoRA blocks are parameter-efficient, they suffer from two major problems:
first, the size of these blocks is fixed and cannot be modified after training
(for example, if we need to change the rank of LoRA blocks, then we need to
re-train them from scratch); second, optimizing their rank requires an
exhaustive search and effort. In this work, we introduce a dynamic low-rank
adaptation (DyLoRA) technique to address these two problems together. Our
DyLoRA method trains LoRA blocks for a range of ranks instead of a single rank
by sorting the representation learned by the adapter module at different ranks
during training. We evaluate our solution on different natural language
understanding (GLUE benchmark) and language generation tasks (E2E, DART and
WebNLG) using different pretrained models such as RoBERTa and GPT with
different sizes. Our results show that we can train dynamic search-free models
with DyLoRA at least 4 to 7 times (depending to the task) faster than LoRA
without significantly compromising performance. Moreover, our models can
perform consistently well on a much larger range of ranks compared to LoRA.Comment: Accepted to EACL 202
Attribute Controlled Dialogue Prompting
Prompt-tuning has become an increasingly popular parameter-efficient method
for adapting large pretrained language models to downstream tasks. However,
both discrete prompting and continuous prompting assume fixed prompts for all
data samples within a task, neglecting the fact that inputs vary greatly in
some tasks such as open-domain dialogue generation. In this paper, we present a
novel, instance-specific prompt-tuning algorithm for dialogue generation.
Specifically, we generate prompts based on instance-level control code, rather
than the conversation history, to explore their impact on controlled dialogue
generation. Experiments on popular open-domain dialogue datasets, evaluated on
both automated metrics and human evaluation, demonstrate that our method is
superior to prompting baselines and comparable to fine-tuning with only 5%-6%
of total parameters.Comment: Accepted at ACL 2023 In Finding
Continuation KD: Improved Knowledge Distillation through the Lens of Continuation Optimization
Knowledge Distillation (KD) has been extensively used for natural language
understanding (NLU) tasks to improve a small model's (a student) generalization
by transferring the knowledge from a larger model (a teacher). Although KD
methods achieve state-of-the-art performance in numerous settings, they suffer
from several problems limiting their performance. It is shown in the literature
that the capacity gap between the teacher and the student networks can make KD
ineffective. Additionally, existing KD techniques do not mitigate the noise in
the teacher's output: modeling the noisy behaviour of the teacher can distract
the student from learning more useful features. We propose a new KD method that
addresses these problems and facilitates the training compared to previous
techniques. Inspired by continuation optimization, we design a training
procedure that optimizes the highly non-convex KD objective by starting with
the smoothed version of this objective and making it more complex as the
training proceeds. Our method (Continuation-KD) achieves state-of-the-art
performance across various compact architectures on NLU (GLUE benchmark) and
computer vision tasks (CIFAR-10 and CIFAR-100).Comment: Published at EMNLP 2022 (Findings