43,962 research outputs found
TAIL: Task-specific Adapters for Imitation Learning with Large Pretrained Models
The full potential of large pretrained models remains largely untapped in
control domains like robotics. This is mainly because of the scarcity of data
and the computational challenges associated with training or fine-tuning these
large models for such applications. Prior work mainly emphasizes effective
pretraining of large models for decision-making, with little exploration into
how to perform data-efficient continual adaptation of these models for new
tasks. Recognizing these constraints, we introduce TAIL (Task-specific Adapters
for Imitation Learning), a framework for efficient adaptation to new control
tasks. Inspired by recent advancements in parameter-efficient fine-tuning in
language domains, we explore efficient fine-tuning techniques -- e.g.,
Bottleneck Adapters, P-Tuning, and Low-Rank Adaptation (LoRA) -- in TAIL to
adapt large pretrained models for new tasks with limited demonstration data.
Our extensive experiments in large-scale language-conditioned manipulation
tasks comparing prevalent parameter-efficient fine-tuning techniques and
adaptation baselines suggest that TAIL with LoRA can achieve the best
post-adaptation performance with only 1\% of the trainable parameters of full
fine-tuning, while avoiding catastrophic forgetting and preserving adaptation
plasticity in continual learning settings.Comment: 21 pages, 8 figures, 8 table
Fine-Tuning Language Models with Just Forward Passes
Fine-tuning language models (LMs) has yielded success on diverse downstream
tasks, but as LMs grow in size, backpropagation requires a prohibitively large
amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients
using only two forward passes but are theorized to be catastrophically slow for
optimizing large models. In this work, we propose a memory-efficient
zerothorder optimizer (MeZO), adapting the classical ZO-SGD method to operate
in-place, thereby fine-tuning LMs with the same memory footprint as inference.
For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter
model, whereas fine-tuning with backpropagation can train only a 2.7B LM with
the same budget. We conduct comprehensive experiments across model types
(masked and autoregressive LMs), model scales (up to 66B), and downstream tasks
(classification, multiple-choice, and generation). Our results demonstrate that
(1) MeZO significantly outperforms in-context learning and linear probing; (2)
MeZO achieves comparable performance to fine-tuning with backpropagation across
multiple tasks, with up to 12x memory reduction; (3) MeZO is compatible with
both full-parameter and parameter-efficient tuning techniques such as LoRA and
prefix tuning; (4) MeZO can effectively optimize non-differentiable objectives
(e.g., maximizing accuracy or F1). We support our empirical findings with
theoretical insights, highlighting how adequate pre-training and task prompts
enable MeZO to fine-tune huge models, despite classical ZO analyses suggesting
otherwise.Comment: Code available at https://github.com/princeton-nlp/MeZ
When Do Prompting and Prefix-Tuning Work? A Theory of Capabilities and Limitations
Context-based fine-tuning methods, including prompting, in-context learning,
soft prompting (also known as prompt tuning), and prefix-tuning, have gained
popularity due to their ability to often match the performance of full
fine-tuning with a fraction of the parameters. Despite their empirical
successes, there is little theoretical understanding of how these techniques
influence the internal computation of the model and their expressiveness
limitations. We show that despite the continuous embedding space being more
expressive than the discrete token space, soft-prompting and prefix-tuning are
strictly less expressive than full fine-tuning, even with the same number of
learnable parameters. Concretely, context-based fine-tuning cannot change the
relative attention pattern over the content and can only bias the outputs of an
attention layer in a fixed direction. This suggests that while techniques like
prompting, in-context learning, soft prompting, and prefix-tuning can
effectively elicit skills present in the pretrained model, they cannot learn
novel tasks that require new attention patterns
Effectiveness of Data Augmentation for Parameter Efficient Tuning with Limited Data
Recent work has demonstrated that using parameter efficient tuning techniques
such as prefix tuning (or P-tuning) on pretrained language models can yield
performance that is comparable or superior to fine-tuning while dramatically
reducing trainable parameters. Nevertheless, the effectiveness of such methods
under the context of data augmentation, a common strategy to improve learning
under low data regimes, has not been fully explored. In this paper, we examine
the effectiveness of several popular task-agnostic data augmentation
techniques, i.e., EDA, Back Translation, and Mixup, when using two general
parameter efficient tuning methods, P-tuning v2 and LoRA, under data scarcity.
We show that data augmentation can be used to boost the performance of P-tuning
and LoRA models, but the effectiveness of each technique varies and certain
methods can lead to a notable degradation in performance, particularly when
using larger models and on harder tasks. We further analyze the sentence
representations of P-tuning compared to fine-tuning to help understand the
above behaviour, and reveal how P-tuning generally presents a more limited
ability to separate the sentence embeddings from different classes of augmented
data. In addition, it displays poorer performance on heavily altered data.
However, we demonstrate that by adding a simple contrastive loss function it
can help mitigate such issues for prefix tuning, resulting in sizable
improvements to augmented data performance.Comment: Published at the 8th Workshop on Representation Learning for NLP
(RepL4NLP 2023) at ACL 202
WIKITIDE: A Wikipedia-Based Timestamped Definition Pairs Dataset
A fundamental challenge in the current NLP context, dominated by language
models, comes from the inflexibility of current architectures to 'learn' new
information. While model-centric solutions like continual learning or
parameter-efficient fine tuning are available, the question still remains of
how to reliably identify changes in language or in the world. In this paper, we
propose WikiTiDe, a dataset derived from pairs of timestamped definitions
extracted from Wikipedia. We argue that such resource can be helpful for
accelerating diachronic NLP, specifically, for training models able to scan
knowledge resources for core updates concerning a concept, an event, or a named
entity. Our proposed end-to-end method is fully automatic, and leverages a
bootstrapping algorithm for gradually creating a high-quality dataset. Our
results suggest that bootstrapping the seed version of WikiTiDe leads to better
fine-tuned models. We also leverage fine-tuned models in a number of downstream
tasks, showing promising results with respect to competitive baselines.Comment: Accepted by RANLP 2023 main conferenc
Comparison between parameter-efficient techniques and full fine-tuning: A case study on multilingual news article classification
Adapters and Low-Rank Adaptation (LoRA) are parameter-efficient fine-tuning
techniques designed to make the training of language models more efficient.
Previous results demonstrated that these methods can even improve performance
on some classification tasks. This paper complements the existing research by
investigating how these techniques influence the classification performance and
computation costs compared to full fine-tuning when applied to multilingual
text classification tasks (genre, framing, and persuasion techniques detection;
with different input lengths, number of predicted classes and classification
difficulty), some of which have limited training data. In addition, we conduct
in-depth analyses of their efficacy across different training scenarios
(training on the original multilingual data; on the translations into English;
and on a subset of English-only data) and different languages. Our findings
provide valuable insights into the applicability of the parameter-efficient
fine-tuning techniques, particularly to complex multilingual and multilabel
classification tasks
AdaFilter: Adaptive Filter Fine-tuning for Deep Transfer Learning
There is an increasing number of pre-trained deep neural network models.
However, it is still unclear how to effectively use these models for a new
task. Transfer learning, which aims to transfer knowledge from source tasks to
a target task, is an effective solution to this problem. Fine-tuning is a
popular transfer learning technique for deep neural networks where a few rounds
of training are applied to the parameters of a pre-trained model to adapt them
to a new task. Despite its popularity, in this paper, we show that fine-tuning
suffers from several drawbacks. We propose an adaptive fine-tuning approach,
called AdaFilter, which selects only a part of the convolutional filters in the
pre-trained model to optimize on a per-example basis. We use a recurrent gated
network to selectively fine-tune convolutional filters based on the activations
of the previous layer. We experiment with 7 public image classification
datasets and the results show that AdaFilter can reduce the average
classification error of the standard fine-tuning by 2.54%
Test-Time Training for Speech
In this paper, we study the application of Test-Time Training (TTT) as a
solution to handling distribution shifts in speech applications. In particular,
we introduce distribution-shifts to the test datasets of standard
speech-classification tasks -- for example, speaker-identification and
emotion-detection -- and explore how Test-Time Training (TTT) can help adjust
to the distribution-shift. In our experiments that include distribution shifts
due to background noise and natural variations in speech such as gender and
age, we identify some key-challenges with TTT including sensitivity to
optimization hyperparameters (e.g., number of optimization steps and subset of
parameters chosen for TTT) and scalability (e.g., as each example gets its own
set of parameters, TTT is not scalable). Finally, we propose using BitFit -- a
parameter-efficient fine-tuning algorithm proposed for text applications that
only considers the bias parameters for fine-tuning -- as a solution to the
aforementioned challenges and demonstrate that it is consistently more stable
than fine-tuning all the parameters of the model
- …