48 research outputs found
CLR: Channel-wise Lightweight Reprogramming for Continual Learning
Continual learning aims to emulate the human ability to continually
accumulate knowledge over sequential tasks. The main challenge is to maintain
performance on previously learned tasks after learning new tasks, i.e., to
avoid catastrophic forgetting. We propose a Channel-wise Lightweight
Reprogramming (CLR) approach that helps convolutional neural networks (CNNs)
overcome catastrophic forgetting during continual learning. We show that a CNN
model trained on an old task (or self-supervised proxy task) could be
``reprogrammed" to solve a new task by using our proposed lightweight (very
cheap) reprogramming parameter. With the help of CLR, we have a better
stability-plasticity trade-off to solve continual learning problems: To
maintain stability and retain previous task ability, we use a common
task-agnostic immutable part as the shared ``anchor" parameter set. We then add
task-specific lightweight reprogramming parameters to reinterpret the outputs
of the immutable parts, to enable plasticity and integrate new knowledge. To
learn sequential tasks, we only train the lightweight reprogramming parameters
to learn each new task. Reprogramming parameters are task-specific and
exclusive to each task, which makes our method immune to catastrophic
forgetting. To minimize the parameter requirement of reprogramming to learn new
tasks, we make reprogramming lightweight by only adjusting essential kernels
and learning channel-wise linear mappings from anchor parameters to
task-specific domain knowledge. We show that, for general CNNs, the CLR
parameter increase is less than 0.6\% for any new task. Our method outperforms
13 state-of-the-art continual learning baselines on a new challenging sequence
of 53 image classification datasets. Code and data are available at
https://github.com/gyhandy/Channel-wise-Lightweight-ReprogrammingComment: ICCV 202
Emerging Paradigms of Neural Network Pruning
Over-parameterization of neural networks benefits the optimization and
generalization yet brings cost in practice. Pruning is adopted as a
post-processing solution to this problem, which aims to remove unnecessary
parameters in a neural network with little performance compromised. It has been
broadly believed the resulted sparse neural network cannot be trained from
scratch to comparable accuracy. However, several recent works (e.g., [Frankle
and Carbin, 2019a]) challenge this belief by discovering random sparse networks
which can be trained to match the performance with their dense counterpart.
This new pruning paradigm later inspires more new methods of pruning at
initialization. In spite of the encouraging progress, how to coordinate these
new pruning fashions with the traditional pruning has not been explored yet.
This survey seeks to bridge the gap by proposing a general pruning framework so
that the emerging pruning paradigms can be accommodated well with the
traditional one. With it, we systematically reflect the major differences and
new insights brought by these new pruning fashions, with representative works
discussed at length. Finally, we summarize the open questions as worthy future
directions
Engineering flexible machine learning systems by traversing functionally-invariant paths
Transformers have emerged as the state of the art neural network architecture
for natural language processing and computer vision. In the foundation model
paradigm, large transformer models (BERT, GPT3/4, Bloom, ViT) are pre-trained
on self-supervised tasks such as word or image masking, and then, adapted
through fine-tuning for downstream user applications including instruction
following and Question Answering. While many approaches have been developed for
model fine-tuning including low-rank weight update strategies (eg. LoRA),
underlying mathematical principles that enable network adaptation without
knowledge loss remain poorly understood. Here, we introduce a differential
geometry framework, functionally invariant paths (FIP), that provides flexible
and continuous adaptation of neural networks for a range of machine learning
goals and network sparsification objectives. We conceptualize the weight space
of a neural network as a curved Riemannian manifold equipped with a metric
tensor whose spectrum defines low rank subspaces in weight space that
accommodate network adaptation without loss of prior knowledge. We formalize
adaptation as movement along a geodesic path in weight space while searching
for networks that accommodate secondary objectives. With modest computational
resources, the FIP algorithm achieves comparable to state of the art
performance on continual learning and sparsification tasks for language models
(BERT), vision transformers (ViT, DeIT), and the CNNs. Broadly, we
conceptualize a neural network as a mathematical object that can be iteratively
transformed into distinct configurations by the path-sampling algorithm to
define a sub-manifold of weight space that can be harnessed to achieve user
goals.Comment: 22 page
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs
The ever-increasing large language models (LLMs), though opening a potential
path for the upcoming artificial general intelligence, sadly drops a daunting
obstacle on the way towards their on-device deployment. As one of the most
well-established pre-LLMs approaches in reducing model complexity, network
pruning appears to lag behind in the era of LLMs, due mostly to its costly
fine-tuning (or re-training) necessity under the massive volumes of model
parameter and training data. To close this industry-academia gap, we introduce
Dynamic Sparse No Training (DSnoT), a training-free fine-tuning approach that
slightly updates sparse LLMs without the expensive backpropagation and any
weight updates. Inspired by the Dynamic Sparse Training, DSnoT minimizes the
reconstruction error between the dense and sparse LLMs, in the fashion of
performing iterative weight pruning-and-growing on top of sparse LLMs. To
accomplish this purpose, DSnoT particularly takes into account the anticipated
reduction in reconstruction error for pruning and growing, as well as the
variance w.r.t. different input data for growing each weight. This practice can
be executed efficiently in linear time since its obviates the need of
backpropagation for fine-tuning LLMs. Extensive experiments on LLaMA-V1/V2,
Vicuna, and OPT across various benchmarks demonstrate the effectiveness of
DSnoT in enhancing the performance of sparse LLMs, especially at high sparsity
levels. For instance, DSnoT is able to outperform the state-of-the-art Wanda by
26.79 perplexity at 70% sparsity with LLaMA-7B. Our paper offers fresh insights
into how to fine-tune sparse LLMs in an efficient training-free manner and open
new venues to scale the great potential of sparsity to LLMs. Codes are
available at https://github.com/zyxxmu/DSnoT.Comment: Published as a conference paper at ICLR 202
Continual Learning with Dynamic Sparse Training: Exploring Algorithms for Effective Model Updates
Continual learning (CL) refers to the ability of an intelligent system to
sequentially acquire and retain knowledge from a stream of data with as little
computational overhead as possible. To this end; regularization, replay,
architecture, and parameter isolation approaches were introduced to the
literature. Parameter isolation using a sparse network which enables to
allocate distinct parts of the neural network to different tasks and also
allows to share of parameters between tasks if they are similar. Dynamic Sparse
Training (DST) is a prominent way to find these sparse networks and isolate
them for each task. This paper is the first empirical study investigating the
effect of different DST components under the CL paradigm to fill a critical
research gap and shed light on the optimal configuration of DST for CL if it
exists. Therefore, we perform a comprehensive study in which we investigate
various DST components to find the best topology per task on well-known
CIFAR100 and miniImageNet benchmarks in a task-incremental CL setup since our
primary focus is to evaluate the performance of various DST criteria, rather
than the process of mask selection. We found that, at a low sparsity level,
Erdos-Renyi Kernel (ERK) initialization utilizes the backbone more efficiently
and allows to effectively learn increments of tasks. At a high sparsity level,
however, uniform initialization demonstrates more reliable and robust
performance. In terms of growth strategy; performance is dependent on the
defined initialization strategy, and the extent of sparsity. Finally,
adaptivity within DST components is a promising way for better continual
learners
Continual Learning with Invertible Generative Models
Catastrophic forgetting (CF) happens whenever a neural network overwrites
past knowledge while being trained on new tasks. Common techniques to handle CF
include regularization of the weights (using, e.g., their importance on past
tasks), and rehearsal strategies, where the network is constantly re-trained on
past data. Generative models have also been applied for the latter, in order to
have endless sources of data. In this paper, we propose a novel method that
combines the strengths of regularization and generative-based rehearsal
approaches. Our generative model consists of a normalizing flow (NF), a
probabilistic and invertible neural network, trained on the internal embeddings
of the network. By keeping a single NF throughout the training process, we show
that our memory overhead remains constant. In addition, exploiting the
invertibility of the NF, we propose a simple approach to regularize the
network's embeddings with respect to past tasks. We show that our method
performs favorably with respect to state-of-the-art approaches in the
literature, with bounded computational power and memory overheads.Comment: arXiv admin note: substantial text overlap with arXiv:2007.0244
Incremental Task Learning with Incremental Rank Updates
Incremental Task learning (ITL) is a category of continual learning that
seeks to train a single network for multiple tasks (one after another), where
training data for each task is only available during the training of that task.
Neural networks tend to forget older tasks when they are trained for the newer
tasks; this property is often known as catastrophic forgetting. To address this
issue, ITL methods use episodic memory, parameter regularization, masking and
pruning, or extensible network structures. In this paper, we propose a new
incremental task learning framework based on low-rank factorization. In
particular, we represent the network weights for each layer as a linear
combination of several rank-1 matrices. To update the network for a new task,
we learn a rank-1 (or low-rank) matrix and add that to the weights of every
layer. We also introduce an additional selector vector that assigns different
weights to the low-rank matrices learned for the previous tasks. We show that
our approach performs better than the current state-of-the-art methods in terms
of accuracy and forgetting. Our method also offers better memory efficiency
compared to episodic memory- and mask-based approaches. Our code will be
available at https://github.com/CSIPlab/task-increment-rank-update.gitComment: Code will be available at
https://github.com/CSIPlab/task-increment-rank-update.gi