341 research outputs found
PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning
This paper presents a method for adding multiple tasks to a single deep
neural network while avoiding catastrophic forgetting. Inspired by network
pruning techniques, we exploit redundancies in large deep networks to free up
parameters that can then be employed to learn new tasks. By performing
iterative pruning and network re-training, we are able to sequentially "pack"
multiple tasks into a single network while ensuring minimal drop in performance
and minimal storage overhead. Unlike prior work that uses proxy losses to
maintain accuracy on older tasks, we always optimize for the task at hand. We
perform extensive experiments on a variety of network architectures and
large-scale datasets, and observe much better robustness against catastrophic
forgetting than prior work. In particular, we are able to add three
fine-grained classification tasks to a single ImageNet-trained VGG-16 network
and achieve accuracies close to those of separately trained networks for each
task. Code available at https://github.com/arunmallya/packne
Powerpropagation: A sparsity inducing weight reparameterisation
The training of sparse neural networks is becoming an increasingly important tool
for reducing the computational footprint of models at training and evaluation, as
well enabling the effective scaling up of models. Whereas much work over the
years has been dedicated to specialised pruning techniques, little attention has
been paid to the inherent effect of gradient based training on model sparsity. In
this work, we introduce Powerpropagation, a new weight-parameterisation for
neural networks that leads to inherently sparse models. Exploiting the behaviour
of gradient descent, our method gives rise to weight updates exhibiting a “rich get
richer” dynamic, leaving low-magnitude parameters largely unaffected by learning.
Models trained in this manner exhibit similar performance, but have a distribution
with markedly higher density at zero, allowing more parameters to be pruned safely.
Powerpropagation is general, intuitive, cheap and straight-forward to implement
and can readily be combined with various other techniques. To highlight its versatility, we explore it in two very different settings: Firstly, following a recent
line of work, we investigate its effect on sparse training for resource-constrained
settings. Here, we combine Powerpropagation with a traditional weight-pruning
technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing
superior performance on the ImageNet benchmark. Secondly, we advocate the use
of sparsity in overcoming catastrophic forgetting, where compressed representations allow accommodating a large number of tasks at fixed model capacity. In all
cases our reparameterisation considerably increases the efficacy of the off-the-shelf
methods
AdaptCL: Adaptive Continual Learning for Tackling Heterogeneity in Sequential Datasets
Managing heterogeneous datasets that vary in complexity, size, and similarity
in continual learning presents a significant challenge. Task-agnostic continual
learning is necessary to address this challenge, as datasets with varying
similarity pose difficulties in distinguishing task boundaries. Conventional
task-agnostic continual learning practices typically rely on rehearsal or
regularization techniques. However, rehearsal methods may struggle with varying
dataset sizes and regulating the importance of old and new data due to rigid
buffer sizes. Meanwhile, regularization methods apply generic constraints to
promote generalization but can hinder performance when dealing with dissimilar
datasets lacking shared features, necessitating a more adaptive approach. In
this paper, we propose AdaptCL, a novel adaptive continual learning method to
tackle heterogeneity in sequential datasets. AdaptCL employs fine-grained
data-driven pruning to adapt to variations in data complexity and dataset size.
It also utilizes task-agnostic parameter isolation to mitigate the impact of
varying degrees of catastrophic forgetting caused by differences in data
similarity. Through a two-pronged case study approach, we evaluate AdaptCL on
both datasets of MNIST Variants and DomainNet, as well as datasets from
different domains. The latter include both large-scale, diverse binary-class
datasets and few-shot, multi-class datasets. Across all these scenarios,
AdaptCL consistently exhibits robust performance, demonstrating its flexibility
and general applicability in handling heterogeneous datasets.Comment: This article has been accepted by TNNL
Towards Redundancy-Free Sub-networks in Continual Learning
Catastrophic Forgetting (CF) is a prominent issue in continual learning.
Parameter isolation addresses this challenge by masking a sub-network for each
task to mitigate interference with old tasks. However, these sub-networks are
constructed relying on weight magnitude, which does not necessarily correspond
to the importance of weights, resulting in maintaining unimportant weights and
constructing redundant sub-networks. To overcome this limitation, inspired by
information bottleneck, which removes redundancy between adjacent network
layers, we propose \textbf{\underline{I}nformation \underline{B}ottleneck
\underline{M}asked sub-network (IBM)} to eliminate redundancy within
sub-networks. Specifically, IBM accumulates valuable information into essential
weights to construct redundancy-free sub-networks, not only effectively
mitigating CF by freezing the sub-networks but also facilitating new tasks
training through the transfer of valuable knowledge. Additionally, IBM
decomposes hidden representations to automate the construction process and make
it flexible. Extensive experiments demonstrate that IBM consistently
outperforms state-of-the-art methods. Notably, IBM surpasses the
state-of-the-art parameter isolation method with a 70\% reduction in the number
of parameters within sub-networks and an 80\% decrease in training time
On the Soft-Subnetwork for Few-shot Class Incremental Learning
Inspired by Regularized Lottery Ticket Hypothesis (RLTH), which hypothesizes
that there exist smooth (non-binary) subnetworks within a dense network that
achieve the competitive performance of the dense network, we propose a few-shot
class incremental learning (FSCIL) method referred to as \emph{Soft-SubNetworks
(SoftNet)}. Our objective is to learn a sequence of sessions incrementally,
where each session only includes a few training instances per class while
preserving the knowledge of the previously learned ones. SoftNet jointly learns
the model weights and adaptive non-binary soft masks at a base training session
in which each mask consists of the major and minor subnetwork; the former aims
to minimize catastrophic forgetting during training, and the latter aims to
avoid overfitting to a few samples in each new training session. We provide
comprehensive empirical validations demonstrating that our SoftNet effectively
tackles the few-shot incremental learning problem by surpassing the performance
of state-of-the-art baselines over benchmark datasets
Forget-free Continual Learning with Soft-Winning SubNetworks
Inspired by Regularized Lottery Ticket Hypothesis (RLTH), which states that
competitive smooth (non-binary) subnetworks exist within a dense network in
continual learning tasks, we investigate two proposed architecture-based
continual learning methods which sequentially learn and select adaptive binary-
(WSN) and non-binary Soft-Subnetworks (SoftNet) for each task. WSN and SoftNet
jointly learn the regularized model weights and task-adaptive non-binary masks
of subnetworks associated with each task whilst attempting to select a small
set of weights to be activated (winning ticket) by reusing weights of the prior
subnetworks. Our proposed WSN and SoftNet are inherently immune to catastrophic
forgetting as each selected subnetwork model does not infringe upon other
subnetworks in Task Incremental Learning (TIL). In TIL, binary masks spawned
per winning ticket are encoded into one N-bit binary digit mask, then
compressed using Huffman coding for a sub-linear increase in network capacity
to the number of tasks. Surprisingly, in the inference step, SoftNet generated
by injecting small noises to the backgrounds of acquired WSN (holding the
foregrounds of WSN) provides excellent forward transfer power for future tasks
in TIL. SoftNet shows its effectiveness over WSN in regularizing parameters to
tackle the overfitting, to a few examples in Few-shot Class Incremental
Learning (FSCIL).Comment: arXiv admin note: text overlap with arXiv:2209.0752
- …