855 research outputs found
SWAMP: Sparse Weight Averaging with Multiple Particles for Iterative Magnitude Pruning
Given the ever-increasing size of modern neural networks, the significance of
sparse architectures has surged due to their accelerated inference speeds and
minimal memory demands. When it comes to global pruning techniques, Iterative
Magnitude Pruning (IMP) still stands as a state-of-the-art algorithm despite
its simple nature, particularly in extremely sparse regimes. In light of the
recent finding that the two successive matching IMP solutions are linearly
connected without a loss barrier, we propose Sparse Weight Averaging with
Multiple Particles (SWAMP), a straightforward modification of IMP that achieves
performance comparable to an ensemble of two IMP solutions. For every
iteration, we concurrently train multiple sparse models, referred to as
particles, using different batch orders yet the same matching ticket, and then
weight average such models to produce a single mask. We demonstrate that our
method consistently outperforms existing baselines across different sparsities
through extensive experiments on various data and neural network structures
Linear Mode Connectivity in Sparse Neural Networks
With the rise in interest of sparse neural networks, we study how neural
network pruning with synthetic data leads to sparse networks with unique
training properties. We find that distilled data, a synthetic summarization of
the real data, paired with Iterative Magnitude Pruning (IMP) unveils a new
class of sparse networks that are more stable to SGD noise on the real data,
than either the dense model, or subnetworks found with real data in IMP. That
is, synthetically chosen subnetworks often train to the same minima, or exhibit
linear mode connectivity. We study this through linear interpolation, loss
landscape visualizations, and measuring the diagonal of the hessian. While
dataset distillation as a field is still young, we find that these properties
lead to synthetic subnetworks matching the performance of traditional IMP with
up to 150x less training points in settings where distilled data applies.Comment: Published in NeurIPS 2023 UniReps Worksho
Lottery Tickets in Evolutionary Optimization: On Sparse Backpropagation-Free Trainability
Is the lottery ticket phenomenon an idiosyncrasy of gradient-based training
or does it generalize to evolutionary optimization? In this paper we establish
the existence of highly sparse trainable initializations for evolution
strategies (ES) and characterize qualitative differences compared to gradient
descent (GD)-based sparse training. We introduce a novel signal-to-noise
iterative pruning procedure, which incorporates loss curvature information into
the network pruning step. This can enable the discovery of even sparser
trainable network initializations when using black-box evolution as compared to
GD-based optimization. Furthermore, we find that these initializations encode
an inductive bias, which transfers across different ES, related tasks and even
to GD-based training. Finally, we compare the local optima resulting from the
different optimization paradigms and sparsity levels. In contrast to GD, ES
explore diverse and flat local optima and do not preserve linear mode
connectivity across sparsity levels and independent runs. The results highlight
qualitative differences between evolution and gradient-based learning dynamics,
which can be uncovered by the study of iterative pruning procedures.Comment: 13 pages, 11 figures, International Conference on Machine Learning
(ICML) 202
Random Teachers are Good Teachers
In this work, we investigate the implicit regularization induced by
teacher-student learning dynamics in self-distillation. To isolate its effect,
we describe a simple experiment where we consider teachers at random
initialization instead of trained teachers. Surprisingly, when distilling a
student into such a random teacher, we observe that the resulting model and its
representations already possess very interesting characteristics; (1) we
observe a strong improvement of the distilled student over its teacher in terms
of probing accuracy. (2) The learned representations are data-dependent and
transferable between different tasks but deteriorate strongly if trained on
random inputs. (3) The student checkpoint contains sparse subnetworks,
so-called lottery tickets, and lies on the border of linear basins in the
supervised loss landscape. These observations have interesting consequences for
several important areas in machine learning: (1) Self-distillation can work
solely based on the implicit regularization present in the gradient dynamics
without relying on any dark knowledge, (2) self-supervised learning can learn
features even in the absence of data augmentation and (3) training dynamics
during the early phase of supervised training do not necessarily require label
information. Finally, we shed light on an intriguing local property of the loss
landscape: the process of feature learning is strongly amplified if the student
is initialized closely to the teacher. These results raise interesting
questions about the nature of the landscape that have remained unexplored so
far. Code is available at https://github.com/safelix/dinopl
Random initialisations performing above chance and how to find them
Neural networks trained with stochastic gradient descent (SGD) starting from
different random initialisations typically find functionally very similar
solutions, raising the question of whether there are meaningful differences
between different SGD solutions. Entezari et al.\ recently conjectured that
despite different initialisations, the solutions found by SGD lie in the same
loss valley after taking into account the permutation invariance of neural
networks. Concretely, they hypothesise that any two solutions found by SGD can
be permuted such that the linear interpolation between their parameters forms a
path without significant increases in loss. Here, we use a simple but powerful
algorithm to find such permutations that allows us to obtain direct empirical
evidence that the hypothesis is true in fully connected networks. Strikingly,
we find that two networks already live in the same loss valley at the time of
initialisation and averaging their random, but suitably permuted initialisation
performs significantly above chance. In contrast, for convolutional
architectures, our evidence suggests that the hypothesis does not hold.
Especially in a large learning rate regime, SGD seems to discover diverse
modes.Comment: NeurIPS 2022, 14th Annual Workshop on Optimization for Machine
Learning (OPT2022
Emerging Paradigms of Neural Network Pruning
Over-parameterization of neural networks benefits the optimization and
generalization yet brings cost in practice. Pruning is adopted as a
post-processing solution to this problem, which aims to remove unnecessary
parameters in a neural network with little performance compromised. It has been
broadly believed the resulted sparse neural network cannot be trained from
scratch to comparable accuracy. However, several recent works (e.g., [Frankle
and Carbin, 2019a]) challenge this belief by discovering random sparse networks
which can be trained to match the performance with their dense counterpart.
This new pruning paradigm later inspires more new methods of pruning at
initialization. In spite of the encouraging progress, how to coordinate these
new pruning fashions with the traditional pruning has not been explored yet.
This survey seeks to bridge the gap by proposing a general pruning framework so
that the emerging pruning paradigms can be accommodated well with the
traditional one. With it, we systematically reflect the major differences and
new insights brought by these new pruning fashions, with representative works
discussed at length. Finally, we summarize the open questions as worthy future
directions
- …