15 research outputs found
Some Fundamental Aspects about Lipschitz Continuity of Neural Network Functions
Lipschitz continuity is a simple yet crucial functional property of any
predictive model for it lies at the core of the model's robustness,
generalisation, as well as adversarial vulnerability. Our aim is to thoroughly
investigate and characterise the Lipschitz behaviour of the functions realised
by neural networks. Thus, we carry out an empirical investigation in a range of
different settings (namely, architectures, losses, optimisers, label noise, and
more) by exhausting the limits of the simplest and the most general lower and
upper bounds. Although motivated primarily by computational hardness results,
this choice nevertheless turns out to be rather resourceful and sheds light on
several fundamental and intriguing traits of the Lipschitz continuity of neural
network functions, which we also supplement with suitable theoretical
arguments. As a highlight of this investigation, we identify a striking double
descent trend in both upper and lower bounds to the Lipschitz constant with
increasing network width -- which tightly aligns with the typical double
descent trend in the test loss. Lastly, we touch upon the seeming
(counter-intuitive) decline of the Lipschitz constant in the presence of label
noise
WoodFisher: Efficient Second-Order Approximation for Neural Network Compression
Second-order information, in the form of Hessian- or Inverse-Hessian-vector
products, is a fundamental tool for solving optimization problems. Recently,
there has been significant interest in utilizing this information in the
context of deep neural networks; however, relatively little is known about the
quality of existing approximations in this context. Our work examines this
question, identifies issues with existing approaches, and proposes a method
called WoodFisher to compute a faithful and efficient estimate of the inverse
Hessian.
Our main application is to neural network compression, where we build on the
classic Optimal Brain Damage/Surgeon framework. We demonstrate that WoodFisher
significantly outperforms popular state-of-the-art methods for one-shot
pruning. Further, even when iterative, gradual pruning is considered, our
method results in a gain in test accuracy over the state-of-the-art approaches,
for pruning popular neural networks (like ResNet-50, MobileNetV1) trained on
standard image classification datasets such as ImageNet ILSVRC. We examine how
our method can be extended to take into account first-order information, as
well as illustrate its ability to automatically set layer-wise pruning
thresholds and perform compression in the limited-data regime. The code is
available at the following link, https://github.com/IST-DASLab/WoodFisher.Comment: NeurIPS 202
Model Fusion via Optimal Transport
Combining different models is a widely used paradigm in machine learning
applications. While the most common approach is to form an ensemble of models
and average their individual predictions, this approach is often rendered
infeasible by given resource constraints in terms of memory and computation,
which grow linearly with the number of models. We present a layer-wise model
fusion algorithm for neural networks that utilizes optimal transport to (soft-)
align neurons across the models before averaging their associated parameters.
We show that this can successfully yield "one-shot" knowledge transfer (i.e,
without requiring any retraining) between neural networks trained on
heterogeneous non-i.i.d. data. In both i.i.d. and non-i.i.d. settings , we
illustrate that our approach significantly outperforms vanilla averaging, as
well as how it can serve as an efficient replacement for the ensemble with
moderate fine-tuning, for standard convolutional networks (like VGG11),
residual networks (like ResNet18), and multi-layer perceptrons on CIFAR10,
CIFAR100, and MNIST. Finally, our approach also provides a principled way to
combine the parameters of neural networks with different widths, and we explore
its application for model compression. The code is available at the following
link, https://github.com/sidak/otfusion.Comment: NeurIPS 2020 conference proceedings (early version featured in the
Optimal Transport & Machine Learning workshop, NeurIPS 2019
Context Mover's Distance & Barycenters: Optimal Transport of Contexts for Building Representations
We present a framework for building unsupervised representations of entities
and their compositions, where each entity is viewed as a probability
distribution rather than a vector embedding. In particular, this distribution
is supported over the contexts which co-occur with the entity and are embedded
in a suitable low-dimensional space. This enables us to consider representation
learning from the perspective of Optimal Transport and take advantage of its
tools such as Wasserstein distance and barycenters. We elaborate how the method
can be applied for obtaining unsupervised representations of text and
illustrate the performance (quantitatively as well as qualitatively) on tasks
such as measuring sentence similarity, word entailment and similarity, where we
empirically observe significant gains (e.g., 4.1% relative improvement over
Sent2vec, GenSen).
The key benefits of the proposed approach include: (a) capturing uncertainty
and polysemy via modeling the entities as distributions, (b) utilizing the
underlying geometry of the particular task (with the ground cost), (c)
simultaneously providing interpretability with the notion of optimal transport
between contexts and (d) easy applicability on top of existing point embedding
methods. The code, as well as prebuilt histograms, are available under
https://github.com/context-mover/.Comment: AISTATS 2020. Also, accepted previously at ICLR 2019 DeepGenStruct
Worksho
Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers
This work presents an analysis of the effectiveness of using standard shallow
feed-forward networks to mimic the behavior of the attention mechanism in the
original Transformer model, a state-of-the-art architecture for
sequence-to-sequence tasks. We substitute key elements of the attention
mechanism in the Transformer with simple feed-forward networks, trained using
the original components via knowledge distillation. Our experiments, conducted
on the IWSLT2017 dataset, reveal the capacity of these "attentionless
Transformers" to rival the performance of the original architecture. Through
rigorous ablation studies, and experimenting with various replacement network
types and sizes, we offer insights that support the viability of our approach.
This not only sheds light on the adaptability of shallow feed-forward networks
in emulating attention mechanisms but also underscores their potential to
streamline complex architectures for sequence-to-sequence tasks.Comment: Accepted at AAAI24(https://aaai.org/aaai-conference/
Efficient second-order methods for model compression
Second-order information, in the form of Hessian- or Inverse-Hessian-vector products, is a fundamental tool for solving optimization problems. Recently, there has been a tremendous amount of work on utilizing this information for the current compute and memory-intensive deep neural networks, usually via coarse-grained approximations (such as diagonal, blockwise, or Kronecker-factorization). However, not much is known about the quality of these approximations. Our work addresses this question, and in particular, we propose a method called âWoodFisherâ that leverages the structure of the empirical Fisher information matrix, along with the Woodbury matrix identity, to compute a faithful and efficient estimate of the inverse Hessian. Our main application is to the task of compressing neural networks, where we build on the classical Optimal Brain Damage/Surgeon framework (LeCun et al., 1990; Hassibi and Stork, 1993). We demonstrate that WoodFisher significantly outperforms magnitude pruning (isotropic Hessian), as well as methods that maintain other diagonal estimates. Further, even when gradual pruning is considered, our method results in a gain in test accuracy over the state-of-the-art approaches, for standard image classification datasets such as CIFAR-10, ImageNet. We also propose a variant called âWoodTaylorâ, which takes into account the first-order gradient term, and can lead to additional improvements. An important advantage of our methods is that they allow us to automatically set the layer-wise pruning thresholds, avoiding the need for any manual tuning or sensitivity analysis
The Hessian perspective into the Nature of Convolutional Neural Networks
While Convolutional Neural Networks (CNNs) have long been investigated and
applied, as well as theorized, we aim to provide a slightly different
perspective into their nature -- through the perspective of their Hessian maps.
The reason is that the loss Hessian captures the pairwise interaction of
parameters and therefore forms a natural ground to probe how the architectural
aspects of CNN get manifested in its structure and properties. We develop a
framework relying on Toeplitz representation of CNNs, and then utilize it to
reveal the Hessian structure and, in particular, its rank. We prove tight upper
bounds (with linear activations), which closely follow the empirical trend of
the Hessian rank and hold in practice in more general settings. Overall, our
work generalizes and establishes the key insight that, even in CNNs, the
Hessian rank grows as the square root of the number of parameters.Comment: ICML 2023 conference proceeding