51 research outputs found
Hessian-based Analysis of Large Batch Training and Robustness to Adversaries
Large batch size training of Neural Networks has been shown to incur accuracy
loss when trained with the current methods. The exact underlying reasons for
this are still not completely understood. Here, we study large batch size
training through the lens of the Hessian operator and robust optimization. In
particular, we perform a Hessian based study to analyze exactly how the
landscape of the loss function changes when training with large batch size. We
compute the true Hessian spectrum, without approximation, by back-propagating
the second derivative. Extensive experiments on multiple networks show that
saddle-points are not the cause for generalization gap of large batch size
training, and the results consistently show that large batch converges to
points with noticeably higher Hessian spectrum. Furthermore, we show that
robust training allows one to favor flat areas, as points with large Hessian
spectrum show poor robustness to adversarial perturbation. We further study
this relationship, and provide empirical and theoretical proof that the inner
loop for robust training is a saddle-free optimization problem \textit{almost
everywhere}. We present detailed experiments with five different network
architectures, including a residual network, tested on MNIST, CIFAR-10, and
CIFAR-100 datasets. We have open sourced our method which can be accessed at
[1].Comment: Presented in NeurIPS'18 conferenc
On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent
Increasing the mini-batch size for stochastic gradient descent offers
significant opportunities to reduce wall-clock training time, but there are a
variety of theoretical and systems challenges that impede the widespread
success of this technique. We investigate these issues, with an emphasis on
time to convergence and total computational cost, through an extensive
empirical analysis of network training across several architectures and problem
domains, including image classification, image segmentation, and language
modeling. Although it is common practice to increase the batch size in order to
fully exploit available computational resources, we find a substantially more
nuanced picture. Our main finding is that across a wide range of network
architectures and problem domains, increasing the batch size beyond a certain
point yields no decrease in wall-clock time to convergence for \emph{either}
train or test loss. This batch size is usually substantially below the capacity
of current systems. We show that popular training strategies for large batch
size optimization begin to fail before we can populate all available compute
resources, and we show that the point at which these methods break down depends
more on attributes like model architecture and data complexity than it does
directly on the size of the dataset
Trust Region Based Adversarial Attack on Neural Networks
Deep Neural Networks are quite vulnerable to adversarial perturbations.
Current state-of-the-art adversarial attack methods typically require very time
consuming hyper-parameter tuning, or require many iterations to solve an
optimization based adversarial attack. To address this problem, we present a
new family of trust region based adversarial attacks, with the goal of
computing adversarial perturbations efficiently. We propose several attacks
based on variants of the trust region optimization method. We test the proposed
methods on Cifar-10 and ImageNet datasets using several different models
including AlexNet, ResNet-50, VGG-16, and DenseNet-121 models. Our methods
achieve comparable results with the Carlini-Wagner (CW) attack, but with
significant speed up of up to , for the VGG-16 model on a Titan Xp
GPU. For the case of ResNet-50 on ImageNet, we can bring down its
classification accuracy to less than 0.1\% with at most relative
(or ) perturbation requiring only seconds as compared to
seconds for the CW attack. We have open sourced our method which can be
accessed at [1]
The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size
We apply state-of-the-art tools in modern high-dimensional numerical linear
algebra to approximate efficiently the spectrum of the Hessian of modern
deepnets, with tens of millions of parameters, trained on real data. Our
results corroborate previous findings, based on small-scale networks, that the
Hessian exhibits "spiked" behavior, with several outliers isolated from a
continuous bulk. We decompose the Hessian into different components and study
the dynamics with training and sample size of each term individually
Numerically Recovering the Critical Points of a Deep Linear Autoencoder
Numerically locating the critical points of non-convex surfaces is a
long-standing problem central to many fields. Recently, the loss surfaces of
deep neural networks have been explored to gain insight into outstanding
questions in optimization, generalization, and network architecture design.
However, the degree to which recently-proposed methods for numerically
recovering critical points actually do so has not been thoroughly evaluated. In
this paper, we examine this issue in a case for which the ground truth is
known: the deep linear autoencoder. We investigate two sub-problems associated
with numerical critical point identification: first, because of large parameter
counts, it is infeasible to find all of the critical points for contemporary
neural networks, necessitating sampling approaches whose characteristics are
poorly understood; second, the numerical tolerance for accurately identifying a
critical point is unknown, and conservative tolerances are difficult to
satisfy. We first identify connections between recently-proposed methods and
well-understood methods in other fields, including chemical physics, economics,
and algebraic geometry. We find that several methods work well at recovering
certain information about loss surfaces, but fail to take an unbiased sample of
critical points. Furthermore, numerical tolerance must be very strict to ensure
that numerically-identified critical points have similar properties to true
analytical critical points. We also identify a recently-published Newton method
for optimization that outperforms previous methods as a critical point-finding
algorithm. We expect our results will guide future attempts to numerically
study critical points in large nonlinear neural networks
Normalized Flat Minima: Exploring Scale Invariant Definition of Flat Minima for Neural Networks using PAC-Bayesian Analysis
The notion of flat minima has played a key role in the generalization studies
of deep learning models. However, existing definitions of the flatness are
known to be sensitive to the rescaling of parameters. The issue suggests that
the previous definitions of the flatness might not be a good measure of
generalization, because generalization is invariant to such rescalings. In this
paper, from the PAC-Bayesian perspective, we scrutinize the discussion
concerning the flat minima and introduce the notion of normalized flat minima,
which is free from the known scale dependence issues. Additionally, we
highlight the scale dependence of existing matrix-norm based generalization
error bounds similar to the existing flat minima definitions. Our modified
notion of the flatness does not suffer from the insufficiency, either,
suggesting it might provide better hierarchy in the hypothesis class
HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision
Model size and inference speed/power have become a major challenge in the
deployment of Neural Networks for many applications. A promising approach to
address these problems is quantization. However, uniformly quantizing a model
to ultra low precision leads to significant accuracy degradation. A novel
solution for this is to use mixed-precision quantization, as some parts of the
network may allow lower precision as compared to other layers. However, there
is no systematic way to determine the precision of different layers. A brute
force approach is not feasible for deep networks, as the search space for
mixed-precision is exponential in the number of layers. Another challenge is a
similar factorial complexity for determining block-wise fine-tuning order when
quantizing the model to a target precision. Here, we introduce Hessian AWare
Quantization (HAWQ), a novel second-order quantization method to address these
problems. HAWQ allows for the automatic selection of the relative quantization
precision of each layer, based on the layer's Hessian spectrum. Moreover, HAWQ
provides a deterministic fine-tuning order for quantizing layers, based on
second-order information. We show the results of our method on Cifar-10 using
ResNet20, and on ImageNet using Inception-V3, ResNet50 and SqueezeNext models.
Comparing HAWQ with state-of-the-art shows that we can achieve similar/better
accuracy with activation compression ratio on ResNet20, as compared
to DNAS~\cite{wu2018mixed}, and up to higher accuracy with up to
smaller models on ResNet50 and Inception-V3, compared to recently proposed
methods of RVQuant~\cite{park2018value} and HAQ~\cite{wang2018haq}.
Furthermore, we show that we can quantize SqueezeNext to just 1MB model size
while achieving above top1 accuracy on ImageNet.Comment: ICCV 201
Minimum sharpness: Scale-invariant parameter-robustness of neural networks
Toward achieving robust and defensive neural networks, the robustness against
the weight parameters perturbations, i.e., sharpness, attracts attention in
recent years (Sun et al., 2020). However, sharpness is known to remain a
critical issue, "scale-sensitivity." In this paper, we propose a novel
sharpness measure, Minimum Sharpness. It is known that NNs have a specific
scale transformation that constitutes equivalent classes where functional
properties are completely identical, and at the same time, their sharpness
could change unlimitedly. We define our sharpness through a minimization
problem over the equivalent NNs being invariant to the scale transformation. We
also develop an efficient and exact technique to make the sharpness tractable,
which reduces the heavy computational costs involved with Hessian. In the
experiment, we observed that our sharpness has a valid correlation with the
generalization of NNs and runs with less computational cost than existing
sharpness measures.Comment: 9 pages, accepted to ICML 2021 Workshop on Theoretic Foundation,
Criticism, and Application Trend of Explainable A
Understanding Impacts of High-Order Loss Approximations and Features in Deep Learning Interpretation
Current methods to interpret deep learning models by generating saliency maps
generally rely on two key assumptions. First, they use first-order
approximations of the loss function neglecting higher-order terms such as the
loss curvatures. Second, they evaluate each feature's importance in isolation,
ignoring their inter-dependencies. In this work, we study the effect of
relaxing these two assumptions. First, by characterizing a closed-form formula
for the Hessian matrix of a deep ReLU network, we prove that, for a
classification problem with a large number of classes, if an input has a high
confidence classification score, the inclusion of the Hessian term has small
impacts in the final solution. We prove this result by showing that in this
case the Hessian matrix is approximately of rank one and its leading
eigenvector is almost parallel to the gradient of the loss function. Our
empirical experiments on ImageNet samples are consistent with our theory. This
result can have implications in other related problems such as adversarial
examples as well. Second, we compute the importance of group-features in deep
learning interpretation by introducing a sparsity regularization term. We use
the relaxation technique along with the proximal gradient descent to
have an efficient computation of group feature importance scores. Our empirical
results indicate that considering group features can improve deep learning
interpretation significantly.Comment: Proceedings of the 36th International Conference on Machine Learning,
201
How do SGD hyperparameters in natural training affect adversarial robustness?
Learning rate, batch size and momentum are three important hyperparameters in
the SGD algorithm. It is known from the work of Jastrzebski et al.
arXiv:1711.04623 that large batch size training of neural networks yields
models which do not generalize well. Yao et al. arXiv:1802.08241 observe that
large batch training yields models that have poor adversarial robustness. In
the same paper, the authors train models with different batch sizes and compute
the eigenvalues of the Hessian of loss function. They observe that as the batch
size increases, the dominant eigenvalues of the Hessian become larger. They
also show that both adversarial training and small-batch training leads to a
drop in the dominant eigenvalues of the Hessian or lowering its spectrum. They
combine adversarial training and second order information to come up with a new
large-batch training algorithm and obtain robust models with good
generalization. In this paper, we empirically observe the effect of the SGD
hyperparameters on the accuracy and adversarial robustness of networks trained
with unperturbed samples. Jastrzebski et al. considered training models with a
fixed learning rate to batch size ratio. They observed that higher the ratio,
better is the generalization. We observe that networks trained with constant
learning rate to batch size ratio, as proposed in Jastrzebski et al., yield
models which generalize well and also have almost constant adversarial
robustness, independent of the batch size. We observe that momentum is more
effective with varying batch sizes and a fixed learning rate than with constant
learning rate to batch size ratio based SGD training.Comment: Preliminary version presented in ICML 2019 Workshop on "Understanding
and Improving Generalization in Deep Learning" as "On Adversarial Robustness
of Small vs Large Batch Training
- …