1,277 research outputs found
Mirror Descent View for Neural Network Quantization
Quantizing large Neural Networks (NN) while maintaining the performance is
highly desirable for resource-limited devices due to reduced memory and time
complexity. It is usually formulated as a constrained optimization problem and
optimized via a modified version of gradient descent. In this work, by
interpreting the continuous parameters (unconstrained) as the dual of the
quantized ones, we introduce a Mirror Descent (MD) framework for NN
quantization. Specifically, we provide conditions on the projections (i.e.,
mapping from continuous to quantized ones) which would enable us to derive
valid mirror maps and in turn the respective MD updates. Furthermore, we
present a numerically stable implementation of MD that requires storing an
additional set of auxiliary variables (unconstrained), and show that it is
strikingly analogous to the Straight Through Estimator (STE) based method which
is typically viewed as a "trick" to avoid vanishing gradients issue. Our
experiments on CIFAR-10/100, TinyImageNet, and ImageNet classification datasets
with VGG-16, ResNet-18, and MobileNetV2 architectures show that our MD variants
obtain quantized networks with state-of-the-art performance. Code is available
at https://github.com/kartikgupta-at-anu/md-bnn.Comment: This paper was accepted at AISTATS 202
Non-Vacuous Generalization Bounds at the ImageNet Scale: A PAC-Bayesian Compression Approach
Modern neural networks are highly overparameterized, with capacity to
substantially overfit to training data. Nevertheless, these networks often
generalize well in practice. It has also been observed that trained networks
can often be "compressed" to much smaller representations. The purpose of this
paper is to connect these two empirical observations. Our main technical result
is a generalization bound for compressed networks based on the compressed size.
Combined with off-the-shelf compression algorithms, the bound leads to state of
the art generalization guarantees; in particular, we provide the first
non-vacuous generalization guarantees for realistic architectures applied to
the ImageNet classification problem. As additional evidence connecting
compression and generalization, we show that compressibility of models that
tend to overfit is limited: We establish an absolute limit on expected
compressibility as a function of expected generalization error, where the
expectations are over the random choice of training examples. The bounds are
complemented by empirical results that show an increase in overfitting implies
an increase in the number of bits required to describe a trained network.Comment: 16 pages, 1 figure. Accepted at ICLR 201
AskewSGD : An Annealed interval-constrained Optimisation method to train Quantized Neural Networks
In this paper, we develop a new algorithm, Annealed Skewed SGD - AskewSGD -
for training deep neural networks (DNNs) with quantized weights. First, we
formulate the training of quantized neural networks (QNNs) as a smoothed
sequence of interval-constrained optimization problems. Then, we propose a new
first-order stochastic method, AskewSGD, to solve each constrained optimization
subproblem. Unlike algorithms with active sets and feasible directions,
AskewSGD avoids projections or optimization under the entire feasible set and
allows iterates that are infeasible. The numerical complexity of AskewSGD is
comparable to existing approaches for training QNNs, such as the
straight-through gradient estimator used in BinaryConnect, or other state of
the art methods (ProxQuant, LUQ). We establish convergence guarantees for
AskewSGD (under general assumptions for the objective function). Experimental
results show that the AskewSGD algorithm performs better than or on par with
state of the art methods in classical benchmarks
Towards Efficient and Reliable Deep Neural Networks
Deep neural networks have achieved state-of-the-art performance for various machine learning tasks in different domains such as computer vision, natural language processing, bioinformatics, speech processing, etc. Despite the success, their excessive computational and memory requirements limit their practical usability for real-time applications or in resource-limited devices. Neural network quantization has become increasingly popular due to efficient memory consumption and faster computation resulting from bit-wise operations on the quantized networks, where the objective is to learn a network while restricting the parameters (and activations) to take values from a small discrete set. Another important aspect of modern neural networks is the adversarial vulnerability and reliability of the predictions of deep neural networks. In addition to obtaining accurate predictions, it is also critical to accurately quantify the predictive uncertainty of deep neural networks in many real-world decision-making applications. Calibrating neural networks is of utmost importance when employing them in safety-critical applications where the downstream decision-making depends on the predicted probabilities. Further to this, modern machine vision algorithms have also been shown to be extremely susceptible to small and almost imperceptible perturbations of their inputs. To this end, we tackle these fundamental challenges in modern neural networks, focussing on the efficiency and reliability of neural networks.
Neural network quantization is usually formulated as a constrained optimization problem and optimized via a modified version of gradient descent. To this end, first by interpreting the continuous parameters (unconstrained) as the dual of the quantized ones, we introduce a Mirror Descent (MD) framework for NN quantization. Specifically, we provide conditions on the projections (i.e., mapping from continuous to quantized ones) which enable us to derive valid mirror maps and in turn the respective MD updates. Furthermore, we present a numerically stable implementation of MD that requires storing an additional set of auxiliary variables (unconstrained), and show that it is strikingly analogous to the STE based method which is typically viewed as a ``trick'' to avoid vanishing gradients issue. Our experiments on multiple computer vision classification datasets with multiple network architectures demonstrate that our MD variants yield state-of-the-art performance.
Even though quantized networks exhibit excellent generalization capabilities, their robustness properties are not well-understood. Therefore next, we systematically study the robustness of quantized networks against gradient based adversarial attacks and demonstrate that these quantized models suffer from gradient vanishing issues and show a fake sense of robustness. By attributing gradient vanishing to poor forward-backward signal propagation in the trained network, we introduce a simple temperature scaling approach to mitigate this issue while preserving the decision boundary. Experiments on multiple image classification datasets with multiple network architectures demonstrate that our temperature scaled attacks obtain near-perfect success rate on quantized networks.
Finally, we introduce a binning-free calibration measure inspired by the classical Kolmogorov-Smirnov (KS) statistical test in which the main idea is to compare the respective cumulative probability distributions. From this, by approximating the empirical cumulative distribution using a differentiable function via splines, we obtain a recalibration function, which maps the network outputs to actual (calibrated) class assignment probabilities. We tested our method against existing calibration approaches on various image classification datasets and our spline-based recalibration approach consistently outperforms existing methods on KS error as well as other commonly used calibration measures
Why and When Can Deep -- but Not Shallow -- Networks Avoid the Curse of Dimensionality: a Review
The paper characterizes classes of functions for which deep learning can be
exponentially better than shallow learning. Deep convolutional networks are a
special case of these conditions, though weight sharing is not the main reason
for their exponential advantage
- …