1,277 research outputs found

    Mirror Descent View for Neural Network Quantization

    Full text link
    Quantizing large Neural Networks (NN) while maintaining the performance is highly desirable for resource-limited devices due to reduced memory and time complexity. It is usually formulated as a constrained optimization problem and optimized via a modified version of gradient descent. In this work, by interpreting the continuous parameters (unconstrained) as the dual of the quantized ones, we introduce a Mirror Descent (MD) framework for NN quantization. Specifically, we provide conditions on the projections (i.e., mapping from continuous to quantized ones) which would enable us to derive valid mirror maps and in turn the respective MD updates. Furthermore, we present a numerically stable implementation of MD that requires storing an additional set of auxiliary variables (unconstrained), and show that it is strikingly analogous to the Straight Through Estimator (STE) based method which is typically viewed as a "trick" to avoid vanishing gradients issue. Our experiments on CIFAR-10/100, TinyImageNet, and ImageNet classification datasets with VGG-16, ResNet-18, and MobileNetV2 architectures show that our MD variants obtain quantized networks with state-of-the-art performance. Code is available at https://github.com/kartikgupta-at-anu/md-bnn.Comment: This paper was accepted at AISTATS 202

    Non-Vacuous Generalization Bounds at the ImageNet Scale: A PAC-Bayesian Compression Approach

    Full text link
    Modern neural networks are highly overparameterized, with capacity to substantially overfit to training data. Nevertheless, these networks often generalize well in practice. It has also been observed that trained networks can often be "compressed" to much smaller representations. The purpose of this paper is to connect these two empirical observations. Our main technical result is a generalization bound for compressed networks based on the compressed size. Combined with off-the-shelf compression algorithms, the bound leads to state of the art generalization guarantees; in particular, we provide the first non-vacuous generalization guarantees for realistic architectures applied to the ImageNet classification problem. As additional evidence connecting compression and generalization, we show that compressibility of models that tend to overfit is limited: We establish an absolute limit on expected compressibility as a function of expected generalization error, where the expectations are over the random choice of training examples. The bounds are complemented by empirical results that show an increase in overfitting implies an increase in the number of bits required to describe a trained network.Comment: 16 pages, 1 figure. Accepted at ICLR 201

    AskewSGD : An Annealed interval-constrained Optimisation method to train Quantized Neural Networks

    Full text link
    In this paper, we develop a new algorithm, Annealed Skewed SGD - AskewSGD - for training deep neural networks (DNNs) with quantized weights. First, we formulate the training of quantized neural networks (QNNs) as a smoothed sequence of interval-constrained optimization problems. Then, we propose a new first-order stochastic method, AskewSGD, to solve each constrained optimization subproblem. Unlike algorithms with active sets and feasible directions, AskewSGD avoids projections or optimization under the entire feasible set and allows iterates that are infeasible. The numerical complexity of AskewSGD is comparable to existing approaches for training QNNs, such as the straight-through gradient estimator used in BinaryConnect, or other state of the art methods (ProxQuant, LUQ). We establish convergence guarantees for AskewSGD (under general assumptions for the objective function). Experimental results show that the AskewSGD algorithm performs better than or on par with state of the art methods in classical benchmarks

    Towards Efficient and Reliable Deep Neural Networks

    Get PDF
    Deep neural networks have achieved state-of-the-art performance for various machine learning tasks in different domains such as computer vision, natural language processing, bioinformatics, speech processing, etc. Despite the success, their excessive computational and memory requirements limit their practical usability for real-time applications or in resource-limited devices. Neural network quantization has become increasingly popular due to efficient memory consumption and faster computation resulting from bit-wise operations on the quantized networks, where the objective is to learn a network while restricting the parameters (and activations) to take values from a small discrete set. Another important aspect of modern neural networks is the adversarial vulnerability and reliability of the predictions of deep neural networks. In addition to obtaining accurate predictions, it is also critical to accurately quantify the predictive uncertainty of deep neural networks in many real-world decision-making applications. Calibrating neural networks is of utmost importance when employing them in safety-critical applications where the downstream decision-making depends on the predicted probabilities. Further to this, modern machine vision algorithms have also been shown to be extremely susceptible to small and almost imperceptible perturbations of their inputs. To this end, we tackle these fundamental challenges in modern neural networks, focussing on the efficiency and reliability of neural networks. Neural network quantization is usually formulated as a constrained optimization problem and optimized via a modified version of gradient descent. To this end, first by interpreting the continuous parameters (unconstrained) as the dual of the quantized ones, we introduce a Mirror Descent (MD) framework for NN quantization. Specifically, we provide conditions on the projections (i.e., mapping from continuous to quantized ones) which enable us to derive valid mirror maps and in turn the respective MD updates. Furthermore, we present a numerically stable implementation of MD that requires storing an additional set of auxiliary variables (unconstrained), and show that it is strikingly analogous to the STE based method which is typically viewed as a ``trick'' to avoid vanishing gradients issue. Our experiments on multiple computer vision classification datasets with multiple network architectures demonstrate that our MD variants yield state-of-the-art performance. Even though quantized networks exhibit excellent generalization capabilities, their robustness properties are not well-understood. Therefore next, we systematically study the robustness of quantized networks against gradient based adversarial attacks and demonstrate that these quantized models suffer from gradient vanishing issues and show a fake sense of robustness. By attributing gradient vanishing to poor forward-backward signal propagation in the trained network, we introduce a simple temperature scaling approach to mitigate this issue while preserving the decision boundary. Experiments on multiple image classification datasets with multiple network architectures demonstrate that our temperature scaled attacks obtain near-perfect success rate on quantized networks. Finally, we introduce a binning-free calibration measure inspired by the classical Kolmogorov-Smirnov (KS) statistical test in which the main idea is to compare the respective cumulative probability distributions. From this, by approximating the empirical cumulative distribution using a differentiable function via splines, we obtain a recalibration function, which maps the network outputs to actual (calibrated) class assignment probabilities. We tested our method against existing calibration approaches on various image classification datasets and our spline-based recalibration approach consistently outperforms existing methods on KS error as well as other commonly used calibration measures

    Why and When Can Deep -- but Not Shallow -- Networks Avoid the Curse of Dimensionality: a Review

    Get PDF
    The paper characterizes classes of functions for which deep learning can be exponentially better than shallow learning. Deep convolutional networks are a special case of these conditions, though weight sharing is not the main reason for their exponential advantage
    • …
    corecore