11 research outputs found
"Lossless" Compression of Deep Neural Networks: A High-dimensional Neural Tangent Kernel Approach
Modern deep neural networks (DNNs) are extremely powerful; however, this
comes at the price of increased depth and having more parameters per layer,
making their training and inference more computationally challenging. In an
attempt to address this key limitation, efforts have been devoted to the
compression (e.g., sparsification and/or quantization) of these large-scale
machine learning models, so that they can be deployed on low-power IoT devices.
In this paper, building upon recent advances in neural tangent kernel (NTK) and
random matrix theory (RMT), we provide a novel compression approach to wide and
fully-connected \emph{deep} neural nets. Specifically, we demonstrate that in
the high-dimensional regime where the number of data points and their
dimension are both large, and under a Gaussian mixture model for the data,
there exists \emph{asymptotic spectral equivalence} between the NTK matrices
for a large family of DNN models. This theoretical result enables "lossless"
compression of a given DNN to be performed, in the sense that the compressed
network yields asymptotically the same NTK as the original (dense and
unquantized) network, with its weights and activations taking values
\emph{only} in up to a scaling. Experiments on both synthetic
and real-world data are conducted to support the advantages of the proposed
compression scheme, with code available at
\url{https://github.com/Model-Compression/Lossless_Compression}.Comment: 32 pages, 4 figures, and 2 tables. Fixing typos in Theorems 1 and 2
from NeurIPS 2022 proceeding
(https://proceedings.neurips.cc/paper_files/paper/2022/hash/185087ea328b4f03ea8fd0c8aa96f747-Abstract-Conference.html
Identifying Critical Neurons in ANN Architectures using Mixed Integer Programming
We introduce a mixed integer program (MIP) for assigning importance scores to
each neuron in deep neural network architectures which is guided by the impact
of their simultaneous pruning on the main learning task of the network. By
carefully devising the objective function of the MIP, we drive the solver to
minimize the number of critical neurons (i.e., with high importance score) that
need to be kept for maintaining the overall accuracy of the trained neural
network. Further, the proposed formulation generalizes the recently considered
lottery ticket optimization by identifying multiple "lucky" sub-networks
resulting in optimized architecture that not only performs well on a single
dataset, but also generalizes across multiple ones upon retraining of network
weights. Finally, we present a scalable implementation of our method by
decoupling the importance scores across layers using auxiliary networks. We
demonstrate the ability of our formulation to prune neural networks with
marginal loss in accuracy and generalizability on popular datasets and
architectures.Comment: 16 pages, 3 figures, 5 tables, under revie
Optimization of Sparsity-Constrained Neural Networks as a Mixed Integer Linear Program: NN2MILP
The literature has shown how to optimize and analyze the parameters of different types of neural networks using mixed integer linear programs (MILP). Building on these developments, this work presents an approach to do so for a McCulloch/Pitts and Rosenblatt neurons. As the original formulation involves a step-function, it is not differentiable, but it is possible to optimize the parameters of neurons, and their concatenation as a shallow neural network, by using a mixed integer linear program. The main contribution of this paper is to additionally enforce sparsity constraints on the weights and activations as well as on the amount of used neurons. Several experiments demonstrate that such constraints effectively prevent overfitting in neural networks, and ensure resource optimized models
Recall Distortion in Neural Network Pruning and the Undecayed Pruning Algorithm
Pruning techniques have been successfully used in neural networks to trade accuracy for sparsity. However, the impact of network pruning is not uniform: prior work has shown that the recall for underrepresented classes in a dataset may be more negatively affected. In this work, we study such relative distortions in recall by hypothesizing an intensification effect that is inherent to the model. Namely, that pruning makes recall relatively worse for a class with recall below accuracy and, conversely, that it makes recall relatively better for a class with recall above accuracy. In addition, we propose a new pruning algorithm aimed at attenuating such effect. Through statistical analysis, we have observed that intensification is less severe with our algorithm but nevertheless more pronounced with relatively more difficult tasks, less complex models, and higher pruning ratios. More surprisingly, we conversely observe a de-intensification effect with lower pruning ratios, which indicates that moderate pruning may have a corrective effect to such distortions
Combining Reinforcement Learning and Constraint Programming for Combinatorial Optimization
Combinatorial optimization has found applications in numerous fields, from
aerospace to transportation planning and economics. The goal is to find an
optimal solution among a finite set of possibilities. The well-known challenge
one faces with combinatorial optimization is the state-space explosion problem:
the number of possibilities grows exponentially with the problem size, which
makes solving intractable for large problems. In the last years, deep
reinforcement learning (DRL) has shown its promise for designing good
heuristics dedicated to solve NP-hard combinatorial optimization problems.
However, current approaches have two shortcomings: (1) they mainly focus on the
standard travelling salesman problem and they cannot be easily extended to
other problems, and (2) they only provide an approximate solution with no
systematic ways to improve it or to prove optimality. In another context,
constraint programming (CP) is a generic tool to solve combinatorial
optimization problems. Based on a complete search procedure, it will always
find the optimal solution if we allow an execution time large enough. A
critical design choice, that makes CP non-trivial to use in practice, is the
branching decision, directing how the search space is explored. In this work,
we propose a general and hybrid approach, based on DRL and CP, for solving
combinatorial optimization problems. The core of our approach is based on a
dynamic programming formulation, that acts as a bridge between both techniques.
We experimentally show that our solver is efficient to solve two challenging
problems: the traveling salesman problem with time windows, and the 4-moments
portfolio optimization problem. Results obtained show that the framework
introduced outperforms the stand-alone RL and CP solutions, while being
competitive with industrial solvers
Scaling Up Exact Neural Network Compression by ReLU Stability
We can compress a rectifier network while exactly preserving its underlying functionality with respect to a given input domain if some of its neurons are stable. However, current approaches to determine the stability of neurons with Rectified Linear Unit (ReLU) activations require solving or finding a good approximation to multiple discrete optimization problems. In this work, we introduce an algorithm based on solving a single optimization problem to identify all stable neurons. Our approach is on median 183 times faster than the state-of-art method on CIFAR-10, which allows us to explore exact compression on deeper (5 x 100) and wider (2 x 800) networks within minutes. For classifiers trained under an amount of L1 regularization that does not worsen accuracy, we can remove up to 56% of the connections on the CIFAR-10 dataset. The code is available at the following link, https://github.com/yuxwind/ExactCompression
Towards Lower Bounds on the Depth of ReLU Neural Networks
We contribute to a better understanding of the class of functions that is
represented by a neural network with ReLU activations and a given architecture.
Using techniques from mixed-integer optimization, polyhedral theory, and
tropical geometry, we provide a mathematical counterbalance to the universal
approximation theorems which suggest that a single hidden layer is sufficient
for learning tasks. In particular, we investigate whether the class of exactly
representable functions strictly increases by adding more layers (with no
restrictions on size). This problem has potential impact on algorithmic and
statistical aspects because of the insight it provides into the class of
functions represented by neural hypothesis classes. However, to the best of our
knowledge, this question has not been investigated in the neural network
literature. We also present upper bounds on the sizes of neural networks
required to represent functions in these neural hypothesis classes.Comment: Camera-ready version for NeurIPS 2021 conferenc
Expected Complexity and Gradients of Deep Maxout Neural Networks and Implications to Parameter Initialization
Learning with neural networks depends on the particular parametrization of the functions represented by the network, that is, the assignment of parameters to functions. It also depends on the identity of the functions, which get assigned typical parameters at initialization, and, later, the parameters that arise during training. The choice of the activation function is a critical aspect of the network design that influences these function properties and requires investigation. This thesis focuses on analyzing the expected behavior of networks with maxout (multi-argument) activation functions. On top of enhancing the practical applicability of maxout networks, these findings add to the theoretical exploration of activation functions beyond the common choices. We believe this work can advance the study of activation functions and complicated neural network architectures.
We begin by taking the number of activation regions as a complexity measure and showing that the practical complexity of deep networks with maxout activation functions is often far from the theoretical maximum. This analysis extends the previous results that were valid for deep neural networks with single-argument activation functions such as ReLU. Additionally, we demonstrate that a similar phenomenon occurs when considering the decision boundaries in classification tasks. We also show that the parameter space has a multitude of full-dimensional regions with widely different complexity and obtain nontrivial lower bounds on the expected complexity. Finally, we investigate different parameter initialization procedures and show that they can increase the speed of the gradient descent convergence in training.
Further, continuing the investigation of the expected behavior, we study the gradients of a maxout network with respect to inputs and parameters and obtain bounds for the moments depending on the architecture and the parameter distribution. We observe that the distribution of the input-output Jacobian depends on the input, which complicates a stable parameter initialization. Based on the moments of the gradients, we formulate parameter initialization strategies that avoid vanishing and exploding gradients in wide networks. Experiments with deep fully-connected and convolutional networks show that this strategy improves SGD and Adam training of deep maxout networks. In addition, we obtain refined bounds on the expected number of linear regions, results on the expected curve length distortion, and results on the NTK. As the result of the research in this thesis, we develop multiple experiments and helpful components and make the code for them publicly available