1,101 research outputs found
Theoretical insights into the optimization landscape of over-parameterized shallow neural networks
In this paper we study the problem of learning a shallow artificial neural
network that best fits a training data set. We study this problem in the
over-parameterized regime where the number of observations are fewer than the
number of parameters in the model. We show that with quadratic activations the
optimization landscape of training such shallow neural networks has certain
favorable characteristics that allow globally optimal models to be found
efficiently using a variety of local search heuristics. This result holds for
an arbitrary training data of input/output pairs. For differentiable activation
functions we also show that gradient descent, when suitably initialized,
converges at a linear rate to a globally optimal model. This result focuses on
a realizable model where the inputs are chosen i.i.d. from a Gaussian
distribution and the labels are generated according to planted weight
coefficients.Comment: Section 3 on numerical experiments is added. Theorems 2.1 and 2.2 are
improved to apply to almost all input data (not just Gaussian inputs).
Related work section is expanded. The paper is accepted for publication in
IEEE transaction on Information Theory (2018
Learning Neural Networks with Two Nonlinear Layers in Polynomial Time
We give a polynomial-time algorithm for learning neural networks with one
layer of sigmoids feeding into any Lipschitz, monotone activation function
(e.g., sigmoid or ReLU). We make no assumptions on the structure of the
network, and the algorithm succeeds with respect to {\em any} distribution on
the unit ball in dimensions (hidden weight vectors also have unit norm).
This is the first assumption-free, provably efficient algorithm for learning
neural networks with two nonlinear layers.
Our algorithm-- {\em Alphatron}-- is a simple, iterative update rule that
combines isotonic regression with kernel methods. It outputs a hypothesis that
yields efficient oracle access to interpretable features. It also suggests a
new approach to Boolean learning problems via real-valued conditional-mean
functions, sidestepping traditional hardness results from computational
learning theory.
Along these lines, we subsume and improve many longstanding results for PAC
learning Boolean functions to the more general, real-valued setting of {\em
probabilistic concepts}, a model that (unlike PAC learning) requires non-i.i.d.
noise-tolerance.Comment: Changed title, included new result
Lipschitz regularity of deep neural networks: analysis and efficient estimation
Deep neural networks are notorious for being sensitive to small well-chosen
perturbations, and estimating the regularity of such architectures is of utmost
importance for safe and robust practical applications. In this paper, we
investigate one of the key characteristics to assess the regularity of such
methods: the Lipschitz constant of deep learning architectures. First, we show
that, even for two layer neural networks, the exact computation of this
quantity is NP-hard and state-of-art methods may significantly overestimate it.
Then, we both extend and improve previous estimation methods by providing
AutoLip, the first generic algorithm for upper bounding the Lipschitz constant
of any automatically differentiable function. We provide a power method
algorithm working with automatic differentiation, allowing efficient
computations even on large convolutions. Second, for sequential neural
networks, we propose an improved algorithm named SeqLip that takes advantage of
the linear computation graph to split the computation per pair of consecutive
layers. Third we propose heuristics on SeqLip in order to tackle very large
networks. Our experiments show that SeqLip can significantly improve on the
existing upper bounds. Finally, we provide an implementation of AutoLip in the
PyTorch environment that may be used to better estimate the robustness of a
given neural network to small perturbations or regularize it using more precise
Lipschitz estimations.Comment: 12 pages, 3 figure
Analytical bounds on the local Lipschitz constants of affine-ReLU functions
In this paper, we determine analytical bounds on the local Lipschitz
constants of of affine functions composed with rectified linear units (ReLUs).
Affine-ReLU functions represent a widely used layer in deep neural networks,
due to the fact that convolution, fully-connected, and normalization functions
are all affine, and are often followed by a ReLU activation function. Using an
analytical approach, we mathematically determine upper bounds on the local
Lipschitz constant of an affine-ReLU function, show how these bounds can be
combined to determine a bound on an entire network, and discuss how the bounds
can be efficiently computed, even for larger layers and networks. We show
several examples by applying our results to AlexNet, as well as several smaller
networks based on the MNIST and CIFAR-10 datasets. The results show that our
method produces tighter bounds than the standard conservative bound (i.e. the
product of the spectral norms of the layers' linear matrices), especially for
small perturbations.Comment: 14 pages, 5 figure
Data-dependent Sample Complexity of Deep Neural Networks via Lipschitz Augmentation
Existing Rademacher complexity bounds for neural networks rely only on norm
control of the weight matrices and depend exponentially on depth via a product
of the matrix norms. Lower bounds show that this exponential dependence on
depth is unavoidable when no additional properties of the training data are
considered. We suspect that this conundrum comes from the fact that these
bounds depend on the training data only through the margin. In practice, many
data-dependent techniques such as Batchnorm improve the generalization
performance. For feedforward neural nets as well as RNNs, we obtain tighter
Rademacher complexity bounds by considering additional data-dependent
properties of the network: the norms of the hidden layers of the network, and
the norms of the Jacobians of each layer with respect to all previous layers.
Our bounds scale polynomially in depth when these empirical quantities are
small, as is usually the case in practice. To obtain these bounds, we develop
general tools for augmenting a sequence of functions to make their composition
Lipschitz and then covering the augmented functions. Inspired by our theory, we
directly regularize the network's Jacobians during training and empirically
demonstrate that this improves test performance
Principled Deep Neural Network Training through Linear Programming
Deep Learning has received significant attention due to its impressive
performance in many state-of-the-art learning tasks. Unfortunately, while very
powerful, Deep Learning is not well understood theoretically and in particular
only recently results for the complexity of training deep neural networks have
been obtained. In this work we show that large classes of deep neural networks
with various architectures (e.g., DNNs, CNNs, Binary Neural Networks, and
ResNets), activation functions (e.g., ReLUs and leaky ReLUs), and loss
functions (e.g., Hinge loss, Euclidean loss, etc) can be trained to near
optimality with desired target accuracy using linear programming in time that
is exponential in the input data and parameter space dimension and polynomial
in the size of the data set; improvements of the dependence in the input
dimension are known to be unlikely assuming , and improving the
dependence on the parameter space dimension remains open. In particular, we
obtain polynomial time algorithms for training for a given fixed network
architecture. Our work applies more broadly to empirical risk minimization
problems which allows us to generalize various previous results and obtain new
complexity results for previously unstudied architectures in the proper
learning setting
Are ResNets Provably Better than Linear Predictors?
A residual network (or ResNet) is a standard deep neural net architecture,
with state-of-the-art performance across numerous applications. The main
premise of ResNets is that they allow the training of each layer to focus on
fitting just the residual of the previous layer's output and the target output.
Thus, we should expect that the trained network is no worse than what we can
obtain if we remove the residual layers and train a shallower network instead.
However, due to the non-convexity of the optimization problem, it is not at all
clear that ResNets indeed achieve this behavior, rather than getting stuck at
some arbitrarily poor local minimum. In this paper, we rigorously prove that
arbitrarily deep, nonlinear residual units indeed exhibit this behavior, in the
sense that the optimization landscape contains no local minima with value above
what can be obtained with a linear predictor (namely a 1-layer network).
Notably, we show this under minimal or no assumptions on the precise network
architecture, data distribution, or loss function used. We also provide a
quantitative analysis of approximate stationary points for this problem.
Finally, we show that with a certain tweak to the architecture, training the
network with standard stochastic gradient descent achieves an objective value
close or better than any linear predictor.Comment: Comparison to previous arXiv version: Minor changes to incorporate
comments of NIPS 2018 reviewers (main results are unaffected
Lipschitz constant estimation of Neural Networks via sparse polynomial optimization
We introduce LiPopt, a polynomial optimization framework for computing
increasingly tighter upper bounds on the Lipschitz constant of neural networks.
The underlying optimization problems boil down to either linear (LP) or
semidefinite (SDP) programming. We show how to use the sparse connectivity of a
network, to significantly reduce the complexity of computation. This is
specially useful for convolutional as well as pruned neural networks. We
conduct experiments on networks with random weights as well as networks trained
on MNIST, showing that in the particular case of the -Lipschitz
constant, our approach yields superior estimates, compared to baselines
available in the literature.Comment: Published as a conference paper in ICLR2020, originally submitted in
September 25 2019 and available at https://openreview.net/forum?id=rJe4_xSFD
Generalization bounds for deep convolutional neural networks
We prove bounds on the generalization error of convolutional networks. The
bounds are in terms of the training loss, the number of parameters, the
Lipschitz constant of the loss and the distance from the weights to the initial
weights. They are independent of the number of pixels in the input, and the
height and width of hidden feature maps. We present experiments using CIFAR-10
with varying hyperparameters of a deep convolutional network, comparing our
bounds with practical generalization gaps.Comment: Published as a conference paper at ICLR 202
Eigenvalue Decay Implies Polynomial-Time Learnability for Neural Networks
We consider the problem of learning function classes computed by neural
networks with various activations (e.g. ReLU or Sigmoid), a task believed to be
computationally intractable in the worst-case. A major open problem is to
understand the minimal assumptions under which these classes admit provably
efficient algorithms. In this work we show that a natural distributional
assumption corresponding to {\em eigenvalue decay} of the Gram matrix yields
polynomial-time algorithms in the non-realizable setting for expressive classes
of networks (e.g. feed-forward networks of ReLUs). We make no assumptions on
the structure of the network or the labels. Given sufficiently-strong
polynomial eigenvalue decay, we obtain {\em fully}-polynomial time algorithms
in {\em all} the relevant parameters with respect to square-loss. Milder decay
assumptions also lead to improved algorithms. This is the first purely
distributional assumption that leads to polynomial-time algorithms for networks
of ReLUs, even with one hidden layer. Further, unlike prior distributional
assumptions (e.g., the marginal distribution is Gaussian), eigenvalue decay has
been observed in practice on common data sets
- …