1,406 research outputs found
The Computational Complexity of Training ReLU(s)
We consider the computational complexity of training depth-2 neural networks
composed of rectified linear units (ReLUs). We show that, even for the case of
a single ReLU, finding a set of weights that minimizes the squared error (even
approximately) for a given training set is NP-hard. We also show that for a
simple network consisting of two ReLUs, the error minimization problem is
NP-hard, even in the realizable case. We complement these hardness results by
showing that, when the weights and samples belong to the unit ball, one can
(agnostically) properly and reliably learn depth-2 ReLUs with units and
error at most in time ; this
extends upon a previous work of Goel, Kanade, Klivans and Thaler (2017) which
provided efficient improper learning algorithms for ReLUs
Dense Associative Memory for Pattern Recognition
A model of associative memory is studied, which stores and reliably retrieves
many more patterns than the number of neurons in the network. We propose a
simple duality between this dense associative memory and neural networks
commonly used in deep learning. On the associative memory side of this duality,
a family of models that smoothly interpolates between two limiting cases can be
constructed. One limit is referred to as the feature-matching mode of pattern
recognition, and the other one as the prototype regime. On the deep learning
side of the duality, this family corresponds to feedforward neural networks
with one hidden layer and various activation functions, which transmit the
activities of the visible neurons to the hidden layer. This family of
activation functions includes logistics, rectified linear units, and rectified
polynomials of higher degrees. The proposed duality makes it possible to apply
energy-based intuition from associative memory to analyze computational
properties of neural networks with unusual activation functions - the higher
rectified polynomials which until now have not been used in deep learning. The
utility of the dense memories is illustrated for two test cases: the logical
gate XOR and the recognition of handwritten digits from the MNIST data set.Comment: Accepted for publication at NIPS 201
Static Activation Function Normalization
Recent seminal work at the intersection of deep neural networks practice and
random matrix theory has linked the convergence speed and robustness of these
networks with the combination of random weight initialization and nonlinear
activation function in use. Building on those principles, we introduce a
process to transform an existing activation function into another one with
better properties. We term such transform \emph{static activation
normalization}. More specifically we focus on this normalization applied to the
ReLU unit, and show empirically that it significantly promotes convergence
robustness, maximum training depth, and anytime performance. We verify these
claims by examining empirical eigenvalue distributions of networks trained with
those activations. Our static activation normalization provides a first step
towards giving benefits similar in spirit to schemes like batch normalization,
but without computational cost
Eigenvalue Decay Implies Polynomial-Time Learnability for Neural Networks
We consider the problem of learning function classes computed by neural
networks with various activations (e.g. ReLU or Sigmoid), a task believed to be
computationally intractable in the worst-case. A major open problem is to
understand the minimal assumptions under which these classes admit provably
efficient algorithms. In this work we show that a natural distributional
assumption corresponding to {\em eigenvalue decay} of the Gram matrix yields
polynomial-time algorithms in the non-realizable setting for expressive classes
of networks (e.g. feed-forward networks of ReLUs). We make no assumptions on
the structure of the network or the labels. Given sufficiently-strong
polynomial eigenvalue decay, we obtain {\em fully}-polynomial time algorithms
in {\em all} the relevant parameters with respect to square-loss. Milder decay
assumptions also lead to improved algorithms. This is the first purely
distributional assumption that leads to polynomial-time algorithms for networks
of ReLUs, even with one hidden layer. Further, unlike prior distributional
assumptions (e.g., the marginal distribution is Gaussian), eigenvalue decay has
been observed in practice on common data sets
Neural networks and rational functions
Neural networks and rational functions efficiently approximate each other. In
more detail, it is shown here that for any ReLU network, there exists a
rational function of degree which is
-close, and similarly for any rational function there exists a ReLU
network of size which is -close. By
contrast, polynomials need degree to
approximate even a single ReLU. When converting a ReLU network to a rational
function as above, the hidden constants depend exponentially on the number of
layers, which is shown to be tight; in other words, a compositional
representation can be beneficial even for rational functions.Comment: To appear, ICML 201
On the Learnability of Deep Random Networks
In this paper we study the learnability of deep random networks from both
theoretical and practical points of view. On the theoretical front, we show
that the learnability of random deep networks with sign activation drops
exponentially with its depth. On the practical front, we find that the
learnability drops sharply with depth even with the state-of-the-art training
methods, suggesting that our stylized theoretical results are closer to
reality
Learning Neural Networks with Two Nonlinear Layers in Polynomial Time
We give a polynomial-time algorithm for learning neural networks with one
layer of sigmoids feeding into any Lipschitz, monotone activation function
(e.g., sigmoid or ReLU). We make no assumptions on the structure of the
network, and the algorithm succeeds with respect to {\em any} distribution on
the unit ball in dimensions (hidden weight vectors also have unit norm).
This is the first assumption-free, provably efficient algorithm for learning
neural networks with two nonlinear layers.
Our algorithm-- {\em Alphatron}-- is a simple, iterative update rule that
combines isotonic regression with kernel methods. It outputs a hypothesis that
yields efficient oracle access to interpretable features. It also suggests a
new approach to Boolean learning problems via real-valued conditional-mean
functions, sidestepping traditional hardness results from computational
learning theory.
Along these lines, we subsume and improve many longstanding results for PAC
learning Boolean functions to the more general, real-valued setting of {\em
probabilistic concepts}, a model that (unlike PAC learning) requires non-i.i.d.
noise-tolerance.Comment: Changed title, included new result
Fitting ReLUs via SGD and Quantized SGD
In this paper we focus on the problem of finding the optimal weights of the
shallowest of neural networks consisting of a single Rectified Linear Unit
(ReLU). These functions are of the form with
denoting the weight vector. We focus on a planted model where the inputs are
chosen i.i.d. from a Gaussian distribution and the labels are generated
according to a planted weight vector. We first show that mini-batch stochastic
gradient descent when suitably initialized, converges at a geometric rate to
the planted model with a number of samples that is optimal up to numerical
constants. Next we focus on a parallel implementation where in each iteration
the mini-batch gradient is calculated in a distributed manner across multiple
processors and then broadcast to a master or all other processors. To reduce
the communication cost in this setting we utilize a Quanitzed Stochastic
Gradient Scheme (QSGD) where the partial gradients are quantized. Perhaps
unexpectedly, we show that QSGD maintains the fast convergence of SGD to a
globally optimal model while significantly reducing the communication cost. We
further corroborate our numerical findings via various experiments including
distributed implementations over Amazon EC2
When is a Convolutional Filter Easy To Learn?
We analyze the convergence of (stochastic) gradient descent algorithm for
learning a convolutional filter with Rectified Linear Unit (ReLU) activation
function. Our analysis does not rely on any specific form of the input
distribution and our proofs only use the definition of ReLU, in contrast with
previous works that are restricted to standard Gaussian input. We show that
(stochastic) gradient descent with random initialization can learn the
convolutional filter in polynomial time and the convergence rate depends on the
smoothness of the input distribution and the closeness of patches. To the best
of our knowledge, this is the first recovery guarantee of gradient-based
algorithms for convolutional filter on non-Gaussian input distributions. Our
theory also justifies the two-stage learning rate strategy in deep neural
networks. While our focus is theoretical, we also present experiments that
illustrate our theoretical findings.Comment: Published as a conference paper at ICLR 201
Principled Deep Neural Network Training through Linear Programming
Deep Learning has received significant attention due to its impressive
performance in many state-of-the-art learning tasks. Unfortunately, while very
powerful, Deep Learning is not well understood theoretically and in particular
only recently results for the complexity of training deep neural networks have
been obtained. In this work we show that large classes of deep neural networks
with various architectures (e.g., DNNs, CNNs, Binary Neural Networks, and
ResNets), activation functions (e.g., ReLUs and leaky ReLUs), and loss
functions (e.g., Hinge loss, Euclidean loss, etc) can be trained to near
optimality with desired target accuracy using linear programming in time that
is exponential in the input data and parameter space dimension and polynomial
in the size of the data set; improvements of the dependence in the input
dimension are known to be unlikely assuming , and improving the
dependence on the parameter space dimension remains open. In particular, we
obtain polynomial time algorithms for training for a given fixed network
architecture. Our work applies more broadly to empirical risk minimization
problems which allows us to generalize various previous results and obtain new
complexity results for previously unstudied architectures in the proper
learning setting
- …