217 research outputs found
Learning Halfspaces and Neural Networks with Random Initialization
We study non-convex empirical risk minimization for learning halfspaces and
neural networks. For loss functions that are -Lipschitz continuous, we
present algorithms to learn halfspaces and multi-layer neural networks that
achieve arbitrarily small excess risk . The time complexity is
polynomial in the input dimension and the sample size , but exponential
in the quantity . These algorithms run multiple
rounds of random initialization followed by arbitrary optimization steps. We
further show that if the data is separable by some neural network with constant
margin , then there is a polynomial-time algorithm for learning a
neural network that separates the training data with margin .
As a consequence, the algorithm achieves arbitrary generalization error
with sample and time complexity. We
establish the same learnability result when the labels are randomly flipped
with probability .Comment: 31 page
On the Computational Efficiency of Training Neural Networks
It is well-known that neural networks are computationally hard to train. On
the other hand, in practice, modern day neural networks are trained efficiently
using SGD and a variety of tricks that include different activation functions
(e.g. ReLU), over-specification (i.e., train networks which are larger than
needed), and regularization. In this paper we revisit the computational
complexity of training neural networks from a modern perspective. We provide
both positive and negative results, some of them yield new provably efficient
and practical algorithms for training certain types of neural networks.Comment: Section 2 is revised due to a mistak
Provable Generalization of SGD-trained Neural Networks of Any Width in the Presence of Adversarial Label Noise
We consider a one-hidden-layer leaky ReLU network of arbitrary width trained
by stochastic gradient descent (SGD) following an arbitrary initialization. We
prove that SGD produces neural networks that have classification accuracy
competitive with that of the best halfspace over the distribution for a broad
class of distributions that includes log-concave isotropic and hard margin
distributions. Equivalently, such networks can generalize when the data
distribution is linearly separable but corrupted with adversarial label noise,
despite the capacity to overfit. To the best of our knowledge, this is the
first work to show that overparameterized neural networks trained by SGD can
generalize when the data is corrupted with adversarial label noise.Comment: 30 pages, 10 figure
Agnostic Learning of a Single Neuron with Gradient Descent
We consider the problem of learning the best-fitting single neuron as
measured by the expected square loss over some unknown joint distribution
by using gradient descent to minimize the empirical risk induced
by a set of i.i.d. samples . The activation function
is an arbitrary Lipschitz and non-decreasing function, making the
optimization problem nonconvex and nonsmooth in general, and covers typical
neural network activation functions and inverse link functions in the
generalized linear model setting. In the agnostic PAC learning setting, where
no assumption on the relationship between the labels and the input is
made, if the optimal population risk is , we show that gradient
descent achieves population risk in polynomial time
and sample complexity when is strictly increasing. For the ReLU
activation, our population risk guarantee is .
When labels take the form for zero-mean
sub-Gaussian noise , we show that the population risk guarantees for
gradient descent improve to . Our sample complexity
and runtime guarantees are (almost) dimension independent, and when is
strictly increasing, require no distributional assumptions beyond boundedness.
For ReLU, we show the same results under a nondegeneracy assumption for the
marginal distribution of the input.Comment: 31 pages, 3 tables. This version improves the risk bound from
O(OPT^1/2) to O(OPT) for strictly increasing activation function
Agnostic Learning of Halfspaces with Gradient Descent via Soft Margins
We analyze the properties of gradient descent on convex surrogates for the
zero-one loss for the agnostic learning of linear halfspaces. If
is the best classification error achieved by a halfspace, by appealing to the
notion of soft margins we are able to show that gradient descent finds
halfspaces with classification error in time and sample complexity for
a broad class of distributions that includes log-concave isotropic
distributions as a subclass. Along the way we answer a question recently posed
by Ji et al. (2020) on how the tail behavior of a loss function can affect
sample complexity and runtime guarantees for gradient descent.Comment: 25 pages, 1 tabl
Empirical Studies on the Properties of Linear Regions in Deep Neural Networks
A deep neural network (DNN) with piecewise linear activations can partition
the input space into numerous small linear regions, where different linear
functions are fitted. It is believed that the number of these regions
represents the expressivity of the DNN. This paper provides a novel and
meticulous perspective to look into DNNs: Instead of just counting the number
of the linear regions, we study their local properties, such as the inspheres,
the directions of the corresponding hyperplanes, the decision boundaries, and
the relevance of the surrounding regions. We empirically observed that
different optimization techniques lead to completely different linear regions,
even though they result in similar classification accuracies. We hope our study
can inspire the design of novel optimization techniques, and help discover and
analyze the behaviors of DNNs.Comment: Int'l. Conf. on Learning Representations (ICLR), Addis Ababa,
Ethiopia, April 202
On the Quality of the Initial Basin in Overspecified Neural Networks
Deep learning, in the form of artificial neural networks, has achieved
remarkable practical success in recent years, for a variety of difficult
machine learning applications. However, a theoretical explanation for this
remains a major open problem, since training neural networks involves
optimizing a highly non-convex objective function, and is known to be
computationally hard in the worst case. In this work, we study the
\emph{geometric} structure of the associated non-convex objective function, in
the context of ReLU networks and starting from a random initialization of the
network parameters. We identify some conditions under which it becomes more
favorable to optimization, in the sense of (i) High probability of initializing
at a point from which there is a monotonically decreasing path to a global
minimum; and (ii) High probability of initializing at a basin (suitably
defined) with a small minimal objective value. A common theme in our results is
that such properties are more likely to hold for larger ("overspecified")
networks, which accords with some recent empirical and theoretical
observations
Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity
We develop a general duality between neural networks and compositional
kernels, striving towards a better understanding of deep learning. We show that
initial representations generated by common random initializations are
sufficiently rich to express all functions in the dual kernel space. Hence,
though the training objective is hard to optimize in the worst case, the
initial weights form a good starting point for optimization. Our dual view also
reveals a pragmatic and aesthetic perspective of neural networks and
underscores their expressive power
Provable Certificates for Adversarial Examples: Fitting a Ball in the Union of Polytopes
We propose a novel method for computing exact pointwise robustness of deep
neural networks for all convex norms. Our algorithm, GeoCert, finds
the largest ball centered at an input point , within which the
output class of a given neural network with ReLU nonlinearities remains
unchanged. We relate the problem of computing pointwise robustness of these
networks to that of computing the maximum norm ball with a fixed center that
can be contained in a non-convex polytope. This is a challenging problem in
general, however we show that there exists an efficient algorithm to compute
this for polyhedral complices. Further we show that piecewise linear neural
networks partition the input space into a polyhedral complex. Our algorithm has
the ability to almost immediately output a nontrivial lower bound to the
pointwise robustness which is iteratively improved until it ultimately becomes
tight. We empirically show that our approach generates distance lower bounds
that are tighter compared to prior work, under moderate time constraints.Comment: Code can be found here:
https://github.com/revbucket/geometric-certificate
Eigenvalue Decay Implies Polynomial-Time Learnability for Neural Networks
We consider the problem of learning function classes computed by neural
networks with various activations (e.g. ReLU or Sigmoid), a task believed to be
computationally intractable in the worst-case. A major open problem is to
understand the minimal assumptions under which these classes admit provably
efficient algorithms. In this work we show that a natural distributional
assumption corresponding to {\em eigenvalue decay} of the Gram matrix yields
polynomial-time algorithms in the non-realizable setting for expressive classes
of networks (e.g. feed-forward networks of ReLUs). We make no assumptions on
the structure of the network or the labels. Given sufficiently-strong
polynomial eigenvalue decay, we obtain {\em fully}-polynomial time algorithms
in {\em all} the relevant parameters with respect to square-loss. Milder decay
assumptions also lead to improved algorithms. This is the first purely
distributional assumption that leads to polynomial-time algorithms for networks
of ReLUs, even with one hidden layer. Further, unlike prior distributional
assumptions (e.g., the marginal distribution is Gaussian), eigenvalue decay has
been observed in practice on common data sets
- …