217 research outputs found

    Learning Halfspaces and Neural Networks with Random Initialization

    Full text link
    We study non-convex empirical risk minimization for learning halfspaces and neural networks. For loss functions that are LL-Lipschitz continuous, we present algorithms to learn halfspaces and multi-layer neural networks that achieve arbitrarily small excess risk ϵ>0\epsilon>0. The time complexity is polynomial in the input dimension dd and the sample size nn, but exponential in the quantity (L/ϵ2)log(L/ϵ)(L/\epsilon^2)\log(L/\epsilon). These algorithms run multiple rounds of random initialization followed by arbitrary optimization steps. We further show that if the data is separable by some neural network with constant margin γ>0\gamma>0, then there is a polynomial-time algorithm for learning a neural network that separates the training data with margin Ω(γ)\Omega(\gamma). As a consequence, the algorithm achieves arbitrary generalization error ϵ>0\epsilon>0 with poly(d,1/ϵ){\rm poly}(d,1/\epsilon) sample and time complexity. We establish the same learnability result when the labels are randomly flipped with probability η<1/2\eta<1/2.Comment: 31 page

    On the Computational Efficiency of Training Neural Networks

    Full text link
    It is well-known that neural networks are computationally hard to train. On the other hand, in practice, modern day neural networks are trained efficiently using SGD and a variety of tricks that include different activation functions (e.g. ReLU), over-specification (i.e., train networks which are larger than needed), and regularization. In this paper we revisit the computational complexity of training neural networks from a modern perspective. We provide both positive and negative results, some of them yield new provably efficient and practical algorithms for training certain types of neural networks.Comment: Section 2 is revised due to a mistak

    Provable Generalization of SGD-trained Neural Networks of Any Width in the Presence of Adversarial Label Noise

    Full text link
    We consider a one-hidden-layer leaky ReLU network of arbitrary width trained by stochastic gradient descent (SGD) following an arbitrary initialization. We prove that SGD produces neural networks that have classification accuracy competitive with that of the best halfspace over the distribution for a broad class of distributions that includes log-concave isotropic and hard margin distributions. Equivalently, such networks can generalize when the data distribution is linearly separable but corrupted with adversarial label noise, despite the capacity to overfit. To the best of our knowledge, this is the first work to show that overparameterized neural networks trained by SGD can generalize when the data is corrupted with adversarial label noise.Comment: 30 pages, 10 figure

    Agnostic Learning of a Single Neuron with Gradient Descent

    Full text link
    We consider the problem of learning the best-fitting single neuron as measured by the expected square loss E(x,y)D[(σ(wx)y)2]\mathbb{E}_{(x,y)\sim \mathcal{D}}[(\sigma(w^\top x)-y)^2] over some unknown joint distribution D\mathcal{D} by using gradient descent to minimize the empirical risk induced by a set of i.i.d. samples SDnS\sim \mathcal{D}^n. The activation function σ\sigma is an arbitrary Lipschitz and non-decreasing function, making the optimization problem nonconvex and nonsmooth in general, and covers typical neural network activation functions and inverse link functions in the generalized linear model setting. In the agnostic PAC learning setting, where no assumption on the relationship between the labels yy and the input xx is made, if the optimal population risk is OPT\mathsf{OPT}, we show that gradient descent achieves population risk O(OPT)+ϵO(\mathsf{OPT})+\epsilon in polynomial time and sample complexity when σ\sigma is strictly increasing. For the ReLU activation, our population risk guarantee is O(OPT1/2)+ϵO(\mathsf{OPT}^{1/2})+\epsilon. When labels take the form y=σ(vx)+ξy = \sigma(v^\top x) + \xi for zero-mean sub-Gaussian noise ξ\xi, we show that the population risk guarantees for gradient descent improve to OPT+ϵ\mathsf{OPT} + \epsilon. Our sample complexity and runtime guarantees are (almost) dimension independent, and when σ\sigma is strictly increasing, require no distributional assumptions beyond boundedness. For ReLU, we show the same results under a nondegeneracy assumption for the marginal distribution of the input.Comment: 31 pages, 3 tables. This version improves the risk bound from O(OPT^1/2) to O(OPT) for strictly increasing activation function

    Agnostic Learning of Halfspaces with Gradient Descent via Soft Margins

    Full text link
    We analyze the properties of gradient descent on convex surrogates for the zero-one loss for the agnostic learning of linear halfspaces. If OPT\mathsf{OPT} is the best classification error achieved by a halfspace, by appealing to the notion of soft margins we are able to show that gradient descent finds halfspaces with classification error O~(OPT1/2)+ε\tilde O(\mathsf{OPT}^{1/2}) + \varepsilon in poly(d,1/ε)\mathrm{poly}(d,1/\varepsilon) time and sample complexity for a broad class of distributions that includes log-concave isotropic distributions as a subclass. Along the way we answer a question recently posed by Ji et al. (2020) on how the tail behavior of a loss function can affect sample complexity and runtime guarantees for gradient descent.Comment: 25 pages, 1 tabl

    Empirical Studies on the Properties of Linear Regions in Deep Neural Networks

    Full text link
    A deep neural network (DNN) with piecewise linear activations can partition the input space into numerous small linear regions, where different linear functions are fitted. It is believed that the number of these regions represents the expressivity of the DNN. This paper provides a novel and meticulous perspective to look into DNNs: Instead of just counting the number of the linear regions, we study their local properties, such as the inspheres, the directions of the corresponding hyperplanes, the decision boundaries, and the relevance of the surrounding regions. We empirically observed that different optimization techniques lead to completely different linear regions, even though they result in similar classification accuracies. We hope our study can inspire the design of novel optimization techniques, and help discover and analyze the behaviors of DNNs.Comment: Int'l. Conf. on Learning Representations (ICLR), Addis Ababa, Ethiopia, April 202

    On the Quality of the Initial Basin in Overspecified Neural Networks

    Full text link
    Deep learning, in the form of artificial neural networks, has achieved remarkable practical success in recent years, for a variety of difficult machine learning applications. However, a theoretical explanation for this remains a major open problem, since training neural networks involves optimizing a highly non-convex objective function, and is known to be computationally hard in the worst case. In this work, we study the \emph{geometric} structure of the associated non-convex objective function, in the context of ReLU networks and starting from a random initialization of the network parameters. We identify some conditions under which it becomes more favorable to optimization, in the sense of (i) High probability of initializing at a point from which there is a monotonically decreasing path to a global minimum; and (ii) High probability of initializing at a basin (suitably defined) with a small minimal objective value. A common theme in our results is that such properties are more likely to hold for larger ("overspecified") networks, which accords with some recent empirical and theoretical observations

    Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity

    Full text link
    We develop a general duality between neural networks and compositional kernels, striving towards a better understanding of deep learning. We show that initial representations generated by common random initializations are sufficiently rich to express all functions in the dual kernel space. Hence, though the training objective is hard to optimize in the worst case, the initial weights form a good starting point for optimization. Our dual view also reveals a pragmatic and aesthetic perspective of neural networks and underscores their expressive power

    Provable Certificates for Adversarial Examples: Fitting a Ball in the Union of Polytopes

    Full text link
    We propose a novel method for computing exact pointwise robustness of deep neural networks for all convex p\ell_p norms. Our algorithm, GeoCert, finds the largest p\ell_p ball centered at an input point x0x_0, within which the output class of a given neural network with ReLU nonlinearities remains unchanged. We relate the problem of computing pointwise robustness of these networks to that of computing the maximum norm ball with a fixed center that can be contained in a non-convex polytope. This is a challenging problem in general, however we show that there exists an efficient algorithm to compute this for polyhedral complices. Further we show that piecewise linear neural networks partition the input space into a polyhedral complex. Our algorithm has the ability to almost immediately output a nontrivial lower bound to the pointwise robustness which is iteratively improved until it ultimately becomes tight. We empirically show that our approach generates distance lower bounds that are tighter compared to prior work, under moderate time constraints.Comment: Code can be found here: https://github.com/revbucket/geometric-certificate

    Eigenvalue Decay Implies Polynomial-Time Learnability for Neural Networks

    Full text link
    We consider the problem of learning function classes computed by neural networks with various activations (e.g. ReLU or Sigmoid), a task believed to be computationally intractable in the worst-case. A major open problem is to understand the minimal assumptions under which these classes admit provably efficient algorithms. In this work we show that a natural distributional assumption corresponding to {\em eigenvalue decay} of the Gram matrix yields polynomial-time algorithms in the non-realizable setting for expressive classes of networks (e.g. feed-forward networks of ReLUs). We make no assumptions on the structure of the network or the labels. Given sufficiently-strong polynomial eigenvalue decay, we obtain {\em fully}-polynomial time algorithms in {\em all} the relevant parameters with respect to square-loss. Milder decay assumptions also lead to improved algorithms. This is the first purely distributional assumption that leads to polynomial-time algorithms for networks of ReLUs, even with one hidden layer. Further, unlike prior distributional assumptions (e.g., the marginal distribution is Gaussian), eigenvalue decay has been observed in practice on common data sets
    corecore