6,638 research outputs found

    Hardness of Learning Neural Networks with Natural Weights

    Full text link
    Neural networks are nowadays highly successful despite strong hardness results. The existing hardness results focus on the network architecture, and assume that the network's weights are arbitrary. A natural approach to settle the discrepancy is to assume that the network's weights are "well-behaved" and posses some generic properties that may allow efficient learning. This approach is supported by the intuition that the weights in real-world networks are not arbitrary, but exhibit some "random-like" properties with respect to some "natural" distributions. We prove negative results in this regard, and show that for depth-22 networks, and many "natural" weights distributions such as the normal and the uniform distribution, most networks are hard to learn. Namely, there is no efficient learning algorithm that is provably successful for most weights, and every input distribution. It implies that there is no generic property that holds with high probability in such random networks and allows efficient learning

    1\ell_1-regularized Neural Networks are Improperly Learnable in Polynomial Time

    Full text link
    We study the improper learning of multi-layer neural networks. Suppose that the neural network to be learned has kk hidden layers and that the 1\ell_1-norm of the incoming weights of any neuron is bounded by LL. We present a kernel-based method, such that with probability at least 1δ1 - \delta, it learns a predictor whose generalization error is at most ϵ\epsilon worse than that of the neural network. The sample complexity and the time complexity of the presented method are polynomial in the input dimension and in (1/ϵ,log(1/δ),F(k,L))(1/\epsilon,\log(1/\delta),F(k,L)), where F(k,L)F(k,L) is a function depending on (k,L)(k,L) and on the activation function, independent of the number of neurons. The algorithm applies to both sigmoid-like activation functions and ReLU-like activation functions. It implies that any sufficiently sparse neural network is learnable in polynomial time.Comment: 16 page

    Angular Visual Hardness

    Get PDF
    Recent convolutional neural networks (CNNs) have led to impressive performance but often suffer from poor calibration. They tend to be overconfident, with the model confidence not always reflecting the underlying true ambiguity and hardness. In this paper, we propose angular visual hardness (AVH), a score given by the normalized angular distance between the sample feature embedding and the target classifier to measure sample hardness. We validate this score with an in-depth and extensive scientific study, and observe that CNN models with the highest accuracy also have the best AVH scores. This agrees with an earlier finding that state-of-art models improve on the classification of harder examples. We observe that the training dynamics of AVH is vastly different compared to the training loss. Specifically, AVH quickly reaches a plateau for all samples even though the training loss keeps improving. This suggests the need for designing better loss functions that can target harder examples more effectively. We also find that AVH has a statistically significant correlation with human visual hardness. Finally, we demonstrate the benefit of AVH to a variety of applications such as self-training for domain adaptation and domain generalization

    Complexity of Training ReLU Neural Network

    Full text link
    In this paper, we explore some basic questions on the complexity of training Neural networks with ReLU activation function. We show that it is NP-hard to train a two- hidden layer feedforward ReLU neural network. If dimension d of the data is fixed then we show that there exists a polynomial time algorithm for the same training problem. We also show that if sufficient over-parameterization is provided in the first hidden layer of ReLU neural network then there is a polynomial time algorithm which finds weights such that output of the over-parameterized ReLU neural network matches with the output of the given dat

    Mixing Complexity and its Applications to Neural Networks

    Full text link
    We suggest analyzing neural networks through the prism of space constraints. We observe that most training algorithms applied in practice use bounded memory, which enables us to use a new notion introduced in the study of space-time tradeoffs that we call mixing complexity. This notion was devised in order to measure the (in)ability to learn using a bounded-memory algorithm. In this paper we describe how we use mixing complexity to obtain new results on what can and cannot be learned using neural networks

    Learning Halfspaces and Neural Networks with Random Initialization

    Full text link
    We study non-convex empirical risk minimization for learning halfspaces and neural networks. For loss functions that are LL-Lipschitz continuous, we present algorithms to learn halfspaces and multi-layer neural networks that achieve arbitrarily small excess risk ϵ>0\epsilon>0. The time complexity is polynomial in the input dimension dd and the sample size nn, but exponential in the quantity (L/ϵ2)log(L/ϵ)(L/\epsilon^2)\log(L/\epsilon). These algorithms run multiple rounds of random initialization followed by arbitrary optimization steps. We further show that if the data is separable by some neural network with constant margin γ>0\gamma>0, then there is a polynomial-time algorithm for learning a neural network that separates the training data with margin Ω(γ)\Omega(\gamma). As a consequence, the algorithm achieves arbitrary generalization error ϵ>0\epsilon>0 with poly(d,1/ϵ){\rm poly}(d,1/\epsilon) sample and time complexity. We establish the same learnability result when the labels are randomly flipped with probability η<1/2\eta<1/2.Comment: 31 page

    On the Connection Between Learning Two-Layers Neural Networks and Tensor Decomposition

    Full text link
    We establish connections between the problem of learning a two-layer neural network and tensor decomposition. We consider a model with feature vectors xRd\boldsymbol x \in \mathbb R^d, rr hidden units with weights {wi}1ir\{\boldsymbol w_i\}_{1\le i \le r} and output yRy\in \mathbb R, i.e., y=i=1rσ(wiTx)y=\sum_{i=1}^r \sigma( \boldsymbol w_i^{\mathsf T}\boldsymbol x), with activation functions given by low-degree polynomials. In particular, if σ(x)=a0+a1x+a3x3\sigma(x) = a_0+a_1x+a_3x^3, we prove that no polynomial-time learning algorithm can outperform the trivial predictor that assigns to each example the response variable E(y)\mathbb E(y), when d3/2rd2d^{3/2}\ll r\ll d^2. Our conclusion holds for a `natural data distribution', namely standard Gaussian feature vectors x\boldsymbol x, and output distributed according to a two-layer neural network with random isotropic weights, and under a certain complexity-theoretic assumption on tensor decomposition. Roughly speaking, we assume that no polynomial-time algorithm can substantially outperform current methods for tensor decomposition based on the sum-of-squares hierarchy. We also prove generalizations of this statement for higher degree polynomial activations, and non-random weight vectors. Remarkably, several existing algorithms for learning two-layer networks with rigorous guarantees are based on tensor decomposition. Our results support the idea that this is indeed the core computational difficulty in learning such networks, under the stated generative model for the data. As a side result, we show that under this model learning the network requires accurate learning of its weights, a property that does not hold in a more general setting.Comment: 41 pages, 1 figur

    Robust Optimization for Non-Convex Objectives

    Full text link
    We consider robust optimization problems, where the goal is to optimize in the worst case over a class of objective functions. We develop a reduction from robust improper optimization to Bayesian optimization: given an oracle that returns α\alpha-approximate solutions for distributions over objectives, we compute a distribution over solutions that is α\alpha-approximate in the worst case. We show that de-randomizing this solution is NP-hard in general, but can be done for a broad class of statistical learning tasks. We apply our results to robust neural network training and submodular optimization. We evaluate our approach experimentally on corrupted character classification, and robust influence maximization in networks

    The Limitations of Deep Learning in Adversarial Settings

    Full text link
    Deep learning takes advantage of large datasets and computationally efficient training algorithms to outperform other approaches at various machine learning tasks. However, imperfections in the training phase of deep neural networks make them vulnerable to adversarial samples: inputs crafted by adversaries with the intent of causing deep neural networks to misclassify. In this work, we formalize the space of adversaries against deep neural networks (DNNs) and introduce a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs. In an application to computer vision, we show that our algorithms can reliably produce samples correctly classified by human subjects but misclassified in specific targets by a DNN with a 97% adversarial success rate while only modifying on average 4.02% of the input features per sample. We then evaluate the vulnerability of different sample classes to adversarial perturbations by defining a hardness measure. Finally, we describe preliminary work outlining defenses against adversarial samples by defining a predictive measure of distance between a benign input and a target classification.Comment: Accepted to the 1st IEEE European Symposium on Security & Privacy, IEEE 2016. Saarbrucken, German

    Function Norms and Regularization in Deep Networks

    Full text link
    Deep neural networks (DNNs) have become increasingly important due to their excellent empirical performance on a wide range of problems. However, regularization is generally achieved by indirect means, largely due to the complex set of functions defined by a network and the difficulty in measuring function complexity. There exists no method in the literature for additive regularization based on a norm of the function, as is classically considered in statistical learning theory. In this work, we propose sampling-based approximations to weighted function norms as regularizers for deep neural networks. We provide, to the best of our knowledge, the first proof in the literature of the NP-hardness of computing function norms of DNNs, motivating the necessity of an approximate approach. We then derive a generalization bound for functions trained with weighted norms and prove that a natural stochastic optimization strategy minimizes the bound. Finally, we empirically validate the improved performance of the proposed regularization strategies for both convex function sets as well as DNNs on real-world classification and image segmentation tasks demonstrating improved performance over weight decay, dropout, and batch normalization. Source code will be released at the time of publication.Comment: 17 pages, 8 figure
    corecore