302 research outputs found

    EagleEye: Attack-Agnostic Defense against Adversarial Inputs (Technical Report)

    Full text link
    Deep neural networks (DNNs) are inherently vulnerable to adversarial inputs: such maliciously crafted samples trigger DNNs to misbehave, leading to detrimental consequences for DNN-powered systems. The fundamental challenges of mitigating adversarial inputs stem from their adaptive and variable nature. Existing solutions attempt to improve DNN resilience against specific attacks; yet, such static defenses can often be circumvented by adaptively engineered inputs or by new attack variants. Here, we present EagleEye, an attack-agnostic adversarial tampering analysis engine for DNN-powered systems. Our design exploits the {\em minimality principle} underlying many attacks: to maximize the attack's evasiveness, the adversary often seeks the minimum possible distortion to convert genuine inputs to adversarial ones. We show that this practice entails the distinct distributional properties of adversarial inputs in the input space. By leveraging such properties in a principled manner, EagleEye effectively discriminates adversarial inputs and even uncovers their correct classification outputs. Through extensive empirical evaluation using a range of benchmark datasets and DNN models, we validate EagleEye's efficacy. We further investigate the adversary's possible countermeasures, which implies a difficult dilemma for her: to evade EagleEye's detection, excessive distortion is necessary, thereby significantly reducing the attack's evasiveness regarding other detection mechanisms

    Gradient Descent for One-Hidden-Layer Neural Networks: Polynomial Convergence and SQ Lower Bounds

    Full text link
    We study the complexity of training neural network models with one hidden nonlinear activation layer and an output weighted sum layer. We analyze Gradient Descent applied to learning a bounded target function on nn real-valued inputs. We give an agnostic learning guarantee for GD: starting from a randomly initialized network, it converges in mean squared loss to the minimum error (in 22-norm) of the best approximation of the target function using a polynomial of degree at most kk. Moreover, for any kk, the size of the network and number of iterations needed are both bounded by nO(k)log(1/ϵ)n^{O(k)}\log(1/\epsilon). In particular, this applies to training networks of unbiased sigmoids and ReLUs. We also rigorously explain the empirical finding that gradient descent discovers lower frequency Fourier components before higher frequency components. We complement this result with nearly matching lower bounds in the Statistical Query model. GD fits well in the SQ framework since each training step is determined by an expectation over the input distribution. We show that any SQ algorithm that achieves significant improvement over a constant function with queries of tolerance some inverse polynomial in the input dimensionality nn must use nΩ(k)n^{\Omega(k)} queries even when the target functions are restricted to a set of nO(k)n^{O(k)} degree-kk polynomials, and the input distribution is uniform over the unit sphere; for this class the information-theoretic lower bound is only Θ(klogn)\Theta(k \log n). Our approach for both parts is based on spherical harmonics. We view gradient descent as an operator on the space of functions, and study its dynamics. An essential tool is the Funk-Hecke theorem, which explains the eigenfunctions of this operator in the case of the mean squared loss.Comment: Revised version now includes matching lower bound

    From Stochastic Mixability to Fast Rates

    Full text link
    Empirical risk minimization (ERM) is a fundamental learning rule for statistical learning problems where the data is generated according to some unknown distribution P\mathsf{P} and returns a hypothesis ff chosen from a fixed class F\mathcal{F} with small loss \ell. In the parametric setting, depending upon (,F,P)(\ell, \mathcal{F},\mathsf{P}) ERM can have slow (1/n)(1/\sqrt{n}) or fast (1/n)(1/n) rates of convergence of the excess risk as a function of the sample size nn. There exist several results that give sufficient conditions for fast rates in terms of joint properties of \ell, F\mathcal{F}, and P\mathsf{P}, such as the margin condition and the Bernstein condition. In the non-statistical prediction with expert advice setting, there is an analogous slow and fast rate phenomenon, and it is entirely characterized in terms of the mixability of the loss \ell (there being no role there for F\mathcal{F} or P\mathsf{P}). The notion of stochastic mixability builds a bridge between these two models of learning, reducing to classical mixability in a special case. The present paper presents a direct proof of fast rates for ERM in terms of stochastic mixability of (,F,P)(\ell,\mathcal{F}, \mathsf{P}), and in so doing provides new insight into the fast-rates phenomenon. The proof exploits an old result of Kemperman on the solution to the general moment problem. We also show a partial converse that suggests a characterization of fast rates for ERM in terms of stochastic mixability is possible.Comment: 21 pages, accepted to NIPS 201

    Fault Tolerance in Iterative-Convergent Machine Learning

    Full text link
    Machine learning (ML) training algorithms often possess an inherent self-correcting behavior due to their iterative-convergent nature. Recent systems exploit this property to achieve adaptability and efficiency in unreliable computing environments by relaxing the consistency of execution and allowing calculation errors to be self-corrected during training. However, the behavior of such systems are only well understood for specific types of calculation errors, such as those caused by staleness, reduced precision, or asynchronicity, and for specific types of training algorithms, such as stochastic gradient descent. In this paper, we develop a general framework to quantify the effects of calculation errors on iterative-convergent algorithms and use this framework to design new strategies for checkpoint-based fault tolerance. Our framework yields a worst-case upper bound on the iteration cost of arbitrary perturbations to model parameters during training. Our system, SCAR, employs strategies which reduce the iteration cost upper bound due to perturbations incurred when recovering from checkpoints. We show that SCAR can reduce the iteration cost of partial failures by 78% - 95% when compared with traditional checkpoint-based fault tolerance across a variety of ML models and training algorithms

    On the Global Optimality of Model-Agnostic Meta-Learning

    Full text link
    Model-agnostic meta-learning (MAML) formulates meta-learning as a bilevel optimization problem, where the inner level solves each subtask based on a shared prior, while the outer level searches for the optimal shared prior by optimizing its aggregated performance over all the subtasks. Despite its empirical success, MAML remains less understood in theory, especially in terms of its global optimality, due to the nonconvexity of the meta-objective (the outer-level objective). To bridge such a gap between theory and practice, we characterize the optimality gap of the stationary points attained by MAML for both reinforcement learning and supervised learning, where the inner-level and outer-level problems are solved via first-order optimization methods. In particular, our characterization connects the optimality gap of such stationary points with (i) the functional geometry of inner-level objectives and (ii) the representation power of function approximators, including linear models and neural networks. To the best of our knowledge, our analysis establishes the global optimality of MAML with nonconvex meta-objectives for the first time.Comment: 41 pages; accepted to ICML; initial draft submitted in Feb, 202

    Catalyst Acceleration for Gradient-Based Non-Convex Optimization

    Get PDF
    We introduce a generic scheme to solve nonconvex optimization problems using gradient-based algorithms originally designed for minimizing convex functions. Even though these methods may originally require convexity to operate, the proposed approach allows one to use them on weakly convex objectives, which covers a large class of non-convex functions typically appearing in machine learning and signal processing. In general, the scheme is guaranteed to produce a stationary point with a worst-case efficiency typical of first-order methods, and when the objective turns out to be convex, it automatically accelerates in the sense of Nesterov and achieves near-optimal convergence rate in function values. These properties are achieved without assuming any knowledge about the convexity of the objective, by automatically adapting to the unknown weak convexity constant. We conclude the paper by showing promising experimental results obtained by applying our approach to incremental algorithms such as SVRG and SAGA for sparse matrix factorization and for learning neural networks

    Better Agnostic Clustering Via Relaxed Tensor Norms

    Full text link
    We develop a new family of convex relaxations for kk-means clustering based on sum-of-squares norms, a relaxation of the injective tensor norm that is efficiently computable using the Sum-of-Squares algorithm. We give an algorithm based on this relaxation that recovers a faithful approximation to the true means in the given data whenever the low-degree moments of the points in each cluster have bounded sum-of-squares norms. We then prove a sharp upper bound on the sum-of-squares norms for moment tensors of any distribution that satisfies the \emph{Poincare inequality}. The Poincare inequality is a central inequality in probability theory, and a large class of distributions satisfy it including Gaussians, product distributions, strongly log-concave distributions, and any sum or uniformly continuous transformation of such distributions. As an immediate corollary, for any γ>0\gamma > 0, we obtain an efficient algorithm for learning the means of a mixture of kk arbitrary \Poincare distributions in Rd\mathbb{R}^d in time dO(1/γ)d^{O(1/\gamma)} so long as the means have separation Ω(kγ)\Omega(k^{\gamma}). This in particular yields an algorithm for learning Gaussian mixtures with separation Ω(kγ)\Omega(k^{\gamma}), thus partially resolving an open problem of Regev and Vijayaraghavan \citet{regev2017learning}. Our algorithm works even in the outlier-robust setting where an ϵ\epsilon fraction of arbitrary outliers are added to the data, as long as the fraction of outliers is smaller than the smallest cluster. We, therefore, obtain results in the strong agnostic setting where, in addition to not knowing the distribution family, the data itself may be arbitrarily corrupted

    Covariance Eigenvector Sparsity for Compression and Denoising

    Full text link
    Sparsity in the eigenvectors of signal covariance matrices is exploited in this paper for compression and denoising. Dimensionality reduction (DR) and quantization modules present in many practical compression schemes such as transform codecs, are designed to capitalize on this form of sparsity and achieve improved reconstruction performance compared to existing sparsity-agnostic codecs. Using training data that may be noisy a novel sparsity-aware linear DR scheme is developed to fully exploit sparsity in the covariance eigenvectors and form noise-resilient estimates of the principal covariance eigenbasis. Sparsity is effected via norm-one regularization, and the associated minimization problems are solved using computationally efficient coordinate descent iterations. The resulting eigenspace estimator is shown capable of identifying a subset of the unknown support of the eigenspace basis vectors even when the observation noise covariance matrix is unknown, as long as the noise power is sufficiently low. It is proved that the sparsity-aware estimator is asymptotically normal, and the probability to correctly identify the signal subspace basis support approaches one, as the number of training data grows large. Simulations using synthetic data and images, corroborate that the proposed algorithms achieve improved reconstruction quality relative to alternatives.Comment: IEEE Transcations on Signal Processing, 2012 (to appear

    FedADMM: A Federated Primal-Dual Algorithm Allowing Partial Participation

    Full text link
    Federated learning is a framework for distributed optimization that places emphasis on communication efficiency. In particular, it follows a client-server broadcast model and is particularly appealing because of its ability to accommodate heterogeneity in client compute and storage resources, non-i.i.d. data assumptions, and data privacy. Our contribution is to offer a new federated learning algorithm, FedADMM, for solving non-convex composite optimization problems with non-smooth regularizers. We prove converges of FedADMM for the case when not all clients are able to participate in a given communication round under a very general sampling model

    Deep Relaxation: partial differential equations for optimizing deep neural networks

    Full text link
    In this paper we establish a connection between non-convex optimization methods for training deep neural networks and nonlinear partial differential equations (PDEs). Relaxation techniques arising in statistical physics which have already been used successfully in this context are reinterpreted as solutions of a viscous Hamilton-Jacobi PDE. Using a stochastic control interpretation allows we prove that the modified algorithm performs better in expectation that stochastic gradient descent. Well-known PDE regularity results allow us to analyze the geometry of the relaxed energy landscape, confirming empirical evidence. The PDE is derived from a stochastic homogenization problem, which arises in the implementation of the algorithm. The algorithms scale well in practice and can effectively tackle the high dimensionality of modern neural networks
    corecore