Search CORE

302 research outputs found

EagleEye: Attack-Agnostic Defense against Adversarial Inputs (Technical Report)

Author: Ji Yujie
Wang Ting
Zhang Xinyang
Publication venue
Publication date: 31/07/2018
Field of study

Deep neural networks (DNNs) are inherently vulnerable to adversarial inputs: such maliciously crafted samples trigger DNNs to misbehave, leading to detrimental consequences for DNN-powered systems. The fundamental challenges of mitigating adversarial inputs stem from their adaptive and variable nature. Existing solutions attempt to improve DNN resilience against specific attacks; yet, such static defenses can often be circumvented by adaptively engineered inputs or by new attack variants. Here, we present EagleEye, an attack-agnostic adversarial tampering analysis engine for DNN-powered systems. Our design exploits the {\em minimality principle} underlying many attacks: to maximize the attack's evasiveness, the adversary often seeks the minimum possible distortion to convert genuine inputs to adversarial ones. We show that this practice entails the distinct distributional properties of adversarial inputs in the input space. By leveraging such properties in a principled manner, EagleEye effectively discriminates adversarial inputs and even uncovers their correct classification outputs. Through extensive empirical evaluation using a range of benchmark datasets and DNN models, we validate EagleEye's efficacy. We further investigate the adversary's possible countermeasures, which implies a difficult dilemma for her: to evade EagleEye's detection, excessive distortion is necessary, thereby significantly reducing the attack's evasiveness regarding other detection mechanisms

arXiv.org e-Print Archive

Gradient Descent for One-Hidden-Layer Neural Networks: Polynomial Convergence and SQ Lower Bounds

Author: Vempala Santosh
Wilmes John
Publication venue
Publication date: 27/05/2019
Field of study

We study the complexity of training neural network models with one hidden nonlinear activation layer and an output weighted sum layer. We analyze Gradient Descent applied to learning a bounded target function on

n

real-valued inputs. We give an agnostic learning guarantee for GD: starting from a randomly initialized network, it converges in mean squared loss to the minimum error (in

2

-norm) of the best approximation of the target function using a polynomial of degree at most

k

. Moreover, for any

k

, the size of the network and number of iterations needed are both bounded by

n^{O(k)}\log(1/\epsilon)

. In particular, this applies to training networks of unbiased sigmoids and ReLUs. We also rigorously explain the empirical finding that gradient descent discovers lower frequency Fourier components before higher frequency components. We complement this result with nearly matching lower bounds in the Statistical Query model. GD fits well in the SQ framework since each training step is determined by an expectation over the input distribution. We show that any SQ algorithm that achieves significant improvement over a constant function with queries of tolerance some inverse polynomial in the input dimensionality

n

must use

n^{\Omega(k)}

queries even when the target functions are restricted to a set of

n^{O(k)}

degree-

k

polynomials, and the input distribution is uniform over the unit sphere; for this class the information-theoretic lower bound is only

\Theta(k \log n)

. Our approach for both parts is based on spherical harmonics. We view gradient descent as an operator on the space of functions, and study its dynamics. An essential tool is the Funk-Hecke theorem, which explains the eigenfunctions of this operator in the case of the mean squared loss.Comment: Revised version now includes matching lower bound

arXiv.org e-Print Archive

From Stochastic Mixability to Fast Rates

Author: Mehta Nishant A.
Williamson Robert C.
Publication venue
Publication date: 22/11/2014
Field of study

Empirical risk minimization (ERM) is a fundamental learning rule for statistical learning problems where the data is generated according to some unknown distribution

\mathsf{P}

and returns a hypothesis

f

chosen from a fixed class

\mathcal{F}

with small loss

\ell

. In the parametric setting, depending upon

(\ell, \mathcal{F},\mathsf{P})

ERM can have slow

(1/\sqrt{n})

or fast

(1/n)

rates of convergence of the excess risk as a function of the sample size

n

. There exist several results that give sufficient conditions for fast rates in terms of joint properties of

\ell

\mathcal{F}

, and

\mathsf{P}

, such as the margin condition and the Bernstein condition. In the non-statistical prediction with expert advice setting, there is an analogous slow and fast rate phenomenon, and it is entirely characterized in terms of the mixability of the loss

\ell

(there being no role there for

\mathcal{F}

\mathsf{P}

). The notion of stochastic mixability builds a bridge between these two models of learning, reducing to classical mixability in a special case. The present paper presents a direct proof of fast rates for ERM in terms of stochastic mixability of

(\ell,\mathcal{F}, \mathsf{P})

, and in so doing provides new insight into the fast-rates phenomenon. The proof exploits an old result of Kemperman on the solution to the general moment problem. We also show a partial converse that suggests a characterization of fast rates for ERM in terms of stochastic mixability is possible.Comment: 21 pages, accepted to NIPS 201

arXiv.org e-Print Archive

CiteSeerX

Fault Tolerance in Iterative-Convergent Machine Learning

Author: Aragam Bryon
Qiao Aurick
Xing Eric P.
Zhang Bingjing
Publication venue
Publication date: 16/10/2018
Field of study

Machine learning (ML) training algorithms often possess an inherent self-correcting behavior due to their iterative-convergent nature. Recent systems exploit this property to achieve adaptability and efficiency in unreliable computing environments by relaxing the consistency of execution and allowing calculation errors to be self-corrected during training. However, the behavior of such systems are only well understood for specific types of calculation errors, such as those caused by staleness, reduced precision, or asynchronicity, and for specific types of training algorithms, such as stochastic gradient descent. In this paper, we develop a general framework to quantify the effects of calculation errors on iterative-convergent algorithms and use this framework to design new strategies for checkpoint-based fault tolerance. Our framework yields a worst-case upper bound on the iteration cost of arbitrary perturbations to model parameters during training. Our system, SCAR, employs strategies which reduce the iteration cost upper bound due to perturbations incurred when recovering from checkpoints. We show that SCAR can reduce the iteration cost of partial failures by 78% - 95% when compared with traditional checkpoint-based fault tolerance across a variety of ML models and training algorithms

arXiv.org e-Print Archive

On the Global Optimality of Model-Agnostic Meta-Learning

Author: Cai Qi
Wang Lingxiao
Wang Zhaoran
Yang Zhuoran
Publication venue
Publication date: 23/06/2020
Field of study

Model-agnostic meta-learning (MAML) formulates meta-learning as a bilevel optimization problem, where the inner level solves each subtask based on a shared prior, while the outer level searches for the optimal shared prior by optimizing its aggregated performance over all the subtasks. Despite its empirical success, MAML remains less understood in theory, especially in terms of its global optimality, due to the nonconvexity of the meta-objective (the outer-level objective). To bridge such a gap between theory and practice, we characterize the optimality gap of the stationary points attained by MAML for both reinforcement learning and supervised learning, where the inner-level and outer-level problems are solved via first-order optimization methods. In particular, our characterization connects the optimality gap of such stationary points with (i) the functional geometry of inner-level objectives and (ii) the representation power of function approximators, including linear models and neural networks. To the best of our knowledge, our analysis establishes the global optimality of MAML with nonconvex meta-objectives for the first time.Comment: 41 pages; accepted to ICML; initial draft submitted in Feb, 202

arXiv.org e-Print Archive

Catalyst Acceleration for Gradient-Based Non-Convex Optimization

Author: Drusvyatskiy Dmitriy
Harchaoui Zaid
Lin Hongzhou
Mairal Julien
Paquette Courtney
Publication venue
Publication date: 09/06/2017
Field of study

We introduce a generic scheme to solve nonconvex optimization problems using gradient-based algorithms originally designed for minimizing convex functions. Even though these methods may originally require convexity to operate, the proposed approach allows one to use them on weakly convex objectives, which covers a large class of non-convex functions typically appearing in machine learning and signal processing. In general, the scheme is guaranteed to produce a stationary point with a worst-case efficiency typical of first-order methods, and when the objective turns out to be convex, it automatically accelerates in the sense of Nesterov and achieves near-optimal convergence rate in function values. These properties are achieved without assuming any knowledge about the convexity of the objective, by automatically adapting to the unknown weak convexity constant. We conclude the paper by showing promising experimental results obtained by applying our approach to incremental algorithms such as SVRG and SAGA for sparse matrix factorization and for learning neural networks

arXiv.org e-Print Archive

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Better Agnostic Clustering Via Relaxed Tensor Norms

Author: Kothari Pravesh K.
Steinhardt Jacob
Publication venue
Publication date: 20/11/2017
Field of study

We develop a new family of convex relaxations for

k

-means clustering based on sum-of-squares norms, a relaxation of the injective tensor norm that is efficiently computable using the Sum-of-Squares algorithm. We give an algorithm based on this relaxation that recovers a faithful approximation to the true means in the given data whenever the low-degree moments of the points in each cluster have bounded sum-of-squares norms. We then prove a sharp upper bound on the sum-of-squares norms for moment tensors of any distribution that satisfies the \emph{Poincare inequality}. The Poincare inequality is a central inequality in probability theory, and a large class of distributions satisfy it including Gaussians, product distributions, strongly log-concave distributions, and any sum or uniformly continuous transformation of such distributions. As an immediate corollary, for any

\gamma > 0

, we obtain an efficient algorithm for learning the means of a mixture of

k

arbitrary \Poincare distributions in

\mathbb{R}^d

in time

d^{O(1/\gamma)}

so long as the means have separation

\Omega(k^{\gamma})

. This in particular yields an algorithm for learning Gaussian mixtures with separation

\Omega(k^{\gamma})

, thus partially resolving an open problem of Regev and Vijayaraghavan \citet{regev2017learning}. Our algorithm works even in the outlier-robust setting where an

\epsilon

fraction of arbitrary outliers are added to the data, as long as the fraction of outliers is smaller than the smallest cluster. We, therefore, obtain results in the strong agnostic setting where, in addition to not knowing the distribution family, the data itself may be arbitrarily corrupted

arXiv.org e-Print Archive

Covariance Eigenvector Sparsity for Compression and Denoising

Author: Giannakis Georgios B.
Schizas Ioannis D.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 17/01/2012
Field of study

Sparsity in the eigenvectors of signal covariance matrices is exploited in this paper for compression and denoising. Dimensionality reduction (DR) and quantization modules present in many practical compression schemes such as transform codecs, are designed to capitalize on this form of sparsity and achieve improved reconstruction performance compared to existing sparsity-agnostic codecs. Using training data that may be noisy a novel sparsity-aware linear DR scheme is developed to fully exploit sparsity in the covariance eigenvectors and form noise-resilient estimates of the principal covariance eigenbasis. Sparsity is effected via norm-one regularization, and the associated minimization problems are solved using computationally efficient coordinate descent iterations. The resulting eigenspace estimator is shown capable of identifying a subset of the unknown support of the eigenspace basis vectors even when the observation noise covariance matrix is unknown, as long as the noise power is sufficiently low. It is proved that the sparsity-aware estimator is asymptotically normal, and the probability to correctly identify the signal subspace basis support approaches one, as the number of training data grows large. Simulations using synthetic data and images, corroborate that the proposed algorithms achieve improved reconstruction quality relative to alternatives.Comment: IEEE Transcations on Signal Processing, 2012 (to appear

arXiv.org e-Print Archive

FedADMM: A Federated Primal-Dual Algorithm Allowing Partial Participation

Author: Anderson James
Marella Siddartha
Wang Han
Publication venue
Publication date: 28/03/2022
Field of study

Federated learning is a framework for distributed optimization that places emphasis on communication efficiency. In particular, it follows a client-server broadcast model and is particularly appealing because of its ability to accommodate heterogeneity in client compute and storage resources, non-i.i.d. data assumptions, and data privacy. Our contribution is to offer a new federated learning algorithm, FedADMM, for solving non-convex composite optimization problems with non-smooth regularizers. We prove converges of FedADMM for the case when not all clients are able to participate in a given communication round under a very general sampling model

arXiv.org e-Print Archive

Deep Relaxation: partial differential equations for optimizing deep neural networks

Author: Carlier Guillaume
Chaudhari Pratik
Oberman Adam
Osher Stanley
Soatto Stefano
Publication venue
Publication date: 01/06/2017
Field of study

In this paper we establish a connection between non-convex optimization methods for training deep neural networks and nonlinear partial differential equations (PDEs). Relaxation techniques arising in statistical physics which have already been used successfully in this context are reinterpreted as solutions of a viscous Hamilton-Jacobi PDE. Using a stochastic control interpretation allows we prove that the modified algorithm performs better in expectation that stochastic gradient descent. Well-known PDE regularity results allow us to analyze the geometry of the relaxed energy landscape, confirming empirical evidence. The PDE is derived from a stochastic homogenization problem, which arises in the implementation of the algorithm. The algorithms scale well in practice and can effectively tackle the high dimensionality of modern neural networks

arXiv.org e-Print Archive