302 research outputs found
EagleEye: Attack-Agnostic Defense against Adversarial Inputs (Technical Report)
Deep neural networks (DNNs) are inherently vulnerable to adversarial inputs:
such maliciously crafted samples trigger DNNs to misbehave, leading to
detrimental consequences for DNN-powered systems. The fundamental challenges of
mitigating adversarial inputs stem from their adaptive and variable nature.
Existing solutions attempt to improve DNN resilience against specific attacks;
yet, such static defenses can often be circumvented by adaptively engineered
inputs or by new attack variants.
Here, we present EagleEye, an attack-agnostic adversarial tampering analysis
engine for DNN-powered systems. Our design exploits the {\em minimality
principle} underlying many attacks: to maximize the attack's evasiveness, the
adversary often seeks the minimum possible distortion to convert genuine inputs
to adversarial ones. We show that this practice entails the distinct
distributional properties of adversarial inputs in the input space. By
leveraging such properties in a principled manner, EagleEye effectively
discriminates adversarial inputs and even uncovers their correct classification
outputs. Through extensive empirical evaluation using a range of benchmark
datasets and DNN models, we validate EagleEye's efficacy. We further
investigate the adversary's possible countermeasures, which implies a difficult
dilemma for her: to evade EagleEye's detection, excessive distortion is
necessary, thereby significantly reducing the attack's evasiveness regarding
other detection mechanisms
Gradient Descent for One-Hidden-Layer Neural Networks: Polynomial Convergence and SQ Lower Bounds
We study the complexity of training neural network models with one hidden
nonlinear activation layer and an output weighted sum layer. We analyze
Gradient Descent applied to learning a bounded target function on
real-valued inputs. We give an agnostic learning guarantee for GD: starting
from a randomly initialized network, it converges in mean squared loss to the
minimum error (in -norm) of the best approximation of the target function
using a polynomial of degree at most . Moreover, for any , the size of
the network and number of iterations needed are both bounded by
. In particular, this applies to training networks of
unbiased sigmoids and ReLUs. We also rigorously explain the empirical finding
that gradient descent discovers lower frequency Fourier components before
higher frequency components.
We complement this result with nearly matching lower bounds in the
Statistical Query model. GD fits well in the SQ framework since each training
step is determined by an expectation over the input distribution. We show that
any SQ algorithm that achieves significant improvement over a constant function
with queries of tolerance some inverse polynomial in the input dimensionality
must use queries even when the target functions are
restricted to a set of degree- polynomials, and the input
distribution is uniform over the unit sphere; for this class the
information-theoretic lower bound is only .
Our approach for both parts is based on spherical harmonics. We view gradient
descent as an operator on the space of functions, and study its dynamics. An
essential tool is the Funk-Hecke theorem, which explains the eigenfunctions of
this operator in the case of the mean squared loss.Comment: Revised version now includes matching lower bound
From Stochastic Mixability to Fast Rates
Empirical risk minimization (ERM) is a fundamental learning rule for
statistical learning problems where the data is generated according to some
unknown distribution and returns a hypothesis chosen from a
fixed class with small loss . In the parametric setting,
depending upon ERM can have slow
or fast rates of convergence of the excess risk as a
function of the sample size . There exist several results that give
sufficient conditions for fast rates in terms of joint properties of ,
, and , such as the margin condition and the Bernstein
condition. In the non-statistical prediction with expert advice setting, there
is an analogous slow and fast rate phenomenon, and it is entirely characterized
in terms of the mixability of the loss (there being no role there for
or ). The notion of stochastic mixability builds a
bridge between these two models of learning, reducing to classical mixability
in a special case. The present paper presents a direct proof of fast rates for
ERM in terms of stochastic mixability of , and
in so doing provides new insight into the fast-rates phenomenon. The proof
exploits an old result of Kemperman on the solution to the general moment
problem. We also show a partial converse that suggests a characterization of
fast rates for ERM in terms of stochastic mixability is possible.Comment: 21 pages, accepted to NIPS 201
Fault Tolerance in Iterative-Convergent Machine Learning
Machine learning (ML) training algorithms often possess an inherent
self-correcting behavior due to their iterative-convergent nature. Recent
systems exploit this property to achieve adaptability and efficiency in
unreliable computing environments by relaxing the consistency of execution and
allowing calculation errors to be self-corrected during training. However, the
behavior of such systems are only well understood for specific types of
calculation errors, such as those caused by staleness, reduced precision, or
asynchronicity, and for specific types of training algorithms, such as
stochastic gradient descent. In this paper, we develop a general framework to
quantify the effects of calculation errors on iterative-convergent algorithms
and use this framework to design new strategies for checkpoint-based fault
tolerance. Our framework yields a worst-case upper bound on the iteration cost
of arbitrary perturbations to model parameters during training. Our system,
SCAR, employs strategies which reduce the iteration cost upper bound due to
perturbations incurred when recovering from checkpoints. We show that SCAR can
reduce the iteration cost of partial failures by 78% - 95% when compared with
traditional checkpoint-based fault tolerance across a variety of ML models and
training algorithms
On the Global Optimality of Model-Agnostic Meta-Learning
Model-agnostic meta-learning (MAML) formulates meta-learning as a bilevel
optimization problem, where the inner level solves each subtask based on a
shared prior, while the outer level searches for the optimal shared prior by
optimizing its aggregated performance over all the subtasks. Despite its
empirical success, MAML remains less understood in theory, especially in terms
of its global optimality, due to the nonconvexity of the meta-objective (the
outer-level objective). To bridge such a gap between theory and practice, we
characterize the optimality gap of the stationary points attained by MAML for
both reinforcement learning and supervised learning, where the inner-level and
outer-level problems are solved via first-order optimization methods. In
particular, our characterization connects the optimality gap of such stationary
points with (i) the functional geometry of inner-level objectives and (ii) the
representation power of function approximators, including linear models and
neural networks. To the best of our knowledge, our analysis establishes the
global optimality of MAML with nonconvex meta-objectives for the first time.Comment: 41 pages; accepted to ICML; initial draft submitted in Feb, 202
Catalyst Acceleration for Gradient-Based Non-Convex Optimization
We introduce a generic scheme to solve nonconvex optimization problems using
gradient-based algorithms originally designed for minimizing convex functions.
Even though these methods may originally require convexity to operate, the
proposed approach allows one to use them on weakly convex objectives, which
covers a large class of non-convex functions typically appearing in machine
learning and signal processing. In general, the scheme is guaranteed to produce
a stationary point with a worst-case efficiency typical of first-order methods,
and when the objective turns out to be convex, it automatically accelerates in
the sense of Nesterov and achieves near-optimal convergence rate in function
values. These properties are achieved without assuming any knowledge about the
convexity of the objective, by automatically adapting to the unknown weak
convexity constant. We conclude the paper by showing promising experimental
results obtained by applying our approach to incremental algorithms such as
SVRG and SAGA for sparse matrix factorization and for learning neural networks
Better Agnostic Clustering Via Relaxed Tensor Norms
We develop a new family of convex relaxations for -means clustering based
on sum-of-squares norms, a relaxation of the injective tensor norm that is
efficiently computable using the Sum-of-Squares algorithm. We give an algorithm
based on this relaxation that recovers a faithful approximation to the true
means in the given data whenever the low-degree moments of the points in each
cluster have bounded sum-of-squares norms.
We then prove a sharp upper bound on the sum-of-squares norms for moment
tensors of any distribution that satisfies the \emph{Poincare inequality}. The
Poincare inequality is a central inequality in probability theory, and a large
class of distributions satisfy it including Gaussians, product distributions,
strongly log-concave distributions, and any sum or uniformly continuous
transformation of such distributions.
As an immediate corollary, for any , we obtain an efficient
algorithm for learning the means of a mixture of arbitrary \Poincare
distributions in in time so long as the means
have separation . This in particular yields an algorithm
for learning Gaussian mixtures with separation , thus
partially resolving an open problem of Regev and Vijayaraghavan
\citet{regev2017learning}.
Our algorithm works even in the outlier-robust setting where an
fraction of arbitrary outliers are added to the data, as long as the fraction
of outliers is smaller than the smallest cluster. We, therefore, obtain results
in the strong agnostic setting where, in addition to not knowing the
distribution family, the data itself may be arbitrarily corrupted
Covariance Eigenvector Sparsity for Compression and Denoising
Sparsity in the eigenvectors of signal covariance matrices is exploited in
this paper for compression and denoising. Dimensionality reduction (DR) and
quantization modules present in many practical compression schemes such as
transform codecs, are designed to capitalize on this form of sparsity and
achieve improved reconstruction performance compared to existing
sparsity-agnostic codecs. Using training data that may be noisy a novel
sparsity-aware linear DR scheme is developed to fully exploit sparsity in the
covariance eigenvectors and form noise-resilient estimates of the principal
covariance eigenbasis. Sparsity is effected via norm-one regularization, and
the associated minimization problems are solved using computationally efficient
coordinate descent iterations. The resulting eigenspace estimator is shown
capable of identifying a subset of the unknown support of the eigenspace basis
vectors even when the observation noise covariance matrix is unknown, as long
as the noise power is sufficiently low. It is proved that the sparsity-aware
estimator is asymptotically normal, and the probability to correctly identify
the signal subspace basis support approaches one, as the number of training
data grows large. Simulations using synthetic data and images, corroborate that
the proposed algorithms achieve improved reconstruction quality relative to
alternatives.Comment: IEEE Transcations on Signal Processing, 2012 (to appear
FedADMM: A Federated Primal-Dual Algorithm Allowing Partial Participation
Federated learning is a framework for distributed optimization that places
emphasis on communication efficiency. In particular, it follows a client-server
broadcast model and is particularly appealing because of its ability to
accommodate heterogeneity in client compute and storage resources, non-i.i.d.
data assumptions, and data privacy. Our contribution is to offer a new
federated learning algorithm, FedADMM, for solving non-convex composite
optimization problems with non-smooth regularizers. We prove converges of
FedADMM for the case when not all clients are able to participate in a given
communication round under a very general sampling model
Deep Relaxation: partial differential equations for optimizing deep neural networks
In this paper we establish a connection between non-convex optimization
methods for training deep neural networks and nonlinear partial differential
equations (PDEs). Relaxation techniques arising in statistical physics which
have already been used successfully in this context are reinterpreted as
solutions of a viscous Hamilton-Jacobi PDE. Using a stochastic control
interpretation allows we prove that the modified algorithm performs better in
expectation that stochastic gradient descent. Well-known PDE regularity results
allow us to analyze the geometry of the relaxed energy landscape, confirming
empirical evidence. The PDE is derived from a stochastic homogenization
problem, which arises in the implementation of the algorithm. The algorithms
scale well in practice and can effectively tackle the high dimensionality of
modern neural networks
- …