9 research outputs found
Optimal Regret Algorithm for Pseudo-1d Bandit Convex Optimization
We study online learning with bandit feedback (i.e. learner has access to
only zeroth-order oracle) where cost/reward functions \f_t admit a
"pseudo-1d" structure, i.e. \f_t(\w) = \loss_t(\pred_t(\w)) where the output
of \pred_t is one-dimensional. At each round, the learner observes context
\x_t, plays prediction \pred_t(\w_t; \x_t) (e.g. \pred_t(\cdot)=\langle
\x_t, \cdot\rangle) for some \w_t \in \mathbb{R}^d and observes loss
\loss_t(\pred_t(\w_t)) where \loss_t is a convex Lipschitz-continuous
function. The goal is to minimize the standard regret metric. This pseudo-1d
bandit convex optimization problem (\SBCO) arises frequently in domains such as
online decision-making or parameter-tuning in large systems. For this problem,
we first show a lower bound of for the regret of any
algorithm, where is the number of rounds. We propose a new algorithm
\sbcalg that combines randomized online gradient descent with a kernelized
exponential weights method to exploit the pseudo-1d structure effectively,
guaranteeing the {\em optimal} regret bound mentioned above, up to additional
logarithmic factors. In contrast, applying state-of-the-art online convex
optimization methods leads to
regret, that is significantly suboptimal in
A Primer on Zeroth-Order Optimization in Signal Processing and Machine Learning
Zeroth-order (ZO) optimization is a subset of gradient-free optimization that
emerges in many signal processing and machine learning applications. It is used
for solving optimization problems similarly to gradient-based methods. However,
it does not require the gradient, using only function evaluations.
Specifically, ZO optimization iteratively performs three major steps: gradient
estimation, descent direction computation, and solution update. In this paper,
we provide a comprehensive review of ZO optimization, with an emphasis on
showing the underlying intuition, optimization principles and recent advances
in convergence analysis. Moreover, we demonstrate promising applications of ZO
optimization, such as evaluating robustness and generating explanations from
black-box deep learning models, and efficient online sensor management.Comment: IEEE Signal Processing Magazin
One Sample Stochastic Frank-Wolfe
One of the beauties of the projected gradient descent method lies in its
rather simple mechanism and yet stable behavior with inexact, stochastic
gradients, which has led to its wide-spread use in many machine learning
applications. However, once we replace the projection operator with a simpler
linear program, as is done in the Frank-Wolfe method, both simplicity and
stability take a serious hit. The aim of this paper is to bring them back
without sacrificing the efficiency. In this paper, we propose the first
one-sample stochastic Frank-Wolfe algorithm, called 1-SFW, that avoids the need
to carefully tune the batch size, step size, learning rate, and other
complicated hyper parameters. In particular, 1-SFW achieves the optimal
convergence rate of for reaching an
-suboptimal solution in the stochastic convex setting, and a
approximate solution for a stochastic monotone DR-submodular
maximization problem. Moreover, in a general non-convex setting, 1-SFW finds an
-first-order stationary point after at most
iterations, achieving the current best known
convergence rate. All of this is possible by designing a novel unbiased
momentum estimator that governs the stability of the optimization process while
using a single sample at each iteration
A Hybrid-Order Distributed SGD Method for Non-Convex Optimization to Balance Communication Overhead, Computational Complexity, and Convergence Rate
In this paper, we propose a method of distributed stochastic gradient descent
(SGD), with low communication load and computational complexity, and still fast
convergence. To reduce the communication load, at each iteration of the
algorithm, the worker nodes calculate and communicate some scalers, that are
the directional derivatives of the sample functions in some \emph{pre-shared
directions}. However, to maintain accuracy, after every specific number of
iterations, they communicate the vectors of stochastic gradients. To reduce the
computational complexity in each iteration, the worker nodes approximate the
directional derivatives with zeroth-order stochastic gradient estimation, by
performing just two function evaluations rather than computing a first-order
gradient vector. The proposed method highly improves the convergence rate of
the zeroth-order methods, guaranteeing order-wise faster convergence. Moreover,
compared to the famous communication-efficient methods of model averaging (that
perform local model updates and periodic communication of the gradients to
synchronize the local models), we prove that for the general class of
non-convex stochastic problems and with reasonable choice of parameters, the
proposed method guarantees the same orders of communication load and
convergence rate, while having order-wise less computational complexity.
Experimental results on various learning problems in neural networks
applications demonstrate the effectiveness of the proposed approach compared to
various state-of-the-art distributed SGD methods
ZO-AdaMM: Zeroth-Order Adaptive Momentum Method for Black-Box Optimization
The adaptive momentum method (AdaMM), which uses past gradients to update
descent directions and learning rates simultaneously, has become one of the
most popular first-order optimization methods for solving machine learning
problems. However, AdaMM is not suited for solving black-box optimization
problems, where explicit gradient forms are difficult or infeasible to obtain.
In this paper, we propose a zeroth-order AdaMM (ZO-AdaMM) algorithm, that
generalizes AdaMM to the gradient-free regime. We show that the convergence
rate of ZO-AdaMM for both convex and nonconvex optimization is roughly a factor
of worse than that of the first-order AdaMM algorithm, where
is problem size. In particular, we provide a deep understanding on why
Mahalanobis distance matters in convergence of ZO-AdaMM and other AdaMM-type
methods. As a byproduct, our analysis makes the first step toward understanding
adaptive learning rate methods for nonconvex constrained optimization.
Furthermore, we demonstrate two applications, designing per-image and universal
adversarial attacks from black-box neural networks, respectively. We perform
extensive experiments on ImageNet and empirically show that ZO-AdaMM converges
much faster to a solution of high accuracy compared with state-of-the-art
ZO optimization methods
Statistical Inference for Polyak-Ruppert Averaged Zeroth-order Stochastic Gradient Algorithm
Statistical machine learning models trained with stochastic gradient
algorithms are increasingly being deployed in critical scientific applications.
However, computing the stochastic gradient in several such applications is
highly expensive or even impossible at times. In such cases, derivative-free or
zeroth-order algorithms are used. An important question which has thus far not
been addressed sufficiently in the statistical machine learning literature is
that of equipping stochastic zeroth-order algorithms with practical yet
rigorous inferential capabilities so that we not only have point estimates or
predictions but also quantify the associated uncertainty via confidence
intervals or sets. Towards this, in this work, we first establish a central
limit theorem for Polyak-Ruppert averaged stochastic zeroth-order gradient
algorithm. We then provide online estimators of the asymptotic covariance
matrix appearing in the central limit theorem, thereby providing a practical
procedure for constructing asymptotically valid confidence sets (or intervals)
for parameter estimation (or prediction) in the zeroth-order setting
Accelerated Stochastic Gradient-free and Projection-free Methods
In the paper, we propose a class of accelerated stochastic gradient-free and
projection-free (a.k.a., zeroth-order Frank-Wolfe) methods to solve the
constrained stochastic and finite-sum nonconvex optimization. Specifically, we
propose an accelerated stochastic zeroth-order Frank-Wolfe (Acc-SZOFW) method
based on the variance reduced technique of SPIDER/SpiderBoost and a novel
momentum accelerated technique. Moreover, under some mild conditions, we prove
that the Acc-SZOFW has the function query complexity of
for finding an -stationary point in the
finite-sum problem, which improves the exiting best result by a factor of
, and has the function query complexity of
in the stochastic problem, which improves the exiting best
result by a factor of . To relax the large batches required
in the Acc-SZOFW, we further propose a novel accelerated stochastic
zeroth-order Frank-Wolfe (Acc-SZOFW*) based on a new variance reduced technique
of STORM, which still reaches the function query complexity of
in the stochastic problem without relying on any large
batches. In particular, we present an accelerated framework of the Frank-Wolfe
methods based on the proposed momentum accelerated technique. The extensive
experimental results on black-box adversarial attack and robust black-box
classification demonstrate the efficiency of our algorithms.Comment: Accepted to ICML 2020, 34 page
Projection Efficient Subgradient Method and Optimal Nonsmooth Frank-Wolfe Method
We consider the classical setting of optimizing a nonsmooth Lipschitz
continuous convex function over a convex constraint set, when having access to
a (stochastic) first-order oracle (FO) for the function and a projection oracle
(PO) for the constraint set. It is well known that to achieve
-suboptimality in high-dimensions, FO calls
are necessary. This is achieved by the projected subgradient method (PGD).
However, PGD also entails PO calls, which may be
computationally costlier than FO calls (e.g. nuclear norm constraints).
Improving this PO calls complexity of PGD is largely unexplored, despite the
fundamental nature of this problem and extensive literature. We present first
such improvement. This only requires a mild assumption that the objective
function, when extended to a slightly larger neighborhood of the constraint
set, still remains Lipschitz and accessible via FO. In particular, we introduce
MOPES method, which carefully combines Moreau-Yosida smoothing and accelerated
first-order schemes. This is guaranteed to find a feasible
-suboptimal solution using only PO calls and
optimal FO calls. Further, instead of a PO if we only have a
linear minimization oracle (LMO, a la Frank-Wolfe) to access the constraint
set, an extension of our method, MOLES, finds a feasible -suboptimal
solution using LMO calls and FO calls---both match known
lower bounds, resolving a question left open since White (1993). Our
experiments confirm that these methods achieve significant speedups over the
state-of-the-art, for a problem with costly PO and LMO calls
Zeroth-Order Algorithms for Stochastic Distributed Nonconvex Optimization
In this paper, we consider a stochastic distributed nonconvex optimization
problem with the cost function being distributed over agents having access
only to zeroth-order (ZO) information of the cost. This problem has various
machine learning applications. As a solution, we propose two distributed ZO
algorithms, in which at each iteration each agent samples the local stochastic
ZO oracle at two points with an adaptive smoothing parameter. We show that the
proposed algorithms achieve the linear speedup convergence rate
for smooth cost functions and
convergence rate when the global cost function
additionally satisfies the Polyak--Lojasiewicz (P--L) condition, where and
are the dimension of the decision variable and the total number of
iterations, respectively. To the best of our knowledge, this is the first
linear speedup result for distributed ZO algorithms, which enables systematic
processing performance improvements by adding more agents. We also show that
the proposed algorithms converge linearly when considering deterministic
centralized optimization problems under the P--L condition. We demonstrate
through numerical experiments the efficiency of our algorithms on generating
adversarial examples from deep neural networks in comparison with baseline and
recently proposed centralized and distributed ZO algorithms