9 research outputs found

    Optimal Regret Algorithm for Pseudo-1d Bandit Convex Optimization

    Full text link
    We study online learning with bandit feedback (i.e. learner has access to only zeroth-order oracle) where cost/reward functions \f_t admit a "pseudo-1d" structure, i.e. \f_t(\w) = \loss_t(\pred_t(\w)) where the output of \pred_t is one-dimensional. At each round, the learner observes context \x_t, plays prediction \pred_t(\w_t; \x_t) (e.g. \pred_t(\cdot)=\langle \x_t, \cdot\rangle) for some \w_t \in \mathbb{R}^d and observes loss \loss_t(\pred_t(\w_t)) where \loss_t is a convex Lipschitz-continuous function. The goal is to minimize the standard regret metric. This pseudo-1d bandit convex optimization problem (\SBCO) arises frequently in domains such as online decision-making or parameter-tuning in large systems. For this problem, we first show a lower bound of min(dT,T3/4)\min(\sqrt{dT}, T^{3/4}) for the regret of any algorithm, where TT is the number of rounds. We propose a new algorithm \sbcalg that combines randomized online gradient descent with a kernelized exponential weights method to exploit the pseudo-1d structure effectively, guaranteeing the {\em optimal} regret bound mentioned above, up to additional logarithmic factors. In contrast, applying state-of-the-art online convex optimization methods leads to O~(min(d9.5T,dT3/4))\tilde{O}\left(\min\left(d^{9.5}\sqrt{T},\sqrt{d}T^{3/4}\right)\right) regret, that is significantly suboptimal in dd

    A Primer on Zeroth-Order Optimization in Signal Processing and Machine Learning

    Full text link
    Zeroth-order (ZO) optimization is a subset of gradient-free optimization that emerges in many signal processing and machine learning applications. It is used for solving optimization problems similarly to gradient-based methods. However, it does not require the gradient, using only function evaluations. Specifically, ZO optimization iteratively performs three major steps: gradient estimation, descent direction computation, and solution update. In this paper, we provide a comprehensive review of ZO optimization, with an emphasis on showing the underlying intuition, optimization principles and recent advances in convergence analysis. Moreover, we demonstrate promising applications of ZO optimization, such as evaluating robustness and generating explanations from black-box deep learning models, and efficient online sensor management.Comment: IEEE Signal Processing Magazin

    One Sample Stochastic Frank-Wolfe

    Full text link
    One of the beauties of the projected gradient descent method lies in its rather simple mechanism and yet stable behavior with inexact, stochastic gradients, which has led to its wide-spread use in many machine learning applications. However, once we replace the projection operator with a simpler linear program, as is done in the Frank-Wolfe method, both simplicity and stability take a serious hit. The aim of this paper is to bring them back without sacrificing the efficiency. In this paper, we propose the first one-sample stochastic Frank-Wolfe algorithm, called 1-SFW, that avoids the need to carefully tune the batch size, step size, learning rate, and other complicated hyper parameters. In particular, 1-SFW achieves the optimal convergence rate of O(1/ϵ2)\mathcal{O}(1/\epsilon^2) for reaching an ϵ\epsilon-suboptimal solution in the stochastic convex setting, and a (11/e)ϵ(1-1/e)-\epsilon approximate solution for a stochastic monotone DR-submodular maximization problem. Moreover, in a general non-convex setting, 1-SFW finds an ϵ\epsilon-first-order stationary point after at most O(1/ϵ3)\mathcal{O}(1/\epsilon^3) iterations, achieving the current best known convergence rate. All of this is possible by designing a novel unbiased momentum estimator that governs the stability of the optimization process while using a single sample at each iteration

    A Hybrid-Order Distributed SGD Method for Non-Convex Optimization to Balance Communication Overhead, Computational Complexity, and Convergence Rate

    Full text link
    In this paper, we propose a method of distributed stochastic gradient descent (SGD), with low communication load and computational complexity, and still fast convergence. To reduce the communication load, at each iteration of the algorithm, the worker nodes calculate and communicate some scalers, that are the directional derivatives of the sample functions in some \emph{pre-shared directions}. However, to maintain accuracy, after every specific number of iterations, they communicate the vectors of stochastic gradients. To reduce the computational complexity in each iteration, the worker nodes approximate the directional derivatives with zeroth-order stochastic gradient estimation, by performing just two function evaluations rather than computing a first-order gradient vector. The proposed method highly improves the convergence rate of the zeroth-order methods, guaranteeing order-wise faster convergence. Moreover, compared to the famous communication-efficient methods of model averaging (that perform local model updates and periodic communication of the gradients to synchronize the local models), we prove that for the general class of non-convex stochastic problems and with reasonable choice of parameters, the proposed method guarantees the same orders of communication load and convergence rate, while having order-wise less computational complexity. Experimental results on various learning problems in neural networks applications demonstrate the effectiveness of the proposed approach compared to various state-of-the-art distributed SGD methods

    ZO-AdaMM: Zeroth-Order Adaptive Momentum Method for Black-Box Optimization

    Full text link
    The adaptive momentum method (AdaMM), which uses past gradients to update descent directions and learning rates simultaneously, has become one of the most popular first-order optimization methods for solving machine learning problems. However, AdaMM is not suited for solving black-box optimization problems, where explicit gradient forms are difficult or infeasible to obtain. In this paper, we propose a zeroth-order AdaMM (ZO-AdaMM) algorithm, that generalizes AdaMM to the gradient-free regime. We show that the convergence rate of ZO-AdaMM for both convex and nonconvex optimization is roughly a factor of O(d)O(\sqrt{d}) worse than that of the first-order AdaMM algorithm, where dd is problem size. In particular, we provide a deep understanding on why Mahalanobis distance matters in convergence of ZO-AdaMM and other AdaMM-type methods. As a byproduct, our analysis makes the first step toward understanding adaptive learning rate methods for nonconvex constrained optimization. Furthermore, we demonstrate two applications, designing per-image and universal adversarial attacks from black-box neural networks, respectively. We perform extensive experiments on ImageNet and empirically show that ZO-AdaMM converges much faster to a solution of high accuracy compared with 66 state-of-the-art ZO optimization methods

    Statistical Inference for Polyak-Ruppert Averaged Zeroth-order Stochastic Gradient Algorithm

    Full text link
    Statistical machine learning models trained with stochastic gradient algorithms are increasingly being deployed in critical scientific applications. However, computing the stochastic gradient in several such applications is highly expensive or even impossible at times. In such cases, derivative-free or zeroth-order algorithms are used. An important question which has thus far not been addressed sufficiently in the statistical machine learning literature is that of equipping stochastic zeroth-order algorithms with practical yet rigorous inferential capabilities so that we not only have point estimates or predictions but also quantify the associated uncertainty via confidence intervals or sets. Towards this, in this work, we first establish a central limit theorem for Polyak-Ruppert averaged stochastic zeroth-order gradient algorithm. We then provide online estimators of the asymptotic covariance matrix appearing in the central limit theorem, thereby providing a practical procedure for constructing asymptotically valid confidence sets (or intervals) for parameter estimation (or prediction) in the zeroth-order setting

    Accelerated Stochastic Gradient-free and Projection-free Methods

    Full text link
    In the paper, we propose a class of accelerated stochastic gradient-free and projection-free (a.k.a., zeroth-order Frank-Wolfe) methods to solve the constrained stochastic and finite-sum nonconvex optimization. Specifically, we propose an accelerated stochastic zeroth-order Frank-Wolfe (Acc-SZOFW) method based on the variance reduced technique of SPIDER/SpiderBoost and a novel momentum accelerated technique. Moreover, under some mild conditions, we prove that the Acc-SZOFW has the function query complexity of O(dnϵ2)O(d\sqrt{n}\epsilon^{-2}) for finding an ϵ\epsilon-stationary point in the finite-sum problem, which improves the exiting best result by a factor of O(nϵ2)O(\sqrt{n}\epsilon^{-2}), and has the function query complexity of O(dϵ3)O(d\epsilon^{-3}) in the stochastic problem, which improves the exiting best result by a factor of O(ϵ1)O(\epsilon^{-1}). To relax the large batches required in the Acc-SZOFW, we further propose a novel accelerated stochastic zeroth-order Frank-Wolfe (Acc-SZOFW*) based on a new variance reduced technique of STORM, which still reaches the function query complexity of O(dϵ3)O(d\epsilon^{-3}) in the stochastic problem without relying on any large batches. In particular, we present an accelerated framework of the Frank-Wolfe methods based on the proposed momentum accelerated technique. The extensive experimental results on black-box adversarial attack and robust black-box classification demonstrate the efficiency of our algorithms.Comment: Accepted to ICML 2020, 34 page

    Projection Efficient Subgradient Method and Optimal Nonsmooth Frank-Wolfe Method

    Full text link
    We consider the classical setting of optimizing a nonsmooth Lipschitz continuous convex function over a convex constraint set, when having access to a (stochastic) first-order oracle (FO) for the function and a projection oracle (PO) for the constraint set. It is well known that to achieve ϵ\epsilon-suboptimality in high-dimensions, Θ(ϵ2)\Theta(\epsilon^{-2}) FO calls are necessary. This is achieved by the projected subgradient method (PGD). However, PGD also entails O(ϵ2)O(\epsilon^{-2}) PO calls, which may be computationally costlier than FO calls (e.g. nuclear norm constraints). Improving this PO calls complexity of PGD is largely unexplored, despite the fundamental nature of this problem and extensive literature. We present first such improvement. This only requires a mild assumption that the objective function, when extended to a slightly larger neighborhood of the constraint set, still remains Lipschitz and accessible via FO. In particular, we introduce MOPES method, which carefully combines Moreau-Yosida smoothing and accelerated first-order schemes. This is guaranteed to find a feasible ϵ\epsilon-suboptimal solution using only O(ϵ1)O(\epsilon^{-1}) PO calls and optimal O(ϵ2)O(\epsilon^{-2}) FO calls. Further, instead of a PO if we only have a linear minimization oracle (LMO, a la Frank-Wolfe) to access the constraint set, an extension of our method, MOLES, finds a feasible ϵ\epsilon-suboptimal solution using O(ϵ2)O(\epsilon^{-2}) LMO calls and FO calls---both match known lower bounds, resolving a question left open since White (1993). Our experiments confirm that these methods achieve significant speedups over the state-of-the-art, for a problem with costly PO and LMO calls

    Zeroth-Order Algorithms for Stochastic Distributed Nonconvex Optimization

    Full text link
    In this paper, we consider a stochastic distributed nonconvex optimization problem with the cost function being distributed over nn agents having access only to zeroth-order (ZO) information of the cost. This problem has various machine learning applications. As a solution, we propose two distributed ZO algorithms, in which at each iteration each agent samples the local stochastic ZO oracle at two points with an adaptive smoothing parameter. We show that the proposed algorithms achieve the linear speedup convergence rate O(p/(nT))\mathcal{O}(\sqrt{p/(nT)}) for smooth cost functions and O(p/(nT))\mathcal{O}(p/(nT)) convergence rate when the global cost function additionally satisfies the Polyak--Lojasiewicz (P--L) condition, where pp and TT are the dimension of the decision variable and the total number of iterations, respectively. To the best of our knowledge, this is the first linear speedup result for distributed ZO algorithms, which enables systematic processing performance improvements by adding more agents. We also show that the proposed algorithms converge linearly when considering deterministic centralized optimization problems under the P--L condition. We demonstrate through numerical experiments the efficiency of our algorithms on generating adversarial examples from deep neural networks in comparison with baseline and recently proposed centralized and distributed ZO algorithms
    corecore