1,757 research outputs found

    Online Newton Step Algorithm with Estimated Gradient

    Full text link
    Online learning with limited information feedback (bandit) tries to solve the problem where an online learner receives partial feedback information from the environment in the course of learning. Under this setting, Flaxman et al.[8] extended Zinkevich's classical Online Gradient Descent (OGD) algorithm [29] by proposing the Online Gradient Descent with Expected Gradient (OGDEG) algorithm. Specifically, it uses a simple trick to approximate the gradient of the loss function ftf_t by evaluating it at a single point and bounds the expected regret as O(T5/6)\mathcal{O}(T^{5/6}) [8], where the number of rounds is TT. Meanwhile, past research efforts have shown that compared with the first-order algorithms, second-order online learning algorithms such as Online Newton Step (ONS) [11] can significantly accelerate the convergence rate of traditional online learning algorithms. Motivated by this, this paper aims to exploit the second-order information to speed up the convergence of the OGDEG algorithm. In particular, we extend the ONS algorithm with the trick of expected gradient and develop a novel second-order online learning algorithm, i.e., Online Newton Step with Expected Gradient (ONSEG). Theoretically, we show that the proposed ONSEG algorithm significantly reduces the expected regret of OGDEG algorithm from O(T5/6)\mathcal{O}(T^{5/6}) to O(T2/3)\mathcal{O}(T^{2/3}) in the bandit feedback scenario. Empirically, we further demonstrate the advantages of the proposed algorithm on multiple real-world datasets

    Quantum Algorithm for Online Convex Optimization

    Full text link
    We explore whether quantum advantages can be found for the zeroth-order online convex optimization problem, which is also known as bandit convex optimization with multi-point feedback. In this setting, given access to zeroth-order oracles (that is, the loss function is accessed as a black box that returns the function value for any queried input), a player attempts to minimize a sequence of adversarially generated convex loss functions. This procedure can be described as a TT round iterative game between the player and the adversary. In this paper, we present quantum algorithms for the problem and show for the first time that potential quantum advantages are possible for problems of online convex optimization. Specifically, our contributions are as follows. (i) When the player is allowed to query zeroth-order oracles O(1)O(1) times in each round as feedback, we give a quantum algorithm that achieves O(T)O(\sqrt{T}) regret without additional dependence of the dimension nn, which outperforms the already known optimal classical algorithm only achieving O(nT)O(\sqrt{nT}) regret. Note that the regret of our quantum algorithm has achieved the lower bound of classical first-order methods. (ii) We show that for strongly convex loss functions, the quantum algorithm can achieve O(logT)O(\log T) regret with O(1)O(1) queries as well, which means that the quantum algorithm can achieve the same regret bound as the classical algorithms in the full information setting

    Stochastic Structured Prediction under Bandit Feedback

    Full text link
    Stochastic structured prediction under bandit feedback follows a learning protocol where on each of a sequence of iterations, the learner receives an input, predicts an output structure, and receives partial feedback in form of a task loss evaluation of the predicted structure. We present applications of this learning scenario to convex and non-convex objectives for structured prediction and analyze them as stochastic first-order methods. We present an experimental evaluation on problems of natural language processing over exponential output spaces, and compare convergence speed across different objectives under the practical criterion of optimal task performance on development data and the optimization-theoretic criterion of minimal squared gradient norm. Best results under both criteria are obtained for a non-convex objective for pairwise preference learning under bandit feedback.Comment: 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spai

    Bandit Convex Optimization for Scalable and Dynamic IoT Management

    Full text link
    The present paper deals with online convex optimization involving both time-varying loss functions, and time-varying constraints. The loss functions are not fully accessible to the learner, and instead only the function values (a.k.a. bandit feedback) are revealed at queried points. The constraints are revealed after making decisions, and can be instantaneously violated, yet they must be satisfied in the long term. This setting fits nicely the emerging online network tasks such as fog computing in the Internet-of-Things (IoT), where online decisions must flexibly adapt to the changing user preferences (loss functions), and the temporally unpredictable availability of resources (constraints). Tailored for such human-in-the-loop systems where the loss functions are hard to model, a family of bandit online saddle-point (BanSaP) schemes are developed, which adaptively adjust the online operations based on (possibly multiple) bandit feedback of the loss functions, and the changing environment. Performance here is assessed by: i) dynamic regret that generalizes the widely used static regret; and, ii) fit that captures the accumulated amount of constraint violations. Specifically, BanSaP is proved to simultaneously yield sub-linear dynamic regret and fit, provided that the best dynamic solutions vary slowly over time. Numerical tests in fog computation offloading tasks corroborate that our proposed BanSaP approach offers competitive performance relative to existing approaches that are based on gradient feedback

    Multi-objective Bandits: Optimizing the Generalized Gini Index

    Full text link
    We study the multi-armed bandit (MAB) problem where the agent receives a vectorial feedback that encodes many possibly competing objectives to be optimized. The goal of the agent is to find a policy, which can optimize these objectives simultaneously in a fair way. This multi-objective online optimization problem is formalized by using the Generalized Gini Index (GGI) aggregation function. We propose an online gradient descent algorithm which exploits the convexity of the GGI aggregation function, and controls the exploration in a careful way achieving a distribution-free regret \tilde{\bigO} (T^{-1/2} ) with high probability. We test our algorithm on synthetic data as well as on an electric battery control problem where the goal is to trade off the use of the different cells of a battery in order to balance their respective degradation rates.Comment: 13 pages, 3 figures, draft version of ICML'17 pape

    Regret Analysis for Continuous Dueling Bandit

    Full text link
    The dueling bandit is a learning framework wherein the feedback information in the learning process is restricted to a noisy comparison between a pair of actions. In this research, we address a dueling bandit problem based on a cost function over a continuous space. We propose a stochastic mirror descent algorithm and show that the algorithm achieves an O(TlogT)O(\sqrt{T\log T})-regret bound under strong convexity and smoothness assumptions for the cost function. Subsequently, we clarify the equivalence between regret minimization in dueling bandit and convex optimization for the cost function. Moreover, when considering a lower bound in convex optimization, our algorithm is shown to achieve the optimal convergence rate in convex optimization and the optimal regret in dueling bandit except for a logarithmic factor.Comment: 14 pages. This paper was accepted at NIPS 2017 as a spotlight presentatio

    Towards minimax policies for online linear optimization with bandit feedback

    Full text link
    We address the online linear optimization problem with bandit feedback. Our contribution is twofold. First, we provide an algorithm (based on exponential weights) with a regret of order dnlogN\sqrt{d n \log N} for any finite action set with NN actions, under the assumption that the instantaneous loss is bounded by 1. This shaves off an extraneous d\sqrt{d} factor compared to previous works, and gives a regret bound of order dnlognd \sqrt{n \log n} for any compact set of actions. Without further assumptions on the action set, this last bound is minimax optimal up to a logarithmic factor. Interestingly, our result also shows that the minimax regret for bandit linear optimization with expert advice in dd dimension is the same as for the basic dd-armed bandit with expert advice. Our second contribution is to show how to use the Mirror Descent algorithm to obtain computationally efficient strategies with minimax optimal regret bounds in specific examples. More precisely we study two canonical action sets: the hypercube and the Euclidean ball. In the former case, we obtain the first computationally efficient algorithm with a dnd \sqrt{n} regret, thus improving by a factor dlogn\sqrt{d \log n} over the best known result for a computationally efficient algorithm. In the latter case, our approach gives the first algorithm with a dnlogn\sqrt{d n \log n} regret, again shaving off an extraneous d\sqrt{d} compared to previous works

    Bandit Convex Optimization: sqrt{T} Regret in One Dimension

    Full text link
    We analyze the minimax regret of the adversarial bandit convex optimization problem. Focusing on the one-dimensional case, we prove that the minimax regret is Θ~(T)\widetilde\Theta(\sqrt{T}) and partially resolve a decade-old open problem. Our analysis is non-constructive, as we do not present a concrete algorithm that attains this regret rate. Instead, we use minimax duality to reduce the problem to a Bayesian setting, where the convex loss functions are drawn from a worst-case distribution, and then we solve the Bayesian version of the problem with a variant of Thompson Sampling. Our analysis features a novel use of convexity, formalized as a "local-to-global" property of convex functions, that may be of independent interest

    Bandit Structured Prediction for Learning from Partial Feedback in Statistical Machine Translation

    Full text link
    We present an approach to structured prediction from bandit feedback, called Bandit Structured Prediction, where only the value of a task loss function at a single predicted point, instead of a correct structure, is observed in learning. We present an application to discriminative reranking in Statistical Machine Translation (SMT) where the learning algorithm only has access to a 1-BLEU loss evaluation of a predicted translation instead of obtaining a gold standard reference translation. In our experiment bandit feedback is obtained by evaluating BLEU on reference translations without revealing them to the algorithm. This can be thought of as a simulation of interactive machine translation where an SMT system is personalized by a user who provides single point feedback to predicted translations. Our experiments show that our approach improves translation quality and is comparable to approaches that employ more informative feedback in learning.Comment: In Proceedings of MT Summit XV, 2015. Miami, F

    Extended Formulations for Online Linear Bandit Optimization

    Full text link
    On-line linear optimization on combinatorial action sets (d-dimensional actions) with bandit feedback, is known to have complexity in the order of the dimension of the problem. The exponential weighted strategy achieves the best known regret bound that is of the order of d2nd^{2}\sqrt{n} (where dd is the dimension of the problem, nn is the time horizon). However, such strategies are provably suboptimal or computationally inefficient. The complexity is attributed to the combinatorial structure of the action set and the dearth of efficient exploration strategies of the set. Mirror descent with entropic regularization function comes close to solving this problem by enforcing a meticulous projection of weights with an inherent boundary condition. Entropic regularization in mirror descent is the only known way of achieving a logarithmic dependence on the dimension. Here, we argue otherwise and recover the original intuition of exponential weighting by borrowing a technique from discrete optimization and approximation algorithms called `extended formulation'. Such formulations appeal to the underlying geometry of the set with a guaranteed logarithmic dependence on the dimension underpinned by an information theoretic entropic analysis
    corecore