1,757 research outputs found
Online Newton Step Algorithm with Estimated Gradient
Online learning with limited information feedback (bandit) tries to solve the
problem where an online learner receives partial feedback information from the
environment in the course of learning. Under this setting, Flaxman et al.[8]
extended Zinkevich's classical Online Gradient Descent (OGD) algorithm [29] by
proposing the Online Gradient Descent with Expected Gradient (OGDEG) algorithm.
Specifically, it uses a simple trick to approximate the gradient of the loss
function by evaluating it at a single point and bounds the expected
regret as [8], where the number of rounds is .
Meanwhile, past research efforts have shown that compared with the first-order
algorithms, second-order online learning algorithms such as Online Newton Step
(ONS) [11] can significantly accelerate the convergence rate of traditional
online learning algorithms. Motivated by this, this paper aims to exploit the
second-order information to speed up the convergence of the OGDEG algorithm. In
particular, we extend the ONS algorithm with the trick of expected gradient and
develop a novel second-order online learning algorithm, i.e., Online Newton
Step with Expected Gradient (ONSEG). Theoretically, we show that the proposed
ONSEG algorithm significantly reduces the expected regret of OGDEG algorithm
from to in the bandit feedback
scenario. Empirically, we further demonstrate the advantages of the proposed
algorithm on multiple real-world datasets
Quantum Algorithm for Online Convex Optimization
We explore whether quantum advantages can be found for the zeroth-order
online convex optimization problem, which is also known as bandit convex
optimization with multi-point feedback. In this setting, given access to
zeroth-order oracles (that is, the loss function is accessed as a black box
that returns the function value for any queried input), a player attempts to
minimize a sequence of adversarially generated convex loss functions. This
procedure can be described as a round iterative game between the player and
the adversary. In this paper, we present quantum algorithms for the problem and
show for the first time that potential quantum advantages are possible for
problems of online convex optimization. Specifically, our contributions are as
follows. (i) When the player is allowed to query zeroth-order oracles
times in each round as feedback, we give a quantum algorithm that achieves
regret without additional dependence of the dimension , which
outperforms the already known optimal classical algorithm only achieving
regret. Note that the regret of our quantum algorithm has
achieved the lower bound of classical first-order methods. (ii) We show that
for strongly convex loss functions, the quantum algorithm can achieve regret with queries as well, which means that the quantum algorithm
can achieve the same regret bound as the classical algorithms in the full
information setting
Stochastic Structured Prediction under Bandit Feedback
Stochastic structured prediction under bandit feedback follows a learning
protocol where on each of a sequence of iterations, the learner receives an
input, predicts an output structure, and receives partial feedback in form of a
task loss evaluation of the predicted structure. We present applications of
this learning scenario to convex and non-convex objectives for structured
prediction and analyze them as stochastic first-order methods. We present an
experimental evaluation on problems of natural language processing over
exponential output spaces, and compare convergence speed across different
objectives under the practical criterion of optimal task performance on
development data and the optimization-theoretic criterion of minimal squared
gradient norm. Best results under both criteria are obtained for a non-convex
objective for pairwise preference learning under bandit feedback.Comment: 30th Conference on Neural Information Processing Systems (NIPS 2016),
Barcelona, Spai
Bandit Convex Optimization for Scalable and Dynamic IoT Management
The present paper deals with online convex optimization involving both
time-varying loss functions, and time-varying constraints. The loss functions
are not fully accessible to the learner, and instead only the function values
(a.k.a. bandit feedback) are revealed at queried points. The constraints are
revealed after making decisions, and can be instantaneously violated, yet they
must be satisfied in the long term. This setting fits nicely the emerging
online network tasks such as fog computing in the Internet-of-Things (IoT),
where online decisions must flexibly adapt to the changing user preferences
(loss functions), and the temporally unpredictable availability of resources
(constraints). Tailored for such human-in-the-loop systems where the loss
functions are hard to model, a family of bandit online saddle-point (BanSaP)
schemes are developed, which adaptively adjust the online operations based on
(possibly multiple) bandit feedback of the loss functions, and the changing
environment. Performance here is assessed by: i) dynamic regret that
generalizes the widely used static regret; and, ii) fit that captures the
accumulated amount of constraint violations. Specifically, BanSaP is proved to
simultaneously yield sub-linear dynamic regret and fit, provided that the best
dynamic solutions vary slowly over time. Numerical tests in fog computation
offloading tasks corroborate that our proposed BanSaP approach offers
competitive performance relative to existing approaches that are based on
gradient feedback
Multi-objective Bandits: Optimizing the Generalized Gini Index
We study the multi-armed bandit (MAB) problem where the agent receives a
vectorial feedback that encodes many possibly competing objectives to be
optimized. The goal of the agent is to find a policy, which can optimize these
objectives simultaneously in a fair way. This multi-objective online
optimization problem is formalized by using the Generalized Gini Index (GGI)
aggregation function. We propose an online gradient descent algorithm which
exploits the convexity of the GGI aggregation function, and controls the
exploration in a careful way achieving a distribution-free regret
\tilde{\bigO} (T^{-1/2} ) with high probability. We test our algorithm on
synthetic data as well as on an electric battery control problem where the goal
is to trade off the use of the different cells of a battery in order to balance
their respective degradation rates.Comment: 13 pages, 3 figures, draft version of ICML'17 pape
Regret Analysis for Continuous Dueling Bandit
The dueling bandit is a learning framework wherein the feedback information
in the learning process is restricted to a noisy comparison between a pair of
actions. In this research, we address a dueling bandit problem based on a cost
function over a continuous space. We propose a stochastic mirror descent
algorithm and show that the algorithm achieves an -regret
bound under strong convexity and smoothness assumptions for the cost function.
Subsequently, we clarify the equivalence between regret minimization in dueling
bandit and convex optimization for the cost function. Moreover, when
considering a lower bound in convex optimization, our algorithm is shown to
achieve the optimal convergence rate in convex optimization and the optimal
regret in dueling bandit except for a logarithmic factor.Comment: 14 pages. This paper was accepted at NIPS 2017 as a spotlight
presentatio
Towards minimax policies for online linear optimization with bandit feedback
We address the online linear optimization problem with bandit feedback. Our
contribution is twofold. First, we provide an algorithm (based on exponential
weights) with a regret of order for any finite action set
with actions, under the assumption that the instantaneous loss is bounded
by 1. This shaves off an extraneous factor compared to previous
works, and gives a regret bound of order for any compact
set of actions. Without further assumptions on the action set, this last bound
is minimax optimal up to a logarithmic factor. Interestingly, our result also
shows that the minimax regret for bandit linear optimization with expert advice
in dimension is the same as for the basic -armed bandit with expert
advice. Our second contribution is to show how to use the Mirror Descent
algorithm to obtain computationally efficient strategies with minimax optimal
regret bounds in specific examples. More precisely we study two canonical
action sets: the hypercube and the Euclidean ball. In the former case, we
obtain the first computationally efficient algorithm with a
regret, thus improving by a factor over the best known result
for a computationally efficient algorithm. In the latter case, our approach
gives the first algorithm with a regret, again shaving off
an extraneous compared to previous works
Bandit Convex Optimization: sqrt{T} Regret in One Dimension
We analyze the minimax regret of the adversarial bandit convex optimization
problem. Focusing on the one-dimensional case, we prove that the minimax regret
is and partially resolve a decade-old open
problem. Our analysis is non-constructive, as we do not present a concrete
algorithm that attains this regret rate. Instead, we use minimax duality to
reduce the problem to a Bayesian setting, where the convex loss functions are
drawn from a worst-case distribution, and then we solve the Bayesian version of
the problem with a variant of Thompson Sampling. Our analysis features a novel
use of convexity, formalized as a "local-to-global" property of convex
functions, that may be of independent interest
Bandit Structured Prediction for Learning from Partial Feedback in Statistical Machine Translation
We present an approach to structured prediction from bandit feedback, called
Bandit Structured Prediction, where only the value of a task loss function at a
single predicted point, instead of a correct structure, is observed in
learning. We present an application to discriminative reranking in Statistical
Machine Translation (SMT) where the learning algorithm only has access to a
1-BLEU loss evaluation of a predicted translation instead of obtaining a gold
standard reference translation. In our experiment bandit feedback is obtained
by evaluating BLEU on reference translations without revealing them to the
algorithm. This can be thought of as a simulation of interactive machine
translation where an SMT system is personalized by a user who provides single
point feedback to predicted translations. Our experiments show that our
approach improves translation quality and is comparable to approaches that
employ more informative feedback in learning.Comment: In Proceedings of MT Summit XV, 2015. Miami, F
Extended Formulations for Online Linear Bandit Optimization
On-line linear optimization on combinatorial action sets (d-dimensional
actions) with bandit feedback, is known to have complexity in the order of the
dimension of the problem. The exponential weighted strategy achieves the best
known regret bound that is of the order of (where is the
dimension of the problem, is the time horizon). However, such strategies
are provably suboptimal or computationally inefficient. The complexity is
attributed to the combinatorial structure of the action set and the dearth of
efficient exploration strategies of the set. Mirror descent with entropic
regularization function comes close to solving this problem by enforcing a
meticulous projection of weights with an inherent boundary condition. Entropic
regularization in mirror descent is the only known way of achieving a
logarithmic dependence on the dimension. Here, we argue otherwise and recover
the original intuition of exponential weighting by borrowing a technique from
discrete optimization and approximation algorithms called `extended
formulation'. Such formulations appeal to the underlying geometry of the set
with a guaranteed logarithmic dependence on the dimension underpinned by an
information theoretic entropic analysis
- …