846 research outputs found
Extended Formulations for Online Linear Bandit Optimization
On-line linear optimization on combinatorial action sets (d-dimensional
actions) with bandit feedback, is known to have complexity in the order of the
dimension of the problem. The exponential weighted strategy achieves the best
known regret bound that is of the order of (where is the
dimension of the problem, is the time horizon). However, such strategies
are provably suboptimal or computationally inefficient. The complexity is
attributed to the combinatorial structure of the action set and the dearth of
efficient exploration strategies of the set. Mirror descent with entropic
regularization function comes close to solving this problem by enforcing a
meticulous projection of weights with an inherent boundary condition. Entropic
regularization in mirror descent is the only known way of achieving a
logarithmic dependence on the dimension. Here, we argue otherwise and recover
the original intuition of exponential weighting by borrowing a technique from
discrete optimization and approximation algorithms called `extended
formulation'. Such formulations appeal to the underlying geometry of the set
with a guaranteed logarithmic dependence on the dimension underpinned by an
information theoretic entropic analysis
Addressing the Loss-Metric Mismatch with Adaptive Loss Alignment
In most machine learning training paradigms a fixed, often handcrafted, loss
function is assumed to be a good proxy for an underlying evaluation metric. In
this work we assess this assumption by meta-learning an adaptive loss function
to directly optimize the evaluation metric. We propose a sample efficient
reinforcement learning approach for adapting the loss dynamically during
training. We empirically show how this formulation improves performance by
simultaneously optimizing the evaluation metric and smoothing the loss
landscape. We verify our method in metric learning and classification
scenarios, showing considerable improvements over the state-of-the-art on a
diverse set of tasks. Importantly, our method is applicable to a wide range of
loss functions and evaluation metrics. Furthermore, the learned policies are
transferable across tasks and data, demonstrating the versatility of the
method.Comment: Accepted to ICML 201
Multi-Objective Generalized Linear Bandits
In this paper, we study the multi-objective bandits (MOB) problem, where a
learner repeatedly selects one arm to play and then receives a reward vector
consisting of multiple objectives. MOB has found many real-world applications
as varied as online recommendation and network routing. On the other hand,
these applications typically contain contextual information that can guide the
learning process which, however, is ignored by most of existing work. To
utilize this information, we associate each arm with a context vector and
assume the reward follows the generalized linear model (GLM). We adopt the
notion of Pareto regret to evaluate the learner's performance and develop a
novel algorithm for minimizing it. The essential idea is to apply a variant of
the online Newton step to estimate model parameters, based on which we utilize
the upper confidence bound (UCB) policy to construct an approximation of the
Pareto front, and then uniformly at random choose one arm from the approximate
Pareto front. Theoretical analysis shows that the proposed algorithm achieves
an Pareto regret, where is the time horizon and
is the dimension of contexts, which matches the optimal result for single
objective contextual bandits problem. Numerical experiments demonstrate the
effectiveness of our method
Sharp Dichotomies for Regret Minimization in Metric Spaces
The Lipschitz multi-armed bandit (MAB) problem generalizes the classical
multi-armed bandit problem by assuming one is given side information consisting
of a priori upper bounds on the difference in expected payoff between certain
pairs of strategies. Classical results of (Lai and Robbins 1985) and (Auer et
al. 2002) imply a logarithmic regret bound for the Lipschitz MAB problem on
finite metric spaces. Recent results on continuum-armed bandit problems and
their generalizations imply lower bounds of , or stronger, for many
infinite metric spaces such as the unit interval. Is this dichotomy universal?
We prove that the answer is yes: for every metric space, the optimal regret of
a Lipschitz MAB algorithm is either bounded above by any ,
or bounded below by any . Perhaps surprisingly, this
dichotomy does not coincide with the distinction between finite and infinite
metric spaces; instead it depends on whether the completion of the metric space
is compact and countable. Our proof connects upper and lower bound techniques
in online learning with classical topological notions such as perfect sets and
the Cantor-Bendixson theorem. Among many other results, we show a similar
dichotomy for the "full-feedback" (a.k.a., "best-expert") version.Comment: Full version of a paper in ACM-SIAM SODA 201
On Regret-Optimal Learning in Decentralized Multi-player Multi-armed Bandits
We consider the problem of learning in single-player and multiplayer
multiarmed bandit models. Bandit problems are classes of online learning
problems that capture exploration versus exploitation tradeoffs. In a
multiarmed bandit model, players can pick among many arms, and each play of an
arm generates an i.i.d. reward from an unknown distribution. The objective is
to design a policy that maximizes the expected reward over a time horizon for a
single player setting and the sum of expected rewards for the multiplayer
setting. In the multiplayer setting, arms may give different rewards to
different players. There is no separate channel for coordination among the
players. Any attempt at communication is costly and adds to regret. We propose
two decentralizable policies, (-) and -, that can be used in both single player and multiplayer settings.
These policies are shown to yield expected regret that grows at most as
O(). It is well known that is the lower bound on
the rate of growth of regret even in a centralized case. The proposed
algorithms improve on prior work where regret grew at O(). More
fundamentally, these policies address the question of additional cost incurred
in decentralized online learning, suggesting that there is at most an
-factor cost in terms of order of regret. This solves a problem of
relevance in many domains and had been open for a while
Online Dynamic Programming
We consider the problem of repeatedly solving a variant of the same dynamic
programming problem in successive trials. An instance of the type of problems
we consider is to find a good binary search tree in a changing environment.At
the beginning of each trial, the learner probabilistically chooses a tree with
the keys at the internal nodes and the gaps between keys at the
leaves. The learner is then told the frequencies of the keys and gaps and is
charged by the average search cost for the chosen tree. The problem is online
because the frequencies can change between trials. The goal is to develop
algorithms with the property that their total average search cost (loss) in all
trials is close to the total loss of the best tree chosen in hindsight for all
trials. The challenge, of course, is that the algorithm has to deal with
exponential number of trees. We develop a general methodology for tackling such
problems for a wide class of dynamic programming algorithms. Our framework
allows us to extend online learning algorithms like Hedge and Component Hedge
to a significantly wider class of combinatorial objects than was possible
before
Towards Distribution-Free Multi-Armed Bandits with Combinatorial Strategies
In this paper we study a generalized version of classical multi-armed bandits
(MABs) problem by allowing for arbitrary constraints on constituent bandits at
each decision point. The motivation of this study comes from many situations
that involve repeatedly making choices subject to arbitrary constraints in an
uncertain environment: for instance, regularly deciding which advertisements to
display online in order to gain high click-through-rate without knowing user
preferences, or what route to drive home each day under uncertain weather and
traffic conditions. Assume that there are unknown random variables (RVs),
i.e., arms, each evolving as an \emph{i.i.d} stochastic process over time. At
each decision epoch, we select a strategy, i.e., a subset of RVs, subject to
arbitrary constraints on constituent RVs.
We then gain a reward that is a linear combination of observations on
selected RVs.
The performance of prior results for this problem heavily depends on the
distribution of strategies generated by corresponding learning policy. For
example, if the reward-difference between the best and second best strategy
approaches zero, prior result may lead to arbitrarily large regret.
Meanwhile, when there are exponential number of possible strategies at each
decision point, naive extension of a prior distribution-free policy would cause
poor performance in terms of regret, computation and space complexity.
To this end, we propose an efficient Distribution-Free Learning (DFL) policy
that achieves zero regret, regardless of the probability distribution of the
resultant strategies.
Our learning policy has both time complexity and space
complexity. In successive generations, we show that even if finding the optimal
strategy at each decision point is NP-hard, our policy still allows for
approximated solutions while retaining near zero-regret
Imitation Learning as -Divergence Minimization
We address the problem of imitation learning with multi-modal demonstrations.
Instead of attempting to learn all modes, we argue that in many tasks it is
sufficient to imitate any one of them. We show that the state-of-the-art
methods such as GAIL and behavior cloning, due to their choice of loss
function, often incorrectly interpolate between such modes. Our key insight is
to minimize the right divergence between the learner and the expert
state-action distributions, namely the reverse KL divergence or I-projection.
We propose a general imitation learning framework for estimating and minimizing
any f-Divergence. By plugging in different divergences, we are able to recover
existing algorithms such as Behavior Cloning (Kullback-Leibler), GAIL (Jensen
Shannon) and Dagger (Total Variation). Empirical results show that our
approximate I-projection technique is able to imitate multi-modal behaviors
more reliably than GAIL and behavior cloning.Comment: International Workshop on the Algorithmic Foundations of Robotics
(WAFR) 202
Stochastic Contextual Bandits with Known Reward Functions
Many sequential decision-making problems in communication networks can be
modeled as contextual bandit problems, which are natural extensions of the
well-known multi-armed bandit problem. In contextual bandit problems, at each
time, an agent observes some side information or context, pulls one arm and
receives the reward for that arm. We consider a stochastic formulation where
the context-reward tuples are independently drawn from an unknown distribution
in each trial. Motivated by networking applications, we analyze a setting where
the reward is a known non-linear function of the context and the chosen arm's
current state. We first consider the case of discrete and finite context-spaces
and propose DCB(), an algorithm that we prove, through a careful
analysis, yields regret (cumulative reward gap compared to a distribution-aware
genie) scaling logarithmically in time and linearly in the number of arms that
are not optimal for any context, improving over existing algorithms where the
regret scales linearly in the total number of arms. We then study continuous
context-spaces with Lipschitz reward functions and propose CCB(), an algorithm that uses DCB() as a subroutine.
CCB() reveals a novel regret-storage trade-off that is
parametrized by . Tuning to the time horizon allows us to
obtain sub-linear regret bounds, while requiring sub-linear storage. By
exploiting joint learning for all contexts we get regret bounds for
CCB() that are unachievable by any existing contextual bandit
algorithm for continuous context-spaces. We also show similar performance
bounds for the unknown horizon case.Comment: A version of this technical report is under submission in IEEE/ACM
Transactions on Networkin
Derivative-free optimization methods
In many optimization problems arising from scientific, engineering and
artificial intelligence applications, objective and constraint functions are
available only as the output of a black-box or simulation oracle that does not
provide derivative information. Such settings necessitate the use of methods
for derivative-free, or zeroth-order, optimization. We provide a review and
perspectives on developments in these methods, with an emphasis on highlighting
recent developments and on unifying treatment of such problems in the
non-linear optimization and machine learning literature. We categorize methods
based on assumed properties of the black-box functions, as well as features of
the methods. We first overview the primary setting of deterministic methods
applied to unconstrained, non-convex optimization problems where the objective
function is defined by a deterministic black-box oracle. We then discuss
developments in randomized methods, methods that assume some additional
structure about the objective (including convexity, separability and general
non-smooth compositions), methods for problems where the output of the
black-box oracle is stochastic, and methods for handling different types of
constraints
- …