164 research outputs found
Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions
We study the problem of learning Markov decision processes with finite state
and action spaces when the transition probability distributions and loss
functions are chosen adversarially and are allowed to change with time. We
introduce an algorithm whose regret with respect to any policy in a comparison
class grows as the square root of the number of rounds of the game, provided
the transition probabilities satisfy a uniform mixing condition. Our approach
is efficient as long as the comparison class is polynomial and we can compute
expectations over sample paths for each policy. Designing an efficient
algorithm with small regret for the general case remains an open problem
Competitive ratio versus regret minimization: achieving the best of both worlds
We consider online algorithms under both the competitive ratio criteria and
the regret minimization one. Our main goal is to build a unified methodology
that would be able to guarantee both criteria simultaneously.
For a general class of online algorithms, namely any Metrical Task System
(MTS), we show that one can simultaneously guarantee the best known competitive
ratio and a natural regret bound. For the paging problem we further show an
efficient online algorithm (polynomial in the number of pages) with this
guarantee.
To this end, we extend an existing regret minimization algorithm
(specifically, Kapralov and Panigrahy) to handle movement cost (the cost of
switching between states of the online system). We then show how to use the
extended regret minimization algorithm to combine multiple online algorithms.
Our end result is an online algorithm that can combine a "base" online
algorithm, having a guaranteed competitive ratio, with a range of online
algorithms that guarantee a small regret over any interval of time. The
combined algorithm guarantees both that the competitive ratio matches that of
the base algorithm and a low regret over any time interval.
As a by product, we obtain an expert algorithm with close to optimal regret
bound on every time interval, even in the presence of switching costs. This
result is of independent interest
Multi-Armed Bandits with Metric Movement Costs
We consider the non-stochastic Multi-Armed Bandit problem in a setting where
there is a fixed and known metric on the action space that determines a cost
for switching between any pair of actions. The loss of the online learner has
two components: the first is the usual loss of the selected actions, and the
second is an additional loss due to switching between actions. Our main
contribution gives a tight characterization of the expected minimax regret in
this setting, in terms of a complexity measure of the underlying
metric which depends on its covering numbers. In finite metric spaces with
actions, we give an efficient algorithm that achieves regret of the form
, and show that this
is the best possible. Our regret bound generalizes previous known regret bounds
for some special cases: (i) the unit-switching cost regret
where
, and (ii) the interval metric with regret
where .
For infinite metrics spaces with Lipschitz loss functions, we derive a tight
regret bound of where is
the Minkowski dimension of the space, which is known to be tight even when
there are no switching costs
Online Convex Optimization Against Adversaries with Memory and Application to Statistical Arbitrage
The framework of online learning with memory naturally captures learning
problems with temporal constraints, and was previously studied for the experts
setting. In this work we extend the notion of learning with memory to the
general Online Convex Optimization (OCO) framework, and present two algorithms
that attain low regret. The first algorithm applies to Lipschitz continuous
loss functions, obtaining optimal regret bounds for both convex and strongly
convex losses. The second algorithm attains the optimal regret bounds and
applies more broadly to convex losses without requiring Lipschitz continuity,
yet is more complicated to implement. We complement our theoretic results with
an application to statistical arbitrage in finance: we devise algorithms for
constructing mean-reverting portfolios.Comment: 22 pages, 2 figure
Bandits with Switching Costs: T^{2/3} Regret
We study the adversarial multi-armed bandit problem in a setting where the
player incurs a unit cost each time he switches actions. We prove that the
player's -round minimax regret in this setting is
, thereby closing a fundamental gap in our
understanding of learning with bandit feedback. In the corresponding
full-information version of the problem, the minimax regret is known to grow at
a much slower rate of . The difference between these two
rates provides the \emph{first} indication that learning with bandit feedback
can be significantly harder than learning with full-information feedback
(previous results only showed a different dependence on the number of actions,
but not on .)
In addition to characterizing the inherent difficulty of the multi-armed
bandit problem with switching costs, our results also resolve several other
open problems in online learning. One direct implication is that learning with
bandit feedback against bounded-memory adaptive adversaries has a minimax
regret of . Another implication is that the
minimax regret of online learning in adversarial Markov decision processes
(MDPs) is . The key to all of our results is a new
randomized construction of a multi-scale random walk, which is of independent
interest and likely to prove useful in additional settings
Online Learning with Composite Loss Functions
We study a new class of online learning problems where each of the online
algorithm's actions is assigned an adversarial value, and the loss of the
algorithm at each step is a known and deterministic function of the values
assigned to its recent actions. This class includes problems where the
algorithm's loss is the minimum over the recent adversarial values, the maximum
over the recent values, or a linear combination of the recent values. We
analyze the minimax regret of this class of problems when the algorithm
receives bandit feedback, and prove that when the minimum or maximum functions
are used, the minimax regret is (so called hard online
learning problems), and when a linear function is used, the minimax regret is
(so called easy learning problems). Previously, the only
online learning problem that was known to be provably hard was the multi-armed
bandit with switching costs
Online learning over a finite action set with limited switching
This paper studies the value of switching actions in the Prediction From
Experts (PFE) problem and Adversarial Multi-Armed Bandits (MAB) problem. First,
we revisit the well-studied and practically motivated setting of PFE with
switching costs. Many algorithms are known to achieve the minimax optimal order
of in expectation for both regret and number of switches,
where is the number of iterations and the number of actions. However,
no high probability (h.p.) guarantees are known. Our main technical
contribution is the first algorithms which with h.p. achieve this optimal order
for both regret and switches. This settles an open problem of [Devroye et al.,
2015], and directly implies the first h.p. guarantees for several problems of
interest.
Next, to investigate the value of switching actions at a more granular level,
we introduce the setting of switching budgets, in which algorithms are limited
to switches between actions. This entails a limited number of free
switches, in contrast to the unlimited number of expensive switches in the
switching cost setting. Using the above result and several reductions, we unify
previous work and completely characterize the complexity of this switching
budget setting up to small polylogarithmic factors: for both PFE and MAB, for
all switching budgets , and for both expectation and h.p. guarantees.
For PFE, we show the optimal rate is for , and for
. Interestingly, the bandit setting does not exhibit
such a phase transition; instead we show the minimax rate decays steadily as
for all ranges of . These results recover and generalize the known minimax rates for the
(arbitrary) switching cost setting.Comment: Extended abstract to appear in the proceedings of the 2018 Conference
on Learning Theory (COLT
Online Caching with Optimal Switching Regret
We consider the classical uncoded caching problem from an online learning
point-of-view. A cache of limited storage capacity can hold files at a time
from a large catalog. A user requests an arbitrary file from the catalog at
each time slot. Before the file request from the user arrives, a caching policy
populates the cache with any files of its choice. In the case of a
cache-hit, the policy receives a unit reward and zero rewards otherwise. In
addition to that, there is a cost associated with fetching files to the cache,
which we refer to as the switching cost. The objective is to design a caching
policy that incurs minimal regret while considering both the rewards due to
cache-hits and the switching cost due to the file fetches. The main
contribution of this paper is the switching regret analysis of a Follow the
Perturbed Leader-based anytime caching policy, which is shown to have an order
optimal switching regret. In this pursuit, we improve the best-known switching
regret bound for this problem by a factor of We conclude
the paper by comparing the performance of different popular caching policies
using a publicly available trace from a commercial CDN server.Comment: 11 pages, 3 figures, to be submitted to ISIT, 202
Learning to Cache With No Regrets
This paper introduces a novel caching analysis that, contrary to prior work,
makes no modeling assumptions for the file request sequence. We cast the
caching problem in the framework of Online Linear Optimization (OLO), and
introduce a class of minimum regret caching policies, which minimize the losses
with respect to the best static configuration in hindsight when the request
model is unknown. These policies are very important since they are robust to
popularity deviations in the sense that they learn to adjust their caching
decisions when the popularity model changes. We first prove a novel lower bound
for the regret of any caching policy, improving existing OLO bounds for our
setting. Then we show that the Online Gradient Ascent (OGA) policy guarantees a
regret that matches the lower bound, hence it is universally optimal. Finally,
we shift our attention to a network of caches arranged to form a bipartite
graph, and show that the Bipartite Subgradient Algorithm (BSA) has no regretComment: IEEE INFOCOM 201
Online learning with feedback graphs and switching costs
We study online learning when partial feedback information is provided
following every action of the learning process, and the learner incurs
switching costs for changing his actions. In this setting, the feedback
information system can be represented by a graph, and previous works studied
the expected regret of the learner in the case of a clique (Expert setup), or
disconnected single loops (Multi-Armed Bandits (MAB)). This work provides a
lower bound on the expected regret in the Partial Information (PI) setting,
namely for general feedback graphs --excluding the clique. Additionally, it
shows that all algorithms that are optimal without switching costs are
necessarily sub-optimal in the presence of switching costs, which motivates the
need to design new algorithms. We propose two new algorithms: Threshold Based
EXP3 and EXP3. SC. For the two special cases of symmetric PI setting and MAB,
the expected regret of both of these algorithms is order optimal in the
duration of the learning process. Additionally, Threshold Based EXP3 is order
optimal in the switching cost, whereas EXP3. SC is not. Finally, empirical
evaluations show that Threshold Based EXP3 outperforms the previously proposed
order-optimal algorithms EXP3 SET in the presence of switching costs, and Batch
EXP3 in the MAB setting with switching costs.Comment: Published in Proceedings of the 22nd International Conference on
Artificial Intelligence and Statistics (AISTATS) 2019. PMLR: Volume 8
- β¦