Search CORE

2,054 research outputs found

Online Markov decision processes under bandit feedback

Author: András Antos
András György
Csaba Szepesvári
Gergely Neu
Senior Member
Publication venue
Publication date
Field of study

Abstract—We consider online learning in finite stochastic Markovian environments where in each time step a new reward function is chosen by an oblivious adversary. The goal of the learning agent is to compete with the best stationary policy in hindsight in terms of the total reward received. Specifically, in each time step the agent observes the current state and the reward associated with the last transition, however, the agent does not observe the rewards associated with other state-action pairs. The agent is assumed to know the transition probabilities. The state of the art result for this setting is an algorithm with an expected regret of O(T 2/3 ln T). In this paper, assuming that stationary policies mix uniformly fast, we show that after T time steps, the expected regret of this algorithm (more precisely, a slightly modified version thereof) is O ( T 1/2 ln T) , giving the first rigorously proven, essentially tight regret bound for the problem. /01*2';<= ';<= ';<= ' r

CiteSeerX

Stochastic Online Shortest Path Routing: The Value of Feedback

Author: Combes Richard
Johansson Mikael
Proutiere Alexandre
Talebi M. Sadegh
Zou Zhenhua
Publication venue
Publication date: 01/01/2017
Field of study

This paper studies online shortest path routing over multi-hop networks. Link costs or delays are time-varying and modeled by independent and identically distributed random processes, whose parameters are initially unknown. The parameters, and hence the optimal path, can only be estimated by routing packets through the network and observing the realized delays. Our aim is to find a routing policy that minimizes the regret (the cumulative difference of expected delay) between the path chosen by the policy and the unknown optimal path. We formulate the problem as a combinatorial bandit optimization problem and consider several scenarios that differ in where routing decisions are made and in the information available when making the decisions. For each scenario, we derive a tight asymptotic lower bound on the regret that has to be satisfied by any online routing policy. These bounds help us to understand the performance improvements we can expect when (i) taking routing decisions at each hop rather than at the source only, and (ii) observing per-link delays rather than end-to-end path delays. In particular, we show that (i) is of no use while (ii) can have a spectacular impact. Three algorithms, with a trade-off between computational complexity and performance, are proposed. The regret upper bounds of these algorithms improve over those of the existing algorithms, and they significantly outperform state-of-the-art algorithms in numerical experiments.Comment: 18 page

arXiv.org e-Print Archive

HAL-CentraleSupelec

HAL-Rennes 1

Learning Adversarial Markov Decision Processes with Delayed Feedback

Author: Lancewicki Tal
Mansour Yishay
Rosenberg Aviv
Publication venue
Publication date: 15/12/2021
Field of study

Reinforcement learning typically assumes that agents observe feedback for their actions immediately, but in many real-world applications (like recommendation systems) feedback is observed in delay. This paper studies online learning in episodic Markov decision processes (MDPs) with unknown transitions, adversarially changing costs and unrestricted delayed feedback. That is, the costs and trajectory of episode

k

are revealed to the learner only in the end of episode

k + d^k

, where the delays

d^k

are neither identical nor bounded, and are chosen by an oblivious adversary. We present novel algorithms based on policy optimization that achieve near-optimal high-probability regret of

\sqrt{K + D}

under full-information feedback, where

K

is the number of episodes and

D = \sum_{k} d^k

is the total delay. Under bandit feedback, we prove similar

\sqrt{K + D}

regret assuming the costs are stochastic, and

(K + D)^{2/3}

regret in the general case. We are the first to consider regret minimization in the important setting of MDPs with delayed feedback.Comment: AAAI 202

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Chasing Ghosts: Competing with Stateful Policies

Author: Feige Uriel
Koren Tomer
Tennenholtz Moshe
Publication venue
Publication date: 29/07/2014
Field of study

We consider sequential decision making in a setting where regret is measured with respect to a set of stateful reference policies, and feedback is limited to observing the rewards of the actions performed (the so called "bandit" setting). If either the reference policies are stateless rather than stateful, or the feedback includes the rewards of all actions (the so called "expert" setting), previous work shows that the optimal regret grows like

\Theta(\sqrt{T})

in terms of the number of decision rounds

T

. The difficulty in our setting is that the decision maker unavoidably loses track of the internal states of the reference policies, and thus cannot reliably attribute rewards observed in a certain round to any of the reference policies. In fact, in this setting it is impossible for the algorithm to estimate which policy gives the highest (or even approximately highest) total reward. Nevertheless, we design an algorithm that achieves expected regret that is sublinear in

T

, of the form

O( T/\log^{1/4}{T})

. Our algorithm is based on a certain local repetition lemma that may be of independent interest. We also show that no algorithm can guarantee expected regret better than

O( T/\log^{3/2} T)

arXiv.org e-Print Archive

Crossref