Search CORE

328 research outputs found

Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

Author: Abbasi-Yadkori Yasin
Bartlett Peter L.
Szepesvari Csaba
Publication venue
Publication date: 12/03/2013
Field of study

We study the problem of learning Markov decision processes with finite state and action spaces when the transition probability distributions and loss functions are chosen adversarially and are allowed to change with time. We introduce an algorithm whose regret with respect to any policy in a comparison class grows as the square root of the number of rounds of the game, provided the transition probabilities satisfy a uniform mixing condition. Our approach is efficient as long as the comparison class is polynomial and we can compute expectations over sample paths for each policy. Designing an efficient algorithm with small regret for the general case remains an open problem

arXiv.org e-Print Archive

Large Scale Markov Decision Processes with Changing Rewards

Author: Cardoso Adrian Rivera
Wang He
Xu Huan
Publication venue
Publication date: 25/05/2019
Field of study

We consider Markov Decision Processes (MDPs) where the rewards are unknown and may change in an adversarial manner. We provide an algorithm that achieves state-of-the-art regret bound of

O( \sqrt{\tau (\ln|S|+\ln|A|)T}\ln(T))

, where

S

is the state space,

A

is the action space,

\tau

is the mixing time of the MDP, and

T

is the number of periods. The algorithm's computational complexity is polynomial in

|S|

and

|A|

per period. We then consider a setting often encountered in practice, where the state space of the MDP is too large to allow for exact solutions. By approximating the state-action occupancy measures with a linear architecture of dimension

d\ll|S|

, we propose a modified algorithm with computational complexity polynomial in

d

. We also prove a regret bound for this modified algorithm, which to the best of our knowledge this is the first

\tilde{O}(\sqrt{T})

regret bound for large scale MDPs with changing rewards

arXiv.org e-Print Archive

Online Linear Quadratic Control

Author: Cohen Alon
Hassidim Avinatan
Koren Tomer
Lazic Nevena
Mansour Yishay
Talwar Kunal
Publication venue
Publication date: 19/06/2018
Field of study

We study the problem of controlling linear time-invariant systems with known noisy dynamics and adversarially chosen quadratic losses. We present the first efficient online learning algorithms in this setting that guarantee

O(\sqrt{T})

regret under mild assumptions, where

T

is the time horizon. Our algorithms rely on a novel SDP relaxation for the steady-state distribution of the system. Crucially, and in contrast to previously proposed relaxations, the feasible solutions of our SDP all correspond to "strongly stable" policies that mix exponentially fast to a steady state

arXiv.org e-Print Archive

Markov Decision Processes with Continuous Side Information

Author: Jiang Nan
Modi Aditya
Singh Satinder
Tewari Ambuj
Publication venue
Publication date: 15/11/2017
Field of study

We consider a reinforcement learning (RL) setting in which the agent interacts with a sequence of episodic MDPs. At the start of each episode the agent has access to some side-information or context that determines the dynamics of the MDP for that episode. Our setting is motivated by applications in healthcare where baseline measurements of a patient at the start of a treatment episode form the context that may provide information about how the patient might respond to treatment decisions. We propose algorithms for learning in such Contextual Markov Decision Processes (CMDPs) under an assumption that the unobserved MDP parameters vary smoothly with the observed context. We also give lower and upper PAC bounds under the smoothness assumption. Because our lower bound has an exponential dependence on the dimension, we consider a tractable linear setting where the context is used to create linear combinations of a finite set of MDPs. For the linear setting, we give a PAC learning algorithm based on KWIK learning techniques

arXiv.org e-Print Archive

Provably Efficient Exploration in Policy Optimization

Author: Cai Qi
Jin Chi
Wang Zhaoran
Yang Zhuoran
Publication venue
Publication date: 06/07/2020
Field of study

While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an ``optimistic version'' of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves

\tilde{O}(\sqrt{d^2 H^3 T} )

regret. Here

d

is the feature dimension,

H

is the episode horizon, and

T

is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.Comment: We have fixed a technical issue in the first version of this paper. We remark the technical assumption of the linear MDP in this version of the paper is different from that in the first versio

arXiv.org e-Print Archive

Online Reinforcement Learning in Stochastic Games

Author: Hong Yi-Te
Lu Chi-Jen
Wei Chen-Yu
Publication venue
Publication date: 02/12/2017
Field of study

We study online reinforcement learning in average-reward stochastic games (SGs). An SG models a two-player zero-sum game in a Markov environment, where state transitions and one-step payoffs are determined simultaneously by a learner and an adversary. We propose the UCSG algorithm that achieves a sublinear regret compared to the game value when competing with an arbitrary opponent. This result improves previous ones under the same setting. The regret bound has a dependency on the diameter, which is an intrinsic value related to the mixing property of SGs. If we let the opponent play an optimistic best response to the learner, UCSG finds an

\varepsilon

-maximin stationary policy with a sample complexity of

\tilde{\mathcal{O}}\left(\text{poly}(1/\varepsilon)\right)

, where

\varepsilon

is the gap to the best policy

arXiv.org e-Print Archive

Variational Regret Bounds for Reinforcement Learning

Author: Auer Peter
Gajane Pratik
Ortner Ronald
Publication venue
Publication date: 10/09/2019
Field of study

We consider undiscounted reinforcement learning in Markov decision processes (MDPs) where both the reward functions and the state-transition probabilities may vary (gradually or abruptly) over time. For this problem setting, we propose an algorithm and provide performance guarantees for the regret evaluated against the optimal non-stationary policy. The upper bound on the regret is given in terms of the total variation in the MDP. This is the first variational regret bound for the general reinforcement learning setting.Comment: Presented at UAI 201

arXiv.org e-Print Archive

The Online Coupon-Collector Problem and Its Application to Lifelong Reinforcement Learning

Author: Brunskill Emma
Li Lihong
Publication venue
Publication date: 21/09/2015
Field of study

Transferring knowledge across a sequence of related tasks is an important challenge in reinforcement learning (RL). Despite much encouraging empirical evidence, there has been little theoretical analysis. In this paper, we study a class of lifelong RL problems: the agent solves a sequence of tasks modeled as finite Markov decision processes (MDPs), each of which is from a finite set of MDPs with the same state/action sets and different transition/reward functions. Motivated by the need for cross-task exploration in lifelong learning, we formulate a novel online coupon-collector problem and give an optimal algorithm. This allows us to develop a new lifelong RL algorithm, whose overall sample complexity in a sequence of tasks is much smaller than single-task learning, even if the sequence of tasks is generated by an adversary. Benefits of the algorithm are demonstrated in simulated problems, including a recently introduced human-robot interaction problem.Comment: 13 page

arXiv.org e-Print Archive

Non-Stationary Markov Decision Processes, a Worst-Case Approach using Model-Based Reinforcement Learning, Extended version

Author: Lecarpentier Erwan
Rachelson Emmanuel
Publication venue
Publication date: 15/01/2020
Field of study

This work tackles the problem of robust zero-shot planning in non-stationary stochastic environments. We study Markov Decision Processes (MDPs) evolving over time and consider Model-Based Reinforcement Learning algorithms in this setting. We make two hypotheses: 1) the environment evolves continuously with a bounded evolution rate; 2) a current model is known at each decision epoch but not its evolution. Our contribution can be presented in four points. 1) we define a specific class of MDPs that we call Non-Stationary MDPs (NSMDPs). We introduce the notion of regular evolution by making an hypothesis of Lipschitz-Continuity on the transition and reward functions w.r.t. time; 2) we consider a planning agent using the current model of the environment but unaware of its future evolution. This leads us to consider a worst-case method where the environment is seen as an adversarial agent; 3) following this approach, we propose the Risk-Averse Tree-Search (RATS) algorithm, a zero-shot Model-Based method similar to Minimax search; 4) we illustrate the benefits brought by RATS empirically and compare its performance with reference Model-Based algorithms.Comment: Published at NeurIPS 2019, 17 pages, 3 figure

arXiv.org e-Print Archive

Online Markov decision processes with policy iteration

Author: Ma Yao
Sugiyama Masashi
Zhang Hao
Publication venue
Publication date: 15/10/2015
Field of study

The online Markov decision process (MDP) is a generalization of the classical Markov decision process that incorporates changing reward functions. In this paper, we propose practical online MDP algorithms with policy iteration and theoretically establish a sublinear regret bound. A notable advantage of the proposed algorithm is that it can be easily combined with function approximation, and thus large and possibly continuous state spaces can be efficiently handled. Through experiments, we demonstrate the usefulness of the proposed algorithm

arXiv.org e-Print Archive