328 research outputs found
Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions
We study the problem of learning Markov decision processes with finite state
and action spaces when the transition probability distributions and loss
functions are chosen adversarially and are allowed to change with time. We
introduce an algorithm whose regret with respect to any policy in a comparison
class grows as the square root of the number of rounds of the game, provided
the transition probabilities satisfy a uniform mixing condition. Our approach
is efficient as long as the comparison class is polynomial and we can compute
expectations over sample paths for each policy. Designing an efficient
algorithm with small regret for the general case remains an open problem
Large Scale Markov Decision Processes with Changing Rewards
We consider Markov Decision Processes (MDPs) where the rewards are unknown
and may change in an adversarial manner. We provide an algorithm that achieves
state-of-the-art regret bound of ,
where is the state space, is the action space, is the mixing
time of the MDP, and is the number of periods. The algorithm's
computational complexity is polynomial in and per period. We then
consider a setting often encountered in practice, where the state space of the
MDP is too large to allow for exact solutions. By approximating the
state-action occupancy measures with a linear architecture of dimension
, we propose a modified algorithm with computational complexity
polynomial in . We also prove a regret bound for this modified algorithm,
which to the best of our knowledge this is the first
regret bound for large scale MDPs with changing rewards
Online Linear Quadratic Control
We study the problem of controlling linear time-invariant systems with known
noisy dynamics and adversarially chosen quadratic losses. We present the first
efficient online learning algorithms in this setting that guarantee
regret under mild assumptions, where is the time horizon. Our
algorithms rely on a novel SDP relaxation for the steady-state distribution of
the system. Crucially, and in contrast to previously proposed relaxations, the
feasible solutions of our SDP all correspond to "strongly stable" policies that
mix exponentially fast to a steady state
Markov Decision Processes with Continuous Side Information
We consider a reinforcement learning (RL) setting in which the agent
interacts with a sequence of episodic MDPs. At the start of each episode the
agent has access to some side-information or context that determines the
dynamics of the MDP for that episode. Our setting is motivated by applications
in healthcare where baseline measurements of a patient at the start of a
treatment episode form the context that may provide information about how the
patient might respond to treatment decisions. We propose algorithms for
learning in such Contextual Markov Decision Processes (CMDPs) under an
assumption that the unobserved MDP parameters vary smoothly with the observed
context. We also give lower and upper PAC bounds under the smoothness
assumption. Because our lower bound has an exponential dependence on the
dimension, we consider a tractable linear setting where the context is used to
create linear combinations of a finite set of MDPs. For the linear setting, we
give a PAC learning algorithm based on KWIK learning techniques
Provably Efficient Exploration in Policy Optimization
While policy-based reinforcement learning (RL) achieves tremendous successes
in practice, it is significantly less understood in theory, especially compared
with value-based RL. In particular, it remains elusive how to design a provably
efficient policy optimization algorithm that incorporates exploration. To
bridge such a gap, this paper proposes an Optimistic variant of the Proximal
Policy Optimization algorithm (OPPO), which follows an ``optimistic version''
of the policy gradient direction. This paper proves that, in the problem of
episodic Markov decision process with linear function approximation, unknown
transition, and adversarial reward with full-information feedback, OPPO
achieves regret. Here is the feature
dimension, is the episode horizon, and is the total number of steps. To
the best of our knowledge, OPPO is the first provably efficient policy
optimization algorithm that explores.Comment: We have fixed a technical issue in the first version of this paper.
We remark the technical assumption of the linear MDP in this version of the
paper is different from that in the first versio
Online Reinforcement Learning in Stochastic Games
We study online reinforcement learning in average-reward stochastic games
(SGs). An SG models a two-player zero-sum game in a Markov environment, where
state transitions and one-step payoffs are determined simultaneously by a
learner and an adversary. We propose the UCSG algorithm that achieves a
sublinear regret compared to the game value when competing with an arbitrary
opponent. This result improves previous ones under the same setting. The regret
bound has a dependency on the diameter, which is an intrinsic value related to
the mixing property of SGs. If we let the opponent play an optimistic best
response to the learner, UCSG finds an -maximin stationary policy
with a sample complexity of
, where
is the gap to the best policy
Variational Regret Bounds for Reinforcement Learning
We consider undiscounted reinforcement learning in Markov decision processes
(MDPs) where both the reward functions and the state-transition probabilities
may vary (gradually or abruptly) over time. For this problem setting, we
propose an algorithm and provide performance guarantees for the regret
evaluated against the optimal non-stationary policy. The upper bound on the
regret is given in terms of the total variation in the MDP. This is the first
variational regret bound for the general reinforcement learning setting.Comment: Presented at UAI 201
The Online Coupon-Collector Problem and Its Application to Lifelong Reinforcement Learning
Transferring knowledge across a sequence of related tasks is an important
challenge in reinforcement learning (RL). Despite much encouraging empirical
evidence, there has been little theoretical analysis. In this paper, we study a
class of lifelong RL problems: the agent solves a sequence of tasks modeled as
finite Markov decision processes (MDPs), each of which is from a finite set of
MDPs with the same state/action sets and different transition/reward functions.
Motivated by the need for cross-task exploration in lifelong learning, we
formulate a novel online coupon-collector problem and give an optimal
algorithm. This allows us to develop a new lifelong RL algorithm, whose overall
sample complexity in a sequence of tasks is much smaller than single-task
learning, even if the sequence of tasks is generated by an adversary. Benefits
of the algorithm are demonstrated in simulated problems, including a recently
introduced human-robot interaction problem.Comment: 13 page
Non-Stationary Markov Decision Processes, a Worst-Case Approach using Model-Based Reinforcement Learning, Extended version
This work tackles the problem of robust zero-shot planning in non-stationary
stochastic environments. We study Markov Decision Processes (MDPs) evolving
over time and consider Model-Based Reinforcement Learning algorithms in this
setting. We make two hypotheses: 1) the environment evolves continuously with a
bounded evolution rate; 2) a current model is known at each decision epoch but
not its evolution. Our contribution can be presented in four points. 1) we
define a specific class of MDPs that we call Non-Stationary MDPs (NSMDPs). We
introduce the notion of regular evolution by making an hypothesis of
Lipschitz-Continuity on the transition and reward functions w.r.t. time; 2) we
consider a planning agent using the current model of the environment but
unaware of its future evolution. This leads us to consider a worst-case method
where the environment is seen as an adversarial agent; 3) following this
approach, we propose the Risk-Averse Tree-Search (RATS) algorithm, a zero-shot
Model-Based method similar to Minimax search; 4) we illustrate the benefits
brought by RATS empirically and compare its performance with reference
Model-Based algorithms.Comment: Published at NeurIPS 2019, 17 pages, 3 figure
Online Markov decision processes with policy iteration
The online Markov decision process (MDP) is a generalization of the classical
Markov decision process that incorporates changing reward functions. In this
paper, we propose practical online MDP algorithms with policy iteration and
theoretically establish a sublinear regret bound. A notable advantage of the
proposed algorithm is that it can be easily combined with function
approximation, and thus large and possibly continuous state spaces can be
efficiently handled. Through experiments, we demonstrate the usefulness of the
proposed algorithm
- …