Search CORE

51,546 research outputs found

Monte Carlo Matrix Inversion Policy Evaluation

Author: Lu Fletcher
Schuurmans Dale
Publication venue
Publication date: 19/10/2012
Field of study

In 1950, Forsythe and Leibler (1950) introduced a statistical technique for finding the inverse of a matrix by characterizing the elements of the matrix inverse as expected values of a sequence of random walks. Barto and Duff (1994) subsequently showed relations between this technique and standard dynamic programming and temporal differencing methods. The advantage of the Monte Carlo matrix inversion (MCMI) approach is that it scales better with respect to state-space size than alternative techniques. In this paper, we introduce an algorithm for performing reinforcement learning policy evaluation using MCMI. We demonstrate that MCMI improves on runtime over a maximum likelihood model-based policy evaluation approach and on both runtime and accuracy over the temporal differencing (TD) policy evaluation approach. We further improve on MCMI policy evaluation by adding an importance sampling technique to our algorithm to reduce the variance of our estimator. Lastly, we illustrate techniques for scaling up MCMI to large state spaces in order to perform policy improvement.Comment: Appears in Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence (UAI2003

arXiv.org e-Print Archive

Self Training Autonomous Driving Agent

Author: Kotyan Shashank
U Venkanna
Vargas Danilo Vasconcellos
Publication venue
Publication date: 26/04/2019
Field of study

Intrinsically, driving is a Markov Decision Process which suits well the reinforcement learning paradigm. In this paper, we propose a novel agent which learns to drive a vehicle without any human assistance. We use the concept of reinforcement learning and evolutionary strategies to train our agent in a 2D simulation environment. Our model's architecture goes beyond the World Model's by introducing difference images in the auto encoder. This novel involvement of difference images in the auto-encoder gives better representation of the latent space with respect to the motion of vehicle and helps an autonomous agent to learn more efficiently how to drive a vehicle. Results show that our method requires fewer (96% less) total agents, (87.5% less) agents per generations, (70% less) generations and (90% less) rollouts than the original architecture while achieving the same accuracy of the original

arXiv.org e-Print Archive

A Greedy Approach to Adapting the Trace Parameter for Temporal Difference Learning

Author: White Adam
White Martha
Publication venue
Publication date: 24/10/2016
Field of study

One of the main obstacles to broad application of reinforcement learning methods is the parameter sensitivity of our core learning algorithms. In many large-scale applications, online computation and function approximation represent key strategies in scaling up reinforcement learning algorithms. In this setting, we have effective and reasonably well understood algorithms for adapting the learning-rate parameter, online during learning. Such meta-learning approaches can improve robustness of learning and enable specialization to current task, improving learning speed. For temporal-difference learning algorithms which we study here, there is yet another parameter,

\lambda

, that similarly impacts learning speed and stability in practice. Unfortunately, unlike the learning-rate parameter,

\lambda

parametrizes the objective function that temporal-difference methods optimize. Different choices of

\lambda

produce different fixed-point solutions, and thus adapting

\lambda

online and characterizing the optimization is substantially more complex than adapting the learning-rate parameter. There are no meta-learning method for

\lambda

that can achieve (1) incremental updating, (2) compatibility with function approximation, and (3) maintain stability of learning under both on and off-policy sampling. In this paper we contribute a novel objective function for optimizing

\lambda

as a function of state rather than time. We derive a new incremental, linear complexity

\lambda

-adaption algorithm that does not require offline batch updating or access to a model of the world, and present a suite of experiments illustrating the practicality of our new algorithm in three different settings. Taken together, our contributions represent a concrete step towards black-box application of temporal-difference learning methods in real world problems

arXiv.org e-Print Archive

Non-stationary Stochastic Optimization under $L_{p,q}$ -Variation Measures

Author: Chen Xi
Wang Yining
Wang Yu-Xiang
Publication venue
Publication date: 11/05/2018
Field of study

We consider a non-stationary sequential stochastic optimization problem, in which the underlying cost functions change over time under a variation budget constraint. We propose an

L_{p,q}

-variation functional to quantify the change, which yields less variation for dynamic function sequences whose changes are constrained to short time periods or small subsets of input domain. Under the

L_{p,q}

-variation constraint, we derive both upper and matching lower regret bounds for smooth and strongly convex function sequences, which generalize previous results in Besbes et al. (2015). Furthermore, we provide an upper bound for general convex function sequences with noisy gradient feedback, which matches the optimal rate as

p\to\infty

. Our results reveal some surprising phenomena under this general variation functional, such as the curse of dimensionality of the function domain. The key technical novelties in our analysis include affinity lemmas that characterize the distance of the minimizers of two convex functions with bounded Lp difference, and a cubic spline based construction that attains matching lower bounds.Comment: 38 pages, 3 figures. Revised versio

arXiv.org e-Print Archive

Context-Dependent Upper-Confidence Bounds for Directed Exploration

Author: Kumaraswamy Raksha
Schlegel Matthew
White Adam
White Martha
Publication venue
Publication date: 15/11/2018
Field of study

Directed exploration strategies for reinforcement learning are critical for learning an optimal policy in a minimal number of interactions with the environment. Many algorithms use optimism to direct exploration, either through visitation estimates or upper confidence bounds, as opposed to data-inefficient strategies like \epsilon-greedy that use random, undirected exploration. Most data-efficient exploration methods require significant computation, typically relying on a learned model to guide exploration. Least-squares methods have the potential to provide some of the data-efficiency benefits of model-based approaches -- because they summarize past interactions -- with the computation closer to that of model-free approaches. In this work, we provide a novel, computationally efficient, incremental exploration strategy, leveraging this property of least-squares temporal difference learning (LSTD). We derive upper confidence bounds on the action-values learned by LSTD, with context-dependent (or state-dependent) noise variance. Such context-dependent noise focuses exploration on a subset of variable states, and allows for reduced exploration in other states. We empirically demonstrate that our algorithm can converge more quickly than other incremental exploration strategies using confidence estimates on action-values.Comment: Neural Information Processing Systems 201

arXiv.org e-Print Archive

Finite Sample Analyses for TD(0) with Function Approximation

Author: Dalal Gal
Mannor Shie
Szörényi Balázs
Thoppe Gugan
Publication venue
Publication date: 11/12/2017
Field of study

TD(0) is one of the most commonly used algorithms in reinforcement learning. Despite this, there is no existing finite sample analysis for TD(0) with function approximation, even for the linear case. Our work is the first to provide such results. Existing convergence rates for Temporal Difference (TD) methods apply only to somewhat modified versions, e.g., projected variants or ones where stepsizes depend on unknown problem parameters. Our analyses obviate these artificial alterations by exploiting strong properties of TD(0). We provide convergence rates both in expectation and with high-probability. The two are obtained via different approaches that use relatively unknown, recently developed stochastic approximation techniques

arXiv.org e-Print Archive

Safe and Efficient Off-Policy Reinforcement Learning

Author: Bellemare Marc G.
Harutyunyan Anna
Munos Rémi
Stepleton Tom
Publication venue
Publication date: 07/11/2016
Field of study

In this work, we take a fresh look at some old and new algorithms for off-policy, return-based reinforcement learning. Expressing these in a common form, we derive a novel algorithm, Retrace(

\lambda

), with three desired properties: (1) it has low variance; (2) it safely uses samples collected from any behaviour policy, whatever its degree of "off-policyness"; and (3) it is efficient as it makes the best use of samples collected from near on-policy behaviour policies. We analyze the contractive nature of the related operator under both off-policy policy evaluation and control settings and derive online sample-based algorithms. We believe this is the first return-based off-policy control algorithm converging a.s. to

Q^*

without the GLIE assumption (Greedy in the Limit with Infinite Exploration). As a corollary, we prove the convergence of Watkins' Q(

\lambda

), which was an open problem since 1989. We illustrate the benefits of Retrace(

\lambda

) on a standard suite of Atari 2600 games

arXiv.org e-Print Archive

Non-Stationary Markov Decision Processes, a Worst-Case Approach using Model-Based Reinforcement Learning, Extended version

Author: Lecarpentier Erwan
Rachelson Emmanuel
Publication venue
Publication date: 15/01/2020
Field of study

This work tackles the problem of robust zero-shot planning in non-stationary stochastic environments. We study Markov Decision Processes (MDPs) evolving over time and consider Model-Based Reinforcement Learning algorithms in this setting. We make two hypotheses: 1) the environment evolves continuously with a bounded evolution rate; 2) a current model is known at each decision epoch but not its evolution. Our contribution can be presented in four points. 1) we define a specific class of MDPs that we call Non-Stationary MDPs (NSMDPs). We introduce the notion of regular evolution by making an hypothesis of Lipschitz-Continuity on the transition and reward functions w.r.t. time; 2) we consider a planning agent using the current model of the environment but unaware of its future evolution. This leads us to consider a worst-case method where the environment is seen as an adversarial agent; 3) following this approach, we propose the Risk-Averse Tree-Search (RATS) algorithm, a zero-shot Model-Based method similar to Minimax search; 4) we illustrate the benefits brought by RATS empirically and compare its performance with reference Model-Based algorithms.Comment: Published at NeurIPS 2019, 17 pages, 3 figure

arXiv.org e-Print Archive

Dyna-H: a heuristic planning reinforcement learning algorithm applied to role-playing-game strategy decision systems

Author: Botella Guillermo
H. Jose Antonio Martin
Lopez Victoria
Santos Matilde
Publication venue
Publication date: 30/07/2011
Field of study

In a Role-Playing Game, finding optimal trajectories is one of the most important tasks. In fact, the strategy decision system becomes a key component of a game engine. Determining the way in which decisions are taken (online, batch or simulated) and the consumed resources in decision making (e.g. execution time, memory) will influence, in mayor degree, the game performance. When classical search algorithms such as A* can be used, they are the very first option. Nevertheless, such methods rely on precise and complete models of the search space, and there are many interesting scenarios where their application is not possible. Then, model free methods for sequential decision making under uncertainty are the best choice. In this paper, we propose a heuristic planning strategy to incorporate the ability of heuristic-search in path-finding into a Dyna agent. The proposed Dyna-H algorithm, as A* does, selects branches more likely to produce outcomes than other branches. Besides, it has the advantages of being a model-free online reinforcement learning algorithm. The proposal was evaluated against the one-step Q-Learning and Dyna-Q algorithms obtaining excellent experimental results: Dyna-H significantly overcomes both methods in all experiments. We suggest also, a functional analogy between the proposed sampling from worst trajectories heuristic and the role of dreams (e.g. nightmares) in human behavior

arXiv.org e-Print Archive

Contextual Bandits with Similarity Information

Author: Slivkins Aleksandrs
Publication venue
Publication date: 19/05/2014
Field of study

In a multi-armed bandit (MAB) problem, an online algorithm makes a sequence of choices. In each round it chooses from a time-invariant set of alternatives and receives the payoff associated with this alternative. While the case of small strategy sets is by now well-understood, a lot of recent work has focused on MAB problems with exponentially or infinitely large strategy sets, where one needs to assume extra structure in order to make the problem tractable. In particular, recent literature considered information on similarity between arms. We consider similarity information in the setting of "contextual bandits", a natural extension of the basic MAB problem where before each round an algorithm is given the "context" -- a hint about the payoffs in this round. Contextual bandits are directly motivated by placing advertisements on webpages, one of the crucial problems in sponsored search. A particularly simple way to represent similarity information in the contextual bandit setting is via a "similarity distance" between the context-arm pairs which gives an upper bound on the difference between the respective expected payoffs. Prior work on contextual bandits with similarity uses "uniform" partitions of the similarity space, which is potentially wasteful. We design more efficient algorithms that are based on adaptive partitions adjusted to "popular" context and "high-payoff" arms.Comment: This is the full version of a conference paper in COLT 2011, to appear in JMLR in 2014. A preliminary version of this manuscript (with all the results) has been posted to arXiv in February 2011. An earlier version on arXiv, which does not include the results in Section 6, dates back to July 2009. The present revision addresses various presentation issues pointed out by journal referee

arXiv.org e-Print Archive