Search CORE

16 research outputs found

Extreme State Aggregation Beyond MDPs

Author: A.L. Strehl
I. Fazekas
M. Hutter
M. Hutter
M.L. Puterman
O.-A. Maillard
P. Nguyen
P. Nguyen
P. Sunehag
R. Givan
R.S. Sutton
S.J. Russell
T. Jaksch
T. Lattimore
T. Lattimore
T. Lattimote
V. Vovk
Publication venue
Publication date: 01/01/2014
Field of study

We consider a Reinforcement Learning setup where an agent interacts with an environment in observation-reward-action cycles without any (esp.\ MDP) assumptions on the environment. State aggregation and more generally feature reinforcement learning is concerned with mapping histories/raw-states to reduced/aggregated states. The idea behind both is that the resulting reduced process (approximately) forms a small stationary finite-state MDP, which can then be efficiently solved or learnt. We considerably generalize existing aggregation results by showing that even if the reduced process is not an MDP, the (q-)value functions and (optimal) policies of an associated MDP with same state-space size solve the original problem, as long as the solution can approximately be represented as a function of the reduced states. This implies an upper bound on the required state space size that holds uniformly for all RL problems. It may also explain why RL algorithms designed for MDPs sometimes perform well beyond MDPs.Comment: 28 LaTeX pages. 8 Theorem

arXiv.org e-Print Archive

Crossref

The Australian National University

Self-Modification of Policy and Utility Function in Rational Agents

Author: B Hibbard
D Dewey
D Silver
J Schmidhuber
L Orseau
L Orseau
L Orseau
LP Kaelbling
M Hutter
M Hutter
M Ring
N Bostrom
R Sutton
RV Yampolskiy
S Legg
V Mnih
Publication venue
Publication date: 10/05/2016
Field of study

Any agent that is part of the environment it interacts with and has versatile actuators (such as arms and fingers), will in principle have the ability to self-modify -- for example by changing its own source code. As we continue to create more and more intelligent agents, chances increase that they will learn about this ability. The question is: will they want to use it? For example, highly intelligent systems may find ways to change their goals to something more easily achievable, thereby `escaping' the control of their designers. In an important paper, Omohundro (2008) argued that goal preservation is a fundamental drive of any intelligent system, since a goal is more likely to be achieved if future versions of the agent strive towards the same goal. In this paper, we formalise this argument in general reinforcement learning, and explore situations where it fails. Our conclusion is that the self-modification possibility is harmless if and only if the value function of the agent anticipates the consequences of self-modifications and use the current utility function when evaluating the future.Comment: Artificial General Intelligence (AGI) 201

arXiv.org e-Print Archive

Crossref

The Australian National University

On overfitting and asymptotic bias in batch reinforcement learning with partial observability

Author: Ernst Damien
Fonteneau Raphael
Francois-Lavet Vincent
Pineau Joelle
Rabusseau Guillaume
Publication venue
Publication date: 06/02/2019
Field of study

This paper provides an analysis of the tradeoff between asymptotic bias (suboptimality with unlimited data) and overfitting (additional suboptimality due to limited data) in the context of reinforcement learning with partial observability. Our theoretical analysis formally characterizes that while potentially increasing the asymptotic bias, a smaller state representation decreases the risk of overfitting. This analysis relies on expressing the quality of a state representation by bounding L1 error terms of the associated belief states. Theoretical results are empirically illustrated when the state representation is a truncated history of observations, both on synthetic POMDPs and on a large-scale POMDP in the context of smartgrids, with real-world data. Finally, similarly to known results in the fully observable setting, we also briefly discuss and empirically illustrate how using function approximators and adapting the discount factor may enhance the tradeoff between asymptotic bias and overfitting in the partially observable context.Comment: Accepted at the Journal of Artificial Intelligence Research (JAIR) - 31 page

arXiv.org e-Print Archive

Open Repository and Bibliography - Liège

On learning history based policies for controlling Markov decision processes

Author: Mahajan Aditya
Patil Gandharv
Precup Doina
Publication venue
Publication date: 05/11/2022
Field of study

Reinforcementlearning(RL)folkloresuggeststhathistory-basedfunctionapproximationmethods,suchas recurrent neural nets or history-based state abstraction, perform better than their memory-less counterparts, due to the fact that function approximation in Markov decision processes (MDP) can be viewed as inducing a Partially observable MDP. However, there has been little formal analysis of such history-based algorithms, as most existing frameworks focus exclusively on memory-less features. In this paper, we introduce a theoretical framework for studying the behaviour of RL algorithms that learn to control an MDP using history-based feature abstraction mappings. Furthermore, we use this framework to design a practical RL algorithm and we numerically evaluate its effectiveness on a set of continuous control tasks

arXiv.org e-Print Archive

Explainable reinforcement learning for broad-XAI: a conceptual framework and survey

Author: Cruz Francisco
Dazeley Richard
Vamplew Peter
Publication venue: Springer Science and Business Media Deutschland GmbH
Publication date: 01/01/2023
Field of study

Broad-XAI moves away from interpreting individual decisions based on a single datum and aims to provide integrated explanations from multiple machine learning algorithms into a coherent explanation of an agent’s behaviour that is aligned to the communication needs of the explainee. Reinforcement Learning (RL) methods, we propose, provide a potential backbone for the cognitive model required for the development of Broad-XAI. RL represents a suite of approaches that have had increasing success in solving a range of sequential decision-making problems. However, these algorithms operate as black-box problem solvers, where they obfuscate their decision-making policy through a complex array of values and functions. EXplainable RL (XRL) aims to develop techniques to extract concepts from the agent’s: perception of the environment; intrinsic/extrinsic motivations/beliefs; Q-values, goals and objectives. This paper aims to introduce the Causal XRL Framework (CXF), that unifies the current XRL research and uses RL as a backbone to the development of Broad-XAI. CXF is designed to incorporate many standard RL extensions and integrated with external ontologies and communication facilities so that the agent can answer questions that explain outcomes its decisions. This paper aims to: establish XRL as a distinct branch of XAI; introduce a conceptual framework for XRL; review existing approaches explaining agent behaviour; and identify opportunities for future research. Finally, this paper discusses how additional information can be extracted and ultimately integrated into models of communication, facilitating the development of Broad-XAI. © 2023, The Author(s)

Federation ResearchOnline

Recommended from our members

Monte Carlo Tree Search with Fixed and Adaptive Abstractions

Author: Hostetler Jesse A.
Publication venue: 'Oregon State University'
Publication date
Field of study

Monte Carlo tree search (MCTS) is a class of online planning algorithms for Markov decision processes (MDPs) and related models that has found success in challenging applications. In the online planning approach, the agent makes a decision in the current state by performing a limited forward search over possible futures and selecting the course of action that is expected to lead to the best outcomes. This thesis proposes a new approach to MCTS based on abstraction and progressive abstraction refinement that makes better use of a limited number of samples. Our first contribution is an analysis of state abstraction in the MCTS setting. We describe a class of state aggregation abstractions that generalizes previously-proposed abstraction criteria and show that the regret due to planning with such abstractions is bounded. We then adapt popular MCTS algorithms to use fixed state abstractions. Our second contribution is a novel approach to MCTS based on abstraction refinement. We propose the Progressive Abstraction Refinement for Sparse Sampling (PARSS) algorithm, which begins by performing sparse sampling with a coarse state abstraction and then refines the abstraction progressively to make it more accurate. The PARSS algorithm provides the same formal guarantees as ordinary sparse sampling, and we show experimentally that PARSS outperforms sparse sampling in the ground state space and with fixed uninformed abstractions. Our third contribution is an extension of the progressive refinement idea to incorporate other kinds of abstraction. For this purpose, we introduce the formalism of abstraction diagrams (ADs) and show that ADs can express diverse kinds of abstraction, including state abstraction, temporal abstraction, and action pruning. We then describe refinement operators for ADs, extending the progressive refinement search framework to abstractions represented as ADs. Our fourth and final contribution is an application of online planning algorithms to the problem of controlling electrical transmission grids to mitigate the effects of equipment failures. Our work in this area is distinguished by the use of a full dynamical model of the power grid, which captures more mechanisms of cascading failure than simpler models. Because of the computational cost of the simulation, we choose simple online planning algorithms that require a small number of simulation trajectories. Our results demonstrate the superiority of the online planning approach to fixed expert policies, while also highlighting the need for faster simulators to enable more sophisticated solution algorithms

ScholarsArchive@OSU

Model-based reinforcement learning and navigation in animals and machines

Author: Corneil Dane Sterling
Publication venue: Lausanne, EPFL
Publication date: 15/10/2018
Field of study

For decades, neuroscientists and psychologists have observed that animal performance on spatial navigation tasks suggests an internal learned map of the environment. More recently, map-based (or model-based) reinforcement learning has become a highly active research area in machine learning. With a learned model of their environment, both animals and artificial agents can generalize between tasks and learn rapidly. In this thesis, I present approaches for developing efficient model--based behaviour in machines and explaining model--based behaviour in animals. From a neuroscience perspective, I focus on the hippocampus, believed to be a major substrate of model-based behaviour in the brain. I consider how hippocampal connectivity enable path--finding between different locations in an environment. The model describes how environments with boundaries and barriers can be represented in recurrent neural networks (i.e. attractor networks), and how the transient activity in these networks, after being stimulated with a goal location, could be used for determining a path to the goal. I also propose how the connectivity of these map--like networks can be learned from the spatial firing patterns observed in the input pathway to the hippocampus (i.e. grid cells and border cells). From a machine learning perspective, I describe a reinforcement learning model that integrates model-based methods and "episodic control", an approach to reinforcement learning based on episodic memory. According to episodic control, the agent learns how to act in the environment by storing snapshot-like memories of its observations, then comparing its current observations to similar snapshot memories where it took an action that resulted in high reward. In our approach, the agent augments these real-world memories with episodes simulated offline using a learned model of the environment. These ``simulated memories'' allow the agent to adapt faster when the reward locations change. Next, I describe Variational State Tabulation (VaST), a model--based method for learning quickly with continuous and high-dimensional observations (like those found in 3D navigation tasks). The VaST agent learns to map its observations to a limited number of discrete abstract states, and build a transition model over those abstract states. The long--term values of different actions in each state are updated continuously and efficiently in the background as the agent explores the environment. I show how the VaST agent can learn faster than other state-of-the-art algorithms, even changing its policy after a single new experience, and how it can respond quickly to changing rewards in complex 3D environments. The models I present allow the agent to rapidly adapt to changing goals and rewards, a key component of intelligence. They use a combination of features attributed to model-based and episodic controllers, suggesting that the division between the two fields is not strict. I therefore also consider the consequences of these findings on theories of model-based learning, episodic control and hippocampal function

Infoscience - École polytechnique fédérale de Lausanne