16 research outputs found
Extreme State Aggregation Beyond MDPs
We consider a Reinforcement Learning setup where an agent interacts with an
environment in observation-reward-action cycles without any (esp.\ MDP)
assumptions on the environment. State aggregation and more generally feature
reinforcement learning is concerned with mapping histories/raw-states to
reduced/aggregated states. The idea behind both is that the resulting reduced
process (approximately) forms a small stationary finite-state MDP, which can
then be efficiently solved or learnt. We considerably generalize existing
aggregation results by showing that even if the reduced process is not an MDP,
the (q-)value functions and (optimal) policies of an associated MDP with same
state-space size solve the original problem, as long as the solution can
approximately be represented as a function of the reduced states. This implies
an upper bound on the required state space size that holds uniformly for all RL
problems. It may also explain why RL algorithms designed for MDPs sometimes
perform well beyond MDPs.Comment: 28 LaTeX pages. 8 Theorem
Self-Modification of Policy and Utility Function in Rational Agents
Any agent that is part of the environment it interacts with and has versatile
actuators (such as arms and fingers), will in principle have the ability to
self-modify -- for example by changing its own source code. As we continue to
create more and more intelligent agents, chances increase that they will learn
about this ability. The question is: will they want to use it? For example,
highly intelligent systems may find ways to change their goals to something
more easily achievable, thereby `escaping' the control of their designers. In
an important paper, Omohundro (2008) argued that goal preservation is a
fundamental drive of any intelligent system, since a goal is more likely to be
achieved if future versions of the agent strive towards the same goal. In this
paper, we formalise this argument in general reinforcement learning, and
explore situations where it fails. Our conclusion is that the self-modification
possibility is harmless if and only if the value function of the agent
anticipates the consequences of self-modifications and use the current utility
function when evaluating the future.Comment: Artificial General Intelligence (AGI) 201
On overfitting and asymptotic bias in batch reinforcement learning with partial observability
This paper provides an analysis of the tradeoff between asymptotic bias
(suboptimality with unlimited data) and overfitting (additional suboptimality
due to limited data) in the context of reinforcement learning with partial
observability. Our theoretical analysis formally characterizes that while
potentially increasing the asymptotic bias, a smaller state representation
decreases the risk of overfitting. This analysis relies on expressing the
quality of a state representation by bounding L1 error terms of the associated
belief states. Theoretical results are empirically illustrated when the state
representation is a truncated history of observations, both on synthetic POMDPs
and on a large-scale POMDP in the context of smartgrids, with real-world data.
Finally, similarly to known results in the fully observable setting, we also
briefly discuss and empirically illustrate how using function approximators and
adapting the discount factor may enhance the tradeoff between asymptotic bias
and overfitting in the partially observable context.Comment: Accepted at the Journal of Artificial Intelligence Research (JAIR) -
31 page
On learning history based policies for controlling Markov decision processes
Reinforcementlearning(RL)folkloresuggeststhathistory-basedfunctionapproximationmethods,suchas
recurrent neural nets or history-based state abstraction, perform better than
their memory-less counterparts, due to the fact that function approximation in
Markov decision processes (MDP) can be viewed as inducing a Partially
observable MDP. However, there has been little formal analysis of such
history-based algorithms, as most existing frameworks focus exclusively on
memory-less features. In this paper, we introduce a theoretical framework for
studying the behaviour of RL algorithms that learn to control an MDP using
history-based feature abstraction mappings. Furthermore, we use this framework
to design a practical RL algorithm and we numerically evaluate its
effectiveness on a set of continuous control tasks
Explainable reinforcement learning for broad-XAI: a conceptual framework and survey
Broad-XAI moves away from interpreting individual decisions based on a single datum and aims to provide integrated explanations from multiple machine learning algorithms into a coherent explanation of an agent’s behaviour that is aligned to the communication needs of the explainee. Reinforcement Learning (RL) methods, we propose, provide a potential backbone for the cognitive model required for the development of Broad-XAI. RL represents a suite of approaches that have had increasing success in solving a range of sequential decision-making problems. However, these algorithms operate as black-box problem solvers, where they obfuscate their decision-making policy through a complex array of values and functions. EXplainable RL (XRL) aims to develop techniques to extract concepts from the agent’s: perception of the environment; intrinsic/extrinsic motivations/beliefs; Q-values, goals and objectives. This paper aims to introduce the Causal XRL Framework (CXF), that unifies the current XRL research and uses RL as a backbone to the development of Broad-XAI. CXF is designed to incorporate many standard RL extensions and integrated with external ontologies and communication facilities so that the agent can answer questions that explain outcomes its decisions. This paper aims to: establish XRL as a distinct branch of XAI; introduce a conceptual framework for XRL; review existing approaches explaining agent behaviour; and identify opportunities for future research. Finally, this paper discusses how additional information can be extracted and ultimately integrated into models of communication, facilitating the development of Broad-XAI. © 2023, The Author(s)
Recommended from our members
Monte Carlo Tree Search with Fixed and Adaptive Abstractions
Monte Carlo tree search (MCTS) is a class of online planning algorithms for Markov decision processes (MDPs) and related models that has found success in challenging applications. In the online planning approach, the agent makes a decision in the current state by performing a limited forward search over possible futures and selecting the course of action that is expected to lead to the best outcomes. This thesis proposes a new approach to MCTS based on abstraction and progressive abstraction refinement that makes better use of a limited number of samples. Our first contribution is an analysis of state abstraction in the MCTS setting. We describe a class of state aggregation abstractions that generalizes previously-proposed abstraction criteria and show that the regret due to planning with such abstractions is bounded. We then adapt popular MCTS algorithms to use fixed state abstractions. Our second contribution is a novel approach to MCTS based on abstraction refinement. We propose the Progressive Abstraction Refinement for Sparse Sampling (PARSS) algorithm, which begins by performing sparse sampling with a coarse state abstraction and then refines the abstraction progressively to make it more accurate. The PARSS algorithm provides the same formal guarantees as ordinary sparse sampling, and we show experimentally that PARSS outperforms sparse sampling in the ground state space and with fixed uninformed abstractions. Our third contribution is an extension of the progressive refinement idea to incorporate other kinds of abstraction. For this purpose, we introduce the formalism of abstraction diagrams (ADs) and show that ADs can express diverse kinds of abstraction, including state abstraction, temporal abstraction, and action pruning. We then describe refinement operators for ADs, extending the progressive refinement search framework to abstractions represented as ADs. Our fourth and final contribution is an application of online planning algorithms to the problem of controlling electrical transmission grids to mitigate the effects of equipment failures. Our work in this area is distinguished by the use of a full dynamical model of the power grid, which captures more mechanisms of cascading failure than simpler models. Because of the computational cost of the simulation, we choose simple online planning algorithms that require a small number of simulation trajectories. Our results demonstrate the superiority of the online planning approach to fixed expert policies, while also highlighting the need for faster simulators to enable more sophisticated solution algorithms
Model-based reinforcement learning and navigation in animals and machines
For decades, neuroscientists and psychologists have observed that animal performance on spatial navigation tasks suggests an internal learned map of the environment. More recently, map-based (or model-based) reinforcement learning has become a highly active research area in machine learning. With a learned model of their environment, both animals and artificial agents can generalize between tasks and learn rapidly. In this thesis, I present approaches for developing efficient model--based behaviour in machines and explaining model--based behaviour in animals.
From a neuroscience perspective, I focus on the hippocampus, believed to be a major substrate of model-based behaviour in the brain. I consider how hippocampal connectivity enable path--finding between different locations in an environment. The model describes how environments with boundaries and barriers can be represented in recurrent neural networks (i.e. attractor networks), and how the transient activity in these networks, after being stimulated with a goal location, could be used for determining a path to the goal. I also propose how the connectivity of these map--like networks can be learned from the spatial firing patterns observed in the input pathway to the hippocampus (i.e. grid cells and border cells).
From a machine learning perspective, I describe a reinforcement learning model that integrates model-based methods and "episodic control", an approach to reinforcement learning based on episodic memory. According to episodic control, the agent learns how to act in the environment by storing snapshot-like memories of its observations, then comparing its current observations to similar snapshot memories where it took an action that resulted in high reward. In our approach, the agent augments these real-world memories with episodes simulated offline using a learned model of the environment. These ``simulated memories'' allow the agent to adapt faster when the reward locations change.
Next, I describe Variational State Tabulation (VaST), a model--based method for learning quickly with continuous and high-dimensional observations (like those found in 3D navigation tasks). The VaST agent learns to map its observations to a limited number of discrete abstract states, and build a transition model over those abstract states. The long--term values of different actions in each state are updated continuously and efficiently in the background as the agent explores the environment. I show how the VaST agent can learn faster than other state-of-the-art algorithms, even changing its policy after a single new experience, and how it can respond quickly to changing rewards in complex 3D environments.
The models I present allow the agent to rapidly adapt to changing goals and rewards, a key component of intelligence. They use a combination of features attributed to model-based and episodic controllers, suggesting that the division between the two fields is not strict. I therefore also consider the consequences of these findings on theories of model-based learning, episodic control and hippocampal function