Search CORE

1,399 research outputs found

Deep Ordinal Reinforcement Learning

Author: C Wirth
CJ Watkins
RS Sutton
V Mnih
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 11/07/2019
Field of study

Reinforcement learning usually makes use of numerical rewards, which have nice properties but also come with drawbacks and difficulties. Using rewards on an ordinal scale (ordinal rewards) is an alternative to numerical rewards that has received more attention in recent years. In this paper, a general approach to adapting reinforcement learning problems to the use of ordinal rewards is presented and motivated. We show how to convert common reinforcement learning algorithms to an ordinal variation by the example of Q-learning and introduce Ordinal Deep Q-Networks, which adapt deep reinforcement learning to ordinal rewards. Additionally, we run evaluations on problems provided by the OpenAI Gym framework, showing that our ordinal variants exhibit a performance that is comparable to the numerical variations for a number of problems. We also give first evidence that our ordinal variant is able to produce better results for problems with less engineered and simpler-to-design reward signals.Comment: replaced figures for better visibility, added github repository, more details about source of experimental results, updated target value calculation for standard and ordinal Deep Q-Networ

arXiv.org e-Print Archive

Crossref

Assessing the Potential of Classical Q-learning in General Game Playing

Author: CB Browne
CJCH Watkins
CP Robert
D Silver
D Silver
H Wang
J Hu
J Méhat
M Genesereth
M Genesereth
M Świechowski
RS Sutton
V Mnih
Publication venue
Publication date: 14/10/2018
Field of study

After the recent groundbreaking results of AlphaGo and AlphaZero, we have seen strong interests in deep reinforcement learning and artificial general intelligence (AGI) in game playing. However, deep learning is resource-intensive and the theory is not yet well developed. For small games, simple classical table-based Q-learning might still be the algorithm of choice. General Game Playing (GGP) provides a good testbed for reinforcement learning to research AGI. Q-learning is one of the canonical reinforcement learning methods, and has been used by (Banerjee

\&

Stone, IJCAI 2007) in GGP. In this paper we implement Q-learning in GGP for three small-board games (Tic-Tac-Toe, Connect Four, Hex)\footnote{source code: https://github.com/wh1992v/ggp-rl}, to allow comparison to Banerjee et al.. We find that Q-learning converges to a high win rate in GGP. For the

\epsilon

-greedy strategy, we propose a first enhancement, the dynamic

\epsilon

algorithm. In addition, inspired by (Gelly

\&

Silver, ICML 2007) we combine online search (Monte Carlo Search) to enhance offline learning, and propose QM-learning for GGP. Both enhancements improve the performance of classical Q-learning. In this work, GGP allows us to show, if augmented by appropriate enhancements, that classical table-based Q-learning can perform well in small games.Comment: arXiv admin note: substantial text overlap with arXiv:1802.0594

arXiv.org e-Print Archive

Crossref

Leiden University Scholary Publications

Multi-agent Hierarchical Reinforcement Learning with Dynamic Termination

Author: C Watkins
G Tesauro
M Giannakis
M Riedmiller
NR Jennings
P Stone
RS Sutton
RS Sutton
TG Dietterich
V Lesser
V Mnih
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 21/10/2019
Field of study

In a multi-agent system, an agent's optimal policy will typically depend on the policies chosen by others. Therefore, a key issue in multi-agent systems research is that of predicting the behaviours of others, and responding promptly to changes in such behaviours. One obvious possibility is for each agent to broadcast their current intention, for example, the currently executed option in a hierarchical reinforcement learning framework. However, this approach results in inflexibility of agents if options have an extended duration and are dynamic. While adjusting the executed option at each step improves flexibility from a single-agent perspective, frequent changes in options can induce inconsistency between an agent's actual behaviour and its broadcast intention. In order to balance flexibility and predictability, we propose a dynamic termination Bellman equation that allows the agents to flexibly terminate their options. We evaluate our model empirically on a set of multi-agent pursuit and taxi tasks, and show that our agents learn to adapt flexibly across scenarios that require different termination behaviours.Comment: PRICAI 201

arXiv.org e-Print Archive

Crossref

Learning from Monte Carlo Rollouts with Opponent Models for Playing Tron

Author: AL Samuel
CJ Watkins
D Silver
D Silver
G Tesauro
J Baxter
J Schmidhuber
L Kocsis
M Otterlo van
RS Sutton
RS Sutton
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 30/12/2018
Field of study

This paper describes a novel reinforcement learning system for learning to play the game of Tron. The system combines Q-learning, multi-layer perceptrons, vision grids, opponent modelling, and Monte Carlo rollouts in a novel way. By learning an opponent model, Monte Carlo rollouts can be effectively applied to generate state trajectories for all possible actions from which improved action estimates can be computed. This allows to extend experience replay by making it possible to update the state-action values of all actions in a given game state simultaneously. The results show that the use of experience replay that updates the Q-values of all actions simultaneously strongly outperforms the conventional experience replay that only updates the Q-value of the performed action. The results also show that using short or long rollout horizons during training lead to similar good performances against two fixed opponents

Crossref

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen

Pseudorehearsal in value function approximation

Author: A Robins
A Robins
B Baddeley
CJ Watkins
J Gama
JL McClelland
JN Tsitsiklis
KP Murphy
M Frean
M Hattori
M McCloskey
R Coop
R Ratcliff
RJ Williams
RM French
RS Sutton
S Adam
Publication venue
Publication date: 21/03/2017
Field of study

Catastrophic forgetting is of special importance in reinforcement learning, as the data distribution is generally non-stationary over time. We study and compare several pseudorehearsal approaches for Q-learning with function approximation in a pole balancing task. We have found that pseudorehearsal seems to assist learning even in such very simple problems, given proper initialization of the rehearsal parameters

arXiv.org e-Print Archive

Crossref

Synthesis of novel thieno[3,2-b]thienobis(silolothiophene) based low bandgap polymers for organic photovoltaics

Author: Anthopoulos TD
Ashraf RS
Biniek L
Durrant JR
Huang Z
McCulloch I
Nielsen CB
Schroeder BC
Thomas S
Tuladhar PS
Watkins SE
White AJP
Zhang W
Publication venue: 'Royal Society of Chemistry (RSC)'
Publication date: 01/01/2012
Field of study

Thieno[3,2-b]thienobis(silolothiophene), a new electron rich hexacyclic monomer has been synthesized and incorporated into three novel donor–acceptor low-bandgap polymers. By carefully choosing the acceptor co-monomer, the energy levels of the polymers could be modulated and high power conversion efficiencies of 5.52% were reached in OPV devices

UCL Discovery

Spiral - Imperial College Digital Repository

Hal-Diderot

Probabilistic inference for determining options in reinforcement learning

Author: Christian Daniel
Christopher M Bishop
CJCH Watkins
E Theodorou
Gerhard Neumann
Herke van Hoof
J Morimoto
Jan Peters
LE Baum
M Lagoudakis
ML Puterman
RS Sutton
TG Dietterich
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Tasks that require many sequential decisions or complex solutions are hard to solve using conventional reinforcement learning algorithms. Based on the semi Markov decision process setting (SMDP) and the option framework, we propose a model which aims to alleviate these concerns. Instead of learning a single monolithic policy, the agent learns a set of simpler sub-policies as well as the initiation and termination probabilities for each of those sub-policies. While existing option learning algorithms frequently require manual specification of components such as the sub-policies, we present an algorithm which infers all relevant components of the option framework from data. Furthermore, the proposed approach is based on parametric option representations and works well in combination with current policy search methods, which are particularly well suited for continuous real-world tasks. We present results on SMDPs with discrete as well as continuous state-action spaces. The results show that the presented algorithm can combine simple sub-policies to solve complex tasks and can improve learning performance on simpler tasks

University of Lincoln Institutional Repository

TUbiblio

Crossref

MPG.PuRe

A sense of embodiment is reflected in people's signature size

Author: A Dijksterhuis
A Rawal
A Rawal
A Rawal
AB Warriner
Adhip Rawal
AG Greenwald
AJ Yap
Alessio Avenanti
BR Swanson
Catherine J. Harmer
DV Sheehan
E Watkins
F Strack
FA Cowdrey
GL Wells
J. Mark G. Williams
JM Ackerman
LR Aiken
LW Barsalou
M Häfner
MK Sekar
MM Duguid
P Beumont
PM Niedenthal
Rebecca J. Park
RJ Park
RL Zweigenhaft
RL Zweigenhaft
RL Zweigenhaft
RS Friedman
RW Robins
SE Duclos
Ursula D. O'Sullivan
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2014
Field of study

BACKGROUND: The size of a person's signature may reveal implicit information about how the self is perceived although this has not been closely examined. METHODS/RESULTS: We conducted three experiments to test whether increases in signature size can be induced. Specifically, the aim of these experiments was to test whether changes in signature size reflect a person's current implicit sense of embodiment. Experiment 1 showed that an implicit affect task (positive subliminal evaluative conditioning) led to increases in signature size relative to an affectively neutral task, showing that implicit affective cues alter signature size. Experiments 2 and 3 demonstrated increases in signature size following experiential self-focus on sensory and affective stimuli relative to both conceptual self-focus and external (non-self-focus) in both healthy participants and patients with anorexia nervosa, a disorder associated with self-evaluation and a sense of disembodiment. In all three experiments, increases in signature size were unrelated to changes in self-reported mood and larger than manipulation unrelated variations. CONCLUSIONS: Together, these findings suggest that a person's sense of embodiment is reflected in their signature size

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Oxford University Research Archive

Sussex Research Online

Reinforcement learning or active inference?

This paper questions the need for reinforcement learning or control theory when optimising behaviour. We show that it is fairly simple to teach an agent complicated and adaptive behaviours using a free-energy formulation of perception. In this formulation, agents adjust their internal states and sampling of the environment to minimize their free-energy. Such agents learn causal structure in the environment and sample it in an adaptive and self-supervised fashion. This results in behavioural policies that reproduce those optimised by reinforcement learning and dynamic programming. Critically, we do not need to invoke the notion of reward, value or utility. We illustrate these points by solving a benchmark problem in dynamic programming; namely the mountain-car problem, using active perception or inference under the free-energy principle. The ensuing proof-of-concept may be important because the free-energy formulation furnishes a unified account of both action and perception and may speak to a reappraisal of the role of dopamine in the brain

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

Game theory of mind

Author: A Benveniste
A Ng
A Traulsen
AN Hampton
B Skyrms
CF Camerer
CF Camerer
CF Camerer
CJCH Watkins
D Fudenberg
D Fudenberg
D Kahneman
D Wilson
DG Premack
DM Kreps
DO Stahl
E Fehr
E Fehr
E Todorov
H Gintis
H Gintis
HA Simon
HL Gallagher
J Moll
JM Smith
JM Smith
K McCabe
Karl J. Friston
KJ Friston
M Costa-Gomes
P Davies
P Milgrom
PA Haile
PJ Gmytrasiewicz
R Bellman
R McKelvey
Ray J. Dolan
RS Sutton
S Avner
Tim Behrens
U Frith
W Nelson
Wako Yoshida
Publication venue
Publication date: 01/01/2008
Field of study

This paper introduces a model of ‘theory of mind’, namely, how we represent the intentions and goals of others to optimise our mutual interactions. We draw on ideas from optimum control and game theory to provide a ‘game theory of mind’. First, we consider the representations of goals in terms of value functions that are prescribed by utility or rewards. Critically, the joint value functions and ensuing behaviour are optimised recursively, under the assumption that I represent your value function, your representation of mine, your representation of my representation of yours, and so on ad infinitum. However, if we assume that the degree of recursion is bounded, then players need to estimate the opponent's degree of recursion (i.e., sophistication) to respond optimally. This induces a problem of inferring the opponent's sophistication, given behavioural exchanges. We show it is possible to deduce whether players make inferences about each other and quantify their sophistication on the basis of choices in sequential games. This rests on comparing generative models of choices with, and without, inference. Model comparison is demonstrated using simulated and real data from a ‘stag-hunt’. Finally, we note that exactly the same sophisticated behaviour can be achieved by optimising the utility function itself (through prosocial utility), producing unsophisticated but apparently altruistic agents. This may be relevant ethologically in hierarchal game theory and coevolution

CiteSeerX

Crossref

Directory of Open Access Journals

UCL Discovery

PubMed Central

MPG.PuRe