9 research outputs found
Market-based reinforcement learning in partially observable worlds
Unlike traditional reinforcement learning (RL), market-based RL is in principle applicable to worlds described by partially observable Markov Decision Processes (POMDPs), where an agent needs to learn short-term memories of relevant previous events in order to execute optimal actions. Most previous work, however, has focused on reactive settings (MDPs) instead of POMDPs. Here we reimplement a recent approach to market-based RL and for the first time evaluate it in a toy POMDP setting.This work was supported by SNF grants 21-55409.98 and 2000-61847.0
Smooth markets: A basic mechanism for organizing gradient-based learners
With the success of modern machine learning, it is becoming increasingly
important to understand and control how learning algorithms interact.
Unfortunately, negative results from game theory show there is little hope of
understanding or controlling general n-player games. We therefore introduce
smooth markets (SM-games), a class of n-player games with pairwise zero sum
interactions. SM-games codify a common design pattern in machine learning that
includes (some) GANs, adversarial training, and other recent algorithms. We
show that SM-games are amenable to analysis and optimization using first-order
methods.Comment: 18 pages, 3 figure
Decentralized Reinforcement Learning: Global Decision-Making via Local Economic Transactions
This paper seeks to establish a framework for directing a society of simple,
specialized, self-interested agents to solve what traditionally are posed as
monolithic single-agent sequential decision problems. What makes it challenging
to use a decentralized approach to collectively optimize a central objective is
the difficulty in characterizing the equilibrium strategy profile of
non-cooperative games. To overcome this challenge, we design a mechanism for
defining the learning environment of each agent for which we know that the
optimal solution for the global objective coincides with a Nash equilibrium
strategy profile of the agents optimizing their own local objectives. The
society functions as an economy of agents that learn the credit assignment
process itself by buying and selling to each other the right to operate on the
environment state. We derive a class of decentralized reinforcement learning
algorithms that are broadly applicable not only to standard reinforcement
learning but also for selecting options in semi-MDPs and dynamically composing
computation graphs. Lastly, we demonstrate the potential advantages of a
society's inherent modular structure for more efficient transfer learning.Comment: 18 pages, 13 figures, accepted to the International Conference on
Machine Learning (ICML) 202
Universal Algorithmic Intelligence: A mathematical top->down approach
Sequential decision theory formally solves the problem of rational agents in
uncertain worlds if the true environmental prior probability distribution is
known. Solomonoff's theory of universal induction formally solves the problem
of sequence prediction for unknown prior distribution. We combine both ideas
and get a parameter-free theory of universal Artificial Intelligence. We give
strong arguments that the resulting AIXI model is the most intelligent unbiased
agent possible. We outline how the AIXI model can formally solve a number of
problem classes, including sequence prediction, strategic games, function
minimization, reinforcement and supervised learning. The major drawback of the
AIXI model is that it is uncomputable. To overcome this problem, we construct a
modified algorithm AIXItl that is still effectively more intelligent than any
other time t and length l bounded agent. The computation time of AIXItl is of
the order t x 2^l. The discussion includes formal definitions of intelligence
order relations, the horizon problem and relations of the AIXI theory to other
AI approaches.Comment: 70 page
Policy-Gradient Algorithms for Partially Observable Markov Decision Processes
Partially observable Markov decision processes are interesting because of their ability to model most conceivable real-world learning problems, for example, robot navigation, driving a car, speech recognition, stock trading, and playing games. The downside of this generality is that exact algorithms are computationally intractable. Such computational complexity motivates approximate approaches. One such class of algorithms are the so-called policy-gradient methods from reinforcement learning. They seek to adjust the parameters of an agent in the direction that maximises the long-term average of a reward signal. Policy-gradient methods are attractive as a \emph{scalable} approach for controlling partially observable Markov decision processes (POMDPs). In the most general case POMDP policies require some form of internal state, or memory, in order to act optimally. Policy-gradient methods have shown promise for problems admitting memory-less policies but have been less successful when memory is required. This thesis develops several improved algorithms for learning policies with memory in an infinite-horizon setting. Directly, when the dynamics of the world are known, and via Monte-Carlo methods otherwise. The algorithms simultaneously learn how to act and what to remember. ..
Policy-Gradient Algorithms for Partially Observable Markov Decision Processes
Partially observable Markov decision processes are interesting because of their ability to model most conceivable real-world learning problems, for example, robot navigation, driving a car, speech recognition, stock trading, and playing games. The downside of this generality is that exact algorithms are computationally intractable. Such computational complexity motivates approximate approaches. One such class of algorithms are the so-called policy-gradient methods from reinforcement learning. They seek to adjust the parameters of an agent in the direction that maximises the long-term average of a reward signal. Policy-gradient methods are attractive as a \emph{scalable} approach for controlling partially observable Markov decision processes (POMDPs). In the most general case POMDP policies require some form of internal state, or memory, in order to act optimally. Policy-gradient methods have shown promise for problems admitting memory-less policies but have been less successful when memory is required. This thesis develops several improved algorithms for learning policies with memory in an infinite-horizon setting. Directly, when the dynamics of the world are known, and via Monte-Carlo methods otherwise. The algorithms simultaneously learn how to act and what to remember. ..