7 research outputs found

    Monte Carlo Bayesian Reinforcement Learning

    Full text link
    Bayesian reinforcement learning (BRL) encodes prior knowledge of the world in a model and represents uncertainty in model parameters by maintaining a probability distribution over them. This paper presents Monte Carlo BRL (MC-BRL), a simple and general approach to BRL. MC-BRL samples a priori a finite set of hypotheses for the model parameter values and forms a discrete partially observable Markov decision process (POMDP) whose state space is a cross product of the state space for the reinforcement learning task and the sampled model parameter space. The POMDP does not require conjugate distributions for belief representation, as earlier works do, and can be solved relatively easily with point-based approximation algorithms. MC-BRL naturally handles both fully and partially observable worlds. Theoretical and experimental results show that the discrete POMDP approximates the underlying BRL task well with guaranteed performance.Comment: Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012

    Probably Approximately Correct MDP Learning and Control With Temporal Logic Constraints

    Full text link
    We consider synthesis of control policies that maximize the probability of satisfying given temporal logic specifications in unknown, stochastic environments. We model the interaction between the system and its environment as a Markov decision process (MDP) with initially unknown transition probabilities. The solution we develop builds on the so-called model-based probably approximately correct Markov decision process (PAC-MDP) methodology. The algorithm attains an ε\varepsilon-approximately optimal policy with probability 1δ1-\delta using samples (i.e. observations), time and space that grow polynomially with the size of the MDP, the size of the automaton expressing the temporal logic specification, 1ε\frac{1}{\varepsilon}, 1δ\frac{1}{\delta} and a finite time horizon. In this approach, the system maintains a model of the initially unknown MDP, and constructs a product MDP based on its learned model and the specification automaton that expresses the temporal logic constraints. During execution, the policy is iteratively updated using observation of the transitions taken by the system. The iteration terminates in finitely many steps. With high probability, the resulting policy is such that, for any state, the difference between the probability of satisfying the specification under this policy and the optimal one is within a predefined bound.Comment: 9 pages, 5 figures, Accepted by 2014 Robotics: Science and Systems (RSS

    Efficient methods for near-optimal sequential decision making under uncertainty

    Get PDF
    This chapter discusses decision making under uncertainty. More specifically, it offers an overview of efficient Bayesian and distribution-free algorithms for making near-optimal sequential decisions under uncertainty about the environment. Due to the uncertainty, such algorithms must not only learn from their interaction with the environment but also perform as well as possible while learning is taking place. © 2010 Springer-Verlag Berlin Heidelberg

    Using linear programming for Bayesian exploration in Markov Decision Processes

    No full text
    A key problem in reinforcement learning is finding a good balance between the need to explore the environment and the need to gain rewards by exploiting existing knowledge. Much research has been devoted to this topic, and many of the proposed methods are aimed simply at ensuring that enough samples are gathered to estimate well the value function. In contrast, [Bellman and Kalaba, 1959] proposed constructing a representation in which the states of the original system are paired with knowledge about the current model. Hence, knowledge about the possible Markov models of the environment is represented and maintained explicitly. Unfortunately, this approach is intractable except for bandit problems (where it gives rise to Gittins indices, an optimal exploration method). In this paper, we explore ideas for making this method computationally tractable. We maintain a model of the environment as a Markov Decision Process. We sample finite-length trajectories from the infinite tree using ideas based on sparse sampling. Finding the values of the nodes of this sparse subtree can then be expressed as an optimization problem, which we solve using Linear Programming. We illustrate this approach on a few domains and compare it with other exploration algorithms.

    Using Linear Programming for Bayesian Exploration in Markov Decision Processes

    No full text
    A key problem in reinforcement learning is finding a good balance between the need to explore the environment and the need to gain rewards by exploiting existing knowledge. Much research has been devoted to this topic, and many of the proposed methods are aimed simply at ensuring that enough samples are gathered to estimate well the value function. In contrast, [Bellman and Kalaba, 1959] proposed constructing a representation in which the states of the original system are paired with knowledge about the current model. Hence, knowledge about the possible Markov models of the environment is represented and maintained explicitly. Unfortunately, this approach is intractable except for bandit problems (where it gives rise to Gittins indices, an optimal exploration method). In this paper, we explore ideas for making this method computationally tractable. We maintain a model of the environment as a Markov Decision Process. We sample finite-length trajectories from the infinite tree using ideas based on sparse sampling. Finding the values of the nodes of this sparse subtree can then be expressed as an optimization problem, which we solve using Linear Programming. We illustrate this approach on a few domains and compare it with other exploration algorithms.

    Compact parametric models for efficient sequential decision making in high-dimensional, uncertain domains

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.Cataloged from PDF version of thesis.Includes bibliographical references (p. 137-144).Within artificial intelligence and robotics there is considerable interest in how a single agent can autonomously make sequential decisions in large, high-dimensional, uncertain domains. This thesis presents decision-making algorithms for maximizing the expected sum of future rewards in two types of large, high-dimensional, uncertain situations: when the agent knows its current state but does not have a model of the world dynamics within a Markov decision process (MDP) framework, and in partially observable Markov decision processes (POMDPs), when the agent knows the dynamics and reward models, but only receives information about its state through its potentially noisy sensors. One of the key challenges in the sequential decision making field is the tradeoff between optimality and tractability. To handle high-dimensional (many variables), large (many potential values per variable) domains, an algorithm must have a computational complexity that scales gracefully with the number of dimensions. However, many prior approaches achieve such scalability through the use of heuristic methods with limited or no guarantees on how close to optimal, and under what circumstances, are the decisions made by the algorithm. Algorithms that do provide rigorous optimality bounds often do so at the expense of tractability. This thesis proposes that the use of parametric models of the world dynamics, rewards and observations can enable efficient, provably close to optimal, decision making in large, high-dimensional uncertain environments.(cont.) In support of this, we present a reinforcement learning (RL) algorithm where the use of a parametric model allows the algorithm to make close to optimal decisions on all but a number of samples that scales polynomially with the dimension, a significant improvement over most prior RL provably approximately optimal algorithms. We also show that parametric models can be used to reduce the computational complexity from an exponential to polynomial dependence on the state dimension in forward search partially observable MDP planning. Under mild conditions our new forward-search POMDP planner maintains prior optimality guarantees on the resulting decisions. We present experimental results on a robot navigation over varying terrain RL task and a large global driving POMDP planning simulation.by Emma Patricia Brunskill.Ph.D
    corecore