6,209 research outputs found

    Self-Optimizing and Pareto-Optimal Policies in General Environments based on Bayes-Mixtures

    Full text link
    The problem of making sequential decisions in unknown probabilistic environments is studied. In cycle tt action yty_t results in perception xtx_t and reward rtr_t, where all quantities in general may depend on the complete history. The perception xtx_t and reward rtr_t are sampled from the (reactive) environmental probability distribution μ\mu. This very general setting includes, but is not limited to, (partial observable, k-th order) Markov decision processes. Sequential decision theory tells us how to act in order to maximize the total expected reward, called value, if μ\mu is known. Reinforcement learning is usually used if μ\mu is unknown. In the Bayesian approach one defines a mixture distribution ξ\xi as a weighted sum of distributions \nu\in\M, where \M is any class of distributions including the true environment μ\mu. We show that the Bayes-optimal policy pξp^\xi based on the mixture ξ\xi is self-optimizing in the sense that the average value converges asymptotically for all \mu\in\M to the optimal value achieved by the (infeasible) Bayes-optimal policy pμp^\mu which knows μ\mu in advance. We show that the necessary condition that \M admits self-optimizing policies at all, is also sufficient. No other structural assumptions are made on \M. As an example application, we discuss ergodic Markov decision processes, which allow for self-optimizing policies. Furthermore, we show that pξp^\xi is Pareto-optimal in the sense that there is no other policy yielding higher or equal value in {\em all} environments \nu\in\M and a strictly higher value in at least one.Comment: 15 page

    Geometry of Policy Improvement

    Full text link
    We investigate the geometry of optimal memoryless time independent decision making in relation to the amount of information that the acting agent has about the state of the system. We show that the expected long term reward, discounted or per time step, is maximized by policies that randomize among at most kk actions whenever at most kk world states are consistent with the agent's observation. Moreover, we show that the expected reward per time step can be studied in terms of the expected discounted reward. Our main tool is a geometric version of the policy improvement lemma, which identifies a polyhedral cone of policy changes in which the state value function increases for all states.Comment: 8 page
    • …
    corecore