6 research outputs found
Context tree maximizing reinforcement learning
Recent developments in reinforcement learning for nonMarkovian
problems witness a surge in history-based methods,
among which we are particularly interested in two frameworks,
ΦMDP and MC-AIXI-CTW. ΦMDP attempts to reduce
the general RL problem, where the environment’s states
and dynamics are both unknown, to an MDP, while MCAIXI-CTW
incrementally learns a mixture of context trees
as its environment model. The main idea of ΦMDP is to connect
generic reinforcement learning with classical reinforcement
learning. The first implementation of ΦMDP relies on a
stochastic search procedure for finding a tree that minimizes a
certain cost function. This does not guarantee finding the minimizing
tree, or even a good one, given limited search time.
As a consequence it appears that the approach has difficulties
with large domains. MC-AIXI-CTW is attractive in that it can
incrementally and analytically compute the internal model
through interactions with the environment. Unfortunately, it
is computationally demanding due to requiring heavy planning
simulations at every single time step. We devise a novel
approach called CTMRL, which analytically and efficiently
finds the cost-minimizing tree. Instead of the context-tree
weighting method that MC-AIXI-CTW is based on, we use
the closely related context-tree maximizing algorithm that selects
just one single tree. This approach falls under the ΦMDP
framework, which allows the replacement of the costly planning
component of MC-AIXI-CTW with simple Q-Learning.
Our empirical investigation shows that CTMRL finds policies
of quality as good as MC-AIXI-CTW’s on six domains
including a challenging Pacman domain, but in an order of
magnitude less time
IST Austria Technical Report
We consider partially observable Markov decision processes (POMDPs) with a set of target states and every transition is associated with an integer cost. The optimization objective we study asks to minimize the expected total cost till the target set is reached, while ensuring that the target set is reached almost-surely (with probability 1). We show that for integer costs approximating the optimal cost is undecidable. For positive costs, our results are as follows: (i) we establish matching lower and upper bounds for the optimal cost and the bound is double exponential; (ii) we show that the problem of approximating the optimal cost is decidable and present approximation algorithms developing on the existing algorithms for POMDPs with finite-horizon objectives. While the worst-case running time of our algorithm is double exponential, we also present efficient stopping criteria for the algorithm and show experimentally that it performs well in many examples of interest
Optimal Cost Almost-sure Reachability in POMDPs
We consider partially observable Markov decision processes (POMDPs) with a
set of target states and every transition is associated with an integer cost.
The optimization objective we study asks to minimize the expected total cost
till the target set is reached, while ensuring that the target set is reached
almost-surely (with probability 1). We show that for integer costs
approximating the optimal cost is undecidable. For positive costs, our results
are as follows: (i) we establish matching lower and upper bounds for the
optimal cost and the bound is double exponential; (ii) we show that the problem
of approximating the optimal cost is decidable and present approximation
algorithms developing on the existing algorithms for POMDPs with finite-horizon
objectives. While the worst-case running time of our algorithm is double
exponential, we also present efficient stopping criteria for the algorithm and
show experimentally that it performs well in many examples of interest.Comment: Full Version of Optimal Cost Almost-sure Reachability in POMDPs, AAAI
2015. arXiv admin note: text overlap with arXiv:1207.4166 by other author
Online discovery and learning of predictive state representations
Predictive state representations (PSRs) are a method of modeling dynamical systems using only observable data, such as actions and observations, to describe their model. PSRs use predictions about the outcome of future tests to summarize the system state. The best existing techniques for discovery and learning of PSRs use a Monte Carlo approach to explicitly estimate these outcome probabilities. In this paper, we present a new algorithm for discovery and learning of PSRs that uses a gradient descent approach to compute the predictions for the current state. The algorithm takes advantage of the large amount of structure inherent in a valid prediction matrix to constrain its predictions. Furthermore, the algorithm can be used online by an agent to constantly improve its prediction quality; something that current state of the art discovery and learning algorithms are unable to do. We give empirical results to show that our constrained gradient algorithm is able to discover core tests using very small amounts of data, and with larger amounts of data can compute accurate predictions of the system dynamics.