409 research outputs found
Solving Factored MDPs with Hybrid State and Action Variables
Efficient representations and solutions for large decision problems with
continuous and discrete variables are among the most important challenges faced
by the designers of automated decision support systems. In this paper, we
describe a novel hybrid factored Markov decision process (MDP) model that
allows for a compact representation of these problems, and a new hybrid
approximate linear programming (HALP) framework that permits their efficient
solutions. The central idea of HALP is to approximate the optimal value
function by a linear combination of basis functions and optimize its weights by
linear programming. We analyze both theoretical and computational aspects of
this approach, and demonstrate its scale-up potential on several hybrid
optimization problems
Planning in Hybrid Structured Stochastic Domains
Efficient representations and solutions for large structured decision problems with continuous and discrete variables are among the important challenges faced by the designers of automated decision support systems. In this work, we describe a novel hybrid factored Markov decision process (MDP) model that allows for a compact representation of these problems, and a hybrid approximate linear programming (HALP) framework that permits their efficient solutions. The central idea of HALP is to approximate the optimal value function of an MDP by a linear combination of basis functions and optimize its weights by linear programming. We study both theoretical and practical aspects of this approach, and demonstrate its scale-up potential on several hybrid optimization problems
Global Optimization for Value Function Approximation
Existing value function approximation methods have been successfully used in
many applications, but they often lack useful a priori error bounds. We propose
a new approximate bilinear programming formulation of value function
approximation, which employs global optimization. The formulation provides
strong a priori guarantees on both robust and expected policy loss by
minimizing specific norms of the Bellman residual. Solving a bilinear program
optimally is NP-hard, but this is unavoidable because the Bellman-residual
minimization itself is NP-hard. We describe and analyze both optimal and
approximate algorithms for solving bilinear programs. The analysis shows that
this algorithm offers a convergent generalization of approximate policy
iteration. We also briefly analyze the behavior of bilinear programming
algorithms under incomplete samples. Finally, we demonstrate that the proposed
approach can consistently minimize the Bellman residual on simple benchmark
problems
Reinforcement Learning with Parameterized Actions
We introduce a model-free algorithm for learning in Markov decision processes
with parameterized actions-discrete actions with continuous parameters. At each
step the agent must select both which action to use and which parameters to use
with that action. We introduce the Q-PAMDP algorithm for learning in these
domains, show that it converges to a local optimum, and compare it to direct
policy search in the goal-scoring and Platform domains.Comment: Accepted for AAAI 201
Factored temporal difference learning in the new ties environment
Although reinforcement learning is a popular method for training an agent for decision making based on rewards, well studied tabular methods are not applicable for large, realistic problems. In this paper, we experiment with a factored version of temporal difference learning, which boils down to a linear function approximation scheme utilising natural features coming from the structure of the task. We conducted experiments in the New Ties environment, which is a novel platform for multi-agent simulations. We show that learning utilising a factored representation is effective even in large state spaces, furthermore it outperforms tabular methods even in smaller problems both in learning speed and stability, because of its generalisation capabilities
Factored Value Iteration Converges
In this paper we propose a novel algorithm, factored value iteration (FVI),
for the approximate solution of factored Markov decision processes (fMDPs). The
traditional approximate value iteration algorithm is modified in two ways. For
one, the least-squares projection operator is modified so that it does not
increase max-norm, and thus preserves convergence. The other modification is
that we uniformly sample polynomially many samples from the (exponentially
large) state space. This way, the complexity of our algorithm becomes
polynomial in the size of the fMDP description length. We prove that the
algorithm is convergent. We also derive an upper bound on the difference
between our approximate solution and the optimal one, and also on the error
introduced by sampling. We analyze various projection operators with respect to
their computation complexity and their convergence when combined with
approximate value iteration.Comment: 17 pages, 1 figur
- …