922 research outputs found
Inductive Policy Selection for First-Order MDPs
We select policies for large Markov Decision Processes (MDPs) with compact
first-order representations. We find policies that generalize well as the
number of objects in the domain grows, potentially without bound. Existing
dynamic-programming approaches based on flat, propositional, or first-order
representations either are impractical here or do not naturally scale as the
number of objects grows without bound. We implement and evaluate an alternative
approach that induces first-order policies using training data constructed by
solving small problem instances using PGraphplan (Blum & Langford, 1999). Our
policies are represented as ensembles of decision lists, using a taxonomic
concept language. This approach extends the work of Martin and Geffner (2000)
to stochastic domains, ensemble learning, and a wider variety of problems.
Empirically, we find "good" policies for several stochastic first-order MDPs
that are beyond the scope of previous approaches. We also discuss the
application of this work to the relational reinforcement-learning problem.Comment: Appears in Proceedings of the Eighteenth Conference on Uncertainty in
Artificial Intelligence (UAI2002
Exploiting First-Order Regression in Inductive Policy Selection
We consider the problem of computing optimal generalised policies for
relational Markov decision processes. We describe an approach combining some of
the benefits of purely inductive techniques with those of symbolic dynamic
programming methods. The latter reason about the optimal value function using
first-order decision theoretic regression and formula rewriting, while the
former, when provided with a suitable hypotheses language, are capable of
generalising value functions or policies for small instances. Our idea is to
use reasoning and in particular classical first-order regression to
automatically generate a hypotheses language dedicated to the domain at hand,
which is then used as input by an inductive solver. This approach avoids the
more complex reasoning of symbolic dynamic programming while focusing the
inductive solver's attention on concepts that are specifically relevant to the
optimal value function for the domain considered.Comment: Appears in Proceedings of the Twentieth Conference on Uncertainty in
Artificial Intelligence (UAI2004
Fast Planning in Stochastic Games
Stochastic games generalize Markov decision processes (MDPs) to a multiagent
setting by allowing the state transitions to depend jointly on all player
actions, and having rewards determined by multiplayer matrix games at each
state. We consider the problem of computing Nash equilibria in stochastic
games, the analogue of planning in MDPs. We begin by providing a generalization
of finite-horizon value iteration that computes a Nash strategy for each player
in generalsum stochastic games. The algorithm takes an arbitrary Nash selection
function as input, which allows the translation of local choices between
multiple Nash equilibria into the selection of a single global Nash
equilibrium.
Our main technical result is an algorithm for computing near-Nash equilibria
in large or infinite state spaces. This algorithm builds on our finite-horizon
value iteration algorithm, and adapts the sparse sampling methods of Kearns,
Mansour and Ng (1999) to stochastic games. We conclude by descrbing a
counterexample showing that infinite-horizon discounted value iteration, which
was shown by shaplely to converge in the zero-sum case (a result we give extend
slightly here), does not converge in the general-sum case.Comment: Appears in Proceedings of the Sixteenth Conference on Uncertainty in
Artificial Intelligence (UAI2000
Practical Linear Value-approximation Techniques for First-order MDPs
Recent work on approximate linear programming (ALP) techniques for
first-order Markov Decision Processes (FOMDPs) represents the value function
linearly w.r.t. a set of first-order basis functions and uses linear
programming techniques to determine suitable weights. This approach offers the
advantage that it does not require simplification of the first-order value
function, and allows one to solve FOMDPs independent of a specific domain
instantiation. In this paper, we address several questions to enhance the
applicability of this work: (1) Can we extend the first-order ALP framework to
approximate policy iteration to address performance deficiencies of previous
approaches? (2) Can we automatically generate basis functions and evaluate
their impact on value function quality? (3) How can we decompose intractable
problems with universally quantified rewards into tractable subproblems? We
propose answers to these questions along with a number of novel optimizations
and provide a comparative empirical evaluation on logistics problems from the
ICAPS 2004 Probabilistic Planning Competition.Comment: Appears in Proceedings of the Twenty-Second Conference on Uncertainty
in Artificial Intelligence (UAI2006
On the Complexity of Policy Iteration
Decision-making problems in uncertain or stochastic domains are often
formulated as Markov decision processes (MDPs). Policy iteration (PI) is a
popular algorithm for searching over policy-space, the size of which is
exponential in the number of states. We are interested in bounds on the
complexity of PI that do not depend on the value of the discount factor. In
this paper we prove the first such non-trivial, worst-case, upper bounds on the
number of iterations required by PI to converge to the optimal policy. Our
analysis also sheds new light on the manner in which PI progresses through the
space of policies.Comment: Appears in Proceedings of the Fifteenth Conference on Uncertainty in
Artificial Intelligence (UAI1999
Practicality of Nested Risk Measures for Dynamic Electric Vehicle Charging
We consider the sequential decision problem faced by the manager of an
electric vehicle (EV) charging station, who aims to satisfy the charging demand
of the customer while minimizing cost. Since the total time needed to charge
the EV up to capacity is often less than the amount of time that the customer
is away, there are opportunities to exploit electricity spot price variations
within some reservation window. We formulate the problem as a finite horizon
Markov decision process (MDP) and consider a risk-averse objective function by
optimizing under a dynamic risk measure constructed using a convex combination
of expected value and conditional value at risk (CVaR). It has been recognized
that the objective function of a risk-averse MDP lacks a practical
interpretation. Therefore, in both academic and industry practice, the dynamic
risk measure objective is often not of primary interest; instead, the
risk-averse MDP is used as a computational tool for solving problems with
predefined "practical" risk and reward objectives (termed the base model). In
this paper, we study the extent to which the two sides of this framework are
compatible with each other for the EV setting -- roughly speaking, does a "more
risk-averse" MDP provide lower risk in the practical sense as well? In order to
answer such a question, the effect of the degree of dynamic risk-aversion on
the optimal MDP policy is analyzed. Based on these results, we also propose a
principled approximation approach to finding an instance of the risk-averse MDP
whose optimal policy behaves well under the practical objectives of the base
model. Our numerical experiments suggest that EV charging stations can be
operated at a significantly higher level of profitability if dynamic charging
is adopted and a small amount of risk is tolerated.Comment: 45 pages, 15 figure
Approximate Linear Programming for First-order MDPs
We introduce a new approximate solution technique for first-order Markov
decision processes (FOMDPs). Representing the value function linearly w.r.t. a
set of first-order basis functions, we compute suitable weights by casting the
corresponding optimization as a first-order linear program and show how
off-the-shelf theorem prover and LP software can be effectively used. This
technique allows one to solve FOMDPs independent of a specific domain
instantiation; furthermore, it allows one to determine bounds on approximation
error that apply equally to all domain instantiations. We apply this solution
technique to the task of elevator scheduling with a rich feature space and
multi-criteria additive reward, and demonstrate that it outperforms a number of
intuitive, heuristicallyguided policies.Comment: Appears in Proceedings of the Twenty-First Conference on Uncertainty
in Artificial Intelligence (UAI2005
Approximate Policy Iteration with a Policy Language Bias: Solving Relational Markov Decision Processes
We study an approach to policy selection for large relational Markov Decision
Processes (MDPs). We consider a variant of approximate policy iteration (API)
that replaces the usual value-function learning step with a learning step in
policy space. This is advantageous in domains where good policies are easier to
represent and learn than the corresponding value functions, which is often the
case for the relational MDPs we are interested in. In order to apply API to
such problems, we introduce a relational policy language and corresponding
learner. In addition, we introduce a new bootstrapping routine for goal-based
planning domains, based on random walks. Such bootstrapping is necessary for
many large relational MDPs, where reward is extremely sparse, as API is
ineffective in such domains when initialized with an uninformed policy. Our
experiments show that the resulting system is able to find good policies for a
number of classical planning domains and their stochastic variants by solving
them as extremely large relational MDPs. The experiments also point to some
limitations of our approach, suggesting future work
PAC Reinforcement Learning with Rich Observations
We propose and study a new model for reinforcement learning with rich
observations, generalizing contextual bandits to sequential decision making.
These models require an agent to take actions based on observations (features)
with the goal of achieving long-term performance competitive with a large set
of policies. To avoid barriers to sample-efficient learning associated with
large observation spaces and general POMDPs, we focus on problems that can be
summarized by a small number of hidden states and have long-term rewards that
are predictable by a reactive function class. In this setting, we design and
analyze a new reinforcement learning algorithm, Least Squares Value Elimination
by Exploration. We prove that the algorithm learns near optimal behavior after
a number of episodes that is polynomial in all relevant parameters, logarithmic
in the number of policies, and independent of the size of the observation
space. Our result provides theoretical justification for reinforcement learning
with function approximation
Probabilistic Relational Planning with First Order Decision Diagrams
Dynamic programming algorithms have been successfully applied to
propositional stochastic planning problems by using compact representations, in
particular algebraic decision diagrams, to capture domain dynamics and value
functions. Work on symbolic dynamic programming lifted these ideas to first
order logic using several representation schemes. Recent work introduced a
first order variant of decision diagrams (FODD) and developed a value iteration
algorithm for this representation. This paper develops several improvements to
the FODD algorithm that make the approach practical. These include, new
reduction operators that decrease the size of the representation, several
speedup techniques, and techniques for value approximation. Incorporating
these, the paper presents a planning system, FODD-Planner, for solving
relational stochastic planning problems. The system is evaluated on several
domains, including problems from the recent international planning competition,
and shows competitive performance with top ranking systems. This is the first
demonstration of feasibility of this approach and it shows that abstraction
through compact representation is a promising approach to stochastic planning
- …