922 research outputs found

    Inductive Policy Selection for First-Order MDPs

    Full text link
    We select policies for large Markov Decision Processes (MDPs) with compact first-order representations. We find policies that generalize well as the number of objects in the domain grows, potentially without bound. Existing dynamic-programming approaches based on flat, propositional, or first-order representations either are impractical here or do not naturally scale as the number of objects grows without bound. We implement and evaluate an alternative approach that induces first-order policies using training data constructed by solving small problem instances using PGraphplan (Blum & Langford, 1999). Our policies are represented as ensembles of decision lists, using a taxonomic concept language. This approach extends the work of Martin and Geffner (2000) to stochastic domains, ensemble learning, and a wider variety of problems. Empirically, we find "good" policies for several stochastic first-order MDPs that are beyond the scope of previous approaches. We also discuss the application of this work to the relational reinforcement-learning problem.Comment: Appears in Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI2002

    Exploiting First-Order Regression in Inductive Policy Selection

    Full text link
    We consider the problem of computing optimal generalised policies for relational Markov decision processes. We describe an approach combining some of the benefits of purely inductive techniques with those of symbolic dynamic programming methods. The latter reason about the optimal value function using first-order decision theoretic regression and formula rewriting, while the former, when provided with a suitable hypotheses language, are capable of generalising value functions or policies for small instances. Our idea is to use reasoning and in particular classical first-order regression to automatically generate a hypotheses language dedicated to the domain at hand, which is then used as input by an inductive solver. This approach avoids the more complex reasoning of symbolic dynamic programming while focusing the inductive solver's attention on concepts that are specifically relevant to the optimal value function for the domain considered.Comment: Appears in Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence (UAI2004

    Fast Planning in Stochastic Games

    Full text link
    Stochastic games generalize Markov decision processes (MDPs) to a multiagent setting by allowing the state transitions to depend jointly on all player actions, and having rewards determined by multiplayer matrix games at each state. We consider the problem of computing Nash equilibria in stochastic games, the analogue of planning in MDPs. We begin by providing a generalization of finite-horizon value iteration that computes a Nash strategy for each player in generalsum stochastic games. The algorithm takes an arbitrary Nash selection function as input, which allows the translation of local choices between multiple Nash equilibria into the selection of a single global Nash equilibrium. Our main technical result is an algorithm for computing near-Nash equilibria in large or infinite state spaces. This algorithm builds on our finite-horizon value iteration algorithm, and adapts the sparse sampling methods of Kearns, Mansour and Ng (1999) to stochastic games. We conclude by descrbing a counterexample showing that infinite-horizon discounted value iteration, which was shown by shaplely to converge in the zero-sum case (a result we give extend slightly here), does not converge in the general-sum case.Comment: Appears in Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence (UAI2000

    Practical Linear Value-approximation Techniques for First-order MDPs

    Full text link
    Recent work on approximate linear programming (ALP) techniques for first-order Markov Decision Processes (FOMDPs) represents the value function linearly w.r.t. a set of first-order basis functions and uses linear programming techniques to determine suitable weights. This approach offers the advantage that it does not require simplification of the first-order value function, and allows one to solve FOMDPs independent of a specific domain instantiation. In this paper, we address several questions to enhance the applicability of this work: (1) Can we extend the first-order ALP framework to approximate policy iteration to address performance deficiencies of previous approaches? (2) Can we automatically generate basis functions and evaluate their impact on value function quality? (3) How can we decompose intractable problems with universally quantified rewards into tractable subproblems? We propose answers to these questions along with a number of novel optimizations and provide a comparative empirical evaluation on logistics problems from the ICAPS 2004 Probabilistic Planning Competition.Comment: Appears in Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence (UAI2006

    On the Complexity of Policy Iteration

    Full text link
    Decision-making problems in uncertain or stochastic domains are often formulated as Markov decision processes (MDPs). Policy iteration (PI) is a popular algorithm for searching over policy-space, the size of which is exponential in the number of states. We are interested in bounds on the complexity of PI that do not depend on the value of the discount factor. In this paper we prove the first such non-trivial, worst-case, upper bounds on the number of iterations required by PI to converge to the optimal policy. Our analysis also sheds new light on the manner in which PI progresses through the space of policies.Comment: Appears in Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI1999

    Practicality of Nested Risk Measures for Dynamic Electric Vehicle Charging

    Full text link
    We consider the sequential decision problem faced by the manager of an electric vehicle (EV) charging station, who aims to satisfy the charging demand of the customer while minimizing cost. Since the total time needed to charge the EV up to capacity is often less than the amount of time that the customer is away, there are opportunities to exploit electricity spot price variations within some reservation window. We formulate the problem as a finite horizon Markov decision process (MDP) and consider a risk-averse objective function by optimizing under a dynamic risk measure constructed using a convex combination of expected value and conditional value at risk (CVaR). It has been recognized that the objective function of a risk-averse MDP lacks a practical interpretation. Therefore, in both academic and industry practice, the dynamic risk measure objective is often not of primary interest; instead, the risk-averse MDP is used as a computational tool for solving problems with predefined "practical" risk and reward objectives (termed the base model). In this paper, we study the extent to which the two sides of this framework are compatible with each other for the EV setting -- roughly speaking, does a "more risk-averse" MDP provide lower risk in the practical sense as well? In order to answer such a question, the effect of the degree of dynamic risk-aversion on the optimal MDP policy is analyzed. Based on these results, we also propose a principled approximation approach to finding an instance of the risk-averse MDP whose optimal policy behaves well under the practical objectives of the base model. Our numerical experiments suggest that EV charging stations can be operated at a significantly higher level of profitability if dynamic charging is adopted and a small amount of risk is tolerated.Comment: 45 pages, 15 figure

    Approximate Linear Programming for First-order MDPs

    Full text link
    We introduce a new approximate solution technique for first-order Markov decision processes (FOMDPs). Representing the value function linearly w.r.t. a set of first-order basis functions, we compute suitable weights by casting the corresponding optimization as a first-order linear program and show how off-the-shelf theorem prover and LP software can be effectively used. This technique allows one to solve FOMDPs independent of a specific domain instantiation; furthermore, it allows one to determine bounds on approximation error that apply equally to all domain instantiations. We apply this solution technique to the task of elevator scheduling with a rich feature space and multi-criteria additive reward, and demonstrate that it outperforms a number of intuitive, heuristicallyguided policies.Comment: Appears in Proceedings of the Twenty-First Conference on Uncertainty in Artificial Intelligence (UAI2005

    Approximate Policy Iteration with a Policy Language Bias: Solving Relational Markov Decision Processes

    Full text link
    We study an approach to policy selection for large relational Markov Decision Processes (MDPs). We consider a variant of approximate policy iteration (API) that replaces the usual value-function learning step with a learning step in policy space. This is advantageous in domains where good policies are easier to represent and learn than the corresponding value functions, which is often the case for the relational MDPs we are interested in. In order to apply API to such problems, we introduce a relational policy language and corresponding learner. In addition, we introduce a new bootstrapping routine for goal-based planning domains, based on random walks. Such bootstrapping is necessary for many large relational MDPs, where reward is extremely sparse, as API is ineffective in such domains when initialized with an uninformed policy. Our experiments show that the resulting system is able to find good policies for a number of classical planning domains and their stochastic variants by solving them as extremely large relational MDPs. The experiments also point to some limitations of our approach, suggesting future work

    PAC Reinforcement Learning with Rich Observations

    Full text link
    We propose and study a new model for reinforcement learning with rich observations, generalizing contextual bandits to sequential decision making. These models require an agent to take actions based on observations (features) with the goal of achieving long-term performance competitive with a large set of policies. To avoid barriers to sample-efficient learning associated with large observation spaces and general POMDPs, we focus on problems that can be summarized by a small number of hidden states and have long-term rewards that are predictable by a reactive function class. In this setting, we design and analyze a new reinforcement learning algorithm, Least Squares Value Elimination by Exploration. We prove that the algorithm learns near optimal behavior after a number of episodes that is polynomial in all relevant parameters, logarithmic in the number of policies, and independent of the size of the observation space. Our result provides theoretical justification for reinforcement learning with function approximation

    Probabilistic Relational Planning with First Order Decision Diagrams

    Full text link
    Dynamic programming algorithms have been successfully applied to propositional stochastic planning problems by using compact representations, in particular algebraic decision diagrams, to capture domain dynamics and value functions. Work on symbolic dynamic programming lifted these ideas to first order logic using several representation schemes. Recent work introduced a first order variant of decision diagrams (FODD) and developed a value iteration algorithm for this representation. This paper develops several improvements to the FODD algorithm that make the approach practical. These include, new reduction operators that decrease the size of the representation, several speedup techniques, and techniques for value approximation. Incorporating these, the paper presents a planning system, FODD-Planner, for solving relational stochastic planning problems. The system is evaluated on several domains, including problems from the recent international planning competition, and shows competitive performance with top ranking systems. This is the first demonstration of feasibility of this approach and it shows that abstraction through compact representation is a promising approach to stochastic planning
    • …
    corecore