292 research outputs found
Solving Factored MDPs with Hybrid State and Action Variables
Efficient representations and solutions for large decision problems with
continuous and discrete variables are among the most important challenges faced
by the designers of automated decision support systems. In this paper, we
describe a novel hybrid factored Markov decision process (MDP) model that
allows for a compact representation of these problems, and a new hybrid
approximate linear programming (HALP) framework that permits their efficient
solutions. The central idea of HALP is to approximate the optimal value
function by a linear combination of basis functions and optimize its weights by
linear programming. We analyze both theoretical and computational aspects of
this approach, and demonstrate its scale-up potential on several hybrid
optimization problems
Temporal coordination under uncertainty: initial results for the two agents case
We focus on the problem of decentralized planning and coordination for two
heterogeneous autonomous agents, having a common mission in an uncertain
environment. For example, we consider a helicopter UAV and a ground rover
cooperating in the exploration of a dangerous zone where communication is
limited, which forces decentralization of planning. After proposing a framework for decentralized planning, we underline the need for a planner under uncertainty taking continuous time into account in time-dependent
problems and present initial results on temporal planning under uncertainty
Planning in Hybrid Structured Stochastic Domains
Efficient representations and solutions for large structured decision problems with continuous and discrete variables are among the important challenges faced by the designers of automated decision support systems. In this work, we describe a novel hybrid factored Markov decision process (MDP) model that allows for a compact representation of these problems, and a hybrid approximate linear programming (HALP) framework that permits their efficient solutions. The central idea of HALP is to approximate the optimal value function of an MDP by a linear combination of basis functions and optimize its weights by linear programming. We study both theoretical and practical aspects of this approach, and demonstrate its scale-up potential on several hybrid optimization problems
Finding Approximate POMDP solutions Through Belief Compression
Standard value function approaches to finding policies for Partially
Observable Markov Decision Processes (POMDPs) are generally considered to be
intractable for large models. The intractability of these algorithms is to a
large extent a consequence of computing an exact, optimal policy over the
entire belief space. However, in real-world POMDP problems, computing the
optimal policy for the full belief space is often unnecessary for good control
even for problems with complicated policy classes. The beliefs experienced by
the controller often lie near a structured, low-dimensional subspace embedded
in the high-dimensional belief space. Finding a good approximation to the
optimal value function for only this subspace can be much easier than computing
the full value function. We introduce a new method for solving large-scale
POMDPs by reducing the dimensionality of the belief space. We use Exponential
family Principal Components Analysis (Collins, Dasgupta and Schapire, 2002) to
represent sparse, high-dimensional belief spaces using small sets of learned
features of the belief state. We then plan only in terms of the low-dimensional
belief features. By planning in this low-dimensional space, we can find
policies for POMDP models that are orders of magnitude larger than models that
can be handled by conventional techniques. We demonstrate the use of this
algorithm on a synthetic problem and on mobile robot navigation tasks
Optimal and Approximate Q-value Functions for Decentralized POMDPs
Decision-theoretic planning is a popular approach to sequential decision
making problems, because it treats uncertainty in sensing and acting in a
principled way. In single-agent frameworks like MDPs and POMDPs, planning can
be carried out by resorting to Q-value functions: an optimal Q-value function
Q* is computed in a recursive manner by dynamic programming, and then an
optimal policy is extracted from Q*. In this paper we study whether similar
Q-value functions can be defined for decentralized POMDP models (Dec-POMDPs),
and how policies can be extracted from such value functions. We define two
forms of the optimal Q-value function for Dec-POMDPs: one that gives a
normative description as the Q-value function of an optimal pure joint policy
and another one that is sequentially rational and thus gives a recipe for
computation. This computation, however, is infeasible for all but the smallest
problems. Therefore, we analyze various approximate Q-value functions that
allow for efficient computation. We describe how they relate, and we prove that
they all provide an upper bound to the optimal Q-value function Q*. Finally,
unifying some previous approaches for solving Dec-POMDPs, we describe a family
of algorithms for extracting policies from such Q-value functions, and perform
an experimental evaluation on existing test problems, including a new
firefighting benchmark problem
Techniques for the allocation of resources under uncertainty
L’allocation de ressources est un problème omniprésent qui survient dès que des ressources limitées doivent être distribuées parmi de multiples agents autonomes (e.g., personnes, compagnies, robots, etc). Les approches standard pour déterminer l’allocation optimale souffrent généralement d’une très grande complexité de calcul. Le but de cette thèse est de proposer des algorithmes rapides et efficaces pour allouer des ressources consommables et non consommables à des agents autonomes dont les préférences sur ces ressources sont induites par un processus stochastique. Afin d’y parvenir, nous avons développé de nouveaux modèles pour des problèmes de planifications, basés sur le cadre des Processus Décisionnels de Markov (MDPs), où l’espace d’actions possibles est explicitement paramétrisés par les ressources disponibles. Muni de ce cadre, nous avons développé des algorithmes basés sur la programmation dynamique et la recherche heuristique en temps-réel afin de générer des allocations de ressources pour des agents qui agissent dans un environnement stochastique. En particulier, nous avons utilisé la propriété acyclique des créations de tâches pour décomposer le problème d’allocation de ressources. Nous avons aussi proposé une stratégie de décomposition approximative, où les agents considèrent des interactions positives et négatives ainsi que les actions simultanées entre les agents gérants les ressources. Cependant, la majeure contribution de cette thèse est l’adoption de la recherche heuristique en temps-réel pour l’allocation de ressources. À cet effet, nous avons développé une approche basée sur la Q-décomposition munie de bornes strictes afin de diminuer drastiquement le temps de planification pour formuler une politique optimale. Ces bornes strictes nous ont permis d’élaguer l’espace d’actions pour les agents. Nous montrons analytiquement et empiriquement que les approches proposées mènent à des diminutions de la complexité de calcul par rapport à des approches de planification standard. Finalement, nous avons testé la recherche heuristique en temps-réel dans le simulateur SADM, un simulateur d’allocation de ressource pour une frégate.Resource allocation is an ubiquitous problem that arises whenever limited resources have to be distributed among multiple autonomous entities (e.g., people, companies, robots, etc). The standard approaches to determine the optimal resource allocation are computationally prohibitive. The goal of this thesis is to propose computationally efficient algorithms for allocating consumable and non-consumable resources among autonomous agents whose preferences for these resources are induced by a stochastic process. Towards this end, we have developed new models of planning problems, based on the framework of Markov Decision Processes (MDPs), where the action sets are explicitly parameterized by the available resources. Given these models, we have designed algorithms based on dynamic programming and real-time heuristic search to formulating thus allocations of resources for agents evolving in stochastic environments. In particular, we have used the acyclic property of task creation to decompose the problem of resource allocation. We have also proposed an approximative decomposition strategy, where the agents consider positive and negative interactions as well as simultaneous actions among the agents managing the resources. However, the main contribution of this thesis is the adoption of stochastic real-time heuristic search for a resource allocation. To this end, we have developed an approach based on distributed Q-values with tight bounds to diminish drastically the planning time to formulate the optimal policy. These tight bounds enable to prune the action space for the agents. We show analytically and empirically that our proposed approaches lead to drastic (in many cases, exponential) improvements in computational efficiency over standard planning methods. Finally, we have tested real-time heuristic search in the SADM simulator, a simulator for the resource allocation of a platform
Recommended from our members
Intrinsically Motivated Exploration in Hierarchical Reinforcement Learning
The acquisition of hierarchies of reusable skills is one of the distinguishing characteristics of human intelligence, and the learning of such hierarchies is an important open problem in computational reinforcement learning (RL). In humans, these skills are learned during a substantial developmental period in which individuals are intrinsically motivated to explore their environment and learn about the effects of their actions. The skills learned during this period of exploration are then reused to great effect later in life to solve many unfamiliar problems very quickly. This thesis presents novel methods for achieving such developmental acquisition of skill hierarchies in artificial agents by rewarding them for using their current skill set to better understand the effects of their actions on unfamiliar parts of their environment, which in turn leads to the formation of new skills and further exploration, in a life-long process of hierarchical exploration and skill learning.
In particular, we present algorithms for intrinsically motivated hierarchical exploration of Markov Decision Processes (MDPs) and finite factored MDPs (FMDPs). These methods integrate existing research on temporal abstraction in MDPs, intrinsically motivated RL, hierarchical decomposition of finite FMDPs, Bayesian network structure learning, and information theory to achieve long-term, incremental acquisition of skill hierarchies in these environments. Moreover, we show that the skill hierarchies learned in this fashion afford an agent the ability to solve novel tasks in its environment much more quickly than solving them from scratch.
To apply these techniques to environments with representational properties that differ from traditional MDPs and finite FMDPs requires methods for incrementally learning transition models of environments with such representations. Taking a step in this direction, we also present novel methods for incremental model learning in two other types of environments. The first is an algorithm for online, incremental structure learning of transition functions for FMDPs with continuous-valued state and action variables. The second is an algorithm for learning the parameters of a predictive state representation, which serves as a model of partially observable dynamical systems with continuous-valued observations and actions. These techniques serve as a prerequisite to future work applying intrinsically motivated skill learning to these types of environments
Recommended from our members
Approximate Dynamic Programming with Parallel Stochastic Planning Operators
This thesis presents an approximate dynamic programming (ADP) technique for environment modelling agents. The agent learns a set of parallel stochastic planning operators (P-SPOs) by evaluating changes in its environment in response to actions, using an association rule mining approach. An approximate policy is then derived by iteratively improving state value aggregation estimates attached to the operators using the P-SPOs as a model in a Dyna-Q-like architecture.
Reinforcement learning and dynamic programming are powerful techniques for automated agent decision making in stochastic environments. Dynamic programming is effective when there is a known environment model, while reinforcement learning is effective when a model is not available. The techniques derive a policy: a mapping from each environment state to an action which optimizes the long term reward the agent receives.
The standard methods become less effective as the state space for the environment increases because they require values to be associated with each state, the storage and processing of which is exponential to the number of state variables. Resolving this “curse of dimensionality” is an important topic of research amongst all communities working on this problem. Two key methods are to: (i) derive an estimate of the value (approximate dynamic programming) using function approximation or state aggregation; or (ii) build a model of the environment from experience.
This thesis presents a method of combining these approaches by exploiting structure in the state transition and value functions captured in a set of planning operators which are learnt through experience in the environment. Standard planning operators define the deterministic changes that occur in an environment in response to an action. This work presents Parallel Stochastic Planning Operators (P-SPOs), a novel form of planning operator providing a structured model of the state transition function in environments which are both non-deterministic and for which changes can occur outside the influence of actions. Next, an automated method for extracting P-SPOs from observations in an environment is explored using an adaptation of association rule mining. Finally, methods of relating the state transition structure encapsulated in the P-SPOs to state values, using the operators to store state value aggregation estimates, are evaluated.
The framework described provides a method by which approximate dynamic programming can be applied by designers of AI agents and AI planning systems for which they have minimal prior knowledge. The framework and P-SPO based implementations are tested against standard techniques in two bench-mark stochastic environments: a “slippery gripper” block painting robot; and a “predator-prey” agent environment.
Experimental results show that an agent using a P-SPO-based approach is able to learn an accurate model of its environment if successor state variables exhibit conditional independence, and an approximate model in the non-independent case. Results also demonstrate that the agent’s ability to generalise to previously unseen states using the model allow it to form an improved policy over an agent employing a standard Dyna-Q based technique. Finally, an approximate policy stored in state aggregation estimates attached to operators is shown to be optimal in experiments for which the P-SPO set contains sufficient information for effective aggregations to be formed
- …