529 research outputs found
Q-Learning for MDPs with General Spaces: Convergence and Near Optimality via Quantization under Weak Continuity
Reinforcement learning algorithms often require finiteness of state and
action spaces in Markov decision processes (MDPs) and various efforts have been
made in the literature towards the applicability of such algorithms for
continuous state and action spaces. In this paper, we show that under very mild
regularity conditions (in particular, involving only weak continuity of the
transition kernel of an MDP), Q-learning for standard Borel MDPs via
quantization of states and actions converge to a limit, and furthermore this
limit satisfies an optimality equation which leads to near optimality with
either explicit performance bounds or which are guaranteed to be asymptotically
optimal. Our approach builds on (i) viewing quantization as a measurement
kernel and thus a quantized MDP as a POMDP, (ii) utilizing near optimality and
convergence results of Q-learning for POMDPs, and (iii) finally,
near-optimality of finite state model approximations for MDPs with weakly
continuous kernels which we show to correspond to the fixed point of the
constructed POMDP. Thus, our paper presents a very general convergence and
approximation result for the applicability of Q-learning for continuous MDPs
Restricted Value Iteration: Theory and Algorithms
Value iteration is a popular algorithm for finding near optimal policies for
POMDPs. It is inefficient due to the need to account for the entire belief
space, which necessitates the solution of large numbers of linear programs. In
this paper, we study value iteration restricted to belief subsets. We show
that, together with properly chosen belief subsets, restricted value iteration
yields near-optimal policies and we give a condition for determining whether a
given belief subset would bring about savings in space and time. We also apply
restricted value iteration to two interesting classes of POMDPs, namely
informative POMDPs and near-discernible POMDPs
Q-Learning for Stochastic Control under General Information Structures and Non-Markovian Environments
As a primary contribution, we present a convergence theorem for stochastic
iterations, and in particular, Q-learning iterates, under a general, possibly
non-Markovian, stochastic environment. Our conditions for convergence involve
an ergodicity and a positivity criterion. We provide a precise characterization
on the limit of the iterates and conditions on the environment and
initializations for convergence. As our second contribution, we discuss the
implications and applications of this theorem to a variety of stochastic
control problems with non-Markovian environments involving (i) quantized
approximations of fully observed Markov Decision Processes (MDPs) with
continuous spaces (where quantization break down the Markovian structure), (ii)
quantized approximations of belief-MDP reduced partially observable MDPS
(POMDPs) with weak Feller continuity and a mild version of filter stability
(which requires the knowledge of the model by the controller), (iii) finite
window approximations of POMDPs under a uniform controlled filter stability
(which does not require the knowledge of the model), and (iv) for multi-agent
models where convergence of learning dynamics to a new class of equilibria,
subjective Q-learning equilibria, will be studied. In addition to the
convergence theorem, some implications of the theorem above are new to the
literature and others are interpreted as applications of the convergence
theorem. Some open problems are noted.Comment: 2 figure
Learning for Multi-robot Cooperation in Partially Observable Stochastic Environments with Macro-actions
This paper presents a data-driven approach for multi-robot coordination in
partially-observable domains based on Decentralized Partially Observable Markov
Decision Processes (Dec-POMDPs) and macro-actions (MAs). Dec-POMDPs provide a
general framework for cooperative sequential decision making under uncertainty
and MAs allow temporally extended and asynchronous action execution. To date,
most methods assume the underlying Dec-POMDP model is known a priori or a full
simulator is available during planning time. Previous methods which aim to
address these issues suffer from local optimality and sensitivity to initial
conditions. Additionally, few hardware demonstrations involving a large team of
heterogeneous robots and with long planning horizons exist. This work addresses
these gaps by proposing an iterative sampling based Expectation-Maximization
algorithm (iSEM) to learn polices using only trajectory data containing
observations, MAs, and rewards. Our experiments show the algorithm is able to
achieve better solution quality than the state-of-the-art learning-based
methods. We implement two variants of multi-robot Search and Rescue (SAR)
domains (with and without obstacles) on hardware to demonstrate the learned
policies can effectively control a team of distributed robots to cooperate in a
partially observable stochastic environment.Comment: Accepted to the 2017 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS 2017
Hardware-Efficient Scalable Reinforcement Learning Systems
Reinforcement Learning (RL) is a machine learning discipline in which an agent learns by interacting with its environment. In this paradigm, the agent is required to perceive its state and take actions accordingly. Upon taking each action, a numerical reward is provided by the environment. The goal of the agent is thus to maximize the aggregate rewards it receives over time. Over the past two decades, a large variety of algorithms have been proposed to select actions in order to explore the environment and gradually construct an e¤ective strategy that maximizes the rewards. These RL techniques have been successfully applied to numerous real-world, complex applications including board games and motor control tasks.
Almost all RL algorithms involve the estimation of a value function, which indicates how good it is for the agent to be in a given state, in terms of the total expected reward in the long run. Alternatively, the value function may re‡ect on the impact of taking a particular action at a given state. The most fundamental approach for constructing such a value function consists of updating a table that contains a value for each state (or each state-action pair). However, this approach is impractical for large scale problems, in which the state and/or action spaces are large. In order to deal with such problems, it is necessary to exploit the generalization capabilities of non-linear function approximators, such as arti…cial neural networks.
This dissertation focuses on practical methodologies for solving reinforcement learning problems with large state and/or action spaces. In particular, the work addresses scenarios in which an agent does not have full knowledge of its state, but rather receives partial information about its environment via sensory-based observations. In order to address such intricate problems, novel solutions for both tabular and function-approximation based RL frameworks are proposed. A resource-efficient recurrent neural network algorithm is presented, which exploits adaptive step-size techniques to improve learning characteristics. Moreover, a consolidated actor-critic network is introduced, which omits the modeling redundancy found in typical actor-critic systems. Pivotal concerns are the scalability and speed of the learning algorithms, for which we devise architectures that map efficiently to hardware. As a result, a high degree of parallelism can be achieved. Simulation results that correspond to relevant testbench problems clearly demonstrate the solid performance attributes of the proposed solutions
Partially Observable Total-Cost Markov Decision Processes with Weakly Continuous Transition Probabilities
This paper describes sufficient conditions for the existence of optimal policies for partially observable Markov decision processes (POMDPs) with Borel state, observation, and action sets, when the goal is to minimize the expected total costs over finite or infinite horizons. For infinite-horizon problems, one-step costs are either discounted or assumed to be nonnegative. Action sets may be noncompact and one-step cost functions may be unbounded. The introduced conditions are also sufficient for the validity of optimality equations, semicontinuity of value functions, and convergence of value iterations to optimal values. Since POMDPs can be reduced to completely observable Markov decision processes (COMDPs), whose states are posterior state distributions, this paper focuses on the validity of the above-mentioned optimality properties for COMDPs. The central question is whether the transition probabilities for the COMDP are weakly continuous. We introduce sufficient conditions for this and show that the transition probabilities for a COMDP are weakly continuous, if transition probabilities of the underlying Markov decision process are weakly continuous and observation probabilities for the POMDP are continuous in total variation. Moreover, the continuity in total variation of the observation probabilities cannot be weakened to setwise continuity. The results are illustrated with counterexamples and examples
Multiscale Markov Decision Problems: Compression, Solution, and Transfer Learning
Many problems in sequential decision making and stochastic control often have
natural multiscale structure: sub-tasks are assembled together to accomplish
complex goals. Systematically inferring and leveraging hierarchical structure,
particularly beyond a single level of abstraction, has remained a longstanding
challenge. We describe a fast multiscale procedure for repeatedly compressing,
or homogenizing, Markov decision processes (MDPs), wherein a hierarchy of
sub-problems at different scales is automatically determined. Coarsened MDPs
are themselves independent, deterministic MDPs, and may be solved using
existing algorithms. The multiscale representation delivered by this procedure
decouples sub-tasks from each other and can lead to substantial improvements in
convergence rates both locally within sub-problems and globally across
sub-problems, yielding significant computational savings. A second fundamental
aspect of this work is that these multiscale decompositions yield new transfer
opportunities across different problems, where solutions of sub-tasks at
different levels of the hierarchy may be amenable to transfer to new problems.
Localized transfer of policies and potential operators at arbitrary scales is
emphasized. Finally, we demonstrate compression and transfer in a collection of
illustrative domains, including examples involving discrete and continuous
statespaces.Comment: 86 pages, 15 figure
Simultaneous Auctions for "Rendez-Vous" Coordination Phases in Multi-robot Multi-task Mission
International audienceThis paper presents a protocol that permits to automatically allocate tasks, in a distributed way, among a fleet of agents when communication is not permanently available. In cooperation settings when communication is available only during short periods, it is difficult to build joint policies of agents to collectively accomplish a mission defined by a set of tasks. The proposed approach aims to punctually coordinate the agents during "Rendezvous'' phases defined by the short periods when communication is available. This approach consists of a series of simultaneous auctions to coordinate individual policies computed in a distributed way from Markov decision processes oriented by several goals. These policies allow the agents to evaluate their own relevance in each task achievement and to communicate bids when possible. This approach is illustrated on multi-mobile-robot missions similar to distributed traveling salesmen problem. Experimental results (through simulation and on real robots) demonstrate that high-quality allocations are quickly computed
Perseus: Randomized Point-based Value Iteration for POMDPs
Partially observable Markov decision processes (POMDPs) form an attractive
and principled framework for agent planning under uncertainty. Point-based
approximate techniques for POMDPs compute a policy based on a finite set of
points collected in advance from the agents belief space. We present a
randomized point-based value iteration algorithm called Perseus. The algorithm
performs approximate value backup stages, ensuring that in each backup stage
the value of each point in the belief set is improved; the key observation is
that a single backup may improve the value of many belief points. Contrary to
other point-based methods, Perseus backs up only a (randomly selected) subset
of points in the belief set, sufficient for improving the value of each belief
point in the set. We show how the same idea can be extended to dealing with
continuous action spaces. Experimental results show the potential of Perseus in
large scale POMDP problems
- …