10,828 research outputs found
Partially Observable Total-Cost Markov Decision Processes with Weakly Continuous Transition Probabilities
This paper describes sufficient conditions for the existence of optimal policies for partially observable Markov decision processes (POMDPs) with Borel state, observation, and action sets, when the goal is to minimize the expected total costs over finite or infinite horizons. For infinite-horizon problems, one-step costs are either discounted or assumed to be nonnegative. Action sets may be noncompact and one-step cost functions may be unbounded. The introduced conditions are also sufficient for the validity of optimality equations, semicontinuity of value functions, and convergence of value iterations to optimal values. Since POMDPs can be reduced to completely observable Markov decision processes (COMDPs), whose states are posterior state distributions, this paper focuses on the validity of the above-mentioned optimality properties for COMDPs. The central question is whether the transition probabilities for the COMDP are weakly continuous. We introduce sufficient conditions for this and show that the transition probabilities for a COMDP are weakly continuous, if transition probabilities of the underlying Markov decision process are weakly continuous and observation probabilities for the POMDP are continuous in total variation. Moreover, the continuity in total variation of the observation probabilities cannot be weakened to setwise continuity. The results are illustrated with counterexamples and examples
Maintenance optimization for a Markovian deteriorating system with population heterogeneity
We develop a partially observable Markov decision process model to incorporate population heterogeneity when scheduling replacements for a deteriorating system. The single-component system deteriorates over a finite set of condition states according to a Markov chain. The population of spare components that is available for replacements is composed of multiple component types that cannot be distinguished by their exterior appearance but deteriorate according to different transition probability matrices. This situation may arise, for example, because of variations in the production process of components. We provide a set of conditions for which we characterize the structure of the optimal policy that minimizes the total expected discounted operating and replacement cost over an infinite horizon. In a numerical experiment, we benchmark the optimal policy against a heuristic policy that neglects population heterogeneity
LEARNING ALGORITHMS FOR MARKOV DECISION PROCESSES
We propose various computational schemes for solving Partially Observable
Markov Decision Processes with the finite stage additive cost and infinite
horizon discounted cost criterion. Error bounds for the corresponding algorithms
are given and it is further shown that at the expense of more computational
effort the Partially Observable Markov Decision Problem (POMDP) can be solved
as closely to the optimal as desired.
It is well known that a sufficient statistic for taking the best action at any time for
the POMDP is the aposteriori probability distribution on the underlying states, given
all the past history, and that this can be updated recursively. We prove that the finite
stage optimal costs as well as the optimal cost for the infinite horizon discounted
cost problem are both Lipschitz continuous (with domain the unit simplex of probability
distributions over the underlying states) and gives bounds for the Lipschitz constant.
We use these bounds to provide error bounds for computational algorithms for solving
POMDPs.
We extend the almost sure convergence result of a very general stochastic approximation
algorithm to the case when the underlying Markov process exhibits periodicity. This result
is used to extend the proof of convergence of Temporal Difference (TD) reinforcement learning
schemes with linear function approximation for Markov Cost processes in order to estimate the
cost to go function for the discounted cost criterion, and the differential cost function for the
average cost criterion, respectively.
Adaptive control of Markov Decision Problems (MDPs) is a problem in which a full knowledge
of the system parameters, namely transition probabilities as well as the distribution of the
immediate costs, are not available apriori. We give direct adaptive control schemes for
infinite horizon discounted cost and average cost MDPs. Approximate Policy Iteration
using on-line TD schemes for policy evaluation is detailed for the discounted cost and
average cost criteria.
Possible extensions of direct adaptive control schemes to the POMDP framework are
discussed.
Auxiliary results relevant to the core results of the dissertation are stated
and proved in the appendices. In particular an efficient discretization scheme
for the finite dimensional unit simplex is given. Some general error bounds for
MDPs are also given. Also TD schemes for learning in Stochastic Shortest Path
problems (SSP) are discussed
Producing efficient error-bounded solutions for transition independent decentralized MDPs
pages 539-546International audienceThere has been substantial progress on algorithms for single-agent sequential decision making problems represented as partially observable Markov decision processes (POMDPs). A number of efficient algorithms for solving POMDPs share two desirable properties: error-bounds and fast convergence rates. Despite significant efforts, no algorithms for solving decentralized POMDPs benefit from these properties, leading to either poor solution quality or limited scalability. This paper presents the first approach for solving transition independent decentralized Markov decision processes (MDPs), that inherits these properties. Two related algorithms illustrate this approach. The first recasts the original problem as a finite-horizon deterministic and completely observable Markov decision process. In this form, the original problem is solved by combining heuristic search with constraint optimization to quickly converge into a near-optimal policy. This algorithm also provides the foundation for the first algorithm for solving infinite-horizon transition independent decentralized MDPs. We demonstrate that both methods outperform state-of-the-art algorithms by multiple orders of magnitude, and for infinite-horizon decentralized MDPs, the algorithm is able to construct more concise policies by searching cyclic policy graphs
- …