    On the Complexity of Solving Markov Decision Problems

    Markov decision problems (MDPs) provide the foundations for a number of problems of interest to AI researchers studying automated planning and reinforcement learning. In this paper, we summarize results regarding the complexity of solving MDPs and the running time of MDP solution algorithms. We argue that, although MDPs can be solved efficiently in theory, more study is needed to reveal practical algorithms for solving large problems quickly. To encourage future research, we sketch some alternative methods of analysis that rely on the structure of MDPs.Comment: Appears in Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence (UAI1995

    An Analysis of Primal-Dual Algorithms for Discounted Markov Decision Processes

    Several well-known algorithms in the field of combinatorial optimization can be interpreted in terms of the primal-dual method for solving linear programs. For example, Dijkstra's algorithm, the Ford-Fulkerson algorithm, and the Hungarian algorithm can all be viewed as the primal-dual method applied to the linear programming formulations of their respective optimization problems. Roughly speaking, successfully applying the primal-dual method to an optimization problem that can be posed as a linear program relies on the ability to find a simple characterization of the optimal solutions to a related linear program, called the `dual of the restricted primal' (DRP). This paper is motivated by the following question: What is the algorithm we obtain if we apply the primal-dual method to a linear programming formulation of a discounted cost Markov decision process? We will first show that several widely-used algorithms for Markov decision processes can be interpreted in terms of the primal-dual method, where the value function is updated with suboptimal solutions to the DRP in each iteration. We then provide the optimal solution to the DRP in closed-form, and present the algorithm that results when using this solution to update the value function in each iteration. Unlike the algorithms obtained from suboptimal DRP updates, this algorithm is guaranteed to yield the optimal value function in a finite number of iterations. Finally, we show that the iterations of the primal-dual algorithm can be interpreted as repeated application of the policy iteration algorithm to a special class of Markov decision processes. When considered alongside recent results characterizing the computational complexity of the policy iteration algorithm, this observation could provide new insights into the computational complexity of solving discounted-cost Markov decision processes.Comment: Earlier version published in the Proceedings of the 2015 European Control Conference. Appendix added and corrected an erroneous reference to Lemma 7 in the appendi

    On the Reduction of Total-Cost and Average-Cost MDPs to Discounted MDPs

    This paper provides conditions under which total-cost and average-cost Markov decision processes (MDPs) can be reduced to discounted ones. Results are given for transient total-cost MDPs with tran- sition rates whose values may be greater than one, as well as for average-cost MDPs with transition probabilities satisfying the condition that there is a state such that the expected time to reach it is uniformly bounded for all initial states and stationary policies. In particular, these reductions imply sufficient conditions for the validity of optimality equations and the existence of stationary optimal poli- cies for MDPs with undiscounted total cost and average-cost criteria. When the state and action sets are finite, these reductions lead to linear programming formulations and complexity estimates for MDPs under the aforementioned criteria

    A unified view of entropy-regularized Markov decision processes

    We propose a general framework for entropy-regularized average-reward reinforcement learning in Markov decision processes (MDPs). Our approach is based on extending the linear-programming formulation of policy optimization in MDPs to accommodate convex regularization functions. Our key result is showing that using the conditional entropy of the joint state-action distributions as regularization yields a dual optimization problem closely resembling the Bellman optimality equations. This result enables us to formalize a number of state-of-the-art entropy-regularized reinforcement learning algorithms as approximate variants of Mirror Descent or Dual Averaging, and thus to argue about the convergence properties of these methods. In particular, we show that the exact version of the TRPO algorithm of Schulman et al. (2015) actually converges to the optimal policy, while the entropy-regularized policy gradient methods of Mnih et al. (2016) may fail to converge to a fixed point. Finally, we illustrate empirically the effects of using various regularization techniques on learning performance in a simple reinforcement learning setup

    Variance Reduced Value Iteration and Faster Algorithms for Solving Markov Decision Processes

    In this paper we provide faster algorithms for approximately solving discounted Markov Decision Processes in multiple parameter regimes. Given a discounted Markov Decision Process (DMDP) with S|S| states, A|A| actions, discount factor γ(0,1)\gamma\in(0,1), and rewards in the range [M,M][-M, M], we show how to compute an ϵ\epsilon-optimal policy, with probability 1δ1 - \delta in time O~((S2A+SA(1γ)3)log(Mϵ)log(1δ)) . \tilde{O}\left( \left(|S|^2 |A| + \frac{|S| |A|}{(1 - \gamma)^3} \right) \log\left( \frac{M}{\epsilon} \right) \log\left( \frac{1}{\delta} \right) \right) ~ . This contribution reflects the first nearly linear time, nearly linearly convergent algorithm for solving DMDPs for intermediate values of γ\gamma. We also show how to obtain improved sublinear time algorithms provided we can sample from the transition function in O(1)O(1) time. Under this assumption we provide an algorithm which computes an ϵ\epsilon-optimal policy with probability 1δ1 - \delta in time O~(SAM2(1γ)4ϵ2log(1δ)) . \tilde{O} \left(\frac{|S| |A| M^2}{(1 - \gamma)^4 \epsilon^2} \log \left(\frac{1}{\delta}\right) \right) ~. Lastly, we extend both these algorithms to solve finite horizon MDPs. Our algorithms improve upon the previous best for approximately computing optimal policies for fixed-horizon MDPs in multiple parameter regimes. Interestingly, we obtain our results by a careful modification of approximate value iteration. We show how to combine classic approximate value iteration analysis with new techniques in variance reduction. Our fastest algorithms leverage further insights to ensure that our algorithms make monotonic progress towards the optimal value. This paper is one of few instances in using sampling to obtain a linearly convergent linear programming algorithm and we hope that the analysis may be useful more broadly

    Sequential Decision Making with Limited Observation Capability: Application to Wireless Networks

    This work studies a generalized class of restless multi-armed bandits with hidden states and allow cumulative feedback, as opposed to the conventional instantaneous feedback. We call them lazy restless bandits (LRB) as the events of decision-making are sparser than events of state transition. Hence, feedback after each decision event is the cumulative effect of the following state transition events. The states of arms are hidden from the decision-maker and rewards for actions are state dependent. The decision-maker needs to choose one arm in each decision interval, such that long term cumulative reward is maximized. As the states are hidden, the decision-maker maintains and updates its belief about them. It is shown that LRBs admit an optimal policy which has threshold structure in belief space. The Whittle-index policy for solving LRB problem is analyzed; indexability of LRBs is shown. Further, closed-form index expressions are provided for two sets of special cases; for more general cases, an algorithm for index computation is provided. An extensive simulation study is presented; Whittle-index, modified Whittle-index and myopic policies are compared. Lagrangian relaxation of the problem provides an upper bound on the optimal value function; it is used to assess the degree of sub-optimality various policies

    Optimal control in Markov decision processes via distributed optimization

    Optimal control synthesis in stochastic systems with respect to quantitative temporal logic constraints can be formulated as linear programming problems. However, centralized synthesis algorithms do not scale to many practical systems. To tackle this issue, we propose a decomposition-based distributed synthesis algorithm. By decomposing a large-scale stochastic system modeled as a Markov decision process into a collection of interacting sub-systems, the original control problem is formulated as a linear programming problem with a sparse constraint matrix, which can be solved through distributed optimization methods. Additionally, we propose a decomposition algorithm which automatically exploits, if exists, the modular structure in a given large-scale system. We illustrate the proposed methods through robotic motion planning examples.Comment: 8 pages, 5 figures, submitted to CDC 2015 conferenc

    Partially Observed Markov Decision Processes. Problem Sets and Internet Supplement

    This document is an internet supplement to my book "Partially Observed Markov Decision Processes - From Filtering to Controlled Sensing" published by Cambridge University Press in 2016. This internet supplement contains exercises, examples and case studies. The material appears in this internet supplement (instead of the book) so that it can be updated. This document will evolve over time and further discussion and examples will be added. This internet supplement document is work in progress and will be updated periodically. I welcome constructive comments from readers of the book and this internet supplement

    A Possibilistic Model for Qualitative Sequential Decision Problems under Uncertainty in Partially Observable Environments

    In this article we propose a qualitative (ordinal) counterpart for the Partially Observable Markov Decision Processes model (POMDP) in which the uncertainty, as well as the preferences of the agent, are modeled by possibility distributions. This qualitative counterpart of the POMDP model relies on a possibilistic theory of decision under uncertainty, recently developed. One advantage of such a qualitative framework is its ability to escape from the classical obstacle of stochastic POMDPs, in which even with a finite state space, the obtained belief state space of the POMDP is infinite. Instead, in the possibilistic framework even if exponentially larger than the state space, the belief state space remains finite.Comment: Appears in Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI1999

    Acceleration Operators in the Value Iteration Algorithms for Markov Decision Processes

    We study the general approach to accelerating the convergence of the most widely used solution method of Markov decision processes with the total expected discounted reward. Inspired by the monotone behavior of the contraction mappings in the feasible set of the linear programming problem equivalent to the MDP, we establish a class of operators that can be used in combination with a contraction mapping operator in the standard value iteration algorithm and its variants. We then propose two such operators, which can be easily implemented as part of the value iteration algorithm and its variants. Numerical studies show that the computational savings can be significant especially when the discount factor approaches 1 and the transition probability matrix becomes dense, in which the standard value iteration algorithm and its variants suffer from slow convergence.Comment: 32 pages, 2 figures, 2 tabl