5,025 research outputs found
On the Complexity of Solving Markov Decision Problems
Markov decision problems (MDPs) provide the foundations for a number of
problems of interest to AI researchers studying automated planning and
reinforcement learning. In this paper, we summarize results regarding the
complexity of solving MDPs and the running time of MDP solution algorithms. We
argue that, although MDPs can be solved efficiently in theory, more study is
needed to reveal practical algorithms for solving large problems quickly. To
encourage future research, we sketch some alternative methods of analysis that
rely on the structure of MDPs.Comment: Appears in Proceedings of the Eleventh Conference on Uncertainty in
Artificial Intelligence (UAI1995
An Analysis of Primal-Dual Algorithms for Discounted Markov Decision Processes
Several well-known algorithms in the field of combinatorial optimization can
be interpreted in terms of the primal-dual method for solving linear programs.
For example, Dijkstra's algorithm, the Ford-Fulkerson algorithm, and the
Hungarian algorithm can all be viewed as the primal-dual method applied to the
linear programming formulations of their respective optimization problems.
Roughly speaking, successfully applying the primal-dual method to an
optimization problem that can be posed as a linear program relies on the
ability to find a simple characterization of the optimal solutions to a related
linear program, called the `dual of the restricted primal' (DRP).
This paper is motivated by the following question: What is the algorithm we
obtain if we apply the primal-dual method to a linear programming formulation
of a discounted cost Markov decision process? We will first show that several
widely-used algorithms for Markov decision processes can be interpreted in
terms of the primal-dual method, where the value function is updated with
suboptimal solutions to the DRP in each iteration. We then provide the optimal
solution to the DRP in closed-form, and present the algorithm that results when
using this solution to update the value function in each iteration. Unlike the
algorithms obtained from suboptimal DRP updates, this algorithm is guaranteed
to yield the optimal value function in a finite number of iterations. Finally,
we show that the iterations of the primal-dual algorithm can be interpreted as
repeated application of the policy iteration algorithm to a special class of
Markov decision processes. When considered alongside recent results
characterizing the computational complexity of the policy iteration algorithm,
this observation could provide new insights into the computational complexity
of solving discounted-cost Markov decision processes.Comment: Earlier version published in the Proceedings of the 2015 European
Control Conference. Appendix added and corrected an erroneous reference to
Lemma 7 in the appendi
On the Reduction of Total-Cost and Average-Cost MDPs to Discounted MDPs
This paper provides conditions under which total-cost and average-cost Markov
decision processes (MDPs) can be reduced to discounted ones. Results are given
for transient total-cost MDPs with tran- sition rates whose values may be
greater than one, as well as for average-cost MDPs with transition
probabilities satisfying the condition that there is a state such that the
expected time to reach it is uniformly bounded for all initial states and
stationary policies. In particular, these reductions imply sufficient
conditions for the validity of optimality equations and the existence of
stationary optimal poli- cies for MDPs with undiscounted total cost and
average-cost criteria. When the state and action sets are finite, these
reductions lead to linear programming formulations and complexity estimates for
MDPs under the aforementioned criteria
A unified view of entropy-regularized Markov decision processes
We propose a general framework for entropy-regularized average-reward
reinforcement learning in Markov decision processes (MDPs). Our approach is
based on extending the linear-programming formulation of policy optimization in
MDPs to accommodate convex regularization functions. Our key result is showing
that using the conditional entropy of the joint state-action distributions as
regularization yields a dual optimization problem closely resembling the
Bellman optimality equations. This result enables us to formalize a number of
state-of-the-art entropy-regularized reinforcement learning algorithms as
approximate variants of Mirror Descent or Dual Averaging, and thus to argue
about the convergence properties of these methods. In particular, we show that
the exact version of the TRPO algorithm of Schulman et al. (2015) actually
converges to the optimal policy, while the entropy-regularized policy gradient
methods of Mnih et al. (2016) may fail to converge to a fixed point. Finally,
we illustrate empirically the effects of using various regularization
techniques on learning performance in a simple reinforcement learning setup
Variance Reduced Value Iteration and Faster Algorithms for Solving Markov Decision Processes
In this paper we provide faster algorithms for approximately solving
discounted Markov Decision Processes in multiple parameter regimes. Given a
discounted Markov Decision Process (DMDP) with states, actions,
discount factor , and rewards in the range , we show
how to compute an -optimal policy, with probability in
time This contribution reflects the first nearly linear time, nearly
linearly convergent algorithm for solving DMDPs for intermediate values of
.
We also show how to obtain improved sublinear time algorithms provided we can
sample from the transition function in time. Under this assumption we
provide an algorithm which computes an -optimal policy with
probability in time
Lastly, we extend both these algorithms to solve finite horizon MDPs. Our
algorithms improve upon the previous best for approximately computing optimal
policies for fixed-horizon MDPs in multiple parameter regimes.
Interestingly, we obtain our results by a careful modification of approximate
value iteration. We show how to combine classic approximate value iteration
analysis with new techniques in variance reduction. Our fastest algorithms
leverage further insights to ensure that our algorithms make monotonic progress
towards the optimal value. This paper is one of few instances in using sampling
to obtain a linearly convergent linear programming algorithm and we hope that
the analysis may be useful more broadly
Sequential Decision Making with Limited Observation Capability: Application to Wireless Networks
This work studies a generalized class of restless multi-armed bandits with
hidden states and allow cumulative feedback, as opposed to the conventional
instantaneous feedback. We call them lazy restless bandits (LRB) as the events
of decision-making are sparser than events of state transition. Hence, feedback
after each decision event is the cumulative effect of the following state
transition events. The states of arms are hidden from the decision-maker and
rewards for actions are state dependent. The decision-maker needs to choose one
arm in each decision interval, such that long term cumulative reward is
maximized.
As the states are hidden, the decision-maker maintains and updates its belief
about them. It is shown that LRBs admit an optimal policy which has threshold
structure in belief space. The Whittle-index policy for solving LRB problem is
analyzed; indexability of LRBs is shown. Further, closed-form index expressions
are provided for two sets of special cases; for more general cases, an
algorithm for index computation is provided. An extensive simulation study is
presented; Whittle-index, modified Whittle-index and myopic policies are
compared. Lagrangian relaxation of the problem provides an upper bound on the
optimal value function; it is used to assess the degree of sub-optimality
various policies
Optimal control in Markov decision processes via distributed optimization
Optimal control synthesis in stochastic systems with respect to quantitative
temporal logic constraints can be formulated as linear programming problems.
However, centralized synthesis algorithms do not scale to many practical
systems. To tackle this issue, we propose a decomposition-based distributed
synthesis algorithm. By decomposing a large-scale stochastic system modeled as
a Markov decision process into a collection of interacting sub-systems, the
original control problem is formulated as a linear programming problem with a
sparse constraint matrix, which can be solved through distributed optimization
methods. Additionally, we propose a decomposition algorithm which automatically
exploits, if exists, the modular structure in a given large-scale system. We
illustrate the proposed methods through robotic motion planning examples.Comment: 8 pages, 5 figures, submitted to CDC 2015 conferenc
Partially Observed Markov Decision Processes. Problem Sets and Internet Supplement
This document is an internet supplement to my book "Partially Observed Markov
Decision Processes - From Filtering to Controlled Sensing" published by
Cambridge University Press in 2016. This internet supplement contains
exercises, examples and case studies. The material appears in this internet
supplement (instead of the book) so that it can be updated. This document will
evolve over time and further discussion and examples will be added. This
internet supplement document is work in progress and will be updated
periodically. I welcome constructive comments from readers of the book and this
internet supplement
A Possibilistic Model for Qualitative Sequential Decision Problems under Uncertainty in Partially Observable Environments
In this article we propose a qualitative (ordinal) counterpart for the
Partially Observable Markov Decision Processes model (POMDP) in which the
uncertainty, as well as the preferences of the agent, are modeled by
possibility distributions. This qualitative counterpart of the POMDP model
relies on a possibilistic theory of decision under uncertainty, recently
developed. One advantage of such a qualitative framework is its ability to
escape from the classical obstacle of stochastic POMDPs, in which even with a
finite state space, the obtained belief state space of the POMDP is infinite.
Instead, in the possibilistic framework even if exponentially larger than the
state space, the belief state space remains finite.Comment: Appears in Proceedings of the Fifteenth Conference on Uncertainty in
Artificial Intelligence (UAI1999
Acceleration Operators in the Value Iteration Algorithms for Markov Decision Processes
We study the general approach to accelerating the convergence of the most
widely used solution method of Markov decision processes with the total
expected discounted reward. Inspired by the monotone behavior of the
contraction mappings in the feasible set of the linear programming problem
equivalent to the MDP, we establish a class of operators that can be used in
combination with a contraction mapping operator in the standard value iteration
algorithm and its variants. We then propose two such operators, which can be
easily implemented as part of the value iteration algorithm and its variants.
Numerical studies show that the computational savings can be significant
especially when the discount factor approaches 1 and the transition probability
matrix becomes dense, in which the standard value iteration algorithm and its
variants suffer from slow convergence.Comment: 32 pages, 2 figures, 2 tabl
- …