4,601 research outputs found
Recommended from our members
Uniform positive recurrence and long term behavior of Markov decision processes, with applications in sensor scheduling
In this dissertation, we show a number of new results relating to stability, optimal control, and value iteration algorithms for discrete-time Markov decision processes (MDPs). First, we adapt two recent results in controlled diffusion processes to suit countable state MDPs by making assumptions that approximate continuous behavior. We show that if the MDP is stable under any stationary policy, then it must be uniformly so under all policies. This abstract result is very useful in the analysis of optimal control problems, and extends the characterization of uniform stability properties for MDPs. Then we derive two useful local bounds on the discounted value functions for a large class of MDPs, facilitating analysis of the ergodic cost problem via the Arzela-Ascoli theorem. We also examine and exploit the previously underutilized Harnack inequality for discrete Markov chains; one aim of this work was to discover how much can be accomplished for models with this property.
Convergence of the value iteration algorithm is typically treated in the literature under blanket stability assumptions. We show two new sufficient conditions for the convergence of the value iteration algorithm without blanket stability, requiring only geometric ergodicity under the optimal policy. These results form the theoretical basis to apply the value iteration to classes of problems previously unavailable.
We then consider a discrete-time linear system with Gaussian white noise and quadratic costs, observed via multiple sensors that communicate over a congested network. Observations are lost or received according to a Bernoulli random variable with a loss rate determined by the state of the network and the choice of sensor. We completely analyze the finite horizon, discounted, and long-term average optimal control problems. Assuming that the system is stabilizable, we use a partial separation principle to transform the problem into an MDP on the set of symmetric, positive definite matrices. A special case of these results generalizes a known result for Kalman filters with intermittent observations to the multiple-sensor case, with powerful implications.
Finally, we show that the value iteration algorithm converges without additional assumptions, as the structure of the problem guarantees geometric ergodicity under the optimal policy. The results allow the incorporation of adaptive schemes to determine unknown system parameters without affecting stability or long-term average cost. We also show that after only a few steps of the value iteration algorithm, the generated policy is geometrically ergodic and near-optimal.Electrical and Computer Engineerin
Reinforcement Learning: A Survey
This paper surveys the field of reinforcement learning from a
computer-science perspective. It is written to be accessible to researchers
familiar with machine learning. Both the historical basis of the field and a
broad selection of current work are summarized. Reinforcement learning is the
problem faced by an agent that learns behavior through trial-and-error
interactions with a dynamic environment. The work described here has a
resemblance to work in psychology, but differs considerably in the details and
in the use of the word ``reinforcement.'' The paper discusses central issues of
reinforcement learning, including trading off exploration and exploitation,
establishing the foundations of the field via Markov decision theory, learning
from delayed reinforcement, constructing empirical models to accelerate
learning, making use of generalization and hierarchy, and coping with hidden
state. It concludes with a survey of some implemented systems and an assessment
of the practical utility of current methods for reinforcement learning.Comment: See http://www.jair.org/ for any accompanying file
Sensor Scheduling for Optimal Observability Using Estimation Entropy
We consider sensor scheduling as the optimal observability problem for
partially observable Markov decision processes (POMDP). This model fits to the
cases where a Markov process is observed by a single sensor which needs to be
dynamically adjusted or by a set of sensors which are selected one at a time in
a way that maximizes the information acquisition from the process. Similar to
conventional POMDP problems, in this model the control action is based on all
past measurements; however here this action is not for the control of state
process, which is autonomous, but it is for influencing the measurement of that
process. This POMDP is a controlled version of the hidden Markov process, and
we show that its optimal observability problem can be formulated as an average
cost Markov decision process (MDP) scheduling problem. In this problem, a
policy is a rule for selecting sensors or adjusting the measuring device based
on the measurement history. Given a policy, we can evaluate the estimation
entropy for the joint state-measurement processes which inversely measures the
observability of state process for that policy. Considering estimation entropy
as the cost of a policy, we show that the problem of finding optimal policy is
equivalent to an average cost MDP scheduling problem where the cost function is
the entropy function over the belief space. This allows the application of the
policy iteration algorithm for finding the policy achieving minimum estimation
entropy, thus optimum observability.Comment: 5 pages, submitted to 2007 IEEE PerCom/PerSeNS conferenc
- …