45,542 research outputs found
Q-Learning for Continuous State and Action MDPs under Average Cost Criteria
For infinite-horizon average-cost criterion problems, we present several
approximation and reinforcement learning results for Markov Decision Processes
with standard Borel spaces. Toward this end, (i) we first provide a
discretization based approximation method for fully observed Markov Decision
Processes (MDPs) with continuous spaces under average cost criteria, and we
provide error bounds for the approximations when the dynamics are only weakly
continuous under certain ergodicity assumptions. In particular, we relax the
total variation condition given in prior work to weak continuity as well as
Wasserstein continuity conditions. (ii) We provide synchronous and asynchronous
Q-learning algorithms for continuous spaces via quantization, and establish
their convergence. (iii) We show that the convergence is to the optimal Q
values of the finite approximate models constructed via quantization. Our
Q-learning convergence results and their convergence to near optimality are new
for continuous spaces, and the proof method is new even for finite spaces, to
our knowledge.Comment: 3 figure
LEARNING ALGORITHMS FOR MARKOV DECISION PROCESSES
We propose various computational schemes for solving Partially Observable
Markov Decision Processes with the finite stage additive cost and infinite
horizon discounted cost criterion. Error bounds for the corresponding algorithms
are given and it is further shown that at the expense of more computational
effort the Partially Observable Markov Decision Problem (POMDP) can be solved
as closely to the optimal as desired.
It is well known that a sufficient statistic for taking the best action at any time for
the POMDP is the aposteriori probability distribution on the underlying states, given
all the past history, and that this can be updated recursively. We prove that the finite
stage optimal costs as well as the optimal cost for the infinite horizon discounted
cost problem are both Lipschitz continuous (with domain the unit simplex of probability
distributions over the underlying states) and gives bounds for the Lipschitz constant.
We use these bounds to provide error bounds for computational algorithms for solving
POMDPs.
We extend the almost sure convergence result of a very general stochastic approximation
algorithm to the case when the underlying Markov process exhibits periodicity. This result
is used to extend the proof of convergence of Temporal Difference (TD) reinforcement learning
schemes with linear function approximation for Markov Cost processes in order to estimate the
cost to go function for the discounted cost criterion, and the differential cost function for the
average cost criterion, respectively.
Adaptive control of Markov Decision Problems (MDPs) is a problem in which a full knowledge
of the system parameters, namely transition probabilities as well as the distribution of the
immediate costs, are not available apriori. We give direct adaptive control schemes for
infinite horizon discounted cost and average cost MDPs. Approximate Policy Iteration
using on-line TD schemes for policy evaluation is detailed for the discounted cost and
average cost criteria.
Possible extensions of direct adaptive control schemes to the POMDP framework are
discussed.
Auxiliary results relevant to the core results of the dissertation are stated
and proved in the appendices. In particular an efficient discretization scheme
for the finite dimensional unit simplex is given. Some general error bounds for
MDPs are also given. Also TD schemes for learning in Stochastic Shortest Path
problems (SSP) are discussed
Energy Sharing for Multiple Sensor Nodes with Finite Buffers
We consider the problem of finding optimal energy sharing policies that
maximize the network performance of a system comprising of multiple sensor
nodes and a single energy harvesting (EH) source. Sensor nodes periodically
sense the random field and generate data, which is stored in the corresponding
data queues. The EH source harnesses energy from ambient energy sources and the
generated energy is stored in an energy buffer. Sensor nodes receive energy for
data transmission from the EH source. The EH source has to efficiently share
the stored energy among the nodes in order to minimize the long-run average
delay in data transmission. We formulate the problem of energy sharing between
the nodes in the framework of average cost infinite-horizon Markov decision
processes (MDPs). We develop efficient energy sharing algorithms, namely
Q-learning algorithm with exploration mechanisms based on the -greedy
method as well as upper confidence bound (UCB). We extend these algorithms by
incorporating state and action space aggregation to tackle state-action space
explosion in the MDP. We also develop a cross entropy based method that
incorporates policy parameterization in order to find near optimal energy
sharing policies. Through simulations, we show that our algorithms yield energy
sharing policies that outperform the heuristic greedy method.Comment: 38 pages, 10 figure
Risk-Sensitive Reinforcement Learning: A Constrained Optimization Viewpoint
The classic objective in a reinforcement learning (RL) problem is to find a
policy that minimizes, in expectation, a long-run objective such as the
infinite-horizon discounted or long-run average cost. In many practical
applications, optimizing the expected value alone is not sufficient, and it may
be necessary to include a risk measure in the optimization process, either as
the objective or as a constraint. Various risk measures have been proposed in
the literature, e.g., mean-variance tradeoff, exponential utility, the
percentile performance, value at risk, conditional value at risk, prospect
theory and its later enhancement, cumulative prospect theory. In this article,
we focus on the combination of risk criteria and reinforcement learning in a
constrained optimization framework, i.e., a setting where the goal to find a
policy that optimizes the usual objective of infinite-horizon
discounted/average cost, while ensuring that an explicit risk constraint is
satisfied. We introduce the risk-constrained RL framework, cover popular risk
measures based on variance, conditional value-at-risk and cumulative prospect
theory, and present a template for a risk-sensitive RL algorithm. We survey
some of our recent work on this topic, covering problems encompassing
discounted cost, average cost, and stochastic shortest path settings, together
with the aforementioned risk measures in a constrained framework. This
non-exhaustive survey is aimed at giving a flavor of the challenges involved in
solving a risk-sensitive RL problem, and outlining some potential future
research directions
- …