Search CORE

45,542 research outputs found

Q-Learning for Continuous State and Action MDPs under Average Cost Criteria

Author: Kara Ali Devran
Yuksel Serdar
Publication venue
Publication date: 15/08/2023
Field of study

For infinite-horizon average-cost criterion problems, we present several approximation and reinforcement learning results for Markov Decision Processes with standard Borel spaces. Toward this end, (i) we first provide a discretization based approximation method for fully observed Markov Decision Processes (MDPs) with continuous spaces under average cost criteria, and we provide error bounds for the approximations when the dynamics are only weakly continuous under certain ergodicity assumptions. In particular, we relax the total variation condition given in prior work to weak continuity as well as Wasserstein continuity conditions. (ii) We provide synchronous and asynchronous Q-learning algorithms for continuous spaces via quantization, and establish their convergence. (iii) We show that the convergence is to the optimal Q values of the finite approximate models constructed via quantization. Our Q-learning convergence results and their convergence to near optimality are new for continuous spaces, and the proof method is new even for finite spaces, to our knowledge.Comment: 3 figure

arXiv.org e-Print Archive

LEARNING ALGORITHMS FOR MARKOV DECISION PROCESSES

Author: Thomas Abraham
Publication venue
Publication date: 01/01/2009
Field of study

We propose various computational schemes for solving Partially Observable Markov Decision Processes with the finite stage additive cost and infinite horizon discounted cost criterion. Error bounds for the corresponding algorithms are given and it is further shown that at the expense of more computational effort the Partially Observable Markov Decision Problem (POMDP) can be solved as closely to the optimal as desired. It is well known that a sufficient statistic for taking the best action at any time for the POMDP is the aposteriori probability distribution on the underlying states, given all the past history, and that this can be updated recursively. We prove that the finite stage optimal costs as well as the optimal cost for the infinite horizon discounted cost problem are both Lipschitz continuous (with domain the unit simplex of probability distributions over the underlying states) and gives bounds for the Lipschitz constant. We use these bounds to provide error bounds for computational algorithms for solving POMDPs. We extend the almost sure convergence result of a very general stochastic approximation algorithm to the case when the underlying Markov process exhibits periodicity. This result is used to extend the proof of convergence of Temporal Difference (TD) reinforcement learning schemes with linear function approximation for Markov Cost processes in order to estimate the cost to go function for the discounted cost criterion, and the differential cost function for the average cost criterion, respectively. Adaptive control of Markov Decision Problems (MDPs) is a problem in which a full knowledge of the system parameters, namely transition probabilities as well as the distribution of the immediate costs, are not available apriori. We give direct adaptive control schemes for infinite horizon discounted cost and average cost MDPs. Approximate Policy Iteration using on-line TD schemes for policy evaluation is detailed for the discounted cost and average cost criteria. Possible extensions of direct adaptive control schemes to the POMDP framework are discussed. Auxiliary results relevant to the core results of the dissertation are stated and proved in the appendices. In particular an efficient discretization scheme for the finite dimensional unit simplex is given. Some general error bounds for MDPs are also given. Also TD schemes for learning in Stochastic Shortest Path problems (SSP) are discussed

Digital Repository at the University of Maryland

Energy Sharing for Multiple Sensor Nodes with Finite Buffers

Author: Bhatnagar Shalabh
J Prabuchandran K.
Padakandla Sindhu
Publication venue
Publication date: 17/03/2015
Field of study

We consider the problem of finding optimal energy sharing policies that maximize the network performance of a system comprising of multiple sensor nodes and a single energy harvesting (EH) source. Sensor nodes periodically sense the random field and generate data, which is stored in the corresponding data queues. The EH source harnesses energy from ambient energy sources and the generated energy is stored in an energy buffer. Sensor nodes receive energy for data transmission from the EH source. The EH source has to efficiently share the stored energy among the nodes in order to minimize the long-run average delay in data transmission. We formulate the problem of energy sharing between the nodes in the framework of average cost infinite-horizon Markov decision processes (MDPs). We develop efficient energy sharing algorithms, namely Q-learning algorithm with exploration mechanisms based on the

\epsilon

-greedy method as well as upper confidence bound (UCB). We extend these algorithms by incorporating state and action space aggregation to tackle state-action space explosion in the MDP. We also develop a cross entropy based method that incorporates policy parameterization in order to find near optimal energy sharing policies. Through simulations, we show that our algorithms yield energy sharing policies that outperform the heuristic greedy method.Comment: 38 pages, 10 figure

arXiv.org e-Print Archive

Open Access Repository of IISc Research Publications

Risk-Sensitive Reinforcement Learning: A Constrained Optimization Viewpoint

Author: A. Prashanth L.
Fu Michael
Publication venue
Publication date: 22/10/2018
Field of study

The classic objective in a reinforcement learning (RL) problem is to find a policy that minimizes, in expectation, a long-run objective such as the infinite-horizon discounted or long-run average cost. In many practical applications, optimizing the expected value alone is not sufficient, and it may be necessary to include a risk measure in the optimization process, either as the objective or as a constraint. Various risk measures have been proposed in the literature, e.g., mean-variance tradeoff, exponential utility, the percentile performance, value at risk, conditional value at risk, prospect theory and its later enhancement, cumulative prospect theory. In this article, we focus on the combination of risk criteria and reinforcement learning in a constrained optimization framework, i.e., a setting where the goal to find a policy that optimizes the usual objective of infinite-horizon discounted/average cost, while ensuring that an explicit risk constraint is satisfied. We introduce the risk-constrained RL framework, cover popular risk measures based on variance, conditional value-at-risk and cumulative prospect theory, and present a template for a risk-sensitive RL algorithm. We survey some of our recent work on this topic, covering problems encompassing discounted cost, average cost, and stochastic shortest path settings, together with the aforementioned risk measures in a constrained framework. This non-exhaustive survey is aimed at giving a flavor of the challenges involved in solving a risk-sensitive RL problem, and outlining some potential future research directions

arXiv.org e-Print Archive