Search CORE

4,682 research outputs found

The Divergence of Reinforcement Learning Algorithms with Value-Iteration and Function Approximation

Author: Alonso Eduardo
Fairbank Michael
Publication venue
Publication date: 01/01/2012
Field of study

This paper gives specific divergence examples of value-iteration for several major Reinforcement Learning and Adaptive Dynamic Programming algorithms, when using a function approximator for the value function. These divergence examples differ from previous divergence examples in the literature, in that they are applicable for a greedy policy, i.e. in a "value iteration" scenario. Perhaps surprisingly, with a greedy policy, it is also possible to get divergence for the algorithms TD(1) and Sarsa(1). In addition to these divergences, we also achieve divergence for the Adaptive Dynamic Programming algorithms HDP, DHP and GDHP.Comment: 8 pages, 4 figures. In Proceedings of the IEEE International Joint Conference on Neural Networks, June 2012, Brisbane (IEEE IJCNN 2012), pp. 3070--307

arXiv.org e-Print Archive

CiteSeerX

City Research Online

Crossref

On Resource Allocation in Fading Multiple Access Channels - An Efficient Approximate Projection Approach

Author: Eryilmaz Atilla
Medard Muriel
Ozdaglar Asuman
ParandehGheibi Ali
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2008
Field of study

We consider the problem of rate and power allocation in a multiple-access channel. Our objective is to obtain rate and power allocation policies that maximize a general concave utility function of average transmission rates on the information theoretic capacity region of the multiple-access channel. Our policies does not require queue-length information. We consider several different scenarios. First, we address the utility maximization problem in a nonfading channel to obtain the optimal operating rates, and present an iterative gradient projection algorithm that uses approximate projection. By exploiting the polymatroid structure of the capacity region, we show that the approximate projection can be implemented in time polynomial in the number of users. Second, we consider resource allocation in a fading channel. Optimal rate and power allocation policies are presented for the case that power control is possible and channel statistics are available. For the case that transmission power is fixed and channel statistics are unknown, we propose a greedy rate allocation policy and provide bounds on the performance difference of this policy and the optimal policy in terms of channel variations and structure of the utility function. We present numerical results that demonstrate superior convergence rate performance for the greedy policy compared to queue-length based policies. In order to reduce the computational complexity of the greedy policy, we present approximate rate allocation policies which track the greedy policy within a certain neighborhood that is characterized in terms of the speed of fading.Comment: 32 pages, Submitted to IEEE Trans. on Information Theor

arXiv.org e-Print Archive

CiteSeerX

DSpace@MIT

Crossref

Multiuser Scheduling in a Markov-modeled Downlink using Randomly Delayed ARQ Feedback

Author: Ness B. Shroff
Philip Schniter
Senior Member
Sugumar Murugesan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 27/04/2011
Field of study

We focus on the downlink of a cellular system, which corresponds to the bulk of the data transfer in such wireless systems. We address the problem of opportunistic multiuser scheduling under imperfect channel state information, by exploiting the memory inherent in the channel. In our setting, the channel between the base station and each user is modeled by a two-state Markov chain and the scheduled user sends back an ARQ feedback signal that arrives at the scheduler with a random delay that is i.i.d across users and time. The scheduler indirectly estimates the channel via accumulated delayed-ARQ feedback and uses this information to make scheduling decisions. We formulate a throughput maximization problem as a partially observable Markov decision process (POMDP). For the case of two users in the system, we show that a greedy policy is sum throughput optimal for any distribution on the ARQ feedback delay. For the case of more than two users, we prove that the greedy policy is suboptimal and demonstrate, via numerical studies, that it has near optimal performance. We show that the greedy policy can be implemented by a simple algorithm that does not require the statistics of the underlying Markov channel or the ARQ feedback delay, thus making it robust against errors in system parameter estimation. Establishing an equivalence between the two-user system and a genie-aided system, we obtain a simple closed form expression for the sum capacity of the Markov-modeled downlink. We further derive inner and outer bounds on the capacity region of the Markov-modeled downlink and tighten these bounds for special cases of the system parameters.Comment: Contains 22 pages, 6 figures and 8 tables; revised version including additional analytical and numerical results; work submitted, Feb 2010, to IEEE Transactions on Information Theory, revised April 2011; authors can be reached at [email protected]/[email protected]/[email protected]

arXiv.org e-Print Archive

CiteSeerX

An Equivalence Between Adaptive Dynamic Programming With a Critic and Backpropagation Through Time

Author: Alonso E.
Fairbank M.
Prokhorov D.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/11/2013
Field of study

We consider the adaptive dynamic programming technique called Dual Heuristic Programming (DHP), which is designed to learn a critic function, when using learned model functions of the environment. DHP is designed for optimizing control problems in large and continuous state spaces. We extend DHP into a new algorithm that we call Value-Gradient Learning, VGL(λ), and prove equivalence of an instance of the new algorithm to Backpropagation Through Time for Control with a greedy policy. Not only does this equivalence provide a link between these two different approaches, but it also enables our variant of DHP to have guaranteed convergence, under certain smoothness conditions and a greedy policy, when using a general smooth nonlinear function approximator for the critic. We consider several experimental scenarios including some that prove divergence of DHP under a greedy policy, which contrasts against our proven-convergent algorithm

University of Essex Research Repository

City Research Online

Crossref