4,682 research outputs found
The Divergence of Reinforcement Learning Algorithms with Value-Iteration and Function Approximation
This paper gives specific divergence examples of value-iteration for several
major Reinforcement Learning and Adaptive Dynamic Programming algorithms, when
using a function approximator for the value function. These divergence examples
differ from previous divergence examples in the literature, in that they are
applicable for a greedy policy, i.e. in a "value iteration" scenario. Perhaps
surprisingly, with a greedy policy, it is also possible to get divergence for
the algorithms TD(1) and Sarsa(1). In addition to these divergences, we also
achieve divergence for the Adaptive Dynamic Programming algorithms HDP, DHP and
GDHP.Comment: 8 pages, 4 figures. In Proceedings of the IEEE International Joint
Conference on Neural Networks, June 2012, Brisbane (IEEE IJCNN 2012), pp.
3070--307
On Resource Allocation in Fading Multiple Access Channels - An Efficient Approximate Projection Approach
We consider the problem of rate and power allocation in a multiple-access
channel. Our objective is to obtain rate and power allocation policies that
maximize a general concave utility function of average transmission rates on
the information theoretic capacity region of the multiple-access channel. Our
policies does not require queue-length information. We consider several
different scenarios. First, we address the utility maximization problem in a
nonfading channel to obtain the optimal operating rates, and present an
iterative gradient projection algorithm that uses approximate projection. By
exploiting the polymatroid structure of the capacity region, we show that the
approximate projection can be implemented in time polynomial in the number of
users. Second, we consider resource allocation in a fading channel. Optimal
rate and power allocation policies are presented for the case that power
control is possible and channel statistics are available. For the case that
transmission power is fixed and channel statistics are unknown, we propose a
greedy rate allocation policy and provide bounds on the performance difference
of this policy and the optimal policy in terms of channel variations and
structure of the utility function. We present numerical results that
demonstrate superior convergence rate performance for the greedy policy
compared to queue-length based policies. In order to reduce the computational
complexity of the greedy policy, we present approximate rate allocation
policies which track the greedy policy within a certain neighborhood that is
characterized in terms of the speed of fading.Comment: 32 pages, Submitted to IEEE Trans. on Information Theor
Multiuser Scheduling in a Markov-modeled Downlink using Randomly Delayed ARQ Feedback
We focus on the downlink of a cellular system, which corresponds to the bulk
of the data transfer in such wireless systems. We address the problem of
opportunistic multiuser scheduling under imperfect channel state information,
by exploiting the memory inherent in the channel. In our setting, the channel
between the base station and each user is modeled by a two-state Markov chain
and the scheduled user sends back an ARQ feedback signal that arrives at the
scheduler with a random delay that is i.i.d across users and time. The
scheduler indirectly estimates the channel via accumulated delayed-ARQ feedback
and uses this information to make scheduling decisions. We formulate a
throughput maximization problem as a partially observable Markov decision
process (POMDP). For the case of two users in the system, we show that a greedy
policy is sum throughput optimal for any distribution on the ARQ feedback
delay. For the case of more than two users, we prove that the greedy policy is
suboptimal and demonstrate, via numerical studies, that it has near optimal
performance. We show that the greedy policy can be implemented by a simple
algorithm that does not require the statistics of the underlying Markov channel
or the ARQ feedback delay, thus making it robust against errors in system
parameter estimation. Establishing an equivalence between the two-user system
and a genie-aided system, we obtain a simple closed form expression for the sum
capacity of the Markov-modeled downlink. We further derive inner and outer
bounds on the capacity region of the Markov-modeled downlink and tighten these
bounds for special cases of the system parameters.Comment: Contains 22 pages, 6 figures and 8 tables; revised version including
additional analytical and numerical results; work submitted, Feb 2010, to
IEEE Transactions on Information Theory, revised April 2011; authors can be
reached at [email protected]/[email protected]/[email protected]
An Equivalence Between Adaptive Dynamic Programming With a Critic and Backpropagation Through Time
We consider the adaptive dynamic programming technique called Dual Heuristic Programming (DHP), which is designed to learn a critic function, when using learned model functions of the environment. DHP is designed for optimizing control problems in large and continuous state spaces. We extend DHP into a new algorithm that we call Value-Gradient Learning, VGL(λ), and prove equivalence of an instance of the new algorithm to Backpropagation Through Time for Control with a greedy policy. Not only does this equivalence provide a link between these two different approaches, but it also enables our variant of DHP to have guaranteed convergence, under certain smoothness conditions and a greedy policy, when using a general smooth nonlinear function approximator for the critic. We consider several experimental scenarios including some that prove divergence of DHP under a greedy policy, which contrasts against our proven-convergent algorithm
- …