9,573 research outputs found
Projections for Approximate Policy Iteration Algorithms
Approximate policy iteration is a class of reinforcement learning (RL) algorithms where the policy is encoded using a function approximator and which has been especially prominent in RL with continuous action spaces. In this class of RL algorithms, ensuring increase of the policy return during policy update often requires to constrain the change in action distribution. Several approximations exist in the literature to solve this constrained policy update problem. In this paper, we propose to improve over such solutions by introducing a set of projections that transform the constrained problem into an unconstrained one which is then solved by standard gradient descent. Using these projections, we empirically demonstrate that our approach can improve the policy update solution and the control over exploration of existing approximate policy iteration algorithms
Projections for Approximate Policy Iteration Algorithms
Approximate policy iteration is a class of reinforcement learning (RL) algorithms where the policy is encoded using a function approximator and which has been especially prominent in RL with continuous action spaces. In this class of RL algorithms, ensuring increase of the policy return during policy update often requires to constrain the change in action distribution. Several approximations exist in the literature to solve this constrained policy update problem. In this paper, we propose to improve over such solutions by introducing a set of projections that transform the constrained problem into an unconstrained one which is then solved by standard gradient descent. Using these projections, we empirically demonstrate that our approach can improve the policy update solution and the control over exploration of existing approximate policy iteration algorithms
Controlled Sequential Monte Carlo
Sequential Monte Carlo methods, also known as particle methods, are a popular
set of techniques for approximating high-dimensional probability distributions
and their normalizing constants. These methods have found numerous applications
in statistics and related fields; e.g. for inference in non-linear non-Gaussian
state space models, and in complex static models. Like many Monte Carlo
sampling schemes, they rely on proposal distributions which crucially impact
their performance. We introduce here a class of controlled sequential Monte
Carlo algorithms, where the proposal distributions are determined by
approximating the solution to an associated optimal control problem using an
iterative scheme. This method builds upon a number of existing algorithms in
econometrics, physics, and statistics for inference in state space models, and
generalizes these methods so as to accommodate complex static models. We
provide a theoretical analysis concerning the fluctuation and stability of this
methodology that also provides insight into the properties of related
algorithms. We demonstrate significant gains over state-of-the-art methods at a
fixed computational complexity on a variety of applications
On Resource Allocation in Fading Multiple Access Channels - An Efficient Approximate Projection Approach
We consider the problem of rate and power allocation in a multiple-access
channel. Our objective is to obtain rate and power allocation policies that
maximize a general concave utility function of average transmission rates on
the information theoretic capacity region of the multiple-access channel. Our
policies does not require queue-length information. We consider several
different scenarios. First, we address the utility maximization problem in a
nonfading channel to obtain the optimal operating rates, and present an
iterative gradient projection algorithm that uses approximate projection. By
exploiting the polymatroid structure of the capacity region, we show that the
approximate projection can be implemented in time polynomial in the number of
users. Second, we consider resource allocation in a fading channel. Optimal
rate and power allocation policies are presented for the case that power
control is possible and channel statistics are available. For the case that
transmission power is fixed and channel statistics are unknown, we propose a
greedy rate allocation policy and provide bounds on the performance difference
of this policy and the optimal policy in terms of channel variations and
structure of the utility function. We present numerical results that
demonstrate superior convergence rate performance for the greedy policy
compared to queue-length based policies. In order to reduce the computational
complexity of the greedy policy, we present approximate rate allocation
policies which track the greedy policy within a certain neighborhood that is
characterized in terms of the speed of fading.Comment: 32 pages, Submitted to IEEE Trans. on Information Theor
Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view
We investigate projection methods, for evaluating a linear approximation of
the value function of a policy in a Markov Decision Process context. We
consider two popular approaches, the one-step Temporal Difference fix-point
computation (TD(0)) and the Bellman Residual (BR) minimization. We describe
examples, where each method outperforms the other. We highlight a simple
relation between the objective function they minimize, and show that while BR
enjoys a performance guarantee, TD(0) does not in general. We then propose a
unified view in terms of oblique projections of the Bellman equation, which
substantially simplifies and extends the characterization of (schoknecht,2002)
and the recent analysis of (Yu & Bertsekas, 2008). Eventually, we describe some
simulations that suggest that if the TD(0) solution is usually slightly better
than the BR solution, its inherent numerical instability makes it very bad in
some cases, and thus worse on average
On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes
We consider infinite-horizon stationary -discounted Markov Decision
Processes, for which it is known that there exists a stationary optimal policy.
Using Value and Policy Iteration with some error at each iteration,
it is well-known that one can compute stationary policies that are
-optimal. After arguing that this
guarantee is tight, we develop variations of Value and Policy Iteration for
computing non-stationary policies that can be up to
-optimal, which constitutes a significant
improvement in the usual situation when is close to 1. Surprisingly,
this shows that the problem of "computing near-optimal non-stationary policies"
is much simpler than that of "computing near-optimal stationary policies"
Bellman Error Based Feature Generation using Random Projections on Sparse Spaces
We address the problem of automatic generation of features for value function
approximation. Bellman Error Basis Functions (BEBFs) have been shown to improve
the error of policy evaluation with function approximation, with a convergence
rate similar to that of value iteration. We propose a simple, fast and robust
algorithm based on random projections to generate BEBFs for sparse feature
spaces. We provide a finite sample analysis of the proposed method, and prove
that projections logarithmic in the dimension of the original space are enough
to guarantee contraction in the error. Empirical results demonstrate the
strength of this method
Computing Probabilistic Bisimilarity Distances for Probabilistic Automata
The probabilistic bisimilarity distance of Deng et al. has been proposed as a
robust quantitative generalization of Segala and Lynch's probabilistic
bisimilarity for probabilistic automata. In this paper, we present a
characterization of the bisimilarity distance as the solution of a simple
stochastic game. The characterization gives us an algorithm to compute the
distances by applying Condon's simple policy iteration on these games. The
correctness of Condon's approach, however, relies on the assumption that the
games are stopping. Our games may be non-stopping in general, yet we are able
to prove termination for this extended class of games. Already other algorithms
have been proposed in the literature to compute these distances, with
complexity in and \textbf{PPAD}. Despite the
theoretical relevance, these algorithms are inefficient in practice. To the
best of our knowledge, our algorithm is the first practical solution.
The characterization of the probabilistic bisimilarity distance mentioned
above crucially uses a dual presentation of the Hausdorff distance due to
M\'emoli. As an additional contribution, in this paper we show that M\'emoli's
result can be used also to prove that the bisimilarity distance bounds the
difference in the maximal (or minimal) probability of two states to satisfying
arbitrary -regular properties, expressed, eg., as LTL formulas
- …