43 research outputs found
Experiments with Infinite-Horizon, Policy-Gradient Estimation
In this paper, we present algorithms that perform gradient ascent of the
average reward in a partially observable Markov decision process (POMDP). These
algorithms are based on GPOMDP, an algorithm introduced in a companion paper
(Baxter and Bartlett, this volume), which computes biased estimates of the
performance gradient in POMDPs. The algorithm's chief advantages are that it
uses only one free parameter beta, which has a natural interpretation in terms
of bias-variance trade-off, it requires no knowledge of the underlying state,
and it can be applied to infinite state, control and observation spaces. We
show how the gradient estimates produced by GPOMDP can be used to perform
gradient ascent, both with a traditional stochastic-gradient algorithm, and
with an algorithm based on conjugate-gradients that utilizes gradient
information to bracket maxima in line searches. Experimental results are
presented illustrating both the theoretical results of (Baxter and Bartlett,
this volume) on a toy problem, and practical aspects of the algorithms on a
number of more realistic problems
Traffic Light Control Using Deep Policy-Gradient and Value-Function Based Reinforcement Learning
Recent advances in combining deep neural network architectures with
reinforcement learning techniques have shown promising potential results in
solving complex control problems with high dimensional state and action spaces.
Inspired by these successes, in this paper, we build two kinds of reinforcement
learning algorithms: deep policy-gradient and value-function based agents which
can predict the best possible traffic signal for a traffic intersection. At
each time step, these adaptive traffic light control agents receive a snapshot
of the current state of a graphical traffic simulator and produce control
signals. The policy-gradient based agent maps its observation directly to the
control signal, however the value-function based agent first estimates values
for all legal control signals. The agent then selects the optimal control
action with the highest value. Our methods show promising results in a traffic
network simulated in the SUMO traffic simulator, without suffering from
instability issues during the training process
Decentralized Delay Optimal Control for Interference Networks with Limited Renewable Energy Storage
In this paper, we consider delay minimization for interference networks with
renewable energy source, where the transmission power of a node comes from both
the conventional utility power (AC power) and the renewable energy source. We
assume the transmission power of each node is a function of the local channel
state, local data queue state and local energy queue state only. In turn, we
consider two delay optimization formulations, namely the decentralized
partially observable Markov decision process (DEC-POMDP) and Non-cooperative
partially observable stochastic game (POSG). In DEC-POMDP formulation, we
derive a decentralized online learning algorithm to determine the control
actions and Lagrangian multipliers (LMs) simultaneously, based on the policy
gradient approach. Under some mild technical conditions, the proposed
decentralized policy gradient algorithm converges almost surely to a local
optimal solution. On the other hand, in the non-cooperative POSG formulation,
the transmitter nodes are non-cooperative. We extend the decentralized policy
gradient solution and establish the technical proof for almost-sure convergence
of the learning algorithms. In both cases, the solutions are very robust to
model variations. Finally, the delay performance of the proposed solutions are
compared with conventional baseline schemes for interference networks and it is
illustrated that substantial delay performance gain and energy savings can be
achieved
ARES:Adaptive receding-horizon synthesis of optimal plans
We introduce ARES, an efficient approximation algorithm for generating optimal plans (action sequences) that take an initial state of a Markov Decision Process (MDP) to a state whose cost is below a specified (convergence) threshold. ARES uses Particle Swarm Optimization, with adaptive sizing for both the receding horizon and the particle swarm. Inspired by Importance Splitting, the length of the horizon and the number of particles are chosen such that at least one particle reaches a next-level state, that is, a state where the cost decreases by a required delta from the previous-level state. The level relation on states and the plans constructed by ARES implicitly define a Lyapunov function and an optimal policy, respectively, both of which could be explicitly generated by applying ARES to all states of the MDP, up to some topological equivalence relation. We also assess the effectiveness of ARES by statistically evaluating its rate of success in generating optimal plans. The ARES algorithm resulted from our desire to clarify if flying in V-formation is a flocking policy that optimizes energy conservation, clear view, and velocity alignment. That is, we were interested to see if one could find optimal plans that bring a flock from an arbitrary initial state to a state exhibiting a single connected V-formation. For flocks with 7 birds, ARES is able to generate a plan that leads to a V-formation in 95% of the 8,000 random initial configurations within 63 s, on average. ARES can also be easily customized into a model-predictive controller (MPC) with an adaptive receding horizon and statistical guarantees of convergence. To the best of our knowledge, our adaptive-sizing approach is the first to provide convergence guarantees in receding-horizon techniques
Better Exploration with Optimistic Actor-Critic
Actor-critic methods, a type of model-free Reinforcement Learning, have been
successfully applied to challenging tasks in continuous control, often
achieving state-of-the art performance. However, wide-scale adoption of these
methods in real-world domains is made difficult by their poor sample
efficiency. We address this problem both theoretically and empirically. On the
theoretical side, we identify two phenomena preventing efficient exploration in
existing state-of-the-art algorithms such as Soft Actor Critic. First,
combining a greedy actor update with a pessimistic estimate of the critic leads
to the avoidance of actions that the agent does not know about, a phenomenon we
call pessimistic underexploration. Second, current algorithms are directionally
uninformed, sampling actions with equal probability in opposite directions from
the current mean. This is wasteful, since we typically need actions taken along
certain directions much more than others. To address both of these phenomena,
we introduce a new algorithm, Optimistic Actor Critic, which approximates a
lower and upper confidence bound on the state-action value function. This
allows us to apply the principle of optimism in the face of uncertainty to
perform directed exploration using the upper bound while still using the lower
bound to avoid overestimation. We evaluate OAC in several challenging
continuous control tasks, achieving state-of the art sample efficiency.Comment: 20 pages (including supplement
Reinforcement Learning for Humanoid Robots - Policy Gradients and Beyond
Reinforcement learning offers one of the most general frameworks to take traditional robotics towards true autonomy
and versatility. However, applying reinforcement learning to high dimensional movement systems like humanoid
robots remains an unsolved problem. In this paper, we discuss different approaches of reinforcement learning in terms
of their applicability in humanoid robotics. Methods can be coarsely classified in to three different categories, i.e.,
greedy methods, ’vanilla’ policy gradient methods, and natural gradient methods. We discuss that greedy methods are
not likely to scale into the domain humanoid robotics as they are problematic when used with function approximation.
Vanilla’ policy gradient methods on the other hand have been successfully applied on real-world robots including at
least one humanoid robot [3]. We demonstrate that these methods can be significantly improved using the natural
policy gradient instead of the regular policy gradient. A derivation of the natural policy gradient is provided, proving
that the average policy gradient of Kakade[10] is indeed the true natural gradient. A general algorithm for estimating
the natural gradient, the Natural Actor-Critic algorithm, is introduced. This algorithm converges to the nearest local
minimum of the cost function with respect to the Fisher information metric under suitable conditions. The algorithm
outperforms non-natural policy gradients by far in a cart-pole balancing evaluation, and for learning non-linear dynamic
motor primitives for humanoid robot control. It offers a promising route for the development of reinforcement
learning for truly high-dimensionally continuous state-action systems