55,345 research outputs found
Q-PrOP: Sample-efficient policy gradient with an off-policy critic
Model-free deep reinforcement learning (RL) methods have been successful in a
wide variety of simulated domains. However, a major obstacle facing deep RL in
the real world is their high sample complexity. Batch policy gradient methods
offer stable learning, but at the cost of high variance, which often requires
large batches. TD-style methods, such as off-policy actor-critic and
Q-learning, are more sample-efficient but biased, and often require costly
hyperparameter sweeps to stabilize. In this work, we aim to develop methods
that combine the stability of policy gradients with the efficiency of
off-policy RL. We present Q-Prop, a policy gradient method that uses a Taylor
expansion of the off-policy critic as a control variate. Q-Prop is both sample
efficient and stable, and effectively combines the benefits of on-policy and
off-policy methods. We analyze the connection between Q-Prop and existing
model-free algorithms, and use control variate theory to derive two variants of
Q-Prop with conservative and aggressive adaptation. We show that conservative
Q-Prop provides substantial gains in sample efficiency over trust region policy
optimization (TRPO) with generalized advantage estimation (GAE), and improves
stability over deep deterministic policy gradient (DDPG), the state-of-the-art
on-policy and off-policy methods, on OpenAI Gym's MuJoCo continuous control
environments
Differential Dynamic Programming for time-delayed systems
Trajectory optimization considers the problem of deciding how to control a
dynamical system to move along a trajectory which minimizes some cost function.
Differential Dynamic Programming (DDP) is an optimal control method which
utilizes a second-order approximation of the problem to find the control. It is
fast enough to allow real-time control and has been shown to work well for
trajectory optimization in robotic systems. Here we extend classic DDP to
systems with multiple time-delays in the state. Being able to find optimal
trajectories for time-delayed systems with DDP opens up the possibility to use
richer models for system identification and control, including recurrent neural
networks with multiple timesteps in the state. We demonstrate the algorithm on
a two-tank continuous stirred tank reactor. We also demonstrate the algorithm
on a recurrent neural network trained to model an inverted pendulum with
position information only.Comment: 7 pages, 6 figures, conference, Decision and Control (CDC), 2016 IEEE
55th Conference o
Average energy efficiency contours with multiple decoding policies
This letter addresses energy-efficient design in multi-user, single-carrier uplink channels by employing multiple decoding policies. The comparison metric used in this study is based on average energy efficiency contours, where an optimal rate vector is obtained based on four system targets: Maximum energy efficiency, a trade-off between maximum energy efficiency and rate fairness, achieving energy efficiency target with maximum sum-rate and achieving energy efficiency target with fairness. The transmit power function is approximated using Taylor series expansion, with simulation results demonstrating the achievability of the optimal rate vector, and negligible performance difference in employing this approximation
- …