13,429 research outputs found
Inverse Reinforcement Learning from a Gradient-based Learner
Inverse Reinforcement Learning addresses the problem of inferring an expert's
reward function from demonstrations. However, in many applications, we not only
have access to the expert's near-optimal behavior, but we also observe part of
her learning process. In this paper, we propose a new algorithm for this
setting, in which the goal is to recover the reward function being optimized by
an agent, given a sequence of policies produced during learning. Our approach
is based on the assumption that the observed agent is updating her policy
parameters along the gradient direction. Then we extend our method to deal with
the more realistic scenario where we only have access to a dataset of learning
trajectories. For both settings, we provide theoretical insights into our
algorithms' performance. Finally, we evaluate the approach in a simulated
GridWorld environment and on the MuJoCo environments, comparing it with the
state-of-the-art baseline
Traffic Light Control Using Deep Policy-Gradient and Value-Function Based Reinforcement Learning
Recent advances in combining deep neural network architectures with
reinforcement learning techniques have shown promising potential results in
solving complex control problems with high dimensional state and action spaces.
Inspired by these successes, in this paper, we build two kinds of reinforcement
learning algorithms: deep policy-gradient and value-function based agents which
can predict the best possible traffic signal for a traffic intersection. At
each time step, these adaptive traffic light control agents receive a snapshot
of the current state of a graphical traffic simulator and produce control
signals. The policy-gradient based agent maps its observation directly to the
control signal, however the value-function based agent first estimates values
for all legal control signals. The agent then selects the optimal control
action with the highest value. Our methods show promising results in a traffic
network simulated in the SUMO traffic simulator, without suffering from
instability issues during the training process
Control Regularization for Reduced Variance Reinforcement Learning
Dealing with high variance is a significant challenge in model-free
reinforcement learning (RL). Existing methods are unreliable, exhibiting high
variance in performance from run to run using different initializations/seeds.
Focusing on problems arising in continuous control, we propose a functional
regularization approach to augmenting model-free RL. In particular, we
regularize the behavior of the deep policy to be similar to a policy prior,
i.e., we regularize in function space. We show that functional regularization
yields a bias-variance trade-off, and propose an adaptive tuning strategy to
optimize this trade-off. When the policy prior has control-theoretic stability
guarantees, we further show that this regularization approximately preserves
those stability guarantees throughout learning. We validate our approach
empirically on a range of settings, and demonstrate significantly reduced
variance, guaranteed dynamic stability, and more efficient learning than deep
RL alone.Comment: Appearing in ICML 201
A new Potential-Based Reward Shaping for Reinforcement Learning Agent
Potential-based reward shaping (PBRS) is a particular category of machine
learning methods which aims to improve the learning speed of a reinforcement
learning agent by extracting and utilizing extra knowledge while performing a
task. There are two steps in the process of transfer learning: extracting
knowledge from previously learned tasks and transferring that knowledge to use
it in a target task. The latter step is well discussed in the literature with
various methods being proposed for it, while the former has been explored less.
With this in mind, the type of knowledge that is transmitted is very important
and can lead to considerable improvement. Among the literature of both the
transfer learning and the potential-based reward shaping, a subject that has
never been addressed is the knowledge gathered during the learning process
itself. In this paper, we presented a novel potential-based reward shaping
method that attempted to extract knowledge from the learning process. The
proposed method extracts knowledge from episodes' cumulative rewards. The
proposed method has been evaluated in the Arcade learning environment and the
results indicate an improvement in the learning process in both the single-task
and the multi-task reinforcement learner agents
- …