59,149 research outputs found
Smoothing Policies and Safe Policy Gradients
Policy gradient algorithms are among the best candidates for the much
anticipated application of reinforcement learning to real-world control tasks,
such as the ones arising in robotics. However, the trial-and-error nature of
these methods introduces safety issues whenever the learning phase itself must
be performed on a physical system. In this paper, we address a specific safety
formulation, where danger is encoded in the reward signal and the learning
agent is constrained to never worsen its performance. By studying actor-only
policy gradient from a stochastic optimization perspective, we establish
improvement guarantees for a wide class of parametric policies, generalizing
existing results on Gaussian policies. This, together with novel upper bounds
on the variance of policy gradient estimators, allows to identify those
meta-parameter schedules that guarantee monotonic improvement with high
probability. The two key meta-parameters are the step size of the parameter
updates and the batch size of the gradient estimators. By a joint, adaptive
selection of these meta-parameters, we obtain a safe policy gradient algorithm
From Language to Programs: Bridging Reinforcement Learning and Maximum Marginal Likelihood
Our goal is to learn a semantic parser that maps natural language utterances
into executable programs when only indirect supervision is available: examples
are labeled with the correct execution result, but not the program itself.
Consequently, we must search the space of programs for those that output the
correct result, while not being misled by spurious programs: incorrect programs
that coincidentally output the correct result. We connect two common learning
paradigms, reinforcement learning (RL) and maximum marginal likelihood (MML),
and then present a new learning algorithm that combines the strengths of both.
The new algorithm guards against spurious programs by combining the systematic
search traditionally employed in MML with the randomized exploration of RL, and
by updating parameters such that probability is spread more evenly across
consistent programs. We apply our learning algorithm to a new neural semantic
parser and show significant gains over existing state-of-the-art results on a
recent context-dependent semantic parsing task.Comment: Proceedings of the 55th Annual Meeting of the Association for
Computational Linguistics (2017
Trajectory-Based Off-Policy Deep Reinforcement Learning
Policy gradient methods are powerful reinforcement learning algorithms and
have been demonstrated to solve many complex tasks. However, these methods are
also data-inefficient, afflicted with high variance gradient estimates, and
frequently get stuck in local optima. This work addresses these weaknesses by
combining recent improvements in the reuse of off-policy data and exploration
in parameter space with deterministic behavioral policies. The resulting
objective is amenable to standard neural network optimization strategies like
stochastic gradient descent or stochastic gradient Hamiltonian Monte Carlo.
Incorporation of previous rollouts via importance sampling greatly improves
data-efficiency, whilst stochastic optimization schemes facilitate the escape
from local optima. We evaluate the proposed approach on a series of continuous
control benchmark tasks. The results show that the proposed algorithm is able
to successfully and reliably learn solutions using fewer system interactions
than standard policy gradient methods.Comment: Includes appendix. Accepted for ICML 201
- …