2,057 research outputs found
Diverse Exploration via Conjugate Policies for Policy Gradient Methods
We address the challenge of effective exploration while maintaining good
performance in policy gradient methods. As a solution, we propose diverse
exploration (DE) via conjugate policies. DE learns and deploys a set of
conjugate policies which can be conveniently generated as a byproduct of
conjugate gradient descent. We provide both theoretical and empirical results
showing the effectiveness of DE at achieving exploration, improving policy
performance, and the advantage of DE over exploration by random policy
perturbations.Comment: AAAI 201
Reward is enough for convex MDPs
Maximising a cumulative reward function that is Markov and stationary, i.e.,
defined over state-action pairs and independent of time, is sufficient to
capture many kinds of goals in a Markov decision process (MDP). However, not
all goals can be captured in this manner. In this paper we study convex MDPs in
which goals are expressed as convex functions of the stationary distribution
and show that they cannot be formulated using stationary reward functions.
Convex MDPs generalize the standard reinforcement learning (RL) problem
formulation to a larger framework that includes many supervised and
unsupervised RL problems, such as apprenticeship learning, constrained MDPs,
and so-called `pure exploration'. Our approach is to reformulate the convex MDP
problem as a min-max game involving policy and cost (negative reward)
`players', using Fenchel duality. We propose a meta-algorithm for solving this
problem and show that it unifies many existing algorithms in the literature
- …