Trust Region Policy Optimization (TRPO) is an iterative method that
simultaneously maximizes a surrogate objective and enforces a trust region
constraint over consecutive policies in each iteration. The combination of the
surrogate objective maximization and the trust region enforcement has been
shown to be crucial to guarantee a monotonic policy improvement. However,
solving a trust-region-constrained optimization problem can be computationally
intensive as it requires many steps of conjugate gradient and a large number of
on-policy samples. In this paper, we show that the trust region constraint over
policies can be safely substituted by a trust-region-free constraint without
compromising the underlying monotonic improvement guarantee. The key idea is to
generalize the surrogate objective used in TRPO in a way that a monotonic
improvement guarantee still emerges as a result of constraining the maximum
advantage-weighted ratio between policies. This new constraint outlines a
conservative mechanism for iterative policy optimization and sheds light on
practical ways to optimize the generalized surrogate objective. We show that
the new constraint can be effectively enforced by being conservative when
optimizing the generalized objective function in practice. We call the
resulting algorithm Trust-REgion-Free Policy Optimization (TREFree) as it is
free of any explicit trust region constraints. Empirical results show that
TREFree outperforms TRPO and Proximal Policy Optimization (PPO) in terms of
policy performance and sample efficiency.Comment: RLDM 202