Reinforcement learning from human feedback (RLHF) has emerged as a reliable
approach to aligning large language models (LLMs) to human preferences. Among
the plethora of RLHF techniques, proximal policy optimization (PPO) is of the
most widely used methods. Despite its popularity, however, PPO may suffer from
mode collapse, instability, and poor sample efficiency. We show that these
issues can be alleviated by a novel algorithm that we refer to as
Advantage-Induced Policy Alignment (APA), which leverages a squared error loss
function based on the estimated advantages. We demonstrate empirically that APA
consistently outperforms PPO in language tasks by a large margin, when a
separate reward model is employed as the evaluator. In addition, compared with
PPO, APA offers a more stable form of control over the deviation from the
model's initial policy, ensuring that the model improves its performance
without collapsing to deterministic output. In addition to empirical results,
we also provide a theoretical justification supporting the design of our loss
function