Maximum likelihood estimation (MLE) is the predominant algorithm for training
text generation models. This paradigm relies on direct supervision examples,
which is not applicable to many applications, such as generating adversarial
attacks or generating prompts to control language models. Reinforcement
learning (RL) on the other hand offers a more flexible solution by allowing
users to plug in arbitrary task metrics as reward. Yet previous RL algorithms
for text generation, such as policy gradient (on-policy RL) and Q-learning
(off-policy RL), are often notoriously inefficient or unstable to train due to
the large sequence space and the sparse reward received only at the end of
sequences. In this paper, we introduce a new RL formulation for text generation
from the soft Q-learning perspective. It further enables us to draw from the
latest RL advances, such as path consistency learning, to combine the best of
on-/off-policy updates, and learn effectively from sparse reward. We apply the
approach to a wide range of tasks, including learning from noisy/negative
examples, adversarial attacks, and prompt generation. Experiments show our
approach consistently outperforms both task-specialized algorithms and the
previous RL methods. On standard supervised tasks where MLE prevails, our
approach also achieves competitive performance and stability by training text
generation from scratch.Comment: Code available at
https://github.com/HanGuo97/soft-Q-learning-for-text-generatio