14 research outputs found
: Policy Representations with Successor Features
This paper describes , a method for representing behaviors of
black box policies as feature vectors. The policy representations capture how
the statistics of foundation model features change in response to the policy
behavior in a task agnostic way, and can be trained from offline data, allowing
them to be used in offline policy selection. This work provides a key piece of
a recipe for fusing together three modern lines of research: Offline policy
evaluation as a counterpart to offline RL, foundation models as generic and
powerful state representations, and efficient policy selection in resource
constrained environments.Comment: Accepted paper at ICLR202
Reinforced Self-Training (ReST) for Language Modeling
Reinforcement learning from human feedback (RLHF) can improve the quality of
large language model's (LLM) outputs by aligning them with human preferences.
We propose a simple algorithm for aligning LLMs with human preferences inspired
by growing batch reinforcement learning (RL), which we call Reinforced
Self-Training (ReST). Given an initial LLM policy, ReST produces a dataset by
generating samples from the policy, which are then used to improve the LLM
policy using offline RL algorithms. ReST is more efficient than typical online
RLHF methods because the training dataset is produced offline, which allows
data reuse. While ReST is a general approach applicable to all generative
learning settings, we focus on its application to machine translation. Our
results show that ReST can substantially improve translation quality, as
measured by automated metrics and human evaluation on machine translation
benchmarks in a compute and sample-efficient manner.Comment: 23 pages, 16 figure