29,087 research outputs found
RRHF: Rank Responses to Align Language Models with Human Feedback without tears
Reinforcement Learning from Human Feedback (RLHF) facilitates the alignment
of large language models with human preferences, significantly enhancing the
quality of interactions between humans and these models. InstructGPT implements
RLHF through several stages, including Supervised Fine-Tuning (SFT), reward
model training, and Proximal Policy Optimization (PPO). PPO, however, is
sensitive to hyperparameters and requires a minimum of four models in its
standard implementation, which makes it hard to train. In contrast, we propose
a novel learning paradigm called RRHF, which scores responses generated by
different sampling policies and learns to align them with human preferences
through ranking loss. RRHF can efficiently align language model output
probabilities with human preferences as robust as fine-tuning and it only needs
1 to 2 models during tuning. In addition, RRHF can be considered an extension
of SFT and reward models while being simpler than PPO in terms of coding, model
counts, and hyperparameters. The entire alignment process can be accomplished
within a single RRHF training session. We evaluate RRHF using LLaMA and Alpaca
on Helpful and Harmless data, demonstrating performance comparable to PPO.Comment: Codes available at https://github.com/GanjinZero/RRH
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
While large-scale unsupervised language models (LMs) learn broad world
knowledge and some reasoning skills, achieving precise control of their
behavior is difficult due to the completely unsupervised nature of their
training. Existing methods for gaining such steerability collect human labels
of the relative quality of model generations and fine-tune the unsupervised LM
to align with these preferences, often with reinforcement learning from human
feedback (RLHF). However, RLHF is a complex and often unstable procedure, first
fitting a reward model that reflects the human preferences, and then
fine-tuning the large unsupervised LM using reinforcement learning to maximize
this estimated reward without drifting too far from the original model. In this
paper, we leverage a mapping between reward functions and optimal policies to
show that this constrained reward maximization problem can be optimized exactly
with a single stage of policy training, essentially solving a classification
problem on the human preference data. The resulting algorithm, which we call
Direct Preference Optimization (DPO), is stable, performant and computationally
lightweight, eliminating the need for fitting a reward model, sampling from the
LM during fine-tuning, or performing significant hyperparameter tuning. Our
experiments show that DPO can fine-tune LMs to align with human preferences as
well as or better than existing methods. Notably, fine-tuning with DPO exceeds
RLHF's ability to control sentiment of generations and improves response
quality in summarization and single-turn dialogue while being substantially
simpler to implement and train
Safe RLHF: Safe Reinforcement Learning from Human Feedback
With the development of large language models (LLMs), striking a balance
between the performance and safety of AI systems has never been more critical.
However, the inherent tension between the objectives of helpfulness and
harmlessness presents a significant challenge during LLM training. To address
this issue, we propose Safe Reinforcement Learning from Human Feedback (Safe
RLHF), a novel algorithm for human value alignment. Safe RLHF explicitly
decouples human preferences regarding helpfulness and harmlessness, effectively
avoiding the crowdworkers' confusion about the tension and allowing us to train
separate reward and cost models. We formalize the safety concern of LLMs as an
optimization task of maximizing the reward function while satisfying specified
cost constraints. Leveraging the Lagrangian method to solve this constrained
problem, Safe RLHF dynamically adjusts the balance between the two objectives
during fine-tuning. Through a three-round fine-tuning using Safe RLHF, we
demonstrate a superior ability to mitigate harmful responses while enhancing
model performance compared to existing value-aligned algorithms.
Experimentally, we fine-tuned the Alpaca-7B using Safe RLHF and aligned it with
collected human preferences, significantly improving its helpfulness and
harmlessness according to human evaluations
An Emulator for Fine-Tuning Large Language Models using Small Language Models
Widely used language models (LMs) are typically built by scaling up a
two-stage training pipeline: a pre-training stage that uses a very large,
diverse dataset of text and a fine-tuning (sometimes, 'alignment') stage that
uses targeted examples or other specifications of desired behaviors. While it
has been hypothesized that knowledge and skills come from pre-training, and
fine-tuning mostly filters this knowledge and skillset, this intuition has not
been extensively tested. To aid in doing so, we introduce a novel technique for
decoupling the knowledge and skills gained in these two stages, enabling a
direct answer to the question, "What would happen if we combined the knowledge
learned by a large model during pre-training with the knowledge learned by a
small model during fine-tuning (or vice versa)?" Using an RL-based framework
derived from recent developments in learning from human preferences, we
introduce emulated fine-tuning (EFT), a principled and practical method for
sampling from a distribution that approximates (or 'emulates') the result of
pre-training and fine-tuning at different scales. Our experiments with EFT show
that scaling up fine-tuning tends to improve helpfulness, while scaling up
pre-training tends to improve factuality. Beyond decoupling scale, we show that
EFT enables test-time adjustment of competing behavioral traits like
helpfulness and harmlessness without additional training. Finally, a special
case of emulated fine-tuning, which we call LM up-scaling, avoids
resource-intensive fine-tuning of large pre-trained models by ensembling them
with small fine-tuned models, essentially emulating the result of fine-tuning
the large pre-trained model. Up-scaling consistently improves helpfulness and
factuality of instruction-following models in the Llama, Llama-2, and Falcon
families, without additional hyperparameters or training
CycleAlign: Iterative Distillation from Black-box LLM to White-box Models for Better Human Alignment
Language models trained on large-scale corpus often generate content that is
harmful, toxic, or contrary to human preferences, making their alignment with
human values a critical concern. Reinforcement learning from human feedback
(RLHF) with algorithms like PPO is a prevalent approach for alignment but is
often complex, unstable, and resource-intensive. Recently, ranking-based
alignment methods have emerged, offering stability and effectiveness by
replacing the RL framework with supervised fine-tuning, but they are costly due
to the need for annotated data. Considering that existing large language models
(LLMs) like ChatGPT are already relatively well-aligned and cost-friendly,
researchers have begun to align the language model with human preference from
AI feedback. The common practices, which unidirectionally distill the
instruction-following responses from LLMs, are constrained by their bottleneck.
Thus we introduce CycleAlign to distill alignment capabilities from
parameter-invisible LLMs (black-box) to a parameter-visible model (white-box)
in an iterative manner. With in-context learning (ICL) as the core of the
cycle, the black-box models are able to rank the model-generated responses
guided by human-craft instruction and demonstrations about their preferences.
During iterative interaction, the white-box models also have a judgment about
responses generated by them. Consequently, the agreement ranking could be
viewed as a pseudo label to dynamically update the in-context demonstrations
and improve the preference ranking ability of black-box models. Through
multiple interactions, the CycleAlign framework could align the white-box model
with the black-box model effectively in a low-resource way. Empirical results
illustrate that the model fine-tuned by CycleAlign remarkably exceeds existing
methods, and achieves the state-of-the-art performance in alignment with human
value
Group Preference Optimization: Few-Shot Alignment of Large Language Models
Many applications of large language models (LLMs), ranging from chatbots to
creative writing, require nuanced subjective judgments that can differ
significantly across different groups. Existing alignment algorithms can be
expensive to align for each group, requiring prohibitive amounts of
group-specific preference data and computation for real-world use cases. We
introduce Group Preference Optimization (GPO), an alignment framework that
steers language models to preferences of individual groups in a few-shot
manner. In GPO, we augment the base LLM with an independent transformer module
trained to predict the preferences of a group for the LLM generations. For
few-shot learning, we parameterize this module as an in-context autoregressive
transformer and train it via meta-learning on several groups. We empirically
validate the efficacy of GPO through rigorous evaluations using LLMs with
varied sizes on three human opinion adaptation tasks. These tasks involve
adapting to the preferences of US demographic groups, global countries, and
individual users. Our results demonstrate that GPO not only aligns models more
accurately but also requires fewer group-specific preferences, and less
training and inference computing resources, outperforming existing strategies
such as in-context steering and fine-tuning methods.Comment: 24 pages, 12 figure
SALMON: Self-Alignment with Principle-Following Reward Models
Supervised Fine-Tuning (SFT) on response demonstrations combined with
Reinforcement Learning from Human Feedback (RLHF) constitutes a powerful
paradigm for aligning LLM-based AI agents. However, a significant limitation of
such an approach is its dependency on high-quality human annotations, making
its application to intricate tasks challenging due to difficulties in obtaining
consistent response demonstrations and in-distribution response preferences.
This paper presents a novel approach, namely SALMON (Self-ALignMent with
principle-fOllowiNg reward models), to align base language models with minimal
human supervision, using only a small set of human-defined principles, yet
achieving superior performance. Central to our approach is a
principle-following reward model. Trained on synthetic preference data, this
model can generate reward scores based on arbitrary human-defined principles.
By merely adjusting these principles during the RL training phase, we gain full
control over the preferences with the reward model, subsequently influencing
the behavior of the RL-trained policies, and eliminating the reliance on the
collection of online human preferences. Applying our method to the LLaMA-2-70b
base language model, we developed an AI assistant named Dromedary-2. With only
6 exemplars for in-context learning and 31 human-defined principles,
Dromedary-2 significantly surpasses the performance of several state-of-the-art
AI systems, including LLaMA-2-Chat-70b, on various benchmark datasets. We have
open-sourced the code and model weights to encourage further research into
aligning LLM-based AI agents with enhanced supervision efficiency, improved
controllability, and scalable oversight.Comment: Project page: https://github.com/IBM/SALMO
OpenAssistant Conversations -- Democratizing Large Language Model Alignment
Aligning large language models (LLMs) with human preferences has proven to
drastically improve usability and has driven rapid adoption as demonstrated by
ChatGPT. Alignment techniques such as supervised fine-tuning (SFT) and
reinforcement learning from human feedback (RLHF) greatly reduce the required
skill and domain knowledge to effectively harness the capabilities of LLMs,
increasing their accessibility and utility across various domains. However,
state-of-the-art alignment techniques like RLHF rely on high-quality human
feedback data, which is expensive to create and often remains proprietary. In
an effort to democratize research on large-scale alignment, we release
OpenAssistant Conversations, a human-generated, human-annotated assistant-style
conversation corpus consisting of 161,443 messages in 35 different languages,
annotated with 461,292 quality ratings, resulting in over 10,000 complete and
fully annotated conversation trees. The corpus is a product of a worldwide
crowd-sourcing effort involving over 13,500 volunteers. Models trained on
OpenAssistant Conversations show consistent improvements on standard benchmarks
over respective base models. We release our code and data under a fully
permissive licence.Comment: Published in NeurIPS 2023 Datasets and Benchmark
- …