Reinforcement Learning from Human Feedback (RLHF) is a vital strategy for
enhancing model safety in language models. However, annotating preference data
for RLHF is a resource-intensive and creativity-demanding process, while
automatic generation methods face limitations in data diversity and quality. In
response, we present Safer-Instruct, a novel pipeline for semi-automatically
constructing large-scale preference datasets. Our approach leverages reversed
instruction tuning, instruction induction, and expert model evaluation to
efficiently generate high-quality preference data without human annotators. We
evaluate Safer-Instruct using LLaMA for instruction induction and GPT-4 as an
expert model, generating approximately 10K preference samples. Finetuning an
Alpaca model on this dataset demonstrates improved harmlessness while
maintaining competitive performance on conversation and downstream tasks.
Safer-Instruct addresses the challenges in preference data acquisition,
advancing the development of safer and more responsible AI systems. Our code
and data are available at https://github.com/uscnlp-lime/safer-instructComment: 11 page