Reward models (RMs) are crucial in aligning large language models (LLMs) with
human preferences for improving interaction quality. However, the real world is
pluralistic, which leads to diversified human preferences based on different
religions, politics, cultures, etc. Moreover, each individual can have their
own unique preferences on various topics. Neglecting the diversity of human
preferences, current LLM training processes only use a general reward model,
which is below satisfaction for customized or personalized application
scenarios. To explore customized preference learning, we collect a
domain-specific preference (DSP) dataset, which collects preferred responses to
each given query from four practical domains. Besides, from the perspective of
data efficiency, we proposed a three-stage customized RM learning scheme, whose
effectiveness is empirically verified on both general preference datasets and
our DSP set. Furthermore, we test multiple training and data strategies on the
three learning stages, and have found several ways to better preserve the
general preferring ability while training the customized RMs, especially
general preference enrichment and customized preference imitation learning. The
DSP dataset and code are available at https://github.com/Linear95/DSP