The complexity of designing reward functions has been a major obstacle to the
wide application of deep reinforcement learning (RL) techniques. Describing an
agent's desired behaviors and properties can be difficult, even for experts. A
new paradigm called reinforcement learning from human preferences (or
preference-based RL) has emerged as a promising solution, in which reward
functions are learned from human preference labels among behavior trajectories.
However, existing methods for preference-based RL are limited by the need for
accurate oracle preference labels. This paper addresses this limitation by
developing a method for crowd-sourcing preference labels and learning from
diverse human preferences. The key idea is to stabilize reward learning through
regularization and correction in a latent space. To ensure temporal
consistency, a strong constraint is imposed on the reward model that forces its
latent space to be close to the prior distribution. Additionally, a
confidence-based reward model ensembling method is designed to generate more
stable and reliable predictions. The proposed method is tested on a variety of
tasks in DMcontrol and Meta-world and has shown consistent and significant
improvements over existing preference-based RL algorithms when learning from
diverse feedback, paving the way for real-world applications of RL methods.Comment: Published as a conference paper in IJCAI 202