5 research outputs found
Contextual Bandits and Imitation Learning via Preference-Based Active Queries
We consider the problem of contextual bandits and imitation learning, where
the learner lacks direct knowledge of the executed action's reward. Instead,
the learner can actively query an expert at each round to compare two actions
and receive noisy preference feedback. The learner's objective is two-fold: to
minimize the regret associated with the executed actions, while simultaneously,
minimizing the number of comparison queries made to the expert. In this paper,
we assume that the learner has access to a function class that can represent
the expert's preference model under appropriate link functions, and provide an
algorithm that leverages an online regression oracle with respect to this
function class for choosing its actions and deciding when to query. For the
contextual bandit setting, our algorithm achieves a regret bound that combines
the best of both worlds, scaling as , where
represents the number of interactions, represents the eluder dimension of
the function class, and represents the minimum preference of the
optimal action over any suboptimal action under all contexts. Our algorithm
does not require the knowledge of , and the obtained regret bound is
comparable to what can be achieved in the standard contextual bandits setting
where the learner observes reward signals at each round. Additionally, our
algorithm makes only queries to the expert. We
then extend our algorithm to the imitation learning setting, where the learning
agent engages with an unknown environment in episodes of length each, and
provide similar guarantees for regret and query complexity. Interestingly, our
algorithm for imitation learning can even learn to outperform the underlying
expert, when it is suboptimal, highlighting a practical benefit of
preference-based feedback in imitation learning
A survey of preference-based reinforcement learning methods
Reinforcement learning (RL) techniques optimize the accumulated long-term reward of a suitably chosen reward function. However, designing such a reward function often requires a lot of task- specific prior knowledge. The designer needs to consider different objectives that do not only influence the learned behavior but also the learning progress. To alleviate these issues, preference-based reinforcement learning algorithms (PbRL) have been proposed that can directly learn from an expert's preferences instead of a hand-designed numeric reward. PbRL has gained traction in recent years due to its ability to resolve the reward shaping problem, its ability to learn from non numeric rewards and the possibility to reduce the dependence on expert knowledge. We provide a unified framework for PbRL that describes the task formally and points out the different design principles that affect the evaluation task for the human as well as the computational complexity. The design principles include the type of feedback that is assumed, the representation that is learned to capture the preferences, the optimization problem that has to be solved as well as how the exploration/exploitation problem is tackled. Furthermore, we point out shortcomings of current algorithms, propose open research questions and briefly survey practical tasks that have been solved using PbRL
A survey of preference-based reinforcement learning methods
Reinforcement learning (RL) techniques optimize the accumulated long-term reward of a suitably chosen reward function. However, designing such a reward function of ten requires a lot of task-specific prior knowledge. The designer needs to consider different objectives that do not only influence the learned behavior but also the learning progress. To alleviate these issues, preference-based reinforcement learning algorithms (PbRL) have been proposed that can directly learn from an expert\u27s preferences instead of a hand-designed numeric reward. PbRL has gained traction in recent years due to its ability to resolve the reward shaping problem, its ability to learn from non numeric rewards and the possibility to reduce the dependence on expert knowledge. We provide a unified framework for PbRL that describes the task formally and points out the different design principles that affect the evaluation task for the human as well as the computational complexity. The design principles include the type of feedback that is assumed, the representation that is learned to capture the preferences, the optimization problem that has to be solved as well as how the exploration/exploitation problem is tackled. Furthermore, we point out shortcomings of current algorithms, propose open research questions and briefly survey practical tasks that have been solved using PbRL
Efficient Preference-based Reinforcement Learning
Common reinforcement learning algorithms assume access to a numeric feedback signal. The numeric feedback contains a high amount of information and can be maximized efficiently. However, the definition of a numeric feedback signal can be difficult in practise due to several limitations and badly defined values may lead to an unintended outcome. For humans, it is usually easier to define qualitative feedback signals than quantitative. Hence, we want to solve reinforcement learning problems with a qualitative signal, potentially capable of overcoming several of the limitations of numeric feedback. Preferences have several advantages over other qualitative settings, like ordinal feedback or advice. Preferences are scale-free and do not require assumptions over the optimal outcome. However, preferences are difficult to use for solving sequential decision problems, because it is unknown which decisions are responsible for the observed preference. Hence, we analyze different approaches for learning from preferences and show the design principles that can be used, as well as the advantages and problems that occur. We also survey the field of preference-based reinforcement learning and categorize the algorithms according to the design principles. Efficiency is of special interest in this setting, as it is important to keep the amount of required preferences low, because they depend on human evaluation. Hence, our focus is on efficient use of the preferences. It can be stated that it is important to be able to generalize the obtained preferences, as this keeps the amount of required preferences low. Therefore, we consider methods that are able to generalize the obtained preferences to models not yet evaluated. However, this introduces uncertain feedback and the exploration/exploitation problem already known from classical reinforcement learning has to be considered with the preferences in mind. We show how to efficiently solve this dual exploration problem by interleaving both tasks, in an undirected manner. We use undirected exploration methods, because they scale better to high-dimensional spaces. Furthermore, human feedback has to be assumed to be error-prone and we analyze the problems that arise when using human evaluation. We show that noise is the most substantial problem when dealing with human preferences and present a solution to this problem