1,093 research outputs found
A survey of preference-based reinforcement learning methods
Reinforcement learning (RL) techniques optimize the accumulated long-term reward of a suitably chosen reward function. However, designing such a reward function of ten requires a lot of task-specific prior knowledge. The designer needs to consider different objectives that do not only influence the learned behavior but also the learning progress. To alleviate these issues, preference-based reinforcement learning algorithms (PbRL) have been proposed that can directly learn from an expert\u27s preferences instead of a hand-designed numeric reward. PbRL has gained traction in recent years due to its ability to resolve the reward shaping problem, its ability to learn from non numeric rewards and the possibility to reduce the dependence on expert knowledge. We provide a unified framework for PbRL that describes the task formally and points out the different design principles that affect the evaluation task for the human as well as the computational complexity. The design principles include the type of feedback that is assumed, the representation that is learned to capture the preferences, the optimization problem that has to be solved as well as how the exploration/exploitation problem is tackled. Furthermore, we point out shortcomings of current algorithms, propose open research questions and briefly survey practical tasks that have been solved using PbRL
Preference-Based Monte Carlo Tree Search
Monte Carlo tree search (MCTS) is a popular choice for solving sequential
anytime problems. However, it depends on a numeric feedback signal, which can
be difficult to define. Real-time MCTS is a variant which may only rarely
encounter states with an explicit, extrinsic reward. To deal with such cases,
the experimenter has to supply an additional numeric feedback signal in the
form of a heuristic, which intrinsically guides the agent. Recent work has
shown evidence that in different areas the underlying structure is ordinal and
not numerical. Hence erroneous and biased heuristics are inevitable, especially
in such domains. In this paper, we propose a MCTS variant which only depends on
qualitative feedback, and therefore opens up new applications for MCTS. We also
find indications that translating absolute into ordinal feedback may be
beneficial. Using a puzzle domain, we show that our preference-based MCTS
variant, wich only receives qualitative feedback, is able to reach a
performance level comparable to a regular MCTS baseline, which obtains
quantitative feedback.Comment: To be publishe
A survey of preference-based reinforcement learning methods
Reinforcement learning (RL) techniques optimize the accumulated long-term reward of a suitably chosen reward function. However, designing such a reward function often requires a lot of task- specific prior knowledge. The designer needs to consider different objectives that do not only influence the learned behavior but also the learning progress. To alleviate these issues, preference-based reinforcement learning algorithms (PbRL) have been proposed that can directly learn from an expert's preferences instead of a hand-designed numeric reward. PbRL has gained traction in recent years due to its ability to resolve the reward shaping problem, its ability to learn from non numeric rewards and the possibility to reduce the dependence on expert knowledge. We provide a unified framework for PbRL that describes the task formally and points out the different design principles that affect the evaluation task for the human as well as the computational complexity. The design principles include the type of feedback that is assumed, the representation that is learned to capture the preferences, the optimization problem that has to be solved as well as how the exploration/exploitation problem is tackled. Furthermore, we point out shortcomings of current algorithms, propose open research questions and briefly survey practical tasks that have been solved using PbRL
Certified Reinforcement Learning with Logic Guidance
This paper proposes the first model-free Reinforcement Learning (RL)
framework to synthesise policies for unknown, and continuous-state Markov
Decision Processes (MDPs), such that a given linear temporal property is
satisfied. We convert the given property into a Limit Deterministic Buchi
Automaton (LDBA), namely a finite-state machine expressing the property.
Exploiting the structure of the LDBA, we shape a synchronous reward function
on-the-fly, so that an RL algorithm can synthesise a policy resulting in traces
that probabilistically satisfy the linear temporal property. This probability
(certificate) is also calculated in parallel with policy learning when the
state space of the MDP is finite: as such, the RL algorithm produces a policy
that is certified with respect to the property. Under the assumption of finite
state space, theoretical guarantees are provided on the convergence of the RL
algorithm to an optimal policy, maximising the above probability. We also show
that our method produces ''best available'' control policies when the logical
property cannot be satisfied. In the general case of a continuous state space,
we propose a neural network architecture for RL and we empirically show that
the algorithm finds satisfying policies, if there exist such policies. The
performance of the proposed framework is evaluated via a set of numerical
examples and benchmarks, where we observe an improvement of one order of
magnitude in the number of iterations required for the policy synthesis,
compared to existing approaches whenever available.Comment: This article draws from arXiv:1801.08099, arXiv:1809.0782
Efficient Model Learning for Human-Robot Collaborative Tasks
We present a framework for learning human user models from joint-action
demonstrations that enables the robot to compute a robust policy for a
collaborative task with a human. The learning takes place completely
automatically, without any human intervention. First, we describe the
clustering of demonstrated action sequences into different human types using an
unsupervised learning algorithm. These demonstrated sequences are also used by
the robot to learn a reward function that is representative for each type,
through the employment of an inverse reinforcement learning algorithm. The
learned model is then used as part of a Mixed Observability Markov Decision
Process formulation, wherein the human type is a partially observable variable.
With this framework, we can infer, either offline or online, the human type of
a new user that was not included in the training set, and can compute a policy
for the robot that will be aligned to the preference of this new user and will
be robust to deviations of the human actions from prior demonstrations. Finally
we validate the approach using data collected in human subject experiments, and
conduct proof-of-concept demonstrations in which a person performs a
collaborative task with a small industrial robot
Bayesian Generational Population-Based Training
Reinforcement learning (RL) offers the potential for training generally
capable agents that can interact autonomously in the real world. However, one
key limitation is the brittleness of RL algorithms to core hyperparameters and
network architecture choice. Furthermore, non-stationarities such as evolving
training data and increased agent complexity mean that different
hyperparameters and architectures may be optimal at different points of
training. This motivates AutoRL, a class of methods seeking to automate these
design choices. One prominent class of AutoRL methods is Population-Based
Training (PBT), which have led to impressive performance in several large scale
settings. In this paper, we introduce two new innovations in PBT-style methods.
First, we employ trust-region based Bayesian Optimization, enabling full
coverage of the high-dimensional mixed hyperparameter search space. Second, we
show that using a generational approach, we can also learn both architectures
and hyperparameters jointly on-the-fly in a single training run. Leveraging the
new highly parallelizable Brax physics engine, we show that these innovations
lead to large performance gains, significantly outperforming the tuned baseline
while learning entire configurations on the fly. Code is available at
https://github.com/xingchenwan/bgpbt.Comment: AutoML Conference 2022. 10 pages, 4 figure, 3 tables (28 pages, 10
figures, 7 tables including references and appendices
An Ordinal Agent Framework
In this thesis, we introduce algorithms to solve ordinal multi-armed bandit problems, Monte-Carlo tree search, and reinforcement learning problems. With ordinal problems, an agent does not receive numerical rewards, but ordinal rewards that cope without any distance measure. For humans, it is often hard to define or to determine exact numerical feedback signals but simpler to come up with an ordering over possibilities. For instance, when looking at medical treatment, the ordering patient death < patient ill < patient cured is easy to come up with but it is hard to assign numerical values to them. As most state-of-the-art algorithms rely on numerical operations, they can not be applied in the presence of ordinal rewards. We present a preference-based approach leveraging dueling bandits to sequential decision problems and discuss its disadvantages in terms of sample
efficiency and scalability. Following another idea, our final approach to identify optimal arms is based on the comparison of reward distributions using the Borda method. We test this approach on multi-armed bandits, leverage it to Monte-Carlo tree search, and also apply it to reinforcement learning. To do so, we introduce a framework that encapsulates the similarities of the different problem definitions. We test our ordinal algorithms on frameworks like the General Video Game Framework (GVGAI), OpenAI, or synthetic data and compare it to ordinal, numerical, or domain-specific algorithms. Since our algorithms are time-dependent on the number of perceived ordinal rewards, we introduce a binning method that artificially reduces the number of
rewards
Goal-Directed Decision Making with Spiking Neurons.
UNLABELLED: Behavioral and neuroscientific data on reward-based decision making point to a fundamental distinction between habitual and goal-directed action selection. The formation of habits, which requires simple updating of cached values, has been studied in great detail, and the reward prediction error theory of dopamine function has enjoyed prominent success in accounting for its neural bases. In contrast, the neural circuit mechanisms of goal-directed decision making, requiring extended iterative computations to estimate values online, are still unknown. Here we present a spiking neural network that provably solves the difficult online value estimation problem underlying goal-directed decision making in a near-optimal way and reproduces behavioral as well as neurophysiological experimental data on tasks ranging from simple binary choice to sequential decision making. Our model uses local plasticity rules to learn the synaptic weights of a simple neural network to achieve optimal performance and solves one-step decision-making tasks, commonly considered in neuroeconomics, as well as more challenging sequential decision-making tasks within 1 s. These decision times, and their parametric dependence on task parameters, as well as the final choice probabilities match behavioral data, whereas the evolution of neural activities in the network closely mimics neural responses recorded in frontal cortices during the execution of such tasks. Our theory provides a principled framework to understand the neural underpinning of goal-directed decision making and makes novel predictions for sequential decision-making tasks with multiple rewards. SIGNIFICANCE STATEMENT: Goal-directed actions requiring prospective planning pervade decision making, but their circuit-level mechanisms remain elusive. We show how a model circuit of biologically realistic spiking neurons can solve this computationally challenging problem in a novel way. The synaptic weights of our network can be learned using local plasticity rules such that its dynamics devise a near-optimal plan of action. By systematically comparing our model results to experimental data, we show that it reproduces behavioral decision times and choice probabilities as well as neural responses in a rich set of tasks. Our results thus offer the first biologically realistic account for complex goal-directed decision making at a computational, algorithmic, and implementational level.This research was supported by the Swiss National Science Foundation (J.F., Grant PBBEP3 146112) and the Wellcome Trust (J.F. and M.L.).This is the author accepted manuscript. It is currently under an indefinite embargo pending publication by the Society for Neuroscience
Recommended from our members
Abstractions in Reasoning for Long-Term Autonomy
The path to building adaptive, robust, intelligent agents has led researchers to develop a suite of powerful models and algorithms for agents with a single objective. However, in recent years, attempts to use this monolithic approach to solve an ever-expanding set of complex real-world problems, which increasingly include long-term autonomous deployments, have illuminated challenges in its ability to scale. Consequently, a fragmented collection of hierarchical and multi-objective models were developed. This trend continues into the algorithms as well, as each approximates an optimal solution in a different manner for scalability. These models and algorithms represent an attempt to solve pieces of an overarching problem: how can an agent explicitly model and integrate the necessary aspects of reasoning required to achieve long-term autonomy?
This thesis presents a general hierarchical and multi-objective model called a policy network that unifies prior fragmented solutions into a single graphical decision-making structure. Policy networks are broadly useful to solve numerous real-world problems. This thesis focuses on autonomous vehicle (AV) problems: (1) route-planning with multiple objectives; (2) semi-autonomy with proactive transfer of control; and (3) intersection decision-making for reasoning online about any number of other vehicles and pedestrians. Formal models are presented for each of the distinct problems. Solutions are evaluated using real-world map data in simulation and demonstrated on a fully operational AV prototype driving on real public roads. Policy networks serve as a shared underlying framework for all three, enabling their seamless integration as parts of an overall solution for rich, real-world, scalable decision-making in agents with long-term autonomy
- …