1,049 research outputs found
Ordered Preference Elicitation Strategies for Supporting Multi-Objective Decision Making
In multi-objective decision planning and learning, much attention is paid to
producing optimal solution sets that contain an optimal policy for every
possible user preference profile. We argue that the step that follows, i.e,
determining which policy to execute by maximising the user's intrinsic utility
function over this (possibly infinite) set, is under-studied. This paper aims
to fill this gap. We build on previous work on Gaussian processes and pairwise
comparisons for preference modelling, extend it to the multi-objective decision
support scenario, and propose new ordered preference elicitation strategies
based on ranking and clustering. Our main contribution is an in-depth
evaluation of these strategies using computer and human-based experiments. We
show that our proposed elicitation strategies outperform the currently used
pairwise methods, and found that users prefer ranking most. Our experiments
further show that utilising monotonicity information in GPs by using a linear
prior mean at the start and virtual comparisons to the nadir and ideal points,
increases performance. We demonstrate our decision support framework in a
real-world study on traffic regulation, conducted with the city of Amsterdam.Comment: AAMAS 2018, Source code at
https://github.com/lmzintgraf/gp_pref_elici
Steering approaches to Pareto-optimal multiobjective reinforcement learning
For reinforcement learning tasks with multiple objectives, it may be advantageous to learn stochastic or non-stationary policies. This paper investigates two novel algorithms for learning non-stationary policies which produce Pareto-optimal behaviour (w-steering and Q-steering), by extending prior work based on the concept of geometric steering. Empirical results demonstrate that both new algorithms offer substantial performance improvements over stationary deterministic policies, while Q-steering significantly outperforms w-steering when the agent has no information about recurrent states within the environment. It is further demonstrated that Q-steering can be used interactively by providing a human decision-maker with a visualisation of the Pareto front and allowing them to adjust the agent’s target point during learning. To demonstrate broader applicability, the use of Q-steering in combination with function approximation is also illustrated on a task involving control of local battery storage for a residential solar power system
Revisiting Norm Optimization for Multi-Objective Black-Box Problems: A Finite-Time Analysis
The complexity of Pareto fronts imposes a great challenge on the convergence
analysis of multi-objective optimization methods. While most theoretical
convergence studies have addressed finite-set and/or discrete problems, others
have provided probabilistic guarantees, assumed a total order on the solutions,
or studied their asymptotic behaviour. In this paper, we revisit the
Tchebycheff weighted method in a hierarchical bandits setting and provide a
finite-time bound on the Pareto-compliant additive -indicator. To the
best of our knowledge, this paper is one of few that establish a link between
weighted sum methods and quality indicators in finite time.Comment: submitted to Journal of Global Optimization. This article's notation
and terminology is based on arXiv:1612.0841
A Multi-Objective Deep Reinforcement Learning Framework
This paper introduces a new scalable multi-objective deep reinforcement
learning (MODRL) framework based on deep Q-networks. We develop a
high-performance MODRL framework that supports both single-policy and
multi-policy strategies, as well as both linear and non-linear approaches to
action selection. The experimental results on two benchmark problems
(two-objective deep sea treasure environment and three-objective Mountain Car
problem) indicate that the proposed framework is able to find the
Pareto-optimal solutions effectively. The proposed framework is generic and
highly modularized, which allows the integration of different deep
reinforcement learning algorithms in different complex problem domains. This
therefore overcomes many disadvantages involved with standard multi-objective
reinforcement learning methods in the current literature. The proposed
framework acts as a testbed platform that accelerates the development of MODRL
for solving increasingly complicated multi-objective problems.Comment: 21 page
Multi-objective reinforcement learning methods for action selection : dealing with multiple objectives and non-stationarity
Multi-objective decision-making entails planning based on a model to find the best policy to solve such problems. If this model is unknown, learning through interaction provides the means to behave in the environment. Multi-objective decision-making in a multi-agent system poses many unsolved challenges. Among them, multiple objectives and non-stationarity, caused by simultaneous learners, have been addressed separately so far. In this work, algorithms that address these issues by taking strengths from different methods are proposed and applied to a route choice scenario formulated as a multi-armed bandit problem. Therefore, the focus is on action selection. In the route choice problem, drivers must select a route while aiming to minimize both their travel time and toll. The proposed algorithms take and combine important aspects from works that tackle only one issue: non-stationarity or multiple objectives, making possible to handle these problems together. The methods used from these works are a set of Upper-Confidence Bound (UCB) algorithms and the Pareto Q-learning (PQL) algorithm. The UCB-based algorithms are Pareto UCB1 (PUCB1), the discounted UCB (DUCB) and sliding window UCB (SWUCB). PUCB1 deals with multiple objectives, while DUCB and SWUCB address non-stationarity in different ways. PUCB1 was extended to include characteristics from DUCB and SWUCB. In the case of PQL, as it is a state-based method that focuses on more than one objective, a modification was made to tackle a problem focused on action selection. Results obtained from a comparison in a route choice scenario show that the proposed algorithms deal with non-stationarity and multiple objectives, while using a discount factor is the best approach. Advantages, limitations and differences of these algorithms are discussed
Softmax exploration strategies for multiobjective reinforcement learning
Despite growing interest over recent years in applying reinforcement learning to multiobjective problems, there has been little research into the applicability and effectiveness of exploration strategies within the multiobjective context. This work considers several widely-used approaches to exploration from the single-objective reinforcement learning literature, and examines their incorporation into multiobjective Q-learning. In particular this paper proposes two novel approaches which extend the softmax operator to work with vector-valued rewards. The performance of these exploration strategies is evaluated across a set of benchmark environments. Issues arising from the multiobjective formulation of these benchmarks which impact on the performance of the exploration strategies are identified. It is shown that of the techniques considered, the combination of the novel softmax–epsilon exploration with optimistic initialisation provides the most effective trade-off between exploration and exploitation
Recommended from our members
Cost Efficient Distributed Load Frequency Control in Power Systems
The introduction of new technologies and increased penetration of renewable resources is altering the power distribution landscape which now includes a larger numbers of micro-generators. The centralized strategies currently employed for performing frequency control in a cost efficient way need to be revisited and decentralized to conform with the increase of distributed generation in the grid. In this paper, the use of Multi-Agent and Multi-Objective Reinforcement Learning techniques to train models to perform cost efficient frequency control through decentralized decision making is proposed. More specifically, we cast the frequency control problem as a Markov Decision Process and propose the use of reward composition and action composition multi-objective techniques and compare the results between the two. Reward composition is achieved by increasing the dimensionality of the reward function, while action composition is achieved through linear combination of actions produced by multiple single objective models. The proposed framework is validated through comparing the observed dynamics with the acceptable limits enforced in the industry and the cost optimal setups
Distributional Multi-Objective Decision Making
For effective decision support in scenarios with conflicting objectives, sets
of potentially optimal solutions can be presented to the decision maker. We
explore both what policies these sets should contain and how such sets can be
computed efficiently. With this in mind, we take a distributional approach and
introduce a novel dominance criterion relating return distributions of policies
directly. Based on this criterion, we present the distributional undominated
set and show that it contains optimal policies otherwise ignored by the Pareto
front. In addition, we propose the convex distributional undominated set and
prove that it comprises all policies that maximise expected utility for
multivariate risk-averse decision makers. We propose a novel algorithm to learn
the distributional undominated set and further contribute pruning operators to
reduce the set to the convex distributional undominated set. Through
experiments, we demonstrate the feasibility and effectiveness of these methods,
making this a valuable new approach for decision support in real-world
problems.Comment: Accepted at IJCAI 202
Sample-Efficient Multi-Objective Learning via Generalized Policy Improvement Prioritization
Multi-objective reinforcement learning (MORL) algorithms tackle sequential
decision problems where agents may have different preferences over (possibly
conflicting) reward functions. Such algorithms often learn a set of policies
(each optimized for a particular agent preference) that can later be used to
solve problems with novel preferences. We introduce a novel algorithm that uses
Generalized Policy Improvement (GPI) to define principled, formally-derived
prioritization schemes that improve sample-efficient learning. They implement
active-learning strategies by which the agent can (i) identify the most
promising preferences/objectives to train on at each moment, to more rapidly
solve a given MORL problem; and (ii) identify which previous experiences are
most relevant when learning a policy for a particular agent preference, via a
novel Dyna-style MORL method. We prove our algorithm is guaranteed to always
converge to an optimal solution in a finite number of steps, or an
-optimal solution (for a bounded ) if the agent is limited
and can only identify possibly sub-optimal policies. We also prove that our
method monotonically improves the quality of its partial solutions while
learning. Finally, we introduce a bound that characterizes the maximum utility
loss (with respect to the optimal solution) incurred by the partial solutions
computed by our method throughout learning. We empirically show that our method
outperforms state-of-the-art MORL algorithms in challenging multi-objective
tasks, both with discrete and continuous state and action spaces.Comment: Accepted to AAMAS 202
- …