300 research outputs found
Selfishness Level Induces Cooperation in Sequential Social Dilemmas
A key contributor to the success of modern societies is humanity’s innate ability to meaningfully cooperate. Modern game-theoretic reasoning shows however, that an individual’s amenity to cooperation is directly linked with the mechanics of the scenario at hand. Social dilemmas constitute a subset of particularly thorny such scenarios, typically modelled as normal-form or sequential games, where players are caught in a dichotomy between the decision to cooperate with teammates or to defect, and further their own goals. In this work, we study such social dilemmas through the lens of ’selfishness level’, a standard game-theoretic metric which quantifies the extent to which a game’s payoffs incentivize defective behaviours.The selfishness level is significant in this context as it doubles as a prescriptive notion, describing the exact payoff modifications necessary to induce players with prosocial preferences. Using this framework, we are able to derive conditions, and means, under which normal-form social dilemmas can be resolved. We also produce a first-step towards extending this metric to Markov-game or sequential social dilemmas with the aim of quantitatively measuring the magnitude to which such environments incentivize selfish behaviours. Finally, we present an exploratory empirical analysis showing the positive effects of using a selfishness level directed reward shaping scheme in such environments
Resolving social dilemmas with minimal reward transfer
Multi-agent cooperation is an important topic, and is particularly
challenging in mixed-motive situations where it does not pay to be nice to
others. Consequently, self-interested agents often avoid collective behaviour,
resulting in suboptimal outcomes for the group. In response, in this paper we
introduce a metric to quantify the disparity between what is rational for
individual agents and what is rational for the group, which we call the general
self-interest level. This metric represents the maximum proportion of
individual rewards that all agents can retain while ensuring that achieving
social welfare optimum becomes a dominant strategy. By aligning the individual
and group incentives, rational agents acting to maximise their own reward will
simultaneously maximise the collective reward. As agents transfer their rewards
to motivate others to consider their welfare, we diverge from traditional
concepts of altruism or prosocial behaviours. The general self-interest level
is a property of a game that is useful for assessing the propensity of players
to cooperate and understanding how features of a game impact this. We
illustrate the effectiveness of our method on several novel games
representations of social dilemmas with arbitrary numbers of players.Comment: 34 pages, 13 tables, submitted to the Journal of Autonomous Agents
and Multi-Agent Systems: Special Issue on Citizen-Centric AI System
An End-to-End Task Allocation Framework for Autonomous Mobile Systems
This work aims to unravel the problem of task allocation and planning for multi-agent systems with a particular interest in promoting adaptability. We proposed a novel end-to-end task allocation framework employing reinforcement learning methods to replace the handcrafted heuristics used in previous works. The proposed framework achieves high adaptability and also explores more competitive results. Learning experiences from the feedback help to reach the advantages. The systematic objectives are adjustable and responsive to the reward design intuitively. The framework is validated in a set of tests with various parameter settings, where adaptability and performance are demonstrated
Estimating α-Rank from A Few Entries with Low Rank Matrix Completion
Multi-agent evaluation aims at the assessment of an agent's strategy on the basis of interaction with others. Typically, existing methods such as α-rank and its approximation still require to exhaustively compare all pairs of joint strategies for an accurate ranking, which in practice is computationally expensive. In this paper, we aim to reduce the number of pairwise comparisons in recovering a satisfying ranking for n strategies in two-player meta-games, by exploring the fact that agents with similar skills may achieve similar payoffs against others. Two situations are considered: the first one is when we can obtain the true payoffs; the other one is when we can only access noisy payoff. Based on these formulations, we leverage low-rank matrix completion and design two novel algorithms for noise-free and noisy evaluations respectively. For both of these settings, we theorize that O(nr log n) (n is the number of agents and r is the rank of the payoff matrix) payoff entries are required to achieve sufficiently well strategy evaluation performance. Empirical results on evaluating the strategies in three synthetic games and twelve real world games demonstrate that strategy evaluation from a few entries can lead to comparable performance to algorithms with full knowledge of the payoff matrix
Zero-shot Preference Learning for Offline RL via Optimal Transport
Preference-based Reinforcement Learning (PbRL) has demonstrated remarkable
efficacy in aligning rewards with human intentions. However, a significant
challenge lies in the need of substantial human labels, which is costly and
time-consuming. Additionally, the expensive preference data obtained from prior
tasks is not typically reusable for subsequent task learning, leading to
extensive labeling for each new task. In this paper, we propose a novel
zero-shot preference-based RL algorithm that leverages labeled preference data
from source tasks to infer labels for target tasks, eliminating the requirement
for human queries. Our approach utilizes Gromov-Wasserstein distance to align
trajectory distributions between source and target tasks. The solved optimal
transport matrix serves as a correspondence between trajectories of two tasks,
making it possible to identify corresponding trajectory pairs between tasks and
transfer the preference labels. However, learning directly from inferred labels
that contains a fraction of noisy labels will result in an inaccurate reward
function, subsequently affecting policy performance. To this end, we introduce
Robust Preference Transformer, which models the rewards as Gaussian
distributions and incorporates reward uncertainty in addition to reward mean.
The empirical results on robotic manipulation tasks of Meta-World and Robomimic
show that our method has strong capabilities of transferring preferences
between tasks and learns reward functions from noisy labels robustly.
Furthermore, we reveal that our method attains near-oracle performance with a
small proportion of scripted labels
Learning to Identify Top Elo Ratings: A Dueling Bandits Approach
The Elo rating system is widely adopted to evaluate the skills of (chess) game and sports players. Recently it has been also integrated into machine learning algorithms in evaluating the performance of computerised AI agents. However, an accurate estimation of the Elo rating (for the top players) often requires many rounds of competitions, which can be expensive to carry out. In this paper, to improve the sample efficiency of the Elo evaluation (for top players), we propose an efficient online match scheduling algorithm. Specifically, we identify and match the top players through a dueling bandits framework and tailor the bandit algorithm to the gradient-based update of Elo. We show that it reduces the per-step memory and time complexity to constant, compared to the traditional likelihood maximization approaches requiring O(t) time. Our algorithm has a regret guarantee of Õ(√T), sublinear in the number of competition rounds and has been extended to the multidimensional Elo ratings for handling intransitive games. We empirically demonstrate that our method achieves superior convergence speed and time efficiency on a variety of gaming tasks
- …