Search CORE

149 research outputs found

MIR2: Towards Provably Robust Multi-Agent Reinforcement Learning by Mutual Information Regularization

Author: Feng Pu
Guo Jun
Li Simin
Liu Aishan
Liu Xianglong
Lv Weifeng
Wang Jiakai
Xu Ruixiao
Yang Yaodong
Publication venue
Publication date: 31/10/2023
Field of study

Robust multi-agent reinforcement learning (MARL) necessitates resilience to uncertain or worst-case actions by unknown allies. Existing max-min optimization techniques in robust MARL seek to enhance resilience by training agents against worst-case adversaries, but this becomes intractable as the number of agents grows, leading to exponentially increasing worst-case scenarios. Attempts to simplify this complexity often yield overly pessimistic policies, inadequate robustness across scenarios and high computational demands. Unlike these approaches, humans naturally learn adaptive and resilient behaviors without the necessity of preparing for every conceivable worst-case scenario. Motivated by this, we propose MIR2, which trains policy in routine scenarios and minimize Mutual Information as Robust Regularization. Theoretically, we frame robustness as an inference problem and prove that minimizing mutual information between histories and actions implicitly maximizes a lower bound on robustness under certain assumptions. Further analysis reveals that our proposed approach prevents agents from overreacting to others through an information bottleneck and aligns the policy with a robust action prior. Empirically, our MIR2 displays even greater resilience against worst-case adversaries than max-min optimization in StarCraft II, Multi-agent Mujoco and rendezvous. Our superiority is consistent when deployed in challenging real-world robot swarm control scenario. See code and demo videos in Supplementary Materials

arXiv.org e-Print Archive

ES-ENAS: Blackbox Optimization over Hybrid Spaces via Combinatorial and Continuous Evolution

Author: Choromanski Krzysztof
Gao Wenbo
Jain Deepali
Pacchiano Aldo
Parker-Holder Jack
Peng Daiyi
Sarlos Tamas
Song Xingyou
Tang Yunhao
Yang Yuxiang
Zhang Qiuyi
Publication venue
Publication date: 06/12/2021
Field of study

We consider the problem of efficient blackbox optimization over a large hybrid search space, consisting of a mixture of a high dimensional continuous space and a complex combinatorial space. Such examples arise commonly in evolutionary computation, but also more recently, neuroevolution and architecture search for Reinforcement Learning (RL) policies. Unfortunately however, previous mutation-based approaches suffer in high dimensional continuous spaces both theoretically and practically. We thus instead propose ES-ENAS, a simple joint optimization procedure by combining Evolutionary Strategies (ES) and combinatorial optimization techniques in a highly scalable and intuitive way, inspired by the one-shot or supernet paradigm introduced in Efficient Neural Architecture Search (ENAS). Through this relatively simple marriage between two different lines of research, we are able to gain the best of both worlds, and empirically demonstrate our approach by optimizing BBOB functions over hybrid spaces as well as combinatorial neural network architectures via edge pruning and quantization on popular RL benchmarks. Due to the modularity of the algorithm, we also are able incorporate a wide variety of popular techniques ranging from use of different continuous and combinatorial optimizers, as well as constrained optimization.Comment: 22 pages. See https://github.com/google-research/google-research/tree/master/es_enas for associated cod

arXiv.org e-Print Archive

Kernelized Offline Contextual Dueling Bandits

Author: Das Vikramjeet
Lin Sen
Mehta Viraj
Neiswanger Willie
Neopane Ojash
Schneider Jeff
Publication venue
Publication date: 20/07/2023
Field of study

Preference-based feedback is important for many applications where direct evaluation of a reward function is not feasible. A notable recent example arises in reinforcement learning from human feedback on large language models. For many of these applications, the cost of acquiring the human feedback can be substantial or even prohibitive. In this work, we take advantage of the fact that often the agent can choose contexts at which to obtain human feedback in order to most efficiently identify a good policy, and introduce the offline contextual dueling bandit setting. We give an upper-confidence-bound style algorithm for this setting and prove a regret bound. We also give empirical confirmation that this method outperforms a similar strategy that uses uniformly sampled contexts

arXiv.org e-Print Archive