144 research outputs found
TempLe: Learning Template of Transitions for Sample Efficient Multi-task RL
Transferring knowledge among various environments is important to efficiently
learn multiple tasks online. Most existing methods directly use the previously
learned models or previously learned optimal policies to learn new tasks.
However, these methods may be inefficient when the underlying models or optimal
policies are substantially different across tasks. In this paper, we propose
Template Learning (TempLe), the first PAC-MDP method for multi-task
reinforcement learning that could be applied to tasks with varying state/action
space. TempLe generates transition dynamics templates, abstractions of the
transition dynamics across tasks, to gain sample efficiency by extracting
similarities between tasks even when their underlying models or optimal
policies have limited commonalities. We present two algorithms for an "online"
and a "finite-model" setting respectively. We prove that our proposed TempLe
algorithms achieve much lower sample complexity than single-task learners or
state-of-the-art multi-task methods. We show via systematically designed
experiments that our TempLe method universally outperforms the state-of-the-art
multi-task methods (PAC-MDP or not) in various settings and regimes
Safe and Robust Multi-Agent Reinforcement Learning for Connected Autonomous Vehicles under State Perturbations
Sensing and communication technologies have enhanced learning-based decision
making methodologies for multi-agent systems such as connected autonomous
vehicles (CAV). However, most existing safe reinforcement learning based
methods assume accurate state information. It remains challenging to achieve
safety requirement under state uncertainties for CAVs, considering the noisy
sensor measurements and the vulnerability of communication channels. In this
work, we propose a Robust Multi-Agent Proximal Policy Optimization with robust
Safety Shield (SR-MAPPO) for CAVs in various driving scenarios. Both robust
MARL algorithm and control barrier function (CBF)-based safety shield are used
in our approach to cope with the perturbed or uncertain state inputs. The
robust policy is trained with a worst-case Q function regularization module
that pursues higher lower-bounded reward in the former, whereas the latter,
i.e., the robust CBF safety shield accounts for CAVs' collision-free
constraints in complicated driving scenarios with even perturbed vehicle state
information. We validate the advantages of SR-MAPPO in robustness and safety
and compare it with baselines under different driving and state perturbation
scenarios in CARLA simulator. The SR-MAPPO policy is verified to maintain
higher safety rates and efficiency (reward) when threatened by both state
perturbations and unconnected vehicles' dangerous behaviors.Comment: 6 pages, 5 figure
Rethinking Adversarial Policies: A Generalized Attack Formulation and Provable Defense in Multi-Agent RL
Most existing works consider direct perturbations of victim's state/action or
the underlying transition dynamics to show vulnerability of reinforcement
learning agents under adversarial attacks. However, such direct manipulation
may not always be feasible in practice. In this paper, we consider another
common and realistic attack setup: in a multi-agent RL setting with
well-trained agents, during deployment time, the victim agent is
exploited by an attacker who controls another agent to act
adversarially against the victim using an \textit{adversarial policy}. Prior
attack models under such setup do not consider that the attacker can confront
resistance and thus can only take partial control of the agent , as
well as introducing perceivable ``abnormal'' behaviors that are easily
detectable. A provable defense against these adversarial policies is also
lacking. To resolve these issues, we introduce a more general attack
formulation that models to what extent the adversary is able to control the
agent to produce the adversarial policy. Based on such a generalized attack
framework, the attacker can also regulate the state distribution shift caused
by the attack through an attack budget, and thus produce stealthy adversarial
policies that can exploit the victim agent. Furthermore, we provide the first
provably robust defenses with convergence guarantee to the most robust victim
policy via adversarial training with timescale separation, in sharp contrast to
adversarial training in supervised learning which may only provide {\it
empirical} defenses
Robustness to Multi-Modal Environment Uncertainty in MARL using Curriculum Learning
Multi-agent reinforcement learning (MARL) plays a pivotal role in tackling
real-world challenges. However, the seamless transition of trained policies
from simulations to real-world requires it to be robust to various
environmental uncertainties. Existing works focus on finding Nash Equilibrium
or the optimal policy under uncertainty in one environment variable (i.e.
action, state or reward). This is because a multi-agent system itself is highly
complex and unstationary. However, in real-world situation uncertainty can
occur in multiple environment variables simultaneously. This work is the first
to formulate the generalised problem of robustness to multi-modal environment
uncertainty in MARL. To this end, we propose a general robust training approach
for multi-modal uncertainty based on curriculum learning techniques. We handle
two distinct environmental uncertainty simultaneously and present extensive
results across both cooperative and competitive MARL environments,
demonstrating that our approach achieves state-of-the-art levels of robustness
Beyond Worst-case Attacks: Robust RL with Adaptive Defense via Non-dominated Policies
In light of the burgeoning success of reinforcement learning (RL) in diverse
real-world applications, considerable focus has been directed towards ensuring
RL policies are robust to adversarial attacks during test time. Current
approaches largely revolve around solving a minimax problem to prepare for
potential worst-case scenarios. While effective against strong attacks, these
methods often compromise performance in the absence of attacks or the presence
of only weak attacks. To address this, we study policy robustness under the
well-accepted state-adversarial attack model, extending our focus beyond only
worst-case attacks. We first formalize this task at test time as a regret
minimization problem and establish its intrinsic hardness in achieving
sublinear regret when the baseline policy is from a general continuous policy
class, . This finding prompts us to \textit{refine} the baseline policy
class prior to test time, aiming for efficient adaptation within a finite
policy class \Tilde{\Pi}, which can resort to an adversarial bandit
subroutine. In light of the importance of a small, finite \Tilde{\Pi}, we
propose a novel training-time algorithm to iteratively discover
\textit{non-dominated policies}, forming a near-optimal and minimal
\Tilde{\Pi}, thereby ensuring both robustness and test-time efficiency.
Empirical validation on the Mujoco corroborates the superiority of our approach
in terms of natural and robust performance, as well as adaptability to various
attack scenarios.Comment: International Conference on Learning Representations (ICLR) 2024,
spotligh
InfoNet: Neural Estimation of Mutual Information without Test-Time Optimization
Estimating mutual correlations between random variables or data streams is
essential for intelligent behavior and decision-making. As a fundamental
quantity for measuring statistical relationships, mutual information has been
extensively studied and utilized for its generality and equitability. However,
existing methods often lack the efficiency needed for real-time applications,
such as test-time optimization of a neural network, or the differentiability
required for end-to-end learning, like histograms. We introduce a neural
network called InfoNet, which directly outputs mutual information estimations
of data streams by leveraging the attention mechanism and the computational
efficiency of deep learning infrastructures. By maximizing a dual formulation
of mutual information through large-scale simulated training, our approach
circumvents time-consuming test-time optimization and offers generalization
ability. We evaluate the effectiveness and generalization of our proposed
mutual information estimation scheme on various families of distributions and
applications. Our results demonstrate that InfoNet and its training process
provide a graceful efficiency-accuracy trade-off and order-preserving
properties. We will make the code and models available as a comprehensive
toolbox to facilitate studies in different fields requiring real-time mutual
information estimation
- …