17 research outputs found
Emergence of Addictive Behaviors in Reinforcement Learning Agents
This paper presents a novel approach to the technical analysis of wireheading
in intelligent agents. Inspired by the natural analogues of wireheading and
their prevalent manifestations, we propose the modeling of such phenomenon in
Reinforcement Learning (RL) agents as psychological disorders. In a preliminary
step towards evaluating this proposal, we study the feasibility and dynamics of
emergent addictive policies in Q-learning agents in the tractable environment
of the game of Snake. We consider a slightly modified settings for this game,
in which the environment provides a "drug" seed alongside the original
"healthy" seed for the consumption of the snake. We adopt and extend an
RL-based model of natural addiction to Q-learning agents in this settings, and
derive sufficient parametric conditions for the emergence of addictive
behaviors in such agents. Furthermore, we evaluate our theoretical analysis
with three sets of simulation-based experiments. The results demonstrate the
feasibility of addictive wireheading in RL agents, and provide promising venues
of further research on the psychopathological modeling of complex AI safety
problems
Founding The Domain of AI Forensics
With the widespread integration of AI in everyday and critical technologies, it seems inevitable to witness increasing instances of failure in AI systems. In such cases, there arises a need for technical investigations that produce legally acceptable and scientifically indisputable findings and conclusions on the causes of such failures. Inspired by the domain of cyber forensics, this paper introduces the need for the establishment of AI Forensics as a new discipline under AI safety. Furthermore, we propose a taxonomy of the subfields under this discipline, and present a discussion on the foundational challenges that lay ahead of this new research area
A Theory of Mind Approach as Test-Time Mitigation Against Emergent Adversarial Communication
Multi-Agent Systems (MAS) is the study of multi-agent interactions in a
shared environment. Communication for cooperation is a fundamental construct
for sharing information in partially observable environments. Cooperative
Multi-Agent Reinforcement Learning (CoMARL) is a learning framework where we
learn agent policies either with cooperative mechanisms or policies that
exhibit cooperative behavior. Explicitly, there are works on learning to
communicate messages from CoMARL agents; however, non-cooperative agents, when
capable of access a cooperative team's communication channel, have been shown
to learn adversarial communication messages, sabotaging the cooperative team's
performance particularly when objectives depend on finite resources. To address
this issue, we propose a technique which leverages local formulations of
Theory-of-Mind (ToM) to distinguish exhibited cooperative behavior from
non-cooperative behavior before accepting messages from any agent. We
demonstrate the efficacy and feasibility of the proposed technique in empirical
evaluations in a centralized training, decentralized execution (CTDE) CoMARL
benchmark. Furthermore, while we propose our explicit ToM defense for
test-time, we emphasize that ToM is a construct for designing a cognitive
defense rather than be the objective of the defense.Comment: 6 pages, 7 figure