56,775 research outputs found
Micro- and Macro-Level Validation in Agent-Based Simulation: Reproduction of Human-Like Behaviors and Thinking in a Sequential Bargaining Game
This paper addresses both micro- and macro-level validation in agent-based simulation (ABS) to explore validated agents that can reproduce not only human-like behaviors externally but also human-like thinking internally. For this purpose, we employ the sequential bargaining game, which can investigate a change in humans' behaviors and thinking longer than the ultimatum game (i.e., one-time bargaining game), and compare simulation results of Q-learning agents employing any type of the three types of action selections (i.e., the ε-greedy, roulette, and Boltzmann distribution selections) in the game. Intensive simulations have revealed the following implications: (1) Q-learning agents with any type of three action selections can reproduce human-like behaviors but not human-like thinking, which means that they are validated from the macro-level viewpoint but not from the micro-level viewpoint; and (2) Q-learning agents employing Boltzmann distribution selection with changing the random parameter can reproduce both human-like behaviors and thinking, which means that they are validated from both micro- and macro-level viewpoints.Micro- and Macro-Level Validation, Agent-Based Simulation, Agent Modeling, Sequential Bargaining Game, Reinforcement Learning
Online Learning in Iterated Prisoner's Dilemma to Mimic Human Behavior
Prisoner's Dilemma mainly treat the choice to cooperate or defect as an
atomic action. We propose to study online learning algorithm behavior in the
Iterated Prisoner's Dilemma (IPD) game, where we explored the full spectrum of
reinforcement learning agents: multi-armed bandits, contextual bandits and
reinforcement learning. We have evaluate them based on a tournament of iterated
prisoner's dilemma where multiple agents can compete in a sequential fashion.
This allows us to analyze the dynamics of policies learned by multiple
self-interested independent reward-driven agents, and also allows us study the
capacity of these algorithms to fit the human behaviors. Results suggest that
considering the current situation to make decision is the worst in this kind of
social dilemma game. Multiples discoveries on online learning behaviors and
clinical validations are stated.Comment: To the best of our knowledge, this is the first attempt to explore
the full spectrum of reinforcement learning agents (multi-armed bandits,
contextual bandits and reinforcement learning) in the sequential social
dilemma. This mental variants section supersedes and extends our work
arXiv:1706.02897 (MAB), arXiv:2005.04544 (CB) and arXiv:1906.11286 (RL) into
the multi-agent settin
RL-Duet: Online Music Accompaniment Generation Using Deep Reinforcement Learning
This paper presents a deep reinforcement learning algorithm for online
accompaniment generation, with potential for real-time interactive
human-machine duet improvisation. Different from offline music generation and
harmonization, online music accompaniment requires the algorithm to respond to
human input and generate the machine counterpart in a sequential order. We cast
this as a reinforcement learning problem, where the generation agent learns a
policy to generate a musical note (action) based on previously generated
context (state). The key of this algorithm is the well-functioning reward
model. Instead of defining it using music composition rules, we learn this
model from monophonic and polyphonic training data. This model considers the
compatibility of the machine-generated note with both the machine-generated
context and the human-generated context. Experiments show that this algorithm
is able to respond to the human part and generate a melodic, harmonic and
diverse machine part. Subjective evaluations on preferences show that the
proposed algorithm generates music pieces of higher quality than the baseline
method
Advice Conformance Verification by Reinforcement Learning agents for Human-in-the-Loop
Human-in-the-loop (HiL) reinforcement learning is gaining traction in domains
with large action and state spaces, and sparse rewards by allowing the agent to
take advice from HiL. Beyond advice accommodation, a sequential decision-making
agent must be able to express the extent to which it was able to utilize the
human advice. Subsequently, the agent should provide a means for the HiL to
inspect parts of advice that it had to reject in favor of the overall
environment objective. We introduce the problem of Advice-Conformance
Verification which requires reinforcement learning (RL) agents to provide
assurances to the human in the loop regarding how much of their advice is being
conformed to. We then propose a Tree-based lingua-franca to support this
communication, called a Preference Tree. We study two cases of good and bad
advice scenarios in MuJoCo's Humanoid environment. Through our experiments, we
show that our method can provide an interpretable means of solving the
Advice-Conformance Verification problem by conveying whether or not the agent
is using the human's advice. Finally, we present a human-user study with 20
participants that validates our method.Comment: Accepted at IROS-RLCONFORM 202
Rethinking the Discount Factor in Reinforcement Learning: A Decision Theoretic Approach
Reinforcement learning (RL) agents have traditionally been tasked with
maximizing the value function of a Markov decision process (MDP), either in
continuous settings, with fixed discount factor , or in episodic
settings, with . While this has proven effective for specific tasks
with well-defined objectives (e.g., games), it has never been established that
fixed discounting is suitable for general purpose use (e.g., as a model of
human preferences). This paper characterizes rationality in sequential decision
making using a set of seven axioms and arrives at a form of discounting that
generalizes traditional fixed discounting. In particular, our framework admits
a state-action dependent "discount" factor that is not constrained to be less
than 1, so long as there is eventual long run discounting. Although this
broadens the range of possible preference structures in continuous settings, we
show that there exists a unique "optimizing MDP" with fixed whose
optimal value function matches the true utility of the optimal policy, and we
quantify the difference between value and utility for suboptimal policies. Our
work can be seen as providing a normative justification for (a slight
generalization of) Martha White's RL task formalism (2017) and other recent
departures from the traditional RL, and is relevant to task specification in
RL, inverse RL and preference-based RL.Comment: 8 pages + 1 page supplement. In proceedings of AAAI 2019. Slides,
poster and bibtex available at
https://silviupitis.com/#rethinking-the-discount-factor-in-reinforcement-learning-a-decision-theoretic-approac
- …