56,775 research outputs found

    Micro- and Macro-Level Validation in Agent-Based Simulation: Reproduction of Human-Like Behaviors and Thinking in a Sequential Bargaining Game

    Get PDF
    This paper addresses both micro- and macro-level validation in agent-based simulation (ABS) to explore validated agents that can reproduce not only human-like behaviors externally but also human-like thinking internally. For this purpose, we employ the sequential bargaining game, which can investigate a change in humans' behaviors and thinking longer than the ultimatum game (i.e., one-time bargaining game), and compare simulation results of Q-learning agents employing any type of the three types of action selections (i.e., the ε-greedy, roulette, and Boltzmann distribution selections) in the game. Intensive simulations have revealed the following implications: (1) Q-learning agents with any type of three action selections can reproduce human-like behaviors but not human-like thinking, which means that they are validated from the macro-level viewpoint but not from the micro-level viewpoint; and (2) Q-learning agents employing Boltzmann distribution selection with changing the random parameter can reproduce both human-like behaviors and thinking, which means that they are validated from both micro- and macro-level viewpoints.Micro- and Macro-Level Validation, Agent-Based Simulation, Agent Modeling, Sequential Bargaining Game, Reinforcement Learning

    Online Learning in Iterated Prisoner's Dilemma to Mimic Human Behavior

    Full text link
    Prisoner's Dilemma mainly treat the choice to cooperate or defect as an atomic action. We propose to study online learning algorithm behavior in the Iterated Prisoner's Dilemma (IPD) game, where we explored the full spectrum of reinforcement learning agents: multi-armed bandits, contextual bandits and reinforcement learning. We have evaluate them based on a tournament of iterated prisoner's dilemma where multiple agents can compete in a sequential fashion. This allows us to analyze the dynamics of policies learned by multiple self-interested independent reward-driven agents, and also allows us study the capacity of these algorithms to fit the human behaviors. Results suggest that considering the current situation to make decision is the worst in this kind of social dilemma game. Multiples discoveries on online learning behaviors and clinical validations are stated.Comment: To the best of our knowledge, this is the first attempt to explore the full spectrum of reinforcement learning agents (multi-armed bandits, contextual bandits and reinforcement learning) in the sequential social dilemma. This mental variants section supersedes and extends our work arXiv:1706.02897 (MAB), arXiv:2005.04544 (CB) and arXiv:1906.11286 (RL) into the multi-agent settin

    RL-Duet: Online Music Accompaniment Generation Using Deep Reinforcement Learning

    Full text link
    This paper presents a deep reinforcement learning algorithm for online accompaniment generation, with potential for real-time interactive human-machine duet improvisation. Different from offline music generation and harmonization, online music accompaniment requires the algorithm to respond to human input and generate the machine counterpart in a sequential order. We cast this as a reinforcement learning problem, where the generation agent learns a policy to generate a musical note (action) based on previously generated context (state). The key of this algorithm is the well-functioning reward model. Instead of defining it using music composition rules, we learn this model from monophonic and polyphonic training data. This model considers the compatibility of the machine-generated note with both the machine-generated context and the human-generated context. Experiments show that this algorithm is able to respond to the human part and generate a melodic, harmonic and diverse machine part. Subjective evaluations on preferences show that the proposed algorithm generates music pieces of higher quality than the baseline method

    Advice Conformance Verification by Reinforcement Learning agents for Human-in-the-Loop

    Full text link
    Human-in-the-loop (HiL) reinforcement learning is gaining traction in domains with large action and state spaces, and sparse rewards by allowing the agent to take advice from HiL. Beyond advice accommodation, a sequential decision-making agent must be able to express the extent to which it was able to utilize the human advice. Subsequently, the agent should provide a means for the HiL to inspect parts of advice that it had to reject in favor of the overall environment objective. We introduce the problem of Advice-Conformance Verification which requires reinforcement learning (RL) agents to provide assurances to the human in the loop regarding how much of their advice is being conformed to. We then propose a Tree-based lingua-franca to support this communication, called a Preference Tree. We study two cases of good and bad advice scenarios in MuJoCo's Humanoid environment. Through our experiments, we show that our method can provide an interpretable means of solving the Advice-Conformance Verification problem by conveying whether or not the agent is using the human's advice. Finally, we present a human-user study with 20 participants that validates our method.Comment: Accepted at IROS-RLCONFORM 202

    Rethinking the Discount Factor in Reinforcement Learning: A Decision Theoretic Approach

    Full text link
    Reinforcement learning (RL) agents have traditionally been tasked with maximizing the value function of a Markov decision process (MDP), either in continuous settings, with fixed discount factor γ<1\gamma < 1, or in episodic settings, with γ=1\gamma = 1. While this has proven effective for specific tasks with well-defined objectives (e.g., games), it has never been established that fixed discounting is suitable for general purpose use (e.g., as a model of human preferences). This paper characterizes rationality in sequential decision making using a set of seven axioms and arrives at a form of discounting that generalizes traditional fixed discounting. In particular, our framework admits a state-action dependent "discount" factor that is not constrained to be less than 1, so long as there is eventual long run discounting. Although this broadens the range of possible preference structures in continuous settings, we show that there exists a unique "optimizing MDP" with fixed γ<1\gamma < 1 whose optimal value function matches the true utility of the optimal policy, and we quantify the difference between value and utility for suboptimal policies. Our work can be seen as providing a normative justification for (a slight generalization of) Martha White's RL task formalism (2017) and other recent departures from the traditional RL, and is relevant to task specification in RL, inverse RL and preference-based RL.Comment: 8 pages + 1 page supplement. In proceedings of AAAI 2019. Slides, poster and bibtex available at https://silviupitis.com/#rethinking-the-discount-factor-in-reinforcement-learning-a-decision-theoretic-approac
    • …
    corecore