57 research outputs found

    Imitation learning based on entropy-regularized forward and inverse reinforcement learning

    Get PDF
    This paper proposes Entropy-Regularized Imitation Learning (ERIL), which is a combination of forward and inverse reinforcement learning under the framework of the entropy-regularized Markov decision process. ERIL minimizes the reverse Kullback-Leibler (KL) divergence between two probability distributions induced by a learner and an expert. Inverse reinforcement learning (RL) in ERIL evaluates the log-ratio between two distributions using the density ratio trick, which is widely used in generative adversarial networks. More specifically, the log-ratio is estimated by building two binary discriminators. The first discriminator is a state-only function, and it tries to distinguish the state generated by the forward RL step from the expert's state. The second discriminator is a function of current state, action, and transitioned state, and it distinguishes the generated experiences from the ones provided by the expert. Since the second discriminator has the same hyperparameters of the forward RL step, it can be used to control the discriminator's ability. The forward RL minimizes the reverse KL estimated by the inverse RL. We show that minimizing the reverse KL divergence is equivalent to finding an optimal policy under entropy regularization. Consequently, a new policy is derived from an algorithm that resembles Dynamic Policy Programming and Soft Actor-Critic. Our experimental results on MuJoCo-simulated environments show that ERIL is more sample-efficient than such previous methods. We further apply the method to human behaviors in performing a pole-balancing task and show that the estimated reward functions show how every subject achieves the goal.Comment: 33 pages, 10 figure

    Constrained Reinforcement Learning from Intrinsic and Extrinsic Rewards

    Get PDF

    Model-Free Deep Inverse Reinforcement Learning by Logistic Regression

    Get PDF
    This paper proposes model-free deep inverse reinforcement learning to find nonlinear reward function structures. We formulate inverse reinforcement learning as a problem of density ratio estimation, and show that the log of the ratio between an optimal state transition and a baseline one is given by a part of reward and the difference of the value functions under the framework of linearly solvable Markov decision processes. The logarithm of density ratio is efficiently calculated by binomial logistic regression, of which the classifier is constructed by the reward and state value function. The classifier tries to discriminate between samples drawn from the optimal state transition probability and those from the baseline one. Then, the estimated state value function is used to initialize the part of the deep neural networks for forward reinforcement learning. The proposed deep forward and inverse reinforcement learning is applied into two benchmark games: Atari 2600 and Reversi. Simulation results show that our method reaches the best performance substantially faster than the standard combination of forward and inverse reinforcement learning as well as behavior cloning

    Cooperative and Competitive Reinforcement and Imitation Learning for a Mixture of Heterogeneous Learning Modules

    Get PDF
    This paper proposes Cooperative and competitive Reinforcement And Imitation Learning (CRAIL) for selecting an appropriate policy from a set of multiple heterogeneous modules and training all of them in parallel. Each learning module has its own network architecture and improves the policy based on an off-policy reinforcement learning algorithm and behavior cloning from samples collected by a behavior policy that is constructed by a combination of all the policies. Since the mixing weights are determined by the performance of the module, a better policy is automatically selected based on the learning progress. Experimental results on a benchmark control task show that CRAIL successfully achieves fast learning by allowing modules with complicated network structures to exploit task-relevant samples for training

    Sigmoid-weighted linear units for neural network function approximation in reinforcement learning

    Get PDF
    In recent years, neural networks have enjoyed a renaissance as function approximators in reinforcement learning. Two decades after Tesauro\u27s TD-Gammon achieved near top-level human performance in backgammon, the deep reinforcement learning algorithm DQN achieved human-level performance in many Atari 2600 games. The purpose of this study is twofold. First, we propose two activation functions for neural network function approximation in reinforcement learning: the sigmoid-weighted linear unit (SiLU) and its derivative function (dSiLU). The activation of the SiLU is computed by the sigmoid function multiplied by its input. Second, we suggest that the more traditional approach of using on-policy learning with eligibility traces, instead of experience replay, and softmax action selection can be competitive with DQN, without the need for a separate target network. We validate our proposed approach by, first, achieving new state-of-the-art results in both stochastic SZ-Tetris and Tetris with a small 10 x 10 board, using TD(lambda) learning and shallow dSiLU network agents, and, then, by outperforming DQN in the Atari 2600 domain by using a deep Sarsa(lambda) agent with SiLU and dSiLU hidden units

    Cooperative Behavior Acquisition by Learning and Evolution in a Multi-Agent Environment for Mobile Robots

    No full text
    The objective of my research described in this dissertation is to realize learning and evolutionary methods for multiagent systems. This dissertation mainly consists of four parts. We propose a method that acquires the purposive behaviors based on the estimation of the state vectors in Chapter 3. In order to acquire the cooperative behaviors in multiagent environments, each learning robot estimates the Local Prediction Model (hereafter LPM) between the learner and the other objects separately. The LPM estimate the local interaction while reinforcement learning copes with the global interaction between multiple LPMs and the given tasks. Based on the LPMs which satisfies the Markovian environment assumption as possible, robots learn the desired behaviors using reinforcement learning. We also propose a learning schedule in order to make learning stable especially in the early stage of multiagent systems. Chapter 4 discusses how an agent can develop its behavior according to the complexity of the interactions with its environment. A method for controlling the complexity i

    Co-evolution for Cooperative Behavior Acquisition in a Multiple Mobile Robot Environment

    No full text
    Co-evolution has been receiving increased attention as a method for multi agent simultaneous learning. This paper discusses how multiple robots can emerge cooperative behaviors through co-evolutionary processes. As an example task, a simplied soccer game with three learning robots is selected and a genetic programming method is applied to individual population corresponding to each robot so as to obtain cooperative and competitive behaviors. The complexity of the problem can be explained twofold: co-evolution for cooperative behaviors needs exact synchronization of mutual evolutions, and three robot co-evolution requires well-complicated environment setups that may gradually change from simpler to more complicated situations. Simulation results are shown, and a discussion is given. 1 Introduction Realization o# a#tono#o## #o#ot# t#at o##anize t#ei# o#n inte#nal #t###t##e# to a##o##li## #i#en ta### t##o### inte#a#tion# #it# t#ei# en#i#on#ent# i# one o# t#e #lti#ate #oal# o# Ro#oti## a..
    • …
    corecore