115 research outputs found

    Imitation learning based on entropy-regularized forward and inverse reinforcement learning

    Get PDF
    This paper proposes Entropy-Regularized Imitation Learning (ERIL), which is a combination of forward and inverse reinforcement learning under the framework of the entropy-regularized Markov decision process. ERIL minimizes the reverse Kullback-Leibler (KL) divergence between two probability distributions induced by a learner and an expert. Inverse reinforcement learning (RL) in ERIL evaluates the log-ratio between two distributions using the density ratio trick, which is widely used in generative adversarial networks. More specifically, the log-ratio is estimated by building two binary discriminators. The first discriminator is a state-only function, and it tries to distinguish the state generated by the forward RL step from the expert's state. The second discriminator is a function of current state, action, and transitioned state, and it distinguishes the generated experiences from the ones provided by the expert. Since the second discriminator has the same hyperparameters of the forward RL step, it can be used to control the discriminator's ability. The forward RL minimizes the reverse KL estimated by the inverse RL. We show that minimizing the reverse KL divergence is equivalent to finding an optimal policy under entropy regularization. Consequently, a new policy is derived from an algorithm that resembles Dynamic Policy Programming and Soft Actor-Critic. Our experimental results on MuJoCo-simulated environments show that ERIL is more sample-efficient than such previous methods. We further apply the method to human behaviors in performing a pole-balancing task and show that the estimated reward functions show how every subject achieves the goal.Comment: 33 pages, 10 figure

    Constrained Reinforcement Learning from Intrinsic and Extrinsic Rewards

    Get PDF

    Model-Free Deep Inverse Reinforcement Learning by Logistic Regression

    Get PDF
    This paper proposes model-free deep inverse reinforcement learning to find nonlinear reward function structures. We formulate inverse reinforcement learning as a problem of density ratio estimation, and show that the log of the ratio between an optimal state transition and a baseline one is given by a part of reward and the difference of the value functions under the framework of linearly solvable Markov decision processes. The logarithm of density ratio is efficiently calculated by binomial logistic regression, of which the classifier is constructed by the reward and state value function. The classifier tries to discriminate between samples drawn from the optimal state transition probability and those from the baseline one. Then, the estimated state value function is used to initialize the part of the deep neural networks for forward reinforcement learning. The proposed deep forward and inverse reinforcement learning is applied into two benchmark games: Atari 2600 and Reversi. Simulation results show that our method reaches the best performance substantially faster than the standard combination of forward and inverse reinforcement learning as well as behavior cloning

    Cooperative and Competitive Reinforcement and Imitation Learning for a Mixture of Heterogeneous Learning Modules

    Get PDF
    This paper proposes Cooperative and competitive Reinforcement And Imitation Learning (CRAIL) for selecting an appropriate policy from a set of multiple heterogeneous modules and training all of them in parallel. Each learning module has its own network architecture and improves the policy based on an off-policy reinforcement learning algorithm and behavior cloning from samples collected by a behavior policy that is constructed by a combination of all the policies. Since the mixing weights are determined by the performance of the module, a better policy is automatically selected based on the learning progress. Experimental results on a benchmark control task show that CRAIL successfully achieves fast learning by allowing modules with complicated network structures to exploit task-relevant samples for training

    Sigmoid-weighted linear units for neural network function approximation in reinforcement learning

    Get PDF
    In recent years, neural networks have enjoyed a renaissance as function approximators in reinforcement learning. Two decades after Tesauro\u27s TD-Gammon achieved near top-level human performance in backgammon, the deep reinforcement learning algorithm DQN achieved human-level performance in many Atari 2600 games. The purpose of this study is twofold. First, we propose two activation functions for neural network function approximation in reinforcement learning: the sigmoid-weighted linear unit (SiLU) and its derivative function (dSiLU). The activation of the SiLU is computed by the sigmoid function multiplied by its input. Second, we suggest that the more traditional approach of using on-policy learning with eligibility traces, instead of experience replay, and softmax action selection can be competitive with DQN, without the need for a separate target network. We validate our proposed approach by, first, achieving new state-of-the-art results in both stochastic SZ-Tetris and Tetris with a small 10 x 10 board, using TD(lambda) learning and shallow dSiLU network agents, and, then, by outperforming DQN in the Atari 2600 domain by using a deep Sarsa(lambda) agent with SiLU and dSiLU hidden units

    Organoids with cancer stem cell-like properties secrete exosomes and HSP90 in a 3D nanoenvironment

    Get PDF
    Ability to form cellular aggregations such as tumorspheres and spheroids have been used as a morphological marker of malignant cancer cells and in particular cancer stem cells (CSC). However, the common definition of the types of cellular aggregation formed by cancer cells has not been available. We examined morphologies of 67 cell lines cultured on three dimensional morphology enhancing NanoCulture Plates (NCP) and classified the types of cellular aggregates that form. Among the 67 cell lines, 49 cell lines formed spheres or spheroids, 8 cell lines formed grape-like aggregation (GLA), 8 cell lines formed other types of aggregation, and 3 cell lines formed monolayer sheets. Seven GLA-forming cell lines were derived from adenocarcinoma among the 8 lines. A neuroendocrine adenocarcinoma cell line PC-3 formed asymmetric GLA with ductal structures on the NCPs and rapidly growing asymmetric tumors that metastasized to lymph nodes in immunocompromised mice. In contrast, another adenocarcinoma cell line DU-145 formed spheroids in vitro and spheroid-like tumors in vivo that did not metastasize to lymph nodes until day 50 after transplantation. Culture in the 3D nanoenvironment and in a defined stem cell medium enabled the neuroendocrine adenocarcinoma cells to form slowly growing large organoids that expressed multiple stem cell markers, neuroendocrine markers, intercellular adhesion molecules, and oncogenes in vitro. In contrast, the more commonly used 2D serum-contained environment reduced intercellular adhesion and induced mesenchymal transition and promoted rapid growth of the cells. In addition, the 3D stemness nanoenvironment promoted secretion of HSP90 and EpCAM-exosomes, a marker of CSC phenotype, from the neuroendocrine organoids. These findings indicate that the NCP-based 3D environment enables cells to form stem cell tumoroids with multipotency and model more accurately the in vivo tumor status at the levels of morphology and gene expression
    corecore