445 research outputs found

    Addressing Function Approximation Error in Actor-Critic Methods

    Get PDF
    In value-based reinforcement learning methods such as deep Q-learning, function approximation errors are known to lead to overestimated value estimates and suboptimal policies. We show that this problem persists in an actor-critic setting and propose novel mechanisms to minimize its effects on both the actor and the critic. Our algorithm builds on Double Q-learning, by taking the minimum value between a pair of critics to limit overestimation. We draw the connection between target networks and overestimation bias, and suggest delaying policy updates to reduce per-update error and further improve performance. We evaluate our method on the suite of OpenAI gym tasks, outperforming the state of the art in every environment tested.Comment: Accepted at ICML 201

    Game Theory Based Distributed Coordination with Multi-Agent Reinforcement Learning

    Get PDF
    We study the problem of automated object manipulation using two arms of a Baxter robot. The robot uses a novel multi-agent reinforcement learning strategy to learn how to complete the task without any prior experience. It learns what actions to take by storing its interactions with the environment and uses these experiences to create a policy that guides future actions. Each of Baxter’s arms is modeled as an independent agent that can move and learn separately from the other. Each arm learns independent policies (i.e., environment state to robot action mapping) about how to best move in order to complete a collaborative (i.e., using two arms) task (e.g., push an item, pick-and-place etc.). The individual agents are trained using the standard TD3 algorithm, which uses the experiences that include how well the agent’s past actions guided it towards completing the task. TD3 has two neural networks: the actor network takes the states (joint angles) as an input and outputs the actions (joint movements) and the twin critic networks evaluate the quality of those actions. The actions between the agents are coordinated through a game theory-based distributed coordination strategy for successful coordination. This coordination learning framework produces a policy that produces a good set of actions for each arm to execute. Finally, Baxter uses this policy to complete tasks using both its arms in collaboration. To the best of our knowledge, this work is the first to use a game theory-based strategy for dual arm manipulation learning

    Episodic Reinforcement Learning with Expanded State-reward Space

    Full text link
    Empowered by deep neural networks, deep reinforcement learning (DRL) has demonstrated tremendous empirical successes in various domains, including games, health care, and autonomous driving. Despite these advancements, DRL is still identified as data-inefficient as effective policies demand vast numbers of environmental samples. Recently, episodic control (EC)-based model-free DRL methods enable sample efficiency by recalling past experiences from episodic memory. However, existing EC-based methods suffer from the limitation of potential misalignment between the state and reward spaces for neglecting the utilization of (past) retrieval states with extensive information, which probably causes inaccurate value estimation and degraded policy performance. To tackle this issue, we introduce an efficient EC-based DRL framework with expanded state-reward space, where the expanded states used as the input and the expanded rewards used in the training both contain historical and current information. To be specific, we reuse the historical states retrieved by EC as part of the input states and integrate the retrieved MC-returns into the immediate reward in each interactive transition. As a result, our method is able to simultaneously achieve the full utilization of retrieval information and the better evaluation of state values by a Temporal Difference (TD) loss. Empirical results on challenging Box2d and Mujoco tasks demonstrate the superiority of our method over a recent sibling method and common baselines. Further, we also verify our method's effectiveness in alleviating Q-value overestimation by additional experiments of Q-value comparison.Comment: Accepted at AAMAS'2

    Maximum Reward Formulation In Reinforcement Learning

    Full text link
    Reinforcement learning (RL) algorithms typically deal with maximizing the expected cumulative return (discounted or undiscounted, finite or infinite horizon). However, several crucial applications in the real world, such as drug discovery, do not fit within this framework because an RL agent only needs to identify states (molecules) that achieve the highest reward within a trajectory and does not need to optimize for the expected cumulative return. In this work, we formulate an objective function to maximize the expected maximum reward along a trajectory, derive a novel functional form of the Bellman equation, introduce the corresponding Bellman operators, and provide a proof of convergence. Using this formulation, we achieve state-of-the-art results on the task of molecule generation that mimics a real-world drug discovery pipeline.Comment: 13 pages, 5 figure

    CrossNorm: Normalization for Off-Policy TD Reinforcement Learning

    Full text link
    Off-policy temporal difference (TD) methods are a powerful class of reinforcement learning (RL) algorithms. Intriguingly, deep off-policy TD algorithms are not commonly used in combination with feature normalization techniques, despite positive effects of normalization in other domains. We show that naive application of existing normalization techniques is indeed not effective, but that well-designed normalization improves optimization stability and removes the necessity of target networks. In particular, we introduce a normalization based on a mixture of on- and off-policy transitions, which we call cross-normalization. It can be regarded as an extension of batch normalization that re-centers data for two different distributions, as present in off-policy learning. Applied to DDPG and TD3, cross-normalization improves over the state of the art across a range of MuJoCo benchmark tasks
    • …
    corecore