445 research outputs found
Addressing Function Approximation Error in Actor-Critic Methods
In value-based reinforcement learning methods such as deep Q-learning,
function approximation errors are known to lead to overestimated value
estimates and suboptimal policies. We show that this problem persists in an
actor-critic setting and propose novel mechanisms to minimize its effects on
both the actor and the critic. Our algorithm builds on Double Q-learning, by
taking the minimum value between a pair of critics to limit overestimation. We
draw the connection between target networks and overestimation bias, and
suggest delaying policy updates to reduce per-update error and further improve
performance. We evaluate our method on the suite of OpenAI gym tasks,
outperforming the state of the art in every environment tested.Comment: Accepted at ICML 201
Game Theory Based Distributed Coordination with Multi-Agent Reinforcement Learning
We study the problem of automated object manipulation using two arms of a Baxter robot. The robot uses a novel multi-agent reinforcement learning strategy to learn how to complete the task without any prior experience. It learns what actions to take by storing its interactions with the environment and uses these experiences to create a policy that guides future actions. Each of Baxter’s arms is modeled as an independent agent that can move and learn separately from the other. Each arm learns independent policies (i.e., environment state to robot action mapping) about how to best move in order to complete a collaborative (i.e., using two arms) task (e.g., push an item, pick-and-place etc.). The individual agents are trained using the standard TD3 algorithm, which uses the experiences that include how well the agent’s past actions guided it towards completing the task. TD3 has two neural networks: the actor network takes the states (joint angles) as an input and outputs the actions (joint movements) and the twin critic networks evaluate the quality of those actions. The actions between the agents are coordinated through a game theory-based distributed coordination strategy for successful coordination. This coordination learning framework produces a policy that produces a good set of actions for each arm to execute. Finally, Baxter uses this policy to complete tasks using both its arms in collaboration. To the best of our knowledge, this work is the first to use a game theory-based strategy for dual arm manipulation learning
Episodic Reinforcement Learning with Expanded State-reward Space
Empowered by deep neural networks, deep reinforcement learning (DRL) has
demonstrated tremendous empirical successes in various domains, including
games, health care, and autonomous driving. Despite these advancements, DRL is
still identified as data-inefficient as effective policies demand vast numbers
of environmental samples. Recently, episodic control (EC)-based model-free DRL
methods enable sample efficiency by recalling past experiences from episodic
memory. However, existing EC-based methods suffer from the limitation of
potential misalignment between the state and reward spaces for neglecting the
utilization of (past) retrieval states with extensive information, which
probably causes inaccurate value estimation and degraded policy performance. To
tackle this issue, we introduce an efficient EC-based DRL framework with
expanded state-reward space, where the expanded states used as the input and
the expanded rewards used in the training both contain historical and current
information. To be specific, we reuse the historical states retrieved by EC as
part of the input states and integrate the retrieved MC-returns into the
immediate reward in each interactive transition. As a result, our method is
able to simultaneously achieve the full utilization of retrieval information
and the better evaluation of state values by a Temporal Difference (TD) loss.
Empirical results on challenging Box2d and Mujoco tasks demonstrate the
superiority of our method over a recent sibling method and common baselines.
Further, we also verify our method's effectiveness in alleviating Q-value
overestimation by additional experiments of Q-value comparison.Comment: Accepted at AAMAS'2
Maximum Reward Formulation In Reinforcement Learning
Reinforcement learning (RL) algorithms typically deal with maximizing the
expected cumulative return (discounted or undiscounted, finite or infinite
horizon). However, several crucial applications in the real world, such as drug
discovery, do not fit within this framework because an RL agent only needs to
identify states (molecules) that achieve the highest reward within a trajectory
and does not need to optimize for the expected cumulative return. In this work,
we formulate an objective function to maximize the expected maximum reward
along a trajectory, derive a novel functional form of the Bellman equation,
introduce the corresponding Bellman operators, and provide a proof of
convergence. Using this formulation, we achieve state-of-the-art results on the
task of molecule generation that mimics a real-world drug discovery pipeline.Comment: 13 pages, 5 figure
CrossNorm: Normalization for Off-Policy TD Reinforcement Learning
Off-policy temporal difference (TD) methods are a powerful class of
reinforcement learning (RL) algorithms. Intriguingly, deep off-policy TD
algorithms are not commonly used in combination with feature normalization
techniques, despite positive effects of normalization in other domains. We show
that naive application of existing normalization techniques is indeed not
effective, but that well-designed normalization improves optimization stability
and removes the necessity of target networks. In particular, we introduce a
normalization based on a mixture of on- and off-policy transitions, which we
call cross-normalization. It can be regarded as an extension of batch
normalization that re-centers data for two different distributions, as present
in off-policy learning. Applied to DDPG and TD3, cross-normalization improves
over the state of the art across a range of MuJoCo benchmark tasks
- …