3 research outputs found
Reinforcement Learning with Goal-Distance Gradient
Reinforcement learning usually uses the feedback rewards of environmental to
train agents. But the rewards in the actual environment are sparse, and even
some environments will not rewards. Most of the current methods are difficult
to get good performance in sparse reward or non-reward environments. Although
using shaped rewards is effective when solving sparse reward tasks, it is
limited to specific problems and learning is also susceptible to local optima.
We propose a model-free method that does not rely on environmental rewards to
solve the problem of sparse rewards in the general environment. Our method use
the minimum number of transitions between states as the distance to replace the
rewards of environmental, and proposes a goal-distance gradient to achieve
policy improvement. We also introduce a bridge point planning method based on
the characteristics of our method to improve exploration efficiency, thereby
solving more complex tasks. Experiments show that our method performs better on
sparse reward and local optimal problems in complex environments than previous
work
Fever Basketball: A Complex, Flexible, and Asynchronized Sports Game Environment for Multi-agent Reinforcement Learning
The development of deep reinforcement learning (DRL) has benefited from the
emergency of a variety type of game environments where new challenging problems
are proposed and new algorithms can be tested safely and quickly, such as Board
games, RTS, FPS, and MOBA games. However, many existing environments lack
complexity and flexibility and assume the actions are synchronously executed in
multi-agent settings, which become less valuable. We introduce the Fever
Basketball game, a novel reinforcement learning environment where agents are
trained to play basketball game. It is a complex and challenging environment
that supports multiple characters, multiple positions, and both the
single-agent and multi-agent player control modes. In addition, to better
simulate real-world basketball games, the execution time of actions differs
among players, which makes Fever Basketball a novel asynchronized environment.
We evaluate commonly used multi-agent algorithms of both independent learners
and joint-action learners in three game scenarios with varying difficulties,
and heuristically propose two baseline methods to diminish the extra
non-stationarity brought by asynchronism in Fever Basketball Benchmarks.
Besides, we propose an integrated curricula training (ICT) framework to better
handle Fever Basketball problems, which includes several game-rule based
cascading curricula learners and a coordination curricula switcher focusing on
enhancing coordination within the team. The results show that the game remains
challenging and can be used as a benchmark environment for studies like
long-time horizon, sparse rewards, credit assignment, and non-stationarity,
etc. in multi-agent settings.Comment: 7 pages,12 figure
Learning to Play Imperfect-Information Games by Imitating an Oracle Planner
We consider learning to play multiplayer imperfect-information games with
simultaneous moves and large state-action spaces. Previous attempts to tackle
such challenging games have largely focused on model-free learning methods,
often requiring hundreds of years of experience to produce competitive agents.
Our approach is based on model-based planning. We tackle the problem of partial
observability by first building an (oracle) planner that has access to the full
state of the environment and then distilling the knowledge of the oracle to a
(follower) agent which is trained to play the imperfect-information game by
imitating the oracle's choices. We experimentally show that planning with naive
Monte Carlo tree search does not perform very well in large combinatorial
action spaces. We therefore propose planning with a fixed-depth tree search and
decoupled Thompson sampling for action selection. We show that the planner is
able to discover efficient playing strategies in the games of Clash Royale and
Pommerman and the follower policy successfully learns to implement them by
training on a few hundred battles