4 research outputs found
Distilling Reinforcement Learning Tricks for Video Games
Reinforcement learning (RL) research focuses on general solutions that can be
applied across different domains. This results in methods that RL practitioners
can use in almost any domain. However, recent studies often lack the
engineering steps ("tricks") which may be needed to effectively use RL, such as
reward shaping, curriculum learning, and splitting a large task into smaller
chunks. Such tricks are common, if not necessary, to achieve state-of-the-art
results and win RL competitions. To ease the engineering efforts, we distill
descriptions of tricks from state-of-the-art results and study how well these
tricks can improve a standard deep Q-learning agent. The long-term goal of this
work is to enable combining proven RL methods with domain-specific tricks by
providing a unified software framework and accompanying insights in multiple
domains.Comment: To appear in IEEE Conference on Games 2021. Experiment code is
available at https://github.com/Miffyli/rl-human-prior-trick
Reinforcement Learning with Goal-Distance Gradient
Reinforcement learning usually uses the feedback rewards of environmental to
train agents. But the rewards in the actual environment are sparse, and even
some environments will not rewards. Most of the current methods are difficult
to get good performance in sparse reward or non-reward environments. Although
using shaped rewards is effective when solving sparse reward tasks, it is
limited to specific problems and learning is also susceptible to local optima.
We propose a model-free method that does not rely on environmental rewards to
solve the problem of sparse rewards in the general environment. Our method use
the minimum number of transitions between states as the distance to replace the
rewards of environmental, and proposes a goal-distance gradient to achieve
policy improvement. We also introduce a bridge point planning method based on
the characteristics of our method to improve exploration efficiency, thereby
solving more complex tasks. Experiments show that our method performs better on
sparse reward and local optimal problems in complex environments than previous
work
Counter-Strike Deathmatch with Large-Scale Behavioural Cloning
This paper describes an AI agent that plays the popular first-person-shooter
(FPS) video game `Counter-Strike; Global Offensive' (CSGO) from pixel input.
The agent, a deep neural network, matches the performance of the medium
difficulty built-in AI on the deathmatch game mode, whilst adopting a humanlike
play style. Unlike much prior work in games, no API is available for CSGO, so
algorithms must train and run in real-time. This limits the quantity of
on-policy data that can be generated, precluding many reinforcement learning
algorithms. Our solution uses behavioural cloning - training on a large noisy
dataset scraped from human play on online servers (4 million frames, comparable
in size to ImageNet), and a smaller dataset of high-quality expert
demonstrations. This scale is an order of magnitude larger than prior work on
imitation learning in FPS games.Comment: Offline Reinforcement Learning Workshop at Neural Information
Processing Systems, 202
Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping
Reward shaping is an effective technique for incorporating domain knowledge
into reinforcement learning (RL). Existing approaches such as potential-based
reward shaping normally make full use of a given shaping reward function.
However, since the transformation of human knowledge into numeric reward values
is often imperfect due to reasons such as human cognitive bias, completely
utilizing the shaping reward function may fail to improve the performance of RL
algorithms. In this paper, we consider the problem of adaptively utilizing a
given shaping reward function. We formulate the utilization of shaping rewards
as a bi-level optimization problem, where the lower level is to optimize policy
using the shaping rewards and the upper level is to optimize a parameterized
shaping weight function for true reward maximization. We formally derive the
gradient of the expected true reward with respect to the shaping weight
function parameters and accordingly propose three learning algorithms based on
different assumptions. Experiments in sparse-reward cartpole and MuJoCo
environments show that our algorithms can fully exploit beneficial shaping
rewards, and meanwhile ignore unbeneficial shaping rewards or even transform
them into beneficial ones.Comment: Accepted by NeurIPS202