56 research outputs found
A new Potential-Based Reward Shaping for Reinforcement Learning Agent
Potential-based reward shaping (PBRS) is a particular category of machine
learning methods which aims to improve the learning speed of a reinforcement
learning agent by extracting and utilizing extra knowledge while performing a
task. There are two steps in the process of transfer learning: extracting
knowledge from previously learned tasks and transferring that knowledge to use
it in a target task. The latter step is well discussed in the literature with
various methods being proposed for it, while the former has been explored less.
With this in mind, the type of knowledge that is transmitted is very important
and can lead to considerable improvement. Among the literature of both the
transfer learning and the potential-based reward shaping, a subject that has
never been addressed is the knowledge gathered during the learning process
itself. In this paper, we presented a novel potential-based reward shaping
method that attempted to extract knowledge from the learning process. The
proposed method extracts knowledge from episodes' cumulative rewards. The
proposed method has been evaluated in the Arcade learning environment and the
results indicate an improvement in the learning process in both the single-task
and the multi-task reinforcement learner agents
Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces
While recent advances in deep reinforcement learning have allowed autonomous
learning agents to succeed at a variety of complex tasks, existing algorithms
generally require a lot of training data. One way to increase the speed at
which agents are able to learn to perform tasks is by leveraging the input of
human trainers. Although such input can take many forms, real-time,
scalar-valued feedback is especially useful in situations where it proves
difficult or impossible for humans to provide expert demonstrations. Previous
approaches have shown the usefulness of human input provided in this fashion
(e.g., the TAMER framework), but they have thus far not considered
high-dimensional state spaces or employed the use of deep learning. In this
paper, we do both: we propose Deep TAMER, an extension of the TAMER framework
that leverages the representational power of deep neural networks in order to
learn complex tasks in just a short amount of time with a human trainer. We
demonstrate Deep TAMER's success by using it and just 15 minutes of
human-provided feedback to train an agent that performs better than humans on
the Atari game of Bowling - a task that has proven difficult for even
state-of-the-art reinforcement learning methods.Comment: 9 pages, 6 figure
Pattern transfer learning for reinforcement learning in order dispatching
Order dispatch is one of the central problems to ridesharing platforms. Recently, value-based reinforcement learning algorithms have shown promising performance to solve this task. However, in real-world applications, the demand-supply system is typically nonstationary over time, posing challenges to reutilizing data generated in different time periods to learn the value function. In this work, motivated by the fact that the relative relationship between the values of some states is largely stable across various environments, we propose a pattern transfer learning framework for value-based reinforcement learning in the order dispatch problem. Our method efficiently captures the value patterns by incorporating a concordance penalty. The superior performance of the proposed method is supported by experiments
Learning to Terminate in Object Navigation
This paper tackles the critical challenge of object navigation in autonomous
navigation systems, particularly focusing on the problem of target approach and
episode termination in environments with long optimal episode length in Deep
Reinforcement Learning (DRL) based methods. While effective in environment
exploration and object localization, conventional DRL methods often struggle
with optimal path planning and termination recognition due to a lack of depth
information. To overcome these limitations, we propose a novel approach, namely
the Depth-Inference Termination Agent (DITA), which incorporates a supervised
model called the Judge Model to implicitly infer object-wise depth and decide
termination jointly with reinforcement learning. We train our judge model along
with reinforcement learning in parallel and supervise the former efficiently by
reward signal. Our evaluation shows the method is demonstrating superior
performance, we achieve a 9.3% gain on success rate than our baseline method
across all room types and gain 51.2% improvements on long episodes environment
while maintaining slightly better Success Weighted by Path Length (SPL). Code
and resources, visualization are available at:
https://github.com/HuskyKingdom/DITA_acml2023Comment: 16 page
Reward shaping using directed graph convolution neural networks for reinforcement learning and games
Game theory can employ reinforcement learning algorithms to identify the optimal policy or equilibrium solution. Potential-based reward shaping (PBRS) methods are prevalently used for accelerating reinforcement learning, ensuring the optimal policy remains consistent. Existing PBRS research performs message passing based on graph convolution neural networks (GCNs) to propagate information from rewarding states. However, in an irreversible time-series reinforcement learning problem, undirected graphs will not only mislead message-passing schemes but also lose a distinctive direction structure. In this paper, a novel approach called directed graph convolution neural networks for reward shaping φDCN has been proposed to tackle this problem. The key innovation of φDCN is the extension of spectral-based undirected graph convolution to directed graphs. Messages can be efficiently propagated by leveraging a directed graph Laplacian as a substitute for the state transition matrix. As a consequence, potential-based reward shaping can then be implemented by the propagated messages. The incorporation of temporal dependencies between states makes φDCN more suitable for real-world scenarios than existing potential-based reward shaping methods based on undirected graph convolutional networks. Preliminary experiments demonstrate that the proposed φDCN exhibits a substantial improvement compared to other competing algorithms on both Atari and MuJoCo benchmarks
- …