9 research outputs found
Generative Exploration and Exploitation
Sparse reward is one of the biggest challenges in reinforcement learning
(RL). In this paper, we propose a novel method called Generative Exploration
and Exploitation (GENE) to overcome sparse reward. GENE automatically generates
start states to encourage the agent to explore the environment and to exploit
received reward signals. GENE can adaptively tradeoff between exploration and
exploitation according to the varying distributions of states experienced by
the agent as the learning progresses. GENE relies on no prior knowledge about
the environment and can be combined with any RL algorithm, no matter on-policy
or off-policy, single-agent or multi-agent. Empirically, we demonstrate that
GENE significantly outperforms existing methods in three tasks with only binary
rewards, including Maze, Maze Ant, and Cooperative Navigation. Ablation studies
verify the emergence of progressive exploration and automatic reversing.Comment: AAAI'2
Model-Based Decentralized Policy Optimization
Decentralized policy optimization has been commonly used in cooperative
multi-agent tasks. However, since all agents are updating their policies
simultaneously, from the perspective of individual agents, the environment is
non-stationary, resulting in it being hard to guarantee monotonic policy
improvement. To help the policy improvement be stable and monotonic, we propose
model-based decentralized policy optimization (MDPO), which incorporates a
latent variable function to help construct the transition and reward function
from an individual perspective. We theoretically analyze that the policy
optimization of MDPO is more stable than model-free decentralized policy
optimization. Moreover, due to non-stationarity, the latent variable function
is varying and hard to be modeled. We further propose a latent variable
prediction method to reduce the error of the latent variable function, which
theoretically contributes to the monotonic policy improvement. Empirically,
MDPO can indeed obtain superior performance than model-free decentralized
policy optimization in a variety of cooperative multi-agent tasks.Comment: 24 page
Online Tuning for Offline Decentralized Multi-Agent Reinforcement Learning
Offline reinforcement learning could learn effective policies from a fixed dataset, which is promising for real-world applications. However, in offline decentralized multi-agent reinforcement learning, due to the discrepancy between the behavior policy and learned policy, the transition dynamics in offline experiences do not accord with the transition dynamics in online execution, which creates severe errors in value estimates, leading to uncoordinated low-performing policies. One way to overcome this problem is to bridge offline training and online tuning. However, considering both deployment efficiency and sample efficiency, we could only collect very limited online experiences, making it insufficient to use merely online data for updating the agent policy. To utilize both offline and online experiences to tune the policies of agents, we introduce online transition correction (OTC) to implicitly correct the offline transition dynamics by modifying sampling probabilities. We design two types of distances, i.e., embedding-based and value-based distance, to measure the similarity between transitions, and further propose an adaptive rank-based prioritization to sample transitions according to the transition similarity. OTC is simple yet effective to increase data efficiency and improve agent policies in online tuning. Empirically, OTC outperforms baselines in a variety of tasks
Learning from Visual Observation via Offline Pretrained State-to-Go Transformer
Learning from visual observation (LfVO), aiming at recovering policies from
only visual observation data, is promising yet a challenging problem. Existing
LfVO approaches either only adopt inefficient online learning schemes or
require additional task-specific information like goal states, making them not
suited for open-ended tasks. To address these issues, we propose a two-stage
framework for learning from visual observation. In the first stage, we
introduce and pretrain State-to-Go (STG) Transformer offline to predict and
differentiate latent transitions of demonstrations. Subsequently, in the second
stage, the STG Transformer provides intrinsic rewards for downstream
reinforcement learning tasks where an agent learns merely from intrinsic
rewards. Empirical results on Atari and Minecraft show that our proposed method
outperforms baselines and in some tasks even achieves performance comparable to
the policy learned from environmental rewards. These results shed light on the
potential of utilizing video-only data to solve difficult visual reinforcement
learning tasks rather than relying on complete offline datasets containing
states, actions, and rewards. The project's website and code can be found at
https://sites.google.com/view/stgtransformer.Comment: 19 page