34 research outputs found
Thompson sampling based Monte-Carlo planning in POMDPs
Monte-Carlo tree search (MCTS) has been drawinggreat interest in recent years for planning under uncertainty. One of the key challenges is the tradeoffbetween exploration and exploitation. To addressthis, we introduce a novel online planning algorithmfor large POMDPs using Thompson sampling basedMCTS that balances between cumulative and simple regrets.The proposed algorithm — Dirichlet-Dirichlet-NormalGamma based Partially Observable Monte-Carlo Planning (D2NG-POMCP) — treats the accumulatedreward of performing an action from a beliefstate in the MCTS search tree as a random variable followingan unknown distribution with hidden parameters.Bayesian method is used to model and infer theposterior distribution of these parameters by choosingthe conjugate prior in the form of a combination of twoDirichlet and one NormalGamma distributions. Thompsonsampling is exploited to guide the action selection inthe search tree. Experimental results confirmed that ouralgorithm outperforms the state-of-the-art approacheson several common benchmark problems
Policy Regularization with Dataset Constraint for Offline Reinforcement Learning
We consider the problem of learning the best possible policy from a fixed
dataset, known as offline Reinforcement Learning (RL). A common taxonomy of
existing offline RL works is policy regularization, which typically constrains
the learned policy by distribution or support of the behavior policy. However,
distribution and support constraints are overly conservative since they both
force the policy to choose similar actions as the behavior policy when
considering particular states. It will limit the learned policy's performance,
especially when the behavior policy is sub-optimal. In this paper, we find that
regularizing the policy towards the nearest state-action pair can be more
effective and thus propose Policy Regularization with Dataset Constraint
(PRDC). When updating the policy in a given state, PRDC searches the entire
dataset for the nearest state-action sample and then restricts the policy with
the action of this sample. Unlike previous works, PRDC can guide the policy
with proper behaviors from the dataset, allowing it to choose actions that do
not appear in the dataset along with the given state. It is a softer constraint
but still keeps enough conservatism from out-of-distribution actions. Empirical
evidence and theoretical analysis show that PRDC can alleviate offline RL's
fundamentally challenging value overestimation issue with a bounded performance
gap. Moreover, on a set of locomotion and navigation tasks, PRDC achieves
state-of-the-art performance compared with existing methods. Code is available
at https://github.com/LAMDA-RL/PRDCComment: Accepted to ICML 202
Attention-Guided Contrastive Role Representations for Multi-Agent Reinforcement Learning
Real-world multi-agent tasks usually involve dynamic team composition with
the emergence of roles, which should also be a key to efficient cooperation in
multi-agent reinforcement learning (MARL). Drawing inspiration from the
correlation between roles and agent's behavior patterns, we propose a novel
framework of **A**ttention-guided **CO**ntrastive **R**ole representation
learning for **M**ARL (**ACORM**) to promote behavior heterogeneity, knowledge
transfer, and skillful coordination across agents. First, we introduce mutual
information maximization to formalize role representation learning, derive a
contrastive learning objective, and concisely approximate the distribution of
negative pairs. Second, we leverage an attention mechanism to prompt the global
state to attend to learned role representations in value decomposition,
implicitly guiding agent coordination in a skillful role space to yield more
expressive credit assignment. Experiments on challenging StarCraft II
micromanagement and Google research football tasks demonstrate the
state-of-the-art performance of our method and its advantages over existing
approaches. Our code is available at
[https://github.com/NJU-RL/ACORM](https://github.com/NJU-RL/ACORM)
Efficient Deep Reinforcement Learning via Adaptive Policy Transfer
Transfer Learning (TL) has shown great potential to accelerate Reinforcement
Learning (RL) by leveraging prior knowledge from past learned policies of
relevant tasks. Existing transfer approaches either explicitly computes the
similarity between tasks or select appropriate source policies to provide
guided explorations for the target task. However, how to directly optimize the
target policy by alternatively utilizing knowledge from appropriate source
policies without explicitly measuring the similarity is currently missing. In
this paper, we propose a novel Policy Transfer Framework (PTF) to accelerate RL
by taking advantage of this idea. Our framework learns when and which source
policy is the best to reuse for the target policy and when to terminate it by
modeling multi-policy transfer as the option learning problem. PTF can be
easily combined with existing deep RL approaches. Experimental results show it
significantly accelerates the learning process and surpasses state-of-the-art
policy transfer methods in terms of learning efficiency and final performance
in both discrete and continuous action spaces.Comment: Accepted by IJCAI'202
Disentangling Policy from Offline Task Representation Learning via Adversarial Data Augmentation
Offline meta-reinforcement learning (OMRL) proficiently allows an agent to
tackle novel tasks while solely relying on a static dataset. For precise and
efficient task identification, existing OMRL research suggests learning
separate task representations that be incorporated with policy input, thus
forming a context-based meta-policy. A major approach to train task
representations is to adopt contrastive learning using multi-task offline data.
The dataset typically encompasses interactions from various policies (i.e., the
behavior policies), thus providing a plethora of contextual information
regarding different tasks. Nonetheless, amassing data from a substantial number
of policies is not only impractical but also often unattainable in realistic
settings. Instead, we resort to a more constrained yet practical scenario,
where multi-task data collection occurs with a limited number of policies. We
observed that learned task representations from previous OMRL methods tend to
correlate spuriously with the behavior policy instead of reflecting the
essential characteristics of the task, resulting in unfavorable
out-of-distribution generalization. To alleviate this issue, we introduce a
novel algorithm to disentangle the impact of behavior policy from task
representation learning through a process called adversarial data augmentation.
Specifically, the objective of adversarial data augmentation is not merely to
generate data analogous to offline data distribution; instead, it aims to
create adversarial examples designed to confound learned task representations
and lead to incorrect task identification. Our experiments show that learning
from such adversarial samples significantly enhances the robustness and
effectiveness of the task identification process and realizes satisfactory
out-of-distribution generalization
Retrosynthetic Planning with Dual Value Networks
Retrosynthesis, which aims to find a route to synthesize a target molecule
from commercially available starting materials, is a critical task in drug
discovery and materials design. Recently, the combination of ML-based
single-step reaction predictors with multi-step planners has led to promising
results. However, the single-step predictors are mostly trained offline to
optimize the single-step accuracy, without considering complete routes. Here,
we leverage reinforcement learning (RL) to improve the single-step predictor,
by using a tree-shaped MDP to optimize complete routes. Specifically, we
propose a novel online training algorithm, called Planning with Dual Value
Networks (PDVN), which alternates between the planning phase and updating
phase. In PDVN, we construct two separate value networks to predict the
synthesizability and cost of molecules, respectively. To maintain the
single-step accuracy, we design a two-branch network structure for the
single-step predictor. On the widely-used USPTO dataset, our PDVN algorithm
improves the search success rate of existing multi-step planners (e.g.,
increasing the success rate from 85.79% to 98.95% for Retro*, and reducing the
number of model calls by half while solving 99.47% molecules for RetroGraph).
Additionally, PDVN helps find shorter synthesis routes (e.g., reducing the
average route length from 5.76 to 4.83 for Retro*, and from 5.63 to 4.78 for
RetroGraph).Comment: Accepted to ICML 202