41 research outputs found
Efficient Diffusion Policies for Offline Reinforcement Learning
Offline reinforcement learning (RL) aims to learn optimal policies from
offline datasets, where the parameterization of policies is crucial but often
overlooked. Recently, Diffsuion-QL significantly boosts the performance of
offline RL by representing a policy with a diffusion model, whose success
relies on a parametrized Markov Chain with hundreds of steps for sampling.
However, Diffusion-QL suffers from two critical limitations. 1) It is
computationally inefficient to forward and backward through the whole Markov
chain during training. 2) It is incompatible with maximum likelihood-based RL
algorithms (e.g., policy gradient methods) as the likelihood of diffusion
models is intractable. Therefore, we propose efficient diffusion policy (EDP)
to overcome these two challenges. EDP approximately constructs actions from
corrupted ones at training to avoid running the sampling chain. We conduct
extensive experiments on the D4RL benchmark. The results show that EDP can
reduce the diffusion policy training time from 5 days to 5 hours on
gym-locomotion tasks. Moreover, we show that EDP is compatible with various
offline RL algorithms (TD3, CRR, and IQL) and achieves new state-of-the-art on
D4RL by large margins over previous methods. Our code is available at
https://github.com/sail-sg/edp.Comment: preprin
Mutual Information Regularized Offline Reinforcement Learning
The major challenge of offline RL is the distribution shift that appears when
out-of-distribution actions are queried, which makes the policy improvement
direction biased by extrapolation errors. Most existing methods address this
problem by penalizing the policy or value for deviating from the behavior
policy during policy improvement or evaluation. In this work, we propose a
novel MISA framework to approach offline RL from the perspective of Mutual
Information between States and Actions in the dataset by directly constraining
the policy improvement direction. MISA constructs lower bounds of mutual
information parameterized by the policy and Q-values. We show that optimizing
this lower bound is equivalent to maximizing the likelihood of a one-step
improved policy on the offline dataset. Hence, we constrain the policy
improvement direction to lie in the data manifold. The resulting algorithm
simultaneously augments the policy evaluation and improvement by adding mutual
information regularizations. MISA is a general framework that unifies
conservative Q-learning (CQL) and behavior regularization methods (e.g.,
TD3+BC) as special cases. We introduce 3 different variants of MISA, and
empirically demonstrate that tighter mutual information lower bound gives
better offline RL performance. In addition, our extensive experiments show MISA
significantly outperforms a wide range of baselines on various tasks of the
D4RL benchmark,e.g., achieving 742.9 total points on gym-locomotion tasks. Our
code is available at https://github.com/sail-sg/MISA.Comment: NeurIPS 202
FreeMask: Synthetic Images with Dense Annotations Make Stronger Segmentation Models
Semantic segmentation has witnessed tremendous progress due to the proposal
of various advanced network architectures. However, they are extremely hungry
for delicate annotations to train, and the acquisition is laborious and
unaffordable. Therefore, we present FreeMask in this work, which resorts to
synthetic images from generative models to ease the burden of both data
collection and annotation procedures. Concretely, we first synthesize abundant
training images conditioned on the semantic masks provided by realistic
datasets. This yields extra well-aligned image-mask training pairs for semantic
segmentation models. We surprisingly observe that, solely trained with
synthetic images, we already achieve comparable performance with real ones
(e.g., 48.3 vs. 48.5 mIoU on ADE20K, and 49.3 vs. 50.5 on COCO-Stuff). Then, we
investigate the role of synthetic images by joint training with real images, or
pre-training for real images. Meantime, we design a robust filtering principle
to suppress incorrectly synthesized regions. In addition, we propose to
inequally treat different semantic masks to prioritize those harder ones and
sample more corresponding synthetic images for them. As a result, either
jointly trained or pre-trained with our filtered and re-sampled synthesized
images, segmentation models can be greatly enhanced, e.g., from 48.7 to 52.0 on
ADE20K. Code is available at https://github.com/LiheYoung/FreeMask.Comment: Accepted by NeurIPS 202
Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data
This work presents Depth Anything, a highly practical solution for robust
monocular depth estimation. Without pursuing novel technical modules, we aim to
build a simple yet powerful foundation model dealing with any images under any
circumstances. To this end, we scale up the dataset by designing a data engine
to collect and automatically annotate large-scale unlabeled data (~62M), which
significantly enlarges the data coverage and thus is able to reduce the
generalization error. We investigate two simple yet effective strategies that
make data scaling-up promising. First, a more challenging optimization target
is created by leveraging data augmentation tools. It compels the model to
actively seek extra visual knowledge and acquire robust representations.
Second, an auxiliary supervision is developed to enforce the model to inherit
rich semantic priors from pre-trained encoders. We evaluate its zero-shot
capabilities extensively, including six public datasets and randomly captured
photos. It demonstrates impressive generalization ability. Further, through
fine-tuning it with metric depth information from NYUv2 and KITTI, new SOTAs
are set. Our better depth model also results in a better depth-conditioned
ControlNet. Our models are released at
https://github.com/LiheYoung/Depth-Anything.Comment: Accepted by CVPR 2024. Project page: https://depth-anything.github.i
Offline Prioritized Experience Replay
Offline reinforcement learning (RL) is challenged by the distributional shift
problem. To address this problem, existing works mainly focus on designing
sophisticated policy constraints between the learned policy and the behavior
policy. However, these constraints are applied equally to well-performing and
inferior actions through uniform sampling, which might negatively affect the
learned policy. To alleviate this issue, we propose Offline Prioritized
Experience Replay (OPER), featuring a class of priority functions designed to
prioritize highly-rewarding transitions, making them more frequently visited
during training. Through theoretical analysis, we show that this class of
priority functions induce an improved behavior policy, and when constrained to
this improved policy, a policy-constrained offline RL algorithm is likely to
yield a better solution. We develop two practical strategies to obtain priority
weights by estimating advantages based on a fitted value network (OPER-A) or
utilizing trajectory returns (OPER-R) for quick computation. OPER is a
plug-and-play component for offline RL algorithms. As case studies, we evaluate
OPER on five different algorithms, including BC, TD3+BC, Onestep RL, CQL, and
IQL. Extensive experiments demonstrate that both OPER-A and OPER-R
significantly improve the performance for all baseline methods. Codes and
priority weights are availiable at https://github.com/sail-sg/OPER.Comment: preprin