170 research outputs found
Double Check Your State Before Trusting It: Confidence-Aware Bidirectional Offline Model-Based Imagination
The learned policy of model-free offline reinforcement learning (RL) methods
is often constrained to stay within the support of datasets to avoid possible
dangerous out-of-distribution actions or states, making it challenging to
handle out-of-support region. Model-based RL methods offer a richer dataset and
benefit generalization by generating imaginary trajectories with either trained
forward or reverse dynamics model. However, the imagined transitions may be
inaccurate, thus downgrading the performance of the underlying offline RL
method. In this paper, we propose to augment the offline dataset by using
trained bidirectional dynamics models and rollout policies with double check.
We introduce conservatism by trusting samples that the forward model and
backward model agree on. Our method, confidence-aware bidirectional offline
model-based imagination, generates reliable samples and can be combined with
any model-free offline RL method. Experimental results on the D4RL benchmarks
demonstrate that our method significantly boosts the performance of existing
model-free offline RL algorithms and achieves competitive or better scores
against baseline methods.Comment: NeurIPS 202
The primacy bias in Model-based RL
The primacy bias in deep reinforcement learning (DRL), which refers to the
agent's tendency to overfit early data and lose the ability to learn from new
data, can significantly decrease the performance of DRL algorithms. Previous
studies have shown that employing simple techniques, such as resetting the
agent's parameters, can substantially alleviate the primacy bias. However, we
observe that resetting the agent's parameters harms its performance in the
context of model-based reinforcement learning (MBRL). In fact, on further
investigation, we find that the primacy bias in MBRL differs from that in
model-free RL. In this work, we focus on investigating the primacy bias in MBRL
and propose world model resetting, which works in MBRL. We apply our method to
two different MBRL algorithms, MBPO and DreamerV2. We validate the
effectiveness of our method on multiple continuous control tasks on MuJoCo and
DeepMind Control Suite, as well as discrete control tasks on Atari 100k
benchmark. The results show that world model resetting can significantly
alleviate the primacy bias in model-based setting and improve algorithm's
performance. We also give a guide on how to perform world model resetting
effectively
Off-Policy RL Algorithms Can be Sample-Efficient for Continuous Control via Sample Multiple Reuse
Sample efficiency is one of the most critical issues for online reinforcement
learning (RL). Existing methods achieve higher sample efficiency by adopting
model-based methods, Q-ensemble, or better exploration mechanisms. We, instead,
propose to train an off-policy RL agent via updating on a fixed sampled batch
multiple times, thus reusing these samples and better exploiting them within a
single optimization loop. We name our method sample multiple reuse (SMR). We
theoretically show the properties of Q-learning with SMR, e.g., convergence.
Furthermore, we incorporate SMR with off-the-shelf off-policy RL algorithms and
conduct experiments on a variety of continuous control benchmarks. Empirical
results show that SMR significantly boosts the sample efficiency of the base
methods across most of the evaluated tasks without any hyperparameter tuning or
additional tricks.Comment: 37 page
Understanding What Affects Generalization Gap in Visual Reinforcement Learning: Theory and Empirical Evidence
Recently, there are many efforts attempting to learn useful policies for
continuous control in visual reinforcement learning (RL). In this scenario, it
is important to learn a generalizable policy, as the testing environment may
differ from the training environment, e.g., there exist distractors during
deployment. Many practical algorithms are proposed to handle this problem.
However, to the best of our knowledge, none of them provide a theoretical
understanding of what affects the generalization gap and why their proposed
methods work. In this paper, we bridge this issue by theoretically answering
the key factors that contribute to the generalization gap when the testing
environment has distractors. Our theories indicate that minimizing the
representation distance between training and testing environments, which aligns
with human intuition, is the most critical for the benefit of reducing the
generalization gap. Our theoretical results are supported by the empirical
evidence in the DMControl Generalization Benchmark (DMC-GB).Comment: Part of this work is accepted as AAMAS 2024 extended abstrac
Zero-shot Preference Learning for Offline RL via Optimal Transport
Preference-based Reinforcement Learning (PbRL) has demonstrated remarkable
efficacy in aligning rewards with human intentions. However, a significant
challenge lies in the need of substantial human labels, which is costly and
time-consuming. Additionally, the expensive preference data obtained from prior
tasks is not typically reusable for subsequent task learning, leading to
extensive labeling for each new task. In this paper, we propose a novel
zero-shot preference-based RL algorithm that leverages labeled preference data
from source tasks to infer labels for target tasks, eliminating the requirement
for human queries. Our approach utilizes Gromov-Wasserstein distance to align
trajectory distributions between source and target tasks. The solved optimal
transport matrix serves as a correspondence between trajectories of two tasks,
making it possible to identify corresponding trajectory pairs between tasks and
transfer the preference labels. However, learning directly from inferred labels
that contains a fraction of noisy labels will result in an inaccurate reward
function, subsequently affecting policy performance. To this end, we introduce
Robust Preference Transformer, which models the rewards as Gaussian
distributions and incorporates reward uncertainty in addition to reward mean.
The empirical results on robotic manipulation tasks of Meta-World and Robomimic
show that our method has strong capabilities of transferring preferences
between tasks and learns reward functions from noisy labels robustly.
Furthermore, we reveal that our method attains near-oracle performance with a
small proportion of scripted labels
State Advantage Weighting for Offline RL
We present state advantage weighting for offline reinforcement learning (RL).
In contrast to action advantage that we commonly adopt in QSA
learning, we leverage state advantage and QSS learning for
offline RL, hence decoupling the action from values. We expect the agent can
get to the high-reward state and the action is determined by how the agent can
get to that corresponding state. Experiments on D4RL datasets show that our
proposed method can achieve remarkable performance against the common
baselines. Furthermore, our method shows good generalization capability when
transferring from offline to online.Comment: 3rd Offline RL workshop at NeurIPS 2022. arXiv admin note: text
overlap with arXiv:2206.0798
Normalization Enhances Generalization in Visual Reinforcement Learning
Recent advances in visual reinforcement learning (RL) have led to impressive
success in handling complex tasks. However, these methods have demonstrated
limited generalization capability to visual disturbances, which poses a
significant challenge for their real-world application and adaptability. Though
normalization techniques have demonstrated huge success in supervised and
unsupervised learning, their applications in visual RL are still scarce. In
this paper, we explore the potential benefits of integrating normalization into
visual RL methods with respect to generalization performance. We find that,
perhaps surprisingly, incorporating suitable normalization techniques is
sufficient to enhance the generalization capabilities, without any additional
special design. We utilize the combination of two normalization techniques,
CrossNorm and SelfNorm, for generalizable visual RL. Extensive experiments are
conducted on DMControl Generalization Benchmark and CARLA to validate the
effectiveness of our method. We show that our method significantly improves
generalization capability while only marginally affecting sample efficiency. In
particular, when integrated with DrQ-v2, our method enhances the test
performance of DrQ-v2 on CARLA across various scenarios, from 14% of the
training performance to 97%
A Survey of Embodied AI: From Simulators to Research Tasks
There has been an emerging paradigm shift from the era of "internet AI" to
"embodied AI", where AI algorithms and agents no longer learn from datasets of
images, videos or text curated primarily from the internet. Instead, they learn
through interactions with their environments from an egocentric perception
similar to humans. Consequently, there has been substantial growth in the
demand for embodied AI simulators to support various embodied AI research
tasks. This growing interest in embodied AI is beneficial to the greater
pursuit of Artificial General Intelligence (AGI), but there has not been a
contemporary and comprehensive survey of this field. This paper aims to provide
an encyclopedic survey for the field of embodied AI, from its simulators to its
research. By evaluating nine current embodied AI simulators with our proposed
seven features, this paper aims to understand the simulators in their provision
for use in embodied AI research and their limitations. Lastly, this paper
surveys the three main research tasks in embodied AI -- visual exploration,
visual navigation and embodied question answering (QA), covering the
state-of-the-art approaches, evaluation metrics and datasets. Finally, with the
new insights revealed through surveying the field, the paper will provide
suggestions for simulator-for-task selections and recommendations for the
future directions of the field.Comment: Under Review for IEEE TETC
Application of Fibonacci Sequence and Lucas Sequence on the Design of the Toilet Siphon Pipe Shape
The purpose of this study was to explore the method for designing the toilet siphon pipe shape to improve flushing performance. The Fibonacci sequence and the Lucas sequence were used to design the structural parameters of the siphon pipe. The flushing processes of the toilet were simulated using the computational fluid dynamics (CFD) method to analyze the flushing performance under different siphon pipe shapes. Experimental studies were conducted to verify the reliability of the simulation results. The results indicated that when the Lucas numbers and the Fibonacci numbers were utilized to regulate the curvature of the siphon pipe in the Xi direction and the Yj direction respectively, the flushing performance of the toilet was optimal. In order to obtain better flushing performance, the curvature of the siphon pipe should be smooth and have obvious transitions at the connections of different sections. When the overall size of the siphon pipe is kept constant, a short siphon pipe length is helpful for the improvement of toilet flushing performance
- …