178 research outputs found
Double Check Your State Before Trusting It: Confidence-Aware Bidirectional Offline Model-Based Imagination
The learned policy of model-free offline reinforcement learning (RL) methods
is often constrained to stay within the support of datasets to avoid possible
dangerous out-of-distribution actions or states, making it challenging to
handle out-of-support region. Model-based RL methods offer a richer dataset and
benefit generalization by generating imaginary trajectories with either trained
forward or reverse dynamics model. However, the imagined transitions may be
inaccurate, thus downgrading the performance of the underlying offline RL
method. In this paper, we propose to augment the offline dataset by using
trained bidirectional dynamics models and rollout policies with double check.
We introduce conservatism by trusting samples that the forward model and
backward model agree on. Our method, confidence-aware bidirectional offline
model-based imagination, generates reliable samples and can be combined with
any model-free offline RL method. Experimental results on the D4RL benchmarks
demonstrate that our method significantly boosts the performance of existing
model-free offline RL algorithms and achieves competitive or better scores
against baseline methods.Comment: NeurIPS 202
Off-Policy RL Algorithms Can be Sample-Efficient for Continuous Control via Sample Multiple Reuse
Sample efficiency is one of the most critical issues for online reinforcement
learning (RL). Existing methods achieve higher sample efficiency by adopting
model-based methods, Q-ensemble, or better exploration mechanisms. We, instead,
propose to train an off-policy RL agent via updating on a fixed sampled batch
multiple times, thus reusing these samples and better exploiting them within a
single optimization loop. We name our method sample multiple reuse (SMR). We
theoretically show the properties of Q-learning with SMR, e.g., convergence.
Furthermore, we incorporate SMR with off-the-shelf off-policy RL algorithms and
conduct experiments on a variety of continuous control benchmarks. Empirical
results show that SMR significantly boosts the sample efficiency of the base
methods across most of the evaluated tasks without any hyperparameter tuning or
additional tricks.Comment: 37 page
Understanding What Affects Generalization Gap in Visual Reinforcement Learning: Theory and Empirical Evidence
Recently, there are many efforts attempting to learn useful policies for
continuous control in visual reinforcement learning (RL). In this scenario, it
is important to learn a generalizable policy, as the testing environment may
differ from the training environment, e.g., there exist distractors during
deployment. Many practical algorithms are proposed to handle this problem.
However, to the best of our knowledge, none of them provide a theoretical
understanding of what affects the generalization gap and why their proposed
methods work. In this paper, we bridge this issue by theoretically answering
the key factors that contribute to the generalization gap when the testing
environment has distractors. Our theories indicate that minimizing the
representation distance between training and testing environments, which aligns
with human intuition, is the most critical for the benefit of reducing the
generalization gap. Our theoretical results are supported by the empirical
evidence in the DMControl Generalization Benchmark (DMC-GB).Comment: Part of this work is accepted as AAMAS 2024 extended abstrac
Tackling Non-Stationarity in Reinforcement Learning via Causal-Origin Representation
In real-world scenarios, the application of reinforcement learning is
significantly challenged by complex non-stationarity. Most existing methods
attempt to model changes in the environment explicitly, often requiring
impractical prior knowledge. In this paper, we propose a new perspective,
positing that non-stationarity can propagate and accumulate through complex
causal relationships during state transitions, thereby compounding its
sophistication and affecting policy learning. We believe that this challenge
can be more effectively addressed by tracing the causal origin of
non-stationarity. To this end, we introduce the Causal-Origin REPresentation
(COREP) algorithm. COREP primarily employs a guided updating mechanism to learn
a stable graph representation for states termed as causal-origin
representation. By leveraging this representation, the learned policy exhibits
impressive resilience to non-stationarity. We supplement our approach with a
theoretical analysis grounded in the causal interpretation for non-stationary
reinforcement learning, advocating for the validity of the causal-origin
representation. Experimental results further demonstrate the superior
performance of COREP over existing methods in tackling non-stationarity
State Advantage Weighting for Offline RL
We present state advantage weighting for offline reinforcement learning (RL).
In contrast to action advantage that we commonly adopt in QSA
learning, we leverage state advantage and QSS learning for
offline RL, hence decoupling the action from values. We expect the agent can
get to the high-reward state and the action is determined by how the agent can
get to that corresponding state. Experiments on D4RL datasets show that our
proposed method can achieve remarkable performance against the common
baselines. Furthermore, our method shows good generalization capability when
transferring from offline to online.Comment: 3rd Offline RL workshop at NeurIPS 2022. arXiv admin note: text
overlap with arXiv:2206.0798
Violation of Electrostatic Rules: Shifting Balance Between Pnicogen Bond and Lone Pair−π Interaction Tuned by Substituents
Complexes were formed pairing ZCl3 (Z=P, As, Sb) with C2R4 (R= H, F, CN). The first interaction present is a pnicogen bond between the Z atom and the C=C π-bond. This bond weakens as the H atoms of ethylene are replaced by electron-withdrawing F and CN and the potential above the alkene switches from negative to positive. In the latter two cases, another set of noncovalent bonds is formed between the Cl lone pairs of ZCl3 and the π*(C=C) antibonding orbital, as well as with the F or CN substituents. The growing strength of these interactions, coupled with a large dispersion energy, more than compensates for the weak pnicogen bond in C2(CN)4, with its repulsion between areas of positive charge on each subunit, making its complexes with ZCl3 very strong, as high as 25 kJ/mol. The pnicogen bond in C2F4 is weaker than in C2H4, and its subsidiary lone pair-π bonds weaker than in C2(CN)4, so the complexes of this alkene with ZCl3 are the weakest of the set
A Survey on Transformers in Reinforcement Learning
Transformer has been considered the dominating neural architecture in NLP and
CV, mostly under supervised settings. Recently, a similar surge of using
Transformers has appeared in the domain of reinforcement learning (RL), but it
is faced with unique design choices and challenges brought by the nature of RL.
However, the evolution of Transformers in RL has not yet been well unraveled.
In this paper, we seek to systematically review motivations and progress on
using Transformers in RL, provide a taxonomy on existing works, discuss each
sub-field, and summarize future prospects
Carbene Triel Bonds Between TrR3 (Tr=B, Al) and N-Heterocyclic Carbenes
The carbene triel bond is predicted and characterized by theoretical calculations. The C lone pair of N‐heterocyclic carbenes (NHCs) is allowed to interact with the central triel atom of TrR3 (Tr = B and Al; R = H, F, Cl, and Br). The ensuing bond is very strong, with an interaction energy of nearly 90 kcal/mol. Replacement of the C lone pair by that of either N or Si weakens the binding. The bond is strengthened by electron‐withdrawing substituents on the triel atom, and the reverse occurs with substitution on the NHC. However, these effects do not strictly follow the typical pattern of F \u3e Cl \u3e Br. The TrR3 molecule suffers a good deal of geometric deformation, requiring on the order of 30 kcal/mol, in forming the complex. The R(C···Tr) bond is quite short, for example, 1.6 Å for Tr = B, and shows other indications of at least a partially covalent bond, such as a high electron density at the bond critical point and a good deal of intermolecular charge transfer
SEABO: A Simple Search-Based Method for Offline Imitation Learning
Offline reinforcement learning (RL) has attracted much attention due to its
ability in learning from static offline datasets and eliminating the need of
interacting with the environment. Nevertheless, the success of offline RL
relies heavily on the offline transitions annotated with reward labels. In
practice, we often need to hand-craft the reward function, which is sometimes
difficult, labor-intensive, or inefficient. To tackle this challenge, we set
our focus on the offline imitation learning (IL) setting, and aim at getting a
reward function based on the expert data and unlabeled data. To that end, we
propose a simple yet effective search-based offline IL method, tagged SEABO.
SEABO allocates a larger reward to the transition that is close to its closest
neighbor in the expert demonstration, and a smaller reward otherwise, all in an
unsupervised learning manner. Experimental results on a variety of D4RL
datasets indicate that SEABO can achieve competitive performance to offline RL
algorithms with ground-truth rewards, given only a single expert trajectory,
and can outperform prior reward learning and offline IL methods across many
tasks. Moreover, we demonstrate that SEABO also works well if the expert
demonstrations contain only observations. Our code is publicly available at
https://github.com/dmksjfl/SEABO.Comment: To appear in ICLR202
Effect of Carbon Hybridization in C—F Bond as an Electron Donor in Triel Bonds
The ability of the F atom of HC≡CF, H2C=CHF and H3CCH2F to serve as an electron donor to the triel (Tr) atom of TrR3 in the context of a triel bond is assessed by ab initio calculations. The triel bond formed by Csp3—F is strongest, as high as 30 kcal/mol, followed by Csp2—F, and then by Csp—F whose triel bonds can be as small as 1 kcal/mol. The noncovalent bond strength diminishes in the order Tr = Al \u3e Ga \u3e B, consistent with the intensity of the π-hole above the Tr atom in the monomer. The triel bond strength of the Al and Ga complexes increases along with the electronegativity of the R substituent but is largest for R=H when Tr=B. Electrostatics play the largest role in the stronger triel bonds, but dispersion makes an outsized contribution for the weakest such bonds
- …