178 research outputs found

    Double Check Your State Before Trusting It: Confidence-Aware Bidirectional Offline Model-Based Imagination

    Full text link
    The learned policy of model-free offline reinforcement learning (RL) methods is often constrained to stay within the support of datasets to avoid possible dangerous out-of-distribution actions or states, making it challenging to handle out-of-support region. Model-based RL methods offer a richer dataset and benefit generalization by generating imaginary trajectories with either trained forward or reverse dynamics model. However, the imagined transitions may be inaccurate, thus downgrading the performance of the underlying offline RL method. In this paper, we propose to augment the offline dataset by using trained bidirectional dynamics models and rollout policies with double check. We introduce conservatism by trusting samples that the forward model and backward model agree on. Our method, confidence-aware bidirectional offline model-based imagination, generates reliable samples and can be combined with any model-free offline RL method. Experimental results on the D4RL benchmarks demonstrate that our method significantly boosts the performance of existing model-free offline RL algorithms and achieves competitive or better scores against baseline methods.Comment: NeurIPS 202

    Off-Policy RL Algorithms Can be Sample-Efficient for Continuous Control via Sample Multiple Reuse

    Full text link
    Sample efficiency is one of the most critical issues for online reinforcement learning (RL). Existing methods achieve higher sample efficiency by adopting model-based methods, Q-ensemble, or better exploration mechanisms. We, instead, propose to train an off-policy RL agent via updating on a fixed sampled batch multiple times, thus reusing these samples and better exploiting them within a single optimization loop. We name our method sample multiple reuse (SMR). We theoretically show the properties of Q-learning with SMR, e.g., convergence. Furthermore, we incorporate SMR with off-the-shelf off-policy RL algorithms and conduct experiments on a variety of continuous control benchmarks. Empirical results show that SMR significantly boosts the sample efficiency of the base methods across most of the evaluated tasks without any hyperparameter tuning or additional tricks.Comment: 37 page

    Understanding What Affects Generalization Gap in Visual Reinforcement Learning: Theory and Empirical Evidence

    Full text link
    Recently, there are many efforts attempting to learn useful policies for continuous control in visual reinforcement learning (RL). In this scenario, it is important to learn a generalizable policy, as the testing environment may differ from the training environment, e.g., there exist distractors during deployment. Many practical algorithms are proposed to handle this problem. However, to the best of our knowledge, none of them provide a theoretical understanding of what affects the generalization gap and why their proposed methods work. In this paper, we bridge this issue by theoretically answering the key factors that contribute to the generalization gap when the testing environment has distractors. Our theories indicate that minimizing the representation distance between training and testing environments, which aligns with human intuition, is the most critical for the benefit of reducing the generalization gap. Our theoretical results are supported by the empirical evidence in the DMControl Generalization Benchmark (DMC-GB).Comment: Part of this work is accepted as AAMAS 2024 extended abstrac

    Tackling Non-Stationarity in Reinforcement Learning via Causal-Origin Representation

    Full text link
    In real-world scenarios, the application of reinforcement learning is significantly challenged by complex non-stationarity. Most existing methods attempt to model changes in the environment explicitly, often requiring impractical prior knowledge. In this paper, we propose a new perspective, positing that non-stationarity can propagate and accumulate through complex causal relationships during state transitions, thereby compounding its sophistication and affecting policy learning. We believe that this challenge can be more effectively addressed by tracing the causal origin of non-stationarity. To this end, we introduce the Causal-Origin REPresentation (COREP) algorithm. COREP primarily employs a guided updating mechanism to learn a stable graph representation for states termed as causal-origin representation. By leveraging this representation, the learned policy exhibits impressive resilience to non-stationarity. We supplement our approach with a theoretical analysis grounded in the causal interpretation for non-stationary reinforcement learning, advocating for the validity of the causal-origin representation. Experimental results further demonstrate the superior performance of COREP over existing methods in tackling non-stationarity

    State Advantage Weighting for Offline RL

    Full text link
    We present state advantage weighting for offline reinforcement learning (RL). In contrast to action advantage A(s,a)A(s,a) that we commonly adopt in QSA learning, we leverage state advantage A(s,s)A(s,s^\prime) and QSS learning for offline RL, hence decoupling the action from values. We expect the agent can get to the high-reward state and the action is determined by how the agent can get to that corresponding state. Experiments on D4RL datasets show that our proposed method can achieve remarkable performance against the common baselines. Furthermore, our method shows good generalization capability when transferring from offline to online.Comment: 3rd Offline RL workshop at NeurIPS 2022. arXiv admin note: text overlap with arXiv:2206.0798

    Violation of Electrostatic Rules: Shifting Balance Between Pnicogen Bond and Lone Pair−π Interaction Tuned by Substituents

    Get PDF
    Complexes were formed pairing ZCl3 (Z=P, As, Sb) with C2R4 (R= H, F, CN). The first interaction present is a pnicogen bond between the Z atom and the C=C π-bond. This bond weakens as the H atoms of ethylene are replaced by electron-withdrawing F and CN and the potential above the alkene switches from negative to positive. In the latter two cases, another set of noncovalent bonds is formed between the Cl lone pairs of ZCl3 and the π*(C=C) antibonding orbital, as well as with the F or CN substituents. The growing strength of these interactions, coupled with a large dispersion energy, more than compensates for the weak pnicogen bond in C2(CN)4, with its repulsion between areas of positive charge on each subunit, making its complexes with ZCl3 very strong, as high as 25 kJ/mol. The pnicogen bond in C2F4 is weaker than in C2H4, and its subsidiary lone pair-π bonds weaker than in C2(CN)4, so the complexes of this alkene with ZCl3 are the weakest of the set

    A Survey on Transformers in Reinforcement Learning

    Full text link
    Transformer has been considered the dominating neural architecture in NLP and CV, mostly under supervised settings. Recently, a similar surge of using Transformers has appeared in the domain of reinforcement learning (RL), but it is faced with unique design choices and challenges brought by the nature of RL. However, the evolution of Transformers in RL has not yet been well unraveled. In this paper, we seek to systematically review motivations and progress on using Transformers in RL, provide a taxonomy on existing works, discuss each sub-field, and summarize future prospects

    Carbene Triel Bonds Between TrR3 (Tr=B, Al) and N-Heterocyclic Carbenes

    Get PDF
    The carbene triel bond is predicted and characterized by theoretical calculations. The C lone pair of N‐heterocyclic carbenes (NHCs) is allowed to interact with the central triel atom of TrR3 (Tr = B and Al; R = H, F, Cl, and Br). The ensuing bond is very strong, with an interaction energy of nearly 90 kcal/mol. Replacement of the C lone pair by that of either N or Si weakens the binding. The bond is strengthened by electron‐withdrawing substituents on the triel atom, and the reverse occurs with substitution on the NHC. However, these effects do not strictly follow the typical pattern of F \u3e Cl \u3e Br. The TrR3 molecule suffers a good deal of geometric deformation, requiring on the order of 30 kcal/mol, in forming the complex. The R(C···Tr) bond is quite short, for example, 1.6 Å for Tr = B, and shows other indications of at least a partially covalent bond, such as a high electron density at the bond critical point and a good deal of intermolecular charge transfer

    SEABO: A Simple Search-Based Method for Offline Imitation Learning

    Full text link
    Offline reinforcement learning (RL) has attracted much attention due to its ability in learning from static offline datasets and eliminating the need of interacting with the environment. Nevertheless, the success of offline RL relies heavily on the offline transitions annotated with reward labels. In practice, we often need to hand-craft the reward function, which is sometimes difficult, labor-intensive, or inefficient. To tackle this challenge, we set our focus on the offline imitation learning (IL) setting, and aim at getting a reward function based on the expert data and unlabeled data. To that end, we propose a simple yet effective search-based offline IL method, tagged SEABO. SEABO allocates a larger reward to the transition that is close to its closest neighbor in the expert demonstration, and a smaller reward otherwise, all in an unsupervised learning manner. Experimental results on a variety of D4RL datasets indicate that SEABO can achieve competitive performance to offline RL algorithms with ground-truth rewards, given only a single expert trajectory, and can outperform prior reward learning and offline IL methods across many tasks. Moreover, we demonstrate that SEABO also works well if the expert demonstrations contain only observations. Our code is publicly available at https://github.com/dmksjfl/SEABO.Comment: To appear in ICLR202

    Effect of Carbon Hybridization in C—F Bond as an Electron Donor in Triel Bonds

    Get PDF
    The ability of the F atom of HC≡CF, H2C=CHF and H3CCH2F to serve as an electron donor to the triel (Tr) atom of TrR3 in the context of a triel bond is assessed by ab initio calculations. The triel bond formed by Csp3—F is strongest, as high as 30 kcal/mol, followed by Csp2—F, and then by Csp—F whose triel bonds can be as small as 1 kcal/mol. The noncovalent bond strength diminishes in the order Tr = Al \u3e Ga \u3e B, consistent with the intensity of the π-hole above the Tr atom in the monomer. The triel bond strength of the Al and Ga complexes increases along with the electronegativity of the R substituent but is largest for R=H when Tr=B. Electrostatics play the largest role in the stronger triel bonds, but dispersion makes an outsized contribution for the weakest such bonds
    corecore