27 research outputs found
Only Relevant Information Matters: Filtering Out Noisy Samples to Boost RL
In reinforcement learning, policy gradient algorithms optimize the policy
directly and rely on sampling efficiently an environment. Nevertheless, while
most sampling procedures are based on direct policy sampling, self-performance
measures could be used to improve such sampling prior to each policy update.
Following this line of thought, we introduce SAUNA, a method where
non-informative transitions are rejected from the gradient update. The level of
information is estimated according to the fraction of variance explained by the
value function: a measure of the discrepancy between V and the empirical
returns. In this work, we use this metric to select samples that are useful to
learn from, and we demonstrate that this selection can significantly improve
the performance of policy gradient methods. In this paper: (a) We define
SAUNA's metric and introduce its method to filter transitions. (b) We conduct
experiments on a set of benchmark continuous control problems. SAUNA
significantly improves performance. (c) We investigate how SAUNA reliably
selects samples with the most positive impact on learning and study its
improvement on both performance and sample efficiency.Comment: Accepted at IJCAI 202
SAAC: Safe Reinforcement Learning as an Adversarial Game of Actor-Critics
Although Reinforcement Learning (RL) is effective for sequential
decision-making problems under uncertainty, it still fails to thrive in
real-world systems where risk or safety is a binding constraint. In this paper,
we formulate the RL problem with safety constraints as a non-zero-sum game.
While deployed with maximum entropy RL, this formulation leads to a safe
adversarially guided soft actor-critic framework, called SAAC. In SAAC, the
adversary aims to break the safety constraint while the RL agent aims to
maximize the constrained value function given the adversary's policy. The
safety constraint on the agent's value function manifests only as a repulsion
term between the agent's and the adversary's policies. Unlike previous
approaches, SAAC can address different safety criteria such as safe
exploration, mean-variance risk sensitivity, and CVaR-like coherent risk
sensitivity. We illustrate the design of the adversary for these constraints.
Then, in each of these variations, we show the agent differentiates itself from
the adversary's unsafe actions in addition to learning to solve the task.
Finally, for challenging continuous control tasks, we demonstrate that SAAC
achieves faster convergence, better efficiency, and fewer failures to satisfy
the safety constraints than risk-averse distributional RL and risk-neutral soft
actor-critic algorithms
MERL: Multi-Head Reinforcement Learning
International audienceA common challenge in reinforcement learning is how to convert the agent's interactions with an environment into fast and robust learning. For instance, earlier work makes use of domain knowledge to improve existing reinforcement learning algorithms in complex tasks. While promising, previously acquired knowledge is often costly and challenging to scale up. Instead, we decide to consider problem knowledge with signals from quantities relevant to solve any task, e.g., self-performance assessment and accurate expectations. is such a quantity. It is the fraction of variance explained by the value function and measures the discrepancy between and the returns. Taking advantage of , we propose MERL, a general framework for structuring reinforcement learning by injecting problem knowledge into policy gradient updates. As a result, the agent is not only optimized for a reward but learns using problem-focused quantities provided by MERL, applicable out-of-the-box to any task. In this paper: (a) We introduce and define MERL, the multi-head reinforcement learning framework we use throughout this work. (b) We conduct experiments across a variety of standard benchmark environments, including 9 continuous control tasks, where results show improved performance. (c) We demonstrate that MERL also improves transfer learning on a set of challenging pixel-based tasks. (d) We ponder how MERL tackles the problem of reward sparsity and better conditions the feature space of reinforcement learning agents
Waypoint Transformer: Reinforcement Learning via Supervised Learning with Intermediate Targets
Despite the recent advancements in offline reinforcement learning via
supervised learning (RvS) and the success of the decision transformer (DT)
architecture in various domains, DTs have fallen short in several challenging
benchmarks. The root cause of this underperformance lies in their inability to
seamlessly connect segments of suboptimal trajectories. To overcome this
limitation, we present a novel approach to enhance RvS methods by integrating
intermediate targets. We introduce the Waypoint Transformer (WT), using an
architecture that builds upon the DT framework and conditioned on
automatically-generated waypoints. The results show a significant increase in
the final return compared to existing RvS methods, with performance on par or
greater than existing state-of-the-art temporal difference learning-based
methods. Additionally, the performance and stability improvements are largest
in the most challenging environments and data configurations, including AntMaze
Large Play/Diverse and Kitchen Mixed/Partial.Comment: Accepted to the Conference on Neural Information Processing Systems
2023 (NeurIPS 2023
Learning Value Functions in Deep Policy Gradients using Residual Variance
Policy gradient algorithms have proven to be successful in diverse decision
making and control tasks. However, these methods suffer from high sample
complexity and instability issues. In this paper, we address these challenges
by providing a different approach for training the critic in the actor-critic
framework. Our work builds on recent studies indicating that traditional
actor-critic algorithms do not succeed in fitting the true value function,
calling for the need to identify a better objective for the critic. In our
method, the critic uses a new state-value (resp. state-action-value) function
approximation that learns the value of the states (resp. state-action pairs)
relative to their mean value rather than the absolute value as in conventional
actor-critic. We prove the theoretical consistency of the new gradient
estimator and observe dramatic empirical improvement across a variety of
continuous control tasks and algorithms. Furthermore, we validate our method in
tasks with sparse rewards, where we provide experimental evidence and
theoretical insights.Comment: Accepted at ICLR 202
PASTA: Pretrained Action-State Transformer Agents
Self-supervised learning has brought about a revolutionary paradigm shift in
various computing domains, including NLP, vision, and biology. Recent
approaches involve pre-training transformer models on vast amounts of unlabeled
data, serving as a starting point for efficiently solving downstream tasks. In
the realm of reinforcement learning, researchers have recently adapted these
approaches by developing models pre-trained on expert trajectories, enabling
them to address a wide range of tasks, from robotics to recommendation systems.
However, existing methods mostly rely on intricate pre-training objectives
tailored to specific downstream applications. This paper presents a
comprehensive investigation of models we refer to as Pretrained Action-State
Transformer Agents (PASTA). Our study uses a unified methodology and covers an
extensive set of general downstream tasks including behavioral cloning, offline
RL, sensor failure robustness, and dynamics change adaptation. Our goal is to
systematically compare various design choices and provide valuable insights to
practitioners for building robust models. Key highlights of our study include
tokenization at the action and state component level, using fundamental
pre-training objectives like next token prediction, training models across
diverse domains simultaneously, and using parameter efficient fine-tuning
(PEFT). The developed models in our study contain fewer than 10 million
parameters and the application of PEFT enables fine-tuning of fewer than 10,000
parameters during downstream adaptation, allowing a broad community to use
these models and reproduce our experiments. We hope that this study will
encourage further research into the use of transformers with first-principles
design choices to represent RL trajectories and contribute to robust policy
learning
Learning Value Functions in Deep Policy Gradients using Residual Variance
International audiencePolicy gradient algorithms have proven to be successful in diverse decision making and control tasks. However, these methods suffer from high sample complexity and instability issues. In this paper, we address these challenges by providing a different approach for training the critic in the actor-critic framework. Our work builds on recent studies indicating that traditional actor-critic algorithms do not succeed in fitting the true value function, calling for the need to identify a better objective for the critic. In our method, the critic uses a new state-value (resp. state-action-value) function approximation that learns the value of the states (resp. state-action pairs) relative to their mean value rather than the absolute value as in conventional actor-critic. We prove the theoretical consistency of the new gradient estimator and observe dramatic empirical improvement across a variety of continuous control tasks and algorithms. Furthermore, we validate our method in tasks with sparse rewards, where we provide experimental evidence and theoretical insights
Adversarially Guided Actor-Critic
International audienceDespite definite success in deep reinforcement learning problems, actor-critic algorithms are still confronted with sample inefficiency in complex environments, particularly in tasks where efficient exploration is a bottleneck. These methods consider a policy (the actor) and a value function (the critic) whose respective losses are built using different motivations and approaches. This paper introduces a third protagonist: the adversary. While the adversary mimics the actor by minimizing the KL-divergence between their respective action distributions, the actor, in addition to learning to solve the task, tries to differentiate itself from the adversary predictions. This novel objective stimulates the actor to follow strategies that could not have been correctly predicted from previous trajectories, making its behavior innovative in tasks where the reward is extremely rare. Our experimental analysis shows that the resulting Adversarially Guided Actor-Critic (AGAC) algorithm leads to more exhaustive exploration. Notably, AGAC outperforms current state-of-the-art methods on a set of various hard-exploration and procedurally-generated tasks
Hearables in hearing care: discovering usage patterns through IoT devices
Hearables are on the rise as next generation wearables, capable of streaming audio, modifying soundscapes or functioning as biometric sensors. The recent introduction of IoT (Internet of things) connected hearing aids o er new opportunities for hearables to collect QS quantified self data that capture user intents and thereby provide insights to adjust the settings of the device. In our study 6 participants shared their QS data capturing when they remotely changed their device settings over 6 weeks. The data confirms that the participants preferred to actively change programs rather than use a single default setting provided by an audiologist. Furthermore, their unique usage patterns indicate a need for designing hearing aids, which as hearables adapt their settings dynamically to individual preferences during the day