5 research outputs found
Recurrent Value Functions
Despite recent successes in Reinforcement Learning, value-based methods often
suffer from high variance hindering performance. In this paper, we illustrate
this in a continuous control setting where state of the art methods perform
poorly whenever sensor noise is introduced. To overcome this issue, we
introduce Recurrent Value Functions (RVFs) as an alternative to estimate the
value function of a state. We propose to estimate the value function of the
current state using the value function of past states visited along the
trajectory. Due to the nature of their formulation, RVFs have a natural way of
learning an emphasis function that selectively emphasizes important states.
First, we establish RVF's asymptotic convergence properties in tabular
settings. We then demonstrate their robustness on a partially observable domain
and continuous control tasks. Finally, we provide a qualitative interpretation
of the learned emphasis function
Regret Minimization for Partially Observable Deep Reinforcement Learning
Deep reinforcement learning algorithms that estimate state and state-action
value functions have been shown to be effective in a variety of challenging
domains, including learning control strategies from raw image pixels. However,
algorithms that estimate state and state-action value functions typically
assume a fully observed state and must compensate for partial observations by
using finite length observation histories or recurrent networks. In this work,
we propose a new deep reinforcement learning algorithm based on counterfactual
regret minimization that iteratively updates an approximation to an
advantage-like function and is robust to partially observed state. We
demonstrate that this new algorithm can substantially outperform strong
baseline methods on several partially observed reinforcement learning tasks:
learning first-person 3D navigation in Doom and Minecraft, and acting in the
presence of partially observed objects in Doom and Pong.Comment: ICML 201
Optimal Multiple Stopping Rule for Warm-Starting Sequential Selection
In this paper we present the Warm-starting Dynamic Thresholding algorithm,
developed using dynamic programming, for a variant of the standard online
selection problem. The problem allows job positions to be either free or
already occupied at the beginning of the process. Throughout the selection
process, the decision maker interviews one after the other the new candidates
and reveals a quality score for each of them. Based on that information, she
can (re)assign each job at most once by taking immediate and irrevocable
decisions. We relax the hard requirement of the class of dynamic programming
algorithms to perfectly know the distribution from which the scores of
candidates are drawn, by presenting extensions for the partial and
no-information cases, in which the decision maker can learn the underlying
score distribution sequentially while interviewing candidates
Scaling data-driven robotics with reward sketching and batch reinforcement learning
We present a framework for data-driven robotics that makes use of a large
dataset of recorded robot experience and scales to several tasks using learned
reward functions. We show how to apply this framework to accomplish three
different object manipulation tasks on a real robot platform. Given
demonstrations of a task together with task-agnostic recorded experience, we
use a special form of human annotation as supervision to learn a reward
function, which enables us to deal with real-world tasks where the reward
signal cannot be acquired directly. Learned rewards are used in combination
with a large dataset of experience from different tasks to learn a robot policy
offline using batch RL. We show that using our approach it is possible to train
agents to perform a variety of challenging manipulation tasks including
stacking rigid objects and handling cloth.Comment: Project website: https://sites.google.com/view/data-driven-robotics
Preferential Temporal Difference Learning
Temporal-Difference (TD) learning is a general and very useful tool for
estimating the value function of a given policy, which in turn is required to
find good policies. Generally speaking, TD learning updates states whenever
they are visited. When the agent lands in a state, its value can be used to
compute the TD-error, which is then propagated to other states. However, it may
be interesting, when computing updates, to take into account other information
than whether a state is visited or not. For example, some states might be more
important than others (such as states which are frequently seen in a successful
trajectory). Or, some states might have unreliable value estimates (for
example, due to partial observability or lack of data), making their values
less desirable as targets. We propose an approach to re-weighting states used
in TD updates, both when they are the input and when they provide the target
for the update. We prove that our approach converges with linear function
approximation and illustrate its desirable empirical behaviour compared to
other TD-style methods.Comment: Accepted at the 38th International Conference on Machine Learning
(ICML, 2021