75,592 research outputs found
The Simulator: Understanding Adaptive Sampling in the Moderate-Confidence Regime
We propose a novel technique for analyzing adaptive sampling called the {\em
Simulator}. Our approach differs from the existing methods by considering not
how much information could be gathered by any fixed sampling strategy, but how
difficult it is to distinguish a good sampling strategy from a bad one given
the limited amount of data collected up to any given time. This change of
perspective allows us to match the strength of both Fano and change-of-measure
techniques, without succumbing to the limitations of either method. For
concreteness, we apply our techniques to a structured multi-arm bandit problem
in the fixed-confidence pure exploration setting, where we show that the
constraints on the means imply a substantial gap between the
moderate-confidence sample complexity, and the asymptotic sample complexity as
found in the literature. We also prove the first instance-based
lower bounds for the top-k problem which incorporate the appropriate
log-factors. Moreover, our lower bounds zero-in on the number of times each
\emph{individual} arm needs to be pulled, uncovering new phenomena which are
drowned out in the aggregate sample complexity. Our new analysis inspires a
simple and near-optimal algorithm for the best-arm and top-k identification,
the first {\em practical} algorithm of its kind for the latter problem which
removes extraneous log factors, and outperforms the state-of-the-art in
experiments
Bridging RL Theory and Practice with the Effective Horizon
Deep reinforcement learning (RL) works impressively in some environments and
fails catastrophically in others. Ideally, RL theory should be able to provide
an understanding of why this is, i.e. bounds predictive of practical
performance. Unfortunately, current theory does not quite have this ability. We
compare standard deep RL algorithms to prior sample complexity prior bounds by
introducing a new dataset, BRIDGE. It consists of 155 MDPs from common deep RL
benchmarks, along with their corresponding tabular representations, which
enables us to exactly compute instance-dependent bounds. We find that prior
bounds do not correlate well with when deep RL succeeds vs. fails, but discover
a surprising property that does. When actions with the highest Q-values under
the random policy also have the highest Q-values under the optimal policy, deep
RL tends to succeed; when they don't, deep RL tends to fail. We generalize this
property into a new complexity measure of an MDP that we call the effective
horizon, which roughly corresponds to how many steps of lookahead search are
needed in order to identify the next optimal action when leaf nodes are
evaluated with random rollouts. Using BRIDGE, we show that the effective
horizon-based bounds are more closely reflective of the empirical performance
of PPO and DQN than prior sample complexity bounds across four metrics. We also
show that, unlike existing bounds, the effective horizon can predict the
effects of using reward shaping or a pre-trained exploration policy
Active Exploration for Inverse Reinforcement Learning
Inverse Reinforcement Learning (IRL) is a powerful paradigm for inferring a
reward function from expert demonstrations. Many IRL algorithms require a known
transition model and sometimes even a known expert policy, or they at least
require access to a generative model. However, these assumptions are too strong
for many real-world applications, where the environment can be accessed only
through sequential interaction. We propose a novel IRL algorithm: Active
exploration for Inverse Reinforcement Learning (AceIRL), which actively
explores an unknown environment and expert policy to quickly learn the expert's
reward function and identify a good policy. AceIRL uses previous observations
to construct confidence intervals that capture plausible reward functions and
find exploration policies that focus on the most informative regions of the
environment. AceIRL is the first approach to active IRL with sample-complexity
bounds that does not require a generative model of the environment. AceIRL
matches the sample complexity of active IRL with a generative model in the
worst case. Additionally, we establish a problem-dependent bound that relates
the sample complexity of AceIRL to the suboptimality gap of a given IRL
problem. We empirically evaluate AceIRL in simulations and find that it
significantly outperforms more naive exploration strategies.Comment: Presented at Conference on Neural Information Processing Systems
(NeurIPS), 202
Adaptive Reward-Free Exploration
Reward-free exploration is a reinforcement learning setting studied by Jin et
al. (2020), who address it by running several algorithms with regret guarantees
in parallel. In our work, we instead give a more natural adaptive approach for
reward-free exploration which directly reduces upper bounds on the maximum MDP
estimation error. We show that, interestingly, our reward-free UCRL algorithm
can be seen as a variant of an algorithm of Fiechter from 1994, originally
proposed for a different objective that we call best-policy identification. We
prove that RF-UCRL needs of order episodes to output, with probability , an
-approximation of the optimal policy for any reward function. This
bound improves over existing sample-complexity bounds in both the small
and the small regimes. We further investigate the
relative complexities of reward-free exploration and best-policy
identification
- …