2,149 research outputs found
Latent Contextual Bandits and their Application to Personalized Recommendations for New Users
Personalized recommendations for new users, also known as the cold-start
problem, can be formulated as a contextual bandit problem. Existing contextual
bandit algorithms generally rely on features alone to capture user variability.
Such methods are inefficient in learning new users' interests. In this paper we
propose Latent Contextual Bandits. We consider both the benefit of leveraging a
set of learned latent user classes for new users, and how we can learn such
latent classes from prior users. We show that our approach achieves a better
regret bound than existing algorithms. We also demonstrate the benefit of our
approach using a large real world dataset and a preliminary user study.Comment: 25th International Joint Conference on Artificial Intelligence (IJCAI
2016
The Online Coupon-Collector Problem and Its Application to Lifelong Reinforcement Learning
Transferring knowledge across a sequence of related tasks is an important
challenge in reinforcement learning (RL). Despite much encouraging empirical
evidence, there has been little theoretical analysis. In this paper, we study a
class of lifelong RL problems: the agent solves a sequence of tasks modeled as
finite Markov decision processes (MDPs), each of which is from a finite set of
MDPs with the same state/action sets and different transition/reward functions.
Motivated by the need for cross-task exploration in lifelong learning, we
formulate a novel online coupon-collector problem and give an optimal
algorithm. This allows us to develop a new lifelong RL algorithm, whose overall
sample complexity in a sequence of tasks is much smaller than single-task
learning, even if the sequence of tasks is generated by an adversary. Benefits
of the algorithm are demonstrated in simulated problems, including a recently
introduced human-robot interaction problem.Comment: 13 page
Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning
In this paper we present a new way of predicting the performance of a
reinforcement learning policy given historical data that may have been
generated by a different policy. The ability to evaluate a policy from
historical data is important for applications where the deployment of a bad
policy can be dangerous or costly. We show empirically that our algorithm
produces estimates that often have orders of magnitude lower mean squared error
than existing methods---it makes more efficient use of the available data. Our
new estimator is based on two advances: an extension of the doubly robust
estimator (Jiang and Li, 2015), and a new way to mix between model based
estimates and importance sampling based estimates
Policy Gradient Methods for Reinforcement Learning with Function Approximation and Action-Dependent Baselines
We show how an action-dependent baseline can be used by the policy gradient
theorem using function approximation, originally presented with
action-independent baselines by (Sutton et al. 2000)
When Simple Exploration is Sample Efficient: Identifying Sufficient Conditions for Random Exploration to Yield PAC RL Algorithms
Efficient exploration is one of the key challenges for reinforcement learning
(RL) algorithms. Most traditional sample efficiency bounds require strategic
exploration. Recently many deep RL algorithms with simple heuristic exploration
strategies that have few formal guarantees, achieve surprising success in many
domains. These results pose an important question about understanding these
exploration strategies such as -greedy, as well as understanding what
characterize the difficulty of exploration in MDPs. In this work we propose
problem specific sample complexity bounds of learning with random walk
exploration that rely on several structural properties. We also link our
theoretical results to some empirical benchmark domains, to illustrate if our
bound gives polynomial sample complexity in these domains and how that is
related with the empirical performance.Comment: Appeared in The 14th European Workshop on Reinforcement Learning
(EWRL), 201
Unifying PAC and Regret: Uniform PAC Bounds for Episodic Reinforcement Learning
Statistical performance bounds for reinforcement learning (RL) algorithms can
be critical for high-stakes applications like healthcare. This paper introduces
a new framework for theoretically measuring the performance of such algorithms
called Uniform-PAC, which is a strengthening of the classical Probably
Approximately Correct (PAC) framework. In contrast to the PAC framework, the
uniform version may be used to derive high probability regret guarantees and so
forms a bridge between the two setups that has been missing in the literature.
We demonstrate the benefits of the new framework for finite-state episodic MDPs
with a new algorithm that is Uniform-PAC and simultaneously achieves optimal
regret and PAC guarantees except for a factor of the horizon.Comment: appears in Neural Information Processing Systems 201
A PAC RL Algorithm for Episodic POMDPs
Many interesting real world domains involve reinforcement learning (RL) in
partially observable environments. Efficient learning in such domains is
important, but existing sample complexity bounds for partially observable RL
are at least exponential in the episode length. We give, to our knowledge, the
first partially observable RL algorithm with a polynomial bound on the number
of episodes on which the algorithm may not achieve near-optimal performance.
Our algorithm is suitable for an important class of episodic POMDPs. Our
approach builds on recent advances in method of moments for latent variable
model estimation
Off-Policy Policy Gradient with State Distribution Correction
We study the problem of off-policy policy optimization in Markov decision
processes, and develop a novel off-policy policy gradient method. Prior
off-policy policy gradient approaches have generally ignored the mismatch
between the distribution of states visited under the behavior policy used to
collect data, and what would be the distribution of states under the learned
policy. Here we build on recent progress for estimating the ratio of the state
distributions under behavior and evaluation policies for policy evaluation, and
present an off-policy policy gradient optimization technique that can account
for this mismatch in distributions. We present an illustrative example of why
this is important and a theoretical convergence guarantee for our approach.
Empirically, we compare our method in simulations to several strong baselines
which do not correct for this mismatch, significantly improving in the quality
of the policy discovered.Comment: to appear at UAI 18; camera-ready versio
Fast Exploration with Simplified Models and Approximately Optimistic Planning in Model Based Reinforcement Learning
Humans learn to play video games significantly faster than the
state-of-the-art reinforcement learning (RL) algorithms. People seem to build
simple models that are easy to learn to support planning and strategic
exploration. Inspired by this, we investigate two issues in leveraging
model-based RL for sample efficiency. First we investigate how to perform
strategic exploration when exact planning is not feasible and empirically show
that optimistic Monte Carlo Tree Search outperforms posterior sampling methods.
Second we show how to learn simple deterministic models to support fast
learning using object representation. We illustrate the benefit of these ideas
by introducing a novel algorithm, Strategic Object Oriented Reinforcement
Learning (SOORL), that outperforms state-of-the-art algorithms in the game of
Pitfall! in less than 50 episodes
Using Options and Covariance Testing for Long Horizon Off-Policy Policy Evaluation
Evaluating a policy by deploying it in the real world can be risky and
costly. Off-policy policy evaluation (OPE) algorithms use historical data
collected from running a previous policy to evaluate a new policy, which
provides a means for evaluating a policy without requiring it to ever be
deployed. Importance sampling is a popular OPE method because it is robust to
partial observability and works with continuous states and actions. However,
the amount of historical data required by importance sampling can scale
exponentially with the horizon of the problem: the number of sequential
decisions that are made. We propose using policies over temporally extended
actions, called options, and show that combining these policies with importance
sampling can significantly improve performance for long-horizon problems. In
addition, we can take advantage of special cases that arise due to
options-based policies to further improve the performance of importance
sampling. We further generalize these special cases to a general covariance
testing rule that can be used to decide which weights to drop in an IS
estimate, and derive a new IS algorithm called Incremental Importance Sampling
that can provide significantly more accurate estimates for a broad class of
domains
- …
