5 research outputs found
Near-optimal Optimistic Reinforcement Learning using Empirical Bernstein Inequalities
We study model-based reinforcement learning in an unknown finite
communicating Markov decision process. We propose a simple algorithm that
leverages a variance based confidence interval. We show that the proposed
algorithm, UCRL-V, achieves the optimal regret
up to logarithmic factors, and so our work
closes a gap with the lower bound without additional assumptions on the MDP. We
perform experiments in a variety of environments that validates the theoretical
bounds as well as prove UCRL-V to be better than the state-of-the-art
algorithms.Comment: the algorithm has been simplified (no need to look at lower bound of
the reward and transitions). Proof has been significantly clean-up. The
previous "assumption" is clarified as a condition of the algorithm well-known
as sub-modularity. The proof that the bounds satisfy the submodularity is
clean-u
Near-optimal Bayesian Solution For Unknown Discrete Markov Decision Process
We tackle the problem of acting in an unknown finite and discrete Markov
Decision Process (MDP) for which the expected shortest path from any state to
any other state is bounded by a finite number . An MDP consists of
states and possible actions per state. Upon choosing an action at
state , one receives a real value reward , then one transits to a
next state . The reward is generated from a fixed reward
distribution depending only on and similarly, the next state
is generated from a fixed transition distribution depending only on
. The objective is to maximize the accumulated rewards after
interactions. In this paper, we consider the case where the reward
distributions, the transitions, and are all unknown. We derive the
first polynomial time Bayesian algorithm, BUCRL{} that achieves up to logarithm
factors, a regret (i.e the difference between the accumulated rewards of the
optimal policy and our algorithm) of the optimal order
. Importantly, our result holds with high
probability for the worst-case (frequentist) regret and not the weaker notion
of Bayesian regret. We perform experiments in a variety of environments that
demonstrate the superiority of our algorithm over previous techniques.
Our work also illustrates several results that will be of independent
interest. In particular, we derive a sharper upper bound for the KL-divergence
of Bernoulli random variables. We also derive sharper upper and lower bounds
for Beta and Binomial quantiles. All the bound are very simple and only use
elementary functions.Comment: Improved the text and added detailed proofs of claims Change title to
better express the solution propose
Exploratory Grasping: Asymptotically Optimal Algorithms for Grasping Challenging Polyhedral Objects
There has been significant recent work on data-driven algorithms for learning
general-purpose grasping policies. However, these policies can consistently
fail to grasp challenging objects which are significantly out of the
distribution of objects in the training data or which have very few high
quality grasps. Motivated by such objects, we propose a novel problem setting,
Exploratory Grasping, for efficiently discovering reliable grasps on an unknown
polyhedral object via sequential grasping, releasing, and toppling. We
formalize Exploratory Grasping as a Markov Decision Process, study the
theoretical complexity of Exploratory Grasping in the context of reinforcement
learning and present an efficient bandit-style algorithm, Bandits for Online
Rapid Grasp Exploration Strategy (BORGES), which leverages the structure of the
problem to efficiently discover high performing grasps for each object stable
pose. BORGES can be used to complement any general-purpose grasping algorithm
with any grasp modality (parallel-jaw, suction, multi-fingered, etc) to learn
policies for objects in which they exhibit persistent failures. Simulation
experiments suggest that BORGES can significantly outperform both
general-purpose grasping pipelines and two other online learning algorithms and
achieves performance within 5% of the optimal policy within 1000 and 8000
timesteps on average across 46 challenging objects from the Dex-Net adversarial
and EGAD! object datasets, respectively. Initial physical experiments suggest
that BORGES can improve grasp success rate by 45% over a Dex-Net baseline with
just 200 grasp attempts in the real world. See https://tinyurl.com/exp-grasping
for supplementary material and videos.Comment: Conference on Robot Learning (CoRL) 2020. First two authors
contributed equall
Towards Tractable Optimism in Model-Based Reinforcement Learning
The principle of optimism in the face of uncertainty is prevalent throughout
sequential decision making problems such as multi-armed bandits and
reinforcement learning (RL). To be successful, an optimistic RL algorithm must
over-estimate the true value function (optimism) but not by so much that it is
inaccurate (estimation error). In the tabular setting, many state-of-the-art
methods produce the required optimism through approaches which are intractable
when scaling to deep RL. We re-interpret these scalable optimistic model-based
algorithms as solving a tractable noise augmented MDP. This formulation
achieves a competitive regret bound: when augmenting using Gaussian noise,
where is the total number of environment steps. We also explore how this
trade-off changes in the deep RL setting, where we show empirically that
estimation error is significantly more troublesome. However, we also show that
if this error is reduced, optimistic model-based RL algorithms can match
state-of-the-art performance in continuous control problems.Comment: Presented as a conference paper at UAI 202
Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes
We study reinforcement learning (RL) with linear function approximation where
the underlying transition probability kernel of the Markov decision process
(MDP) is a linear mixture model (Jia et al., 2020; Ayoub et al., 2020; Zhou et
al., 2020) and the learning agent has access to either an integration or a
sampling oracle of the individual basis kernels. We propose a new
Bernstein-type concentration inequality for self-normalized martingales for
linear bandit problems with bounded noise. Based on the new inequality, we
propose a new, computationally efficient algorithm with linear function
approximation named for the aforementioned linear mixture
MDPs in the episodic undiscounted setting. We show that
attains an regret where is the dimension of feature
mapping, is the length of the episode and is the number of interactions
with the MDP. We also prove a matching lower bound for
this setting, which shows that is minimax optimal up to
logarithmic factors. In addition, we propose the algorithm
for the same family of MDPs under discounting and show that it attains an
regret, where is the
discount factor. Our upper bound matches the lower bound
proved by Zhou et al. (2020) up to
logarithmic factors, suggesting that is nearly minimax
optimal. To the best of our knowledge, these are the first computationally
efficient, nearly minimax optimal algorithms for RL with linear function
approximation.Comment: 59 pages, 1 figur