Search CORE

5 research outputs found

Near-optimal Optimistic Reinforcement Learning using Empirical Bernstein Inequalities

Author: Basu Debabrota
Dimitrakakis Christos
Tossou Aristide
Publication venue
Publication date: 11/12/2019
Field of study

We study model-based reinforcement learning in an unknown finite communicating Markov decision process. We propose a simple algorithm that leverages a variance based confidence interval. We show that the proposed algorithm, UCRL-V, achieves the optimal regret

\tilde{\mathcal{O}}(\sqrt{DSAT})

up to logarithmic factors, and so our work closes a gap with the lower bound without additional assumptions on the MDP. We perform experiments in a variety of environments that validates the theoretical bounds as well as prove UCRL-V to be better than the state-of-the-art algorithms.Comment: the algorithm has been simplified (no need to look at lower bound of the reward and transitions). Proof has been significantly clean-up. The previous "assumption" is clarified as a condition of the algorithm well-known as sub-modularity. The proof that the bounds satisfy the submodularity is clean-u

arXiv.org e-Print Archive

Near-optimal Bayesian Solution For Unknown Discrete Markov Decision Process

Author: Basu Debabrota
Dimitrakakis Christos
Tossou Aristide
Publication venue
Publication date: 09/07/2019
Field of study

We tackle the problem of acting in an unknown finite and discrete Markov Decision Process (MDP) for which the expected shortest path from any state to any other state is bounded by a finite number

D

. An MDP consists of

S

states and

A

possible actions per state. Upon choosing an action

a_t

at state

s_t

, one receives a real value reward

r_t

, then one transits to a next state

s_{t+1}

. The reward

r_t

is generated from a fixed reward distribution depending only on

(s_t, a_t)

and similarly, the next state

s_{t+1}

is generated from a fixed transition distribution depending only on

(s_t, a_t)

. The objective is to maximize the accumulated rewards after

T

interactions. In this paper, we consider the case where the reward distributions, the transitions,

T

and

D

are all unknown. We derive the first polynomial time Bayesian algorithm, BUCRL{} that achieves up to logarithm factors, a regret (i.e the difference between the accumulated rewards of the optimal policy and our algorithm) of the optimal order

\tilde{\mathcal{O}}(\sqrt{DSAT})

. Importantly, our result holds with high probability for the worst-case (frequentist) regret and not the weaker notion of Bayesian regret. We perform experiments in a variety of environments that demonstrate the superiority of our algorithm over previous techniques. Our work also illustrates several results that will be of independent interest. In particular, we derive a sharper upper bound for the KL-divergence of Bernoulli random variables. We also derive sharper upper and lower bounds for Beta and Binomial quantiles. All the bound are very simple and only use elementary functions.Comment: Improved the text and added detailed proofs of claims Change title to better express the solution propose

arXiv.org e-Print Archive

Exploratory Grasping: Asymptotically Optimal Algorithms for Grasping Challenging Polyhedral Objects

Author: Balakrishna Ashwin
Brown Daniel S.
Danielczuk Michael
Devgon Shivin
Goldberg Ken
Publication venue
Publication date: 11/11/2020
Field of study

There has been significant recent work on data-driven algorithms for learning general-purpose grasping policies. However, these policies can consistently fail to grasp challenging objects which are significantly out of the distribution of objects in the training data or which have very few high quality grasps. Motivated by such objects, we propose a novel problem setting, Exploratory Grasping, for efficiently discovering reliable grasps on an unknown polyhedral object via sequential grasping, releasing, and toppling. We formalize Exploratory Grasping as a Markov Decision Process, study the theoretical complexity of Exploratory Grasping in the context of reinforcement learning and present an efficient bandit-style algorithm, Bandits for Online Rapid Grasp Exploration Strategy (BORGES), which leverages the structure of the problem to efficiently discover high performing grasps for each object stable pose. BORGES can be used to complement any general-purpose grasping algorithm with any grasp modality (parallel-jaw, suction, multi-fingered, etc) to learn policies for objects in which they exhibit persistent failures. Simulation experiments suggest that BORGES can significantly outperform both general-purpose grasping pipelines and two other online learning algorithms and achieves performance within 5% of the optimal policy within 1000 and 8000 timesteps on average across 46 challenging objects from the Dex-Net adversarial and EGAD! object datasets, respectively. Initial physical experiments suggest that BORGES can improve grasp success rate by 45% over a Dex-Net baseline with just 200 grasp attempts in the real world. See https://tinyurl.com/exp-grasping for supplementary material and videos.Comment: Conference on Robot Learning (CoRL) 2020. First two authors contributed equall

arXiv.org e-Print Archive

Towards Tractable Optimism in Model-Based Reinforcement Learning

Author: Ball Philip J.
Choromanski Krzysztof
Pacchiano Aldo
Parker-Holder Jack
Roberts Stephen
Publication venue
Publication date: 03/12/2021
Field of study

The principle of optimism in the face of uncertainty is prevalent throughout sequential decision making problems such as multi-armed bandits and reinforcement learning (RL). To be successful, an optimistic RL algorithm must over-estimate the true value function (optimism) but not by so much that it is inaccurate (estimation error). In the tabular setting, many state-of-the-art methods produce the required optimism through approaches which are intractable when scaling to deep RL. We re-interpret these scalable optimistic model-based algorithms as solving a tractable noise augmented MDP. This formulation achieves a competitive regret bound:

\tilde{\mathcal{O}}( |\mathcal{S}|H\sqrt{|\mathcal{A}| T } )

when augmenting using Gaussian noise, where

T

is the total number of environment steps. We also explore how this trade-off changes in the deep RL setting, where we show empirically that estimation error is significantly more troublesome. However, we also show that if this error is reduced, optimistic model-based RL algorithms can match state-of-the-art performance in continuous control problems.Comment: Presented as a conference paper at UAI 202

arXiv.org e-Print Archive

Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

Author: Gu Quanquan
Szepesvari Csaba
Zhou Dongruo
Publication venue
Publication date: 07/01/2021
Field of study

We study reinforcement learning (RL) with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model (Jia et al., 2020; Ayoub et al., 2020; Zhou et al., 2020) and the learning agent has access to either an integration or a sampling oracle of the individual basis kernels. We propose a new Bernstein-type concentration inequality for self-normalized martingales for linear bandit problems with bounded noise. Based on the new inequality, we propose a new, computationally efficient algorithm with linear function approximation named

\text{UCRL-VTR}^{+}

for the aforementioned linear mixture MDPs in the episodic undiscounted setting. We show that

\text{UCRL-VTR}^{+}

attains an

\tilde O(dH\sqrt{T})

regret where

d

is the dimension of feature mapping,

H

is the length of the episode and

T

is the number of interactions with the MDP. We also prove a matching lower bound

\Omega(dH\sqrt{T})

for this setting, which shows that

\text{UCRL-VTR}^{+}

is minimax optimal up to logarithmic factors. In addition, we propose the

\text{UCLK}^{+}

algorithm for the same family of MDPs under discounting and show that it attains an

\tilde O(d\sqrt{T}/(1-\gamma)^{1.5})

regret, where

\gamma\in [0,1)

is the discount factor. Our upper bound matches the lower bound

\Omega(d\sqrt{T}/(1-\gamma)^{1.5})

proved by Zhou et al. (2020) up to logarithmic factors, suggesting that

\text{UCLK}^{+}

is nearly minimax optimal. To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.Comment: 59 pages, 1 figur

arXiv.org e-Print Archive