5 research outputs found

    Near-optimal Optimistic Reinforcement Learning using Empirical Bernstein Inequalities

    Full text link
    We study model-based reinforcement learning in an unknown finite communicating Markov decision process. We propose a simple algorithm that leverages a variance based confidence interval. We show that the proposed algorithm, UCRL-V, achieves the optimal regret O~(DSAT)\tilde{\mathcal{O}}(\sqrt{DSAT}) up to logarithmic factors, and so our work closes a gap with the lower bound without additional assumptions on the MDP. We perform experiments in a variety of environments that validates the theoretical bounds as well as prove UCRL-V to be better than the state-of-the-art algorithms.Comment: the algorithm has been simplified (no need to look at lower bound of the reward and transitions). Proof has been significantly clean-up. The previous "assumption" is clarified as a condition of the algorithm well-known as sub-modularity. The proof that the bounds satisfy the submodularity is clean-u

    Near-optimal Bayesian Solution For Unknown Discrete Markov Decision Process

    Full text link
    We tackle the problem of acting in an unknown finite and discrete Markov Decision Process (MDP) for which the expected shortest path from any state to any other state is bounded by a finite number DD. An MDP consists of SS states and AA possible actions per state. Upon choosing an action ata_t at state sts_t, one receives a real value reward rtr_t, then one transits to a next state st+1s_{t+1}. The reward rtr_t is generated from a fixed reward distribution depending only on (st,at)(s_t, a_t) and similarly, the next state st+1s_{t+1} is generated from a fixed transition distribution depending only on (st,at)(s_t, a_t). The objective is to maximize the accumulated rewards after TT interactions. In this paper, we consider the case where the reward distributions, the transitions, TT and DD are all unknown. We derive the first polynomial time Bayesian algorithm, BUCRL{} that achieves up to logarithm factors, a regret (i.e the difference between the accumulated rewards of the optimal policy and our algorithm) of the optimal order O~(DSAT)\tilde{\mathcal{O}}(\sqrt{DSAT}). Importantly, our result holds with high probability for the worst-case (frequentist) regret and not the weaker notion of Bayesian regret. We perform experiments in a variety of environments that demonstrate the superiority of our algorithm over previous techniques. Our work also illustrates several results that will be of independent interest. In particular, we derive a sharper upper bound for the KL-divergence of Bernoulli random variables. We also derive sharper upper and lower bounds for Beta and Binomial quantiles. All the bound are very simple and only use elementary functions.Comment: Improved the text and added detailed proofs of claims Change title to better express the solution propose

    Exploratory Grasping: Asymptotically Optimal Algorithms for Grasping Challenging Polyhedral Objects

    Full text link
    There has been significant recent work on data-driven algorithms for learning general-purpose grasping policies. However, these policies can consistently fail to grasp challenging objects which are significantly out of the distribution of objects in the training data or which have very few high quality grasps. Motivated by such objects, we propose a novel problem setting, Exploratory Grasping, for efficiently discovering reliable grasps on an unknown polyhedral object via sequential grasping, releasing, and toppling. We formalize Exploratory Grasping as a Markov Decision Process, study the theoretical complexity of Exploratory Grasping in the context of reinforcement learning and present an efficient bandit-style algorithm, Bandits for Online Rapid Grasp Exploration Strategy (BORGES), which leverages the structure of the problem to efficiently discover high performing grasps for each object stable pose. BORGES can be used to complement any general-purpose grasping algorithm with any grasp modality (parallel-jaw, suction, multi-fingered, etc) to learn policies for objects in which they exhibit persistent failures. Simulation experiments suggest that BORGES can significantly outperform both general-purpose grasping pipelines and two other online learning algorithms and achieves performance within 5% of the optimal policy within 1000 and 8000 timesteps on average across 46 challenging objects from the Dex-Net adversarial and EGAD! object datasets, respectively. Initial physical experiments suggest that BORGES can improve grasp success rate by 45% over a Dex-Net baseline with just 200 grasp attempts in the real world. See https://tinyurl.com/exp-grasping for supplementary material and videos.Comment: Conference on Robot Learning (CoRL) 2020. First two authors contributed equall

    Towards Tractable Optimism in Model-Based Reinforcement Learning

    Full text link
    The principle of optimism in the face of uncertainty is prevalent throughout sequential decision making problems such as multi-armed bandits and reinforcement learning (RL). To be successful, an optimistic RL algorithm must over-estimate the true value function (optimism) but not by so much that it is inaccurate (estimation error). In the tabular setting, many state-of-the-art methods produce the required optimism through approaches which are intractable when scaling to deep RL. We re-interpret these scalable optimistic model-based algorithms as solving a tractable noise augmented MDP. This formulation achieves a competitive regret bound: O~(∣S∣H∣A∣T)\tilde{\mathcal{O}}( |\mathcal{S}|H\sqrt{|\mathcal{A}| T } ) when augmenting using Gaussian noise, where TT is the total number of environment steps. We also explore how this trade-off changes in the deep RL setting, where we show empirically that estimation error is significantly more troublesome. However, we also show that if this error is reduced, optimistic model-based RL algorithms can match state-of-the-art performance in continuous control problems.Comment: Presented as a conference paper at UAI 202

    Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

    Full text link
    We study reinforcement learning (RL) with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model (Jia et al., 2020; Ayoub et al., 2020; Zhou et al., 2020) and the learning agent has access to either an integration or a sampling oracle of the individual basis kernels. We propose a new Bernstein-type concentration inequality for self-normalized martingales for linear bandit problems with bounded noise. Based on the new inequality, we propose a new, computationally efficient algorithm with linear function approximation named UCRL-VTR+\text{UCRL-VTR}^{+} for the aforementioned linear mixture MDPs in the episodic undiscounted setting. We show that UCRL-VTR+\text{UCRL-VTR}^{+} attains an O~(dHT)\tilde O(dH\sqrt{T}) regret where dd is the dimension of feature mapping, HH is the length of the episode and TT is the number of interactions with the MDP. We also prove a matching lower bound Ξ©(dHT)\Omega(dH\sqrt{T}) for this setting, which shows that UCRL-VTR+\text{UCRL-VTR}^{+} is minimax optimal up to logarithmic factors. In addition, we propose the UCLK+\text{UCLK}^{+} algorithm for the same family of MDPs under discounting and show that it attains an O~(dT/(1βˆ’Ξ³)1.5)\tilde O(d\sqrt{T}/(1-\gamma)^{1.5}) regret, where γ∈[0,1)\gamma\in [0,1) is the discount factor. Our upper bound matches the lower bound Ξ©(dT/(1βˆ’Ξ³)1.5)\Omega(d\sqrt{T}/(1-\gamma)^{1.5}) proved by Zhou et al. (2020) up to logarithmic factors, suggesting that UCLK+\text{UCLK}^{+} is nearly minimax optimal. To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.Comment: 59 pages, 1 figur
    corecore