Search CORE

43 research outputs found

복잡하고 불확실한 환경에서 최적 의사 결정을 위한 효율적인 로봇 학습

Author: 이경재
Publication venue: 서울대학교 대학원
Publication date: 01/02/2021
Field of study

학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·정보공학부, 2021. 2. Songhwai Oh.The problem of sequential decision making under an uncertain and complex environment is a long-standing challenging problem in robotics. In this thesis, we focus on learning a policy function of robotic systems for sequential decision making under which is called a robot learning framework. In particular, we are interested in reducing the sample complexity of the robot learning framework. Hence, we develop three sample efficient robot learning frameworks. The first one is the maximum entropy reinforcement learning. The second one is a perturbation-based exploration. The last one is learning from demonstrations with mixed qualities. For maximum entropy reinforcement learning, we employ a generalized Tsallis entropy regularization as an efficient exploration method. Tsallis entropy generalizes Shannon-Gibbs entropy by introducing a entropic index. By changing an entropic index, we can control the sparsity and multi-modality of policy. Based on this fact, we first propose a sparse Markov decision process (sparse MDP) which induces a sparse and multi-modal optimal policy distribution. In this MDP, the sparse entropy, which is a special case of Tsallis entropy, is employed as a policy regularization. We first analyze the optimality condition of a sparse MDP. Then, we propose dynamic programming methods for the sparse MDP and prove their convergence and optimality. We also show that the performance error of a sparse MDP has a constant bound, while the error of a soft MDP increases logarithmically with respect to the number of actions, where this performance error is caused by the introduced regularization term. Furthermore, we generalize sparse MDPs to a new class of entropy-regularized Markov decision processes (MDPs), which will be referred to as Tsallis MDPs, and analyzes different types of optimal policies with interesting properties related to the stochasticity of the optimal policy by controlling the entropic index. Furthermore, we also develop perturbation based exploration methods to handle heavy-tailed noises. In many robot learning problems, a learning signal is often corrupted by noises such as sub-Gaussian noise or heavy-tailed noise. While most of the exploration strategies have been analyzed under sub-Gaussian noise assumption, there exist few methods for handling such heavy-tailed rewards. Hence, to overcome heavy-tailed noise, we consider stochastic multi-armed bandits with heavy-tailed rewards. First, we propose a novel robust estimator that does not require prior information about a noise distribution, while other existing robust estimators demand prior knowledge. Then, we show that an error probability of the proposed estimator decays exponentially fast. Using this estimator, we propose a perturbation-based exploration strategy and develop a generalized regret analysis scheme that provides upper and lower regret bounds by revealing the relationship between the regret and the cumulative density function of the perturbation. From the proposed analysis scheme, we obtain gap-dependent and gap-independent upper and lower regret bounds of various perturbations. We also find the optimal hyperparameters for each perturbation, which can achieve the minimax optimal regret bound with respect to total rounds. For learning from demonstrations with mixed qualities, we develop a novel inverse reinforcement learning framework using leveraged Gaussian processes (LGP) which can handle negative demonstrations. In LGP, the correlation between two Gaussian processes is captured by a leveraged kernel function. By using properties, the proposed inverse reinforcement learning algorithm can learn from both positive and negative demonstrations. While most existing inverse reinforcement learning (IRL) methods suffer from the lack of information near low reward regions, the proposed method alleviates this issue by incorporating negative demonstrations. To mathematically formulate negative demonstrations, we introduce a novel generative model which can generate both positive and negative demonstrations using a parameter, called proficiency. Moreover, since we represent a reward function using a leveraged Gaussian process which can model a nonlinear function, the proposed method can effectively estimate the structure of a nonlinear reward function.본 학위 논문에서는 시범과 보상함수를 기반으로한 로봇 학습 문제를 다룬다. 로봇 학습 방법은 불확실하고 복잡 업무를 잘 수행 할 수 있는 최적의 정책 함수를 찾는 것을 목표로 한다. 로봇 학습 분야의 다양한 문제 중에, 샘플 복잡도를 줄이는 것에 집중한다. 특히, 효율적인 탐색 방법과 혼합 시범으로 부터의 학습 기법을 개발하여 적은 수의 샘플로도 높은 효율을 갖는 정책 함수를 학습하는 것이 목표이다. 효율적인 탐색 방법을 개발하기 위해서, 우리는 일반화된 쌀리스 엔트로피를 사용한다. 쌀리스 엔트로피는 샤논-깁스 엔트로피를 일반화한 개념으로 엔트로픽 인덱스라는 새로운 파라미터를 도입한다. 엔트로픽 인덱스를 조절함에 따라 다양한 형태의 엔트로피를 만들어 낼 수 있고 각 엔트로피는 서로 다른 레귤러라이제이션 효과를 보인다. 이 성질을 기반으로, 스파스 마르코프 결정과정을 제안한다. 스파스 마르코프 결정과정은 스파스 쌀리스 엔트로피를 이용하여 희소하면서 동시에 다모드의 정책 분포를 표현하는데 효과적이다. 이를 통해서 샤논-깁스 엔트로피를 사용하였을때에 비해 더 좋은 성능을 갖음을 수학적으로 증명하였다. 또한 스파스 쌀리스 엔트로피로 인한 성능 저하를 이론적으로 계산하였다. 스파스 마르코프 결정과정을 더욱 일반화시켜 일반화된 쌀리스 엔트로피 결정과정을 제안하였다. 마찬가지로 쌀리스 엔트로피를 마르코프 결정과정에 추가함으로써 생기는 최적 정책함수의 변화와 성능 저하를 수학적으로 증명하였다. 나아가, 성능저하를 없앨 수 있는 방법인 엔트로픽 인덱스 스케쥴링을 제안하였고 실험적으로 최적의 성능을 갖음을 보였다. 또한, 헤비테일드 잡음이 있는 학습 문제를 해결하기 위해서 외란(Perturbation)을 이용한 탐색 기법을 개발하였다. 로봇 학습의 많은 문제는 잡음의 영향이 존재한다. 학습 신호안에 다양한 형태로 잡음이 들어있는 경우가 있고 이러한 경우에 잡음을 제거 하면서 최적의 행동을 찾는 문제는 효율적인 탐사 기법을 필요로 한다. 기존의 방법론들은 서브 가우시안(sub-Gaussian) 잡음에만 적용 가능했다면, 본 학위 논문에서 제안한 방식은 헤비테일드 잡음을 해결 할 수 있다는 점에서 기존의 방법론들보다 장점을 갖는다. 먼저, 일반적인 외란에 대해서 리그렛 바운드를 증명하였고 외란의 누적분포함수(CDF)와 리그렛 사이의 관계를 증명하였다. 이 관계를 이용하여 다양한 외란 분포의 리그렛 바운드를 계산 가능하게 하였고 다양한 분포들의 가장 효율적인 탐색 파라미터를 계산하였다. 혼합시범으로 부터의 학습 기법을 개발하기 위해서, 오시범을 다룰 수 있는 새로운 형태의 가우시안 프로세스 회귀분석 방식을 개발하였고, 이 방식을 확장하여 레버리지 가우시안 프로세스 역강화학습 기법을 개발하였다. 개발된 기법에서는 정시범으로부터 무엇을 해야 하는지와 오시범으로부터 무엇을 하면 안되는지를 모두 학습할 수 있다. 기존의 방법에서는 쓰일 수 없었던 오시범을 사용 할 수 있게 만듦으로써 샘플 복잡도를 줄일 수 있었고 정제된 데이터를 수집하지 않아도 된다는 점에서 큰 장점을 갖음을 실험적으로 보였다.1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . 4 2 Background 5 2.1 Learning from Rewards . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 Multi-Armed Bandits . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 Contextual Multi-Armed Bandits . . . . . . . . . . . . . . . 7 2.1.3 Markov Decision Processes . . . . . . . . . . . . . . . . . . 9 2.1.4 Soft Markov Decision Processes . . . . . . . . . . . . . . . . 10 2.2 Learning from Demonstrations . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Behavior Cloning . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.2 Inverse Reinforcement Learning . . . . . . . . . . . . . . . . 13 3 Sparse Policy Learning 19 3.1 Sparse Policy Learning for Reinforcement Learning . . . . . . . . . 19 3.1.1 Sparse Markov Decision Processes . . . . . . . . . . . . . . 23 3.1.2 Sparse Value Iteration . . . . . . . . . . . . . . . . . . . . . 29 3.1.3 Performance Error Bounds for Sparse Value Iteration . . . 30 3.1.4 Sparse Exploration and Update Rule for Sparse Deep QLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2 Sparse Policy Learning for Imitation Learning . . . . . . . . . . . . 46 3.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.2 Principle of Maximum Causal Tsallis Entropy . . . . . . . . 50 3.2.3 Maximum Causal Tsallis Entropy Imitation Learning . . . 54 3.2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4 Entropy-based Exploration 65 4.1 Generalized Tsallis Entropy Reinforcement Learning . . . . . . . . 65 4.1.1 Maximum Generalized Tsallis Entropy in MDPs . . . . . . 71 4.1.2 Dynamic Programming for Tsallis MDPs . . . . . . . . . . 74 4.1.3 Tsallis Actor Critic for Model-Free RL . . . . . . . . . . . . 78 4.1.4 Experiments Setup . . . . . . . . . . . . . . . . . . . . . . . 79 4.1.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . 84 4.1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.2 E cient Exploration for Robotic Grasping . . . . . . . . . . . . . . 92 4.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.2.2 Shannon Entropy Regularized Neural Contextual Bandit Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.2.3 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . 99 4.2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . 104 4.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5 Perturbation-Based Exploration 113 5.1 Perturbed Exploration for sub-Gaussian Rewards . . . . . . . . . . 115 5.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.1.2 Heavy-Tailed Perturbations . . . . . . . . . . . . . . . . . . 117 5.1.3 Adaptively Perturbed Exploration . . . . . . . . . . . . . . 119 5.1.4 General Regret Bound for Sub-Gaussian Rewards . . . . . . 120 5.1.5 Regret Bounds for Speci c Perturbations with sub-Gaussian Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.2 Perturbed Exploration for Heavy-Tailed Rewards . . . . . . . . . . 128 5.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.2.2 Sub-Optimality of Robust Upper Con dence Bounds . . . . 132 5.2.3 Adaptively Perturbed Exploration with A p-Robust Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.2.4 General Regret Bound for Heavy-Tailed Rewards . . . . . . 136 5.2.5 Regret Bounds for Speci c Perturbations with Heavy-Tailed Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.2.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 144 5.2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 6 Inverse Reinforcement Learning with Negative Demonstrations149 6.1 Leveraged Gaussian Processes Inverse Reinforcement Learning . . 151 6.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 152 6.1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 6.1.3 Gaussian Process Regression . . . . . . . . . . . . . . . . . 156 6.1.4 Leveraged Gaussian Processes . . . . . . . . . . . . . . . . . 159 6.1.5 Gaussian Process Inverse Reinforcement Learning . . . . . 164 6.1.6 Inverse Reinforcement Learning with Leveraged Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 6.1.7 Simulations and Experiment . . . . . . . . . . . . . . . . . 175 6.1.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 7 Conclusion 185 Appendices 189 A Proofs of Chapter 3.1. 191 A.1 Useful Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 A.2 Sparse Bellman Optimality Equation . . . . . . . . . . . . . . . . . 192 A.3 Sparse Tsallis Entropy . . . . . . . . . . . . . . . . . . . . . . . . . 195 A.4 Upper and Lower Bounds for Sparsemax Operation . . . . . . . . . 196 A.5 Comparison to Log-Sum-Exp . . . . . . . . . . . . . . . . . . . . . 200 A.6 Convergence and Optimality of Sparse Value Iteration . . . . . . . 201 A.7 Performance Error Bounds for Sparse Value Iteration . . . . . . . . 203 B Proofs of Chapter 3.2. 209 B.1 Change of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 209 B.2 Concavity of Maximum Causal Tsallis Entropy . . . . . . . . . . . 210 B.3 Optimality Condition of Maximum Causal Tsallis Entropy . . . . . 212 B.4 Interpretation as Robust Bayes . . . . . . . . . . . . . . . . . . . . 215 B.5 Generative Adversarial Setting with Maximum Causal Tsallis Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 B.6 Tsallis Entropy of a Mixture of Gaussians . . . . . . . . . . . . . . 217 B.7 Causal Entropy Approximation . . . . . . . . . . . . . . . . . . . . 218 C Proofs of Chapter 4.1. 221 C.1 q-Maximum: Bounded Approximation of Maximum . . . . . . . . . 223 C.2 Tsallis Bellman Optimality Equation . . . . . . . . . . . . . . . . . 226 C.3 Variable Change . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 C.4 Tsallis Bellman Optimality Equation . . . . . . . . . . . . . . . . . 230 C.5 Tsallis Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . 234 C.6 Tsallis Bellman Expectation (TBE) Equation . . . . . . . . . . . . 234 C.7 Tsallis Bellman Expectation Operator and Tsallis Policy Evaluation235 C.8 Tsallis Policy Improvement . . . . . . . . . . . . . . . . . . . . . . 237 C.9 Tsallis Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . 239 C.10 Performance Error Bounds . . . . . . . . . . . . . . . . . . . . . . 241 C.11 q-Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 D Proofs of Chapter 4.2. 245 D.1 In nite Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 D.2 Regret Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 E Proofs of Chapter 5.1. 255 E.1 General Regret Lower Bound of APE . . . . . . . . . . . . . . . . . 255 E.2 General Regret Upper Bound of APE . . . . . . . . . . . . . . . . 257 E.3 Proofs of Corollaries . . . . . . . . . . . . . . . . . . . . . . . . . . 266 F Proofs of Chapter 5.2. 279 F.1 Regret Lower Bound for Robust Upper Con dence Bound . . . . . 279 F.2 Bounds on Tail Probability of A p-Robust Estimator . . . . . . . . 284 F.3 General Regret Upper Bound of APE2 . . . . . . . . . . . . . . . . 287 F.4 General Regret Lower Bound of APE2 . . . . . . . . . . . . . . . . 299 F.5 Proofs of Corollaries . . . . . . . . . . . . . . . . . . . . . . . . . . 302Docto

SNU Open Repository and Archive

Sparse Randomized Shortest Paths Routing with Tsallis Divergence Regularization

Author: Courtain Sylvain
Guex Guillaume
Leleux Pierre
Saerens Marco
Publication venue
Publication date: 01/01/2020
Field of study

This work elaborates on the important problem of (1) designing optimal randomized routing policies for reaching a target node t from a source note s on a weighted directed graph G and (2) defining distance measures between nodes interpolating between the least cost (based on optimal movements) and the commute-cost (based on a random walk on G), depending on a temperature parameter T. To this end, the randomized shortest path formalism (RSP, [2,99,124]) is rephrased in terms of Tsallis divergence regularization, instead of Kullback-Leibler divergence. The main consequence of this change is that the resulting routing policy (local transition probabilities) becomes sparser when T decreases, therefore inducing a sparse random walk on G converging to the least-cost directed acyclic graph when T tends to 0. Experimental comparisons on node clustering and semi-supervised classification tasks show that the derived dissimilarity measures based on expected routing costs provide state-of-the-art results. The sparse RSP is therefore a promising model of movements on a graph, balancing sparse exploitation and exploration in an optimal way

arXiv.org e-Print Archive

DIAL UCLouvain

Generalized Munchausen Reinforcement Learning using Tsallis KL Divergence

Author: Chen Zheng
Schlegel Matthew
White Martha
Zhu Lingwei
Publication venue
Publication date: 22/09/2023
Field of study

Many policy optimization approaches in reinforcement learning incorporate a Kullback-Leilbler (KL) divergence to the previous policy, to prevent the policy from changing too quickly. This idea was initially proposed in a seminal paper on Conservative Policy Iteration, with approximations given by algorithms like TRPO and Munchausen Value Iteration (MVI). We continue this line of work by investigating a generalized KL divergence -- called the Tsallis KL divergence -- which use the

q

-logarithm in the definition. The approach is a strict generalization, as

q = 1

corresponds to the standard KL divergence;

q > 1

provides a range of new options. We characterize the types of policies learned under the Tsallis KL, and motivate when

q >1

could be beneficial. To obtain a practical algorithm that incorporates Tsallis KL regularization, we extend MVI, which is one of the simplest approaches to incorporate KL regularization. We show that this generalized MVI(

q

) obtains significant improvements over the standard MVI(

q = 1

) across 35 Atari games.Comment: Accepted by NeurIPS 202

arXiv.org e-Print Archive

A Theory of Regularized Markov Decision Processes

Author: Geist Matthieu
Pietquin Olivier
Scherrer Bruno
Publication venue
Publication date: 04/06/2019
Field of study

Many recent successful (deep) reinforcement learning algorithms make use of regularization, generally based on entropy or Kullback-Leibler divergence. We propose a general theory of regularized Markov Decision Processes that generalizes these approaches in two directions: we consider a larger class of regularizers, and we consider the general modified policy iteration approach, encompassing both policy iteration and value iteration. The core building blocks of this theory are a notion of regularized Bellman operator and the Legendre-Fenchel transform, a classical tool of convex optimization. This approach allows for error propagation analyses of general algorithmic schemes of which (possibly variants of) classical algorithms such as Trust Region Policy Optimization, Soft Q-learning, Stochastic Actor Critic or Dynamic Policy Programming are special cases. This also draws connections to proximal convex optimization, especially to Mirror Descent.Comment: ICML 201

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Identifiability and Generalizability in Constrained Inverse Reinforcement Learning

Author: Kamgarpour Maryam
Schlaginhaufen Andreas
Publication venue
Publication date: 01/06/2023
Field of study

Two main challenges in Reinforcement Learning (RL) are designing appropriate reward functions and ensuring the safety of the learned policy. To address these challenges, we present a theoretical framework for Inverse Reinforcement Learning (IRL) in constrained Markov decision processes. From a convex-analytic perspective, we extend prior results on reward identifiability and generalizability to both the constrained setting and a more general class of regularizations. In particular, we show that identifiability up to potential shaping (Cao et al., 2021) is a consequence of entropy regularization and may generally no longer hold for other regularizations or in the presence of safety constraints. We also show that to ensure generalizability to new transition laws and constraints, the true reward must be identified up to a constant. Additionally, we derive a finite sample guarantee for the suboptimality of the learned rewards, and validate our results in a gridworld environment.Comment: Published at ICML 202

arXiv.org e-Print Archive