34 research outputs found

    Generalised Entropy MDPs and Minimax Regret

    Full text link
    Bayesian methods suffer from the problem of how to specify prior beliefs. One interesting idea is to consider worst-case priors. This requires solving a stochastic zero-sum game. In this paper, we extend well-known results from bandit theory in order to discover minimax-Bayes policies and discuss when they are practical.Comment: 7 pages, NIPS workshop "From bad models to good policies

    Efficient RL with Impaired Observability: Learning to Act with Delayed and Missing State Observations

    Full text link
    In real-world reinforcement learning (RL) systems, various forms of impaired observability can complicate matters. These situations arise when an agent is unable to observe the most recent state of the system due to latency or lossy channels, yet the agent must still make real-time decisions. This paper introduces a theoretical investigation into efficient RL in control systems where agents must act with delayed and missing state observations. We establish near-optimal regret bounds, of the form O~(poly(H)SAK)\tilde{\mathcal{O}}(\sqrt{{\rm poly}(H) SAK}), for RL in both the delayed and missing observation settings. Despite impaired observability posing significant challenges to the policy class and planning, our results demonstrate that learning remains efficient, with the regret bound optimally depending on the state-action size of the original system. Additionally, we provide a characterization of the performance of the optimal policy under impaired observability, comparing it to the optimal value obtained with full observability

    λ³΅μž‘ν•˜κ³  λΆˆν™•μ‹€ν•œ ν™˜κ²½μ—μ„œ 졜적 μ˜μ‚¬ 결정을 μœ„ν•œ 효율적인 λ‘œλ΄‡ ν•™μŠ΅

    Get PDF
    ν•™μœ„λ…Όλ¬Έ (박사) -- μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 전기·정보곡학뢀, 2021. 2. Songhwai Oh.The problem of sequential decision making under an uncertain and complex environment is a long-standing challenging problem in robotics. In this thesis, we focus on learning a policy function of robotic systems for sequential decision making under which is called a robot learning framework. In particular, we are interested in reducing the sample complexity of the robot learning framework. Hence, we develop three sample efficient robot learning frameworks. The first one is the maximum entropy reinforcement learning. The second one is a perturbation-based exploration. The last one is learning from demonstrations with mixed qualities. For maximum entropy reinforcement learning, we employ a generalized Tsallis entropy regularization as an efficient exploration method. Tsallis entropy generalizes Shannon-Gibbs entropy by introducing a entropic index. By changing an entropic index, we can control the sparsity and multi-modality of policy. Based on this fact, we first propose a sparse Markov decision process (sparse MDP) which induces a sparse and multi-modal optimal policy distribution. In this MDP, the sparse entropy, which is a special case of Tsallis entropy, is employed as a policy regularization. We first analyze the optimality condition of a sparse MDP. Then, we propose dynamic programming methods for the sparse MDP and prove their convergence and optimality. We also show that the performance error of a sparse MDP has a constant bound, while the error of a soft MDP increases logarithmically with respect to the number of actions, where this performance error is caused by the introduced regularization term. Furthermore, we generalize sparse MDPs to a new class of entropy-regularized Markov decision processes (MDPs), which will be referred to as Tsallis MDPs, and analyzes different types of optimal policies with interesting properties related to the stochasticity of the optimal policy by controlling the entropic index. Furthermore, we also develop perturbation based exploration methods to handle heavy-tailed noises. In many robot learning problems, a learning signal is often corrupted by noises such as sub-Gaussian noise or heavy-tailed noise. While most of the exploration strategies have been analyzed under sub-Gaussian noise assumption, there exist few methods for handling such heavy-tailed rewards. Hence, to overcome heavy-tailed noise, we consider stochastic multi-armed bandits with heavy-tailed rewards. First, we propose a novel robust estimator that does not require prior information about a noise distribution, while other existing robust estimators demand prior knowledge. Then, we show that an error probability of the proposed estimator decays exponentially fast. Using this estimator, we propose a perturbation-based exploration strategy and develop a generalized regret analysis scheme that provides upper and lower regret bounds by revealing the relationship between the regret and the cumulative density function of the perturbation. From the proposed analysis scheme, we obtain gap-dependent and gap-independent upper and lower regret bounds of various perturbations. We also find the optimal hyperparameters for each perturbation, which can achieve the minimax optimal regret bound with respect to total rounds. For learning from demonstrations with mixed qualities, we develop a novel inverse reinforcement learning framework using leveraged Gaussian processes (LGP) which can handle negative demonstrations. In LGP, the correlation between two Gaussian processes is captured by a leveraged kernel function. By using properties, the proposed inverse reinforcement learning algorithm can learn from both positive and negative demonstrations. While most existing inverse reinforcement learning (IRL) methods suffer from the lack of information near low reward regions, the proposed method alleviates this issue by incorporating negative demonstrations. To mathematically formulate negative demonstrations, we introduce a novel generative model which can generate both positive and negative demonstrations using a parameter, called proficiency. Moreover, since we represent a reward function using a leveraged Gaussian process which can model a nonlinear function, the proposed method can effectively estimate the structure of a nonlinear reward function.λ³Έ ν•™μœ„ λ…Όλ¬Έμ—μ„œλŠ” μ‹œλ²”κ³Ό λ³΄μƒν•¨μˆ˜λ₯Ό κΈ°λ°˜μœΌλ‘œν•œ λ‘œλ΄‡ ν•™μŠ΅ 문제λ₯Ό 닀룬닀. λ‘œλ΄‡ ν•™μŠ΅ 방법은 λΆˆν™•μ‹€ν•˜κ³  볡작 업무λ₯Ό 잘 μˆ˜ν–‰ ν•  수 μžˆλŠ” 졜적의 μ •μ±… ν•¨μˆ˜λ₯Ό μ°ΎλŠ” 것을 λͺ©ν‘œλ‘œ ν•œλ‹€. λ‘œλ΄‡ ν•™μŠ΅ λΆ„μ•Όμ˜ λ‹€μ–‘ν•œ 문제 쀑에, μƒ˜ν”Œ λ³΅μž‘λ„λ₯Ό μ€„μ΄λŠ” 것에 μ§‘μ€‘ν•œλ‹€. 특히, 효율적인 탐색 방법과 ν˜Όν•© μ‹œλ²”μœΌλ‘œ λΆ€ν„°μ˜ ν•™μŠ΅ 기법을 κ°œλ°œν•˜μ—¬ 적은 수의 μƒ˜ν”Œλ‘œλ„ 높은 νš¨μœ¨μ„ κ°–λŠ” μ •μ±… ν•¨μˆ˜λ₯Ό ν•™μŠ΅ν•˜λŠ” 것이 λͺ©ν‘œμ΄λ‹€. 효율적인 탐색 방법을 κ°œλ°œν•˜κΈ° μœ„ν•΄μ„œ, μš°λ¦¬λŠ” μΌλ°˜ν™”λœ μŒ€λ¦¬μŠ€ μ—”νŠΈλ‘œν”Όλ₯Ό μ‚¬μš©ν•œλ‹€. μŒ€λ¦¬μŠ€ μ—”νŠΈλ‘œν”ΌλŠ” 샀논-깁슀 μ—”νŠΈλ‘œν”Όλ₯Ό μΌλ°˜ν™”ν•œ κ°œλ…μœΌλ‘œ μ—”νŠΈλ‘œν”½ μΈλ±μŠ€λΌλŠ” μƒˆλ‘œμš΄ νŒŒλΌλ―Έν„°λ₯Ό λ„μž…ν•œλ‹€. μ—”νŠΈλ‘œν”½ 인덱슀λ₯Ό μ‘°μ ˆν•¨μ— 따라 λ‹€μ–‘ν•œ ν˜•νƒœμ˜ μ—”νŠΈλ‘œν”Όλ₯Ό λ§Œλ“€μ–΄ λ‚Ό 수 있고 각 μ—”νŠΈλ‘œν”ΌλŠ” μ„œλ‘œ λ‹€λ₯Έ λ ˆκ·€λŸ¬λΌμ΄μ œμ΄μ…˜ 효과λ₯Ό 보인닀. 이 μ„±μ§ˆμ„ 기반으둜, 슀파슀 마λ₯΄μ½”ν”„ 결정과정을 μ œμ•ˆν•œλ‹€. 슀파슀 마λ₯΄μ½”ν”„ 결정과정은 슀파슀 μŒ€λ¦¬μŠ€ μ—”νŠΈλ‘œν”Όλ₯Ό μ΄μš©ν•˜μ—¬ ν¬μ†Œν•˜λ©΄μ„œ λ™μ‹œμ— λ‹€λͺ¨λ“œμ˜ μ •μ±… 뢄포λ₯Ό ν‘œν˜„ν•˜λŠ”λ° νš¨κ³Όμ μ΄λ‹€. 이λ₯Ό ν†΅ν•΄μ„œ 샀논-깁슀 μ—”νŠΈλ‘œν”Όλ₯Ό μ‚¬μš©ν•˜μ˜€μ„λ•Œμ— λΉ„ν•΄ 더 쒋은 μ„±λŠ₯을 κ°–μŒμ„ μˆ˜ν•™μ μœΌλ‘œ 증λͺ…ν•˜μ˜€λ‹€. λ˜ν•œ 슀파슀 μŒ€λ¦¬μŠ€ μ—”νŠΈλ‘œν”Όλ‘œ μΈν•œ μ„±λŠ₯ μ €ν•˜λ₯Ό 이둠적으둜 κ³„μ‚°ν•˜μ˜€λ‹€. 슀파슀 마λ₯΄μ½”ν”„ 결정과정을 λ”μš± μΌλ°˜ν™”μ‹œμΌœ μΌλ°˜ν™”λœ μŒ€λ¦¬μŠ€ μ—”νŠΈλ‘œν”Ό 결정과정을 μ œμ•ˆν•˜μ˜€λ‹€. λ§ˆμ°¬κ°€μ§€λ‘œ μŒ€λ¦¬μŠ€ μ—”νŠΈλ‘œν”Όλ₯Ό 마λ₯΄μ½”ν”„ 결정과정에 μΆ”κ°€ν•¨μœΌλ‘œμ¨ μƒκΈ°λŠ” 졜적 μ •μ±…ν•¨μˆ˜μ˜ 변화와 μ„±λŠ₯ μ €ν•˜λ₯Ό μˆ˜ν•™μ μœΌλ‘œ 증λͺ…ν•˜μ˜€λ‹€. λ‚˜μ•„κ°€, μ„±λŠ₯μ €ν•˜λ₯Ό 없앨 수 μžˆλŠ” 방법인 μ—”νŠΈλ‘œν”½ 인덱슀 μŠ€μΌ€μ₯΄λ§μ„ μ œμ•ˆν•˜μ˜€κ³  μ‹€ν—˜μ μœΌλ‘œ 졜적의 μ„±λŠ₯을 κ°–μŒμ„ λ³΄μ˜€λ‹€. λ˜ν•œ, ν—€λΉ„ν…ŒμΌλ“œ 작음이 μžˆλŠ” ν•™μŠ΅ 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄μ„œ μ™Έλž€(Perturbation)을 μ΄μš©ν•œ 탐색 기법을 κ°œλ°œν•˜μ˜€λ‹€. λ‘œλ΄‡ ν•™μŠ΅μ˜ λ§Žμ€ λ¬Έμ œλŠ” 작음의 영ν–₯이 μ‘΄μž¬ν•œλ‹€. ν•™μŠ΅ μ‹ ν˜Έμ•ˆμ— λ‹€μ–‘ν•œ ν˜•νƒœλ‘œ 작음이 λ“€μ–΄μžˆλŠ” κ²½μš°κ°€ 있고 μ΄λŸ¬ν•œ κ²½μš°μ— μž‘μŒμ„ 제거 ν•˜λ©΄μ„œ 졜적의 행동을 μ°ΎλŠ” λ¬Έμ œλŠ” 효율적인 탐사 기법을 ν•„μš”λ‘œ ν•œλ‹€. 기쑴의 방법둠듀은 μ„œλΈŒ κ°€μš°μ‹œμ•ˆ(sub-Gaussian) μž‘μŒμ—λ§Œ 적용 κ°€λŠ₯ν–ˆλ‹€λ©΄, λ³Έ ν•™μœ„ λ…Όλ¬Έμ—μ„œ μ œμ•ˆν•œ 방식은 ν—€λΉ„ν…ŒμΌλ“œ μž‘μŒμ„ ν•΄κ²° ν•  수 μžˆλ‹€λŠ” μ μ—μ„œ 기쑴의 방법둠듀보닀 μž₯점을 κ°–λŠ”λ‹€. λ¨Όμ €, 일반적인 μ™Έλž€μ— λŒ€ν•΄μ„œ 리그렛 λ°”μš΄λ“œλ₯Ό 증λͺ…ν•˜μ˜€κ³  μ™Έλž€μ˜ λˆ„μ λΆ„ν¬ν•¨μˆ˜(CDF)와 리그렛 μ‚¬μ΄μ˜ 관계λ₯Ό 증λͺ…ν•˜μ˜€λ‹€. 이 관계λ₯Ό μ΄μš©ν•˜μ—¬ λ‹€μ–‘ν•œ μ™Έλž€ λΆ„ν¬μ˜ 리그렛 λ°”μš΄λ“œλ₯Ό 계산 κ°€λŠ₯ν•˜κ²Œ ν•˜μ˜€κ³  λ‹€μ–‘ν•œ λΆ„ν¬λ“€μ˜ κ°€μž₯ 효율적인 탐색 νŒŒλΌλ―Έν„°λ₯Ό κ³„μ‚°ν•˜μ˜€λ‹€. ν˜Όν•©μ‹œλ²”μœΌλ‘œ λΆ€ν„°μ˜ ν•™μŠ΅ 기법을 κ°œλ°œν•˜κΈ° μœ„ν•΄μ„œ, μ˜€μ‹œλ²”μ„ λ‹€λ£° 수 μžˆλŠ” μƒˆλ‘œμš΄ ν˜•νƒœμ˜ κ°€μš°μ‹œμ•ˆ ν”„λ‘œμ„ΈμŠ€ νšŒκ·€λΆ„μ„ 방식을 κ°œλ°œν•˜μ˜€κ³ , 이 방식을 ν™•μž₯ν•˜μ—¬ λ ˆλ²„λ¦¬μ§€ κ°€μš°μ‹œμ•ˆ ν”„λ‘œμ„ΈμŠ€ μ—­κ°•ν™”ν•™μŠ΅ 기법을 κ°œλ°œν•˜μ˜€λ‹€. 개발된 κΈ°λ²•μ—μ„œλŠ” μ •μ‹œλ²”μœΌλ‘œλΆ€ν„° 무엇을 ν•΄μ•Ό ν•˜λŠ”μ§€μ™€ μ˜€μ‹œλ²”μœΌλ‘œλΆ€ν„° 무엇을 ν•˜λ©΄ μ•ˆλ˜λŠ”μ§€λ₯Ό λͺ¨λ‘ ν•™μŠ΅ν•  수 μžˆλ‹€. 기쑴의 λ°©λ²•μ—μ„œλŠ” 쓰일 수 μ—†μ—ˆλ˜ μ˜€μ‹œλ²”μ„ μ‚¬μš© ν•  수 있게 λ§Œλ“¦μœΌλ‘œμ¨ μƒ˜ν”Œ λ³΅μž‘λ„λ₯Ό 쀄일 수 μžˆμ—ˆκ³  μ •μ œλœ 데이터λ₯Ό μˆ˜μ§‘ν•˜μ§€ μ•Šμ•„λ„ λœλ‹€λŠ” μ μ—μ„œ 큰 μž₯점을 κ°–μŒμ„ μ‹€ν—˜μ μœΌλ‘œ λ³΄μ˜€λ‹€.1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . 4 2 Background 5 2.1 Learning from Rewards . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 Multi-Armed Bandits . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 Contextual Multi-Armed Bandits . . . . . . . . . . . . . . . 7 2.1.3 Markov Decision Processes . . . . . . . . . . . . . . . . . . 9 2.1.4 Soft Markov Decision Processes . . . . . . . . . . . . . . . . 10 2.2 Learning from Demonstrations . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Behavior Cloning . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.2 Inverse Reinforcement Learning . . . . . . . . . . . . . . . . 13 3 Sparse Policy Learning 19 3.1 Sparse Policy Learning for Reinforcement Learning . . . . . . . . . 19 3.1.1 Sparse Markov Decision Processes . . . . . . . . . . . . . . 23 3.1.2 Sparse Value Iteration . . . . . . . . . . . . . . . . . . . . . 29 3.1.3 Performance Error Bounds for Sparse Value Iteration . . . 30 3.1.4 Sparse Exploration and Update Rule for Sparse Deep QLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2 Sparse Policy Learning for Imitation Learning . . . . . . . . . . . . 46 3.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.2 Principle of Maximum Causal Tsallis Entropy . . . . . . . . 50 3.2.3 Maximum Causal Tsallis Entropy Imitation Learning . . . 54 3.2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4 Entropy-based Exploration 65 4.1 Generalized Tsallis Entropy Reinforcement Learning . . . . . . . . 65 4.1.1 Maximum Generalized Tsallis Entropy in MDPs . . . . . . 71 4.1.2 Dynamic Programming for Tsallis MDPs . . . . . . . . . . 74 4.1.3 Tsallis Actor Critic for Model-Free RL . . . . . . . . . . . . 78 4.1.4 Experiments Setup . . . . . . . . . . . . . . . . . . . . . . . 79 4.1.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . 84 4.1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.2 E cient Exploration for Robotic Grasping . . . . . . . . . . . . . . 92 4.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.2.2 Shannon Entropy Regularized Neural Contextual Bandit Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.2.3 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . 99 4.2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . 104 4.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5 Perturbation-Based Exploration 113 5.1 Perturbed Exploration for sub-Gaussian Rewards . . . . . . . . . . 115 5.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.1.2 Heavy-Tailed Perturbations . . . . . . . . . . . . . . . . . . 117 5.1.3 Adaptively Perturbed Exploration . . . . . . . . . . . . . . 119 5.1.4 General Regret Bound for Sub-Gaussian Rewards . . . . . . 120 5.1.5 Regret Bounds for Speci c Perturbations with sub-Gaussian Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.2 Perturbed Exploration for Heavy-Tailed Rewards . . . . . . . . . . 128 5.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.2.2 Sub-Optimality of Robust Upper Con dence Bounds . . . . 132 5.2.3 Adaptively Perturbed Exploration with A p-Robust Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.2.4 General Regret Bound for Heavy-Tailed Rewards . . . . . . 136 5.2.5 Regret Bounds for Speci c Perturbations with Heavy-Tailed Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.2.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 144 5.2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 6 Inverse Reinforcement Learning with Negative Demonstrations149 6.1 Leveraged Gaussian Processes Inverse Reinforcement Learning . . 151 6.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 152 6.1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 6.1.3 Gaussian Process Regression . . . . . . . . . . . . . . . . . 156 6.1.4 Leveraged Gaussian Processes . . . . . . . . . . . . . . . . . 159 6.1.5 Gaussian Process Inverse Reinforcement Learning . . . . . 164 6.1.6 Inverse Reinforcement Learning with Leveraged Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 6.1.7 Simulations and Experiment . . . . . . . . . . . . . . . . . 175 6.1.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 7 Conclusion 185 Appendices 189 A Proofs of Chapter 3.1. 191 A.1 Useful Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 A.2 Sparse Bellman Optimality Equation . . . . . . . . . . . . . . . . . 192 A.3 Sparse Tsallis Entropy . . . . . . . . . . . . . . . . . . . . . . . . . 195 A.4 Upper and Lower Bounds for Sparsemax Operation . . . . . . . . . 196 A.5 Comparison to Log-Sum-Exp . . . . . . . . . . . . . . . . . . . . . 200 A.6 Convergence and Optimality of Sparse Value Iteration . . . . . . . 201 A.7 Performance Error Bounds for Sparse Value Iteration . . . . . . . . 203 B Proofs of Chapter 3.2. 209 B.1 Change of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 209 B.2 Concavity of Maximum Causal Tsallis Entropy . . . . . . . . . . . 210 B.3 Optimality Condition of Maximum Causal Tsallis Entropy . . . . . 212 B.4 Interpretation as Robust Bayes . . . . . . . . . . . . . . . . . . . . 215 B.5 Generative Adversarial Setting with Maximum Causal Tsallis Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 B.6 Tsallis Entropy of a Mixture of Gaussians . . . . . . . . . . . . . . 217 B.7 Causal Entropy Approximation . . . . . . . . . . . . . . . . . . . . 218 C Proofs of Chapter 4.1. 221 C.1 q-Maximum: Bounded Approximation of Maximum . . . . . . . . . 223 C.2 Tsallis Bellman Optimality Equation . . . . . . . . . . . . . . . . . 226 C.3 Variable Change . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 C.4 Tsallis Bellman Optimality Equation . . . . . . . . . . . . . . . . . 230 C.5 Tsallis Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . 234 C.6 Tsallis Bellman Expectation (TBE) Equation . . . . . . . . . . . . 234 C.7 Tsallis Bellman Expectation Operator and Tsallis Policy Evaluation235 C.8 Tsallis Policy Improvement . . . . . . . . . . . . . . . . . . . . . . 237 C.9 Tsallis Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . 239 C.10 Performance Error Bounds . . . . . . . . . . . . . . . . . . . . . . 241 C.11 q-Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 D Proofs of Chapter 4.2. 245 D.1 In nite Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 D.2 Regret Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 E Proofs of Chapter 5.1. 255 E.1 General Regret Lower Bound of APE . . . . . . . . . . . . . . . . . 255 E.2 General Regret Upper Bound of APE . . . . . . . . . . . . . . . . 257 E.3 Proofs of Corollaries . . . . . . . . . . . . . . . . . . . . . . . . . . 266 F Proofs of Chapter 5.2. 279 F.1 Regret Lower Bound for Robust Upper Con dence Bound . . . . . 279 F.2 Bounds on Tail Probability of A p-Robust Estimator . . . . . . . . 284 F.3 General Regret Upper Bound of APE2 . . . . . . . . . . . . . . . . 287 F.4 General Regret Lower Bound of APE2 . . . . . . . . . . . . . . . . 299 F.5 Proofs of Corollaries . . . . . . . . . . . . . . . . . . . . . . . . . . 302Docto

    Many-agent Reinforcement Learning

    Get PDF
    Multi-agent reinforcement learning (RL) solves the problem of how each agent should behave optimally in a stochastic environment in which multiple agents are learning simultaneously. It is an interdisciplinary domain with a long history that lies in the joint area of psychology, control theory, game theory, reinforcement learning, and deep learning. Following the remarkable success of the AlphaGO series in single-agent RL, 2019 was a booming year that witnessed significant advances in multi-agent RL techniques; impressive breakthroughs have been made on developing AIs that outperform humans on many challenging tasks, especially multi-player video games. Nonetheless, one of the key challenges of multi-agent RL techniques is the scalability; it is still non-trivial to design efficient learning algorithms that can solve tasks including far more than two agents (N≫2N \gg 2), which I name by \emph{many-agent reinforcement learning} (MARL\footnote{I use the world of ``MARL" to denote multi-agent reinforcement learning with a particular focus on the cases of many agents; otherwise, it is denoted as ``Multi-Agent RL" by default.}) problems. In this thesis, I contribute to tackling MARL problems from four aspects. Firstly, I offer a self-contained overview of multi-agent RL techniques from a game-theoretical perspective. This overview fills the research gap that most of the existing work either fails to cover the recent advances since 2010 or does not pay adequate attention to game theory, which I believe is the cornerstone to solving many-agent learning problems. Secondly, I develop a tractable policy evaluation algorithm -- Ξ±Ξ±\alpha^\alpha-Rank -- in many-agent systems. The critical advantage of Ξ±Ξ±\alpha^\alpha-Rank is that it can compute the solution concept of Ξ±\alpha-Rank tractably in multi-player general-sum games with no need to store the entire pay-off matrix. This is in contrast to classic solution concepts such as Nash equilibrium which is known to be PPADPPAD-hard in even two-player cases. Ξ±Ξ±\alpha^\alpha-Rank allows us, for the first time, to practically conduct large-scale multi-agent evaluations. Thirdly, I introduce a scalable policy learning algorithm -- mean-field MARL -- in many-agent systems. The mean-field MARL method takes advantage of the mean-field approximation from physics, and it is the first provably convergent algorithm that tries to break the curse of dimensionality for MARL tasks. With the proposed algorithm, I report the first result of solving the Ising model and multi-agent battle games through a MARL approach. Fourthly, I investigate the many-agent learning problem in open-ended meta-games (i.e., the game of a game in the policy space). Specifically, I focus on modelling the behavioural diversity in meta-games, and developing algorithms that guarantee to enlarge diversity during training. The proposed metric based on determinantal point processes serves as the first mathematically rigorous definition for diversity. Importantly, the diversity-aware learning algorithms beat the existing state-of-the-art game solvers in terms of exploitability by a large margin. On top of the algorithmic developments, I also contribute two real-world applications of MARL techniques. Specifically, I demonstrate the great potential of applying MARL to study the emergent population dynamics in nature, and model diverse and realistic interactions in autonomous driving. Both applications embody the prospect that MARL techniques could achieve huge impacts in the real physical world, outside of purely video games

    Expert iteration

    Get PDF
    In this thesis, we study how reinforcement learning algorithms can tackle classical board games without recourse to human knowledge. Specifically, we develop a framework and algorithms which learn to play the board game Hex starting from random play. We first describe Expert Iteration (ExIt), a novel reinforcement learning framework which extends Modified Policy Iteration. ExIt explicitly decomposes the reinforcement learning problem into two parts: planning and generalisation. A planning algorithm explores possible move sequences starting from a particular position to find good strategies from that position, while a parametric function approximator is trained to predict those plans, generalising to states not yet seen. Subsequently, planning is improved by using the approximated policy to guide search, increasing the strength of new plans. This decomposition allows ExIt to combine the benefits of both planning methods and function approximation methods. We demonstrate the effectiveness of the ExIt paradigm by implementing ExIt with two different planning algorithms. First, we develop a version based on Monte Carlo Tree Search (MCTS), a search algorithm which has been successful both in specific games, such as Go, Hex and Havannah, and in general game playing competitions. We then develop a new planning algorithm, Policy Gradient Search (PGS), which uses a model-free reinforcement learning algorithm for online planning. Unlike MCTS, PGS does not require an explicit search tree. Instead PGS uses function approximation within a single search, allowing it to be applied to problems with larger branching factors. Both MCTS-ExIt and PGS-ExIt defeated MoHex 2.0 - the most recent Hex Olympiad winner to be open sourced - in 9 Γ— 9 Hex. More importantly, whereas MoHex makes use of many Hex-specific improvements and knowledge, all our programs were trained tabula rasa using general reinforcement learning methods. This bodes well for ExIt’s applicability to both other games and real world decision making problems
    corecore