494 research outputs found
볡μ‘νκ³ λΆνμ€ν νκ²½μμ μ΅μ μμ¬ κ²°μ μ μν ν¨μ¨μ μΈ λ‘λ΄ νμ΅
νμλ
Όλ¬Έ (λ°μ¬) -- μμΈλνκ΅ λνμ : 곡과λν μ κΈ°Β·μ 보곡νλΆ, 2021. 2. Songhwai Oh.The problem of sequential decision making under an uncertain and complex environment is a long-standing challenging problem in robotics. In this thesis, we focus on learning a policy function of robotic systems for sequential decision making under which is called a robot learning framework. In particular, we are interested in reducing the sample complexity of the robot learning framework. Hence, we develop three sample efficient robot learning frameworks. The first one is the maximum entropy reinforcement learning. The second one is a perturbation-based exploration. The last one is learning from demonstrations with mixed qualities.
For maximum entropy reinforcement learning, we employ a generalized Tsallis entropy regularization as an efficient exploration method. Tsallis entropy generalizes Shannon-Gibbs entropy by introducing a entropic index. By changing an entropic index, we can control the sparsity and multi-modality of policy. Based on this fact, we first propose a sparse Markov decision process (sparse MDP) which induces a sparse and multi-modal optimal policy distribution. In this MDP, the sparse entropy, which is a special case of Tsallis entropy, is employed as a policy regularization. We first analyze the optimality condition of a sparse MDP. Then, we propose dynamic programming methods for the sparse MDP and prove their convergence and optimality.
We also show that the performance error of a sparse MDP has a constant bound, while the error of a soft MDP increases logarithmically with respect to the number of actions, where this performance error is caused by the introduced regularization term. Furthermore, we generalize sparse MDPs to a new class of entropy-regularized Markov decision processes (MDPs), which will be referred to as Tsallis MDPs, and analyzes different types of optimal policies with interesting properties related to the stochasticity of the optimal policy by controlling the entropic index.
Furthermore, we also develop perturbation based exploration methods to handle heavy-tailed noises. In many robot learning problems, a learning signal is often corrupted by noises such as sub-Gaussian noise or heavy-tailed noise. While most of the exploration strategies have been analyzed under sub-Gaussian noise assumption, there exist few methods for handling such heavy-tailed rewards. Hence, to overcome heavy-tailed noise, we consider stochastic multi-armed bandits with heavy-tailed rewards. First, we propose a novel robust estimator that does not require prior information about a noise distribution, while other existing robust estimators demand prior knowledge. Then, we show that an error probability of the proposed estimator decays exponentially fast. Using this estimator, we propose a perturbation-based exploration strategy and develop a generalized regret analysis scheme that provides upper and lower regret bounds by revealing the relationship between the regret and the cumulative density function of the perturbation. From the proposed analysis scheme, we obtain gap-dependent and gap-independent upper and lower regret bounds of various perturbations. We also find the optimal hyperparameters for each perturbation, which can achieve the minimax optimal regret bound with respect to total rounds.
For learning from demonstrations with mixed qualities, we develop a novel inverse reinforcement learning framework using leveraged Gaussian processes (LGP) which can handle negative demonstrations. In LGP, the correlation between two Gaussian processes is captured by a leveraged kernel function. By using properties, the proposed inverse reinforcement learning algorithm can learn from both positive and negative demonstrations. While most existing inverse reinforcement learning (IRL) methods suffer from the lack of information near low reward regions, the proposed method alleviates this issue by incorporating negative
demonstrations. To mathematically formulate negative demonstrations, we introduce a novel generative model which can generate both positive and negative demonstrations using a parameter, called proficiency.
Moreover, since we represent a reward function using a leveraged Gaussian process which can model a nonlinear function, the proposed method can effectively estimate the structure of a nonlinear reward
function.λ³Έ νμ λ
Όλ¬Έμμλ μλ²κ³Ό 보μν¨μλ₯Ό κΈ°λ°μΌλ‘ν λ‘λ΄ νμ΅ λ¬Έμ λ₯Ό λ€λ£¬λ€. λ‘λ΄ νμ΅ λ°©λ²μ λΆνμ€νκ³ λ³΅μ‘ μ
무λ₯Ό μ μν ν μ μλ μ΅μ μ μ μ±
ν¨μλ₯Ό μ°Ύλ κ²μ λͺ©νλ‘ νλ€. λ‘λ΄ νμ΅ λΆμΌμ λ€μν λ¬Έμ μ€μ, μν 볡μ‘λλ₯Ό μ€μ΄λ κ²μ μ§μ€νλ€. νΉν, ν¨μ¨μ μΈ νμ λ°©λ²κ³Ό νΌν© μλ²μΌλ‘ λΆν°μ νμ΅ κΈ°λ²μ κ°λ°νμ¬ μ μ μμ μνλ‘λ λμ ν¨μ¨μ κ°λ μ μ±
ν¨μλ₯Ό νμ΅νλ κ²μ΄ λͺ©νμ΄λ€.
ν¨μ¨μ μΈ νμ λ°©λ²μ κ°λ°νκΈ° μν΄μ, μ°λ¦¬λ μΌλ°νλ μλ¦¬μ€ μνΈλ‘νΌλ₯Ό μ¬μ©νλ€. μλ¦¬μ€ μνΈλ‘νΌλ μ€λ
Ό-κΉμ€ μνΈλ‘νΌλ₯Ό μΌλ°νν κ°λ
μΌλ‘ μνΈλ‘ν½ μΈλ±μ€λΌλ μλ‘μ΄ νλΌλ―Έν°λ₯Ό λμ
νλ€. μνΈλ‘ν½ μΈλ±μ€λ₯Ό μ‘°μ ν¨μ λ°λΌ λ€μν ννμ μνΈλ‘νΌλ₯Ό λ§λ€μ΄ λΌ μ μκ³ κ° μνΈλ‘νΌλ μλ‘ λ€λ₯Έ λ κ·€λ¬λΌμ΄μ μ΄μ
ν¨κ³Όλ₯Ό 보μΈλ€. μ΄ μ±μ§μ κΈ°λ°μΌλ‘, μ€νμ€ λ§λ₯΄μ½ν κ²°μ κ³Όμ μ μ μνλ€. μ€νμ€ λ§λ₯΄μ½ν κ²°μ κ³Όμ μ μ€νμ€ μλ¦¬μ€ μνΈλ‘νΌλ₯Ό μ΄μ©νμ¬ ν¬μνλ©΄μ λμμ λ€λͺ¨λμ μ μ±
λΆν¬λ₯Ό νννλλ° ν¨κ³Όμ μ΄λ€. μ΄λ₯Ό ν΅ν΄μ μ€λ
Ό-κΉμ€ μνΈλ‘νΌλ₯Ό μ¬μ©νμμλμ λΉν΄ λ μ’μ μ±λ₯μ κ°μμ μνμ μΌλ‘ μ¦λͺ
νμλ€. λν μ€νμ€ μλ¦¬μ€ μνΈλ‘νΌλ‘ μΈν μ±λ₯ μ νλ₯Ό μ΄λ‘ μ μΌλ‘ κ³μ°νμλ€. μ€νμ€ λ§λ₯΄μ½ν κ²°μ κ³Όμ μ λμ± μΌλ°νμμΌ μΌλ°νλ μλ¦¬μ€ μνΈλ‘νΌ κ²°μ κ³Όμ μ μ μνμλ€. λ§μ°¬κ°μ§λ‘ μλ¦¬μ€ μνΈλ‘νΌλ₯Ό λ§λ₯΄μ½ν κ²°μ κ³Όμ μ μΆκ°ν¨μΌλ‘μ¨ μκΈ°λ μ΅μ μ μ±
ν¨μμ λ³νμ μ±λ₯ μ νλ₯Ό μνμ μΌλ‘ μ¦λͺ
νμλ€. λμκ°, μ±λ₯μ νλ₯Ό μμ¨ μ μλ λ°©λ²μΈ μνΈλ‘ν½ μΈλ±μ€ μ€μΌμ₯΄λ§μ μ μνμκ³ μ€νμ μΌλ‘ μ΅μ μ μ±λ₯μ κ°μμ 보μλ€.
λν, ν€λΉν
μΌλ μ‘μμ΄ μλ νμ΅ λ¬Έμ λ₯Ό ν΄κ²°νκΈ° μν΄μ μΈλ(Perturbation)μ μ΄μ©ν νμ κΈ°λ²μ κ°λ°νμλ€. λ‘λ΄ νμ΅μ λ§μ λ¬Έμ λ μ‘μμ μν₯μ΄ μ‘΄μ¬νλ€. νμ΅ μ νΈμμ λ€μν ννλ‘ μ‘μμ΄ λ€μ΄μλ κ²½μ°κ° μκ³ μ΄λ¬ν κ²½μ°μ μ‘μμ μ κ±° νλ©΄μ μ΅μ μ νλμ μ°Ύλ λ¬Έμ λ ν¨μ¨μ μΈ νμ¬ κΈ°λ²μ νμλ‘ νλ€. κΈ°μ‘΄μ λ°©λ²λ‘ λ€μ μλΈ κ°μ°μμ(sub-Gaussian) μ‘μμλ§ μ μ© κ°λ₯νλ€λ©΄, λ³Έ νμ λ
Όλ¬Έμμ μ μν λ°©μμ ν€λΉν
μΌλ μ‘μμ ν΄κ²° ν μ μλ€λ μ μμ κΈ°μ‘΄μ λ°©λ²λ‘ λ€λ³΄λ€ μ₯μ μ κ°λλ€. λ¨Όμ , μΌλ°μ μΈ μΈλμ λν΄μ 리그λ λ°μ΄λλ₯Ό μ¦λͺ
νμκ³ μΈλμ λμ λΆν¬ν¨μ(CDF)μ 리그λ μ¬μ΄μ κ΄κ³λ₯Ό μ¦λͺ
νμλ€. μ΄ κ΄κ³λ₯Ό μ΄μ©νμ¬ λ€μν μΈλ λΆν¬μ 리그λ λ°μ΄λλ₯Ό κ³μ° κ°λ₯νκ² νμκ³ λ€μν λΆν¬λ€μ κ°μ₯ ν¨μ¨μ μΈ νμ νλΌλ―Έν°λ₯Ό κ³μ°νμλ€.
νΌν©μλ²μΌλ‘ λΆν°μ νμ΅ κΈ°λ²μ κ°λ°νκΈ° μν΄μ, μ€μλ²μ λ€λ£° μ μλ μλ‘μ΄ ννμ κ°μ°μμ νλ‘μΈμ€ νκ·λΆμ λ°©μμ κ°λ°νμκ³ , μ΄ λ°©μμ νμ₯νμ¬ λ λ²λ¦¬μ§ κ°μ°μμ νλ‘μΈμ€ μκ°ννμ΅ κΈ°λ²μ κ°λ°νμλ€. κ°λ°λ κΈ°λ²μμλ μ μλ²μΌλ‘λΆν° 무μμ ν΄μΌ νλμ§μ μ€μλ²μΌλ‘λΆν° 무μμ νλ©΄ μλλμ§λ₯Ό λͺ¨λ νμ΅ν μ μλ€. κΈ°μ‘΄μ λ°©λ²μμλ μ°μΌ μ μμλ μ€μλ²μ μ¬μ© ν μ μκ² λ§λ¦μΌλ‘μ¨ μν 볡μ‘λλ₯Ό μ€μΌ μ μμκ³ μ μ λ λ°μ΄ν°λ₯Ό μμ§νμ§ μμλ λλ€λ μ μμ ν° μ₯μ μ κ°μμ μ€νμ μΌλ‘ 보μλ€.1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . 4
2 Background 5
2.1 Learning from Rewards . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Multi-Armed Bandits . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Contextual Multi-Armed Bandits . . . . . . . . . . . . . . . 7
2.1.3 Markov Decision Processes . . . . . . . . . . . . . . . . . . 9
2.1.4 Soft Markov Decision Processes . . . . . . . . . . . . . . . . 10
2.2 Learning from Demonstrations . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Behavior Cloning . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Inverse Reinforcement Learning . . . . . . . . . . . . . . . . 13
3 Sparse Policy Learning 19
3.1 Sparse Policy Learning for Reinforcement Learning . . . . . . . . . 19
3.1.1 Sparse Markov Decision Processes . . . . . . . . . . . . . . 23
3.1.2 Sparse Value Iteration . . . . . . . . . . . . . . . . . . . . . 29
3.1.3 Performance Error Bounds for Sparse Value Iteration . . . 30
3.1.4 Sparse Exploration and Update Rule for Sparse Deep QLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.2 Sparse Policy Learning for Imitation Learning . . . . . . . . . . . . 46
3.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.2 Principle of Maximum Causal Tsallis Entropy . . . . . . . . 50
3.2.3 Maximum Causal Tsallis Entropy Imitation Learning . . . 54
3.2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4 Entropy-based Exploration 65
4.1 Generalized Tsallis Entropy Reinforcement Learning . . . . . . . . 65
4.1.1 Maximum Generalized Tsallis Entropy in MDPs . . . . . . 71
4.1.2 Dynamic Programming for Tsallis MDPs . . . . . . . . . . 74
4.1.3 Tsallis Actor Critic for Model-Free RL . . . . . . . . . . . . 78
4.1.4 Experiments Setup . . . . . . . . . . . . . . . . . . . . . . . 79
4.1.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . 84
4.1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.2 E cient Exploration for Robotic Grasping . . . . . . . . . . . . . . 92
4.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.2.2 Shannon Entropy Regularized Neural Contextual Bandit
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.2.3 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . 99
4.2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . 104
4.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5 Perturbation-Based Exploration 113
5.1 Perturbed Exploration for sub-Gaussian Rewards . . . . . . . . . . 115
5.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.1.2 Heavy-Tailed Perturbations . . . . . . . . . . . . . . . . . . 117
5.1.3 Adaptively Perturbed Exploration . . . . . . . . . . . . . . 119
5.1.4 General Regret Bound for Sub-Gaussian Rewards . . . . . . 120
5.1.5 Regret Bounds for Speci c Perturbations with sub-Gaussian Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.2 Perturbed Exploration for Heavy-Tailed Rewards . . . . . . . . . . 128
5.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.2.2 Sub-Optimality of Robust Upper Con dence Bounds . . . . 132
5.2.3 Adaptively Perturbed Exploration with A p-Robust Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.2.4 General Regret Bound for Heavy-Tailed Rewards . . . . . . 136
5.2.5 Regret Bounds for Speci c Perturbations with Heavy-Tailed Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.2.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6 Inverse Reinforcement Learning with Negative Demonstrations149
6.1 Leveraged Gaussian Processes Inverse Reinforcement Learning . . 151
6.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
6.1.3 Gaussian Process Regression . . . . . . . . . . . . . . . . . 156
6.1.4 Leveraged Gaussian Processes . . . . . . . . . . . . . . . . . 159
6.1.5 Gaussian Process Inverse Reinforcement Learning . . . . . 164
6.1.6 Inverse Reinforcement Learning with Leveraged Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.1.7 Simulations and Experiment . . . . . . . . . . . . . . . . . 175
6.1.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
7 Conclusion 185
Appendices 189
A Proofs of Chapter 3.1. 191
A.1 Useful Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
A.2 Sparse Bellman Optimality Equation . . . . . . . . . . . . . . . . . 192
A.3 Sparse Tsallis Entropy . . . . . . . . . . . . . . . . . . . . . . . . . 195
A.4 Upper and Lower Bounds for Sparsemax Operation . . . . . . . . . 196
A.5 Comparison to Log-Sum-Exp . . . . . . . . . . . . . . . . . . . . . 200
A.6 Convergence and Optimality of Sparse Value Iteration . . . . . . . 201
A.7 Performance Error Bounds for Sparse Value Iteration . . . . . . . . 203
B Proofs of Chapter 3.2. 209
B.1 Change of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 209
B.2 Concavity of Maximum Causal Tsallis Entropy . . . . . . . . . . . 210
B.3 Optimality Condition of Maximum Causal Tsallis Entropy . . . . . 212
B.4 Interpretation as Robust Bayes . . . . . . . . . . . . . . . . . . . . 215
B.5 Generative Adversarial Setting with Maximum Causal Tsallis Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
B.6 Tsallis Entropy of a Mixture of Gaussians . . . . . . . . . . . . . . 217
B.7 Causal Entropy Approximation . . . . . . . . . . . . . . . . . . . . 218
C Proofs of Chapter 4.1. 221
C.1 q-Maximum: Bounded Approximation of Maximum . . . . . . . . . 223
C.2 Tsallis Bellman Optimality Equation . . . . . . . . . . . . . . . . . 226
C.3 Variable Change . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
C.4 Tsallis Bellman Optimality Equation . . . . . . . . . . . . . . . . . 230
C.5 Tsallis Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . 234
C.6 Tsallis Bellman Expectation (TBE) Equation . . . . . . . . . . . . 234
C.7 Tsallis Bellman Expectation Operator and Tsallis Policy Evaluation235
C.8 Tsallis Policy Improvement . . . . . . . . . . . . . . . . . . . . . . 237
C.9 Tsallis Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . 239
C.10 Performance Error Bounds . . . . . . . . . . . . . . . . . . . . . . 241
C.11 q-Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
D Proofs of Chapter 4.2. 245
D.1 In nite Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
D.2 Regret Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
E Proofs of Chapter 5.1. 255
E.1 General Regret Lower Bound of APE . . . . . . . . . . . . . . . . . 255
E.2 General Regret Upper Bound of APE . . . . . . . . . . . . . . . . 257
E.3 Proofs of Corollaries . . . . . . . . . . . . . . . . . . . . . . . . . . 266
F Proofs of Chapter 5.2. 279
F.1 Regret Lower Bound for Robust Upper Con dence Bound . . . . . 279
F.2 Bounds on Tail Probability of A p-Robust Estimator . . . . . . . . 284
F.3 General Regret Upper Bound of APE2 . . . . . . . . . . . . . . . . 287
F.4 General Regret Lower Bound of APE2 . . . . . . . . . . . . . . . . 299
F.5 Proofs of Corollaries . . . . . . . . . . . . . . . . . . . . . . . . . . 302Docto
Quantum Hamiltonian Learning Using Imperfect Quantum Resources
Identifying an accurate model for the dynamics of a quantum system is a
vexing problem that underlies a range of problems in experimental physics and
quantum information theory. Recently, a method called quantum Hamiltonian
learning has been proposed by the present authors that uses quantum simulation
as a resource for modeling an unknown quantum system. This approach can, under
certain circumstances, allow such models to be efficiently identified. A major
caveat of that work is the assumption of that all elements of the protocol are
noise-free. Here, we show that quantum Hamiltonian learning can tolerate
substantial amounts of depolarizing noise and show numerical evidence that it
can tolerate noise drawn from other realistic models. We further provide
evidence that the learning algorithm will find a model that is maximally close
to the true model in cases where the hypothetical model lacks terms present in
the true model. Finally, we also provide numerical evidence that the algorithm
works for non-commuting models. This work illustrates that quantum Hamiltonian
learning can be performed using realistic resources and suggests that even
imperfect quantum resources may be valuable for characterizing quantum systems.Comment: 16 pages 11 Figure
Reinforcement learning for robotic manipulation using simulated locomotion demonstrations
Mastering robotic manipulation skills through reinforcement learning (RL) typically requires the design of shaped reward functions. Recent developments in this area have demonstrated that using sparse rewards, i.e. rewarding the agent only when the task has been successfully completed, can lead to better policies. However, state-action space exploration is more difficult in this case. Recent RL approaches to learning with sparse rewards have leveraged high-quality human demonstrations for the task, but these can be costly, time consuming or even impossible to obtain. In this paper, we propose a novel and effective approach that does not require human demonstrations. We observe that every robotic manipulation task could be seen as involving a locomotion task from the perspective of the object being manipulated, i.e. the object could learn how to reach a target state on its own. In order to exploit this idea, we introduce a framework whereby an object locomotion policy is initially obtained using a realistic physics simulator. This policy is then used to generate auxiliary rewards, called simulated locomotion demonstration rewards (SLDRs), which enable us to learn the robot manipulation policy. The proposed approach has been evaluated on 13 tasks of increasing complexity, and can achieve higher success rate and faster learning rates compared to alternative algorithms. SLDRs are especially beneficial for tasks like multi-object stacking and non-rigid object manipulation
Cieran: Designing Sequential Colormaps via In-Situ Active Preference Learning
Quality colormaps can help communicate important data patterns. However,
finding an aesthetically pleasing colormap that looks "just right" for a given
scenario requires significant design and technical expertise. We introduce
Cieran, a tool that allows any data analyst to rapidly find quality colormaps
while designing charts within Jupyter Notebooks. Our system employs an active
preference learning paradigm to rank expert-designed colormaps and create new
ones from pairwise comparisons, allowing analysts who are novices in color
design to tailor colormaps to their data context. We accomplish this by
treating colormap design as a path planning problem through the CIELAB
colorspace with a context-specific reward model. In an evaluation with twelve
scientists, we found that Cieran effectively modeled user preferences to rank
colormaps and leveraged this model to create new quality designs. Our work
shows the potential of active preference learning for supporting efficient
visualization design optimization.Comment: CHI 2024. 12 pages/9 figure
A Survey on Causal Reinforcement Learning
While Reinforcement Learning (RL) achieves tremendous success in sequential
decision-making problems of many domains, it still faces key challenges of data
inefficiency and the lack of interpretability. Interestingly, many researchers
have leveraged insights from the causality literature recently, bringing forth
flourishing works to unify the merits of causality and address well the
challenges from RL. As such, it is of great necessity and significance to
collate these Causal Reinforcement Learning (CRL) works, offer a review of CRL
methods, and investigate the potential functionality from causality toward RL.
In particular, we divide existing CRL approaches into two categories according
to whether their causality-based information is given in advance or not. We
further analyze each category in terms of the formalization of different
models, ranging from the Markov Decision Process (MDP), Partially Observed
Markov Decision Process (POMDP), Multi-Arm Bandits (MAB), and Dynamic Treatment
Regime (DTR). Moreover, we summarize the evaluation matrices and open sources
while we discuss emerging applications, along with promising prospects for the
future development of CRL.Comment: 29 pages, 20 figure
Using Multi-Relational Embeddings as Knowledge Graph Representations for Robotics Applications
User demonstrations of robot tasks in everyday environments, such as households, can be brittle due in part to the dynamic, diverse, and complex properties of those environments. Humans can find solutions in ambiguous or unfamiliar situations by using a wealth of common-sense knowledge about their domains to make informed generalizations. For example, likely locations for food in a novel household. Prior work has shown that robots can benefit from reasoning about this type of semantic knowledge, which can be modeled as a knowledge graph of interrelated facts that define whether a relationship exists between two entities. Semantic reasoning about domain knowledge using knowledge graph representations has improved the robustness and usability of end user robots by enabling more fault tolerant task execution. Knowledge graph representations define the underlying representation of facts, how facts are organized, and implement semantic reasoning by defining the possible computations over facts (e.g. association, fact-prediction).
This thesis examines the use of multi-relational embeddings as knowledge graph representations within the context of robust task execution and develops methods to explain the inferences of and sequentially train multi-relational embeddings. This thesis contributes: (i) a survey of knowledge graph representations that model semantic domain knowledge in robotics, (ii) the development and evaluation of our knowledge graph representation based on multi-relational embeddings, (iii) the integration of our knowledge graph representation into a robot architecture to improve robust task execution, (iv) the development and evaluation of methods to sequentially update multi-relational embeddings, and (v) the development and evaluation of an inference reconciliation framework for multi-relational embeddings.Ph.D
- β¦