209 research outputs found

    Vision-Based Multi-Task Manipulation for Inexpensive Robots Using End-To-End Learning from Demonstration

    Full text link
    We propose a technique for multi-task learning from demonstration that trains the controller of a low-cost robotic arm to accomplish several complex picking and placing tasks, as well as non-prehensile manipulation. The controller is a recurrent neural network using raw images as input and generating robot arm trajectories, with the parameters shared across the tasks. The controller also combines VAE-GAN-based reconstruction with autoregressive multimodal action prediction. Our results demonstrate that it is possible to learn complex manipulation tasks, such as picking up a towel, wiping an object, and depositing the towel to its previous position, entirely from raw images with direct behavior cloning. We show that weight sharing and reconstruction-based regularization substantially improve generalization and robustness, and training on multiple tasks simultaneously increases the success rate on all tasks

    Latent Plans for Task-Agnostic Offline Reinforcement Learning

    Full text link
    Everyday tasks of long-horizon and comprising a sequence of multiple implicit subtasks still impose a major challenge in offline robot control. While a number of prior methods aimed to address this setting with variants of imitation and offline reinforcement learning, the learned behavior is typically narrow and often struggles to reach configurable long-horizon goals. As both paradigms have complementary strengths and weaknesses, we propose a novel hierarchical approach that combines the strengths of both methods to learn task-agnostic long-horizon policies from high-dimensional camera observations. Concretely, we combine a low-level policy that learns latent skills via imitation learning and a high-level policy learned from offline reinforcement learning for skill-chaining the latent behavior priors. Experiments in various simulated and real robot control tasks show that our formulation enables producing previously unseen combinations of skills to reach temporally extended goals by "stitching" together latent skills through goal chaining with an order-of-magnitude improvement in performance upon state-of-the-art baselines. We even learn one multi-task visuomotor policy for 25 distinct manipulation tasks in the real world which outperforms both imitation learning and offline reinforcement learning techniques.Comment: CoRL 2022. Project website: http://tacorl.cs.uni-freiburg.de

    Probabilistic movement primitives for coordination of multiple humanโ€“robot collaborative tasks

    Get PDF
    This paper proposes an interaction learning method for collaborative and assistive robots based on movement primitives. The method allows for both action recognition and humanโ€“robot movement coordination. It uses imitation learning to construct a mixture model of humanโ€“robot interaction primitives. This probabilistic model allows the assistive trajectory of the robot to be inferred from human observations. The method is scalable in relation to the number of tasks and can learn nonlinear correlations between the trajectories that describe the humanโ€“robot interaction. We evaluated the method experimentally with a lightweight robot arm in a variety of assistive scenarios, including the coordinated handover of a bottle to a human, and the collaborative assembly of a toolbox. Potential applications of the method are personal caregiver robots, control of intelligent prosthetic devices, and robot coworkers in factories

    Behavior Retrieval: Few-Shot Imitation Learning by Querying Unlabeled Datasets

    Full text link
    Enabling robots to learn novel visuomotor skills in a data-efficient manner remains an unsolved problem with myriad challenges. A popular paradigm for tackling this problem is through leveraging large unlabeled datasets that have many behaviors in them and then adapting a policy to a specific task using a small amount of task-specific human supervision (i.e. interventions or demonstrations). However, how best to leverage the narrow task-specific supervision and balance it with offline data remains an open question. Our key insight in this work is that task-specific data not only provides new data for an agent to train on but can also inform the type of prior data the agent should use for learning. Concretely, we propose a simple approach that uses a small amount of downstream expert data to selectively query relevant behaviors from an offline, unlabeled dataset (including many sub-optimal behaviors). The agent is then jointly trained on the expert and queried data. We observe that our method learns to query only the relevant transitions to the task, filtering out sub-optimal or task-irrelevant data. By doing so, it is able to learn more effectively from the mix of task-specific and offline data compared to naively mixing the data or only using the task-specific data. Furthermore, we find that our simple querying approach outperforms more complex goal-conditioned methods by 20% across simulated and real robotic manipulation tasks from images. See https://sites.google.com/view/behaviorretrieval for videos and code

    Bayesian Disturbance Injection: Robust Imitation Learning of Flexible Policies

    Full text link
    Scenarios requiring humans to choose from multiple seemingly optimal actions are commonplace, however standard imitation learning often fails to capture this behavior. Instead, an over-reliance on replicating expert actions induces inflexible and unstable policies, leading to poor generalizability in an application. To address the problem, this paper presents the first imitation learning framework that incorporates Bayesian variational inference for learning flexible non-parametric multi-action policies, while simultaneously robustifying the policies against sources of error, by introducing and optimizing disturbances to create a richer demonstration dataset. This combinatorial approach forces the policy to adapt to challenging situations, enabling stable multi-action policies to be learned efficiently. The effectiveness of our proposed method is evaluated through simulations and real-robot experiments for a table-sweep task using the UR3 6-DOF robotic arm. Results show that, through improved flexibility and robustness, the learning performance and control safety are better than comparison methods.Comment: 7 pages, Accepted by the 2021 International Conference on Robotics and Automation (ICRA 2021

    Bayesian Disturbance Injection: Robust Imitation Learning of Flexible Policies for Robot Manipulation

    Full text link
    Humans demonstrate a variety of interesting behavioral characteristics when performing tasks, such as selecting between seemingly equivalent optimal actions, performing recovery actions when deviating from the optimal trajectory, or moderating actions in response to sensed risks. However, imitation learning, which attempts to teach robots to perform these same tasks from observations of human demonstrations, often fails to capture such behavior. Specifically, commonly used learning algorithms embody inherent contradictions between the learning assumptions (e.g., single optimal action) and actual human behavior (e.g., multiple optimal actions), thereby limiting robot generalizability, applicability, and demonstration feasibility. To address this, this paper proposes designing imitation learning algorithms with a focus on utilizing human behavioral characteristics, thereby embodying principles for capturing and exploiting actual demonstrator behavioral characteristics. This paper presents the first imitation learning framework, Bayesian Disturbance Injection (BDI), that typifies human behavioral characteristics by incorporating model flexibility, robustification, and risk sensitivity. Bayesian inference is used to learn flexible non-parametric multi-action policies, while simultaneously robustifying policies by injecting risk-sensitive disturbances to induce human recovery action and ensuring demonstration feasibility. Our method is evaluated through risk-sensitive simulations and real-robot experiments (e.g., table-sweep task, shaft-reach task and shaft-insertion task) using the UR5e 6-DOF robotic arm, to demonstrate the improved characterisation of behavior. Results show significant improvement in task performance, through improved flexibility, robustness as well as demonstration feasibility.Comment: 69 pages, 9 figures, accepted by Elsevier Neural Networks - Journa

    ๋ณต์žกํ•˜๊ณ  ๋ถˆํ™•์‹คํ•œ ํ™˜๊ฒฝ์—์„œ ์ตœ์  ์˜์‚ฌ ๊ฒฐ์ •์„ ์œ„ํ•œ ํšจ์œจ์ ์ธ ๋กœ๋ด‡ ํ•™์Šต

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2021. 2. Songhwai Oh.The problem of sequential decision making under an uncertain and complex environment is a long-standing challenging problem in robotics. In this thesis, we focus on learning a policy function of robotic systems for sequential decision making under which is called a robot learning framework. In particular, we are interested in reducing the sample complexity of the robot learning framework. Hence, we develop three sample efficient robot learning frameworks. The first one is the maximum entropy reinforcement learning. The second one is a perturbation-based exploration. The last one is learning from demonstrations with mixed qualities. For maximum entropy reinforcement learning, we employ a generalized Tsallis entropy regularization as an efficient exploration method. Tsallis entropy generalizes Shannon-Gibbs entropy by introducing a entropic index. By changing an entropic index, we can control the sparsity and multi-modality of policy. Based on this fact, we first propose a sparse Markov decision process (sparse MDP) which induces a sparse and multi-modal optimal policy distribution. In this MDP, the sparse entropy, which is a special case of Tsallis entropy, is employed as a policy regularization. We first analyze the optimality condition of a sparse MDP. Then, we propose dynamic programming methods for the sparse MDP and prove their convergence and optimality. We also show that the performance error of a sparse MDP has a constant bound, while the error of a soft MDP increases logarithmically with respect to the number of actions, where this performance error is caused by the introduced regularization term. Furthermore, we generalize sparse MDPs to a new class of entropy-regularized Markov decision processes (MDPs), which will be referred to as Tsallis MDPs, and analyzes different types of optimal policies with interesting properties related to the stochasticity of the optimal policy by controlling the entropic index. Furthermore, we also develop perturbation based exploration methods to handle heavy-tailed noises. In many robot learning problems, a learning signal is often corrupted by noises such as sub-Gaussian noise or heavy-tailed noise. While most of the exploration strategies have been analyzed under sub-Gaussian noise assumption, there exist few methods for handling such heavy-tailed rewards. Hence, to overcome heavy-tailed noise, we consider stochastic multi-armed bandits with heavy-tailed rewards. First, we propose a novel robust estimator that does not require prior information about a noise distribution, while other existing robust estimators demand prior knowledge. Then, we show that an error probability of the proposed estimator decays exponentially fast. Using this estimator, we propose a perturbation-based exploration strategy and develop a generalized regret analysis scheme that provides upper and lower regret bounds by revealing the relationship between the regret and the cumulative density function of the perturbation. From the proposed analysis scheme, we obtain gap-dependent and gap-independent upper and lower regret bounds of various perturbations. We also find the optimal hyperparameters for each perturbation, which can achieve the minimax optimal regret bound with respect to total rounds. For learning from demonstrations with mixed qualities, we develop a novel inverse reinforcement learning framework using leveraged Gaussian processes (LGP) which can handle negative demonstrations. In LGP, the correlation between two Gaussian processes is captured by a leveraged kernel function. By using properties, the proposed inverse reinforcement learning algorithm can learn from both positive and negative demonstrations. While most existing inverse reinforcement learning (IRL) methods suffer from the lack of information near low reward regions, the proposed method alleviates this issue by incorporating negative demonstrations. To mathematically formulate negative demonstrations, we introduce a novel generative model which can generate both positive and negative demonstrations using a parameter, called proficiency. Moreover, since we represent a reward function using a leveraged Gaussian process which can model a nonlinear function, the proposed method can effectively estimate the structure of a nonlinear reward function.๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ๋Š” ์‹œ๋ฒ”๊ณผ ๋ณด์ƒํ•จ์ˆ˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœํ•œ ๋กœ๋ด‡ ํ•™์Šต ๋ฌธ์ œ๋ฅผ ๋‹ค๋ฃฌ๋‹ค. ๋กœ๋ด‡ ํ•™์Šต ๋ฐฉ๋ฒ•์€ ๋ถˆํ™•์‹คํ•˜๊ณ  ๋ณต์žก ์—…๋ฌด๋ฅผ ์ž˜ ์ˆ˜ํ–‰ ํ•  ์ˆ˜ ์žˆ๋Š” ์ตœ์ ์˜ ์ •์ฑ… ํ•จ์ˆ˜๋ฅผ ์ฐพ๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค. ๋กœ๋ด‡ ํ•™์Šต ๋ถ„์•ผ์˜ ๋‹ค์–‘ํ•œ ๋ฌธ์ œ ์ค‘์—, ์ƒ˜ํ”Œ ๋ณต์žก๋„๋ฅผ ์ค„์ด๋Š” ๊ฒƒ์— ์ง‘์ค‘ํ•œ๋‹ค. ํŠนํžˆ, ํšจ์œจ์ ์ธ ํƒ์ƒ‰ ๋ฐฉ๋ฒ•๊ณผ ํ˜ผํ•ฉ ์‹œ๋ฒ”์œผ๋กœ ๋ถ€ํ„ฐ์˜ ํ•™์Šต ๊ธฐ๋ฒ•์„ ๊ฐœ๋ฐœํ•˜์—ฌ ์ ์€ ์ˆ˜์˜ ์ƒ˜ํ”Œ๋กœ๋„ ๋†’์€ ํšจ์œจ์„ ๊ฐ–๋Š” ์ •์ฑ… ํ•จ์ˆ˜๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ด๋‹ค. ํšจ์œจ์ ์ธ ํƒ์ƒ‰ ๋ฐฉ๋ฒ•์„ ๊ฐœ๋ฐœํ•˜๊ธฐ ์œ„ํ•ด์„œ, ์šฐ๋ฆฌ๋Š” ์ผ๋ฐ˜ํ™”๋œ ์Œ€๋ฆฌ์Šค ์—”ํŠธ๋กœํ”ผ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ์Œ€๋ฆฌ์Šค ์—”ํŠธ๋กœํ”ผ๋Š” ์ƒค๋…ผ-๊น์Šค ์—”ํŠธ๋กœํ”ผ๋ฅผ ์ผ๋ฐ˜ํ™”ํ•œ ๊ฐœ๋…์œผ๋กœ ์—”ํŠธ๋กœํ”ฝ ์ธ๋ฑ์Šค๋ผ๋Š” ์ƒˆ๋กœ์šด ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋„์ž…ํ•œ๋‹ค. ์—”ํŠธ๋กœํ”ฝ ์ธ๋ฑ์Šค๋ฅผ ์กฐ์ ˆํ•จ์— ๋”ฐ๋ผ ๋‹ค์–‘ํ•œ ํ˜•ํƒœ์˜ ์—”ํŠธ๋กœํ”ผ๋ฅผ ๋งŒ๋“ค์–ด ๋‚ผ ์ˆ˜ ์žˆ๊ณ  ๊ฐ ์—”ํŠธ๋กœํ”ผ๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ๋ ˆ๊ทค๋Ÿฌ๋ผ์ด์ œ์ด์…˜ ํšจ๊ณผ๋ฅผ ๋ณด์ธ๋‹ค. ์ด ์„ฑ์งˆ์„ ๊ธฐ๋ฐ˜์œผ๋กœ, ์ŠคํŒŒ์Šค ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ •๊ณผ์ •์„ ์ œ์•ˆํ•œ๋‹ค. ์ŠคํŒŒ์Šค ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ •๊ณผ์ •์€ ์ŠคํŒŒ์Šค ์Œ€๋ฆฌ์Šค ์—”ํŠธ๋กœํ”ผ๋ฅผ ์ด์šฉํ•˜์—ฌ ํฌ์†Œํ•˜๋ฉด์„œ ๋™์‹œ์— ๋‹ค๋ชจ๋“œ์˜ ์ •์ฑ… ๋ถ„ํฌ๋ฅผ ํ‘œํ˜„ํ•˜๋Š”๋ฐ ํšจ๊ณผ์ ์ด๋‹ค. ์ด๋ฅผ ํ†ตํ•ด์„œ ์ƒค๋…ผ-๊น์Šค ์—”ํŠธ๋กœํ”ผ๋ฅผ ์‚ฌ์šฉํ•˜์˜€์„๋•Œ์— ๋น„ํ•ด ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๊ฐ–์Œ์„ ์ˆ˜ํ•™์ ์œผ๋กœ ์ฆ๋ช…ํ•˜์˜€๋‹ค. ๋˜ํ•œ ์ŠคํŒŒ์Šค ์Œ€๋ฆฌ์Šค ์—”ํŠธ๋กœํ”ผ๋กœ ์ธํ•œ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ์ด๋ก ์ ์œผ๋กœ ๊ณ„์‚ฐํ•˜์˜€๋‹ค. ์ŠคํŒŒ์Šค ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ •๊ณผ์ •์„ ๋”์šฑ ์ผ๋ฐ˜ํ™”์‹œ์ผœ ์ผ๋ฐ˜ํ™”๋œ ์Œ€๋ฆฌ์Šค ์—”ํŠธ๋กœํ”ผ ๊ฒฐ์ •๊ณผ์ •์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์Œ€๋ฆฌ์Šค ์—”ํŠธ๋กœํ”ผ๋ฅผ ๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ •๊ณผ์ •์— ์ถ”๊ฐ€ํ•จ์œผ๋กœ์จ ์ƒ๊ธฐ๋Š” ์ตœ์  ์ •์ฑ…ํ•จ์ˆ˜์˜ ๋ณ€ํ™”์™€ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ์ˆ˜ํ•™์ ์œผ๋กœ ์ฆ๋ช…ํ•˜์˜€๋‹ค. ๋‚˜์•„๊ฐ€, ์„ฑ๋Šฅ์ €ํ•˜๋ฅผ ์—†์•จ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์ธ ์—”ํŠธ๋กœํ”ฝ ์ธ๋ฑ์Šค ์Šค์ผ€์ฅด๋ง์„ ์ œ์•ˆํ•˜์˜€๊ณ  ์‹คํ—˜์ ์œผ๋กœ ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ๊ฐ–์Œ์„ ๋ณด์˜€๋‹ค. ๋˜ํ•œ, ํ—ค๋น„ํ…Œ์ผ๋“œ ์žก์Œ์ด ์žˆ๋Š” ํ•™์Šต ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ ์™ธ๋ž€(Perturbation)์„ ์ด์šฉํ•œ ํƒ์ƒ‰ ๊ธฐ๋ฒ•์„ ๊ฐœ๋ฐœํ•˜์˜€๋‹ค. ๋กœ๋ด‡ ํ•™์Šต์˜ ๋งŽ์€ ๋ฌธ์ œ๋Š” ์žก์Œ์˜ ์˜ํ–ฅ์ด ์กด์žฌํ•œ๋‹ค. ํ•™์Šต ์‹ ํ˜ธ์•ˆ์— ๋‹ค์–‘ํ•œ ํ˜•ํƒœ๋กœ ์žก์Œ์ด ๋“ค์–ด์žˆ๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ๊ณ  ์ด๋Ÿฌํ•œ ๊ฒฝ์šฐ์— ์žก์Œ์„ ์ œ๊ฑฐ ํ•˜๋ฉด์„œ ์ตœ์ ์˜ ํ–‰๋™์„ ์ฐพ๋Š” ๋ฌธ์ œ๋Š” ํšจ์œจ์ ์ธ ํƒ์‚ฌ ๊ธฐ๋ฒ•์„ ํ•„์š”๋กœ ํ•œ๋‹ค. ๊ธฐ์กด์˜ ๋ฐฉ๋ฒ•๋ก ๋“ค์€ ์„œ๋ธŒ ๊ฐ€์šฐ์‹œ์•ˆ(sub-Gaussian) ์žก์Œ์—๋งŒ ์ ์šฉ ๊ฐ€๋Šฅํ–ˆ๋‹ค๋ฉด, ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•œ ๋ฐฉ์‹์€ ํ—ค๋น„ํ…Œ์ผ๋“œ ์žก์Œ์„ ํ•ด๊ฒฐ ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์—์„œ ๊ธฐ์กด์˜ ๋ฐฉ๋ฒ•๋ก ๋“ค๋ณด๋‹ค ์žฅ์ ์„ ๊ฐ–๋Š”๋‹ค. ๋จผ์ €, ์ผ๋ฐ˜์ ์ธ ์™ธ๋ž€์— ๋Œ€ํ•ด์„œ ๋ฆฌ๊ทธ๋ › ๋ฐ”์šด๋“œ๋ฅผ ์ฆ๋ช…ํ•˜์˜€๊ณ  ์™ธ๋ž€์˜ ๋ˆ„์ ๋ถ„ํฌํ•จ์ˆ˜(CDF)์™€ ๋ฆฌ๊ทธ๋ › ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ์ฆ๋ช…ํ•˜์˜€๋‹ค. ์ด ๊ด€๊ณ„๋ฅผ ์ด์šฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์™ธ๋ž€ ๋ถ„ํฌ์˜ ๋ฆฌ๊ทธ๋ › ๋ฐ”์šด๋“œ๋ฅผ ๊ณ„์‚ฐ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜์˜€๊ณ  ๋‹ค์–‘ํ•œ ๋ถ„ํฌ๋“ค์˜ ๊ฐ€์žฅ ํšจ์œจ์ ์ธ ํƒ์ƒ‰ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ณ„์‚ฐํ•˜์˜€๋‹ค. ํ˜ผํ•ฉ์‹œ๋ฒ”์œผ๋กœ ๋ถ€ํ„ฐ์˜ ํ•™์Šต ๊ธฐ๋ฒ•์„ ๊ฐœ๋ฐœํ•˜๊ธฐ ์œ„ํ•ด์„œ, ์˜ค์‹œ๋ฒ”์„ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋Š” ์ƒˆ๋กœ์šด ํ˜•ํƒœ์˜ ๊ฐ€์šฐ์‹œ์•ˆ ํ”„๋กœ์„ธ์Šค ํšŒ๊ท€๋ถ„์„ ๋ฐฉ์‹์„ ๊ฐœ๋ฐœํ•˜์˜€๊ณ , ์ด ๋ฐฉ์‹์„ ํ™•์žฅํ•˜์—ฌ ๋ ˆ๋ฒ„๋ฆฌ์ง€ ๊ฐ€์šฐ์‹œ์•ˆ ํ”„๋กœ์„ธ์Šค ์—ญ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฒ•์„ ๊ฐœ๋ฐœํ•˜์˜€๋‹ค. ๊ฐœ๋ฐœ๋œ ๊ธฐ๋ฒ•์—์„œ๋Š” ์ •์‹œ๋ฒ”์œผ๋กœ๋ถ€ํ„ฐ ๋ฌด์—‡์„ ํ•ด์•ผ ํ•˜๋Š”์ง€์™€ ์˜ค์‹œ๋ฒ”์œผ๋กœ๋ถ€ํ„ฐ ๋ฌด์—‡์„ ํ•˜๋ฉด ์•ˆ๋˜๋Š”์ง€๋ฅผ ๋ชจ๋‘ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ธฐ์กด์˜ ๋ฐฉ๋ฒ•์—์„œ๋Š” ์“ฐ์ผ ์ˆ˜ ์—†์—ˆ๋˜ ์˜ค์‹œ๋ฒ”์„ ์‚ฌ์šฉ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋งŒ๋“ฆ์œผ๋กœ์จ ์ƒ˜ํ”Œ ๋ณต์žก๋„๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ์—ˆ๊ณ  ์ •์ œ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜์ง€ ์•Š์•„๋„ ๋œ๋‹ค๋Š” ์ ์—์„œ ํฐ ์žฅ์ ์„ ๊ฐ–์Œ์„ ์‹คํ—˜์ ์œผ๋กœ ๋ณด์˜€๋‹ค.1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . 4 2 Background 5 2.1 Learning from Rewards . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 Multi-Armed Bandits . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 Contextual Multi-Armed Bandits . . . . . . . . . . . . . . . 7 2.1.3 Markov Decision Processes . . . . . . . . . . . . . . . . . . 9 2.1.4 Soft Markov Decision Processes . . . . . . . . . . . . . . . . 10 2.2 Learning from Demonstrations . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Behavior Cloning . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.2 Inverse Reinforcement Learning . . . . . . . . . . . . . . . . 13 3 Sparse Policy Learning 19 3.1 Sparse Policy Learning for Reinforcement Learning . . . . . . . . . 19 3.1.1 Sparse Markov Decision Processes . . . . . . . . . . . . . . 23 3.1.2 Sparse Value Iteration . . . . . . . . . . . . . . . . . . . . . 29 3.1.3 Performance Error Bounds for Sparse Value Iteration . . . 30 3.1.4 Sparse Exploration and Update Rule for Sparse Deep QLearning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2 Sparse Policy Learning for Imitation Learning . . . . . . . . . . . . 46 3.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.2 Principle of Maximum Causal Tsallis Entropy . . . . . . . . 50 3.2.3 Maximum Causal Tsallis Entropy Imitation Learning . . . 54 3.2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4 Entropy-based Exploration 65 4.1 Generalized Tsallis Entropy Reinforcement Learning . . . . . . . . 65 4.1.1 Maximum Generalized Tsallis Entropy in MDPs . . . . . . 71 4.1.2 Dynamic Programming for Tsallis MDPs . . . . . . . . . . 74 4.1.3 Tsallis Actor Critic for Model-Free RL . . . . . . . . . . . . 78 4.1.4 Experiments Setup . . . . . . . . . . . . . . . . . . . . . . . 79 4.1.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . 84 4.1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.2 E cient Exploration for Robotic Grasping . . . . . . . . . . . . . . 92 4.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.2.2 Shannon Entropy Regularized Neural Contextual Bandit Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.2.3 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . 99 4.2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . 104 4.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 5 Perturbation-Based Exploration 113 5.1 Perturbed Exploration for sub-Gaussian Rewards . . . . . . . . . . 115 5.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.1.2 Heavy-Tailed Perturbations . . . . . . . . . . . . . . . . . . 117 5.1.3 Adaptively Perturbed Exploration . . . . . . . . . . . . . . 119 5.1.4 General Regret Bound for Sub-Gaussian Rewards . . . . . . 120 5.1.5 Regret Bounds for Speci c Perturbations with sub-Gaussian Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.2 Perturbed Exploration for Heavy-Tailed Rewards . . . . . . . . . . 128 5.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.2.2 Sub-Optimality of Robust Upper Con dence Bounds . . . . 132 5.2.3 Adaptively Perturbed Exploration with A p-Robust Estimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.2.4 General Regret Bound for Heavy-Tailed Rewards . . . . . . 136 5.2.5 Regret Bounds for Speci c Perturbations with Heavy-Tailed Rewards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 5.2.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 144 5.2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 6 Inverse Reinforcement Learning with Negative Demonstrations149 6.1 Leveraged Gaussian Processes Inverse Reinforcement Learning . . 151 6.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 152 6.1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 6.1.3 Gaussian Process Regression . . . . . . . . . . . . . . . . . 156 6.1.4 Leveraged Gaussian Processes . . . . . . . . . . . . . . . . . 159 6.1.5 Gaussian Process Inverse Reinforcement Learning . . . . . 164 6.1.6 Inverse Reinforcement Learning with Leveraged Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 6.1.7 Simulations and Experiment . . . . . . . . . . . . . . . . . 175 6.1.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 7 Conclusion 185 Appendices 189 A Proofs of Chapter 3.1. 191 A.1 Useful Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 A.2 Sparse Bellman Optimality Equation . . . . . . . . . . . . . . . . . 192 A.3 Sparse Tsallis Entropy . . . . . . . . . . . . . . . . . . . . . . . . . 195 A.4 Upper and Lower Bounds for Sparsemax Operation . . . . . . . . . 196 A.5 Comparison to Log-Sum-Exp . . . . . . . . . . . . . . . . . . . . . 200 A.6 Convergence and Optimality of Sparse Value Iteration . . . . . . . 201 A.7 Performance Error Bounds for Sparse Value Iteration . . . . . . . . 203 B Proofs of Chapter 3.2. 209 B.1 Change of Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 209 B.2 Concavity of Maximum Causal Tsallis Entropy . . . . . . . . . . . 210 B.3 Optimality Condition of Maximum Causal Tsallis Entropy . . . . . 212 B.4 Interpretation as Robust Bayes . . . . . . . . . . . . . . . . . . . . 215 B.5 Generative Adversarial Setting with Maximum Causal Tsallis Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 B.6 Tsallis Entropy of a Mixture of Gaussians . . . . . . . . . . . . . . 217 B.7 Causal Entropy Approximation . . . . . . . . . . . . . . . . . . . . 218 C Proofs of Chapter 4.1. 221 C.1 q-Maximum: Bounded Approximation of Maximum . . . . . . . . . 223 C.2 Tsallis Bellman Optimality Equation . . . . . . . . . . . . . . . . . 226 C.3 Variable Change . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 C.4 Tsallis Bellman Optimality Equation . . . . . . . . . . . . . . . . . 230 C.5 Tsallis Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . 234 C.6 Tsallis Bellman Expectation (TBE) Equation . . . . . . . . . . . . 234 C.7 Tsallis Bellman Expectation Operator and Tsallis Policy Evaluation235 C.8 Tsallis Policy Improvement . . . . . . . . . . . . . . . . . . . . . . 237 C.9 Tsallis Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . 239 C.10 Performance Error Bounds . . . . . . . . . . . . . . . . . . . . . . 241 C.11 q-Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 D Proofs of Chapter 4.2. 245 D.1 In nite Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 D.2 Regret Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 E Proofs of Chapter 5.1. 255 E.1 General Regret Lower Bound of APE . . . . . . . . . . . . . . . . . 255 E.2 General Regret Upper Bound of APE . . . . . . . . . . . . . . . . 257 E.3 Proofs of Corollaries . . . . . . . . . . . . . . . . . . . . . . . . . . 266 F Proofs of Chapter 5.2. 279 F.1 Regret Lower Bound for Robust Upper Con dence Bound . . . . . 279 F.2 Bounds on Tail Probability of A p-Robust Estimator . . . . . . . . 284 F.3 General Regret Upper Bound of APE2 . . . . . . . . . . . . . . . . 287 F.4 General Regret Lower Bound of APE2 . . . . . . . . . . . . . . . . 299 F.5 Proofs of Corollaries . . . . . . . . . . . . . . . . . . . . . . . . . . 302Docto

    Feature learning for multi-task inverse reinforcement learning

    Get PDF
    In this paper we study the question of life long learning of behaviors from human demonstrations by an intelligent system. One approach is to model the observed demonstrations by a stationary policy. Inverse rein-forcement learning, on the other hand, searches a reward function that makes the observed policy closed to optimal in the corresponding Markov decision process. This approach provides a model of the task solved by the demonstrator and has been shown to lead to better generalization in un-known contexts. However both approaches focus on learning a single task from the expert demonstration. In this paper we propose a feature learn-ing approach for inverse reinforcement learning in which several different tasks are demonstrated, but in which each task is modeled as a mixture of several, simpler, primitive tasks. We present an algorithm based on an al-ternate gradient descent to learn simultaneously a dictionary of primitive tasks (in the form of reward functions) and their combination into an ap-proximation of the task underlying observed behavior. We illustrate how this approach enables efficient re-use of knowledge from previous demon-strations. Namely knowledge on tasks that were previously observed by the learner is used to improve the learning of a new composite behavior, thus achieving transfer of knowledge between tasks
    • โ€ฆ
    corecore