10 research outputs found

    Bayesian nonparametric reward learning from demonstration

    Get PDF
    Thesis: Ph. D., Massachusetts Institute of Technology, Department of Aeronautics and Astronautics, 2013.Cataloged from PDF version of thesis.Includes bibliographical references (pages 123-132).Learning from demonstration provides an attractive solution to the problem of teaching autonomous systems how to perform complex tasks. Demonstration opens autonomy development to non-experts and is an intuitive means of communication for humans, who naturally use demonstration to teach others. This thesis focuses on a specific form of learning from demonstration, namely inverse reinforcement learning, whereby the reward of the demonstrator is inferred. Formally, inverse reinforcement learning (IRL) is the task of learning the reward function of a Markov Decision Process (MDP) given knowledge of the transition function and a set of observed demonstrations. While reward learning is a promising method of inferring a rich and transferable representation of the demonstrator's intents, current algorithms suffer from intractability and inefficiency in large, real-world domains. This thesis presents a reward learning framework that infers multiple reward functions from a single, unsegmented demonstration, provides several key approximations which enable scalability to large real-world domains, and generalizes to fully continuous demonstration domains without the need for discretization of the state space, all of which are not handled by previous methods. In the thesis, modifications are proposed to an existing Bayesian IRL algorithm to improve its efficiency and tractability in situations where the state space is large and the demonstrations span only a small portion of it. A modified algorithm is presented and simulation results show substantially faster convergence while maintaining the solution quality of the original method. Even with the proposed efficiency improvements, a key limitation of Bayesian IRL (and most current IRL methods) is the assumption that the demonstrator is maximizing a single reward function. This presents problems when dealing with unsegmented demonstrations containing multiple distinct tasks, common in robot learning from demonstration (e.g. in large tasks that may require multiple subtasks to complete). A key contribution of this thesis is the development of a method that learns multiple reward functions from a single demonstration. The proposed method, termed Bayesian nonparametric inverse reinforcement learning (BNIRL), uses a Bayesian nonparametric mixture model to automatically partition the data and find a set of simple reward functions corresponding to each partition. The simple rewards are interpreted intuitively as subgoals, which can be used to predict actions or analyze which states are important to the demonstrator. Simulation results demonstrate the ability of BNIRL to handle cyclic tasks that break existing algorithms due to the existence of multiple subgoal rewards in the demonstration. The BNIRL algorithm is easily parallelized, and several approximations to the demonstrator likelihood function are offered to further improve computational tractability in large domains. Since BNIRL is only applicable to discrete domains, the Bayesian nonparametric reward learning framework is extended to general continuous demonstration domains using Gaussian process reward representations. The resulting algorithm, termed Gaussian process subgoal reward learning (GPSRL), is the only learning from demonstration method that is able to learn multiple reward functions from unsegmented demonstration in general continuous domains. GPSRL does not require discretization of the continuous state space and focuses computation efficiently around the demonstration itself. Learned subgoal rewards are cast as Markov decision process options to enable execution of the learned behaviors by the robotic system and provide a principled basis for future learning and skill refinement. Experiments conducted in the MIT RAVEN indoor test facility demonstrate the ability of both BNIRL and GPSRL to learn challenging maneuvers from demonstration on a quadrotor helicopter and a remote-controlled car.by Bernard J. Michini.Ph. D

    Shared Control Policies and Task Learning for Hydraulic Earth-Moving Machinery

    Get PDF
    This thesis develops a shared control design framework for improving operator efficiency and performance on hydraulic excavation tasks. The framework is based on blended shared control (BSC), a technique whereby the operator’s command input is continually augmented by an assistive controller. Designing a BSC control scheme is subdivided here into four key components. Task learning utilizes nonparametric inverse reinforcement learning to identify the underlying goal structure of a task as a sequence of subgoals directly from the demonstration data of an experienced operator. These subgoals may be distinct points in the actuator space or distributions overthe space, from which the operator draws a subgoal location during the task. The remaining three steps are executed on-line during each update of the BSC controller. In real-time, the subgoal prediction step involves utilizing the subgoal decomposition from the learning process in order to predict the current subgoal of the operator. Novel deterministic and probabilistic prediction methods are developed and evaluated for their ease of implementation and performance against manually labeled trial data. The control generation component involves computing polynomial trajectories to the predicted subgoal location or mean of the subgoal distribution, and computing a control input which tracks those trajectories. Finally, the blending law synthesizes both inputs through a weighted averaging of the human and control input, using a blending parameter which can be static or dynamic. In the latter case, mapping probabilistic quantities such as the maximum a posteriori probability or statistical entropy to the value of the dynamic blending parameter may yield a more intelligent control assistance, scaling the intervention according to the confidence of the prediction. A reduced-scale (1/12) fully hydraulic excavator model was instrumented for BSC experimentation, equipped with absolute position feedback of each hydraulic actuator. Experiments were conducted using a standard operator control interface and a common earthmoving task: loading a truck from a pile. Under BSC, operators experienced an 18% improvement in mean digging efficiency, defined as mass of material moved per cycle time. Effects of BSC vary with regard to pure cycle time, although most operators experienced a reduced mean cycle time

    CODA Algorithm: An Immune Algorithm for Reinforcement Learning Tasks

    Get PDF
    This document presents the design of an algorithm that takes on its basis: reinforcement learning, learning from demonstration and most importantly Artificial Immune Systems. The main advantage of this algorithm named CODA (Cognition from Data). Is; it can learn from limited data samples- that is given a single example and the algorithm will create its own knowledge. The algorithm imitates from the Natural Immune System the clonal procedure for obtaining a repertoire of antibodies from a single antigen. It also uses the self-organised memory in order to reduce searching time in the whole action-state space by searching in specific clusters. CODA algorithm is presented and explained in detail in order to understand how these three principles are used. The algorithm is explained with pseudocode, flowcharts and block diagrams. The clonal/mutation results are presented with a simple example. It can be seen graphically how new data that has a completely new probability distribution. Finally, the first application where CODA is used, a humanoid hand is presented. In this application the algorithm created affordable grasping postures from limited examples, creates its own knowledge and stores data in memory data in memory in order to recognise whether it has been on a similar situation

    Machine Teaching for Inverse Reinforcement Learning: Algorithms and Applications

    Full text link
    Inverse reinforcement learning (IRL) infers a reward function from demonstrations, allowing for policy improvement and generalization. However, despite much recent interest in IRL, little work has been done to understand the minimum set of demonstrations needed to teach a specific sequential decision-making task. We formalize the problem of finding maximally informative demonstrations for IRL as a machine teaching problem where the goal is to find the minimum number of demonstrations needed to specify the reward equivalence class of the demonstrator. We extend previous work on algorithmic teaching for sequential decision-making tasks by showing a reduction to the set cover problem which enables an efficient approximation algorithm for determining the set of maximally-informative demonstrations. We apply our proposed machine teaching algorithm to two novel applications: providing a lower bound on the number of queries needed to learn a policy using active IRL and developing a novel IRL algorithm that can learn more efficiently from informative demonstrations than a standard IRL approach.Comment: In proceedings of the AAAI Conference on Artificial Intelligence, 201

    Optimized Endpoint Delivery Via Unmanned Aerial Vehicles

    Get PDF
    Unmanned Aerial Vehicles (UAVs) are remotely piloted aircraft with a range of varying applications. Though early adoption of UAVs focused on military applications, surveillance, photography, and agricultural applications are presently on the rise. This work aims to ascertain how UAVs may be employed to elicit deceased transportation times, increased power efficiency, and improved safety. Resulting in optimized end point delivery. A combination of tools and techniques, involving a mathematical model, UAV simulations, redundant control systems, and custom designed electrical and mechanical components were used towards reaching the goal of a 10-kilogram maximum payload delivered 10 miles under 30 minutes. Two UAV prototypes were developed, the second of which (V2) showed promising results. Velocities achieved in V2, in combination with a versatile payload connector and proper networking, allowed for 5-10 mile deliveries of goods less than 8-kilograms to be achieved within a metropolis faster than the 30-minute benchmark

    Learning Sequential Force Interaction Skills

    Get PDF
    Learning skills from kinesthetic demonstrations is a promising way of minimizing the gap between human manipulation abilities and those of robots. We propose an approach to learn sequential force interaction skills from such demonstrations. The demonstrations are decomposed into a set of movement primitives by inferring the underlying sequential structure of the task. The decomposition is based on a novel probability distribution which we call Directional Normal Distribution. The distribution allows infering the movement primitive’s composition, i.e., its coordinate frames, control variables and target coordinates from the demonstrations. In addition, it permits determining an appropriate number of movement primitives for a task via model selection. After finding the task’s composition, the system learns to sequence the resulting movement primitives in order to be able to reproduce the task on a real robot. We evaluate the approach on three different tasks, unscrewing a light bulb, box stacking and box flipping. All tasks are kinesthetically demonstrated and then reproduced on a Barrett WAM robot

    Using learning from demonstration to enable automated flight control comparable with experienced human pilots

    Get PDF
    Modern autopilots fall under the domain of Control Theory which utilizes Proportional Integral Derivative (PID) controllers that can provide relatively simple autonomous control of an aircraft such as maintaining a certain trajectory. However, PID controllers cannot cope with uncertainties due to their non-adaptive nature. In addition, modern autopilots of airliners contributed to several air catastrophes due to their robustness issues. Therefore, the aviation industry is seeking solutions that would enhance safety. A potential solution to achieve this is to develop intelligent autopilots that can learn how to pilot aircraft in a manner comparable with experienced human pilots. This work proposes the Intelligent Autopilot System (IAS) which provides a comprehensive level of autonomy and intelligent control to the aviation industry. The IAS learns piloting skills by observing experienced teachers while they provide demonstrations in simulation. A robust Learning from Demonstration approach is proposed which uses human pilots to demonstrate the task to be learned in a flight simulator while training datasets are captured. The datasets are then used by Artificial Neural Networks (ANNs) to generate control models automatically. The control models imitate the skills of the experienced pilots when performing the different piloting tasks while handling flight uncertainties such as severe weather conditions and emergency situations. Experiments show that the IAS performs learned skills and tasks with high accuracy even after being presented with limited examples which are suitable for the proposed approach that relies on many single-hidden-layer ANNs instead of one or few large deep ANNs which produce a black-box that cannot be explained to the aviation regulators. The results demonstrate that the IAS is capable of imitating low-level sub-cognitive skills such as rapid and continuous stabilization attempts in stormy weather conditions, and high-level strategic skills such as the sequence of sub-tasks necessary to takeoff, land, and handle emergencies

    Lattice-Based Motion Planning with Optimal Motion Primitives

    Get PDF
    In the field of navigation for autonomous vehicles, it is the responsibility of a local planner to compute reference trajectories that are then be followed by a tracking controller. These trajectories should be safe, kinematically feasible, and optimize certain desirable features like low travel time and smoothness/comfort. Determining such trajectories is known as the motion planning problem and is the focus of this work. In general, the motion planning problem is intractable, and simplifications must be made in order to compute reference trajectories quickly and in real time. A common strategy involves adopting a simple kinematic model for the trajectory. However, overly simplified models can lead to references that are infeasible for the vehicle. These are hard for a tracking controller to follow resulting in large tracking error and frequent re-planning. In contrast, lattice-based motion planning simplifies the motion planning problem by restricting the set of allowable motions. In detail, lattice-based motion planning works by discretizing the configuration space of a vehicle into a regularly repeating grid called a lattice. The set of all optimal feasible trajectories between vertices of this lattice are pre-computed and a subset called a control set is selected. Trajectories of this pre-computed subset are then joined together online to form more complex compound maneuvers. Because trajectories between lattice vertices are pre-computed, the complexities of the motion planning problem are considered offline. While not every trajectory is available to a lattice-based planner, every trajectory that is available is feasible and optimal. Selecting a control set is an important step in lattice-based motion planning since the optimality of each element of the control set does not guarantee the optimality of compound maneuvers. These control sets are often selected based on intuition and experience. Broadly, the size of a control set has a positive effect on the quality of computed trajectories, but at the expense of run time performance. A control set is said to t-span a lattice if trajectories between lattice vertices can be approximated to within a factor of t as compound maneuvers of elements of the control set. Given an acceptable allowance t on the sub-optimality of compound maneuvers in a lattice, the problem of computing the smallest control set that t-spans the lattice is called the minimum t-spanning control set problem. In essence, this problem seeks to optimize a trade-off between the quality of compound maneuvers and the time required to compute them. This work details solutions and applications of the minimum t-spanning control set problem in autonomous vehicle navigation. In particular, we first investigate an instance of the problem that can be solved efficiently, provide an intuitive solution, and outline the applications of this instance in the field of any-angle path planning in a two dimensional environment. Next, we provide a novel method to compute trajectories that optimize an adjustable trade-off between certain desirable features. The relative importance of each of these features may differ by user, and the techniques developed here are able to reflect these preferences. The NP-completeness of the general minimum t-spanning control set problem is established here, and we present a mixed integer linear program that encodes the problem. The trajectories we propose in conjunction with the mixed integer linear program, result in a method to compute a minimum t-spanning control set whose elements are kinematically feasible and reflect the preferences of a user if those preferences are known. Finally, we consider the problem of simultaneously learning the preferences of a single user from demonstrations and computing sparse control sets for that user. We propose a technique to solve this problem that leverages a separation principle: first estimate the preferences of the user based on demonstrations, then compute a control set of trajectories that are optimal given the estimated preferences. We show that this approach optimally solves the problem. Combining the work of this thesis results in a method by which tailored control sets that reflect the preferences of a user can be determined from the demonstrations of that user. These control sets have the following beneficial attributes: 1) each element of the control set is optimal for the estimated preferences of the user, and 2) the control set optimizes a trade-off between the quality of compound maneuvers between lattice vertices -- as defined by the estimated preferences of the user -- and time required to compute them

    Specifying User Preferences for Autonomous Robots through Interactive Learning

    Get PDF
    This thesis studies a central problem in human-robot interaction (HRI): How can non-expert users specify complex behaviours for autonomous robots? A common technique for robot task specification that does not require expert knowledge is active preference learning. The desired behaviour of a robot is learned by iteratively presenting the user with alternative behaviours of the robot. The user then chooses the alternative they prefer. It is assumed that they make this decision based on an internal, hidden cost function. From the user's choice among the alternatives, the robot learns the hidden user cost function. We use an interactive framework allowing users to create robot task specifications. The behaviour of an autonomous robot can be specified by defining constraints on allowable robot states and actions. For instance, for a mobile robot a user can define traffic rules such as roads, slow zones or areas of avoidance. These constraints form the user-specified terms of the cost function. However, inexperienced users might be oblivious to the impact such constraints have on the robot task performance. Employing an active preference learning framework we present users with the behaviour of the robot following their specification, i.e., the constraints, together with an alternative behaviour where some constraints might be violated. A user cost function trades-off the importance of constraints and the performance of the robot. From the user feedback, the robot learns about the importance of constraints, i.e., parameters in the cost function. We first introduce an algorithm for specification revision that is based on a deterministic user model: We assume that the user always follows the proposed cost function. This allows for dividing the set of possible weights for the user constraints into infeasible and feasible weights whenever user feedback is obtained. In each iteration we present the path the user preferred previously again, together with an alternative path that is optimal for a weight that is feasible with respect to all previous iterations. This path is found with a local search, iterating over the feasible weights until a new path is found. As the number of paths is finite for any discrete motion planner, the algorithm is guaranteed to find the optimal solution within a finite number of iterations. Simulation results show that this approach is suitable to effectively revise user specifications within few iterations. The practicality of the framework is investigated in a user study. The algorithm is extended to learn about multiple tasks for the robot simultaneously, which allows for more realistic scenarios and another active learning component: The choice of task for which the user is presented with two alternative solutions. Through the study we show that nearly all users accept alternative solutions and thus obtain a revised specification through the learning process, leading to a substantial improvement in robot performance. Also, the users whose initial specifications had the largest impact on performance benefit the most from the interactive learning. Next, we weaken the assumptions about the user: In a probabilistic model we do not require the user to always follow our cost function. Based on the sensitivity of a motion planning problem, we show that different values in the user cost function, i.e., weights for the user constraints, do not necessarily lead to different robot behaviour. From the implied discretization of the space of possible parameters we derive an algorithm for efficiently learning a specification revision and demonstrate the performance and robustness in simulations. We build on the notion of sensitivity to an active preference learning technique based on maximum regret, i.e., the maximum error ratio over all possible solutions. We show that active preference learning based on regret substantially outperforms other state of the art approaches. Further, regret based preference learning can be used as an heuristic for both discrete and continuous state and action spaces. An emerging technique for real-time motion planning are state lattice planners, based on a regular discrete set of robot states and pre-computed motions connecting the states, called motion primitives. We study how learning from demonstrations can be used to learn global preferences for robot movement, such as the trade-off between time and jerkiness of the motions. We show how to compute a user optimal set of motion primitives of given size, based on an estimate of the user preferences. We demonstrate that by learning about the motion primitives of a lattice planner, we can shape the robot's behaviour to follow the global user preferences while ensuring good computation time of the motion planner. Furthermore, we study how a robot can simultaneously learn about user preferences on both motions of a lattice planner and parts of the environment when a user is iteratively correcting the robot behaviour. We demonstrate in simulations that this approach is suitable to adapt to user preferences even when the features on the environment that a user considers are not given
    corecore