13 research outputs found
Bayesian learning of noisy Markov decision processes
We consider the inverse reinforcement learning problem, that is, the problem
of learning from, and then predicting or mimicking a controller based on
state/action data. We propose a statistical model for such data, derived from
the structure of a Markov decision process. Adopting a Bayesian approach to
inference, we show how latent variables of the model can be estimated, and how
predictions about actions can be made, in a unified framework. A new Markov
chain Monte Carlo (MCMC) sampler is devised for simulation from the posterior
distribution. This step includes a parameter expansion step, which is shown to
be essential for good convergence properties of the MCMC sampler. As an
illustration, the method is applied to learning a human controller
Bayesian Disturbance Injection: Robust Imitation Learning of Flexible Policies
Scenarios requiring humans to choose from multiple seemingly optimal actions
are commonplace, however standard imitation learning often fails to capture
this behavior. Instead, an over-reliance on replicating expert actions induces
inflexible and unstable policies, leading to poor generalizability in an
application. To address the problem, this paper presents the first imitation
learning framework that incorporates Bayesian variational inference for
learning flexible non-parametric multi-action policies, while simultaneously
robustifying the policies against sources of error, by introducing and
optimizing disturbances to create a richer demonstration dataset. This
combinatorial approach forces the policy to adapt to challenging situations,
enabling stable multi-action policies to be learned efficiently. The
effectiveness of our proposed method is evaluated through simulations and
real-robot experiments for a table-sweep task using the UR3 6-DOF robotic arm.
Results show that, through improved flexibility and robustness, the learning
performance and control safety are better than comparison methods.Comment: 7 pages, Accepted by the 2021 International Conference on Robotics
and Automation (ICRA 2021
Artificial Neural Network-Based Flight Control Using Distributed Sensors on Fixed-Wing Unmanned Aerial Vehicles
Conventional control systems for autonomous aircraft use a small number of precise sensors in combination with classical control laws to maintain flight. The sensing systems encode center of mass motion and generally are set-up for flight regimes where rigid body assumptions and linear flight dynamics models are valid. Gain scheduling is used to overcome some of the limitations from these assumptions, taking advantage of well-tuned controllers over a range of design points. In contrast, flying animals achieve efficient and robust flight control by taking advantage of highly non-linear structural dynamics and aerodynamics. It has been suggested that the distributed arrays of flow and force sensors found in flying animals could be behind their remarkable flight control. Using a wind tunnel aircraft model instrumented with distributed arrays of load and flow sensors, we developed Artificial Neural Network flight control algorithms that use signals from the sensing array as well as the signals available in conventional sensing suites to control angle-of-attack. These controllers were trained to match the response from a conventional controller, achieving a level of performance similar to the conventional controller over a wide range of angle-of-attack and wind speed values. Wind tunnel testing showed that by using an ANN-based controller in combination with signals from a distributed array of pressure and strain sensors on a wing, it was possible to control angle-of-attack. The End-to-End learning approach used here was able to control angle-of-attack by directly learning the mapping between control inputs and system outputs without explicitly estimating or being given the angle-of-attack.</p
Modularity through Attention: Efficient Training and Transfer of Language-Conditioned Policies for Robot Manipulation
Language-conditioned policies allow robots to interpret and execute human
instructions. Learning such policies requires a substantial investment with
regards to time and compute resources. Still, the resulting controllers are
highly device-specific and cannot easily be transferred to a robot with
different morphology, capability, appearance or dynamics. In this paper, we
propose a sample-efficient approach for training language-conditioned
manipulation policies that allows for rapid transfer across different types of
robots. By introducing a novel method, namely Hierarchical Modularity, and
adopting supervised attention across multiple sub-modules, we bridge the divide
between modular and end-to-end learning and enable the reuse of functional
building blocks. In both simulated and real world robot manipulation
experiments, we demonstrate that our method outperforms the current
state-of-the-art methods and can transfer policies across 4 different robots in
a sample-efficient manner. Finally, we show that the functionality of learned
sub-modules is maintained beyond the training process and can be used to
introspect the robot decision-making process. Code is available at
https://github.com/ir-lab/ModAttn.Comment: 2022 Conference on Robot Learning (CoRL
A Case Study on Air Combat Decision Using Approximated Dynamic Programming
As a continuous state space problem, air combat is difficult to be resolved by traditional dynamic programming (DP) with discretized state space. The approximated dynamic programming (ADP) approach is studied in this paper to build a high performance decision model for air combat in 1 versus 1 scenario, in which the iterative process for policy improvement is replaced by mass sampling from history trajectories and utility function approximating, leading to high efficiency on policy improvement eventually. A continuous reward function is also constructed to better guide the plane to find its way to "winner" state from any initial situation. According to our experiments, the plane is more offensive when following policy derived from ADP approach other than the baseline Min-Max policy, in which the "time to win" is reduced greatly but the cumulated probability of being killed by enemy is higher. The reason is analyzed in this paper
Intelligent Inventory Control via Ruminative Reinforcement Learning
Inventory management is a sequential decision problem that can be solved with reinforcement learning (RL). Although RL in its conventional form does not require domain knowledge, exploiting such knowledge of problem structure, usually available in inventory management, can be beneficial to improving the learning quality and speed of RL. Ruminative reinforcement learning (RRL) has been introduced recently based on this approach. RRL is motivated by how humans contemplate the consequences of their actions in trying to learn how to make a better decision. This study further investigates the issues of RRL and proposes new RRL methods applied to inventory management. Our investigation provides insight into different RRL characteristics, and our experimental results show the viability of the new methods
Bayesian nonparametric reward learning from demonstration
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Aeronautics and Astronautics, 2013.Cataloged from PDF version of thesis.Includes bibliographical references (pages 123-132).Learning from demonstration provides an attractive solution to the problem of teaching autonomous systems how to perform complex tasks. Demonstration opens autonomy development to non-experts and is an intuitive means of communication for humans, who naturally use demonstration to teach others. This thesis focuses on a specific form of learning from demonstration, namely inverse reinforcement learning, whereby the reward of the demonstrator is inferred. Formally, inverse reinforcement learning (IRL) is the task of learning the reward function of a Markov Decision Process (MDP) given knowledge of the transition function and a set of observed demonstrations. While reward learning is a promising method of inferring a rich and transferable representation of the demonstrator's intents, current algorithms suffer from intractability and inefficiency in large, real-world domains. This thesis presents a reward learning framework that infers multiple reward functions from a single, unsegmented demonstration, provides several key approximations which enable scalability to large real-world domains, and generalizes to fully continuous demonstration domains without the need for discretization of the state space, all of which are not handled by previous methods. In the thesis, modifications are proposed to an existing Bayesian IRL algorithm to improve its efficiency and tractability in situations where the state space is large and the demonstrations span only a small portion of it. A modified algorithm is presented and simulation results show substantially faster convergence while maintaining the solution quality of the original method. Even with the proposed efficiency improvements, a key limitation of Bayesian IRL (and most current IRL methods) is the assumption that the demonstrator is maximizing a single reward function. This presents problems when dealing with unsegmented demonstrations containing multiple distinct tasks, common in robot learning from demonstration (e.g. in large tasks that may require multiple subtasks to complete). A key contribution of this thesis is the development of a method that learns multiple reward functions from a single demonstration. The proposed method, termed Bayesian nonparametric inverse reinforcement learning (BNIRL), uses a Bayesian nonparametric mixture model to automatically partition the data and find a set of simple reward functions corresponding to each partition. The simple rewards are interpreted intuitively as subgoals, which can be used to predict actions or analyze which states are important to the demonstrator. Simulation results demonstrate the ability of BNIRL to handle cyclic tasks that break existing algorithms due to the existence of multiple subgoal rewards in the demonstration. The BNIRL algorithm is easily parallelized, and several approximations to the demonstrator likelihood function are offered to further improve computational tractability in large domains. Since BNIRL is only applicable to discrete domains, the Bayesian nonparametric reward learning framework is extended to general continuous demonstration domains using Gaussian process reward representations. The resulting algorithm, termed Gaussian process subgoal reward learning (GPSRL), is the only learning from demonstration method that is able to learn multiple reward functions from unsegmented demonstration in general continuous domains. GPSRL does not require discretization of the continuous state space and focuses computation efficiently around the demonstration itself. Learned subgoal rewards are cast as Markov decision process options to enable execution of the learned behaviors by the robotic system and provide a principled basis for future learning and skill refinement. Experiments conducted in the MIT RAVEN indoor test facility demonstrate the ability of both BNIRL and GPSRL to learn challenging maneuvers from demonstration on a quadrotor helicopter and a remote-controlled car.by Bernard J. Michini.Ph. D