1,199 research outputs found
Hybrid Reinforcement Learning with Expert State Sequences
Existing imitation learning approaches often require that the complete
demonstration data, including sequences of actions and states, are available.
In this paper, we consider a more realistic and difficult scenario where a
reinforcement learning agent only has access to the state sequences of an
expert, while the expert actions are unobserved. We propose a novel
tensor-based model to infer the unobserved actions of the expert state
sequences. The policy of the agent is then optimized via a hybrid objective
combining reinforcement learning and imitation learning. We evaluated our
hybrid approach on an illustrative domain and Atari games. The empirical
results show that (1) the agents are able to leverage state expert sequences to
learn faster than pure reinforcement learning baselines, (2) our tensor-based
action inference model is advantageous compared to standard deep neural
networks in inferring expert actions, and (3) the hybrid policy optimization
objective is robust against noise in expert state sequences.Comment: AAAI 2019; https://github.com/XiaoxiaoGuo/tensor4r
Time-Contrastive Networks: Self-Supervised Learning from Video
We propose a self-supervised approach for learning representations and
robotic behaviors entirely from unlabeled videos recorded from multiple
viewpoints, and study how this representation can be used in two robotic
imitation settings: imitating object interactions from videos of humans, and
imitating human poses. Imitation of human behavior requires a
viewpoint-invariant representation that captures the relationships between
end-effectors (hands or robot grippers) and the environment, object attributes,
and body pose. We train our representations using a metric learning loss, where
multiple simultaneous viewpoints of the same observation are attracted in the
embedding space, while being repelled from temporal neighbors which are often
visually similar but functionally different. In other words, the model
simultaneously learns to recognize what is common between different-looking
images, and what is different between similar-looking images. This signal
causes our model to discover attributes that do not change across viewpoint,
but do change across time, while ignoring nuisance variables such as
occlusions, motion blur, lighting and background. We demonstrate that this
representation can be used by a robot to directly mimic human poses without an
explicit correspondence, and that it can be used as a reward function within a
reinforcement learning algorithm. While representations are learned from an
unlabeled collection of task-related videos, robot behaviors such as pouring
are learned by watching a single 3rd-person demonstration by a human. Reward
functions obtained by following the human demonstrations under the learned
representation enable efficient reinforcement learning that is practical for
real-world robotic systems. Video results, open-source code and dataset are
available at https://sermanet.github.io/imitat
Human-in-the-Loop Methods for Data-Driven and Reinforcement Learning Systems
Recent successes combine reinforcement learning algorithms and deep neural
networks, despite reinforcement learning not being widely applied to robotics
and real world scenarios. This can be attributed to the fact that current
state-of-the-art, end-to-end reinforcement learning approaches still require
thousands or millions of data samples to converge to a satisfactory policy and
are subject to catastrophic failures during training. Conversely, in real world
scenarios and after just a few data samples, humans are able to either provide
demonstrations of the task, intervene to prevent catastrophic actions, or
simply evaluate if the policy is performing correctly. This research
investigates how to integrate these human interaction modalities to the
reinforcement learning loop, increasing sample efficiency and enabling
real-time reinforcement learning in robotics and real world scenarios. This
novel theoretical foundation is called Cycle-of-Learning, a reference to how
different human interaction modalities, namely, task demonstration,
intervention, and evaluation, are cycled and combined to reinforcement learning
algorithms. Results presented in this work show that the reward signal that is
learned based upon human interaction accelerates the rate of learning of
reinforcement learning algorithms and that learning from a combination of human
demonstrations and interventions is faster and more sample efficient when
compared to traditional supervised learning algorithms. Finally,
Cycle-of-Learning develops an effective transition between policies learned
using human demonstrations and interventions to reinforcement learning. The
theoretical foundation developed by this research opens new research paths to
human-agent teaming scenarios where autonomous agents are able to learn from
human teammates and adapt to mission performance metrics in real-time and in
real world scenarios.Comment: PhD thesis, Aerospace Engineering, Texas A&M (2020). For more
information, see https://vggoecks.com
Recommended from our members
End to End Learning in Autonomous Driving Systems
Convolutional neural networks have advanced visual perception significantly in recent years. Two major ingredients that enable such a success are the composition of simple modules into a complex network and the end to end optimization. However, such success has not yet revolutionized robotics as much as vision, even if robotics suffer from similar problems as traditional computer vision, i.e. imperfectness of the manual pipeline design of the system. This thesis investigates using end-to-end learning for the autonomous driving system, a concrete robotic application. End to end learning can produce reasonable driving behaviors, even in the complex urban driving scenarios. Representation learning in end-to-end driving models is crucial, and auxiliary vision tasks such as semantic segmentation can help to form a more informative driving representation especially when training data is limited. Naive convolutional neural networks are usually only capable of doing reactive control and can not involve complex reasoning in a particular scenario. This thesis also studies how to handle scene conditioned driving behavior, which goes beyond the capability of reactive control. Alongside the end-to-end structure, learning methods also play a critical role. Imitation learning methods will acquire meaningful behaviors but usually, the robot can not master the skill. Reinforcement learning, on the contrary, either barely learns anything if the environment is too complex, or it can master the skill otherwise. To get the best of both worlds, this thesis proposes an algorithmically unified method to learn from both demonstration data and the environment
Human-Machine Collaborative Optimization via Apprenticeship Scheduling
Coordinating agents to complete a set of tasks with intercoupled temporal and
resource constraints is computationally challenging, yet human domain experts
can solve these difficult scheduling problems using paradigms learned through
years of apprenticeship. A process for manually codifying this domain knowledge
within a computational framework is necessary to scale beyond the
``single-expert, single-trainee" apprenticeship model. However, human domain
experts often have difficulty describing their decision-making processes,
causing the codification of this knowledge to become laborious. We propose a
new approach for capturing domain-expert heuristics through a pairwise ranking
formulation. Our approach is model-free and does not require enumerating or
iterating through a large state space. We empirically demonstrate that this
approach accurately learns multifaceted heuristics on a synthetic data set
incorporating job-shop scheduling and vehicle routing problems, as well as on
two real-world data sets consisting of demonstrations of experts solving a
weapon-to-target assignment problem and a hospital resource allocation problem.
We also demonstrate that policies learned from human scheduling demonstration
via apprenticeship learning can substantially improve the efficiency of a
branch-and-bound search for an optimal schedule. We employ this human-machine
collaborative optimization technique on a variant of the weapon-to-target
assignment problem. We demonstrate that this technique generates solutions
substantially superior to those produced by human domain experts at a rate up
to 9.5 times faster than an optimization approach and can be applied to
optimally solve problems twice as complex as those solved by a human
demonstrator.Comment: Portions of this paper were published in the Proceedings of the
International Joint Conference on Artificial Intelligence (IJCAI) in 2016 and
in the Proceedings of Robotics: Science and Systems (RSS) in 2016. The paper
consists of 50 pages with 11 figures and 4 table
Machine Learning Meets Advanced Robotic Manipulation
Automated industries lead to high quality production, lower manufacturing
cost and better utilization of human resources. Robotic manipulator arms have
major role in the automation process. However, for complex manipulation tasks,
hard coding efficient and safe trajectories is challenging and time consuming.
Machine learning methods have the potential to learn such controllers based on
expert demonstrations. Despite promising advances, better approaches must be
developed to improve safety, reliability, and efficiency of ML methods in both
training and deployment phases. This survey aims to review cutting edge
technologies and recent trends on ML methods applied to real-world manipulation
tasks. After reviewing the related background on ML, the rest of the paper is
devoted to ML applications in different domains such as industry, healthcare,
agriculture, space, military, and search and rescue. The paper is closed with
important research directions for future works
Hierarchical Reinforcement Learning in Minecraft
ENGLISH ABSTRACT: Humans have the remarkable ability to perform actions at various levels of abstraction. In
addition to this, humans are also able to learn new skills by applying relevant knowledge,
observing experts and refining t hrough e x p erience. M any c urrent r einforcement learning
(RL) algorithms rely on a lengthy trial-and-error training process, making it infeasible
to train them in the real world. In this thesis, to address sparse, hierarchical problems
we propose the following: (1) an RL algorithm, Branched Rainbow from Demonstrations
(BRfD), which combines several improvements to the Deep Q-Networks (DQN) algorithm,
and is capable of learning from human demonstrations; (2) a hierarchically structured RL
algorithm using BRfD to solve a set of sub-tasks in order to reach a goal. We evaluate both
of these algorithms in the 2019 MineRL challenge environments. The MineRL competition
challenged participants to find a Diamond i n M inecraftâa 3 D, o p en-world, procedurally
generated game. We analyse the efficiency of several improvements implemented in the
BRfD algorithm through an extensive ablation study. For this study, the agents are tasked
with collecting 64 logs in a Minecraft forest environment. We show that our algorithm
outperforms the overall winner of the MineRL challenge in the TreeChop environment.
Additionally, we show that nearly all of the improvements impact the performance either in
terms of learning speed or rewards received. For the hierarchical algorithm, we segment the
demonstrations into the respective sub-tasks. The algorithm then trains a version of BRfD
on these demonstrations before learning from its own experiences in the environment. We
then evaluate the algorithm by inspecting the proportion of episodes in which certain items
were obtained. While our algorithm is able to obtain iron ore, the current state-of-the-art
algorithms are capable of obtaining a diamond.AFRIKAANSE OPSOMMING: Mense het die uitsonderlike vermoë om op verskillende vlakke van abstraksie verskeie
take uit te voer. Verder kan nuwe vaardighede aangeleer word deur relevante kennis toe
te pas, kundiges waar te neem en deur verfyning van ondervinding. Verskeie bestaande
versterkingsleer-algoritmes vertrou op omslagtige probeer-en-tref opleidingsprosesse wat
dit nie lewensvatbaar maak in die praktyk nie. In hierdie tesis, om die beperkte rangorde
van belangrikheid aan te spreek, stel ons die volgende voor: (1) ân versterkingsleer-
algoritme, âBranched Rainbow from Demonstrations (BRfD)â, wat verskeie verbeterings
in die âDeep Q-Networks (DQN)â algoritme kombineer wat deur menslike demonstrasie
leer; (2) ân hiĂ«rargiesgestruktureerde versterkingsleer-algoritme wat deur middel van BRfD
verskeie subtake kan oplos. Ons ontleed beide die bovermelde algoritmes in die 2019
âMineRLâ omgewing. Die âMineRLâ kompetisie het deelnemers uitgedaag om ân Diamant
te vind in âMinecraftâ. âMinecraftâ is ân driedimensionele, âopen-worldâ, progressief
gegenereerde rekenaarspeletjie. Verskeie verbeteringe wat in die BRfD-algoritme toegepas
is deur omvangryke ablasiestudiemetodes word ontleed. Vir die studie is die agente
opdrag gegee om 64 âlogsâ in ân âMinecraftâ woud omgewing bymekaar te maak. Ons
toon dat hierdie algoritme die algehele wenner in die âTreechopâ omgewing van die 2019
âMineRLâ uitdaging klop.
erder toon ons dat byna alle verbeterings ân positiewe impak
het ten opsigte van leerspoed of vergoeding ontvang. Vir die hiërargiese algoritme is die
demonstrasies opgebreek in hulle verskeie subopdragte. Die algoritme leer dan ân weergawe
van BRfD deur middel van hierdie demonstrasies gebaseer op sy eie ondervinding in die
omgewing. Ons evalueer dan die algoritmes deur ân ondersoek te doen na die proporsie van
episodes waar sekere items verkry is. Ons algoritme kon slegs ystererts vind in teenstelling
met die huidige moderne algoritmes wat ân diamant vind.Master
Policy-Based Planning for Robust Robot Navigation
This thesis proposes techniques for constructing and implementing an extensible navigation framework suitable for operating alongside or in place of traditional navigation systems. Robot navigation is only possible when many subsystems work in tandem such as localization and mapping, motion planning, control, and object tracking. Errors in any one of these subsystems can result in the robot failing to accomplish its task, oftentimes requiring human interventions that diminish the benefits theoretically provided by autonomous robotic systems.
Our first contribution is Direction Approximation through Random Trials (DART), a method for generating human-followable navigation instructions optimized for followability instead of traditional metrics such as path length. We show how this strategy can be extended to robot navigation planning, allowing the robot to compute the sequence of control policies and switching conditions maximizing the likelihood with which the robot will reach its goal. This technique allows robots to select plans based on reliability in addition to efficiency, avoiding error-prone actions or areas of the environment. We also show how DART can be used to build compact, topological maps of its environments, offering opportunities to scale to larger environments.
DART depends on the existence of a set of behaviors and switching conditions describing ways the robot can move through an environment. In the remainder of this thesis, we present methods for learning these behaviors and conditions in indoor environments. To support landmark-based navigation, we show how to train a Convolutional Neural Network (CNN) to distinguish between semantically labeled 2D
occupancy grids generated from LIDAR data. By providing the robot the ability to recognize specific classes of places based on human labels, not only do we support transitioning between control laws, but also provide hooks for human-aided instruction and direction.
Additionally, we suggest a subset of behaviors that provide DART with a sufficient set of actions to navigate in most indoor environments and introduce a method to learn these behaviors from teleloperated demonstrations. Our method learns a cost function suitable for integration into gradient-based control schemes. This enables the robot to execute behaviors in the absence of global knowledge. We present results demonstrating these behaviors working in several environments with varied structure, indicating that they generalize well to new environments.
This work was motivated by the weaknesses and brittleness of many state-of-the-art navigation systems. Reliable navigation is the foundation of any mobile robotic system. It provides access to larger work spaces and enables a wide variety of tasks. Even though navigation systems have continued to improve, catastrophic failures can still occur (e.g. due to an incorrect loop closure) that limit their reliability. Furthermore, as work areas approach the
scale of kilometers, constructing and operating on precise localization maps becomes expensive. These limitations prevent large scale deployments of robots outside of controlled settings and laboratory environments.
The work presented in this thesis is intended to augment or replace traditional navigation systems to mitigate concerns about scalability and reliability by considering the effects of navigation failures for particular actions. By considering these effects when evaluating the actions to take, our framework can adapt navigation strategies to best take advantage of the capabilities of the robot in a given environment. A natural output of our framework is a topological network of actions and switching conditions, providing compact representations of work areas suitable for fast, scalable planning.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/144073/1/rgoeddel_1.pd
- âŠ