307 research outputs found
Human-Machine Collaborative Optimization via Apprenticeship Scheduling
Coordinating agents to complete a set of tasks with intercoupled temporal and
resource constraints is computationally challenging, yet human domain experts
can solve these difficult scheduling problems using paradigms learned through
years of apprenticeship. A process for manually codifying this domain knowledge
within a computational framework is necessary to scale beyond the
``single-expert, single-trainee" apprenticeship model. However, human domain
experts often have difficulty describing their decision-making processes,
causing the codification of this knowledge to become laborious. We propose a
new approach for capturing domain-expert heuristics through a pairwise ranking
formulation. Our approach is model-free and does not require enumerating or
iterating through a large state space. We empirically demonstrate that this
approach accurately learns multifaceted heuristics on a synthetic data set
incorporating job-shop scheduling and vehicle routing problems, as well as on
two real-world data sets consisting of demonstrations of experts solving a
weapon-to-target assignment problem and a hospital resource allocation problem.
We also demonstrate that policies learned from human scheduling demonstration
via apprenticeship learning can substantially improve the efficiency of a
branch-and-bound search for an optimal schedule. We employ this human-machine
collaborative optimization technique on a variant of the weapon-to-target
assignment problem. We demonstrate that this technique generates solutions
substantially superior to those produced by human domain experts at a rate up
to 9.5 times faster than an optimization approach and can be applied to
optimally solve problems twice as complex as those solved by a human
demonstrator.Comment: Portions of this paper were published in the Proceedings of the
International Joint Conference on Artificial Intelligence (IJCAI) in 2016 and
in the Proceedings of Robotics: Science and Systems (RSS) in 2016. The paper
consists of 50 pages with 11 figures and 4 table
Distance-rank Aware Sequential Reward Learning for Inverse Reinforcement Learning with Sub-optimal Demonstrations
Inverse reinforcement learning (IRL) aims to explicitly infer an underlying
reward function based on collected expert demonstrations. Considering that
obtaining expert demonstrations can be costly, the focus of current IRL
techniques is on learning a better-than-demonstrator policy using a reward
function derived from sub-optimal demonstrations. However, existing IRL
algorithms primarily tackle the challenge of trajectory ranking ambiguity when
learning the reward function. They overlook the crucial role of considering the
degree of difference between trajectories in terms of their returns, which is
essential for further removing reward ambiguity. Additionally, it is important
to note that the reward of a single transition is heavily influenced by the
context information within the trajectory. To address these issues, we
introduce the Distance-rank Aware Sequential Reward Learning (DRASRL)
framework. Unlike existing approaches, DRASRL takes into account both the
ranking of trajectories and the degrees of dissimilarity between them to
collaboratively eliminate reward ambiguity when learning a sequence of
contextually informed reward signals. Specifically, we leverage the distance
between policies, from which the trajectories are generated, as a measure to
quantify the degree of differences between traces. This distance-aware
information is then used to infer embeddings in the representation space for
reward learning, employing the contrastive learning technique. Meanwhile, we
integrate the pairwise ranking loss function to incorporate ranking information
into the latent features. Moreover, we resort to the Transformer architecture
to capture the contextual dependencies within the trajectories in the latent
space, leading to more accurate reward estimation. Through extensive
experimentation, our DRASRL framework demonstrates significant performance
improvements over previous SOTA methods
Inferring Versatile Behavior from Demonstrations by Matching Geometric Descriptors
Humans intuitively solve tasks in versatile ways, varying their behavior in
terms of trajectory-based planning and for individual steps. Thus, they can
easily generalize and adapt to new and changing environments. Current Imitation
Learning algorithms often only consider unimodal expert demonstrations and act
in a state-action-based setting, making it difficult for them to imitate human
behavior in case of versatile demonstrations. Instead, we combine a mixture of
movement primitives with a distribution matching objective to learn versatile
behaviors that match the expert's behavior and versatility. To facilitate
generalization to novel task configurations, we do not directly match the
agent's and expert's trajectory distributions but rather work with concise
geometric descriptors which generalize well to unseen task configurations. We
empirically validate our method on various robot tasks using versatile human
demonstrations and compare to imitation learning algorithms in a state-action
setting as well as a trajectory-based setting. We find that the geometric
descriptors greatly help in generalizing to new task configurations and that
combining them with our distribution-matching objective is crucial for
representing and reproducing versatile behavior.Comment: Accepted as a poster at the 6th Conference on Robot Learning (CoRL),
202
Inferring Versatile Behavior from Demonstrations by Matching Geometric Descriptors
Humans intuitively solve tasks in versatile ways, varying their behavior in
terms of trajectory-based planning and for individual steps. Thus, they can
easily generalize and adapt to new and changing environments. Current Imitation
Learning algorithms often only consider unimodal expert demonstrations and act
in a state-action-based setting, making it difficult for them to imitate human
behavior in case of versatile demonstrations. Instead, we combine a mixture of
movement primitives with a distribution matching objective to learn versatile
behaviors that match the expert's behavior and versatility. To facilitate
generalization to novel task configurations, we do not directly match the
agent's and expert's trajectory distributions but rather work with concise
geometric descriptors which generalize well to unseen task configurations. We
empirically validate our method on various robot tasks using versatile human
demonstrations and compare to imitation learning algorithms in a state-action
setting as well as a trajectory-based setting. We find that the geometric
descriptors greatly help in generalizing to new task configurations and that
combining them with our distribution-matching objective is crucial for
representing and reproducing versatile behavior.Comment: Accepted as a poster at the 6th Conference on Robot Learning (CoRL),
202
- …