388 research outputs found
Generating Multi-Agent Trajectories using Programmatic Weak Supervision
We study the problem of training sequential generative models for capturing
coordinated multi-agent trajectory behavior, such as offensive basketball
gameplay. When modeling such settings, it is often beneficial to design
hierarchical models that can capture long-term coordination using intermediate
variables. Furthermore, these intermediate variables should capture interesting
high-level behavioral semantics in an interpretable and manipulatable way. We
present a hierarchical framework that can effectively learn such sequential
generative models. Our approach is inspired by recent work on leveraging
programmatically produced weak labels, which we extend to the spatiotemporal
regime. In addition to synthetic settings, we show how to instantiate our
framework to effectively model complex interactions between basketball players
and generate realistic multi-agent trajectories of basketball gameplay over
long time periods. We validate our approach using both quantitative and
qualitative evaluations, including a user study comparison conducted with
professional sports analysts
Causal-aware Safe Policy Improvement for Task-oriented dialogue
The recent success of reinforcement learning's (RL) in solving complex tasks
is most often attributed to its capacity to explore and exploit an environment
where it has been trained. Sample efficiency is usually not an issue since
cheap simulators are available to sample data on-policy. On the other hand,
task oriented dialogues are usually learnt from offline data collected using
human demonstrations. Collecting diverse demonstrations and annotating them is
expensive. Unfortunately, use of RL methods trained on off-policy data are
prone to issues of bias and generalization, which are further exacerbated by
stochasticity in human response and non-markovian belief state of a dialogue
management system. To this end, we propose a batch RL framework for task
oriented dialogue policy learning: causal aware safe policy improvement
(CASPI). This method gives guarantees on dialogue policy's performance and also
learns to shape rewards according to intentions behind human responses, rather
than just mimicking demonstration data; this couple with batch-RL helps overall
with sample efficiency of the framework. We demonstrate the effectiveness of
this framework on a dialogue-context-to-text Generation and end-to-end dialogue
task of the Multiwoz2.0 dataset. The proposed method outperforms the current
state of the art on these metrics, in both case. In the end-to-end case, our
method trained only on 10\% of the data was able to out perform current state
in three out of four evaluation metrics
An Improved Data Augmentation Scheme for Model Predictive Control Policy Approximation
This paper considers the problem of data generation for MPC policy
approximation. Learning an approximate MPC policy from expert demonstrations
requires a large data set consisting of optimal state-action pairs, sampled
across the feasible state space. Yet, the key challenge of efficiently
generating the training samples has not been studied widely. Recently, a
sensitivity-based data augmentation framework for MPC policy approximation was
proposed, where the parametric sensitivities are exploited to cheaply generate
several additional samples from a single offline MPC computation. The error due
to augmenting the training data set with inexact samples was shown to increase
with the size of the neighborhood around each sample used for data
augmentation. Building upon this work, this letter paper presents an improved
data augmentation scheme based on predictor-corrector steps that enforces a
user-defined level of accuracy, and shows that the error bound of the augmented
samples are independent of the size of the neighborhood used for data
augmentation
Learning Calibratable Policies using Programmatic Style-Consistency
We study the important and challenging problem of controllable generation of long-term sequential behaviors. Solutions to this problem would impact many applications, such as calibrating behaviors of AI agents in games or predicting player trajectories in sports. In contrast to the well-studied areas of controllable generation of images, text, and speech, there are significant challenges that are unique to or exacerbated by generating long-term behaviors: how should we specify the factors of variation to control, and how can we ensure that the generated temporal behavior faithfully demonstrates diverse styles? In this paper, we leverage large amounts of raw behavioral data to learn policies that can be calibrated to generate a diverse range of behavior styles (e.g., aggressive versus passive play in sports). Inspired by recent work on leveraging programmatic labeling functions, we present a novel framework that combines imitation learning with data programming to learn style-calibratable policies. Our primary technical contribution is a formal notion of style-consistency as a learning objective, and its integration with conventional imitation learning approaches. We evaluate our framework using demonstrations from professional basketball players and agents in the MuJoCo physics environment, and show that our learned policies can be accurately calibrated to generate interesting behavior styles in both domains
Generative adversarial training of product of policies for robust and adaptive movement primitives
In learning from demonstrations, many generative models of trajectories make
simplifying assumptions of independence. Correctness is sacrificed in the name
of tractability and speed of the learning phase.
The ignored dependencies, which often are the kinematic and dynamic
constraints of the system, are then only restored when synthesizing the motion,
which introduces possibly heavy distortions.
In this work, we propose to use those approximate trajectory distributions as
close-to-optimal discriminators in the popular generative adversarial framework
to stabilize and accelerate the learning procedure.
The two problems of adaptability and robustness are addressed with our
method.
In order to adapt the motions to varying contexts, we propose to use a
product of Gaussian policies defined in several parametrized task spaces.
Robustness to perturbations and varying dynamics is ensured with the use of
stochastic gradient descent and ensemble methods to learn the stochastic
dynamics. Two experiments are performed on a 7-DoF manipulator to validate the
approach.Comment: Source code can be found here :
https://github.com/emmanuelpignat/tf_robot_learnin
- …