116 research outputs found
Adversarial Imitation Learning from Incomplete Demonstrations
Imitation learning targets deriving a mapping from states to actions, a.k.a.
policy, from expert demonstrations. Existing methods for imitation learning
typically require any actions in the demonstrations to be fully available,
which is hard to ensure in real applications. Though algorithms for learning
with unobservable actions have been proposed, they focus solely on state
information and overlook the fact that the action sequence could still be
partially available and provide useful information for policy deriving. In this
paper, we propose a novel algorithm called Action-Guided Adversarial Imitation
Learning (AGAIL) that learns a policy from demonstrations with incomplete
action sequences, i.e., incomplete demonstrations. The core idea of AGAIL is to
separate demonstrations into state and action trajectories, and train a policy
with state trajectories while using actions as auxiliary information to guide
the training whenever applicable. Built upon the Generative Adversarial
Imitation Learning, AGAIL has three components: a generator, a discriminator,
and a guide. The generator learns a policy with rewards provided by the
discriminator, which tries to distinguish state distributions between
demonstrations and samples generated by the policy. The guide provides
additional rewards to the generator when demonstrated actions for specific
states are available. We compare AGAIL to other methods on benchmark tasks and
show that AGAIL consistently delivers comparable performance to the
state-of-the-art methods even when the action sequence in demonstrations is
only partially available.Comment: Accepted to International Joint Conference on Artificial Intelligence
(IJCAI-19
TTA-Nav: Test-time Adaptive Reconstruction for Point-Goal Navigation under Visual Corruptions
Robot navigation under visual corruption presents a formidable challenge. To
address this, we propose a Test-time Adaptation (TTA) method, named as TTA-Nav,
for point-goal navigation under visual corruptions. Our "plug-and-play" method
incorporates a top-down decoder to a pre-trained navigation model. Firstly, the
pre-trained navigation model gets a corrupted image and extracts features.
Secondly, the top-down decoder produces the reconstruction given the high-level
features extracted by the pre-trained model. Then, it feeds the reconstruction
of a corrupted image back to the pre-trained model. Finally, the pre-trained
model does forward pass again to output action. Despite being trained solely on
clean images, the top-down decoder can reconstruct cleaner images from
corrupted ones without the need for gradient-based adaptation. The pre-trained
navigation model with our top-down decoder significantly enhances navigation
performance across almost all visual corruptions in our benchmarks. Our method
improves the success rate of point-goal navigation from the state-of-the-art
result of 46% to 94% on the most severe corruption. This suggests its potential
for broader application in robotic visual navigation. Project page:
https://sites.google.com/view/tta-navComment: Submitted to IROS202
Deterministic and Discriminative Imitation (D2-Imitation): Revisiting Adversarial Imitation for Sample Efficiency
Sample efficiency is crucial for imitation learning methods to be applicable
in real-world applications. Many studies improve sample efficiency by extending
adversarial imitation to be off-policy regardless of the fact that these
off-policy extensions could either change the original objective or involve
complicated optimization. We revisit the foundation of adversarial imitation
and propose an off-policy sample efficient approach that requires no
adversarial training or min-max optimization. Our formulation capitalizes on
two key insights: (1) the similarity between the Bellman equation and the
stationary state-action distribution equation allows us to derive a novel
temporal difference (TD) learning approach; and (2) the use of a deterministic
policy simplifies the TD learning. Combined, these insights yield a practical
algorithm, Deterministic and Discriminative Imitation (D2-Imitation), which
operates by first partitioning samples into two replay buffers and then
learning a deterministic policy via off-policy reinforcement learning. Our
empirical results show that D2-Imitation is effective in achieving good sample
efficiency, outperforming several off-policy extension approaches of
adversarial imitation on many control tasks.Comment: AAAI 202
Investigating the Effects of Robot Engagement Communication on Learning from Demonstration
Robot Learning from Demonstration (RLfD) is a technique for robots to derive
policies from instructors' examples. Although the reciprocal effects of student
engagement on teacher behavior are widely recognized in the educational
community, it is unclear whether the same phenomenon holds true for RLfD. To
fill this gap, we first design three types of robot engagement behavior
(attention, imitation, and a hybrid of the two) based on the learning
literature. We then conduct, in a simulation environment, a within-subject user
study to investigate the impact of different robot engagement cues on humans
compared to a "without-engagement" condition. Results suggest that engagement
communication significantly changes the human's estimation of the robots'
capability and significantly raises their expectation towards the learning
outcomes, even though we do not run actual learning algorithms in the
experiments. Moreover, imitation behavior affects humans more than attention
does in all metrics, while their combination has the most profound influences
on humans. We also find that communicating engagement via imitation or the
combined behavior significantly improve humans' perception towards the quality
of demonstrations, even if all demonstrations are of the same quality.Comment: Under revie
Comparing the Efficacy of Fine-Tuning and Meta-Learning for Few-Shot Policy Imitation
In this paper we explore few-shot imitation learning for control problems,
which involves learning to imitate a target policy by accessing a limited set
of offline rollouts. This setting has been relatively under-explored despite
its relevance to robotics and control applications. State-of-the-art methods
developed to tackle few-shot imitation rely on meta-learning, which is
expensive to train as it requires access to a distribution over tasks (rollouts
from many target policies and variations of the base environment). Given this
limitation we investigate an alternative approach, fine-tuning, a family of
methods that pretrain on a single dataset and then fine-tune on unseen
domain-specific data. Recent work has shown that fine-tuners outperform
meta-learners in few-shot image classification tasks, especially when the data
is out-of-domain. Here we evaluate to what extent this is true for control
problems, proposing a simple yet effective baseline which relies on two stages:
(i) training a base policy online via reinforcement learning (e.g. Soft
Actor-Critic) on a single base environment, (ii) fine-tuning the base policy
via behavioral cloning on a few offline rollouts of the target policy. Despite
its simplicity this baseline is competitive with meta-learning methods on a
variety of conditions and is able to imitate target policies trained on unseen
variations of the original environment. Importantly, the proposed approach is
practical and easy to implement, as it does not need any complex meta-training
protocol. As a further contribution, we release an open source dataset called
iMuJoCo (iMitation MuJoCo) consisting of 154 variants of popular OpenAI-Gym
MuJoCo environments with associated pretrained target policies and rollouts,
which can be used by the community to study few-shot imitation learning and
offline reinforcement learning
Improve Transformer Pre-Training with Decoupled Directional Relative Position Encoding and Representation Differentiations
In this work, we revisit the Transformer-based pre-trained language models
and identify two problems that may limit the expressiveness of the model.
Firstly, existing relative position encoding models (e.g., T5 and DEBERTA)
confuse two heterogeneous information: relative distance and direction. It may
make the model unable to capture the associative semantics of the same
direction or the same distance, which in turn affects the performance of
downstream tasks. Secondly, we notice the pre-trained BERT with Mask Language
Modeling (MLM) pre-training objective outputs similar token representations and
attention weights of different heads, which may impose difficulties in
capturing discriminative semantic representations. Motivated by the above
investigation, we propose two novel techniques to improve pre-trained language
models: Decoupled Directional Relative Position (DDRP) encoding and MTH
pre-training objective. DDRP decouples the relative distance features and the
directional features in classical relative position encoding for better
position information understanding. MTH designs two novel auxiliary losses
besides MLM to enlarge the dissimilarities between (a) last hidden states of
different tokens, and (b) attention weights of different heads, alleviating
homogenization and anisotropic problem in representation learning for better
optimization. Extensive experiments and ablation studies on GLUE benchmark
demonstrate the effectiveness of our proposed methods
Trust-Region-Free Policy Optimization for Stochastic Policies
Trust Region Policy Optimization (TRPO) is an iterative method that
simultaneously maximizes a surrogate objective and enforces a trust region
constraint over consecutive policies in each iteration. The combination of the
surrogate objective maximization and the trust region enforcement has been
shown to be crucial to guarantee a monotonic policy improvement. However,
solving a trust-region-constrained optimization problem can be computationally
intensive as it requires many steps of conjugate gradient and a large number of
on-policy samples. In this paper, we show that the trust region constraint over
policies can be safely substituted by a trust-region-free constraint without
compromising the underlying monotonic improvement guarantee. The key idea is to
generalize the surrogate objective used in TRPO in a way that a monotonic
improvement guarantee still emerges as a result of constraining the maximum
advantage-weighted ratio between policies. This new constraint outlines a
conservative mechanism for iterative policy optimization and sheds light on
practical ways to optimize the generalized surrogate objective. We show that
the new constraint can be effectively enforced by being conservative when
optimizing the generalized objective function in practice. We call the
resulting algorithm Trust-REgion-Free Policy Optimization (TREFree) as it is
free of any explicit trust region constraints. Empirical results show that
TREFree outperforms TRPO and Proximal Policy Optimization (PPO) in terms of
policy performance and sample efficiency.Comment: RLDM 202
Revisiting Estimation Bias in Policy Gradients for Deep Reinforcement Learning
We revisit the estimation bias in policy gradients for the discounted
episodic Markov decision process (MDP) from Deep Reinforcement Learning (DRL)
perspective. The objective is formulated theoretically as the expected returns
discounted over the time horizon. One of the major policy gradient biases is
the state distribution shift: the state distribution used to estimate the
gradients differs from the theoretical formulation in that it does not take
into account the discount factor. Existing discussion of the influence of this
bias was limited to the tabular and softmax cases in the literature. Therefore,
in this paper, we extend it to the DRL setting where the policy is
parameterized and demonstrate how this bias can lead to suboptimal policies
theoretically. We then discuss why the empirically inaccurate implementations
with shifted state distribution can still be effective. We show that, despite
such state distribution shift, the policy gradient estimation bias can be
reduced in the following three ways: 1) a small learning rate; 2) an
adaptive-learning-rate-based optimizer; and 3) KL regularization. Specifically,
we show that a smaller learning rate, or, an adaptive learning rate, such as
that used by Adam and RSMProp optimizers, makes the policy optimization robust
to the bias. We further draw connections between optimizers and the
optimization regularization to show that both the KL and the reverse KL
regularization can significantly rectify this bias. Moreover, we provide
extensive experiments on continuous control tasks to support our analysis. Our
paper sheds light on how successful PG algorithms optimize policies in the DRL
setting, and contributes insights into the practical issues in DRL.Comment: 12 pages, 9 figure
- …