3,015 research outputs found
A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning
Sequential prediction problems such as imitation learning, where future
observations depend on previous predictions (actions), violate the common
i.i.d. assumptions made in statistical learning. This leads to poor performance
in theory and often in practice. Some recent approaches provide stronger
guarantees in this setting, but remain somewhat unsatisfactory as they train
either non-stationary or stochastic policies and require a large number of
iterations. In this paper, we propose a new iterative algorithm, which trains a
stationary deterministic policy, that can be seen as a no regret algorithm in
an online learning setting. We show that any such no regret algorithm, combined
with additional reduction assumptions, must find a policy with good performance
under the distribution of observations it induces in such sequential settings.
We demonstrate that this new approach outperforms previous approaches on two
challenging imitation learning problems and a benchmark sequence labeling
problem.Comment: Appearing in the 14th International Conference on Artificial
Intelligence and Statistics (AISTATS 2011
Reinforcement and Imitation Learning via Interactive No-Regret Learning
Recent work has demonstrated that problems-- particularly imitation learning
and structured prediction-- where a learner's predictions influence the
input-distribution it is tested on can be naturally addressed by an interactive
approach and analyzed using no-regret online learning. These approaches to
imitation learning, however, neither require nor benefit from information about
the cost of actions. We extend existing results in two directions: first, we
develop an interactive imitation learning approach that leverages cost
information; second, we extend the technique to address reinforcement learning.
The results provide theoretical support to the commonly observed successes of
online approximate policy iteration. Our approach suggests a broad new family
of algorithms and provides a unifying view of existing techniques for imitation
and reinforcement learning.Comment: 14 pages. Under review for NIPS 2014 conferenc
Deeply AggreVaTeD: Differentiable Imitation Learning for Sequential Prediction
Researchers have demonstrated state-of-the-art performance in sequential
decision making problems (e.g., robotics control, sequential prediction) with
deep neural network models. One often has access to near-optimal oracles that
achieve good performance on the task during training. We demonstrate that
AggreVaTeD --- a policy gradient extension of the Imitation Learning (IL)
approach of (Ross & Bagnell, 2014) --- can leverage such an oracle to achieve
faster and better solutions with less training data than a less-informed
Reinforcement Learning (RL) technique. Using both feedforward and recurrent
neural network predictors, we present stochastic gradient procedures on a
sequential prediction task, dependency-parsing from raw image data, as well as
on various high dimensional robotics control problems. We also provide a
comprehensive theoretical study of IL that demonstrates we can expect up to
exponentially lower sample complexity for learning with AggreVaTeD than with RL
algorithms, which backs our empirical findings. Our results and theory indicate
that the proposed approach can achieve superior performance with respect to the
oracle when the demonstrator is sub-optimal.Comment: 17 page
Learning Reductions that Really Work
We provide a summary of the mathematical and computational techniques that
have enabled learning reductions to effectively address a wide class of
problems, and show that this approach to solving machine learning problems can
be broadly useful
Learning Beam Search Policies via Imitation Learning
Beam search is widely used for approximate decoding in structured prediction
problems. Models often use a beam at test time but ignore its existence at
train time, and therefore do not explicitly learn how to use the beam. We
develop an unifying meta-algorithm for learning beam search policies using
imitation learning. In our setting, the beam is part of the model, and not just
an artifact of approximate decoding. Our meta-algorithm captures existing
learning algorithms and suggests new ones. It also lets us show novel no-regret
guarantees for learning beam search policies.Comment: Published in NIPS 201
Curriculum-Based Neighborhood Sampling For Sequence Prediction
The task of multi-step ahead prediction in language models is challenging
considering the discrepancy between training and testing. At test time, a
language model is required to make predictions given past predictions as input,
instead of the past targets that are provided during training. This difference,
known as exposure bias, can lead to the compounding of errors along a generated
sequence at test time.
In order to improve generalization in neural language models and address
compounding errors, we propose a curriculum learning based method that
gradually changes an initially deterministic teacher policy to a gradually more
stochastic policy, which we refer to as \textit{Nearest-Neighbor Replacement
Sampling}. A chosen input at a given timestep is replaced with a sampled
nearest neighbor of the past target with a truncated probability proportional
to the cosine similarity between the original word and its top most similar
words. This allows the teacher to explore alternatives when the teacher
provides a sub-optimal policy or when the initial policy is difficult for the
learner to model. The proposed strategy is straightforward, online and requires
little additional memory requirements. We report our main findings on two
language modelling benchmarks and find that the proposed approach performs
particularly well when used in conjunction with scheduled sampling, that too
attempts to mitigate compounding errors in language models
Accelerating Imitation Learning with Predictive Models
Sample efficiency is critical in solving real-world reinforcement learning
problems, where agent-environment interactions can be costly. Imitation
learning from expert advice has proved to be an effective strategy for reducing
the number of interactions required to train a policy. Online imitation
learning, which interleaves policy evaluation and policy optimization, is a
particularly effective technique with provable performance guarantees. In this
work, we seek to further accelerate the convergence rate of online imitation
learning, thereby making it more sample efficient. We propose two model-based
algorithms inspired by Follow-the-Leader (FTL) with prediction: MoBIL-VI based
on solving variational inequalities and MoBIL-Prox based on stochastic
first-order updates. These two methods leverage a model to predict future
gradients to speed up policy learning. When the model oracle is learned online,
these algorithms can provably accelerate the best known convergence rate up to
an order. Our algorithms can be viewed as a generalization of stochastic
Mirror-Prox (Juditsky et al., 2011), and admit a simple constructive FTL-style
analysis of performance
Inspiration Learning through Preferences
Current imitation learning techniques are too restrictive because they
require the agent and expert to share the same action space. However,
oftentimes agents that act differently from the expert can solve the task just
as good. For example, a person lifting a box can be imitated by a ceiling
mounted robot or a desktop-based robotic-arm. In both cases, the end goal of
lifting the box is achieved, perhaps using different strategies. We denote this
setup as \textit{Inspiration Learning} - knowledge transfer between agents that
operate in different action spaces. Since state-action expert demonstrations
can no longer be used, Inspiration learning requires novel methods to guide the
agent towards the end goal. In this work, we rely on ideas of Preferential
based Reinforcement Learning (PbRL) to design Advantage Actor-Critic algorithms
for solving inspiration learning tasks. Unlike classic actor-critic
architectures, the critic we use consists of two parts: a) a state-value
estimation as in common actor-critic algorithms and b) a single step reward
function derived from an expert/agent classifier. We show that our method is
capable of extending the current imitation framework to new horizons. This
includes continuous-to-discrete action imitation, as well as primitive-to-macro
action imitation
Imitation Learning with Recurrent Neural Networks
We present a novel view that unifies two frameworks that aim to solve
sequential prediction problems: learning to search (L2S) and recurrent neural
networks (RNN). We point out equivalences between elements of the two
frameworks. By complementing what is missing from one framework comparing to
the other, we introduce a more advanced imitation learning framework that, on
one hand, augments L2S s notion of search space and, on the other hand,
enhances RNNs training procedure to be more robust to compounding errors
arising from training on highly correlated examples.Comment: 5 page
Learning to Search for Dependencies
We demonstrate that a dependency parser can be built using a credit
assignment compiler which removes the burden of worrying about low-level
machine learning details from the parser implementation. The result is a simple
parser which robustly applies to many languages that provides similar
statistical and computational performance with best-to-date transition-based
parsing approaches, while avoiding various downsides including randomization,
extra feature requirements, and custom learning algorithms
- …