537 research outputs found
Towards Resolving Unidentifiability in Inverse Reinforcement Learning
We consider a setting for Inverse Reinforcement Learning (IRL) where the
learner is extended with the ability to actively select multiple environments,
observing an agent's behavior on each environment. We first demonstrate that if
the learner can experiment with any transition dynamics on some fixed set of
states and actions, then there exists an algorithm that reconstructs the
agent's reward function to the fullest extent theoretically possible, and that
requires only a small (logarithmic) number of experiments. We contrast this
result to what is known about IRL in single fixed environments, namely that the
true reward function is fundamentally unidentifiable. We then extend this
setting to the more realistic case where the learner may not select any
transition dynamic, but rather is restricted to some fixed set of environments
that it may try. We connect the problem of maximizing the information derived
from experiments to submodular function maximization and demonstrate that a
greedy algorithm is near optimal (up to logarithmic factors). Finally, we
empirically validate our algorithm on an environment inspired by behavioral
psychology
Learning to Make Predictions In Partially Observable Environments Without a Generative Model
When faced with the problem of learning a model of a high-dimensional
environment, a common approach is to limit the model to make only a restricted
set of predictions, thereby simplifying the learning problem. These partial
models may be directly useful for making decisions or may be combined together
to form a more complete, structured model. However, in partially observable
(non-Markov) environments, standard model-learning methods learn generative
models, i.e. models that provide a probability distribution over all possible
futures (such as POMDPs). It is not straightforward to restrict such models to
make only certain predictions, and doing so does not always simplify the
learning problem. In this paper we present prediction profile models:
non-generative partial models for partially observable systems that make only a
given set of predictions, and are therefore far simpler than generative models
in some cases. We formalize the problem of learning a prediction profile model
as a transformation of the original model-learning problem, and show
empirically that one can learn prediction profile models that make a small set
of important predictions even in systems that are too complex for standard
generative models
On the Complexity of Policy Iteration
Decision-making problems in uncertain or stochastic domains are often
formulated as Markov decision processes (MDPs). Policy iteration (PI) is a
popular algorithm for searching over policy-space, the size of which is
exponential in the number of states. We are interested in bounds on the
complexity of PI that do not depend on the value of the discount factor. In
this paper we prove the first such non-trivial, worst-case, upper bounds on the
number of iterations required by PI to converge to the optimal policy. Our
analysis also sheds new light on the manner in which PI progresses through the
space of policies.Comment: Appears in Proceedings of the Fifteenth Conference on Uncertainty in
Artificial Intelligence (UAI1999
Approximate Planning for Factored POMDPs using Belief State Simplification
We are interested in the problem of planning for factored POMDPs. Building on
the recent results of Kearns, Mansour and Ng, we provide a planning algorithm
for factored POMDPs that exploits the accuracy-efficiency tradeoff in the
belief state simplification introduced by Boyen and Koller.Comment: Appears in Proceedings of the Fifteenth Conference on Uncertainty in
Artificial Intelligence (UAI1999
Repeated Inverse Reinforcement Learning
We introduce a novel repeated Inverse Reinforcement Learning problem: the
agent has to act on behalf of a human in a sequence of tasks and wishes to
minimize the number of tasks that it surprises the human by acting suboptimally
with respect to how the human would have acted. Each time the human is
surprised, the agent is provided a demonstration of the desired behavior by the
human. We formalize this problem, including how the sequence of tasks is
chosen, in a few different ways and provide some foundational results.Comment: The first two authors contributed equally to this work. The paper
appears in NIPS 201
Value Prediction Network
This paper proposes a novel deep reinforcement learning (RL) architecture,
called Value Prediction Network (VPN), which integrates model-free and
model-based RL methods into a single neural network. In contrast to typical
model-based RL methods, VPN learns a dynamics model whose abstract states are
trained to make option-conditional predictions of future values (discounted sum
of rewards) rather than of future observations. Our experimental results show
that VPN has several advantages over both model-free and model-based baselines
in a stochastic environment where careful planning is required but building an
accurate observation-prediction model is difficult. Furthermore, VPN
outperforms Deep Q-Network (DQN) on several Atari games even with
short-lookahead planning, demonstrating its potential as a new way of learning
a good state representation.Comment: NIPS 201
Predictive State Representations: A New Theory for Modeling Dynamical Systems
Modeling dynamical systems, both for control purposes and to make predictions
about their behavior, is ubiquitous in science and engineering. Predictive
state representations (PSRs) are a recently introduced class of models for
discrete-time dynamical systems. The key idea behind PSRs and the closely
related OOMs (Jaeger's observable operator models) is to represent the state of
the system as a set of predictions of observable outcomes of experiments one
can do in the system. This makes PSRs rather different from history-based
models such as nth-order Markov models and hidden-state-based models such as
HMMs and POMDPs. We introduce an interesting construct, the systemdynamics
matrix, and show how PSRs can be derived simply from it. We also use this
construct to show formally that PSRs are more general than both nth-order
Markov models and HMMs/POMDPs. Finally, we discuss the main difference between
PSRs and OOMs and conclude with directions for future work.Comment: Appears in Proceedings of the Twentieth Conference on Uncertainty in
Artificial Intelligence (UAI2004
Nash Convergence of Gradient Dynamics in Iterated General-Sum Games
Multi-agent games are becoming an increasing prevalent formalism for the
study of electronic commerce and auctions. The speed at which transactions can
take place and the growing complexity of electronic marketplaces makes the
study of computationally simple agents an appealing direction. In this work, we
analyze the behavior of agents that incrementally adapt their strategy through
gradient ascent on expected payoff, in the simple setting of two-player,
two-action, iterated general-sum games, and present a surprising result. We
show that either the agents will converge to Nash equilibrium, or if the
strategies themselves do not converge, then their average payoffs will
nevertheless converge to the payoffs of a Nash equilibrium.Comment: Appears in Proceedings of the Sixteenth Conference on Uncertainty in
Artificial Intelligence (UAI2000
Minimizing Maximum Regret in Commitment Constrained Sequential Decision Making
In cooperative multiagent planning, it can often be beneficial for an agent
to make commitments about aspects of its behavior to others, allowing them in
turn to plan their own behaviors without taking the agent's detailed behavior
into account. Extending previous work in the Bayesian setting, we consider
instead a worst-case setting in which the agent has a set of possible
environments (MDPs) it could be in, and develop a commitment semantics that
allows for probabilistic guarantees on the agent's behavior in any of the
environments it could end up facing. Crucially, an agent receives observations
(of reward and state transitions) that allow it to potentially eliminate
possible environments and thus obtain higher utility by adapting its policy to
the history of observations. We develop algorithms and provide theory and some
preliminary empirical results showing that they ensure an agent meets its
commitments with history-dependent policies while minimizing maximum regret
over the possible environments
Fast Planning in Stochastic Games
Stochastic games generalize Markov decision processes (MDPs) to a multiagent
setting by allowing the state transitions to depend jointly on all player
actions, and having rewards determined by multiplayer matrix games at each
state. We consider the problem of computing Nash equilibria in stochastic
games, the analogue of planning in MDPs. We begin by providing a generalization
of finite-horizon value iteration that computes a Nash strategy for each player
in generalsum stochastic games. The algorithm takes an arbitrary Nash selection
function as input, which allows the translation of local choices between
multiple Nash equilibria into the selection of a single global Nash
equilibrium.
Our main technical result is an algorithm for computing near-Nash equilibria
in large or infinite state spaces. This algorithm builds on our finite-horizon
value iteration algorithm, and adapts the sparse sampling methods of Kearns,
Mansour and Ng (1999) to stochastic games. We conclude by descrbing a
counterexample showing that infinite-horizon discounted value iteration, which
was shown by shaplely to converge in the zero-sum case (a result we give extend
slightly here), does not converge in the general-sum case.Comment: Appears in Proceedings of the Sixteenth Conference on Uncertainty in
Artificial Intelligence (UAI2000
- …