131 research outputs found
Bayesian multitask inverse reinforcement learning
We generalise the problem of inverse reinforcement learning to multiple
tasks, from multiple demonstrations. Each one may represent one expert trying
to solve a different task, or as different experts trying to solve the same
task. Our main contribution is to formalise the problem as statistical
preference elicitation, via a number of structured priors, whose form captures
our biases about the relatedness of different tasks or expert policies. In
doing so, we introduce a prior on policy optimality, which is more natural to
specify. We show that our framework allows us not only to learn to efficiently
from multiple experts but to also effectively differentiate between the goals
of each. Possible applications include analysing the intrinsic motivations of
subjects in behavioural experiments and learning from multiple teachers.Comment: Corrected version. 13 pages, 8 figure
Regret Bounds for Reinforcement Learning with Policy Advice
In some reinforcement learning problems an agent may be provided with a set
of input policies, perhaps learned from prior experience or provided by
advisors. We present a reinforcement learning with policy advice (RLPA)
algorithm which leverages this input set and learns to use the best policy in
the set for the reinforcement learning task at hand. We prove that RLPA has a
sub-linear regret of \tilde O(\sqrt{T}) relative to the best input policy, and
that both this regret and its computational complexity are independent of the
size of the state and action space. Our empirical simulations support our
theoretical analysis. This suggests RLPA may offer significant advantages in
large domains where some prior good policies are provided
Approximating the Termination Value of One-Counter MDPs and Stochastic Games
One-counter MDPs (OC-MDPs) and one-counter simple stochastic games (OC-SSGs)
are 1-player, and 2-player turn-based zero-sum, stochastic games played on the
transition graph of classic one-counter automata (equivalently, pushdown
automata with a 1-letter stack alphabet). A key objective for the analysis and
verification of these games is the termination objective, where the players aim
to maximize (minimize, respectively) the probability of hitting counter value
0, starting at a given control state and given counter value. Recently, we
studied qualitative decision problems ("is the optimal termination value = 1?")
for OC-MDPs (and OC-SSGs) and showed them to be decidable in P-time (in NP and
coNP, respectively). However, quantitative decision and approximation problems
("is the optimal termination value ? p", or "approximate the termination value
within epsilon") are far more challenging. This is so in part because optimal
strategies may not exist, and because even when they do exist they can have a
highly non-trivial structure. It thus remained open even whether any of these
quantitative termination problems are computable. In this paper we show that
all quantitative approximation problems for the termination value for OC-MDPs
and OC-SSGs are computable. Specifically, given a OC-SSG, and given epsilon >
0, we can compute a value v that approximates the value of the OC-SSG
termination game within additive error epsilon, and furthermore we can compute
epsilon-optimal strategies for both players in the game. A key ingredient in
our proofs is a subtle martingale, derived from solving certain LPs that we can
associate with a maximizing OC-MDP. An application of Azuma's inequality on
these martingales yields a computable bound for the "wealth" at which a "rich
person's strategy" becomes epsilon-optimal for OC-MDPs.Comment: 35 pages, 1 figure, full version of a paper presented at ICALP 2011,
invited for submission to Information and Computatio
Extreme State Aggregation Beyond MDPs
We consider a Reinforcement Learning setup where an agent interacts with an
environment in observation-reward-action cycles without any (esp.\ MDP)
assumptions on the environment. State aggregation and more generally feature
reinforcement learning is concerned with mapping histories/raw-states to
reduced/aggregated states. The idea behind both is that the resulting reduced
process (approximately) forms a small stationary finite-state MDP, which can
then be efficiently solved or learnt. We considerably generalize existing
aggregation results by showing that even if the reduced process is not an MDP,
the (q-)value functions and (optimal) policies of an associated MDP with same
state-space size solve the original problem, as long as the solution can
approximately be represented as a function of the reduced states. This implies
an upper bound on the required state space size that holds uniformly for all RL
problems. It may also explain why RL algorithms designed for MDPs sometimes
perform well beyond MDPs.Comment: 28 LaTeX pages. 8 Theorem
Real-Reward Testing for Probabilistic Processes (Extended Abstract)
We introduce a notion of real-valued reward testing for probabilistic
processes by extending the traditional nonnegative-reward testing with negative
rewards. In this richer testing framework, the may and must preorders turn out
to be inverses. We show that for convergent processes with finitely many states
and transitions, but not in the presence of divergence, the real-reward
must-testing preorder coincides with the nonnegative-reward must-testing
preorder. To prove this coincidence we characterise the usual resolution-based
testing in terms of the weak transitions of processes, without having to
involve policies, adversaries, schedulers, resolutions, or similar structures
that are external to the process under investigation. This requires
establishing the continuity of our function for calculating testing outcomes.Comment: In Proceedings QAPL 2011, arXiv:1107.074
Variations on the Stochastic Shortest Path Problem
In this invited contribution, we revisit the stochastic shortest path
problem, and show how recent results allow one to improve over the classical
solutions: we present algorithms to synthesize strategies with multiple
guarantees on the distribution of the length of paths reaching a given target,
rather than simply minimizing its expected value. The concepts and algorithms
that we propose here are applications of more general results that have been
obtained recently for Markov decision processes and that are described in a
series of recent papers.Comment: Invited paper for VMCAI 201
Decision Problems for Nash Equilibria in Stochastic Games
We analyse the computational complexity of finding Nash equilibria in
stochastic multiplayer games with -regular objectives. While the
existence of an equilibrium whose payoff falls into a certain interval may be
undecidable, we single out several decidable restrictions of the problem.
First, restricting the search space to stationary, or pure stationary,
equilibria results in problems that are typically contained in PSPACE and NP,
respectively. Second, we show that the existence of an equilibrium with a
binary payoff (i.e. an equilibrium where each player either wins or loses with
probability 1) is decidable. We also establish that the existence of a Nash
equilibrium with a certain binary payoff entails the existence of an
equilibrium with the same payoff in pure, finite-state strategies.Comment: 22 pages, revised versio
Improving Strategies via SMT Solving
We consider the problem of computing numerical invariants of programs by
abstract interpretation. Our method eschews two traditional sources of
imprecision: (i) the use of widening operators for enforcing convergence within
a finite number of iterations (ii) the use of merge operations (often, convex
hulls) at the merge points of the control flow graph. It instead computes the
least inductive invariant expressible in the domain at a restricted set of
program points, and analyzes the rest of the code en bloc. We emphasize that we
compute this inductive invariant precisely. For that we extend the strategy
improvement algorithm of [Gawlitza and Seidl, 2007]. If we applied their method
directly, we would have to solve an exponentially sized system of abstract
semantic equations, resulting in memory exhaustion. Instead, we keep the system
implicit and discover strategy improvements using SAT modulo real linear
arithmetic (SMT). For evaluating strategies we use linear programming. Our
algorithm has low polynomial space complexity and performs for contrived
examples in the worst case exponentially many strategy improvement steps; this
is unsurprising, since we show that the associated abstract reachability
problem is Pi-p-2-complete
Characterising Probabilistic Processes Logically
In this paper we work on (bi)simulation semantics of processes that exhibit
both nondeterministic and probabilistic behaviour. We propose a probabilistic
extension of the modal mu-calculus and show how to derive characteristic
formulae for various simulation-like preorders over finite-state processes
without divergence. In addition, we show that even without the fixpoint
operators this probabilistic mu-calculus can be used to characterise these
behavioural relations in the sense that two states are equivalent if and only
if they satisfy the same set of formulae.Comment: 18 page
A survey of solution techniques for the partially observed Markov decision process
We survey several computational procedures for the partially observed Markov decision process (POMDP) that have been developed since the Monahan survey was published in 1982. The POMDP generalizes the standard, completely observed Markov decision process by permitting the possibility that state observations may be noise-corrupted and/or costly. Several computational procedures presented are convergence accelerating variants of, or approximations to, the Smallwood-Sondik algorithm. Finite-memory suboptimal design results are reported, and new research directions involving heuristic search are discussed.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/44198/1/10479_2005_Article_BF02204836.pd
- âŠ