131 research outputs found

    Bayesian multitask inverse reinforcement learning

    Get PDF
    We generalise the problem of inverse reinforcement learning to multiple tasks, from multiple demonstrations. Each one may represent one expert trying to solve a different task, or as different experts trying to solve the same task. Our main contribution is to formalise the problem as statistical preference elicitation, via a number of structured priors, whose form captures our biases about the relatedness of different tasks or expert policies. In doing so, we introduce a prior on policy optimality, which is more natural to specify. We show that our framework allows us not only to learn to efficiently from multiple experts but to also effectively differentiate between the goals of each. Possible applications include analysing the intrinsic motivations of subjects in behavioural experiments and learning from multiple teachers.Comment: Corrected version. 13 pages, 8 figure

    Regret Bounds for Reinforcement Learning with Policy Advice

    Get PDF
    In some reinforcement learning problems an agent may be provided with a set of input policies, perhaps learned from prior experience or provided by advisors. We present a reinforcement learning with policy advice (RLPA) algorithm which leverages this input set and learns to use the best policy in the set for the reinforcement learning task at hand. We prove that RLPA has a sub-linear regret of \tilde O(\sqrt{T}) relative to the best input policy, and that both this regret and its computational complexity are independent of the size of the state and action space. Our empirical simulations support our theoretical analysis. This suggests RLPA may offer significant advantages in large domains where some prior good policies are provided

    Approximating the Termination Value of One-Counter MDPs and Stochastic Games

    Get PDF
    One-counter MDPs (OC-MDPs) and one-counter simple stochastic games (OC-SSGs) are 1-player, and 2-player turn-based zero-sum, stochastic games played on the transition graph of classic one-counter automata (equivalently, pushdown automata with a 1-letter stack alphabet). A key objective for the analysis and verification of these games is the termination objective, where the players aim to maximize (minimize, respectively) the probability of hitting counter value 0, starting at a given control state and given counter value. Recently, we studied qualitative decision problems ("is the optimal termination value = 1?") for OC-MDPs (and OC-SSGs) and showed them to be decidable in P-time (in NP and coNP, respectively). However, quantitative decision and approximation problems ("is the optimal termination value ? p", or "approximate the termination value within epsilon") are far more challenging. This is so in part because optimal strategies may not exist, and because even when they do exist they can have a highly non-trivial structure. It thus remained open even whether any of these quantitative termination problems are computable. In this paper we show that all quantitative approximation problems for the termination value for OC-MDPs and OC-SSGs are computable. Specifically, given a OC-SSG, and given epsilon > 0, we can compute a value v that approximates the value of the OC-SSG termination game within additive error epsilon, and furthermore we can compute epsilon-optimal strategies for both players in the game. A key ingredient in our proofs is a subtle martingale, derived from solving certain LPs that we can associate with a maximizing OC-MDP. An application of Azuma's inequality on these martingales yields a computable bound for the "wealth" at which a "rich person's strategy" becomes epsilon-optimal for OC-MDPs.Comment: 35 pages, 1 figure, full version of a paper presented at ICALP 2011, invited for submission to Information and Computatio

    Extreme State Aggregation Beyond MDPs

    Full text link
    We consider a Reinforcement Learning setup where an agent interacts with an environment in observation-reward-action cycles without any (esp.\ MDP) assumptions on the environment. State aggregation and more generally feature reinforcement learning is concerned with mapping histories/raw-states to reduced/aggregated states. The idea behind both is that the resulting reduced process (approximately) forms a small stationary finite-state MDP, which can then be efficiently solved or learnt. We considerably generalize existing aggregation results by showing that even if the reduced process is not an MDP, the (q-)value functions and (optimal) policies of an associated MDP with same state-space size solve the original problem, as long as the solution can approximately be represented as a function of the reduced states. This implies an upper bound on the required state space size that holds uniformly for all RL problems. It may also explain why RL algorithms designed for MDPs sometimes perform well beyond MDPs.Comment: 28 LaTeX pages. 8 Theorem

    Real-Reward Testing for Probabilistic Processes (Extended Abstract)

    Full text link
    We introduce a notion of real-valued reward testing for probabilistic processes by extending the traditional nonnegative-reward testing with negative rewards. In this richer testing framework, the may and must preorders turn out to be inverses. We show that for convergent processes with finitely many states and transitions, but not in the presence of divergence, the real-reward must-testing preorder coincides with the nonnegative-reward must-testing preorder. To prove this coincidence we characterise the usual resolution-based testing in terms of the weak transitions of processes, without having to involve policies, adversaries, schedulers, resolutions, or similar structures that are external to the process under investigation. This requires establishing the continuity of our function for calculating testing outcomes.Comment: In Proceedings QAPL 2011, arXiv:1107.074

    Variations on the Stochastic Shortest Path Problem

    Full text link
    In this invited contribution, we revisit the stochastic shortest path problem, and show how recent results allow one to improve over the classical solutions: we present algorithms to synthesize strategies with multiple guarantees on the distribution of the length of paths reaching a given target, rather than simply minimizing its expected value. The concepts and algorithms that we propose here are applications of more general results that have been obtained recently for Markov decision processes and that are described in a series of recent papers.Comment: Invited paper for VMCAI 201

    Decision Problems for Nash Equilibria in Stochastic Games

    Get PDF
    We analyse the computational complexity of finding Nash equilibria in stochastic multiplayer games with ω\omega-regular objectives. While the existence of an equilibrium whose payoff falls into a certain interval may be undecidable, we single out several decidable restrictions of the problem. First, restricting the search space to stationary, or pure stationary, equilibria results in problems that are typically contained in PSPACE and NP, respectively. Second, we show that the existence of an equilibrium with a binary payoff (i.e. an equilibrium where each player either wins or loses with probability 1) is decidable. We also establish that the existence of a Nash equilibrium with a certain binary payoff entails the existence of an equilibrium with the same payoff in pure, finite-state strategies.Comment: 22 pages, revised versio

    Improving Strategies via SMT Solving

    Full text link
    We consider the problem of computing numerical invariants of programs by abstract interpretation. Our method eschews two traditional sources of imprecision: (i) the use of widening operators for enforcing convergence within a finite number of iterations (ii) the use of merge operations (often, convex hulls) at the merge points of the control flow graph. It instead computes the least inductive invariant expressible in the domain at a restricted set of program points, and analyzes the rest of the code en bloc. We emphasize that we compute this inductive invariant precisely. For that we extend the strategy improvement algorithm of [Gawlitza and Seidl, 2007]. If we applied their method directly, we would have to solve an exponentially sized system of abstract semantic equations, resulting in memory exhaustion. Instead, we keep the system implicit and discover strategy improvements using SAT modulo real linear arithmetic (SMT). For evaluating strategies we use linear programming. Our algorithm has low polynomial space complexity and performs for contrived examples in the worst case exponentially many strategy improvement steps; this is unsurprising, since we show that the associated abstract reachability problem is Pi-p-2-complete

    Characterising Probabilistic Processes Logically

    Full text link
    In this paper we work on (bi)simulation semantics of processes that exhibit both nondeterministic and probabilistic behaviour. We propose a probabilistic extension of the modal mu-calculus and show how to derive characteristic formulae for various simulation-like preorders over finite-state processes without divergence. In addition, we show that even without the fixpoint operators this probabilistic mu-calculus can be used to characterise these behavioural relations in the sense that two states are equivalent if and only if they satisfy the same set of formulae.Comment: 18 page

    A survey of solution techniques for the partially observed Markov decision process

    Full text link
    We survey several computational procedures for the partially observed Markov decision process (POMDP) that have been developed since the Monahan survey was published in 1982. The POMDP generalizes the standard, completely observed Markov decision process by permitting the possibility that state observations may be noise-corrupted and/or costly. Several computational procedures presented are convergence accelerating variants of, or approximations to, the Smallwood-Sondik algorithm. Finite-memory suboptimal design results are reported, and new research directions involving heuristic search are discussed.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/44198/1/10479_2005_Article_BF02204836.pd
    • 

    corecore