318 research outputs found

    Gambler's Ruin Bandit Problem

    Full text link
    In this paper, we propose a new multi-armed bandit problem called the Gambler's Ruin Bandit Problem (GRBP). In the GRBP, the learner proceeds in a sequence of rounds, where each round is a Markov Decision Process (MDP) with two actions (arms): a continuation action that moves the learner randomly over the state space around the current state; and a terminal action that moves the learner directly into one of the two terminal states (goal and dead-end state). The current round ends when a terminal state is reached, and the learner incurs a positive reward only when the goal state is reached. The objective of the learner is to maximize its long-term reward (expected number of times the goal state is reached), without having any prior knowledge on the state transition probabilities. We first prove a result on the form of the optimal policy for the GRBP. Then, we define the regret of the learner with respect to an omnipotent oracle, which acts optimally in each round, and prove that it increases logarithmically over rounds. We also identify a condition under which the learner's regret is bounded. A potential application of the GRBP is optimal medical treatment assignment, in which the continuation action corresponds to a conservative treatment and the terminal action corresponds to a risky treatment such as surgery

    No-Regret Exploration in Goal-Oriented Reinforcement Learning

    Get PDF
    Many popular reinforcement learning problems (e.g., navigation in a maze, some Atari games, mountain car) are instances of the episodic setting under its stochastic shortest path (SSP) formulation, where an agent has to achieve a goal state while minimizing the cumulative cost. Despite the popularity of this setting, the exploration-exploitation dilemma has been sparsely studied in general SSP problems, with most of the theoretical literature focusing on different problems (i.e., fixed-horizon and infinite-horizon) or making the restrictive loop-free SSP assumption (i.e., no state can be visited twice during an episode). In this paper, we study the general SSP problem with no assumption on its dynamics (some policies may actually never reach the goal). We introduce UC-SSP, the first no-regret algorithm in this setting, and prove a regret bound scaling as O~(DSADK)\displaystyle \widetilde{\mathcal{O}}( D S \sqrt{ A D K}) after KK episodes for any unknown SSP with SS states, AA actions, positive costs and SSP-diameter DD, defined as the smallest expected hitting time from any starting state to the goal. We achieve this result by crafting a novel stopping rule, such that UC-SSP may interrupt the current policy if it is taking too long to achieve the goal and switch to alternative policies that are designed to rapidly terminate the episode

    Exploration Using Without-Replacement Sampling of Actions Is Sometimes Inferior

    Get PDF
    In many statistical and machine learning applications, without-replacement sampling is considered superior to with-replacement sampling. In some cases, this has been proven, and in others the heuristic is so intuitively attractive that it is taken for granted. In reinforcement learning, many count-based exploration strategies are justified by reliance on the aforementioned heuristic. This paper will detail the non-intuitive discovery that when measuring the goodness of an exploration strategy by the stochastic shortest path to a goal state, there is a class of processes for which an action selection strategy based on without-replacement sampling of actions can be worse than with-replacement sampling. Specifically, the expected time until a specified goal state is first reached can be provably larger under without-replacement sampling. Numerical experiments describe the frequency and severity of this inferiority

    CAPIR: Collaborative Action Planning with Intention Recognition

    Get PDF
    We apply decision theoretic techniques to construct non-player characters that are able to assist a human player in collaborative games. The method is based on solving Markov decision processes, which can be difficult when the game state is described by many variables. To scale to more complex games, the method allows decomposition of a game task into subtasks, each of which can be modelled by a Markov decision process. Intention recognition is used to infer the subtask that the human is currently performing, allowing the helper to assist the human in performing the correct task. Experiments show that the method can be effective, giving near-human level performance in helping a human in a collaborative game.Comment: 6 pages, accepted for presentation at AIIDE'1

    Conflict-driven learning in AI planning state-space search

    Get PDF
    Many combinatorial computation problems in computer science can be cast as a reachability problem in an implicitly described, potentially huge, graph: the state space. State-space search is a versatile and widespread method to solve such reachability problems, but it requires some form of guidance to prevent exploring that combinatorial space exhaustively. Conflict-driven learning is an indispensable search ingredient for solving constraint satisfaction problems (most prominently, Boolean satisfiability). It guides search towards solutions by identifying conflicts during the search, i.e., search branches not leading to any solution, learning from them knowledge to avoid similar conflicts in the remainder of the search. This thesis adapts the conflict-driven learning methodology to more general classes of reachability problems. Specifically, our work is placed in AI planning. We consider goal-reachability objectives in classical planning and in planning under uncertainty. The canonical form of "conflicts" in this context are dead-end states, i.e., states from which the desired goal property cannot be reached. We pioneer methods for learning sound and generalizable dead-end knowledge from conflicts encountered during forward state-space search. This embraces the following core contributions: When acting under uncertainty, the presence of dead-end states may make it impossible to satisfy the goal property with absolute certainty. The natural planning objective then is MaxProb, maximizing the probability of reaching the goal. However, algorithms for MaxProb probabilistic planning are severely underexplored. We close this gap by developing a large design space of probabilistic state-space search methods, contributing new search algorithms, admissible state-space reduction techniques, and goal-probability bounds suitable for heuristic state-space search. We systematically explore this design space through an extensive empirical evaluation. The key to our conflict-driven learning algorithm adaptation are unsolvability detectors, i.e., goal-reachability overapproximations. We design three complementary families of such unsolvability detectors, building upon known techniques: critical-path heuristics, linear-programming-based heuristics, and dead-end traps. We develop search methods to identify conflicts in deterministic and probabilistic state spaces, and we develop suitable refinement methods for the different unsolvability detectors so to recognize these states. Arranged in a depth-first search, our techniques approach the elegance of conflict-driven learning in constraint satisfaction, featuring the ability to learn to refute search subtrees, and intelligent backjumping to the root cause of a conflict. We provide a comprehensive experimental evaluation, demonstrating that the proposed techniques yield state-of-the-art performance for finding plans for solvable classical planning tasks, proving classical planning tasks unsolvable, and solving MaxProb in probabilistic planning, on benchmarks where dead-end states abound.Viele kombinatorisch komplexe Berechnungsprobleme in der Informatik lassen sich als Erreichbarkeitsprobleme in einem implizit dargestellten, potenziell riesigen, Graphen - dem Zustandsraum - verstehen. Die Zustandsraumsuche ist eine weit verbreitete Methode, um solche Erreichbarkeitsprobleme zu lösen. Die Effizienz dieser Methode hängt aber maßgeblich von der Verwendung strikter Suchkontrollmechanismen ab. Das konfliktgesteuerte Lernen ist eine essenzielle Suchkomponente für das Lösen von Constraint-Satisfaction-Problemen (wie dem Erfüllbarkeitsproblem der Aussagenlogik), welches von Konflikten, also Fehlern in der Suche, neue Kontrollregeln lernt, die ähnliche Konflikte zukünftig vermeiden. In dieser Arbeit erweitern wir die zugrundeliegende Methodik auf Zielerreichbarkeitsfragen, wie sie im klassischen und probabilistischen Planen, einem Teilbereich der Künstlichen Intelligenz, auftauchen. Die kanonische Form von „Konflikten“ in diesem Kontext sind sog. Sackgassen, Zustände, von denen aus die Zielbedingung nicht erreicht werden kann. Wir präsentieren Methoden, die es ermöglichen, während der Zustandsraumsuche von solchen Konflikten korrektes und verallgemeinerbares Wissen über Sackgassen zu erlernen. Unsere Arbeit umfasst folgende Beiträge: Wenn der Effekt des Handelns mit Unsicherheiten behaftet ist, dann kann die Existenz von Sackgassen dazu führen, dass die Zielbedingung nicht unter allen Umständen erfüllt werden kann. Die naheliegendste Planungsbedingung in diesem Fall ist MaxProb, das Maximieren der Wahrscheinlichkeit, dass die Zielbedingung erreicht wird. Planungsalgorithmen für MaxProb sind jedoch wenig erforscht. Um diese Lücke zu schließen, erstellen wir einen umfangreichen Bausatz für Suchmethoden in probabilistischen Zustandsräumen, und entwickeln dabei neue Suchalgorithmen, Zustandsraumreduktionsmethoden, und Abschätzungen der Zielerreichbarkeitswahrscheinlichkeit, wie sie für heuristische Suchalgorithmen gebraucht werden. Wir explorieren den resultierenden Gestaltungsraum systematisch in einer breit angelegten empirischen Studie. Die Grundlage unserer Adaption des konfliktgesteuerten Lernens bilden Unerreichbarkeitsdetektoren. Wir konzipieren drei Familien solcher Detektoren basierend auf bereits bekannten Techniken: Kritische-Pfad Heuristiken, Heuristiken basierend auf linearer Optimierung, und Sackgassen-Fallen. Wir entwickeln Suchmethoden, um Konflikte in deterministischen und probabilistischen Zustandsräumen zu erkennen, sowie Methoden, um die verschiedenen Unerreichbarkeitsdetektoren basierend auf den erkannten Konflikten zu verfeinern. Instanziiert als Tiefensuche weisen unsere Techniken ähnliche Eigenschaften auf wie das konfliktgesteuerte Lernen für Constraint-Satisfaction-Problemen. Wir evaluieren die entwickelten Methoden empirisch, und zeigen dabei, dass das konfliktgesteuerte Lernen unter gewissen Voraussetzungen zu signifikanten Suchreduktionen beim Finden von Plänen in lösbaren klassischen Planungsproblemen, Beweisen der Unlösbarkeit von klassischen Planungsproblemen, und Lösen von MaxProb im probabilistischen Planen, führen kann
    corecore