136 research outputs found

    Log-Distributional Approach for Learning Covariate Shift Ratios

    Get PDF
    Distributional Reinforcement Learning theory suggests that distributional fixed points could play a fundamental role to learning non additive value functions. In particular, we propose a distributional approach for learning Covariate Shift Ratios, whose update rule is originally multiplicative

    The Impatient May Use Limited Optimism to Minimize Regret

    Full text link
    Discounted-sum games provide a formal model for the study of reinforcement learning, where the agent is enticed to get rewards early since later rewards are discounted. When the agent interacts with the environment, she may regret her actions, realizing that a previous choice was suboptimal given the behavior of the environment. The main contribution of this paper is a PSPACE algorithm for computing the minimum possible regret of a given game. To this end, several results of independent interest are shown. (1) We identify a class of regret-minimizing and admissible strategies that first assume that the environment is collaborating, then assume it is adversarial---the precise timing of the switch is key here. (2) Disregarding the computational cost of numerical analysis, we provide an NP algorithm that checks that the regret entailed by a given time-switching strategy exceeds a given value. (3) We show that determining whether a strategy minimizes regret is decidable in PSPACE

    Average cost temporal-difference learning

    Get PDF
    Includes bibliographical references (p. 23).Supported by NSF. DMI-9625489 Supported by AFOSR grant. F49620-95-1-0219John N. Tsitsiklis and Benjamin Van Roy

    Navigation with uncertain spatio-temporal resources

    Get PDF
    Supporting people with intelligent navigation instructions enables users to efficiently achieve trip-related objectives (e.g., minimum travel time or fuel consumption) and preserves them from making unnecessary detours. This, in turn, enables them to save time, money and, additionally, minimize CO2CO_2 emissions. For these reasons, manufacturers integrate navigation systems into almost all modern automobiles. Nevertheless, most of them support only simple routing instructions, i.e., how to drive from location A to B. Albeit, people are regularly faced with more complex decisions, e.g. navigating to a cheap gas station on the route while incorporating dynamic gas price changes. Another example-scenario is after reaching the destination, an available facility to park needs to be found. So far, people cruise almost randomly around the goal area in the search for a parking space. As a consequence, persons valuable time is consumed and unnecessary traffic arises. Besides private persons, transportation companies have to make complex mobility decisions. For instance, taxi drivers have to find out where to move next whenever the taxi is idle. There are plenty possibilities for where the taxi driver could go. In case the last drop-off was in a sparsely populated region, waiting for a call from the taxi office will likely result in a longer drive to the next customer. In turn, customer satisfaction decreases with a longer waiting time and implies a potential loss of customers. Recently, the number of data sources that potentially improve these mobility decisions increased. For instance, on-street parking sensors track the current state of the spaces (e.g. Melbourne), mobile applications collect taxi requests from customers and gas stations publish the current prices all in real-time. This thesis investigates the question of how to design algorithms such that they exploit this volatile data. Standard routing algorithms assume a static world. But the availability of passengers, gas prices and the availability of parking spots change over time in a non-deterministic manner. Hence, we model multiple real-world applications as Markov decision processes (MDP), i.e., a framework for sequential decision making under uncertainty. Depending on the task, we propose to solve the MDP with dynamic programming, replanning and hindsight planning or reinforcement learning. Ultimately, we combine all applications in a single problem domain. Subsequently, we propose a reinforcement learning approach that solves all applications in this domain without modification. Furthermore, it decouples the routing task from solving the application itself. Hence, it is transferable to previously unseen street networks without further training.Durch intelligente Navigationssysteme werden Verkehrsteilnehmer davor bewahrt, Umwege zu fahren. Dadurch sparen sie Zeit, Geld und verringern den CO2CO_2-Ausstoß. Aus diesem Grund verbauen Hersteller Navigationssysteme in fast allen Neuwägen. Bis heute unterstützen die meisten Systeme nur einfache Routenplanung, die den kürzesten oder schnellsten Pfad von A nach B berechnen. Dennoch müssen Fahrer regelmäßig Entscheidungen darüber hinaus treffen. Beispielsweise soll eine möglichst günstige Tankstelle auf dem Weg zum eigentlichen Ziel besucht werden. Allerdings kann diese ihre Preise, während der Fahrer oder die Fahrerin auf dem Weg dort hin ist, dynamisch ändern. Anschließend muss, sobald das eigentliche Ziel erreicht ist, ein Parkplatz gefunden werden. Bisher fahren Parkplatzsuchende zufällig durch das Zielgebiet in der Hoffnung möglichst schnell einen freien Parkplatz zu finden. Die Suche verursacht zusätzlichen Verkehr und der Fahrer oder die Fahrerin verbringt mehr Zeit auf der Straße. Neben Privatpersonen müssen auch Transportunternehmen komplexe Entscheidungen über Bewegungen treffen. Zum Beispiel muss ein Taxifahrer, wenn er gerade keinen Fahrgast hat, entscheiden, wo er sich als nächstes positioniert. Zwar könnte er am letzten Zielort warten, bis er einen Anruf der Taxizentrale bekommt. Falls jedoch der letzte Zielort in einem entlegenen Gebiet ist, muss der nächste Fahrgast wahrscheinlich lange warten, bis der Fahrer oder die Fahrerin bei ihm ankommt. Damit sinkt die Kundenzufriedenheit, was wiederum einen potentiellen Verlust der Kunden bedeutet. Seit Kurzem gibt es immer mehr Datenquellen, die Entscheidungen für diese Probleme verbessern. Beispielsweise wird durch Parkplatzsensoren die Verfügbarkeit der Parkplätze verfolgt, mobile Anwendungen sammeln Anfragen über Fahrgäste und Tankstellen veröffentlichen ihren aktuellen Preis in Echtzeit. In dieser Arbeit wird der Forschungsfrage nachgegangen, wie Algorithmen gestaltet werden können, sodass diese veränderlichen Informationen verwendet werden können. Standard-Routing-Algorithmen gehen von einer statischen Welt aus. Aber die Verfügbarkeit von Fahrgästen, die Tankstellenpreise und die Parkplatzzustände ändern sich nicht deterministisch. Aus diesem Grund modellieren wir eine Reihe von Anwendungen als Markov-Entscheidungsproblem (MDP). Applikationsabhängig schlagen wir vor, das MDP mit dynamischer Programmierung, Replanning bzw. Hindsight Planning oder Reinforcement Learning zu lösen. Abschließend fassen wir alle Anwendungen in einer Domäne zusammen. Dadurch können wir einen Reinforcement Learning Ansatz definieren, der alle Anwendungen in dieser Domäne ohne Änderung lösen kann. Dieser Ansatz ermöglicht es, die Routenplanung von der eigentlichen Problemstellung zu lösen. Dadurch ist die gelernte Funktionsapproximation auch auf bisher unbekannte Straßennetze ohne weiteres Training anwendbar

    Markov Decision Processes with Embedded Agents

    Get PDF
    We present Markov Decision Processes with Embedded Agents (MDPEAs), an extension of multi-agent POMDPs that allow for the modeling of environments that can change the actuators, sensors, and learning function of the agent, e.g., a household robot which could gain and lose hardware from its frame, or a sovereign software agent which could encounter viruses on computers that modify its code. We show several toy problems for which standard reinforcement-learning methods fail to converge, and give an algorithm, `just-copy-it`, which learns some of them. Unlike MDPs, MDPEAs are closed systems and hence their evolution over time can be treated as a Markov chain. In future work, we hope MDPEAs can be extended to model even fully embedded agents acting in real digital or physical environments

    Contributions to Optimal Stopping and Long-Term Average Impulse Control

    Get PDF
    In this thesis we consider undiscounted, infinite time horizon optimal stopping problems with generalized linear costs and long-term average impulse control problems. The main goal is to find (semi-)explicit solutions in case the underlying process contains jumps. In order to solve the stopping problems, we utilize embedded monotone problems to find sufficient conditions, that are easy to handle, for a threshold time to be optimal. Further, we characterize the threshold for one-dimensional Markov processes in both discrete and continuous time. While in the discrete time case the concept of ladder times can be used to exploit inherent monotone structures, in continuous time we develop an integral type maximum representation to enable a comparable line of argument. The findings on long-term average impulse control problems are structured in two main areas. First, for a general one-dimensional Markov process we characterize the problem’s value and possible optimal strategies by an associated stopping problem. Then, we develop a step-by-step solution technique in case the process is a Lévy process and demonstrate its usefulness by applying it to relevant examples, among others problems from inventory control and optimal harvesting. Apart from these direct applications we use our theoretical findings to investigate the influence of varying fixed costs on the impulse control problem, study a control problem with a restriction to the impulse frequency and treat mean field games and problems of impulse control
    • …