Search CORE

136 research outputs found

Log-Distributional Approach for Learning Covariate Shift Ratios

Author: Bernárdez Gil Guillermo
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2019
Field of study

Distributional Reinforcement Learning theory suggests that distributional fixed points could play a fundamental role to learning non additive value functions. In particular, we propose a distributional approach for learning Covariate Shift Ratios, whose update rule is originally multiplicative

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

The Impatient May Use Limited Optimism to Minimize Regret

Author: B Aminof
C Reutenauer
CJCH Watkins
E Allender
E Filiot
F Cucker
J Filar
JY Halpern
KR Apt
L Alfaro de
LS Shapley
M Jurdzinski
ML Puterman
P Hunter
R Brenguier
U Zwick
Publication venue
Publication date: 17/11/2018
Field of study

Discounted-sum games provide a formal model for the study of reinforcement learning, where the agent is enticed to get rewards early since later rewards are discounted. When the agent interacts with the environment, she may regret her actions, realizing that a previous choice was suboptimal given the behavior of the environment. The main contribution of this paper is a PSPACE algorithm for computing the minimum possible regret of a given game. To this end, several results of independent interest are shown. (1) We identify a class of regret-minimizing and admissible strategies that first assume that the environment is collaborating, then assume it is adversarial---the precise timing of the switch is key here. (2) Disregarding the computational cost of numerical analysis, we provide an NP algorithm that checks that the regret entailed by a given time-switching strategy exceeds a given value. (3) We show that determining whether a strategy minimizes regret is decidable in PSPACE

arXiv.org e-Print Archive

Crossref

Institutional Repository Universiteit Antwerpen

DI-fusion

Average cost temporal-difference learning

Author
Publication venue: Massachusetts Institute of Technology, Laboratory for Information and Decision Systems
Publication date: 01/01/1997
Field of study

Includes bibliographical references (p. 23).Supported by NSF. DMI-9625489 Supported by AFOSR grant. F49620-95-1-0219John N. Tsitsiklis and Benjamin Van Roy

DSpace@MIT

Navigation with uncertain spatio-temporal resources

Author: Schmoll Sebastian
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 04/02/2021
Field of study

Supporting people with intelligent navigation instructions enables users to efficiently achieve trip-related objectives (e.g., minimum travel time or fuel consumption) and preserves them from making unnecessary detours. This, in turn, enables them to save time, money and, additionally, minimize

CO_2

emissions. For these reasons, manufacturers integrate navigation systems into almost all modern automobiles. Nevertheless, most of them support only simple routing instructions, i.e., how to drive from location A to B. Albeit, people are regularly faced with more complex decisions, e.g. navigating to a cheap gas station on the route while incorporating dynamic gas price changes. Another example-scenario is after reaching the destination, an available facility to park needs to be found. So far, people cruise almost randomly around the goal area in the search for a parking space. As a consequence, persons valuable time is consumed and unnecessary traffic arises. Besides private persons, transportation companies have to make complex mobility decisions. For instance, taxi drivers have to find out where to move next whenever the taxi is idle. There are plenty possibilities for where the taxi driver could go. In case the last drop-off was in a sparsely populated region, waiting for a call from the taxi office will likely result in a longer drive to the next customer. In turn, customer satisfaction decreases with a longer waiting time and implies a potential loss of customers. Recently, the number of data sources that potentially improve these mobility decisions increased. For instance, on-street parking sensors track the current state of the spaces (e.g. Melbourne), mobile applications collect taxi requests from customers and gas stations publish the current prices all in real-time. This thesis investigates the question of how to design algorithms such that they exploit this volatile data. Standard routing algorithms assume a static world. But the availability of passengers, gas prices and the availability of parking spots change over time in a non-deterministic manner. Hence, we model multiple real-world applications as Markov decision processes (MDP), i.e., a framework for sequential decision making under uncertainty. Depending on the task, we propose to solve the MDP with dynamic programming, replanning and hindsight planning or reinforcement learning. Ultimately, we combine all applications in a single problem domain. Subsequently, we propose a reinforcement learning approach that solves all applications in this domain without modification. Furthermore, it decouples the routing task from solving the application itself. Hence, it is transferable to previously unseen street networks without further training.Durch intelligente Navigationssysteme werden Verkehrsteilnehmer davor bewahrt, Umwege zu fahren. Dadurch sparen sie Zeit, Geld und verringern den

CO_2

-Ausstoß. Aus diesem Grund verbauen Hersteller Navigationssysteme in fast allen Neuwägen. Bis heute unterstützen die meisten Systeme nur einfache Routenplanung, die den kürzesten oder schnellsten Pfad von A nach B berechnen. Dennoch müssen Fahrer regelmäßig Entscheidungen darüber hinaus treffen. Beispielsweise soll eine möglichst günstige Tankstelle auf dem Weg zum eigentlichen Ziel besucht werden. Allerdings kann diese ihre Preise, während der Fahrer oder die Fahrerin auf dem Weg dort hin ist, dynamisch ändern. Anschließend muss, sobald das eigentliche Ziel erreicht ist, ein Parkplatz gefunden werden. Bisher fahren Parkplatzsuchende zufällig durch das Zielgebiet in der Hoffnung möglichst schnell einen freien Parkplatz zu finden. Die Suche verursacht zusätzlichen Verkehr und der Fahrer oder die Fahrerin verbringt mehr Zeit auf der Straße. Neben Privatpersonen müssen auch Transportunternehmen komplexe Entscheidungen über Bewegungen treffen. Zum Beispiel muss ein Taxifahrer, wenn er gerade keinen Fahrgast hat, entscheiden, wo er sich als nächstes positioniert. Zwar könnte er am letzten Zielort warten, bis er einen Anruf der Taxizentrale bekommt. Falls jedoch der letzte Zielort in einem entlegenen Gebiet ist, muss der nächste Fahrgast wahrscheinlich lange warten, bis der Fahrer oder die Fahrerin bei ihm ankommt. Damit sinkt die Kundenzufriedenheit, was wiederum einen potentiellen Verlust der Kunden bedeutet. Seit Kurzem gibt es immer mehr Datenquellen, die Entscheidungen für diese Probleme verbessern. Beispielsweise wird durch Parkplatzsensoren die Verfügbarkeit der Parkplätze verfolgt, mobile Anwendungen sammeln Anfragen über Fahrgäste und Tankstellen veröffentlichen ihren aktuellen Preis in Echtzeit. In dieser Arbeit wird der Forschungsfrage nachgegangen, wie Algorithmen gestaltet werden können, sodass diese veränderlichen Informationen verwendet werden können. Standard-Routing-Algorithmen gehen von einer statischen Welt aus. Aber die Verfügbarkeit von Fahrgästen, die Tankstellenpreise und die Parkplatzzustände ändern sich nicht deterministisch. Aus diesem Grund modellieren wir eine Reihe von Anwendungen als Markov-Entscheidungsproblem (MDP). Applikationsabhängig schlagen wir vor, das MDP mit dynamischer Programmierung, Replanning bzw. Hindsight Planning oder Reinforcement Learning zu lösen. Abschließend fassen wir alle Anwendungen in einer Domäne zusammen. Dadurch können wir einen Reinforcement Learning Ansatz definieren, der alle Anwendungen in dieser Domäne ohne Änderung lösen kann. Dieser Ansatz ermöglicht es, die Routenplanung von der eigentlichen Problemstellung zu lösen. Dadurch ist die gelernte Funktionsapproximation auch auf bisher unbekannte Straßennetze ohne weiteres Training anwendbar

Digitale Hochschulschriften der LMU

Markov Decision Processes with Embedded Agents

Author: Miles Luke Harold
Publication venue: UKnowledge
Publication date: 01/01/2021
Field of study

We present Markov Decision Processes with Embedded Agents (MDPEAs), an extension of multi-agent POMDPs that allow for the modeling of environments that can change the actuators, sensors, and learning function of the agent, e.g., a household robot which could gain and lose hardware from its frame, or a sovereign software agent which could encounter viruses on computers that modify its code. We show several toy problems for which standard reinforcement-learning methods fail to converge, and give an algorithm, `just-copy-it`, which learns some of them. Unlike MDPs, MDPEAs are closed systems and hence their evolution over time can be treated as a Markov chain. In future work, we hope MDPEAs can be extended to model even fully embedded agents acting in real digital or physical environments

University of Kentucky

Contributions to Optimal Stopping and Long-Term Average Impulse Control

Author: Sohr Tobias
Publication venue
Publication date: 01/01/2020
Field of study

In this thesis we consider undiscounted, infinite time horizon optimal stopping problems with generalized linear costs and long-term average impulse control problems. The main goal is to find (semi-)explicit solutions in case the underlying process contains jumps. In order to solve the stopping problems, we utilize embedded monotone problems to find sufficient conditions, that are easy to handle, for a threshold time to be optimal. Further, we characterize the threshold for one-dimensional Markov processes in both discrete and continuous time. While in the discrete time case the concept of ladder times can be used to exploit inherent monotone structures, in continuous time we develop an integral type maximum representation to enable a comparable line of argument. The findings on long-term average impulse control problems are structured in two main areas. First, for a general one-dimensional Markov process we characterize the problem’s value and possible optimal strategies by an associated stopping problem. Then, we develop a step-by-step solution technique in case the process is a Lévy process and demonstrate its usefulness by applying it to relevant examples, among others problems from inventory control and optimal harvesting. Apart from these direct applications we use our theoretical findings to investigate the influence of varying fixed costs on the impulse control problem, study a control problem with a restriction to the impulse frequency and treat mean field games and problems of impulse control

MACAU: Open Access Repository of Kiel University

Recommended from our members

Abstractions in Reasoning for Long-Term Autonomy

Author: Wray Kyle Hollins
Publication venue: ScholarWorks@UMass Amherst
Publication date: 02/07/2019
Field of study

The path to building adaptive, robust, intelligent agents has led researchers to develop a suite of powerful models and algorithms for agents with a single objective. However, in recent years, attempts to use this monolithic approach to solve an ever-expanding set of complex real-world problems, which increasingly include long-term autonomous deployments, have illuminated challenges in its ability to scale. Consequently, a fragmented collection of hierarchical and multi-objective models were developed. This trend continues into the algorithms as well, as each approximates an optimal solution in a different manner for scalability. These models and algorithms represent an attempt to solve pieces of an overarching problem: how can an agent explicitly model and integrate the necessary aspects of reasoning required to achieve long-term autonomy? This thesis presents a general hierarchical and multi-objective model called a policy network that unifies prior fragmented solutions into a single graphical decision-making structure. Policy networks are broadly useful to solve numerous real-world problems. This thesis focuses on autonomous vehicle (AV) problems: (1) route-planning with multiple objectives; (2) semi-autonomy with proactive transfer of control; and (3) intersection decision-making for reasoning online about any number of other vehicles and pedestrians. Formal models are presented for each of the distinct problems. Solutions are evaluated using real-world map data in simulation and demonstrated on a fully operational AV prototype driving on real public roads. Policy networks serve as a shared underlying framework for all three, enabling their seamless integration as parts of an overall solution for rich, real-world, scalable decision-making in agents with long-term autonomy

ScholarWorks@UMass Amherst