Search CORE

22,821 research outputs found

Answer Set Programming for Non-Stationary Markov Decision Processes

Author: C Baral
CJCH Watkins
E Even-dar
E Even-Dar
J Babb
JY Yu
Leonardo A. Ferreira
M Balduccini
M Balduccini
M Gelfond
M Nogueira
Paulo E. Santos
R Bellman
R Bellman
Ramon Lopez de Mantaras
Reinaldo A. C. Bianchi
S Zhang
V Lifschitz
Publication venue
Publication date: 03/05/2017
Field of study

Non-stationary domains, where unforeseen changes happen, present a challenge for agents to find an optimal policy for a sequential decision making problem. This work investigates a solution to this problem that combines Markov Decision Processes (MDP) and Reinforcement Learning (RL) with Answer Set Programming (ASP) in a method we call ASP(RL). In this method, Answer Set Programming is used to find the possible trajectories of an MDP, from where Reinforcement Learning is applied to learn the optimal policy of the problem. Results show that ASP(RL) is capable of efficiently finding the optimal solution of an MDP representing non-stationary domains

arXiv.org e-Print Archive

Crossref

Digital.CSIC

Feature Markov Decision Processes

Author: Hutter Marcus
Publication venue
Publication date: 24/12/2008
Field of study

General purpose intelligent learning agents cycle through (complex,non-MDP) sequences of observations, actions, and rewards. On the other hand, reinforcement learning is well-developed for small finite state Markov Decision Processes (MDPs). So far it is an art performed by human designers to extract the right state representation out of the bare observations, i.e. to reduce the agent setup to the MDP framework. Before we can think of mechanizing this search for suitable MDPs, we need a formal objective criterion. The main contribution of this article is to develop such a criterion. I also integrate the various parts into one learning algorithm. Extensions to more realistic dynamic Bayesian networks are developed in a companion article.Comment: 7 page

arXiv.org e-Print Archive

CiteSeerX

Crossref

The Australian National University

Feature Reinforcement Learning: Part I: Unstructured MDPs

Author: Hutter Marcus
Publication venue
Publication date: 01/01/2009
Field of study

General-purpose, intelligent, learning agents cycle through sequences of observations, actions, and rewards that are complex, uncertain, unknown, and non-Markovian. On the other hand, reinforcement learning is well-developed for small finite state Markov decision processes (MDPs). Up to now, extracting the right state representations out of bare observations, that is, reducing the general agent setup to the MDP framework, is an art that involves significant effort by designers. The primary goal of this work is to automate the reduction process and thereby significantly expand the scope of many existing reinforcement learning algorithms and the agents that employ them. Before we can think of mechanizing this search for suitable MDPs, we need a formal objective criterion. The main contribution of this article is to develop such a criterion. I also integrate the various parts into one learning algorithm. Extensions to more realistic dynamic Bayesian networks are developed in Part II. The role of POMDPs is also considered there.Comment: 24 LaTeX pages, 5 diagram

arXiv.org e-Print Archive

CiteSeerX

The Australian National University

A PAC Learning Algorithm for LTL and Omega-regular Objectives in MDPs

Author: Perez Mateo
Somenzi Fabio
Trivedi Ashutosh
Publication venue
Publication date: 18/10/2023
Field of study

Linear temporal logic (LTL) and omega-regular objectives -- a superset of LTL -- have seen recent use as a way to express non-Markovian objectives in reinforcement learning. We introduce a model-based probably approximately correct (PAC) learning algorithm for omega-regular objectives in Markov decision processes. Unlike prior approaches, our algorithm learns from sampled trajectories of the system and does not require prior knowledge of the system's topology

arXiv.org e-Print Archive

On the Expressivity of Multidimensional Markov Reward

Author: Miura Shuwa
Publication venue
Publication date: 22/07/2023
Field of study

We consider the expressivity of Markov rewards in sequential decision making under uncertainty. We view reward functions in Markov Decision Processes (MDPs) as a means to characterize desired behaviors of agents. Assuming desired behaviors are specified as a set of acceptable policies, we investigate if there exists a scalar or multidimensional Markov reward function that makes the policies in the set more desirable than the other policies. Our main result states both necessary and sufficient conditions for the existence of such reward functions. We also show that for every non-degenerate set of deterministic policies, there exists a multidimensional Markov reward function that characterizes itComment: Presented at RLDM Workshop on Reinforcement Learning as a Model of Agenc

arXiv.org e-Print Archive

Reinforcement Learning for Non-Stationary Markov Decision Processes: The Blessing of (More) Optimism

Author: Cheung Wang Chi
Simchi-Levi David
Zhu Ruihao
Publication venue
Publication date: 24/06/2020
Field of study

We consider un-discounted reinforcement learning (RL) in Markov decision processes (MDPs) under drifting non-stationarity, i.e., both the reward and state transition distributions are allowed to evolve over time, as long as their respective total variations, quantified by suitable metrics, do not exceed certain variation budgets. We first develop the Sliding Window Upper-Confidence bound for Reinforcement Learning with Confidence Widening (SWUCRL2-CW) algorithm, and establish its dynamic regret bound when the variation budgets are known. In addition, we propose the Bandit-over-Reinforcement Learning (BORL) algorithm to adaptively tune the SWUCRL2-CW algorithm to achieve the same dynamic regret bound, but in a parameter-free manner, i.e., without knowing the variation budgets. Notably, learning non-stationary MDPs via the conventional optimistic exploration technique presents a unique challenge absent in existing (non-stationary) bandit learning settings. We overcome the challenge by a novel confidence widening technique that incorporates additional optimism.Comment: To appear in proceedings of the 37th International Conference on Machine Learning. Shortened conference version of its journal version (available at: arXiv:1906.02922

arXiv.org e-Print Archive

DSpace@MIT