Search CORE

93 research outputs found

Stochastic Planning with Lifted Symbolic Trajectory Optimization

Author: Cui Hao
Keller Thomas
Khardon Roni
Publication venue: AAAI Press
Publication date: 01/01/2019
Field of study

This paper investigates online stochastic planning for problems with large factored state and action spaces. One promising approach in recent work estimates the quality of applicable actions in the current state through aggregate simulation from the states they reach. This leads to significant speedup, compared to search over concrete states and actions, and suffices to guide decision making in cases where the performance of a random policy is informative of the quality of a state. The paper makes two significant improvements to this approach. The first, taking inspiration from lifted belief propagation, exploits the structure of the problem to derive a more compact computation graph for aggregate simulation. The second improvement replaces the random policy embedded in the computation graph with symbolic variables that are optimized simultaneously with the search for high quality actions. This expands the scope of the approach to problems that require deep search and where information is lost quickly with random steps. An empirical evaluation shows that these ideas significantly improve performance, leading to state of the art performance on hard planning problems

edoc

Multi-Period Stochastic Resource Planning: Models, Algorithms and Applications

Author: Simon Solomon Stanislaus
Publication venue: IRL @ UMSL
Publication date: 16/12/2016
Field of study

This research addresses the problem of sequential decision making in the presence of uncertainty in the professional service industry. Specifically, it considers the problem of dynamically assigning resources to tasks in a stochastic environment with both the uncertainty of resource availability due to attrition, and the uncertainty of job availability due to unknown project bid outcome. This problem is motivated by the resource planning application at the Hewlett Packard (HP) Enterprises. The challenge is to provide resource planning support over a time horizon under the influence of internal resource attrition and demand uncertainty. To ensure demand is satisfied, the external contingent resources can be engaged to make up for internal resource attrition. The objective is to maximize profitability by identifying the optimal mix of internal and contingent resources and their assignments to project tasks under explicit uncertainty. While the sequential decision problems under uncertainty can often be modeled as a Markov decision process (MDP), the classical dynamic programming (DP) method using the Bellman’s equation suffers the well-known curses-of-dimensionality and only works for small size instances. To tackle the challenge of curses-of-dimensionality this research focuses on developing computationally tractable closed-loop Approximate Dynamic Programming (ADP) algorithms to obtain near-optimal solutions in reasonable computational time. Various approximation schemes are developed to approximate the cost-to-go function. A comprehensive computational experiment is conducted to investigate the performance and behavior of the ADP algorithm. The performance of ADP is also compared with that of a rolling horizon approach as a benchmark solution. Computational results show that the optimization model and algorithm developed in this thesis are able to offer solutions with higher profitability and utilization of internal resource for companies in the professional service industry

University of Missouri, St. Louis

How to Fine-tune the Model: Unified Model Shift and Model Bias Policy Optimization

Author: ChangHuang
Ye Chen
Yu Hang
Zhang Di
Zhang Hai
Zhang Xiao
Zhao Junqiao
Zhou Hongtu
Publication venue
Publication date: 22/09/2023
Field of study

Designing and deriving effective model-based reinforcement learning (MBRL) algorithms with a performance improvement guarantee is challenging, mainly attributed to the high coupling between model learning and policy optimization. Many prior methods that rely on return discrepancy to guide model learning ignore the impacts of model shift, which can lead to performance deterioration due to excessive model updates. Other methods use performance difference bound to explicitly consider model shift. However, these methods rely on a fixed threshold to constrain model shift, resulting in a heavy dependence on the threshold and a lack of adaptability during the training process. In this paper, we theoretically derive an optimization objective that can unify model shift and model bias and then formulate a fine-tuning process. This process adaptively adjusts the model updates to get a performance improvement guarantee while avoiding model overfitting. Based on these, we develop a straightforward algorithm USB-PO (Unified model Shift and model Bias Policy Optimization). Empirical results show that USB-PO achieves state-of-the-art performance on several challenging benchmark tasks

arXiv.org e-Print Archive

Language representations for generalization in reinforcement learning

Author: Dazeley Richard
Foale Cameron
Goodger Nikolaj
Vamplew Peter
Publication venue: ACML
Publication date: 01/01/2021
Field of study

The choice of state and action representation in Reinforcement Learning (RL) has a significant effect on agent performance for the training task. But its relationship with generalization to new tasks is under-explored. One approach to improving generalization investigated here is the use of language as a representation. We compare vector-states and discreteactions to language representations. We find the agents using language representations generalize better and could solve tasks with more entities, new entities, and more complexity than seen in the training task. We attribute this to the compositionality of languag

Federation ResearchOnline

Provably Efficient Adversarial Imitation Learning with Unknown Transitions

Author: Li Ziniu
Luo Zhi-Quan
Xu Tian
Yu Yang
Publication venue
Publication date: 10/06/2023
Field of study

Imitation learning (IL) has proven to be an effective method for learning good policies from expert demonstrations. Adversarial imitation learning (AIL), a subset of IL methods, is particularly promising, but its theoretical foundation in the presence of unknown transitions has yet to be fully developed. This paper explores the theoretical underpinnings of AIL in this context, where the stochastic and uncertain nature of environment transitions presents a challenge. We examine the expert sample complexity and interaction complexity required to recover good policies. To this end, we establish a framework connecting reward-free exploration and AIL, and propose an algorithm, MB-TAIL, that achieves the minimax optimal expert sample complexity of

\widetilde{O} (H^{3/2} |S|/\varepsilon)

and interaction complexity of

\widetilde{O} (H^{3} |S|^2 |A|/\varepsilon^2)

. Here,

H

represents the planning horizon,

|S|

is the state space size,

|A|

is the action space size, and

\varepsilon

is the desired imitation gap. MB-TAIL is the first algorithm to achieve this level of expert sample complexity in the unknown transition setting and improves upon the interaction complexity of the best-known algorithm, OAL, by

O(H)

. Additionally, we demonstrate the generalization ability of MB-TAIL by extending it to the function approximation setting and proving that it can achieve expert sample and interaction complexity independent of

|S|

Comment: arXiv admin note: text overlap with arXiv:2106.1042

arXiv.org e-Print Archive

Safe Model-Based Multi-Agent Mean-Field Reinforcement Learning

Author: Bogunovic Ilija
Corman Francesco
Janik Tadeusz
Jusup Matej
Krause Andreas
Pásztor Barna
Zhang Kenan
Publication venue
Publication date: 29/06/2023
Field of study

Many applications, e.g., in shared mobility, require coordinating a large number of agents. Mean-field reinforcement learning addresses the resulting scalability challenge by optimizing the policy of a representative agent. In this paper, we address an important generalization where there exist global constraints on the distribution of agents (e.g., requiring capacity constraints or minimum coverage requirements to be met). We propose Safe-

\text{M}^3

-UCRL, the first model-based algorithm that attains safe policies even in the case of unknown transition dynamics. As a key ingredient, it uses epistemic uncertainty in the transition model within a log-barrier approach to ensure pessimistic constraints satisfaction with high probability. We showcase Safe-

\text{M}^3

-UCRL on the vehicle repositioning problem faced by many shared mobility operators and evaluate its performance through simulations built on Shenzhen taxi trajectory data. Our algorithm effectively meets the demand in critical areas while ensuring service accessibility in regions with low demand.Comment: 25 pages, 14 figures, 3 table

arXiv.org e-Print Archive

Understanding and Addressing the Pitfalls of Bisimulation-based Representations in Offline Reinforcement Learning

Author: Combes Remi Tachet des
Islam Riashat
Laroche Romain
Li Xin
Liu Yang
Sun Baigui
Zang Hongyu
Zhang Leiji
Publication venue
Publication date: 26/10/2023
Field of study

While bisimulation-based approaches hold promise for learning robust state representations for Reinforcement Learning (RL) tasks, their efficacy in offline RL tasks has not been up to par. In some instances, their performance has even significantly underperformed alternative methods. We aim to understand why bisimulation methods succeed in online settings, but falter in offline tasks. Our analysis reveals that missing transitions in the dataset are particularly harmful to the bisimulation principle, leading to ineffective estimation. We also shed light on the critical role of reward scaling in bounding the scale of bisimulation measurements and of the value error they induce. Based on these findings, we propose to apply the expectile operator for representation learning to our offline RL setting, which helps to prevent overfitting to incomplete data. Meanwhile, by introducing an appropriate reward scaling strategy, we avoid the risk of feature collapse in representation space. We implement these recommendations on two state-of-the-art bisimulation-based algorithms, MICo and SimSR, and demonstrate performance gains on two benchmark suites: D4RL and Visual D4RL. Codes are provided at \url{https://github.com/zanghyu/Offline_Bisimulation}.Comment: NeurIPS 202

arXiv.org e-Print Archive