Search CORE

6 research outputs found

Stateful Posted Pricing with Vanishing Regret via Dynamic Deterministic Markov Decision Processes

Author: Emek Yuval
Lavi R
Niazadeh Rad
Shi Yangguang
Publication venue: 'Institute for Operations Research and the Management Sciences (INFORMS)'
Publication date: 07/06/2023
Field of study

An online problem called dynamic resource allocation with capacity constraints (DRACC) is introduced and studied in the realm of posted price mechanisms. This problem subsumes several applications of stateful pricing, including but not limited to posted prices for online job scheduling and matching over a dynamic bipartite graph. Because existing online learning techniques do not yield vanishing regret for this problem, we develop a novel online learning framework over deterministic Markov decision processes with dynamic state transition and reward functions. Following that, we prove, based on a reduction to the well-studied problem of online learning with switching costs, that if the Markov decision process admits a chasing oracle (i.e., an oracle that simulates any given policy from any initial state with bounded loss), then the online learning problem can be solved with vanishing regret. Our results for the DRACC problem and its applications are then obtained by devising (randomized and deterministic) chasing oracles that exploit the particular structure of these problems

OPUS

COMPETING AGAINST ADAPTIVE AGENTS BY MINIMIZING COUNTERFACTUAL NOTIONS OF REGRET

Author: Marinov Teodor Vanislavov
Publication venue: 'The Busan Gyeongnam Mathematical Society'
Publication date: 16/09/2021
Field of study

Online learning or sequential decision making is formally defined as a repeated game between an adversary and a player. At every round of the game the player chooses an action from a fixed action set and the adversary reveals a reward/loss for the action played. The goal of the player is to maximize the cumulative reward of her actions. The rewards/losses could be sampled from an unknown distribution or other less restrictive assumptions can be made. The standard measure of performance is the cumulative regret, that is the difference between the cumulative reward of the player and the best achievable reward by a fixed action, or more generally a fixed policy, on the observed reward sequence. For adversaries which are oblivious to the player's strategy, regret is a meaningful measure. However, the adversary is usually adaptive, e.g., in healthcare a patient will respond to given treatments, and for self-driving cars other traffic will react to the behavior of the autonomous agent. In such settings the notion of regret is hard to interpret as the best action in hindsight might not be the best action overall, given the behavior of the adversary. To resolve this problem a new notion called policy regret is introduced. Policy regret is fundamentally different from other forms of regret as it is counterfactual in nature, i.e., the player competes against all other policies whose reward is calculated by taking into account how the adversary would have behaved had the player chosen another policy. This thesis studies policy regret in a partial (bandit) feedback environment, beyond the worst case setting, by leveraging additional structure such as stochasticity/stability of the adversary or additional feedback

JScholarship