Search CORE

6 research outputs found

Conservative Bayesian Model-Based Value Expansion for Offline Policy Optimization

Author: Abdulhai Baher
Gimelfarb Michael
Jeong Jihwan
Kim Hyunwoo
Sanner Scott
Wang Xiaoyu
Publication venue
Publication date: 07/10/2022
Field of study

Offline reinforcement learning (RL) addresses the problem of learning a performant policy from a fixed batch of data collected by following some behavior policy. Model-based approaches are particularly appealing in the offline setting since they can extract more learning signals from the logged dataset by learning a model of the environment. However, the performance of existing model-based approaches falls short of model-free counterparts, due to the compounding of estimation errors in the learned model. Driven by this observation, we argue that it is critical for a model-based method to understand when to trust the model and when to rely on model-free estimates, and how to act conservatively w.r.t. both. To this end, we derive an elegant and simple methodology called conservative Bayesian model-based value expansion for offline policy optimization (CBOP), that trades off model-free and model-based estimates during the policy evaluation step according to their epistemic uncertainties, and facilitates conservatism by taking a lower bound on the Bayesian posterior value estimate. On the standard D4RL continuous control tasks, we find that our method significantly outperforms previous model-based approaches: e.g., MOPO by

116.4

%, MOReL by

23.2

% and COMBO by

23.7

%. Further, CBOP achieves state-of-the-art performance on

11

out of

18

benchmark datasets while doing on par on the remaining datasets

arXiv.org e-Print Archive

Thompson Sampling for the Control of a Queue with Demand Uncertainty

Author: Gimelfarb Michael
Publication venue
Publication date: 01/11/2017
Field of study

We study an admission control problem in which the customer arrival rate is unknown and needs to be learned from data using Bayesian inference. Two key defining features of this model are that: (1) when the arrival rate is known, the DP equations can be solved explicitly to obtain the optimal policy over the infinite horizon, and (2) uninformative actions are unavoidable and occur infinitely often. We extend the standard proof techniques for Thompson sampling to admission control, in which uninformative actions occur infinitely often, and show that asymptotically optimal convergence rates of the posterior error and worst-case average regret are achieved. Finally, we show that under simple assumptions, our techniques generalize to a broader class of policies, which we call Generalized Thompson sampling. We show that this class of policies achieves asymptotically optimal convergence rates and can outperform standard Thompson sampling in numerical simulation.M.A.S

University of Toronto Research Repository

Thompson Sampling for Parameterized Markov Decision Processes with Uninformative Actions

Author: Gimelfarb Michael
Kim Michael Jong
Publication venue
Publication date: 13/05/2023
Field of study

We study parameterized MDPs (PMDPs) in which the key parameters of interest are unknown and must be learned using Bayesian inference. One key defining feature of such models is the presence of "uninformative" actions that provide no information about the unknown parameters. We contribute a set of assumptions for PMDPs under which Thompson sampling guarantees an asymptotically optimal expected regret bound of

O(T^{-1})

, which are easily verified for many classes of problems such as queuing, inventory control, and dynamic pricing

arXiv.org e-Print Archive