6 research outputs found
Conservative Bayesian Model-Based Value Expansion for Offline Policy Optimization
Offline reinforcement learning (RL) addresses the problem of learning a
performant policy from a fixed batch of data collected by following some
behavior policy. Model-based approaches are particularly appealing in the
offline setting since they can extract more learning signals from the logged
dataset by learning a model of the environment. However, the performance of
existing model-based approaches falls short of model-free counterparts, due to
the compounding of estimation errors in the learned model. Driven by this
observation, we argue that it is critical for a model-based method to
understand when to trust the model and when to rely on model-free estimates,
and how to act conservatively w.r.t. both. To this end, we derive an elegant
and simple methodology called conservative Bayesian model-based value expansion
for offline policy optimization (CBOP), that trades off model-free and
model-based estimates during the policy evaluation step according to their
epistemic uncertainties, and facilitates conservatism by taking a lower bound
on the Bayesian posterior value estimate. On the standard D4RL continuous
control tasks, we find that our method significantly outperforms previous
model-based approaches: e.g., MOPO by %, MOReL by % and COMBO by
%. Further, CBOP achieves state-of-the-art performance on out of
benchmark datasets while doing on par on the remaining datasets
Thompson Sampling for the Control of a Queue with Demand Uncertainty
We study an admission control problem in which the customer arrival rate is unknown
and needs to be learned from data using Bayesian inference. Two key defining features of
this model are that: (1) when the arrival rate is known, the DP equations can be solved
explicitly to obtain the optimal policy over the infinite horizon, and (2) uninformative
actions are unavoidable and occur infinitely often.
We extend the standard proof techniques for Thompson sampling to admission control,
in which uninformative actions occur infinitely often, and show that asymptotically
optimal convergence rates of the posterior error and worst-case average regret are achieved.
Finally, we show that under simple assumptions, our techniques generalize to a
broader class of policies, which we call Generalized Thompson sampling. We show that
this class of policies achieves asymptotically optimal convergence rates and can outperform
standard Thompson sampling in numerical simulation.M.A.S
Thompson Sampling for Parameterized Markov Decision Processes with Uninformative Actions
We study parameterized MDPs (PMDPs) in which the key parameters of interest
are unknown and must be learned using Bayesian inference. One key defining
feature of such models is the presence of "uninformative" actions that provide
no information about the unknown parameters. We contribute a set of assumptions
for PMDPs under which Thompson sampling guarantees an asymptotically optimal
expected regret bound of , which are easily verified for many
classes of problems such as queuing, inventory control, and dynamic pricing