6 research outputs found

    Conservative Bayesian Model-Based Value Expansion for Offline Policy Optimization

    Full text link
    Offline reinforcement learning (RL) addresses the problem of learning a performant policy from a fixed batch of data collected by following some behavior policy. Model-based approaches are particularly appealing in the offline setting since they can extract more learning signals from the logged dataset by learning a model of the environment. However, the performance of existing model-based approaches falls short of model-free counterparts, due to the compounding of estimation errors in the learned model. Driven by this observation, we argue that it is critical for a model-based method to understand when to trust the model and when to rely on model-free estimates, and how to act conservatively w.r.t. both. To this end, we derive an elegant and simple methodology called conservative Bayesian model-based value expansion for offline policy optimization (CBOP), that trades off model-free and model-based estimates during the policy evaluation step according to their epistemic uncertainties, and facilitates conservatism by taking a lower bound on the Bayesian posterior value estimate. On the standard D4RL continuous control tasks, we find that our method significantly outperforms previous model-based approaches: e.g., MOPO by 116.4116.4%, MOReL by 23.223.2% and COMBO by 23.723.7%. Further, CBOP achieves state-of-the-art performance on 1111 out of 1818 benchmark datasets while doing on par on the remaining datasets

    Thompson Sampling for the Control of a Queue with Demand Uncertainty

    No full text
    We study an admission control problem in which the customer arrival rate is unknown and needs to be learned from data using Bayesian inference. Two key defining features of this model are that: (1) when the arrival rate is known, the DP equations can be solved explicitly to obtain the optimal policy over the infinite horizon, and (2) uninformative actions are unavoidable and occur infinitely often. We extend the standard proof techniques for Thompson sampling to admission control, in which uninformative actions occur infinitely often, and show that asymptotically optimal convergence rates of the posterior error and worst-case average regret are achieved. Finally, we show that under simple assumptions, our techniques generalize to a broader class of policies, which we call Generalized Thompson sampling. We show that this class of policies achieves asymptotically optimal convergence rates and can outperform standard Thompson sampling in numerical simulation.M.A.S

    Thompson Sampling for Parameterized Markov Decision Processes with Uninformative Actions

    Full text link
    We study parameterized MDPs (PMDPs) in which the key parameters of interest are unknown and must be learned using Bayesian inference. One key defining feature of such models is the presence of "uninformative" actions that provide no information about the unknown parameters. We contribute a set of assumptions for PMDPs under which Thompson sampling guarantees an asymptotically optimal expected regret bound of O(T−1)O(T^{-1}), which are easily verified for many classes of problems such as queuing, inventory control, and dynamic pricing
    corecore