Multi-armed bandits with costly probes

Abstract

Multi-armed bandits is a sequential decision-making problem where an agent must choose between multiple actions to maximize its cumulative reward over time, while facing uncertainty about the rewards associated with each action. The challenge lies in balancing the exploration of potentially higher-rewarding actions with the exploitation of known high-reward actions. We consider a multi-armed bandit problem with probes, where before pulling an arm, the decision-maker is allowed to probe one of the K arms for a cost c0c\geq 0 to observe its reward. We introduce a new regret definition that is based on the expected reward of the optimal action. We develop UCBP, a novel algorithm that utilizes this strategy to achieve a gap-independent regret upper bound that scales with the number of rounds T as O(KTlogT)O(\sqrt {KT\log T}), and an order optimal gap-dependent upper bound of O(KlogT)O(K\log T). As a baseline, we introduce UCB-naive-probe, a naive UCB-based approach which has a gap-independent regret upper bound of O(KTlogT)O(K\sqrt {T\log T}), and gap-dependent regret bound of O(K2logT)O(K^{2}\log T); and TSP, the Thompson sampling version of UCBP. In empirical simulations, UCBP outperforms UCB-naive-probe, and performs similarly to TSP, verifying the utility of UCBP and TSP algorithms in practical settings

Similar works

Full text

thumbnail-image

Bilkent University Institutional Repository

redirect
Last time updated on 12/04/2025

This paper was published in Bilkent University Institutional Repository.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.

Licence: https://creativecommons.org/licenses/by-nc-nd/4.0/