Improved Asymptotics for Multi-armed Bandit Experiments under Optimism-based Policies: Theory and Applications

Abstract

The classical multi-armed bandit paradigm is a foundational framework for online decision making underlying a wide variety of important applications, e.g., clinical trials, advertising, sequential assignments, assortment optimization, etc. This work will examine two salient aspects of decision making that arise naturally in settings with large action spaces. The first issue pertains to the division of samples across arms at the level of a trajectory (or sample-path). Traditional bounds at the ensemble-level (or in expectation) only translate to meaningful pathwise guarantees (high probability bounds) when the separation between mean rewards is ``large,'' commonly referred to as the ``well-separated'' regime in the literature. On the other hand, applications with a large action space are intrinsically endowed with smaller separations between arm-means (e.g., multiple products of similar quality in e-retail). As a result, classical ensemble-level guarantees for such problems become vacuous at the sample-path level in several settings. This theoretical gap in the understanding of bandit algorithms in the ``small gap'' regime can be of significant consequence in applications where considerations such as fairness and post hoc inference play an important role. Our work provides the first systematic treatment and analysis of this aspect under the celebrated UCB class of optimism-based bandit algorithms, including a complete diffusion-limit characterization of its regret. The diffusion-scale lens also reveals profound insights and highlights distinctions between UCB and the popular posterior sampling-based method, Thompson Sampling, such as an ``incomplete learning'' phenomenon that is characteristic of the latter. The second research question studied in this work concerns the complexity of decision making in problems where the action space is endowed with a large number of substitutable alternatives. For example, it is common in e-retail for multiple brands to offer similar products (in terms of quality-of-service) that compete for revenue within a given product segment. We model the platform's decision problem in this example as a bandit with countably many arms, and investigate limits of achievable performance under canonical bandit algorithms adapted to this setting. We also propose novel rate-optimal algorithms that leverage results for the ``small gap'' regime alluded to earlier, and show that these outperform aforementioned conventional adaptations. We extend the countable-armed bandit paradigm to also serve as a basal motif in sequential assignment and dynamic matching problems typical of settings such as online labor markets. The last chapter of this thesis investigates achievable performance in the countable-armed bandit problem under non-stationarity that is attributable to vanishing arms. This characteristic abstracts away certain attrition and churn processes observable in online markets, e.g., a popular brand may retract its product from a platform owing to under-exposure within its category -- a potential negative externality of the exploration carried out by the platform's policy

    Similar works