328 research outputs found

    Adversarial Bandits with Knapsacks

    Full text link
    We consider Bandits with Knapsacks (henceforth, BwK), a general model for multi-armed bandits under supply/budget constraints. In particular, a bandit algorithm needs to solve a well-known knapsack problem: find an optimal packing of items into a limited-size knapsack. The BwK problem is a common generalization of numerous motivating examples, which range from dynamic pricing to repeated auctions to dynamic ad allocation to network routing and scheduling. While the prior work on BwK focused on the stochastic version, we pioneer the other extreme in which the outcomes can be chosen adversarially. This is a considerably harder problem, compared to both the stochastic version and the "classic" adversarial bandits, in that regret minimization is no longer feasible. Instead, the objective is to minimize the competitive ratio: the ratio of the benchmark reward to the algorithm's reward. We design an algorithm with competitive ratio O(log T) relative to the best fixed distribution over actions, where T is the time horizon; we also prove a matching lower bound. The key conceptual contribution is a new perspective on the stochastic version of the problem. We suggest a new algorithm for the stochastic version, which builds on the framework of regret minimization in repeated games and admits a substantially simpler analysis compared to prior work. We then analyze this algorithm for the adversarial version and use it as a subroutine to solve the latter.Comment: Extended abstract appeared in FOCS 201

    Approximately Stationary Bandits with Knapsacks

    Full text link
    Bandits with Knapsacks (BwK), the generalization of the Bandits problem under global budget constraints, has received a lot of attention in recent years. Previous work has focused on one of the two extremes: Stochastic BwK where the rewards and consumptions of the resources of each round are sampled from an i.i.d. distribution, and Adversarial BwK where these parameters are picked by an adversary. Achievable guarantees in the two cases exhibit a massive gap: No-regret learning is achievable in the stochastic case, but in the adversarial case only competitive ratio style guarantees are achievable, where the competitive ratio depends either on the budget or on both the time and the number of resources. What makes this gap so vast is that in Adversarial BwK the guarantees get worse in the typical case when the budget is more binding. While ``best-of-both-worlds'' type algorithms are known (single algorithms that provide the best achievable guarantee in each extreme case), their bounds degrade to the adversarial case as soon as the environment is not fully stochastic. Our work aims to bridge this gap, offering guarantees for a workload that is not exactly stochastic but is also not worst-case. We define a condition, Approximately Stationary BwK, that parameterizes how close to stochastic or adversarial an instance is. Based on these parameters, we explore what is the best competitive ratio attainable in BwK. We explore two algorithms that are oblivious to the values of the parameters but guarantee competitive ratios that smoothly transition between the best possible guarantees in the two extreme cases, depending on the values of the parameters. Our guarantees offer great improvement over the adversarial guarantee, especially when the available budget is small. We also prove bounds on the achievable guarantee, showing that our results are approximately tight when the budget is small

    Clustered Linear Contextual Bandits with Knapsacks

    Full text link
    In this work, we study clustered contextual bandits where rewards and resource consumption are the outcomes of cluster-specific linear models. The arms are divided in clusters, with the cluster memberships being unknown to an algorithm. Pulling an arm in a time period results in a reward and in consumption for each one of multiple resources, and with the total consumption of any resource exceeding a constraint implying the termination of the algorithm. Thus, maximizing the total reward requires learning not only models about the reward and the resource consumption, but also cluster memberships. We provide an algorithm that achieves regret sublinear in the number of time periods, without requiring access to all of the arms. In particular, we show that it suffices to perform clustering only once to a randomly selected subset of the arms. To achieve this result, we provide a sophisticated combination of techniques from the literature of econometrics and of bandits with constraints

    High-dimensional Linear Bandits with Knapsacks

    Full text link
    We study the contextual bandits with knapsack (CBwK) problem under the high-dimensional setting where the dimension of the feature is large. The reward of pulling each arm equals the multiplication of a sparse high-dimensional weight vector and the feature of the current arrival, with additional random noise. In this paper, we investigate how to exploit this sparsity structure to achieve improved regret for the CBwK problem. To this end, we first develop an online variant of the hard thresholding algorithm that performs the sparse estimation in an online manner. We further combine our online estimator with a primal-dual framework, where we assign a dual variable to each knapsack constraint and utilize an online learning algorithm to update the dual variable, thereby controlling the consumption of the knapsack capacity. We show that this integrated approach allows us to achieve a sublinear regret that depends logarithmically on the feature dimension, thus improving the polynomial dependency established in the previous literature. We also apply our framework to the high-dimension contextual bandit problem without the knapsack constraint and achieve optimal regret in both the data-poor regime and the data-rich regime. We finally conduct numerical experiments to show the efficient empirical performance of our algorithms under the high dimensional setting
    • …
    corecore