405 research outputs found

    Combinatorial Bandits Revisited

    Full text link
    This paper investigates stochastic and adversarial combinatorial multi-armed bandit problems. In the stochastic setting under semi-bandit feedback, we derive a problem-specific regret lower bound, and discuss its scaling with the dimension of the decision space. We propose ESCB, an algorithm that efficiently exploits the structure of the problem and provide a finite-time analysis of its regret. ESCB has better performance guarantees than existing algorithms, and significantly outperforms these algorithms in practice. In the adversarial setting under bandit feedback, we propose \textsc{CombEXP}, an algorithm with the same regret scaling as state-of-the-art algorithms, but with lower computational complexity for some combinatorial problems.Comment: 30 pages, Advances in Neural Information Processing Systems 28 (NIPS 2015

    Learning to Route Efficiently with End-to-End Feedback: The Value of Networked Structure

    Full text link
    We introduce efficient algorithms which achieve nearly optimal regrets for the problem of stochastic online shortest path routing with end-to-end feedback. The setting is a natural application of the combinatorial stochastic bandits problem, a special case of the linear stochastic bandits problem. We show how the difficulties posed by the large scale action set can be overcome by the networked structure of the action set. Our approach presents a novel connection between bandit learning and shortest path algorithms. Our main contribution is an adaptive exploration algorithm with nearly optimal instance-dependent regret for any directed acyclic network. We then modify it so that nearly optimal worst case regret is achieved simultaneously. Driven by the carefully designed Top-Two Comparison (TTC) technique, the algorithms are efficiently implementable. We further conduct extensive numerical experiments to show that our proposed algorithms not only achieve superior regret performances, but also reduce the runtime drastically

    Top-k Combinatorial Bandits with Full-Bandit Feedback

    Full text link
    Top-k Combinatorial Bandits generalize multi-armed bandits, where at each round any subset of kk out of nn arms may be chosen and the sum of the rewards is gained. We address the full-bandit feedback, in which the agent observes only the sum of rewards, in contrast to the semi-bandit feedback, in which the agent observes also the individual arms' rewards. We present the Combinatorial Successive Accepts and Rejects (CSAR) algorithm, which generalizes SAR (Bubeck et al, 2013) for top-k combinatorial bandits. Our main contribution is an efficient sampling scheme that uses Hadamard matrices in order to estimate accurately the individual arms' expected rewards. We discuss two variants of the algorithm, the first minimizes the sample complexity and the second minimizes the regret. We also prove a lower bound on sample complexity, which is tight for k=O(1)k=O(1). Finally, we run experiments and show that our algorithm outperforms other methods

    Thompson Sampling Algorithms for Cascading Bandits

    Full text link
    Motivated by efficient optimization for online recommender systems, we revisit the cascading bandit model proposed by Kveton et al. (2015). While Thompson sampling (TS) algorithms have been shown to be empirically superior to Upper Confidence Bound (UCB) algorithms for cascading bandits, theoretical guarantees are only known for the latter, not the former. In this paper, we close the gap by designing and analyzing a TS algorithm, TS-Cascade, that achieves the state-of-the-art regret bound for cascading bandits. Next, we derive a nearly matching regret lower bound, with information-theoretic techniques and judiciously constructed cascading bandit instances. In complement, we also provide a problem-dependent upper bound on the regret of the Thompson sampling algorithm with Beta-Bernoulli update; this upper bound is tighter than a recent derivation by Huyuk and Tekin (2019). Finally, we consider a linear generalization of the cascading bandit model, which allows efficient learning in large cascading bandit problem instances. We introduce a TS algorithm, which enjoys a regret bound that depends on the dimension of the linear model but not the number of items. Our paper establishes the first theoretical guarantees on TS algorithms for stochastic combinatorial bandit problem model with partial feedback. Numerical experiments demonstrate the superiority of our TS algorithms compared to existing UCB algorithms.Comment: 54 pages, 3 figure

    Combinatorial Semi-Bandits with Knapsacks

    Full text link
    We unify two prominent lines of work on multi-armed bandits: bandits with knapsacks (BwK) and combinatorial semi-bandits. The former concerns limited "resources" consumed by the algorithm, e.g., limited supply in dynamic pricing. The latter allows a huge number of actions but assumes combinatorial structure and additional feedback to make the problem tractable. We define a common generalization, support it with several motivating examples, and design an algorithm for it. Our regret bounds are comparable with those for BwK and combinatorial semi- bandits

    DCM Bandits: Learning to Rank with Multiple Clicks

    Full text link
    A search engine recommends to the user a list of web pages. The user examines this list, from the first page to the last, and clicks on all attractive pages until the user is satisfied. This behavior of the user can be described by the dependent click model (DCM). We propose DCM bandits, an online learning variant of the DCM where the goal is to maximize the probability of recommending satisfactory items, such as web pages. The main challenge of our learning problem is that we do not observe which attractive item is satisfactory. We propose a computationally-efficient learning algorithm for solving our problem, dcmKL-UCB; derive gap-dependent upper bounds on its regret under reasonable assumptions; and also prove a matching lower bound up to logarithmic factors. We evaluate our algorithm on synthetic and real-world problems, and show that it performs well even when our model is misspecified. This work presents the first practical and regret-optimal online algorithm for learning to rank with multiple clicks in a cascade-like click model.Comment: Proceedings of the 33rd International Conference on Machine Learnin

    Cascading Bandits for Large-Scale Recommendation Problems

    Full text link
    Most recommender systems recommend a list of items. The user examines the list, from the first item to the last, and often chooses the first attractive item and does not examine the rest. This type of user behavior can be modeled by the cascade model. In this work, we study cascading bandits, an online learning variant of the cascade model where the goal is to recommend KK most attractive items from a large set of LL candidate items. We propose two algorithms for solving this problem, which are based on the idea of linear generalization. The key idea in our solutions is that we learn a predictor of the attraction probabilities of items from their features, as opposing to learning the attraction probability of each item independently as in the existing work. This results in practical learning algorithms whose regret does not depend on the number of items LL. We bound the regret of one algorithm and comprehensively evaluate the other on a range of recommendation problems. The algorithm performs well and outperforms all baselines.Comment: Accepted to UAI 201

    Regret Bounds for Stochastic Combinatorial Multi-Armed Bandits with Linear Space Complexity

    Full text link
    Many real-world problems face the dilemma of choosing best KK out of NN options at a given time instant. This setup can be modelled as combinatorial bandit which chooses KK out of NN arms at each time, with an aim to achieve an efficient tradeoff between exploration and exploitation. This is the first work for combinatorial bandit where the reward received can be a non-linear function of the chosen KK arms. The direct use of multi-armed bandit requires choosing among NN-choose-KK options making the state space large. In this paper, we present a novel algorithm which is computationally efficient and the storage is linear in NN. The proposed algorithm is a divide-and-conquer based strategy, that we call CMAB-SM. Further, the proposed algorithm achieves a regret bound of O~(K12N13T23)\tilde O(K^\frac{1}{2}N^\frac{1}{3}T^\frac{2}{3}) for a time horizon TT, which is sub-linear in all parameters TT, NN, and KK. The evaluation results on different reward functions and arm distribution functions show significantly improved performance as compared to standard multi-armed bandit approach with (NK)\binom{N}{K} choices.Comment: 32 page

    Batch-Size Independent Regret Bounds for the Combinatorial Multi-Armed Bandit Problem

    Full text link
    We consider the combinatorial multi-armed bandit (CMAB) problem, where the reward function is nonlinear. In this setting, the agent chooses a batch of arms on each round and receives feedback from each arm of the batch. The reward that the agent aims to maximize is a function of the selected arms and their expectations. In many applications, the reward function is highly nonlinear, and the performance of existing algorithms relies on a global Lipschitz constant to encapsulate the function's nonlinearity. This may lead to loose regret bounds, since by itself, a large gradient does not necessarily cause a large regret, but only in regions where the uncertainty in the reward's parameters is high. To overcome this problem, we introduce a new smoothness criterion, which we term \emph{Gini-weighted smoothness}, that takes into account both the nonlinearity of the reward and concentration properties of the arms. We show that a linear dependence of the regret in the batch size in existing algorithms can be replaced by this smoothness parameter. This, in turn, leads to much tighter regret bounds when the smoothness parameter is batch-size independent. For example, in the probabilistic maximum coverage (PMC) problem, that has many applications, including influence maximization, diverse recommendations and more, we achieve dramatic improvements in the upper bounds. We also prove matching lower bounds for the PMC problem and show that our algorithm is tight, up to a logarithmic factor in the problem's parameters.Comment: Accepted to COLT 201

    Hedging the Drift: Learning to Optimize under Non-Stationarity

    Full text link
    We introduce data-driven decision-making algorithms that achieve state-of-the-art \emph{dynamic regret} bounds for non-stationary bandit settings. These settings capture applications such as advertisement allocation, dynamic pricing, and traffic network routing in changing environments. We show how the difficulty posed by the (unknown \emph{a priori} and possibly adversarial) non-stationarity can be overcome by an unconventional marriage between stochastic and adversarial bandit learning algorithms. Our main contribution is a general algorithmic recipe for a wide variety of non-stationary bandit problems. Specifically, we design and analyze the sliding window-upper confidence bound algorithm that achieves the optimal dynamic regret bound for each of the settings when we know the respective underlying \emph{variation budget}, which quantifies the total amount of temporal variation of the latent environments. Boosted by the novel bandit-over-bandit framework that adapts to the latent changes, we can further enjoy the (nearly) optimal dynamic regret bounds in a (surprisingly) parameter-free manner. In addition to the classical exploration-exploitation trade-off, our algorithms leverage the power of the "forgetting principle" in the learning processes, which is vital in changing environments. Our extensive numerical experiments on both synthetic and real world online auto-loan datasets show that our proposed algorithms achieve superior empirical performance compared to existing algorithms.Comment: Journal version of the AISTATS 2019 version (available at arXiv:1810.03024). This version fixed an error in the proof of Theorem 2 with Assumption 4 of arXiv:2103.0575
    • …
    corecore