372 research outputs found

    Stochastic Online Learning with Probabilistic Graph Feedback

    Full text link
    We consider a problem of stochastic online learning with general probabilistic graph feedback, where each directed edge in the feedback graph has probability pijp_{ij}. Two cases are covered. (a) The one-step case, where after playing arm ii the learner observes a sample reward feedback of arm jj with independent probability pijp_{ij}. (b) The cascade case where after playing arm ii the learner observes feedback of all arms jj in a probabilistic cascade starting from ii -- for each (i,j)(i,j) with probability pijp_{ij}, if arm ii is played or observed, then a reward sample of arm jj would be observed with independent probability pijp_{ij}. Previous works mainly focus on deterministic graphs which corresponds to one-step case with pij{0,1}p_{ij} \in \{0,1\}, an adversarial sequence of graphs with certain topology guarantees, or a specific type of random graphs. We analyze the asymptotic lower bounds and design algorithms in both cases. The regret upper bounds of the algorithms match the lower bounds with high probability

    Combinatorial Bandits Revisited

    Full text link
    This paper investigates stochastic and adversarial combinatorial multi-armed bandit problems. In the stochastic setting under semi-bandit feedback, we derive a problem-specific regret lower bound, and discuss its scaling with the dimension of the decision space. We propose ESCB, an algorithm that efficiently exploits the structure of the problem and provide a finite-time analysis of its regret. ESCB has better performance guarantees than existing algorithms, and significantly outperforms these algorithms in practice. In the adversarial setting under bandit feedback, we propose \textsc{CombEXP}, an algorithm with the same regret scaling as state-of-the-art algorithms, but with lower computational complexity for some combinatorial problems.Comment: 30 pages, Advances in Neural Information Processing Systems 28 (NIPS 2015

    Online Clustering of Bandits

    Full text link
    We introduce a novel algorithmic approach to content recommendation based on adaptive clustering of exploration-exploitation ("bandit") strategies. We provide a sharp regret analysis of this algorithm in a standard stochastic noise setting, demonstrate its scalability properties, and prove its effectiveness on a number of artificial and real-world datasets. Our experiments show a significant increase in prediction performance over state-of-the-art methods for bandit problems.Comment: In E. Xing and T. Jebara (Eds.), Proceedings of 31st International Conference on Machine Learning, Journal of Machine Learning Research Workshop and Conference Proceedings, Vol.32 (JMLR W&CP-32), Beijing, China, Jun. 21-26, 2014 (ICML 2014), Submitted by Shuai Li (https://sites.google.com/site/shuailidotsli

    Decentralized Cooperative Stochastic Bandits

    Full text link
    We study a decentralized cooperative stochastic multi-armed bandit problem with KK arms on a network of NN agents. In our model, the reward distribution of each arm is the same for each agent and rewards are drawn independently across agents and time steps. In each round, each agent chooses an arm to play and subsequently sends a message to her neighbors. The goal is to minimize the overall regret of the entire network. We design a fully decentralized algorithm that uses an accelerated consensus procedure to compute (delayed) estimates of the average of rewards obtained by all the agents for each arm, and then uses an upper confidence bound (UCB) algorithm that accounts for the delay and error of the estimates. We analyze the regret of our algorithm and also provide a lower bound. The regret is bounded by the optimal centralized regret plus a natural and simple term depending on the spectral gap of the communication matrix. Our algorithm is simpler to analyze than those proposed in prior work and it achieves better regret bounds, while requiring less information about the underlying network. It also performs better empirically

    Counterfactual Reasoning and Learning Systems

    Full text link
    This work shows how to leverage causal inference to understand the behavior of complex learning systems interacting with their environment and predict the consequences of changes to the system. Such predictions allow both humans and algorithms to select changes that improve both the short-term and long-term performance of such systems. This work is illustrated by experiments carried out on the ad placement system associated with the Bing search engine.Comment: revised versio

    Combinatorial Semi-Bandits with Knapsacks

    Full text link
    We unify two prominent lines of work on multi-armed bandits: bandits with knapsacks (BwK) and combinatorial semi-bandits. The former concerns limited "resources" consumed by the algorithm, e.g., limited supply in dynamic pricing. The latter allows a huge number of actions but assumes combinatorial structure and additional feedback to make the problem tractable. We define a common generalization, support it with several motivating examples, and design an algorithm for it. Our regret bounds are comparable with those for BwK and combinatorial semi- bandits

    Bridging the gap between regret minimization and best arm identification, with application to A/B tests

    Full text link
    State of the art online learning procedures focus either on selecting the best alternative ("best arm identification") or on minimizing the cost (the "regret"). We merge these two objectives by providing the theoretical analysis of cost minimizing algorithms that are also delta-PAC (with a proven guaranteed bound on the decision time), hence fulfilling at the same time regret minimization and best arm identification. This analysis sheds light on the common observation that ill-callibrated UCB-algorithms minimize regret while still identifying quickly the best arm. We also extend these results to the non-iid case faced by many practitioners. This provides a technique to make cost versus decision time compromise when doing adaptive tests with applications ranging from website A/B testing to clinical trials

    Dynamic Pricing with Demand Covariates

    Full text link
    We consider a firm that sells products over TT periods without knowing the demand function. The firm sequentially sets prices to earn revenue and to learn the underlying demand function simultaneously. A natural heuristic for this problem, commonly used in practice, is greedy iterative least squares (GILS). At each time period, GILS estimates the demand as a linear function of the price by applying least squares to the set of prior prices and realized demands. Then a price that maximizes the revenue, given the estimated demand function, is used for the next time period. The performance is measured by the regret, which is the expected revenue loss from the optimal (oracle) pricing policy when the demand function is known. Recently, den Boer and Zwart (2014) and Keskin and Zeevi (2014) demonstrated that GILS is sub-optimal. They introduced algorithms which integrate forced price dispersion with GILS and achieve asymptotically optimal performance. In this paper, we consider this dynamic pricing problem in a data-rich environment. In particular, we assume that the firm knows the expected demand under a particular price from historical data, and in each period, before setting the price, the firm has access to extra information (demand covariates) which may be predictive of the demand. We prove that in this setting GILS achieves asymptotically optimal regret of order log(T)\log(T). We also show the following surprising result: in the original dynamic pricing problem of den Boer and Zwart (2014) and Keskin and Zeevi (2014), inclusion of any set of covariates in GILS as potential demand covariates (even though they could carry no information) would make GILS asymptotically optimal. We validate our results via extensive numerical simulations on synthetic and real data sets.Comment: 28 pages, 6 figure

    A Decentralized Policy with Logarithmic Regret for a Class of Multi-Agent Multi-Armed Bandit Problems with Option Unavailability Constraints and Stochastic Communication Protocols

    Full text link
    This paper considers a multi-armed bandit (MAB) problem in which multiple mobile agents receive rewards by sampling from a collection of spatially dispersed stochastic processes, called bandits. The goal is to formulate a decentralized policy for each agent, in order to maximize the total cumulative reward over all agents, subject to option availability and inter-agent communication constraints. The problem formulation is motivated by applications in which a team of autonomous mobile robots cooperates to accomplish an exploration and exploitation task in an uncertain environment. Bandit locations are represented by vertices of the spatial graph. At any time, an agent's option consist of sampling the bandit at its current location, or traveling along an edge of the spatial graph to a new bandit location. Communication constraints are described by a directed, non-stationary, stochastic communication graph. At any time, agents may receive data only from their communication graph in-neighbors. For the case of a single agent on a fully connected spatial graph, it is known that the expected regret for any optimal policy is necessarily bounded below by a function that grows as the logarithm of time. A class of policies called upper confidence bound (UCB) algorithms asymptotically achieve logarithmic regret for the classical MAB problem. In this paper, we propose a UCB-based decentralized motion and option selection policy and a non-stationary stochastic communication protocol that guarantee logarithmic regret. To our knowledge, this is the first such decentralized policy for non-fully connected spatial graphs with communication constraints. When the spatial graph is fully connected and the communication graph is stationary, our decentralized algorithm matches or exceeds the best reported prior results from the literature.Comment: Pre-print submitted for review to the 2020 CD

    Combinatorial Multi-Armed Bandit and Its Extension to Probabilistically Triggered Arms

    Full text link
    We define a general framework for a large class of combinatorial multi-armed bandit (CMAB) problems, where subsets of base arms with unknown distributions form super arms. In each round, a super arm is played and the base arms contained in the super arm are played and their outcomes are observed. We further consider the extension in which more based arms could be probabilistically triggered based on the outcomes of already triggered arms. The reward of the super arm depends on the outcomes of all played arms, and it only needs to satisfy two mild assumptions, which allow a large class of nonlinear reward instances. We assume the availability of an offline (\alpha,\beta)-approximation oracle that takes the means of the outcome distributions of arms and outputs a super arm that with probability {\beta} generates an {\alpha} fraction of the optimal expected reward. The objective of an online learning algorithm for CMAB is to minimize (\alpha,\beta)-approximation regret, which is the difference between the \alpha{\beta} fraction of the expected reward when always playing the optimal super arm, and the expected reward of playing super arms according to the algorithm. We provide CUCB algorithm that achieves O(log n) distribution-dependent regret, where n is the number of rounds played, and we further provide distribution-independent bounds for a large class of reward functions. Our regret analysis is tight in that it matches the bound of UCB1 algorithm (up to a constant factor) for the classical MAB problem, and it significantly improves the regret bound in a earlier paper on combinatorial bandits with linear rewards. We apply our CMAB framework to two new applications, probabilistic maximum coverage and social influence maximization, both having nonlinear reward structures. In particular, application to social influence maximization requires our extension on probabilistically triggered arms.Comment: A preliminary version of the paper is published in ICML'201
    corecore