372 research outputs found
Stochastic Online Learning with Probabilistic Graph Feedback
We consider a problem of stochastic online learning with general
probabilistic graph feedback, where each directed edge in the feedback graph
has probability . Two cases are covered. (a) The one-step case, where
after playing arm the learner observes a sample reward feedback of arm
with independent probability . (b) The cascade case where after playing
arm the learner observes feedback of all arms in a probabilistic
cascade starting from -- for each with probability , if arm
is played or observed, then a reward sample of arm would be observed
with independent probability . Previous works mainly focus on
deterministic graphs which corresponds to one-step case with , an adversarial sequence of graphs with certain topology guarantees,
or a specific type of random graphs. We analyze the asymptotic lower bounds and
design algorithms in both cases. The regret upper bounds of the algorithms
match the lower bounds with high probability
Combinatorial Bandits Revisited
This paper investigates stochastic and adversarial combinatorial multi-armed
bandit problems. In the stochastic setting under semi-bandit feedback, we
derive a problem-specific regret lower bound, and discuss its scaling with the
dimension of the decision space. We propose ESCB, an algorithm that efficiently
exploits the structure of the problem and provide a finite-time analysis of its
regret. ESCB has better performance guarantees than existing algorithms, and
significantly outperforms these algorithms in practice. In the adversarial
setting under bandit feedback, we propose \textsc{CombEXP}, an algorithm with
the same regret scaling as state-of-the-art algorithms, but with lower
computational complexity for some combinatorial problems.Comment: 30 pages, Advances in Neural Information Processing Systems 28 (NIPS
2015
Online Clustering of Bandits
We introduce a novel algorithmic approach to content recommendation based on
adaptive clustering of exploration-exploitation ("bandit") strategies. We
provide a sharp regret analysis of this algorithm in a standard stochastic
noise setting, demonstrate its scalability properties, and prove its
effectiveness on a number of artificial and real-world datasets. Our
experiments show a significant increase in prediction performance over
state-of-the-art methods for bandit problems.Comment: In E. Xing and T. Jebara (Eds.), Proceedings of 31st International
Conference on Machine Learning, Journal of Machine Learning Research Workshop
and Conference Proceedings, Vol.32 (JMLR W&CP-32), Beijing, China, Jun.
21-26, 2014 (ICML 2014), Submitted by Shuai Li
(https://sites.google.com/site/shuailidotsli
Decentralized Cooperative Stochastic Bandits
We study a decentralized cooperative stochastic multi-armed bandit problem
with arms on a network of agents. In our model, the reward distribution
of each arm is the same for each agent and rewards are drawn independently
across agents and time steps. In each round, each agent chooses an arm to play
and subsequently sends a message to her neighbors. The goal is to minimize the
overall regret of the entire network. We design a fully decentralized algorithm
that uses an accelerated consensus procedure to compute (delayed) estimates of
the average of rewards obtained by all the agents for each arm, and then uses
an upper confidence bound (UCB) algorithm that accounts for the delay and error
of the estimates. We analyze the regret of our algorithm and also provide a
lower bound. The regret is bounded by the optimal centralized regret plus a
natural and simple term depending on the spectral gap of the communication
matrix. Our algorithm is simpler to analyze than those proposed in prior work
and it achieves better regret bounds, while requiring less information about
the underlying network. It also performs better empirically
Counterfactual Reasoning and Learning Systems
This work shows how to leverage causal inference to understand the behavior
of complex learning systems interacting with their environment and predict the
consequences of changes to the system. Such predictions allow both humans and
algorithms to select changes that improve both the short-term and long-term
performance of such systems. This work is illustrated by experiments carried
out on the ad placement system associated with the Bing search engine.Comment: revised versio
Combinatorial Semi-Bandits with Knapsacks
We unify two prominent lines of work on multi-armed bandits: bandits with
knapsacks (BwK) and combinatorial semi-bandits. The former concerns limited
"resources" consumed by the algorithm, e.g., limited supply in dynamic pricing.
The latter allows a huge number of actions but assumes combinatorial structure
and additional feedback to make the problem tractable. We define a common
generalization, support it with several motivating examples, and design an
algorithm for it. Our regret bounds are comparable with those for BwK and
combinatorial semi- bandits
Bridging the gap between regret minimization and best arm identification, with application to A/B tests
State of the art online learning procedures focus either on selecting the
best alternative ("best arm identification") or on minimizing the cost (the
"regret"). We merge these two objectives by providing the theoretical analysis
of cost minimizing algorithms that are also delta-PAC (with a proven guaranteed
bound on the decision time), hence fulfilling at the same time regret
minimization and best arm identification. This analysis sheds light on the
common observation that ill-callibrated UCB-algorithms minimize regret while
still identifying quickly the best arm.
We also extend these results to the non-iid case faced by many practitioners.
This provides a technique to make cost versus decision time compromise when
doing adaptive tests with applications ranging from website A/B testing to
clinical trials
Dynamic Pricing with Demand Covariates
We consider a firm that sells products over periods without knowing the
demand function. The firm sequentially sets prices to earn revenue and to learn
the underlying demand function simultaneously. A natural heuristic for this
problem, commonly used in practice, is greedy iterative least squares (GILS).
At each time period, GILS estimates the demand as a linear function of the
price by applying least squares to the set of prior prices and realized
demands. Then a price that maximizes the revenue, given the estimated demand
function, is used for the next time period. The performance is measured by the
regret, which is the expected revenue loss from the optimal (oracle) pricing
policy when the demand function is known. Recently, den Boer and Zwart (2014)
and Keskin and Zeevi (2014) demonstrated that GILS is sub-optimal. They
introduced algorithms which integrate forced price dispersion with GILS and
achieve asymptotically optimal performance.
In this paper, we consider this dynamic pricing problem in a data-rich
environment. In particular, we assume that the firm knows the expected demand
under a particular price from historical data, and in each period, before
setting the price, the firm has access to extra information (demand covariates)
which may be predictive of the demand. We prove that in this setting GILS
achieves asymptotically optimal regret of order . We also show the
following surprising result: in the original dynamic pricing problem of den
Boer and Zwart (2014) and Keskin and Zeevi (2014), inclusion of any set of
covariates in GILS as potential demand covariates (even though they could carry
no information) would make GILS asymptotically optimal. We validate our results
via extensive numerical simulations on synthetic and real data sets.Comment: 28 pages, 6 figure
A Decentralized Policy with Logarithmic Regret for a Class of Multi-Agent Multi-Armed Bandit Problems with Option Unavailability Constraints and Stochastic Communication Protocols
This paper considers a multi-armed bandit (MAB) problem in which multiple
mobile agents receive rewards by sampling from a collection of spatially
dispersed stochastic processes, called bandits. The goal is to formulate a
decentralized policy for each agent, in order to maximize the total cumulative
reward over all agents, subject to option availability and inter-agent
communication constraints. The problem formulation is motivated by applications
in which a team of autonomous mobile robots cooperates to accomplish an
exploration and exploitation task in an uncertain environment. Bandit locations
are represented by vertices of the spatial graph. At any time, an agent's
option consist of sampling the bandit at its current location, or traveling
along an edge of the spatial graph to a new bandit location. Communication
constraints are described by a directed, non-stationary, stochastic
communication graph. At any time, agents may receive data only from their
communication graph in-neighbors. For the case of a single agent on a fully
connected spatial graph, it is known that the expected regret for any optimal
policy is necessarily bounded below by a function that grows as the logarithm
of time. A class of policies called upper confidence bound (UCB) algorithms
asymptotically achieve logarithmic regret for the classical MAB problem. In
this paper, we propose a UCB-based decentralized motion and option selection
policy and a non-stationary stochastic communication protocol that guarantee
logarithmic regret. To our knowledge, this is the first such decentralized
policy for non-fully connected spatial graphs with communication constraints.
When the spatial graph is fully connected and the communication graph is
stationary, our decentralized algorithm matches or exceeds the best reported
prior results from the literature.Comment: Pre-print submitted for review to the 2020 CD
Combinatorial Multi-Armed Bandit and Its Extension to Probabilistically Triggered Arms
We define a general framework for a large class of combinatorial multi-armed
bandit (CMAB) problems, where subsets of base arms with unknown distributions
form super arms. In each round, a super arm is played and the base arms
contained in the super arm are played and their outcomes are observed. We
further consider the extension in which more based arms could be
probabilistically triggered based on the outcomes of already triggered arms.
The reward of the super arm depends on the outcomes of all played arms, and it
only needs to satisfy two mild assumptions, which allow a large class of
nonlinear reward instances. We assume the availability of an offline
(\alpha,\beta)-approximation oracle that takes the means of the outcome
distributions of arms and outputs a super arm that with probability {\beta}
generates an {\alpha} fraction of the optimal expected reward. The objective of
an online learning algorithm for CMAB is to minimize
(\alpha,\beta)-approximation regret, which is the difference between the
\alpha{\beta} fraction of the expected reward when always playing the optimal
super arm, and the expected reward of playing super arms according to the
algorithm. We provide CUCB algorithm that achieves O(log n)
distribution-dependent regret, where n is the number of rounds played, and we
further provide distribution-independent bounds for a large class of reward
functions. Our regret analysis is tight in that it matches the bound of UCB1
algorithm (up to a constant factor) for the classical MAB problem, and it
significantly improves the regret bound in a earlier paper on combinatorial
bandits with linear rewards. We apply our CMAB framework to two new
applications, probabilistic maximum coverage and social influence maximization,
both having nonlinear reward structures. In particular, application to social
influence maximization requires our extension on probabilistically triggered
arms.Comment: A preliminary version of the paper is published in ICML'201
- …