54 research outputs found
Distributed Algorithms for Learning and Cognitive Medium Access with Logarithmic Regret
The problem of distributed learning and channel access is considered in a
cognitive network with multiple secondary users. The availability statistics of
the channels are initially unknown to the secondary users and are estimated
using sensing decisions. There is no explicit information exchange or prior
agreement among the secondary users. We propose policies for distributed
learning and access which achieve order-optimal cognitive system throughput
(number of successful secondary transmissions) under self play, i.e., when
implemented at all the secondary users. Equivalently, our policies minimize the
regret in distributed learning and access. We first consider the scenario when
the number of secondary users is known to the policy, and prove that the total
regret is logarithmic in the number of transmission slots. Our distributed
learning and access policy achieves order-optimal regret by comparing to an
asymptotic lower bound for regret under any uniformly-good learning and access
policy. We then consider the case when the number of secondary users is fixed
but unknown, and is estimated through feedback. We propose a policy in this
scenario whose asymptotic sum regret which grows slightly faster than
logarithmic in the number of transmission slots.Comment: Submitted to IEEE JSAC on Advances in Cognitive Radio Networking and
Communications, Dec. 2009, Revised May 201
Extended UCB Policy for Multi-Armed Bandit with Light-Tailed Reward Distributions
We consider the multi-armed bandit problems in which a player aims to accrue
reward by sequentially playing a given set of arms with unknown reward
statistics. In the classic work, policies were proposed to achieve the optimal
logarithmic regret order for some special classes of light-tailed reward
distributions, e.g., Auer et al.'s UCB1 index policy for reward distributions
with finite support. In this paper, we extend Auer et al.'s UCB1 index policy
to achieve the optimal logarithmic regret order for all light-tailed (or
equivalently, locally sub-Gaussian) reward distributions defined by the (local)
existence of the moment-generating function.Comment: 9 pages, 1 figur
Deterministic Sequencing of Exploration and Exploitation for Multi-Armed Bandit Problems
In the Multi-Armed Bandit (MAB) problem, there is a given set of arms with
unknown reward models. At each time, a player selects one arm to play, aiming
to maximize the total expected reward over a horizon of length T. An approach
based on a Deterministic Sequencing of Exploration and Exploitation (DSEE) is
developed for constructing sequential arm selection policies. It is shown that
for all light-tailed reward distributions, DSEE achieves the optimal
logarithmic order of the regret, where regret is defined as the total expected
reward loss against the ideal case with known reward models. For heavy-tailed
reward distributions, DSEE achieves O(T^1/p) regret when the moments of the
reward distributions exist up to the pth order for 1<p<=2 and O(T^1/(1+p/2))
for p>2. With the knowledge of an upperbound on a finite moment of the
heavy-tailed reward distributions, DSEE offers the optimal logarithmic regret
order. The proposed DSEE approach complements existing work on MAB by providing
corresponding results for general reward distributions. Furthermore, with a
clearly defined tunable parameter-the cardinality of the exploration sequence,
the DSEE approach is easily extendable to variations of MAB, including MAB with
various objectives, decentralized MAB with multiple players and incomplete
reward observations under collisions, MAB with unknown Markov dynamics, and
combinatorial MAB with dependent arms that often arise in network optimization
problems such as the shortest path, the minimum spanning, and the dominating
set problems under unknown random weights.Comment: 22 pages, 2 figure
Decentralized Cooperative Stochastic Bandits
We study a decentralized cooperative stochastic multi-armed bandit problem
with arms on a network of agents. In our model, the reward distribution
of each arm is the same for each agent and rewards are drawn independently
across agents and time steps. In each round, each agent chooses an arm to play
and subsequently sends a message to her neighbors. The goal is to minimize the
overall regret of the entire network. We design a fully decentralized algorithm
that uses an accelerated consensus procedure to compute (delayed) estimates of
the average of rewards obtained by all the agents for each arm, and then uses
an upper confidence bound (UCB) algorithm that accounts for the delay and error
of the estimates. We analyze the regret of our algorithm and also provide a
lower bound. The regret is bounded by the optimal centralized regret plus a
natural and simple term depending on the spectral gap of the communication
matrix. Our algorithm is simpler to analyze than those proposed in prior work
and it achieves better regret bounds, while requiring less information about
the underlying network. It also performs better empirically
Combinatorial Network Optimization with Unknown Variables: Multi-Armed Bandits with Linear Rewards
In the classic multi-armed bandits problem, the goal is to have a policy for
dynamically operating arms that each yield stochastic rewards with unknown
means. The key metric of interest is regret, defined as the gap between the
expected total reward accumulated by an omniscient player that knows the reward
means for each arm, and the expected total reward accumulated by the given
policy. The policies presented in prior work have storage, computation and
regret all growing linearly with the number of arms, which is not scalable when
the number of arms is large. We consider in this work a broad class of
multi-armed bandits with dependent arms that yield rewards as a linear
combination of a set of unknown parameters. For this general framework, we
present efficient policies that are shown to achieve regret that grows
logarithmically with time, and polynomially in the number of unknown parameters
(even though the number of dependent arms may grow exponentially). Furthermore,
these policies only require storage that grows linearly in the number of
unknown parameters. We show that this generalization is broadly applicable and
useful for many interesting tasks in networks that can be formulated as
tractable combinatorial optimization problems with linear objective functions,
such as maximum weight matching, shortest path, and minimum spanning tree
computations
Spectrum Bandit Optimization
We consider the problem of allocating radio channels to links in a wireless
network. Links interact through interference, modelled as a conflict graph
(i.e., two interfering links cannot be simultaneously active on the same
channel). We aim at identifying the channel allocation maximizing the total
network throughput over a finite time horizon. Should we know the average radio
conditions on each channel and on each link, an optimal allocation would be
obtained by solving an Integer Linear Program (ILP). When radio conditions are
unknown a priori, we look for a sequential channel allocation policy that
converges to the optimal allocation while minimizing on the way the throughput
loss or {\it regret} due to the need for exploring sub-optimal allocations. We
formulate this problem as a generic linear bandit problem, and analyze it first
in a stochastic setting where radio conditions are driven by a stationary
stochastic process, and then in an adversarial setting where radio conditions
can evolve arbitrarily. We provide new algorithms in both settings and derive
upper bounds on their regrets.Comment: 21 page
- …