99 research outputs found
Hellinger KL-UCB based Bandit Algorithms for Markovian and i.i.d. Settings
In the regret-based formulation of multi-armed bandit (MAB) problems, except
in rare instances, much of the literature focuses on arms with i.i.d. rewards.
In this paper, we consider the problem of obtaining regret guarantees for MAB
problems in which the rewards of each arm form a Markov chain which may not
belong to a single parameter exponential family. To achieve logarithmic regret
in such problems is not difficult: a variation of standard KL-UCB does the job.
However, the constants obtained from such an analysis are poor for the
following reason: i.i.d. rewards are a special case of Markov rewards and it is
difficult to design an algorithm that works well independent of whether the
underlying model is truly Markovian or i.i.d. To overcome this issue, we
introduce a novel algorithm that identifies whether the rewards from each arm
are truly Markovian or i.i.d. using a Hellinger distance-based test. Our
algorithm then switches from using a standard KL-UCB to a specialized version
of KL-UCB when it determines that the arm reward is Markovian, thus resulting
in low regret for both i.i.d. and Markovian settings
Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems
Multi-armed bandit problems are the most basic examples of sequential
decision problems with an exploration-exploitation trade-off. This is the
balance between staying with the option that gave highest payoffs in the past
and exploring new options that might give higher payoffs in the future.
Although the study of bandit problems dates back to the Thirties,
exploration-exploitation trade-offs arise in several modern applications, such
as ad placement, website optimization, and packet routing. Mathematically, a
multi-armed bandit is defined by the payoff process associated with each
option. In this survey, we focus on two extreme cases in which the analysis of
regret is particularly simple and elegant: i.i.d. payoffs and adversarial
payoffs. Besides the basic setting of finitely many actions, we also analyze
some of the most important variants and extensions, such as the contextual
bandit model.Comment: To appear in Foundations and Trends in Machine Learnin
Distributed Algorithms for Learning and Cognitive Medium Access with Logarithmic Regret
The problem of distributed learning and channel access is considered in a
cognitive network with multiple secondary users. The availability statistics of
the channels are initially unknown to the secondary users and are estimated
using sensing decisions. There is no explicit information exchange or prior
agreement among the secondary users. We propose policies for distributed
learning and access which achieve order-optimal cognitive system throughput
(number of successful secondary transmissions) under self play, i.e., when
implemented at all the secondary users. Equivalently, our policies minimize the
regret in distributed learning and access. We first consider the scenario when
the number of secondary users is known to the policy, and prove that the total
regret is logarithmic in the number of transmission slots. Our distributed
learning and access policy achieves order-optimal regret by comparing to an
asymptotic lower bound for regret under any uniformly-good learning and access
policy. We then consider the case when the number of secondary users is fixed
but unknown, and is estimated through feedback. We propose a policy in this
scenario whose asymptotic sum regret which grows slightly faster than
logarithmic in the number of transmission slots.Comment: Submitted to IEEE JSAC on Advances in Cognitive Radio Networking and
Communications, Dec. 2009, Revised May 201
Distributed Channel Access for Control Over Unknown Memoryless Communication Channels
We consider the distributed channel access problem for a system consisting of
multiple control subsystems that close their loop over a shared wireless
network. We propose a distributed method for providing deterministic channel
access without requiring explicit information exchange between the subsystems.
This is achieved by utilizing timers for prioritizing channel access with
respect to a local cost which we derive by transforming the control objective
cost to a form that allows its local computation. This property is then
exploited for developing our distributed deterministic channel access scheme. A
framework to verify the stability of the system under the resulting scheme is
then proposed. Next, we consider a practical scenario in which the channel
statistics are unknown. We propose learning algorithms for learning the
parameters of imperfect communication links for estimating the channel quality
and, hence, define the local cost as a function of this estimation and control
performance. We establish that our learning approach results in collision-free
channel access. The behavior of the overall system is exemplified via a
proof-of-concept illustrative example, and the efficacy of this mechanism is
evaluated for large-scale networks via simulations.Comment: This work has been submitted to the IEEE for possible publication.
Copyright may be transferred without notice, after which this version may no
longer be accessibl
- …