219 research outputs found
Factored Bandits
We introduce the factored bandits model, which is a framework for learning
with limited (bandit) feedback, where actions can be decomposed into a
Cartesian product of atomic actions. Factored bandits incorporate rank-1
bandits as a special case, but significantly relax the assumptions on the form
of the reward function. We provide an anytime algorithm for stochastic factored
bandits and up to constants matching upper and lower regret bounds for the
problem. Furthermore, we show that with a slight modification the proposed
algorithm can be applied to utility based dueling bandits. We obtain an
improvement in the additive terms of the regret bound compared to state of the
art algorithms (the additive terms are dominating up to time horizons which are
exponential in the number of arms)
One Arrow, Two Kills: An Unified Framework for Achieving Optimal Regret Guarantees in Sleeping Bandits
We address the problem of \emph{`Internal Regret'} in \emph{Sleeping Bandits}
in the fully adversarial setup, as well as draw connections between different
existing notions of sleeping regrets in the multiarmed bandits (MAB) literature
and consequently analyze the implications: Our first contribution is to propose
the new notion of \emph{Internal Regret} for sleeping MAB. We then proposed an
algorithm that yields sublinear regret in that measure, even for a completely
adversarial sequence of losses and availabilities. We further show that a low
sleeping internal regret always implies a low external regret, and as well as a
low policy regret for iid sequence of losses. The main contribution of this
work precisely lies in unifying different notions of existing regret in
sleeping bandits and understand the implication of one to another. Finally, we
also extend our results to the setting of \emph{Dueling Bandits} (DB)--a
preference feedback variant of MAB, and proposed a reduction to MAB idea to
design a low regret algorithm for sleeping dueling bandits with stochastic
preferences and adversarial availabilities. The efficacy of our algorithms is
justified through empirical evaluations
Dueling Bandits with Adversarial Sleeping
We introduce the problem of sleeping dueling bandits with stochastic
preferences and adversarial availabilities (DB-SPAA). In almost all dueling
bandit applications, the decision space often changes over time; eg, retail
store management, online shopping, restaurant recommendation, search engine
optimization, etc. Surprisingly, this `sleeping aspect' of dueling bandits has
never been studied in the literature. Like dueling bandits, the goal is to
compete with the best arm by sequentially querying the preference feedback of
item pairs. The non-triviality however results due to the non-stationary item
spaces that allow any arbitrary subsets items to go unavailable every round.
The goal is to find an optimal `no-regret' policy that can identify the best
available item at each round, as opposed to the standard `fixed best-arm regret
objective' of dueling bandits. We first derive an instance-specific lower bound
for DB-SPAA , where is the number of items and is the
gap between items and . This indicates that the sleeping problem with
preference feedback is inherently more difficult than that for classical
multi-armed bandits (MAB). We then propose two algorithms, with near optimal
regret guarantees. Our results are corroborated empirically
- …