1,938 research outputs found
Contextual Bandits with Cross-learning
In the classical contextual bandits problem, in each round , a learner
observes some context , chooses some action to perform, and receives
some reward . We consider the variant of this problem where in
addition to receiving the reward , the learner also learns the
values of for all other contexts ; i.e., the rewards that
would have been achieved by performing that action under different contexts.
This variant arises in several strategic settings, such as learning how to bid
in non-truthful repeated auctions (in this setting the context is the decision
maker's private valuation for each auction). We call this problem the
contextual bandits problem with cross-learning. The best algorithms for the
classical contextual bandits problem achieve regret
against all stationary policies, where is the number of contexts, the
number of actions, and the number of rounds. We demonstrate algorithms for
the contextual bandits problem with cross-learning that remove the dependence
on and achieve regret (when contexts are stochastic with
known distribution), (when contexts are stochastic
with unknown distribution), and (when contexts are
adversarial but rewards are stochastic).Comment: 48 pages, 5 figure
Decentralized Exploration in Multi-Armed Bandits
We consider the decentralized exploration problem: a set of players
collaborate to identify the best arm by asynchronously interacting with the
same stochastic environment. The objective is to insure privacy in the best arm
identification problem between asynchronous, collaborative, and thrifty
players. In the context of a digital service, we advocate that this
decentralized approach allows a good balance between the interests of users and
those of service providers: the providers optimize their services, while
protecting the privacy of the users and saving resources. We define the privacy
level as the amount of information an adversary could infer by intercepting the
messages concerning a single user. We provide a generic algorithm Decentralized
Elimination, which uses any best arm identification algorithm as a subroutine.
We prove that this algorithm insures privacy, with a low communication cost,
and that in comparison to the lower bound of the best arm identification
problem, its sample complexity suffers from a penalty depending on the inverse
of the probability of the most frequent players. Then, thanks to the genericity
of the approach, we extend the proposed algorithm to the non-stationary
bandits. Finally, experiments illustrate and complete the analysis
A Neural Networks Committee for the Contextual Bandit Problem
This paper presents a new contextual bandit algorithm, NeuralBandit, which
does not need hypothesis on stationarity of contexts and rewards. Several
neural networks are trained to modelize the value of rewards knowing the
context. Two variants, based on multi-experts approach, are proposed to choose
online the parameters of multi-layer perceptrons. The proposed algorithms are
successfully tested on a large dataset with and without stationarity of
rewards.Comment: 21st International Conference on Neural Information Processin
- …