843 research outputs found
Corrupt Bandits for Preserving Local Privacy
We study a variant of the stochastic multi-armed bandit (MAB) problem in
which the rewards are corrupted. In this framework, motivated by privacy
preservation in online recommender systems, the goal is to maximize the sum of
the (unobserved) rewards, based on the observation of transformation of these
rewards through a stochastic corruption process with known parameters. We
provide a lower bound on the expected regret of any bandit algorithm in this
corrupted setting. We devise a frequentist algorithm, KLUCB-CF, and a Bayesian
algorithm, TS-CF and give upper bounds on their regret. We also provide the
appropriate corruption parameters to guarantee a desired level of local privacy
and analyze how this impacts the regret. Finally, we present some experimental
results that confirm our analysis
Decentralized Exploration in Multi-Armed Bandits
We consider the decentralized exploration problem: a set of players
collaborate to identify the best arm by asynchronously interacting with the
same stochastic environment. The objective is to insure privacy in the best arm
identification problem between asynchronous, collaborative, and thrifty
players. In the context of a digital service, we advocate that this
decentralized approach allows a good balance between the interests of users and
those of service providers: the providers optimize their services, while
protecting the privacy of the users and saving resources. We define the privacy
level as the amount of information an adversary could infer by intercepting the
messages concerning a single user. We provide a generic algorithm Decentralized
Elimination, which uses any best arm identification algorithm as a subroutine.
We prove that this algorithm insures privacy, with a low communication cost,
and that in comparison to the lower bound of the best arm identification
problem, its sample complexity suffers from a penalty depending on the inverse
of the probability of the most frequent players. Then, thanks to the genericity
of the approach, we extend the proposed algorithm to the non-stationary
bandits. Finally, experiments illustrate and complete the analysis
Differentially Private Episodic Reinforcement Learning with Heavy-tailed Rewards
In this paper, we study the problem of (finite horizon tabular) Markov
decision processes (MDPs) with heavy-tailed rewards under the constraint of
differential privacy (DP). Compared with the previous studies for private
reinforcement learning that typically assume rewards are sampled from some
bounded or sub-Gaussian distributions to ensure DP, we consider the setting
where reward distributions have only finite -th moments with some . By resorting to robust mean estimators for rewards, we first propose
two frameworks for heavy-tailed MDPs, i.e., one is for value iteration and
another is for policy optimization. Under each framework, we consider both
joint differential privacy (JDP) and local differential privacy (LDP) models.
Based on our frameworks, we provide regret upper bounds for both JDP and LDP
cases and show that the moment of distribution and privacy budget both have
significant impacts on regrets. Finally, we establish a lower bound of regret
minimization for heavy-tailed MDPs in JDP model by reducing it to the
instance-independent lower bound of heavy-tailed multi-armed bandits in DP
model. We also show the lower bound for the problem in LDP by adopting some
private minimax methods. Our results reveal that there are fundamental
differences between the problem of private RL with sub-Gaussian and that with
heavy-tailed rewards.Comment: ICML 2023. arXiv admin note: text overlap with arXiv:2009.09052 by
other author
Differentially Private Federated Combinatorial Bandits with Constraints
There is a rapid increase in the cooperative learning paradigm in online
learning settings, i.e., federated learning (FL). Unlike most FL settings,
there are many situations where the agents are competitive. Each agent would
like to learn from others, but the part of the information it shares for others
to learn from could be sensitive; thus, it desires its privacy. This work
investigates a group of agents working concurrently to solve similar
combinatorial bandit problems while maintaining quality constraints. Can these
agents collectively learn while keeping their sensitive information
confidential by employing differential privacy? We observe that communicating
can reduce the regret. However, differential privacy techniques for protecting
sensitive information makes the data noisy and may deteriorate than help to
improve regret. Hence, we note that it is essential to decide when to
communicate and what shared data to learn to strike a functional balance
between regret and privacy. For such a federated combinatorial MAB setting, we
propose a Privacy-preserving Federated Combinatorial Bandit algorithm, P-FCB.
We illustrate the efficacy of P-FCB through simulations. We further show that
our algorithm provides an improvement in terms of regret while upholding
quality threshold and meaningful privacy guarantees.Comment: 12 pages, 4 Figures, A version of this paper has appeared in the
Proceedings of the ECML PKDD '2
Federated Linear Contextual Bandits with User-level Differential Privacy
This paper studies federated linear contextual bandits under the notion of
user-level differential privacy (DP). We first introduce a unified federated
bandits framework that can accommodate various definitions of DP in the
sequential decision-making setting. We then formally introduce user-level
central DP (CDP) and local DP (LDP) in the federated bandits framework, and
investigate the fundamental trade-offs between the learning regrets and the
corresponding DP guarantees in a federated linear contextual bandits model. For
CDP, we propose a federated algorithm termed as \robin and show that it is
near-optimal in terms of the number of clients and the privacy budget
by deriving nearly-matching upper and lower regret bounds when
user-level DP is satisfied. For LDP, we obtain several lower bounds, indicating
that learning under user-level -LDP must suffer a regret
blow-up factor at least { or
} under different conditions.Comment: Accepted by ICML 202
- …