130 research outputs found
Incentivized Exploration for Multi-Armed Bandits under Reward Drift
We study incentivized exploration for the multi-armed bandit (MAB) problem
where the players receive compensation for exploring arms other than the greedy
choice and may provide biased feedback on reward. We seek to understand the
impact of this drifted reward feedback by analyzing the performance of three
instantiations of the incentivized MAB algorithm: UCB, -Greedy,
and Thompson Sampling. Our results show that they all achieve regret and compensation under the drifted reward, and are therefore
effective in incentivizing exploration. Numerical examples are provided to
complement the theoretical analysis.Comment: 10 pages, 2 figures, AAAI 202
Secure-UCB: Saving Stochastic Bandits from Poisoning Attacks via Limited Data Verification
This paper studies bandit algorithms under data poisoning attacks in a
bounded reward setting. We consider a strong attacker model in which the
attacker can observe both the selected actions and their corresponding rewards,
and can contaminate the rewards with additive noise. We show that \emph{any}
bandit algorithm with regret can be forced to suffer a regret
with an expected amount of contamination . This amount
of contamination is also necessary, as we prove that there exists an regret bandit algorithm, specifically the classical UCB, that requires
amount of contamination to suffer regret . To
combat such poising attacks, our second main contribution is to propose a novel
algorithm, Secure-UCB, which uses limited \emph{verification} to access a
limited number of uncontaminated rewards. We show that with
expected number of verifications, Secure-UCB can restore the order optimal
regret \emph{irrespective of the amount of contamination} used by
the attacker. Finally, we prove that for any bandit algorithm, this number of
verifications is necessary to recover the order-optimal regret. We
can then conclude that Secure-UCB is order-optimal in terms of both the
expected regret and the expected number of verifications, and can save
stochastic bandits from any data poisoning attack
Contextual Search in the Presence of Irrational Agents
We study contextual search, a generalization of binary search in higher
dimensions, which captures settings such as feature-based dynamic pricing.
Standard game-theoretic formulations of this problem assume that agents act in
accordance with a specific behavioral model. In practice, however, some agents
may not prescribe to the dominant behavioral model or may act in ways that are
seemingly arbitrarily irrational. Existing algorithms heavily depend on the
behavioral model being (approximately) accurate for all agents and have poor
performance in the presence of even a few such arbitrarily irrational agents.
We initiate the study of contextual search when some of the agents can behave
in ways inconsistent with the underlying behavioral model. In particular, we
provide two algorithms, one built on robustifying multidimensional binary
search methods and one on translating the setting to a proxy setting
appropriate for gradient descent. Our techniques draw inspiration from learning
theory, game theory, high-dimensional geometry, and convex analysis.Comment: Compared to the first version titled "Corrupted Multidimensional
Binary Search: Learning in the Presence of Irrational Agents", this version
provides a broader scope of behavioral models of irrationality, specifies how
the results apply to different loss functions, and discusses the power and
limitations of additional algorithmic approache
Learning in Non-Cooperative Configurable Markov Decision Processes
The Configurable Markov Decision Process framework includes two entities: a Reinforcement Learning agent and a configurator that can modify some environmental parameters to improve the agent's performance. This presupposes that the two actors have the same reward functions. What if the configurator does not have the same intentions as the agent? This paper introduces the Non-Cooperative Configurable Markov Decision Process, a setting that allows having two (possibly different) reward functions for the configurator and the agent. Then, we consider an online learning problem, where the configurator has to find the best among a finite set of possible configurations. We propose two learning algorithms to minimize the configurator's expected regret, which exploits the problem's structure, depending on the agent's feedback. While a naive application of the UCB algorithm yields a regret that grows indefinitely over time, we show that our approach suffers only bounded regret. Furthermore, we empirically show the performance of our algorithm in simulated domains
Computational and cognitive mechanisms of exploration heuristics
Should I leave or stay in academia? Many decisions we make require arbitrating between novelty and the benefits of familiar options. This is called the exploration-exploitation trade-off. Solving this trade-off is not trivial, but approximations (called ‘exploration strategies’) exist. Humans are known to rely on different exploration strategies, varying in performance and computational requirements. More complex strategies perform well, but are computationally expensive (e.g., require to compute
expected values). Cheaper strategies, i.e., heuristics, require less cognitive resources but can lead to sub-optimal performance. The simplest heuristic strategy is to ignore prior knowledge, such as expected values, and to choose entirely randomly. In effect, this is like rolling a dice to choose between different choice options. Such ‘value-free random’ exploration strategy may not always lead to optimal performance but allows to spare cognitive resources. In this thesis, I investigate the mechanisms of exploration heuristics in human decision
making. I developed a cognitive task allowing to dissociate between different strategies for exploration. In my first study, I demonstrate that humans supplement
complex strategies with exploration heuristics and, using a pharmacological manipulation, that value-free random exploration is specifically modulated by the neurotransmitter noradrenaline. Exploration heuristics are of particular interest when access to cognitive resources is limited and prior knowledge uncertain, such as in development and mental health disorders. In a cross-sectional developmental study, I demonstrate that value-free random exploration is used more at a younger age. Additionally, in a large-sample online study, I show that it is specifically associated to impulsivity. Together, this indicates that value-free random exploration is useful in certain contexts (e.g., childhood) but that high levels of it can be detrimental. Overall, this thesis attempts to better understand the process of exploration in humans, and opens the way for understanding the mechanisms of arbitration between complex and simple strategies for decision making
What-is and How-to for Fairness in Machine Learning: A Survey, Reflection, and Perspective
Algorithmic fairness has attracted increasing attention in the machine
learning community. Various definitions are proposed in the literature, but the
differences and connections among them are not clearly addressed. In this
paper, we review and reflect on various fairness notions previously proposed in
machine learning literature, and make an attempt to draw connections to
arguments in moral and political philosophy, especially theories of justice. We
also consider fairness inquiries from a dynamic perspective, and further
consider the long-term impact that is induced by current prediction and
decision. In light of the differences in the characterized fairness, we present
a flowchart that encompasses implicit assumptions and expected outcomes of
different types of fairness inquiries on the data generating process, on the
predicted outcome, and on the induced impact, respectively. This paper
demonstrates the importance of matching the mission (which kind of fairness one
would like to enforce) and the means (which spectrum of fairness analysis is of
interest, what is the appropriate analyzing scheme) to fulfill the intended
purpose
- …