130 research outputs found

    Incentivized Exploration for Multi-Armed Bandits under Reward Drift

    Full text link
    We study incentivized exploration for the multi-armed bandit (MAB) problem where the players receive compensation for exploring arms other than the greedy choice and may provide biased feedback on reward. We seek to understand the impact of this drifted reward feedback by analyzing the performance of three instantiations of the incentivized MAB algorithm: UCB, ε\varepsilon-Greedy, and Thompson Sampling. Our results show that they all achieve O(logT)\mathcal{O}(\log T) regret and compensation under the drifted reward, and are therefore effective in incentivizing exploration. Numerical examples are provided to complement the theoretical analysis.Comment: 10 pages, 2 figures, AAAI 202

    Secure-UCB: Saving Stochastic Bandits from Poisoning Attacks via Limited Data Verification

    Full text link
    This paper studies bandit algorithms under data poisoning attacks in a bounded reward setting. We consider a strong attacker model in which the attacker can observe both the selected actions and their corresponding rewards, and can contaminate the rewards with additive noise. We show that \emph{any} bandit algorithm with regret O(logT)O(\log T) can be forced to suffer a regret Ω(T)\Omega(T) with an expected amount of contamination O(logT)O(\log T). This amount of contamination is also necessary, as we prove that there exists an O(logT)O(\log T) regret bandit algorithm, specifically the classical UCB, that requires Ω(logT)\Omega(\log T) amount of contamination to suffer regret Ω(T)\Omega(T). To combat such poising attacks, our second main contribution is to propose a novel algorithm, Secure-UCB, which uses limited \emph{verification} to access a limited number of uncontaminated rewards. We show that with O(logT)O(\log T) expected number of verifications, Secure-UCB can restore the order optimal O(logT)O(\log T) regret \emph{irrespective of the amount of contamination} used by the attacker. Finally, we prove that for any bandit algorithm, this number of verifications O(logT)O(\log T) is necessary to recover the order-optimal regret. We can then conclude that Secure-UCB is order-optimal in terms of both the expected regret and the expected number of verifications, and can save stochastic bandits from any data poisoning attack

    Contextual Search in the Presence of Irrational Agents

    Full text link
    We study contextual search, a generalization of binary search in higher dimensions, which captures settings such as feature-based dynamic pricing. Standard game-theoretic formulations of this problem assume that agents act in accordance with a specific behavioral model. In practice, however, some agents may not prescribe to the dominant behavioral model or may act in ways that are seemingly arbitrarily irrational. Existing algorithms heavily depend on the behavioral model being (approximately) accurate for all agents and have poor performance in the presence of even a few such arbitrarily irrational agents. We initiate the study of contextual search when some of the agents can behave in ways inconsistent with the underlying behavioral model. In particular, we provide two algorithms, one built on robustifying multidimensional binary search methods and one on translating the setting to a proxy setting appropriate for gradient descent. Our techniques draw inspiration from learning theory, game theory, high-dimensional geometry, and convex analysis.Comment: Compared to the first version titled "Corrupted Multidimensional Binary Search: Learning in the Presence of Irrational Agents", this version provides a broader scope of behavioral models of irrationality, specifies how the results apply to different loss functions, and discusses the power and limitations of additional algorithmic approache

    Learning in Non-Cooperative Configurable Markov Decision Processes

    Get PDF
    The Configurable Markov Decision Process framework includes two entities: a Reinforcement Learning agent and a configurator that can modify some environmental parameters to improve the agent's performance. This presupposes that the two actors have the same reward functions. What if the configurator does not have the same intentions as the agent? This paper introduces the Non-Cooperative Configurable Markov Decision Process, a setting that allows having two (possibly different) reward functions for the configurator and the agent. Then, we consider an online learning problem, where the configurator has to find the best among a finite set of possible configurations. We propose two learning algorithms to minimize the configurator's expected regret, which exploits the problem's structure, depending on the agent's feedback. While a naive application of the UCB algorithm yields a regret that grows indefinitely over time, we show that our approach suffers only bounded regret. Furthermore, we empirically show the performance of our algorithm in simulated domains

    Computational and cognitive mechanisms of exploration heuristics

    Get PDF
    Should I leave or stay in academia? Many decisions we make require arbitrating between novelty and the benefits of familiar options. This is called the exploration-exploitation trade-off. Solving this trade-off is not trivial, but approximations (called ‘exploration strategies’) exist. Humans are known to rely on different exploration strategies, varying in performance and computational requirements. More complex strategies perform well, but are computationally expensive (e.g., require to compute expected values). Cheaper strategies, i.e., heuristics, require less cognitive resources but can lead to sub-optimal performance. The simplest heuristic strategy is to ignore prior knowledge, such as expected values, and to choose entirely randomly. In effect, this is like rolling a dice to choose between different choice options. Such ‘value-free random’ exploration strategy may not always lead to optimal performance but allows to spare cognitive resources. In this thesis, I investigate the mechanisms of exploration heuristics in human decision making. I developed a cognitive task allowing to dissociate between different strategies for exploration. In my first study, I demonstrate that humans supplement complex strategies with exploration heuristics and, using a pharmacological manipulation, that value-free random exploration is specifically modulated by the neurotransmitter noradrenaline. Exploration heuristics are of particular interest when access to cognitive resources is limited and prior knowledge uncertain, such as in development and mental health disorders. In a cross-sectional developmental study, I demonstrate that value-free random exploration is used more at a younger age. Additionally, in a large-sample online study, I show that it is specifically associated to impulsivity. Together, this indicates that value-free random exploration is useful in certain contexts (e.g., childhood) but that high levels of it can be detrimental. Overall, this thesis attempts to better understand the process of exploration in humans, and opens the way for understanding the mechanisms of arbitration between complex and simple strategies for decision making

    What-is and How-to for Fairness in Machine Learning: A Survey, Reflection, and Perspective

    Full text link
    Algorithmic fairness has attracted increasing attention in the machine learning community. Various definitions are proposed in the literature, but the differences and connections among them are not clearly addressed. In this paper, we review and reflect on various fairness notions previously proposed in machine learning literature, and make an attempt to draw connections to arguments in moral and political philosophy, especially theories of justice. We also consider fairness inquiries from a dynamic perspective, and further consider the long-term impact that is induced by current prediction and decision. In light of the differences in the characterized fairness, we present a flowchart that encompasses implicit assumptions and expected outcomes of different types of fairness inquiries on the data generating process, on the predicted outcome, and on the induced impact, respectively. This paper demonstrates the importance of matching the mission (which kind of fairness one would like to enforce) and the means (which spectrum of fairness analysis is of interest, what is the appropriate analyzing scheme) to fulfill the intended purpose
    corecore