13,306 research outputs found
Simple threshold rules solve explore/exploit tradeâoffs in a resource accumulation search task
How, and how well, do people switch between exploration and exploitation to search for and accumulate resources? We study the decision processes underlying such exploration/exploitation tradeâoffs using a novel card selection task that captures the common situation of searching among multiple resources (e.g., jobs) that can be exploited without depleting. With experience, participants learn to switch appropriately between exploration and exploitation and approach optimal performance. We model participants' behavior on this task with random, threshold, and sampling strategies, and find that a linear decreasing threshold rule best fits participants' results. Further evidence that participants use decreasing thresholdâbased strategies comes from reaction time differences between exploration and exploitation; however, participants themselves report nonâdecreasing thresholds. Decreasing threshold strategies that âfrontâloadâ exploration and switch quickly to exploitation are particularly effective in resource accumulation tasks, in contrast to optimal stopping problems like the Secretary Problem requiring longer exploration
Solving Non-Stationary Bandit Problems by Random Sampling from Sibling Kalman Filters
The multi-armed bandit problem is a classical optimization problem where an agent sequentially pulls one of multiple arms attached to a gambling machine, with each pull resulting in a random reward. The reward distributions are unknown, and thus, one must balance between exploiting existing knowledge about the arms, and obtaining new information. Dynamically changing (non-stationary) bandit problems are particularly challenging because each change of the reward distributions may progressively degrade the performance of any fixed strategy. Although computationally intractable in many cases, Bayesian methods provide a standard for optimal decision making. This paper proposes a novel solution scheme for bandit problems with non-stationary normally distributed rewards. The scheme is inherently Bayesian in nature, yet avoids computational intractability by relying simply on updating the hyper parameters of sibling Kalman Filters, and on random sampling from these posteriors. Furthermore, it is able to track the better actions, thus supporting non-stationary bandit problems. Extensive experiments demonstrate that our scheme outperforms recently proposed bandit playing algorithms, not only in non-stationary environments, but in stationary environments also. Furthermore, our scheme is robust to inexact parameter settings. We thus believe that our methodology opens avenues for obtaining improved novel solutions
Exploiting simple corporate memory in iterative coalition games
Amongst the challenging problems that must be addressed in order to create increasingly automated electronic commerce systems are those which involve forming coalitions of agents to exploit a particular market opportunity. Furthermore economic systems are normally continuous dynamic systems generating many instances of the same or similar problems (the regular calls for tender, regular emergence of new markets etc.).The work described in this paper explores how simple forms of memory can be exploited by agents over time to guide decision making in iterative sequences of coalition formation problems enabling them to build up social knowledge in order to improve their own utility and the ability of the population to produce increasingly well suited coalitions for a simple call-for-tender economy.Postprint (published version
Distributed Algorithms for Learning and Cognitive Medium Access with Logarithmic Regret
The problem of distributed learning and channel access is considered in a
cognitive network with multiple secondary users. The availability statistics of
the channels are initially unknown to the secondary users and are estimated
using sensing decisions. There is no explicit information exchange or prior
agreement among the secondary users. We propose policies for distributed
learning and access which achieve order-optimal cognitive system throughput
(number of successful secondary transmissions) under self play, i.e., when
implemented at all the secondary users. Equivalently, our policies minimize the
regret in distributed learning and access. We first consider the scenario when
the number of secondary users is known to the policy, and prove that the total
regret is logarithmic in the number of transmission slots. Our distributed
learning and access policy achieves order-optimal regret by comparing to an
asymptotic lower bound for regret under any uniformly-good learning and access
policy. We then consider the case when the number of secondary users is fixed
but unknown, and is estimated through feedback. We propose a policy in this
scenario whose asymptotic sum regret which grows slightly faster than
logarithmic in the number of transmission slots.Comment: Submitted to IEEE JSAC on Advances in Cognitive Radio Networking and
Communications, Dec. 2009, Revised May 201
Robustness of Anytime Bandit Policies
This paper studies the deviations of the regret in a stochastic multi-armed
bandit problem. When the total number of plays n is known beforehand by the
agent, Audibert et al. (2009) exhibit a policy such that with probability at
least 1-1/n, the regret of the policy is of order log(n). They have also shown
that such a property is not shared by the popular ucb1 policy of Auer et al.
(2002). This work first answers an open question: it extends this negative
result to any anytime policy. The second contribution of this paper is to
design anytime robust policies for specific multi-armed bandit problems in
which some restrictions are put on the set of possible distributions of the
different arms
- âŠ