12 research outputs found

    Incentivizing Exploration with Selective Data Disclosure

    Full text link
    We study the design of rating systems that incentivize (more) efficient social learning among self-interested agents. Agents arrive sequentially and are presented with a set of possible actions, each of which yields a positive reward with an unknown probability. A disclosure policy sends messages about the rewards of previously-chosen actions to arriving agents. These messages can alter agents' incentives towards exploration, taking potentially sub-optimal actions for the sake of learning more about their rewards. Prior work achieves much progress with disclosure policies that merely recommend an action to each user, but relies heavily on standard, yet very strong rationality assumptions. We study a particular class of disclosure policies that use messages, called unbiased subhistories, consisting of the actions and rewards from a subsequence of past agents. Each subsequence is chosen ahead of time, according to a predetermined partial order on the rounds. We posit a flexible model of frequentist agent response, which we argue is plausible for this class of "order-based" disclosure policies. We measure the success of a policy by its regret, i.e., the difference, over all rounds, between the expected reward of the best action and the reward induced by the policy. A disclosure policy that reveals full history in each round risks inducing herding behavior among the agents, and typically has regret linear in the time horizon TT. Our main result is an order-based disclosure policy that obtains regret O~(T)\tilde{O}(\sqrt{T}). This regret is known to be optimal in the worst case over reward distributions, even absent incentives. We also exhibit simpler order-based policies with higher, but still sublinear, regret. These policies can be interpreted as dividing a sublinear number of agents into constant-sized focus groups, whose histories are then revealed to future agents

    Optimal Algorithm for Bayesian Incentive-Compatible Exploration

    Full text link
    We consider a social planner faced with a stream of myopic selfish agents. The goal of the social planner is to maximize the social welfare, however, it is limited to using only information asymmetry (regarding previous outcomes) and cannot use any monetary incentives. The planner recommends actions to agents, but her recommendations need to be Bayesian Incentive Compatible to be followed by the agents. Our main result is an optimal algorithm for the planner, in the case that the actions realizations are deterministic and have limited support, making significant important progress on this open problem. Our optimal protocol has two interesting features. First, it always completes the exploration of a priori more beneficial actions before exploring a priori less beneficial actions. Second, the randomization in the protocol is correlated across agents and actions (and not independent at each decision time).Comment: EC 201

    The Price of Incentivizing Exploration: A Characterization via Thompson Sampling and Sample Complexity

    Full text link
    We consider incentivized exploration: a version of multi-armed bandits where the choice of arms is controlled by self-interested agents, and the algorithm can only issue recommendations. The algorithm controls the flow of information, and the information asymmetry can incentivize the agents to explore. Prior work achieves optimal regret rates up to multiplicative factors that become arbitrarily large depending on the Bayesian priors, and scale exponentially in the number of arms. A more basic problem of sampling each arm once runs into similar factors. We focus on the price of incentives: the loss in performance, broadly construed, incurred for the sake of incentive-compatibility. We prove that Thompson Sampling, a standard bandit algorithm, is incentive-compatible if initialized with sufficiently many data points. The performance loss due to incentives is therefore limited to the initial rounds when these data points are collected. The problem is largely reduced to that of sample complexity: how many rounds are needed? We address this question, providing matching upper and lower bounds and instantiating them in various corollaries. Typically, the optimal sample complexity is polynomial in the number of arms and exponential in the "strength of beliefs"

    Competing Bandits: The Perils of Exploration Under Competition

    Full text link
    Most online platforms strive to learn from interactions with users, and many engage in exploration: making potentially suboptimal choices for the sake of acquiring new information. We study the interplay between exploration and competition: how such platforms balance the exploration for learning and the competition for users. Here users play three distinct roles: they are customers that generate revenue, they are sources of data for learning, and they are self-interested agents which choose among the competing platforms. We consider a stylized duopoly model in which two firms face the same multi-armed bandit problem. Users arrive one by one and choose between the two firms, so that each firm makes progress on its bandit problem only if it is chosen. Through a mix of theoretical results and numerical simulations, we study whether and to what extent competition incentivizes the adoption of better bandit algorithms, and whether it leads to welfare increases for users. We find that stark competition induces firms to commit to a "greedy" bandit algorithm that leads to low welfare. However, weakening competition by providing firms with some "free" users incentivizes better exploration strategies and increases welfare. We investigate two channels for weakening the competition: relaxing the rationality of users and giving one firm a first-mover advantage. Our findings are closely related to the "competition vs. innovation" relationship, and elucidate the first-mover advantage in the digital economy.Comment: merged and extended version of arXiv:1702.08533 and arXiv:1902.0559