1,569 research outputs found

    A Relative Exponential Weighing Algorithm for Adversarial Utility-based Dueling Bandits

    Get PDF
    We study the K-armed dueling bandit problem which is a variation of the classical Multi-Armed Bandit (MAB) problem in which the learner receives only relative feedback about the selected pairs of arms. We propose a new algorithm called Relative Exponential-weight algorithm for Exploration and Exploitation (REX3) to handle the adversarial utility-based formulation of this problem. This algorithm is a non-trivial extension of the Exponential-weight algorithm for Exploration and Exploitation (EXP3) algorithm. We prove a finite time expected regret upper bound of order O(sqrt(K ln(K)T)) for this algorithm and a general lower bound of order omega(sqrt(KT)). At the end, we provide experimental results using real data from information retrieval applications

    Stochastic Online Learning with Probabilistic Graph Feedback

    Full text link
    We consider a problem of stochastic online learning with general probabilistic graph feedback, where each directed edge in the feedback graph has probability pijp_{ij}. Two cases are covered. (a) The one-step case, where after playing arm ii the learner observes a sample reward feedback of arm jj with independent probability pijp_{ij}. (b) The cascade case where after playing arm ii the learner observes feedback of all arms jj in a probabilistic cascade starting from ii -- for each (i,j)(i,j) with probability pijp_{ij}, if arm ii is played or observed, then a reward sample of arm jj would be observed with independent probability pijp_{ij}. Previous works mainly focus on deterministic graphs which corresponds to one-step case with pij∈{0,1}p_{ij} \in \{0,1\}, an adversarial sequence of graphs with certain topology guarantees, or a specific type of random graphs. We analyze the asymptotic lower bounds and design algorithms in both cases. The regret upper bounds of the algorithms match the lower bounds with high probability

    Reducing Dueling Bandits to Cardinal Bandits

    Full text link
    We present algorithms for reducing the Dueling Bandits problem to the conventional (stochastic) Multi-Armed Bandits problem. The Dueling Bandits problem is an online model of learning with ordinal feedback of the form "A is preferred to B" (as opposed to cardinal feedback like "A has value 2.5"), giving it wide applicability in learning from implicit user feedback and revealed and stated preferences. In contrast to existing algorithms for the Dueling Bandits problem, our reductions -- named \Doubler, \MultiSbm and \DoubleSbm -- provide a generic schema for translating the extensive body of known results about conventional Multi-Armed Bandit algorithms to the Dueling Bandits setting. For \Doubler and \MultiSbm we prove regret upper bounds in both finite and infinite settings, and conjecture about the performance of \DoubleSbm which empirically outperforms the other two as well as previous algorithms in our experiments. In addition, we provide the first almost optimal regret bound in terms of second order terms, such as the differences between the values of the arms

    Online Learning with Gaussian Payoffs and Side Observations

    Full text link
    We consider a sequential learning problem with Gaussian payoffs and side information: after selecting an action ii, the learner receives information about the payoff of every action jj in the form of Gaussian observations whose mean is the same as the mean payoff, but the variance depends on the pair (i,j)(i,j) (and may be infinite). The setup allows a more refined information transfer from one action to another than previous partial monitoring setups, including the recently introduced graph-structured feedback case. For the first time in the literature, we provide non-asymptotic problem-dependent lower bounds on the regret of any algorithm, which recover existing asymptotic problem-dependent lower bounds and finite-time minimax lower bounds available in the literature. We also provide algorithms that achieve the problem-dependent lower bound (up to some universal constant factor) or the minimax lower bounds (up to logarithmic factors)
    • …
    corecore