3,996 research outputs found

    On the Performance of Thompson Sampling on Logistic Bandits

    Full text link
    We study the logistic bandit, in which rewards are binary with success probability exp(βaθ)/(1+exp(βaθ))\exp(\beta a^\top \theta) / (1 + \exp(\beta a^\top \theta)) and actions aa and coefficients θ\theta are within the dd-dimensional unit ball. While prior regret bounds for algorithms that address the logistic bandit exhibit exponential dependence on the slope parameter β\beta, we establish a regret bound for Thompson sampling that is independent of β\beta. Specifically, we establish that, when the set of feasible actions is identical to the set of possible coefficient vectors, the Bayesian regret of Thompson sampling is O~(dT)\tilde{O}(d\sqrt{T}). We also establish a O~(dηT/λ)\tilde{O}(\sqrt{d\eta T}/\lambda) bound that applies more broadly, where λ\lambda is the worst-case optimal log-odds and η\eta is the "fragility dimension," a new statistic we define to capture the degree to which an optimal action for one model fails to satisfice for others. We demonstrate that the fragility dimension plays an essential role by showing that, for any ϵ>0\epsilon > 0, no algorithm can achieve poly(d,1/λ)T1ϵ\mathrm{poly}(d, 1/\lambda)\cdot T^{1-\epsilon} regret.Comment: Accepted for presentation at the Conference on Learning Theory (COLT) 201

    On the Prior Sensitivity of Thompson Sampling

    Full text link
    The empirically successful Thompson Sampling algorithm for stochastic bandits has drawn much interest in understanding its theoretical properties. One important benefit of the algorithm is that it allows domain knowledge to be conveniently encoded as a prior distribution to balance exploration and exploitation more effectively. While it is generally believed that the algorithm's regret is low (high) when the prior is good (bad), little is known about the exact dependence. In this paper, we fully characterize the algorithm's worst-case dependence of regret on the choice of prior, focusing on a special yet representative case. These results also provide insights into the general sensitivity of the algorithm to the choice of priors. In particular, with pp being the prior probability mass of the true reward-generating model, we prove O(T/p)O(\sqrt{T/p}) and O((1p)T)O(\sqrt{(1-p)T}) regret upper bounds for the bad- and good-prior cases, respectively, as well as \emph{matching} lower bounds. Our proofs rely on the discovery of a fundamental property of Thompson Sampling and make heavy use of martingale theory, both of which appear novel in the literature, to the best of our knowledge.Comment: Appears in the 27th International Conference on Algorithmic Learning Theory (ALT), 201

    Learning to Optimize via Information-Directed Sampling

    Full text link
    We propose information-directed sampling -- a new approach to online optimization problems in which a decision-maker must balance between exploration and exploitation while learning from partial feedback. Each action is sampled in a manner that minimizes the ratio between squared expected single-period regret and a measure of information gain: the mutual information between the optimal action and the next observation. We establish an expected regret bound for information-directed sampling that applies across a very general class of models and scales with the entropy of the optimal action distribution. We illustrate through simple analytic examples how information-directed sampling accounts for kinds of information that alternative approaches do not adequately address and that this can lead to dramatic performance gains. For the widely studied Bernoulli, Gaussian, and linear bandit problems, we demonstrate state-of-the-art simulation performance.Comment: arXiv admin note: substantial text overlap with arXiv:1403.534

    Thompson Sampling for the MNL-Bandit

    Full text link
    We consider a sequential subset selection problem under parameter uncertainty, where at each time step, the decision maker selects a subset of cardinality KK from NN possible items (arms), and observes a (bandit) feedback in the form of the index of one of the items in said subset, or none. Each item in the index set is ascribed a certain value (reward), and the feedback is governed by a Multinomial Logit (MNL) choice model whose parameters are a priori unknown. The objective of the decision maker is to maximize the expected cumulative rewards over a finite horizon TT, or alternatively, minimize the regret relative to an oracle that knows the MNL parameters. We refer to this as the MNL-Bandit problem. This problem is representative of a larger family of exploration-exploitation problems that involve a combinatorial objective, and arise in several important application domains. We present an approach to adapt Thompson Sampling to this problem and show that it achieves near-optimal regret as well as attractive numerical performance.Comment: Accepted for presentation at Conference on Learning Theory (COLT) 201

    Learning to Route Efficiently with End-to-End Feedback: The Value of Networked Structure

    Full text link
    We introduce efficient algorithms which achieve nearly optimal regrets for the problem of stochastic online shortest path routing with end-to-end feedback. The setting is a natural application of the combinatorial stochastic bandits problem, a special case of the linear stochastic bandits problem. We show how the difficulties posed by the large scale action set can be overcome by the networked structure of the action set. Our approach presents a novel connection between bandit learning and shortest path algorithms. Our main contribution is an adaptive exploration algorithm with nearly optimal instance-dependent regret for any directed acyclic network. We then modify it so that nearly optimal worst case regret is achieved simultaneously. Driven by the carefully designed Top-Two Comparison (TTC) technique, the algorithms are efficiently implementable. We further conduct extensive numerical experiments to show that our proposed algorithms not only achieve superior regret performances, but also reduce the runtime drastically

    Satisficing in Time-Sensitive Bandit Learning

    Full text link
    Much of the recent literature on bandit learning focuses on algorithms that aim to converge on an optimal action. One shortcoming is that this orientation does not account for time sensitivity, which can play a crucial role when learning an optimal action requires much more information than near-optimal ones. Indeed, popular approaches such as upper-confidence-bound methods and Thompson sampling can fare poorly in such situations. We consider instead learning a satisficing action, which is near-optimal while requiring less information, and propose satisficing Thompson sampling, an algorithm that serves this purpose. We establish a general bound on expected discounted regret and study the application of satisficing Thompson sampling to linear and infinite-armed bandits, demonstrating arbitrarily large benefits over Thompson sampling. We also discuss the relation between the notion of satisficing and the theory of rate distortion, which offers guidance on the selection of satisficing actions.Comment: This submission largely supersedes earlier work in arXiv:1704.0902

    A Short Survey on Probabilistic Reinforcement Learning

    Full text link
    A reinforcement learning agent tries to maximize its cumulative payoff by interacting in an unknown environment. It is important for the agent to explore suboptimal actions as well as to pick actions with highest known rewards. Yet, in sensitive domains, collecting more data with exploration is not always possible, but it is important to find a policy with a certain performance guaranty. In this paper, we present a brief survey of methods available in the literature for balancing exploration-exploitation trade off and computing robust solutions from fixed samples in reinforcement learning.Comment: 7 pages, originally written as a literature survey for PhD candidacy exa

    Posterior sampling for reinforcement learning: worst-case regret bounds

    Full text link
    We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov Decision Process (MDP) is communicating with a finite, though unknown, diameter. Our main result is a high probability regret upper bound of O~(DSAT)\tilde{O}(DS\sqrt{AT}) for any communicating MDP with SS states, AA actions and diameter DD. Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy, in time horizon TT. This result closely matches the known lower bound of Ω(DSAT)\Omega(\sqrt{DSAT}). Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest.Comment: This revision fixes an error due to use of some incorrect results (Lemma C.1 and Lemma C.2) in the earlier version. The regret bounds in this version are worse by a factor of sqrt(S) as compared to the previous versio

    An Information-Theoretic Approach to Minimax Regret in Partial Monitoring

    Full text link
    We prove a new minimax theorem connecting the worst-case Bayesian regret and minimax regret under partial monitoring with no assumptions on the space of signals or decisions of the adversary. We then generalise the information-theoretic tools of Russo and Van Roy (2016) for proving Bayesian regret bounds and combine them with the minimax theorem to derive minimax regret bounds for various partial monitoring settings. The highlight is a clean analysis of `non-degenerate easy' and `hard' finite partial monitoring, with new regret bounds that are independent of arbitrarily large game-dependent constants. The power of the generalised machinery is further demonstrated by proving that the minimax regret for k-armed adversarial bandits is at most sqrt{2kn}, improving on existing results by a factor of 2. Finally, we provide a simple analysis of the cops and robbers game, also improving best known constants.Comment: 29 pages, to appear in COLT 201

    (More) Efficient Reinforcement Learning via Posterior Sampling

    Full text link
    Most provably-efficient learning algorithms introduce optimism about poorly-understood states and actions to encourage exploration. We study an alternative approach for efficient exploration, posterior sampling for reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of known duration. At the start of each episode, PSRL updates a prior distribution over Markov decision processes and takes one sample from this posterior. PSRL then follows the policy that is optimal for this sample during the episode. The algorithm is conceptually simple, computationally efficient and allows an agent to encode prior knowledge in a natural way. We establish an O~(τSAT)\tilde{O}(\tau S \sqrt{AT}) bound on the expected regret, where TT is time, τ\tau is the episode length and SS and AA are the cardinalities of the state and action spaces. This bound is one of the first for an algorithm not based on optimism, and close to the state of the art for any reinforcement learning algorithm. We show through simulation that PSRL significantly outperforms existing algorithms with similar regret bounds.Comment: 10 page
    corecore