3,996 research outputs found
On the Performance of Thompson Sampling on Logistic Bandits
We study the logistic bandit, in which rewards are binary with success
probability and
actions and coefficients are within the -dimensional unit ball.
While prior regret bounds for algorithms that address the logistic bandit
exhibit exponential dependence on the slope parameter , we establish a
regret bound for Thompson sampling that is independent of .
Specifically, we establish that, when the set of feasible actions is identical
to the set of possible coefficient vectors, the Bayesian regret of Thompson
sampling is . We also establish a bound that applies more broadly, where is the worst-case
optimal log-odds and is the "fragility dimension," a new statistic we
define to capture the degree to which an optimal action for one model fails to
satisfice for others. We demonstrate that the fragility dimension plays an
essential role by showing that, for any , no algorithm can
achieve regret.Comment: Accepted for presentation at the Conference on Learning Theory (COLT)
201
On the Prior Sensitivity of Thompson Sampling
The empirically successful Thompson Sampling algorithm for stochastic bandits
has drawn much interest in understanding its theoretical properties. One
important benefit of the algorithm is that it allows domain knowledge to be
conveniently encoded as a prior distribution to balance exploration and
exploitation more effectively. While it is generally believed that the
algorithm's regret is low (high) when the prior is good (bad), little is known
about the exact dependence. In this paper, we fully characterize the
algorithm's worst-case dependence of regret on the choice of prior, focusing on
a special yet representative case. These results also provide insights into the
general sensitivity of the algorithm to the choice of priors. In particular,
with being the prior probability mass of the true reward-generating model,
we prove and regret upper bounds for the
bad- and good-prior cases, respectively, as well as \emph{matching} lower
bounds. Our proofs rely on the discovery of a fundamental property of Thompson
Sampling and make heavy use of martingale theory, both of which appear novel in
the literature, to the best of our knowledge.Comment: Appears in the 27th International Conference on Algorithmic Learning
Theory (ALT), 201
Learning to Optimize via Information-Directed Sampling
We propose information-directed sampling -- a new approach to online
optimization problems in which a decision-maker must balance between
exploration and exploitation while learning from partial feedback. Each action
is sampled in a manner that minimizes the ratio between squared expected
single-period regret and a measure of information gain: the mutual information
between the optimal action and the next observation. We establish an expected
regret bound for information-directed sampling that applies across a very
general class of models and scales with the entropy of the optimal action
distribution. We illustrate through simple analytic examples how
information-directed sampling accounts for kinds of information that
alternative approaches do not adequately address and that this can lead to
dramatic performance gains. For the widely studied Bernoulli, Gaussian, and
linear bandit problems, we demonstrate state-of-the-art simulation performance.Comment: arXiv admin note: substantial text overlap with arXiv:1403.534
Thompson Sampling for the MNL-Bandit
We consider a sequential subset selection problem under parameter
uncertainty, where at each time step, the decision maker selects a subset of
cardinality from possible items (arms), and observes a (bandit)
feedback in the form of the index of one of the items in said subset, or none.
Each item in the index set is ascribed a certain value (reward), and the
feedback is governed by a Multinomial Logit (MNL) choice model whose parameters
are a priori unknown. The objective of the decision maker is to maximize the
expected cumulative rewards over a finite horizon , or alternatively,
minimize the regret relative to an oracle that knows the MNL parameters. We
refer to this as the MNL-Bandit problem. This problem is representative of a
larger family of exploration-exploitation problems that involve a combinatorial
objective, and arise in several important application domains. We present an
approach to adapt Thompson Sampling to this problem and show that it achieves
near-optimal regret as well as attractive numerical performance.Comment: Accepted for presentation at Conference on Learning Theory (COLT)
201
Learning to Route Efficiently with End-to-End Feedback: The Value of Networked Structure
We introduce efficient algorithms which achieve nearly optimal regrets for
the problem of stochastic online shortest path routing with end-to-end
feedback. The setting is a natural application of the combinatorial stochastic
bandits problem, a special case of the linear stochastic bandits problem. We
show how the difficulties posed by the large scale action set can be overcome
by the networked structure of the action set. Our approach presents a novel
connection between bandit learning and shortest path algorithms. Our main
contribution is an adaptive exploration algorithm with nearly optimal
instance-dependent regret for any directed acyclic network. We then modify it
so that nearly optimal worst case regret is achieved simultaneously. Driven by
the carefully designed Top-Two Comparison (TTC) technique, the algorithms are
efficiently implementable. We further conduct extensive numerical experiments
to show that our proposed algorithms not only achieve superior regret
performances, but also reduce the runtime drastically
Satisficing in Time-Sensitive Bandit Learning
Much of the recent literature on bandit learning focuses on algorithms that
aim to converge on an optimal action. One shortcoming is that this orientation
does not account for time sensitivity, which can play a crucial role when
learning an optimal action requires much more information than near-optimal
ones. Indeed, popular approaches such as upper-confidence-bound methods and
Thompson sampling can fare poorly in such situations. We consider instead
learning a satisficing action, which is near-optimal while requiring less
information, and propose satisficing Thompson sampling, an algorithm that
serves this purpose. We establish a general bound on expected discounted regret
and study the application of satisficing Thompson sampling to linear and
infinite-armed bandits, demonstrating arbitrarily large benefits over Thompson
sampling. We also discuss the relation between the notion of satisficing and
the theory of rate distortion, which offers guidance on the selection of
satisficing actions.Comment: This submission largely supersedes earlier work in arXiv:1704.0902
A Short Survey on Probabilistic Reinforcement Learning
A reinforcement learning agent tries to maximize its cumulative payoff by
interacting in an unknown environment. It is important for the agent to explore
suboptimal actions as well as to pick actions with highest known rewards. Yet,
in sensitive domains, collecting more data with exploration is not always
possible, but it is important to find a policy with a certain performance
guaranty. In this paper, we present a brief survey of methods available in the
literature for balancing exploration-exploitation trade off and computing
robust solutions from fixed samples in reinforcement learning.Comment: 7 pages, originally written as a literature survey for PhD candidacy
exa
Posterior sampling for reinforcement learning: worst-case regret bounds
We present an algorithm based on posterior sampling (aka Thompson sampling)
that achieves near-optimal worst-case regret bounds when the underlying Markov
Decision Process (MDP) is communicating with a finite, though unknown,
diameter. Our main result is a high probability regret upper bound of
for any communicating MDP with states, actions
and diameter . Here, regret compares the total reward achieved by the
algorithm to the total expected reward of an optimal infinite-horizon
undiscounted average reward policy, in time horizon . This result closely
matches the known lower bound of . Our techniques involve
proving some novel results about the anti-concentration of Dirichlet
distribution, which may be of independent interest.Comment: This revision fixes an error due to use of some incorrect results
(Lemma C.1 and Lemma C.2) in the earlier version. The regret bounds in this
version are worse by a factor of sqrt(S) as compared to the previous versio
An Information-Theoretic Approach to Minimax Regret in Partial Monitoring
We prove a new minimax theorem connecting the worst-case Bayesian regret and
minimax regret under partial monitoring with no assumptions on the space of
signals or decisions of the adversary. We then generalise the
information-theoretic tools of Russo and Van Roy (2016) for proving Bayesian
regret bounds and combine them with the minimax theorem to derive minimax
regret bounds for various partial monitoring settings. The highlight is a clean
analysis of `non-degenerate easy' and `hard' finite partial monitoring, with
new regret bounds that are independent of arbitrarily large game-dependent
constants. The power of the generalised machinery is further demonstrated by
proving that the minimax regret for k-armed adversarial bandits is at most
sqrt{2kn}, improving on existing results by a factor of 2. Finally, we provide
a simple analysis of the cops and robbers game, also improving best known
constants.Comment: 29 pages, to appear in COLT 201
(More) Efficient Reinforcement Learning via Posterior Sampling
Most provably-efficient learning algorithms introduce optimism about
poorly-understood states and actions to encourage exploration. We study an
alternative approach for efficient exploration, posterior sampling for
reinforcement learning (PSRL). This algorithm proceeds in repeated episodes of
known duration. At the start of each episode, PSRL updates a prior distribution
over Markov decision processes and takes one sample from this posterior. PSRL
then follows the policy that is optimal for this sample during the episode. The
algorithm is conceptually simple, computationally efficient and allows an agent
to encode prior knowledge in a natural way. We establish an bound on the expected regret, where is time, is the
episode length and and are the cardinalities of the state and action
spaces. This bound is one of the first for an algorithm not based on optimism,
and close to the state of the art for any reinforcement learning algorithm. We
show through simulation that PSRL significantly outperforms existing algorithms
with similar regret bounds.Comment: 10 page
- …