1,235 research outputs found
Gaussian Process bandits with adaptive discretization
In this paper, the problem of maximizing a black-box function is studied in the Bayesian framework with a Gaussian Process
(GP) prior. In particular, a new algorithm for this problem is proposed, and
high probability bounds on its simple and cumulative regret are established.
The query point selection rule in most existing methods involves an exhaustive
search over an increasingly fine sequence of uniform discretizations of
. The proposed algorithm, in contrast, adaptively refines
which leads to a lower computational complexity, particularly
when is a subset of a high dimensional Euclidean space. In
addition to the computational gains, sufficient conditions are identified under
which the regret bounds of the new algorithm improve upon the known results.
Finally an extension of the algorithm to the case of contextual bandits is
proposed, and high probability bounds on the contextual regret are presented.Comment: 34 pages, 2 figure
Contextual Bandits with Random Projection
Contextual bandits with linear payoffs, which are also known as linear
bandits, provide a powerful alternative for solving practical problems of
sequential decisions, e.g., online advertisements. In the era of big data,
contextual data usually tend to be high-dimensional, which leads to new
challenges for traditional linear bandits mostly designed for the setting of
low-dimensional contextual data. Due to the curse of dimensionality, there are
two challenges in most of the current bandit algorithms: the first is high
time-complexity; and the second is extreme large upper regret bounds with
high-dimensional data. In this paper, in order to attack the above two
challenges effectively, we develop an algorithm of Contextual Bandits via
RAndom Projection (\texttt{CBRAP}) in the setting of linear payoffs, which
works especially for high-dimensional contextual data. The proposed
\texttt{CBRAP} algorithm is time-efficient and flexible, because it enables
players to choose an arm in a low-dimensional space, and relaxes the sparsity
assumption of constant number of non-zero components in previous work. Besides,
we provide a linear upper regret bound for the proposed algorithm, which is
associated with reduced dimensions
Action Centered Contextual Bandits
Contextual bandits have become popular as they offer a middle ground between
very simple approaches based on multi-armed bandits and very complex approaches
using the full power of reinforcement learning. They have demonstrated success
in web applications and have a rich body of associated theoretical guarantees.
Linear models are well understood theoretically and preferred by practitioners
because they are not only easily interpretable but also simple to implement and
debug. Furthermore, if the linear model is true, we get very strong performance
guarantees. Unfortunately, in emerging applications in mobile health, the
time-invariant linear model assumption is untenable. We provide an extension of
the linear model for contextual bandits that has two parts: baseline reward and
treatment effect. We allow the former to be complex but keep the latter simple.
We argue that this model is plausible for mobile health applications. At the
same time, it leads to algorithms with strong performance guarantees as in the
linear model setting, while still allowing for complex nonlinear baseline
modeling. Our theory is supported by experiments on data gathered in a recently
concluded mobile health study.Comment: to appear at NIPS 201
Online Clustering of Bandits
We introduce a novel algorithmic approach to content recommendation based on
adaptive clustering of exploration-exploitation ("bandit") strategies. We
provide a sharp regret analysis of this algorithm in a standard stochastic
noise setting, demonstrate its scalability properties, and prove its
effectiveness on a number of artificial and real-world datasets. Our
experiments show a significant increase in prediction performance over
state-of-the-art methods for bandit problems.Comment: In E. Xing and T. Jebara (Eds.), Proceedings of 31st International
Conference on Machine Learning, Journal of Machine Learning Research Workshop
and Conference Proceedings, Vol.32 (JMLR W&CP-32), Beijing, China, Jun.
21-26, 2014 (ICML 2014), Submitted by Shuai Li
(https://sites.google.com/site/shuailidotsli
Structured Stochastic Linear Bandits
The stochastic linear bandit problem proceeds in rounds where at each round
the algorithm selects a vector from a decision set after which it receives a
noisy linear loss parameterized by an unknown vector. The goal in such a
problem is to minimize the (pseudo) regret which is the difference between the
total expected loss of the algorithm and the total expected loss of the best
fixed vector in hindsight. In this paper, we consider settings where the
unknown parameter has structure, e.g., sparse, group sparse, low-rank, which
can be captured by a norm, e.g., , , nuclear norm. We focus on
constructing confidence ellipsoids which contain the unknown parameter across
all rounds with high-probability. We show the radius of such ellipsoids depend
on the Gaussian width of sets associated with the norm capturing the structure.
Such characterization leads to tighter confidence ellipsoids and, therefore,
sharper regret bounds compared to bounds in the existing literature which are
based on the ambient dimensionality
Algorithms for Linear Bandits on Polyhedral Sets
We study stochastic linear optimization problem with bandit feedback. The set
of arms take values in an -dimensional space and belong to a bounded
polyhedron described by finitely many linear inequalities. We provide a lower
bound for the expected regret that scales as . We then provide
a nearly optimal algorithm and show that its expected regret scales as
for an arbitrary small . The algorithm
alternates between exploration and exploitation intervals sequentially where
deterministic set of arms are played in the exploration intervals and greedily
selected arm is played in the exploitation intervals. We also develop an
algorithm that achieves the optimal regret when sub-Gaussianity parameter of
the noise term is known. Our key insight is that for a polyhedron the optimal
arm is robust to small perturbations in the reward function. Consequently, a
greedily selected arm is guaranteed to be optimal when the estimation error
falls below some suitable threshold. Our solution resolves a question posed by
Rusmevichientong and Tsitsiklis (2011) that left open the possibility of
efficient algorithms with asymptotic logarithmic regret bounds. We also show
that the regret upper bounds hold with probability . Our numerical
investigations show that while theoretical results are asymptotic the
performance of our algorithms compares favorably to state-of-the-art algorithms
in finite time as well
Alternating Linear Bandits for Online Matrix-Factorization Recommendation
We consider the problem of online collaborative filtering in the online
setting, where items are recommended to the users over time. At each time step,
the user (selected by the environment) consumes an item (selected by the agent)
and provides a rating of the selected item. In this paper, we propose a novel
algorithm for online matrix factorization recommendation that combines linear
bandits and alternating least squares. In this formulation, the bandit feedback
is equal to the difference between the ratings of the best and selected items.
We evaluate the performance of the proposed algorithm over time using both
cumulative regret and average cumulative NDCG. Simulation results over three
synthetic datasets as well as three real-world datasets for online
collaborative filtering indicate the superior performance of the proposed
algorithm over two state-of-the-art online algorithms
Horde of Bandits using Gaussian Markov Random Fields
The gang of bandits (GOB) model \cite{cesa2013gang} is a recent contextual
bandits framework that shares information between a set of bandit problems,
related by a known (possibly noisy) graph. This model is useful in problems
like recommender systems where the large number of users makes it vital to
transfer information between users. Despite its effectiveness, the existing GOB
model can only be applied to small problems due to its quadratic
time-dependence on the number of nodes. Existing solutions to combat the
scalability issue require an often-unrealistic clustering assumption. By
exploiting a connection to Gaussian Markov random fields (GMRFs), we show that
the GOB model can be made to scale to much larger graphs without additional
assumptions. In addition, we propose a Thompson sampling algorithm which uses
the recent GMRF sampling-by-perturbation technique, allowing it to scale to
even larger problems (leading to a "horde" of bandits). We give regret bounds
and experimental results for GOB with Thompson sampling and epoch-greedy
algorithms, indicating that these methods are as good as or significantly
better than ignoring the graph or adopting a clustering-based approach.
Finally, when an existing graph is not available, we propose a heuristic for
learning it on the fly and show promising results
Estimation Considerations in Contextual Bandits
Contextual bandit algorithms are sensitive to the estimation method of the
outcome model as well as the exploration method used, particularly in the
presence of rich heterogeneity or complex outcome models, which can lead to
difficult estimation problems along the path of learning. We study a
consideration for the exploration vs. exploitation framework that does not
arise in multi-armed bandits but is crucial in contextual bandits; the way
exploration and exploitation is conducted in the present affects the bias and
variance in the potential outcome model estimation in subsequent stages of
learning. We develop parametric and non-parametric contextual bandits that
integrate balancing methods from the causal inference literature in their
estimation to make it less prone to problems of estimation bias. We provide the
first regret bound analyses for contextual bandits with balancing in the domain
of linear contextual bandits that match the state of the art regret bounds. We
demonstrate the strong practical advantage of balanced contextual bandits on a
large number of supervised learning datasets and on a synthetic example that
simulates model mis-specification and prejudice in the initial training data.
Additionally, we develop contextual bandits with simpler assignment policies by
leveraging sparse model estimation methods from the econometrics literature and
demonstrate empirically that in the early stages they can improve the rate of
learning and decrease regret
Stochastic Process Bandits: Upper Confidence Bounds Algorithms via Generic Chaining
The paper considers the problem of global optimization in the setup of
stochastic process bandits. We introduce an UCB algorithm which builds a
cascade of discretization trees based on generic chaining in order to render
possible his operability over a continuous domain. The theoretical framework
applies to functions under weak probabilistic smoothness assumptions and also
extends significantly the spectrum of application of UCB strategies. Moreover
generic regret bounds are derived which are then specialized to Gaussian
processes indexed on infinite-dimensional spaces as well as to quadratic forms
of Gaussian processes. Lower bounds are also proved in the case of Gaussian
processes to assess the optimality of the proposed algorithm.Comment: preprin
- …