3,019 research outputs found
Algorithms with Logarithmic or Sublinear Regret for Constrained Contextual Bandits
We study contextual bandits with budget and time constraints, referred to as
constrained contextual bandits.The time and budget constraints significantly
complicate the exploration and exploitation tradeoff because they introduce
complex coupling among contexts over time.Such coupling effects make it
difficult to obtain oracle solutions that assume known statistics of bandits.
To gain insight, we first study unit-cost systems with known context
distribution. When the expected rewards are known, we develop an approximation
of the oracle, referred to Adaptive-Linear-Programming (ALP), which achieves
near-optimality and only requires the ordering of expected rewards. With these
highly desirable features, we then combine ALP with the upper-confidence-bound
(UCB) method in the general case where the expected rewards are unknown {\it a
priori}. We show that the proposed UCB-ALP algorithm achieves logarithmic
regret except for certain boundary cases. Further, we design algorithms and
obtain similar regret analysis results for more general systems with unknown
context distribution and heterogeneous costs. To the best of our knowledge,
this is the first work that shows how to achieve logarithmic regret in
constrained contextual bandits. Moreover, this work also sheds light on the
study of computationally efficient algorithms for general constrained
contextual bandits.Comment: 36 pages, 4 figures; accepted by the 29th Annual Conference on Neural
Information Processing Systems (NIPS), Montr\'eal, Canada, Dec. 201
Linear Contextual Bandits with Knapsacks
We consider the linear contextual bandit problem with resource consumption,
in addition to reward generation. In each round, the outcome of pulling an arm
is a reward as well as a vector of resource consumptions. The expected values
of these outcomes depend linearly on the context of that arm. The
budget/capacity constraints require that the total consumption doesn't exceed
the budget for each resource. The objective is once again to maximize the total
reward. This problem turns out to be a common generalization of classic linear
contextual bandits (linContextual), bandits with knapsacks (BwK), and the
online stochastic packing problem (OSPP). We present algorithms with
near-optimal regret bounds for this problem. Our bounds compare favorably to
results on the unstructured version of the problem where the relation between
the contexts and the outcomes could be arbitrary, but the algorithm only
competes against a fixed set of policies accessible through an optimization
oracle. We combine techniques from the work on linContextual, BwK, and OSPP in
a nontrivial manner while also tackling new difficulties that are not present
in any of these special cases
BISTRO: An Efficient Relaxation-Based Method for Contextual Bandits
We present efficient algorithms for the problem of contextual bandits with
i.i.d. covariates, an arbitrary sequence of rewards, and an arbitrary class of
policies. Our algorithm BISTRO requires d calls to the empirical risk
minimization (ERM) oracle per round, where d is the number of actions. The
method uses unlabeled data to make the problem computationally simple. When the
ERM problem itself is computationally hard, we extend the approach by employing
multiplicative approximation algorithms for the ERM. The integrality gap of the
relaxation only enters in the regret bound rather than the benchmark. Finally,
we show that the adversarial version of the contextual bandit problem is
learnable (and efficient) whenever the full-information supervised online
learning problem has a non-trivial regret guarantee (and efficient)
An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives
We consider a contextual version of multi-armed bandit problem with global
knapsack constraints. In each round, the outcome of pulling an arm is a scalar
reward and a resource consumption vector, both dependent on the context, and
the global knapsack constraints require the total consumption for each resource
to be below some pre-fixed budget. The learning agent competes with an
arbitrary set of context-dependent policies. This problem was introduced by
Badanidiyuru et al. (2014), who gave a computationally inefficient algorithm
with near-optimal regret bounds for it. We give a computationally efficient
algorithm for this problem with slightly better regret bounds, by generalizing
the approach of Agarwal et al. (2014) for the non-constrained version of the
problem. The computational time of our algorithm scales logarithmically in the
size of the policy space. This answers the main open question of Badanidiyuru
et al. (2014). We also extend our results to a variant where there are no
knapsack constraints but the objective is an arbitrary Lipschitz concave
function of the sum of outcome vectors.Comment: Extended abstract appeared in COLT 201
Resourceful Contextual Bandits
We study contextual bandits with ancillary constraints on resources, which
are common in real-world applications such as choosing ads or dynamic pricing
of items. We design the first algorithm for solving these problems that handles
constrained resources other than time, and improves over a trivial reduction to
the non-contextual case. We consider very general settings for both contextual
bandits (arbitrary policy sets, e.g. Dudik et al. (UAI'11)) and bandits with
resource constraints (bandits with knapsacks, Badanidiyuru et al. (FOCS'13)),
and prove a regret guarantee with near-optimal statistical properties.Comment: This is the full version of a paper in COLT 2014. Version history:
(v2) Added some details to one of the proofs, (v3) a big revision following
comments from COLT reviewers (but no new results), (v4) edits in related
work, minor edits elsewhere. (v6) A correction for Theorem 3, corollary for
contextual dynamic pricing with discretization; updated follow-up work & open
question
Combinatorial Semi-Bandits with Knapsacks
We unify two prominent lines of work on multi-armed bandits: bandits with
knapsacks (BwK) and combinatorial semi-bandits. The former concerns limited
"resources" consumed by the algorithm, e.g., limited supply in dynamic pricing.
The latter allows a huge number of actions but assumes combinatorial structure
and additional feedback to make the problem tractable. We define a common
generalization, support it with several motivating examples, and design an
algorithm for it. Our regret bounds are comparable with those for BwK and
combinatorial semi- bandits
Exploration-Exploitation Trade-off in Reinforcement Learning on Online Markov Decision Processes with Global Concave Rewards
We consider an agent who is involved in a Markov decision process and
receives a vector of outcomes every round. Her objective is to maximize a
global concave reward function on the average vectorial outcome. The problem
models applications such as multi-objective optimization, maximum entropy
exploration, and constrained optimization in Markovian environments. In our
general setting where a stationary policy could have multiple recurrent
classes, the agent faces a subtle yet consequential trade-off in alternating
among different actions for balancing the vectorial outcomes. In particular,
stationary policies are in general sub-optimal. We propose a no-regret
algorithm based on online convex optimization (OCO) tools (Agrawal and Devanur
2014) and UCRL2 (Jaksch et al. 2010). Importantly, we introduce a novel
gradient threshold procedure, which carefully controls the switches among
actions to handle the subtle trade-off. By delaying the gradient updates, our
procedure produces a non-stationary policy that diversifies the outcomes for
optimizing the objective. The procedure is compatible with a variety of OCO
tools.Comment: 54 pages, 1 figur
Personalized Advertisement Recommendation: A Ranking Approach to Address the Ubiquitous Click Sparsity Problem
We study the problem of personalized advertisement recommendation (PAR),
which consist of a user visiting a system (website) and the system displaying
one of ads to the user. The system uses an internal ad recommendation
policy to map the user's profile (context) to one of the ads. The user either
clicks or ignores the ad and correspondingly, the system updates its
recommendation policy. PAR problem is usually tackled by scalable
\emph{contextual bandit} algorithms, where the policies are generally based on
classifiers. A practical problem in PAR is extreme click sparsity, due to very
few users actually clicking on ads. We systematically study the drawback of
using contextual bandit algorithms based on classifier-based policies, in face
of extreme click sparsity. We then suggest an alternate policy, based on
rankers, learnt by optimizing the Area Under the Curve (AUC) ranking loss,
which can significantly alleviate the problem of click sparsity. We conduct
extensive experiments on public datasets, as well as three industry proprietary
datasets, to illustrate the improvement in click-through-rate (CTR) obtained by
using the ranker-based policy over classifier-based policies.Comment: Under revie
Deep Neural Linear Bandits: Overcoming Catastrophic Forgetting through Likelihood Matching
We study the neural-linear bandit model for solving sequential
decision-making problems with high dimensional side information. Neural-linear
bandits leverage the representation power of deep neural networks and combine
it with efficient exploration mechanisms, designed for linear contextual
bandits, on top of the last hidden layer. Since the representation is being
optimized during learning, information regarding exploration with "old"
features is lost. Here, we propose the first limited memory neural-linear
bandit that is resilient to this phenomenon, which we term catastrophic
forgetting. We evaluate our method on a variety of real-world data sets,
including regression, classification, and sentiment analysis, and observe that
our algorithm is resilient to catastrophic forgetting and achieves superior
performance
ADARES: Adaptive Resource Management for Virtual Machines
Virtual execution environments allow for consolidation of multiple
applications onto the same physical server, thereby enabling more efficient use
of server resources. However, users often statically configure the resources of
virtual machines through guesswork, resulting in either insufficient resource
allocations that hinder VM performance, or excessive allocations that waste
precious data center resources. In this paper, we first characterize real-world
resource allocation and utilization of VMs through the analysis of an extensive
dataset, consisting of more than 250k VMs from over 3.6k private enterprise
clusters. Our large-scale analysis confirms that VMs are often misconfigured,
either overprovisioned or underprovisioned, and that this problem is pervasive
across a wide range of private clusters. We then propose ADARES, an adaptive
system that dynamically adjusts VM resources using machine learning techniques.
In particular, ADARES leverages the contextual bandits framework to effectively
manage the adaptations. Our system exploits easily collectible data, at the
cluster, node, and VM levels, to make more sensible allocation decisions, and
uses transfer learning to safely explore the configurations space and speed up
training. Our empirical evaluation shows that ADARES can significantly improve
system utilization without sacrificing performance. For instance, when compared
to threshold and prediction-based baselines, it achieves more predictable
VM-level performance and also reduces the amount of virtual CPUs and memory
provisioned by up to 35% and 60% respectively for synthetic workloads on real
clusters
- …