5 research outputs found
Budget-Constrained Bandits over General Cost and Reward Distributions
We consider a budget-constrained bandit problem where each arm pull incurs a
random cost, and yields a random reward in return. The objective is to maximize
the total expected reward under a budget constraint on the total cost. The
model is general in the sense that it allows correlated and potentially
heavy-tailed cost-reward pairs that can take on negative values as required by
many applications. We show that if moments of order for some
exist for all cost-reward pairs, regret is achievable
for a budget . In order to achieve tight regret bounds, we propose
algorithms that exploit the correlation between the cost and reward of each arm
by extracting the common information via linear minimum mean-square error
estimation. We prove a regret lower bound for this problem, and show that the
proposed algorithms achieve tight problem-dependent regret bounds, which are
optimal up to a universal constant factor in the case of jointly Gaussian cost
and reward pairs
Continuous-Time Multi-Armed Bandits with Controlled Restarts
Time-constrained decision processes have been ubiquitous in many fundamental
applications in physics, biology and computer science. Recently, restart
strategies have gained significant attention for boosting the efficiency of
time-constrained processes by expediting the completion times. In this work, we
investigate the bandit problem with controlled restarts for time-constrained
decision processes, and develop provably good learning algorithms. In
particular, we consider a bandit setting where each decision takes a random
completion time, and yields a random and correlated reward at the end, with
unknown values at the time of decision. The goal of the decision-maker is to
maximize the expected total reward subject to a time constraint . As an
additional control, we allow the decision-maker to interrupt an ongoing task
and forgo its reward for a potentially more rewarding alternative. For this
problem, we develop efficient online learning algorithms with
and regret in a finite and continuous action space
of restart strategies, respectively. We demonstrate an applicability of our
algorithm by using it to boost the performance of SAT solvers
Group-Fair Online Allocation in Continuous Time
The theory of discrete-time online learning has been successfully applied in
many problems that involve sequential decision-making under uncertainty.
However, in many applications including contractual hiring in online
freelancing platforms and server allocation in cloud computing systems, the
outcome of each action is observed only after a random and action-dependent
time. Furthermore, as a consequence of certain ethical and economic concerns,
the controller may impose deadlines on the completion of each task, and require
fairness across different groups in the allocation of total time budget . In
order to address these applications, we consider continuous-time online
learning problem with fairness considerations, and present a novel framework
based on continuous-time utility maximization. We show that this formulation
recovers reward-maximizing, max-min fair and proportionally fair allocation
rules across different groups as special cases. We characterize the optimal
offline policy, which allocates the total time between different actions in an
optimally fair way (as defined by the utility function), and impose deadlines
to maximize time-efficiency. In the absence of any statistical knowledge, we
propose a novel online learning algorithm based on dual ascent optimization for
time averages, and prove that it achieves regret bound.Comment: Corrected figure captions. Added reference
Federated Bandit: A Gossiping Approach
In this paper, we study \emph{Federated Bandit}, a decentralized Multi-Armed
Bandit problem with a set of agents, who can only communicate their local
data with neighbors described by a connected graph . Each agent makes a
sequence of decisions on selecting an arm from candidates, yet they only
have access to local and potentially biased feedback/evaluation of the true
reward for each action taken. Learning only locally will lead agents to
sub-optimal actions while converging to a no-regret strategy requires a
collection of distributed data. Motivated by the proposal of federated
learning, we aim for a solution with which agents will never share their local
observations with a central entity, and will be allowed to only share a private
copy of his/her own information with their neighbors. We first propose a
decentralized bandit algorithm Gossip_UCB, which is a coupling of variants of
both the classical gossiping algorithm and the celebrated Upper Confidence
Bound (UCB) bandit algorithm. We show that Gossip_UCB successfully adapts local
bandit learning into a global gossiping process for sharing information among
connected agents, and achieves guaranteed regret at the order of for
all agents, where is the second largest eigenvalue of
the expected gossip matrix, which is a function of . We then propose
Fed_UCB, a differentially private version of Gossip_UCB, in which the agents
preserve -differential privacy of their local data while achieving
regret.Comment: Accepted by ACM SIGMETRICS 202
Fast Learning for Renewal Optimization in Online Task Scheduling
This paper considers online optimization of a renewal-reward system. A
controller performs a sequence of tasks back-to-back. Each task has a random
vector of parameters, called the task type vector, that affects the task
processing options and also affects the resulting reward and time duration of
the task. The probability distribution for the task type vector is unknown and
the controller must learn to make efficient decisions so that time average
reward converges to optimality. Prior work on such renewal optimization
problems leaves open the question of optimal convergence time. This paper
develops an algorithm with an optimality gap that decays like ,
where is the number of tasks processed. The same algorithm is shown to have
faster performance when the system satisfies a strong concavity
property. The proposed algorithm uses an auxiliary variable that is updated
according to a classic Robbins-Monro iteration. It makes online scheduling
decisions at the start of each renewal frame based on this variable and on the
observed task type. A matching converse is obtained for the strongly concave
case by constructing an example system for which all algorithms have
performance at best . A matching
converse is also shown for the general case without strong concavity.Comment: 32 pages, 9 figures. This version 2 fixes some minor typos from my
submission last yea