5 research outputs found

    Budget-Constrained Bandits over General Cost and Reward Distributions

    Full text link
    We consider a budget-constrained bandit problem where each arm pull incurs a random cost, and yields a random reward in return. The objective is to maximize the total expected reward under a budget constraint on the total cost. The model is general in the sense that it allows correlated and potentially heavy-tailed cost-reward pairs that can take on negative values as required by many applications. We show that if moments of order (2+γ)(2+\gamma) for some γ>0\gamma > 0 exist for all cost-reward pairs, O(logB)O(\log B) regret is achievable for a budget B>0B>0. In order to achieve tight regret bounds, we propose algorithms that exploit the correlation between the cost and reward of each arm by extracting the common information via linear minimum mean-square error estimation. We prove a regret lower bound for this problem, and show that the proposed algorithms achieve tight problem-dependent regret bounds, which are optimal up to a universal constant factor in the case of jointly Gaussian cost and reward pairs

    Continuous-Time Multi-Armed Bandits with Controlled Restarts

    Full text link
    Time-constrained decision processes have been ubiquitous in many fundamental applications in physics, biology and computer science. Recently, restart strategies have gained significant attention for boosting the efficiency of time-constrained processes by expediting the completion times. In this work, we investigate the bandit problem with controlled restarts for time-constrained decision processes, and develop provably good learning algorithms. In particular, we consider a bandit setting where each decision takes a random completion time, and yields a random and correlated reward at the end, with unknown values at the time of decision. The goal of the decision-maker is to maximize the expected total reward subject to a time constraint τ\tau. As an additional control, we allow the decision-maker to interrupt an ongoing task and forgo its reward for a potentially more rewarding alternative. For this problem, we develop efficient online learning algorithms with O(log(τ))O(\log(\tau)) and O(τlog(τ))O(\sqrt{\tau\log(\tau)}) regret in a finite and continuous action space of restart strategies, respectively. We demonstrate an applicability of our algorithm by using it to boost the performance of SAT solvers

    Group-Fair Online Allocation in Continuous Time

    Full text link
    The theory of discrete-time online learning has been successfully applied in many problems that involve sequential decision-making under uncertainty. However, in many applications including contractual hiring in online freelancing platforms and server allocation in cloud computing systems, the outcome of each action is observed only after a random and action-dependent time. Furthermore, as a consequence of certain ethical and economic concerns, the controller may impose deadlines on the completion of each task, and require fairness across different groups in the allocation of total time budget BB. In order to address these applications, we consider continuous-time online learning problem with fairness considerations, and present a novel framework based on continuous-time utility maximization. We show that this formulation recovers reward-maximizing, max-min fair and proportionally fair allocation rules across different groups as special cases. We characterize the optimal offline policy, which allocates the total time between different actions in an optimally fair way (as defined by the utility function), and impose deadlines to maximize time-efficiency. In the absence of any statistical knowledge, we propose a novel online learning algorithm based on dual ascent optimization for time averages, and prove that it achieves O~(B1/2)\tilde{O}(B^{-1/2}) regret bound.Comment: Corrected figure captions. Added reference

    Federated Bandit: A Gossiping Approach

    Full text link
    In this paper, we study \emph{Federated Bandit}, a decentralized Multi-Armed Bandit problem with a set of NN agents, who can only communicate their local data with neighbors described by a connected graph GG. Each agent makes a sequence of decisions on selecting an arm from MM candidates, yet they only have access to local and potentially biased feedback/evaluation of the true reward for each action taken. Learning only locally will lead agents to sub-optimal actions while converging to a no-regret strategy requires a collection of distributed data. Motivated by the proposal of federated learning, we aim for a solution with which agents will never share their local observations with a central entity, and will be allowed to only share a private copy of his/her own information with their neighbors. We first propose a decentralized bandit algorithm Gossip_UCB, which is a coupling of variants of both the classical gossiping algorithm and the celebrated Upper Confidence Bound (UCB) bandit algorithm. We show that Gossip_UCB successfully adapts local bandit learning into a global gossiping process for sharing information among connected agents, and achieves guaranteed regret at the order of O(max{poly(N,M)logT,poly(N,M)logλ21N})O(\max\{ \texttt{poly}(N,M) \log T, \texttt{poly}(N,M)\log_{\lambda_2^{-1}} N\}) for all NN agents, where λ2(0,1)\lambda_2\in(0,1) is the second largest eigenvalue of the expected gossip matrix, which is a function of GG. We then propose Fed_UCB, a differentially private version of Gossip_UCB, in which the agents preserve ϵ\epsilon-differential privacy of their local data while achieving O(max{poly(N,M)ϵlog2.5T,poly(N,M)(logλ21N+logT)})O(\max \{\frac{\texttt{poly}(N,M)}{\epsilon}\log^{2.5} T, \texttt{poly}(N,M) (\log_{\lambda_2^{-1}} N + \log T) \}) regret.Comment: Accepted by ACM SIGMETRICS 202

    Fast Learning for Renewal Optimization in Online Task Scheduling

    Full text link
    This paper considers online optimization of a renewal-reward system. A controller performs a sequence of tasks back-to-back. Each task has a random vector of parameters, called the task type vector, that affects the task processing options and also affects the resulting reward and time duration of the task. The probability distribution for the task type vector is unknown and the controller must learn to make efficient decisions so that time average reward converges to optimality. Prior work on such renewal optimization problems leaves open the question of optimal convergence time. This paper develops an algorithm with an optimality gap that decays like O(1/k)O(1/\sqrt{k}), where kk is the number of tasks processed. The same algorithm is shown to have faster O(log(k)/k)O(\log(k)/k) performance when the system satisfies a strong concavity property. The proposed algorithm uses an auxiliary variable that is updated according to a classic Robbins-Monro iteration. It makes online scheduling decisions at the start of each renewal frame based on this variable and on the observed task type. A matching converse is obtained for the strongly concave case by constructing an example system for which all algorithms have performance at best Ω(log(k)/k)\Omega(\log(k)/k). A matching Ω(1/k)\Omega(1/\sqrt{k}) converse is also shown for the general case without strong concavity.Comment: 32 pages, 9 figures. This version 2 fixes some minor typos from my submission last yea
    corecore