    Extended Formulations for Online Linear Bandit Optimization

    On-line linear optimization on combinatorial action sets (d-dimensional actions) with bandit feedback, is known to have complexity in the order of the dimension of the problem. The exponential weighted strategy achieves the best known regret bound that is of the order of d2nd^{2}\sqrt{n} (where dd is the dimension of the problem, nn is the time horizon). However, such strategies are provably suboptimal or computationally inefficient. The complexity is attributed to the combinatorial structure of the action set and the dearth of efficient exploration strategies of the set. Mirror descent with entropic regularization function comes close to solving this problem by enforcing a meticulous projection of weights with an inherent boundary condition. Entropic regularization in mirror descent is the only known way of achieving a logarithmic dependence on the dimension. Here, we argue otherwise and recover the original intuition of exponential weighting by borrowing a technique from discrete optimization and approximation algorithms called `extended formulation'. Such formulations appeal to the underlying geometry of the set with a guaranteed logarithmic dependence on the dimension underpinned by an information theoretic entropic analysis

    Addressing the Loss-Metric Mismatch with Adaptive Loss Alignment

    In most machine learning training paradigms a fixed, often handcrafted, loss function is assumed to be a good proxy for an underlying evaluation metric. In this work we assess this assumption by meta-learning an adaptive loss function to directly optimize the evaluation metric. We propose a sample efficient reinforcement learning approach for adapting the loss dynamically during training. We empirically show how this formulation improves performance by simultaneously optimizing the evaluation metric and smoothing the loss landscape. We verify our method in metric learning and classification scenarios, showing considerable improvements over the state-of-the-art on a diverse set of tasks. Importantly, our method is applicable to a wide range of loss functions and evaluation metrics. Furthermore, the learned policies are transferable across tasks and data, demonstrating the versatility of the method.Comment: Accepted to ICML 201

    Multi-Objective Generalized Linear Bandits

    In this paper, we study the multi-objective bandits (MOB) problem, where a learner repeatedly selects one arm to play and then receives a reward vector consisting of multiple objectives. MOB has found many real-world applications as varied as online recommendation and network routing. On the other hand, these applications typically contain contextual information that can guide the learning process which, however, is ignored by most of existing work. To utilize this information, we associate each arm with a context vector and assume the reward follows the generalized linear model (GLM). We adopt the notion of Pareto regret to evaluate the learner's performance and develop a novel algorithm for minimizing it. The essential idea is to apply a variant of the online Newton step to estimate model parameters, based on which we utilize the upper confidence bound (UCB) policy to construct an approximation of the Pareto front, and then uniformly at random choose one arm from the approximate Pareto front. Theoretical analysis shows that the proposed algorithm achieves an O~(dT)\tilde O(d\sqrt{T}) Pareto regret, where TT is the time horizon and dd is the dimension of contexts, which matches the optimal result for single objective contextual bandits problem. Numerical experiments demonstrate the effectiveness of our method

    Sharp Dichotomies for Regret Minimization in Metric Spaces

    The Lipschitz multi-armed bandit (MAB) problem generalizes the classical multi-armed bandit problem by assuming one is given side information consisting of a priori upper bounds on the difference in expected payoff between certain pairs of strategies. Classical results of (Lai and Robbins 1985) and (Auer et al. 2002) imply a logarithmic regret bound for the Lipschitz MAB problem on finite metric spaces. Recent results on continuum-armed bandit problems and their generalizations imply lower bounds of t\sqrt{t}, or stronger, for many infinite metric spaces such as the unit interval. Is this dichotomy universal? We prove that the answer is yes: for every metric space, the optimal regret of a Lipschitz MAB algorithm is either bounded above by any fω(logt)f\in \omega(\log t), or bounded below by any go(t)g\in o(\sqrt{t}). Perhaps surprisingly, this dichotomy does not coincide with the distinction between finite and infinite metric spaces; instead it depends on whether the completion of the metric space is compact and countable. Our proof connects upper and lower bound techniques in online learning with classical topological notions such as perfect sets and the Cantor-Bendixson theorem. Among many other results, we show a similar dichotomy for the "full-feedback" (a.k.a., "best-expert") version.Comment: Full version of a paper in ACM-SIAM SODA 201

    On Regret-Optimal Learning in Decentralized Multi-player Multi-armed Bandits

    We consider the problem of learning in single-player and multiplayer multiarmed bandit models. Bandit problems are classes of online learning problems that capture exploration versus exploitation tradeoffs. In a multiarmed bandit model, players can pick among many arms, and each play of an arm generates an i.i.d. reward from an unknown distribution. The objective is to design a policy that maximizes the expected reward over a time horizon for a single player setting and the sum of expected rewards for the multiplayer setting. In the multiplayer setting, arms may give different rewards to different players. There is no separate channel for coordination among the players. Any attempt at communication is costly and adds to regret. We propose two decentralizable policies, E3\tt E^3 (E\tt E-cubed\tt cubed) and E3\tt E^3-TS\tt TS, that can be used in both single player and multiplayer settings. These policies are shown to yield expected regret that grows at most as O(log1+ϵT\log^{1+\epsilon} T). It is well known that logT\log T is the lower bound on the rate of growth of regret even in a centralized case. The proposed algorithms improve on prior work where regret grew at O(log2T\log^2 T). More fundamentally, these policies address the question of additional cost incurred in decentralized online learning, suggesting that there is at most an ϵ\epsilon-factor cost in terms of order of regret. This solves a problem of relevance in many domains and had been open for a while

    Online Dynamic Programming

    We consider the problem of repeatedly solving a variant of the same dynamic programming problem in successive trials. An instance of the type of problems we consider is to find a good binary search tree in a changing environment.At the beginning of each trial, the learner probabilistically chooses a tree with the nn keys at the internal nodes and the n+1n+1 gaps between keys at the leaves. The learner is then told the frequencies of the keys and gaps and is charged by the average search cost for the chosen tree. The problem is online because the frequencies can change between trials. The goal is to develop algorithms with the property that their total average search cost (loss) in all trials is close to the total loss of the best tree chosen in hindsight for all trials. The challenge, of course, is that the algorithm has to deal with exponential number of trees. We develop a general methodology for tackling such problems for a wide class of dynamic programming algorithms. Our framework allows us to extend online learning algorithms like Hedge and Component Hedge to a significantly wider class of combinatorial objects than was possible before

    Towards Distribution-Free Multi-Armed Bandits with Combinatorial Strategies

    In this paper we study a generalized version of classical multi-armed bandits (MABs) problem by allowing for arbitrary constraints on constituent bandits at each decision point. The motivation of this study comes from many situations that involve repeatedly making choices subject to arbitrary constraints in an uncertain environment: for instance, regularly deciding which advertisements to display online in order to gain high click-through-rate without knowing user preferences, or what route to drive home each day under uncertain weather and traffic conditions. Assume that there are KK unknown random variables (RVs), i.e., arms, each evolving as an \emph{i.i.d} stochastic process over time. At each decision epoch, we select a strategy, i.e., a subset of RVs, subject to arbitrary constraints on constituent RVs. We then gain a reward that is a linear combination of observations on selected RVs. The performance of prior results for this problem heavily depends on the distribution of strategies generated by corresponding learning policy. For example, if the reward-difference between the best and second best strategy approaches zero, prior result may lead to arbitrarily large regret. Meanwhile, when there are exponential number of possible strategies at each decision point, naive extension of a prior distribution-free policy would cause poor performance in terms of regret, computation and space complexity. To this end, we propose an efficient Distribution-Free Learning (DFL) policy that achieves zero regret, regardless of the probability distribution of the resultant strategies. Our learning policy has both O(K)O(K) time complexity and O(K)O(K) space complexity. In successive generations, we show that even if finding the optimal strategy at each decision point is NP-hard, our policy still allows for approximated solutions while retaining near zero-regret

    Imitation Learning as ff-Divergence Minimization

    We address the problem of imitation learning with multi-modal demonstrations. Instead of attempting to learn all modes, we argue that in many tasks it is sufficient to imitate any one of them. We show that the state-of-the-art methods such as GAIL and behavior cloning, due to their choice of loss function, often incorrectly interpolate between such modes. Our key insight is to minimize the right divergence between the learner and the expert state-action distributions, namely the reverse KL divergence or I-projection. We propose a general imitation learning framework for estimating and minimizing any f-Divergence. By plugging in different divergences, we are able to recover existing algorithms such as Behavior Cloning (Kullback-Leibler), GAIL (Jensen Shannon) and Dagger (Total Variation). Empirical results show that our approximate I-projection technique is able to imitate multi-modal behaviors more reliably than GAIL and behavior cloning.Comment: International Workshop on the Algorithmic Foundations of Robotics (WAFR) 202

    Stochastic Contextual Bandits with Known Reward Functions

    Many sequential decision-making problems in communication networks can be modeled as contextual bandit problems, which are natural extensions of the well-known multi-armed bandit problem. In contextual bandit problems, at each time, an agent observes some side information or context, pulls one arm and receives the reward for that arm. We consider a stochastic formulation where the context-reward tuples are independently drawn from an unknown distribution in each trial. Motivated by networking applications, we analyze a setting where the reward is a known non-linear function of the context and the chosen arm's current state. We first consider the case of discrete and finite context-spaces and propose DCB(ϵ\epsilon), an algorithm that we prove, through a careful analysis, yields regret (cumulative reward gap compared to a distribution-aware genie) scaling logarithmically in time and linearly in the number of arms that are not optimal for any context, improving over existing algorithms where the regret scales linearly in the total number of arms. We then study continuous context-spaces with Lipschitz reward functions and propose CCB(ϵ,δ\epsilon, \delta), an algorithm that uses DCB(ϵ\epsilon) as a subroutine. CCB(ϵ,δ\epsilon, \delta) reveals a novel regret-storage trade-off that is parametrized by δ\delta. Tuning δ\delta to the time horizon allows us to obtain sub-linear regret bounds, while requiring sub-linear storage. By exploiting joint learning for all contexts we get regret bounds for CCB(ϵ,δ\epsilon, \delta) that are unachievable by any existing contextual bandit algorithm for continuous context-spaces. We also show similar performance bounds for the unknown horizon case.Comment: A version of this technical report is under submission in IEEE/ACM Transactions on Networkin

    Derivative-free optimization methods

    In many optimization problems arising from scientific, engineering and artificial intelligence applications, objective and constraint functions are available only as the output of a black-box or simulation oracle that does not provide derivative information. Such settings necessitate the use of methods for derivative-free, or zeroth-order, optimization. We provide a review and perspectives on developments in these methods, with an emphasis on highlighting recent developments and on unifying treatment of such problems in the non-linear optimization and machine learning literature. We categorize methods based on assumed properties of the black-box functions, as well as features of the methods. We first overview the primary setting of deterministic methods applied to unconstrained, non-convex optimization problems where the objective function is defined by a deterministic black-box oracle. We then discuss developments in randomized methods, methods that assume some additional structure about the objective (including convexity, separability and general non-smooth compositions), methods for problems where the output of the black-box oracle is stochastic, and methods for handling different types of constraints