164 research outputs found

    Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

    Full text link
    We study the problem of learning Markov decision processes with finite state and action spaces when the transition probability distributions and loss functions are chosen adversarially and are allowed to change with time. We introduce an algorithm whose regret with respect to any policy in a comparison class grows as the square root of the number of rounds of the game, provided the transition probabilities satisfy a uniform mixing condition. Our approach is efficient as long as the comparison class is polynomial and we can compute expectations over sample paths for each policy. Designing an efficient algorithm with small regret for the general case remains an open problem

    Competitive ratio versus regret minimization: achieving the best of both worlds

    Full text link
    We consider online algorithms under both the competitive ratio criteria and the regret minimization one. Our main goal is to build a unified methodology that would be able to guarantee both criteria simultaneously. For a general class of online algorithms, namely any Metrical Task System (MTS), we show that one can simultaneously guarantee the best known competitive ratio and a natural regret bound. For the paging problem we further show an efficient online algorithm (polynomial in the number of pages) with this guarantee. To this end, we extend an existing regret minimization algorithm (specifically, Kapralov and Panigrahy) to handle movement cost (the cost of switching between states of the online system). We then show how to use the extended regret minimization algorithm to combine multiple online algorithms. Our end result is an online algorithm that can combine a "base" online algorithm, having a guaranteed competitive ratio, with a range of online algorithms that guarantee a small regret over any interval of time. The combined algorithm guarantees both that the competitive ratio matches that of the base algorithm and a low regret over any time interval. As a by product, we obtain an expert algorithm with close to optimal regret bound on every time interval, even in the presence of switching costs. This result is of independent interest

    Multi-Armed Bandits with Metric Movement Costs

    Full text link
    We consider the non-stochastic Multi-Armed Bandit problem in a setting where there is a fixed and known metric on the action space that determines a cost for switching between any pair of actions. The loss of the online learner has two components: the first is the usual loss of the selected actions, and the second is an additional loss due to switching between actions. Our main contribution gives a tight characterization of the expected minimax regret in this setting, in terms of a complexity measure C\mathcal{C} of the underlying metric which depends on its covering numbers. In finite metric spaces with kk actions, we give an efficient algorithm that achieves regret of the form O~(max⁑{C1/3T2/3,kT})\widetilde{O}(\max\{\mathcal{C}^{1/3}T^{2/3},\sqrt{kT}\}), and show that this is the best possible. Our regret bound generalizes previous known regret bounds for some special cases: (i) the unit-switching cost regret Θ~(max⁑{k1/3T2/3,kT})\widetilde{\Theta}(\max\{k^{1/3}T^{2/3},\sqrt{kT}\}) where C=Θ(k)\mathcal{C}=\Theta(k), and (ii) the interval metric with regret Θ~(max⁑{T2/3,kT})\widetilde{\Theta}(\max\{T^{2/3},\sqrt{kT}\}) where C=Θ(1)\mathcal{C}=\Theta(1). For infinite metrics spaces with Lipschitz loss functions, we derive a tight regret bound of Θ~(Td+1d+2)\widetilde{\Theta}(T^{\frac{d+1}{d+2}}) where dβ‰₯1d \ge 1 is the Minkowski dimension of the space, which is known to be tight even when there are no switching costs

    Online Convex Optimization Against Adversaries with Memory and Application to Statistical Arbitrage

    Full text link
    The framework of online learning with memory naturally captures learning problems with temporal constraints, and was previously studied for the experts setting. In this work we extend the notion of learning with memory to the general Online Convex Optimization (OCO) framework, and present two algorithms that attain low regret. The first algorithm applies to Lipschitz continuous loss functions, obtaining optimal regret bounds for both convex and strongly convex losses. The second algorithm attains the optimal regret bounds and applies more broadly to convex losses without requiring Lipschitz continuity, yet is more complicated to implement. We complement our theoretic results with an application to statistical arbitrage in finance: we devise algorithms for constructing mean-reverting portfolios.Comment: 22 pages, 2 figure

    Bandits with Switching Costs: T^{2/3} Regret

    Full text link
    We study the adversarial multi-armed bandit problem in a setting where the player incurs a unit cost each time he switches actions. We prove that the player's TT-round minimax regret in this setting is Θ~(T2/3)\widetilde{\Theta}(T^{2/3}), thereby closing a fundamental gap in our understanding of learning with bandit feedback. In the corresponding full-information version of the problem, the minimax regret is known to grow at a much slower rate of Θ(T)\Theta(\sqrt{T}). The difference between these two rates provides the \emph{first} indication that learning with bandit feedback can be significantly harder than learning with full-information feedback (previous results only showed a different dependence on the number of actions, but not on TT.) In addition to characterizing the inherent difficulty of the multi-armed bandit problem with switching costs, our results also resolve several other open problems in online learning. One direct implication is that learning with bandit feedback against bounded-memory adaptive adversaries has a minimax regret of Θ~(T2/3)\widetilde{\Theta}(T^{2/3}). Another implication is that the minimax regret of online learning in adversarial Markov decision processes (MDPs) is Θ~(T2/3)\widetilde{\Theta}(T^{2/3}). The key to all of our results is a new randomized construction of a multi-scale random walk, which is of independent interest and likely to prove useful in additional settings

    Online Learning with Composite Loss Functions

    Full text link
    We study a new class of online learning problems where each of the online algorithm's actions is assigned an adversarial value, and the loss of the algorithm at each step is a known and deterministic function of the values assigned to its recent actions. This class includes problems where the algorithm's loss is the minimum over the recent adversarial values, the maximum over the recent values, or a linear combination of the recent values. We analyze the minimax regret of this class of problems when the algorithm receives bandit feedback, and prove that when the minimum or maximum functions are used, the minimax regret is Ξ©~(T2/3)\tilde \Omega(T^{2/3}) (so called hard online learning problems), and when a linear function is used, the minimax regret is O~(T)\tilde O(\sqrt{T}) (so called easy learning problems). Previously, the only online learning problem that was known to be provably hard was the multi-armed bandit with switching costs

    Online learning over a finite action set with limited switching

    Full text link
    This paper studies the value of switching actions in the Prediction From Experts (PFE) problem and Adversarial Multi-Armed Bandits (MAB) problem. First, we revisit the well-studied and practically motivated setting of PFE with switching costs. Many algorithms are known to achieve the minimax optimal order of O(Tlog⁑n)O(\sqrt{T \log n}) in expectation for both regret and number of switches, where TT is the number of iterations and nn the number of actions. However, no high probability (h.p.) guarantees are known. Our main technical contribution is the first algorithms which with h.p. achieve this optimal order for both regret and switches. This settles an open problem of [Devroye et al., 2015], and directly implies the first h.p. guarantees for several problems of interest. Next, to investigate the value of switching actions at a more granular level, we introduce the setting of switching budgets, in which algorithms are limited to S≀TS \leq T switches between actions. This entails a limited number of free switches, in contrast to the unlimited number of expensive switches in the switching cost setting. Using the above result and several reductions, we unify previous work and completely characterize the complexity of this switching budget setting up to small polylogarithmic factors: for both PFE and MAB, for all switching budgets S≀TS \leq T, and for both expectation and h.p. guarantees. For PFE, we show the optimal rate is Θ~(Tlog⁑n)\tilde{\Theta}(\sqrt{T\log n}) for S=Ξ©(Tlog⁑n)S = \Omega(\sqrt{T\log n}), and min⁑(Θ~(Tlog⁑nS),T)\min(\tilde{\Theta}(\tfrac{T\log n}{S}), T) for S=O(Tlog⁑n)S = O(\sqrt{T \log n}). Interestingly, the bandit setting does not exhibit such a phase transition; instead we show the minimax rate decays steadily as min⁑(Θ~(TnS),T)\min(\tilde{\Theta}(\tfrac{T\sqrt{n}}{\sqrt{S}}), T) for all ranges of S≀TS \leq T. These results recover and generalize the known minimax rates for the (arbitrary) switching cost setting.Comment: Extended abstract to appear in the proceedings of the 2018 Conference on Learning Theory (COLT

    Online Caching with Optimal Switching Regret

    Full text link
    We consider the classical uncoded caching problem from an online learning point-of-view. A cache of limited storage capacity can hold CC files at a time from a large catalog. A user requests an arbitrary file from the catalog at each time slot. Before the file request from the user arrives, a caching policy populates the cache with any CC files of its choice. In the case of a cache-hit, the policy receives a unit reward and zero rewards otherwise. In addition to that, there is a cost associated with fetching files to the cache, which we refer to as the switching cost. The objective is to design a caching policy that incurs minimal regret while considering both the rewards due to cache-hits and the switching cost due to the file fetches. The main contribution of this paper is the switching regret analysis of a Follow the Perturbed Leader-based anytime caching policy, which is shown to have an order optimal switching regret. In this pursuit, we improve the best-known switching regret bound for this problem by a factor of Θ(C).\Theta(\sqrt{C}). We conclude the paper by comparing the performance of different popular caching policies using a publicly available trace from a commercial CDN server.Comment: 11 pages, 3 figures, to be submitted to ISIT, 202

    Learning to Cache With No Regrets

    Full text link
    This paper introduces a novel caching analysis that, contrary to prior work, makes no modeling assumptions for the file request sequence. We cast the caching problem in the framework of Online Linear Optimization (OLO), and introduce a class of minimum regret caching policies, which minimize the losses with respect to the best static configuration in hindsight when the request model is unknown. These policies are very important since they are robust to popularity deviations in the sense that they learn to adjust their caching decisions when the popularity model changes. We first prove a novel lower bound for the regret of any caching policy, improving existing OLO bounds for our setting. Then we show that the Online Gradient Ascent (OGA) policy guarantees a regret that matches the lower bound, hence it is universally optimal. Finally, we shift our attention to a network of caches arranged to form a bipartite graph, and show that the Bipartite Subgradient Algorithm (BSA) has no regretComment: IEEE INFOCOM 201

    Online learning with feedback graphs and switching costs

    Full text link
    We study online learning when partial feedback information is provided following every action of the learning process, and the learner incurs switching costs for changing his actions. In this setting, the feedback information system can be represented by a graph, and previous works studied the expected regret of the learner in the case of a clique (Expert setup), or disconnected single loops (Multi-Armed Bandits (MAB)). This work provides a lower bound on the expected regret in the Partial Information (PI) setting, namely for general feedback graphs --excluding the clique. Additionally, it shows that all algorithms that are optimal without switching costs are necessarily sub-optimal in the presence of switching costs, which motivates the need to design new algorithms. We propose two new algorithms: Threshold Based EXP3 and EXP3. SC. For the two special cases of symmetric PI setting and MAB, the expected regret of both of these algorithms is order optimal in the duration of the learning process. Additionally, Threshold Based EXP3 is order optimal in the switching cost, whereas EXP3. SC is not. Finally, empirical evaluations show that Threshold Based EXP3 outperforms the previously proposed order-optimal algorithms EXP3 SET in the presence of switching costs, and Batch EXP3 in the MAB setting with switching costs.Comment: Published in Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS) 2019. PMLR: Volume 8
    • …
    corecore