Search CORE

11 research outputs found

Minimax Policies for Combinatorial Prediction Games

Author: Audibert Jean-Yves
Bubeck Sebastien
Lugosi Gabor
Publication venue
Publication date: 01/01/2011
Field of study

We address the online linear optimization problem when the actions of the forecaster are represented by binary vectors. Our goal is to understand the magnitude of the minimax regret for the worst possible set of actions. We study the problem under three different assumptions for the feedback: full information, and the partial information models of the so-called "semi-bandit", and "bandit" problems. We consider both

L_\infty

-, and

L_2

-type of restrictions for the losses assigned by the adversary. We formulate a general strategy using Bregman projections on top of a potential-based gradient descent, which generalizes the ones studied in the series of papers Gyorgy et al. (2007), Dani et al. (2008), Abernethy et al. (2008), Cesa-Bianchi and Lugosi (2009), Helmbold and Warmuth (2009), Koolen et al. (2010), Uchiya et al. (2010), Kale et al. (2010) and Audibert and Bubeck (2010). We provide simple proofs that recover most of the previous results. We propose new upper bounds for the semi-bandit game. Moreover we derive lower bounds for all three feedback assumptions. With the only exception of the bandit game, the upper and lower bounds are tight, up to a constant factor. Finally, we answer a question asked by Koolen et al. (2010) by showing that the exponentially weighted average forecaster is suboptimal against

L_{\infty}

adversaries

arXiv.org e-Print Archive

CiteSeerX

INRIA a CCSD electronic archive server

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

k-server via multiscale entropic regularization

Author: Bubeck Sebastien
Cohen Michael B.
Lee James R.
Lee Yin Tat
Madry Aleksander
Publication venue
Publication date: 03/11/2017
Field of study

We present an

O((\log k)^2)

-competitive randomized algorithm for the

k

-server problem on hierarchically separated trees (HSTs). This is the first

o(k)

-competitive randomized algorithm for which the competitive ratio is independent of the size of the underlying HST. Our algorithm is designed in the framework of online mirror descent where the mirror map is a multiscale entropy. When combined with Bartal's static HST embedding reduction, this leads to an

O((\log k)^2 \log n)

-competitive algorithm on any

n

-point metric space. We give a new dynamic HST embedding that yields an

O((\log k)^3 \log \Delta)

-competitive algorithm on any metric space where the ratio of the largest to smallest non-zero distance is at most

\Delta

arXiv.org e-Print Archive

Crossref

DSpace@MIT

How to Fine-Tune Vision Models with SGD

Author: Bubeck Sebastien
Gunasekar Suriya
Kumar Ananya
Shen Ruoqi
Publication venue
Publication date: 10/10/2023
Field of study

SGD and AdamW are the two most used optimizers for fine-tuning large neural networks in computer vision. When the two methods perform the same, SGD is preferable because it uses less memory (12 bytes/parameter with momentum and 8 bytes/parameter without) than AdamW (16 bytes/parameter). However, on a suite of downstream tasks, especially those with distribution shifts, we find that fine-tuning with AdamW performs substantially better than SGD on modern Vision Transformer and ConvNeXt models. We find that large gaps in performance between SGD and AdamW occur when the fine-tuning gradients in the first "embedding" layer are much larger than in the rest of the model. Our analysis suggests an easy fix that works consistently across datasets and models: freezing the embedding layer (less than 1% of the parameters) leads to SGD with or without momentum performing slightly better than AdamW while using less memory (e.g., on ViT-L, SGD uses 33% less GPU memory). Our insights result in state-of-the-art accuracies on five popular distribution shift benchmarks: WILDS-FMoW, WILDS-Camelyon, BREEDS-Living-17, Waterbirds, and DomainNet

arXiv.org e-Print Archive

Statistically Preconditioned Accelerated Gradient Method for Distributed Optimization

Author: Bach Francis
Bubeck Sebastien
Hendrikx Hadrien
Massoulie Laurent
Xiao Lin
Publication venue
Publication date: 25/02/2020
Field of study

We consider the setting of distributed empirical risk minimization where multiple machines compute the gradients in parallel and a centralized server updates the model parameters. In order to reduce the number of communications required to reach a given accuracy, we propose a \emph{preconditioned} accelerated gradient method where the preconditioning is done by solving a local optimization problem over a subsampled dataset at the server. The convergence rate of the method depends on the square root of the relative condition number between the global and local loss functions. We estimate the relative condition number for linear prediction models by studying \emph{uniform} concentration of the Hessians over a bounded domain, which allows us to derive improved convergence rates for existing preconditioned gradient methods and our accelerated method. Experiments on real-world datasets illustrate the benefits of acceleration in the ill-conditioned regime

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

AutoMoE: Heterogeneous Mixture-of-Experts with Adaptive Computation for Efficient Neural Machine Translation

Author: Abdul-Mageed Muhammad
Awadallah Ahmed Hassan
Bubeck Sebastien
Gao Jianfeng
Jawahar Ganesh
Kim Young Jin
Lakshmanan Laks V. S.
Liu Xiaodong
Mukherjee Subhabrata
Publication venue
Publication date: 07/06/2023
Field of study

Mixture-of-Expert (MoE) models have obtained state-of-the-art performance in Neural Machine Translation (NMT) tasks. Existing works in MoE mostly consider a homogeneous design where the same number of experts of the same size are placed uniformly throughout the network. Furthermore, existing MoE works do not consider computational constraints (e.g., FLOPs, latency) to guide their design. To this end, we develop AutoMoE -- a framework for designing heterogeneous MoE's under computational constraints. AutoMoE leverages Neural Architecture Search (NAS) to obtain efficient sparse MoE sub-transformers with 4x inference speedup (CPU) and FLOPs reduction over manually designed Transformers, with parity in BLEU score over dense Transformer and within 1 BLEU point of MoE SwitchTransformer, on aggregate over benchmark datasets for NMT. Heterogeneous search space with dense and sparsely activated Transformer modules (e.g., how many experts? where to place them? what should be their sizes?) allows for adaptive compute -- where different amounts of computations are used for different tokens in the input. Adaptivity comes naturally from routing decisions which send tokens to experts of different sizes. AutoMoE code, data, and trained models are available at https://aka.ms/AutoMoE.Comment: ACL 2023 Finding

arXiv.org e-Print Archive

Shortest paths without a map, but with an entropic regularizer

Author: Bubeck Sebastien
Coester Christian
Rabani Yuval
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 09/02/2022
Field of study

In a 1989 paper titled 'shortest paths without a map', Papadimitriou and Yannakakis introduced an online model of searching in a weighted layered graph for a target node, while attempting to minimize the total length of the path traversed by the searcher. This problem, later called layered graph traversal, is parametrized by the maximum cardinality k of a layer of the input graph. It is an online setting for dynamic programming, and it is known to be a rather general and fundamental model of online computing, which includes as special cases other acclaimed models. The deterministic competitive ratio for this problem was soon discovered to be exponential in k, and it is now nearly resolved: it lies between O(2k) and O(k2k). Regarding the randomized competitive ratio, in 1993 Ramesh proved, surprisingly, that this ratio has to be at least O(k2/log1+?k) (for any constant ? > 0). In the same paper, Ramesh also gave an O(k13)-competitive randomized online algorithm. Since 1993, no progress has been reported on the randomized competitive ratio of layered graph traversal. In this work we show how to apply the mirror descent framework on a carefully selected evolving metric space, and obtain an O(k2) competitive randomized online algorithm, nearly matching the known lower bound on the randomized competitive ratio

arXiv.org e-Print Archive

Oxford University Research Archive

Bandits With Heavy Tail

Author: Gabor Lugosi
Nicolo Cesa-Bianchi
Sebastien Bubeck
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref