11 research outputs found
Minimax Policies for Combinatorial Prediction Games
We address the online linear optimization problem when the actions of the
forecaster are represented by binary vectors. Our goal is to understand the
magnitude of the minimax regret for the worst possible set of actions. We study
the problem under three different assumptions for the feedback: full
information, and the partial information models of the so-called "semi-bandit",
and "bandit" problems. We consider both -, and -type of
restrictions for the losses assigned by the adversary.
We formulate a general strategy using Bregman projections on top of a
potential-based gradient descent, which generalizes the ones studied in the
series of papers Gyorgy et al. (2007), Dani et al. (2008), Abernethy et al.
(2008), Cesa-Bianchi and Lugosi (2009), Helmbold and Warmuth (2009), Koolen et
al. (2010), Uchiya et al. (2010), Kale et al. (2010) and Audibert and Bubeck
(2010). We provide simple proofs that recover most of the previous results. We
propose new upper bounds for the semi-bandit game. Moreover we derive lower
bounds for all three feedback assumptions. With the only exception of the
bandit game, the upper and lower bounds are tight, up to a constant factor.
Finally, we answer a question asked by Koolen et al. (2010) by showing that the
exponentially weighted average forecaster is suboptimal against
adversaries
k-server via multiscale entropic regularization
We present an -competitive randomized algorithm for the
-server problem on hierarchically separated trees (HSTs). This is the first
-competitive randomized algorithm for which the competitive ratio is
independent of the size of the underlying HST. Our algorithm is designed in the
framework of online mirror descent where the mirror map is a multiscale
entropy. When combined with Bartal's static HST embedding reduction, this leads
to an -competitive algorithm on any -point metric
space. We give a new dynamic HST embedding that yields an -competitive algorithm on any metric space where the ratio of the
largest to smallest non-zero distance is at most
How to Fine-Tune Vision Models with SGD
SGD and AdamW are the two most used optimizers for fine-tuning large neural
networks in computer vision. When the two methods perform the same, SGD is
preferable because it uses less memory (12 bytes/parameter with momentum and 8
bytes/parameter without) than AdamW (16 bytes/parameter). However, on a suite
of downstream tasks, especially those with distribution shifts, we find that
fine-tuning with AdamW performs substantially better than SGD on modern Vision
Transformer and ConvNeXt models. We find that large gaps in performance between
SGD and AdamW occur when the fine-tuning gradients in the first "embedding"
layer are much larger than in the rest of the model. Our analysis suggests an
easy fix that works consistently across datasets and models: freezing the
embedding layer (less than 1% of the parameters) leads to SGD with or without
momentum performing slightly better than AdamW while using less memory (e.g.,
on ViT-L, SGD uses 33% less GPU memory). Our insights result in
state-of-the-art accuracies on five popular distribution shift benchmarks:
WILDS-FMoW, WILDS-Camelyon, BREEDS-Living-17, Waterbirds, and DomainNet
Statistically Preconditioned Accelerated Gradient Method for Distributed Optimization
We consider the setting of distributed empirical risk minimization where
multiple machines compute the gradients in parallel and a centralized server
updates the model parameters. In order to reduce the number of communications
required to reach a given accuracy, we propose a \emph{preconditioned}
accelerated gradient method where the preconditioning is done by solving a
local optimization problem over a subsampled dataset at the server. The
convergence rate of the method depends on the square root of the relative
condition number between the global and local loss functions. We estimate the
relative condition number for linear prediction models by studying
\emph{uniform} concentration of the Hessians over a bounded domain, which
allows us to derive improved convergence rates for existing preconditioned
gradient methods and our accelerated method. Experiments on real-world datasets
illustrate the benefits of acceleration in the ill-conditioned regime
AutoMoE: Heterogeneous Mixture-of-Experts with Adaptive Computation for Efficient Neural Machine Translation
Mixture-of-Expert (MoE) models have obtained state-of-the-art performance in
Neural Machine Translation (NMT) tasks. Existing works in MoE mostly consider a
homogeneous design where the same number of experts of the same size are placed
uniformly throughout the network. Furthermore, existing MoE works do not
consider computational constraints (e.g., FLOPs, latency) to guide their
design. To this end, we develop AutoMoE -- a framework for designing
heterogeneous MoE's under computational constraints. AutoMoE leverages Neural
Architecture Search (NAS) to obtain efficient sparse MoE sub-transformers with
4x inference speedup (CPU) and FLOPs reduction over manually designed
Transformers, with parity in BLEU score over dense Transformer and within 1
BLEU point of MoE SwitchTransformer, on aggregate over benchmark datasets for
NMT. Heterogeneous search space with dense and sparsely activated Transformer
modules (e.g., how many experts? where to place them? what should be their
sizes?) allows for adaptive compute -- where different amounts of computations
are used for different tokens in the input. Adaptivity comes naturally from
routing decisions which send tokens to experts of different sizes. AutoMoE
code, data, and trained models are available at https://aka.ms/AutoMoE.Comment: ACL 2023 Finding
Shortest paths without a map, but with an entropic regularizer
In a 1989 paper titled 'shortest paths without a map', Papadimitriou and Yannakakis introduced an online model of searching in a weighted layered graph for a target node, while attempting to minimize the total length of the path traversed by the searcher. This problem, later called layered graph traversal, is parametrized by the maximum cardinality k of a layer of the input graph. It is an online setting for dynamic programming, and it is known to be a rather general and fundamental model of online computing, which includes as special cases other acclaimed models. The deterministic competitive ratio for this problem was soon discovered to be exponential in k, and it is now nearly resolved: it lies between O(2k) and O(k2k). Regarding the randomized competitive ratio, in 1993 Ramesh proved, surprisingly, that this ratio has to be at least O(k2/log1+?k) (for any constant ? > 0). In the same paper, Ramesh also gave an O(k13)-competitive randomized online algorithm. Since 1993, no progress has been reported on the randomized competitive ratio of layered graph traversal. In this work we show how to apply the mirror descent framework on a carefully selected evolving metric space, and obtain an O(k2) competitive randomized online algorithm, nearly matching the known lower bound on the randomized competitive ratio