10 research outputs found
Reprogramming under constraints: Revisiting efficient and reliable transferability of lottery tickets
In the era of foundation models with huge pre-training budgets, the
downstream tasks have been shifted to the narrative of efficient and fast
adaptation. For classification-based tasks in the domain of computer vision,
the two most efficient approaches have been linear probing (LP) and visual
prompting/reprogramming (VP); the former aims to learn a classifier in the form
of a linear head on the features extracted by the pre-trained model, while the
latter maps the input data to the domain of the source data on which the model
was originally pre-trained on. Although extensive studies have demonstrated the
differences between LP and VP in terms of downstream performance, we explore
the capabilities of the two aforementioned methods via the sparsity axis: (a)
Data sparsity: the impact of few-shot adaptation and (b) Model sparsity: the
impact of lottery tickets (LT). We demonstrate that LT are not universal
reprogrammers, i.e., for certain target datasets, reprogramming an LT yields
significantly lower performance than the reprogrammed dense model although
their corresponding upstream performance is similar. Further, we demonstrate
that the calibration of dense models is always superior to that of their
lottery ticket counterparts under both LP and VP regimes. Our empirical study
opens a new avenue of research into VP for sparse models and encourages further
understanding of the performance beyond the accuracy achieved by VP under
constraints of sparsity. Code and logs can be accessed at
\url{https://github.com/landskape-ai/Reprogram_LT}.Comment: Preprin
APP: Anytime Progressive Pruning
With the latest advances in deep learning, there has been a lot of focus on
the online learning paradigm due to its relevance in practical settings.
Although many methods have been investigated for optimal learning settings in
scenarios where the data stream is continuous over time, sparse networks
training in such settings have often been overlooked. In this paper, we explore
the problem of training a neural network with a target sparsity in a particular
case of online learning: the anytime learning at macroscale paradigm (ALMA). We
propose a novel way of progressive pruning, referred to as \textit{Anytime
Progressive Pruning} (APP); the proposed approach significantly outperforms the
baseline dense and Anytime OSP models across multiple architectures and
datasets under short, moderate, and long-sequence training. Our method, for
example, shows an improvement in accuracy of and a reduction in
the generalization gap by , while being rd the size
of the dense baseline model in few-shot restricted imagenet training. We
further observe interesting nonmonotonic transitions in the generalization gap
in the high number of megabatches-based ALMA. The code and experiment
dashboards can be accessed at
\url{https://github.com/landskape-ai/Progressive-Pruning} and
\url{https://wandb.ai/landskape/APP}, respectively.Comment: 21 pages including 4 pages of references. Preprint versio
Beyond the imitation game: Quantifying and extrapolating the capabilities of language models
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 442 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Language models demonstrate both quantitative improvement and new qualitative
capabilities with increasing scale. Despite their potentially transformative
impact, these new capabilities are as yet poorly characterized. In order to
inform future research, prepare for disruptive new model capabilities, and
ameliorate socially harmful effects, it is vital that we understand the present
and near-future capabilities and limitations of language models. To address
this challenge, we introduce the Beyond the Imitation Game benchmark
(BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 442
authors across 132 institutions. Task topics are diverse, drawing problems from
linguistics, childhood development, math, common-sense reasoning, biology,
physics, social bias, software development, and beyond. BIG-bench focuses on
tasks that are believed to be beyond the capabilities of current language
models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense
transformer architectures, and Switch-style sparse transformers on BIG-bench,
across model sizes spanning millions to hundreds of billions of parameters. In
addition, a team of human expert raters performed all tasks in order to provide
a strong baseline. Findings include: model performance and calibration both
improve with scale, but are poor in absolute terms (and when compared with
rater performance); performance is remarkably similar across model classes,
though with benefits from sparsity; tasks that improve gradually and
predictably commonly involve a large knowledge or memorization component,
whereas tasks that exhibit "breakthrough" behavior at a critical scale often
involve multiple steps or components, or brittle metrics; social bias typically
increases with scale in settings with ambiguous context, but this can be
improved with prompting.Comment: 27 pages, 17 figures + references and appendices, repo:
https://github.com/google/BIG-benc