19 research outputs found
Fast Trainable Projection for Robust Fine-Tuning
Robust fine-tuning aims to achieve competitive in-distribution (ID)
performance while maintaining the out-of-distribution (OOD) robustness of a
pre-trained model when transferring it to a downstream task. Recently,
projected gradient descent has been successfully used in robust fine-tuning by
constraining the deviation from the initialization of the fine-tuned model
explicitly through projection. However, algorithmically, two limitations
prevent this method from being adopted more widely, scalability and efficiency.
In this paper, we propose a new projection-based fine-tuning algorithm, Fast
Trainable Projection (FTP) for computationally efficient learning of per-layer
projection constraints, resulting in an average speedup on our
benchmarks compared to prior works. FTP can be combined with existing
optimizers such as AdamW, and be used in a plug-and-play fashion. Finally, we
show that FTP is a special instance of hyper-optimizers that tune the
hyper-parameters of optimizers in a learnable manner through nested
differentiation. Empirically, we show superior robustness on OOD datasets,
including domain shifts and natural corruptions, across four different vision
tasks with five different pre-trained models. Additionally, we demonstrate that
FTP is broadly applicable and beneficial to other learning scenarios such as
low-label and continual learning settings thanks to its easy adaptability. The
code will be available at https://github.com/GT-RIPL/FTP.git.Comment: Accepted to NeurIPS 202
Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion
Producing quality segmentation masks for images is a fundamental problem in
computer vision. Recent research has explored large-scale supervised training
to enable zero-shot segmentation on virtually any image style and unsupervised
training to enable segmentation without dense annotations. However,
constructing a model capable of segmenting anything in a zero-shot manner
without any annotations is still challenging. In this paper, we propose to
utilize the self-attention layers in stable diffusion models to achieve this
goal because the pre-trained stable diffusion model has learned inherent
concepts of objects within its attention layers. Specifically, we introduce a
simple yet effective iterative merging process based on measuring KL divergence
among attention maps to merge them into valid segmentation masks. The proposed
method does not require any training or language dependency to extract quality
segmentation for any images. On COCO-Stuff-27, our method surpasses the prior
unsupervised zero-shot SOTA method by an absolute 26% in pixel accuracy and 17%
in mean IoU. The project page is at
\url{https://sites.google.com/view/diffseg/home}
Polyhistor: Parameter-Efficient Multi-Task Adaptation for Dense Vision Tasks
Adapting large-scale pretrained models to various downstream tasks via
fine-tuning is a standard method in machine learning. Recently,
parameter-efficient fine-tuning methods show promise in adapting a pretrained
model to different tasks while training only a few parameters. Despite their
success, most existing methods are proposed in Natural Language Processing
tasks with language Transformers, and adaptation to Computer Vision tasks with
Vision Transformers remains under-explored, especially for dense vision tasks.
Further, in multi-task settings, individually fine-tuning and storing separate
models for different tasks is inefficient. In this work, we provide an
extensive multi-task parameter-efficient benchmark and examine existing
parameter-efficient fine-tuning NLP methods for vision tasks. Our results on
four different dense vision tasks showed that existing methods cannot be
efficiently integrated due to the hierarchical nature of the Hierarchical
Vision Transformers. To overcome this issue, we propose Polyhistor and
Polyhistor-Lite, consisting of Decomposed HyperNetworks and Layer-wise Scaling
Kernels, to share information across different tasks with a few trainable
parameters. This leads to favorable performance improvements against existing
parameter-efficient methods while using fewer trainable parameters.
Specifically, Polyhistor achieves competitive accuracy compared to the
state-of-the-art while only using ~10% of their trainable parameters.
Furthermore, our methods show larger performance gains when large networks and
more pretraining data are used.Comment: Accepted to NeurIPS 2022; Project Page is at
https://ycliu93.github.io/projects/polyhistor.htm
Trainable Projected Gradient Method for Robust Fine-tuning
Recent studies on transfer learning have shown that selectively fine-tuning a
subset of layers or customizing different learning rates for each layer can
greatly improve robustness to out-of-distribution (OOD) data and retain
generalization capability in the pre-trained models. However, most of these
methods employ manually crafted heuristics or expensive hyper-parameter
searches, which prevent them from scaling up to large datasets and neural
networks. To solve this problem, we propose Trainable Projected Gradient Method
(TPGM) to automatically learn the constraint imposed for each layer for a
fine-grained fine-tuning regularization. This is motivated by formulating
fine-tuning as a bi-level constrained optimization problem. Specifically, TPGM
maintains a set of projection radii, i.e., distance constraints between the
fine-tuned model and the pre-trained model, for each layer, and enforces them
through weight projections. To learn the constraints, we propose a bi-level
optimization to automatically learn the best set of projection radii in an
end-to-end manner. Theoretically, we show that the bi-level optimization
formulation could explain the regularization capability of TPGM. Empirically,
with little hyper-parameter search cost, TPGM outperforms existing fine-tuning
methods in OOD performance while matching the best in-distribution (ID)
performance. For example, when fine-tuned on DomainNet-Real and ImageNet,
compared to vanilla fine-tuning, TPGM shows and relative OOD
improvement respectively on their sketch counterparts. Code is available at
\url{https://github.com/PotatoTian/TPGM}.Comment: Accepted to CVPR202
A Closer Look at Rehearsal-Free Continual Learning
Continual learning describes a setting where machine learning models learn
novel concepts from continuously shifting training data, while simultaneously
avoiding degradation of knowledge on previously seen classes (a phenomenon
known as the catastrophic forgetting problem) which may disappear from the
training data for extended periods of time. Current approaches for continual
learning of a single expanding task (aka class-incremental continual learning)
require extensive rehearsal of previously seen data to avoid this degradation
of knowledge. Unfortunately, rehearsal comes at a sharp cost to memory and
computation, and it may also violate data-privacy. Instead, we explore
combining knowledge distillation and parameter regularization in new ways to
achieve strong continual learning performance without rehearsal. Specifically,
we take a deep dive into common continual learning techniques: prediction
distillation, feature distillation, L2 parameter regularization, and EWC
parameter regularization. We first disprove the common assumption that
parameter regularization techniques fail for rehearsal-free continual learning
of a single, expanding task. Next, we explore how to leverage knowledge from a
pre-trained model in rehearsal-free continual learning and find that vanilla L2
parameter regularization outperforms EWC parameter regularization and feature
distillation. We then highlight the impact of the rehearsal-free continual
learning settings with a classifier expansion benchmark, showing that a
strategy based on our findings combined with a positive/negative label
balancing heuristic can close the performance gap between the upper bound and
the existing strategies by up to roughly 50%. Finally, we show that a simple
method consisting of pre-training, L2 regularization, and prediction
distillation can even outperform rehearsal-based methods on the common
CIFAR-100 benchmark