Fine-tuning pre-trained Vision Transformers (ViT) has consistently
demonstrated promising performance in the realm of visual recognition. However,
adapting large pre-trained models to various tasks poses a significant
challenge. This challenge arises from the need for each model to undergo an
independent and comprehensive fine-tuning process, leading to substantial
computational and memory demands. While recent advancements in
Parameter-efficient Transfer Learning (PETL) have demonstrated their ability to
achieve superior performance compared to full fine-tuning with a smaller subset
of parameter updates, they tend to overlook dense prediction tasks such as
object detection and segmentation. In this paper, we introduce Hierarchical
Side-Tuning (HST), a novel PETL approach that enables ViT transfer to various
downstream tasks effectively. Diverging from existing methods that exclusively
fine-tune parameters within input spaces or certain modules connected to the
backbone, we tune a lightweight and hierarchical side network (HSN) that
leverages intermediate activations extracted from the backbone and generates
multi-scale features to make predictions. To validate HST, we conducted
extensive experiments encompassing diverse visual tasks, including
classification, object detection, instance segmentation, and semantic
segmentation. Notably, our method achieves state-of-the-art average Top-1
accuracy of 76.0% on VTAB-1k, all while fine-tuning a mere 0.78M parameters.
When applied to object detection tasks on COCO testdev benchmark, HST even
surpasses full fine-tuning and obtains better performance with 49.7 box AP and
43.2 mask AP using Cascade Mask R-CNN