Large pretrained plain vision Transformers (ViTs) have been the workhorse for
many downstream tasks. However, existing works utilizing off-the-shelf ViTs are
inefficient in terms of training and deployment, because adopting ViTs with
individual sizes requires separate trainings and is restricted by fixed
performance-efficiency trade-offs. In this paper, we are inspired by stitchable
neural networks (SN-Net), which is a new framework that cheaply produces a
single model that covers rich subnetworks by stitching pretrained model
families, supporting diverse performance-efficiency trade-offs at runtime.
Building upon this foundation, we introduce SN-Netv2, a systematically improved
model stitching framework to facilitate downstream task adaptation.
Specifically, we first propose a two-way stitching scheme to enlarge the
stitching space. We then design a resource-constrained sampling strategy that
takes into account the underlying FLOPs distributions in the space for better
sampling. Finally, we observe that learning stitching layers as a low-rank
update plays an essential role on downstream tasks to stabilize training and
ensure a good Pareto frontier. With extensive experiments on ImageNet-1K,
ADE20K, COCO-Stuff-10K and NYUv2, SN-Netv2 demonstrates superior performance
over SN-Netv1 on downstream dense predictions and shows strong ability as a
flexible vision backbone, achieving great advantages in both training
efficiency and deployment flexibility. Code is available at
https://github.com/ziplab/SN-Netv2.Comment: Tech repor