Knowledge distillation (KD) is an effective training strategy to improve the
lightweight student models under the guidance of cumbersome teachers. However,
the large architecture difference across the teacher-student pairs limits the
distillation gains. In contrast to previous adaptive distillation methods to
reduce the teacher-student gap, we explore a novel training-free framework to
search for the best student architectures for a given teacher. Our work first
empirically show that the optimal model under vanilla training cannot be the
winner in distillation. Secondly, we find that the similarity of feature
semantics and sample relations between random-initialized teacher-student
networks have good correlations with final distillation performances. Thus, we
efficiently measure similarity matrixs conditioned on the semantic activation
maps to select the optimal student via an evolutionary algorithm without any
training. In this way, our student architecture search for Distillation WithOut
Training (DisWOT) significantly improves the performance of the model in the
distillation stage with at least 180× training acceleration.
Additionally, we extend similarity metrics in DisWOT as new distillers and
KD-based zero-proxies. Our experiments on CIFAR, ImageNet and NAS-Bench-201
demonstrate that our technique achieves state-of-the-art results on different
search spaces. Our project and code are available at
https://lilujunai.github.io/DisWOT-CVPR2023/.Comment: Accepted by CVPR202