12 research outputs found
Efficient Vision Transformers via Fine-Grained Manifold Distillation
This paper studies the model compression problem of vision transformers.
Benefit from the self-attention module, transformer architectures have shown
extraordinary performance on many computer vision tasks. Although the network
performance is boosted, transformers are often required more computational
resources including memory usage and the inference complexity. Compared with
the existing knowledge distillation approaches, we propose to excavate useful
information from the teacher transformer through the relationship between
images and the divided patches. We then explore an efficient fine-grained
manifold distillation approach that simultaneously calculates cross-images,
cross-patch, and random-selected manifolds in teacher and student models.
Experimental results conducted on several benchmarks demonstrate the
superiority of the proposed algorithm for distilling portable transformer
models with higher performance. For example, our approach achieves 75.06% Top-1
accuracy on the ImageNet-1k dataset for training a DeiT-Tiny model, which
outperforms other ViT distillation methods
Self-Supervised GAN Compression
Deep learning's success has led to larger and larger models to handle more
and more complex tasks; trained models can contain millions of parameters.
These large models are compute- and memory-intensive, which makes it a
challenge to deploy them with minimized latency, throughput, and storage
requirements. Some model compression methods have been successfully applied to
image classification and detection or language models, but there has been very
little work compressing generative adversarial networks (GANs) performing
complex tasks. In this paper, we show that a standard model compression
technique, weight pruning, cannot be applied to GANs using existing methods. We
then develop a self-supervised compression technique which uses the trained
discriminator to supervise the training of a compressed generator. We show that
this framework has a compelling performance to high degrees of sparsity, can be
easily applied to new tasks and models, and enables meaningful comparisons
between different pruning granularities.Comment: The appendix for this paper is in the following repository
https://gitlab.com/dxxz/Self-Supervised-GAN-Compression-Appendi
Improved Knowledge Distillation via Teacher Assistant
Despite the fact that deep neural networks are powerful models and achieve
appealing results on many tasks, they are too large to be deployed on edge
devices like smartphones or embedded sensor nodes. There have been efforts to
compress these networks, and a popular method is knowledge distillation, where
a large (teacher) pre-trained network is used to train a smaller (student)
network. However, in this paper, we show that the student network performance
degrades when the gap between student and teacher is large. Given a fixed
student network, one cannot employ an arbitrarily large teacher, or in other
words, a teacher can effectively transfer its knowledge to students up to a
certain size, not smaller. To alleviate this shortcoming, we introduce
multi-step knowledge distillation, which employs an intermediate-sized network
(teacher assistant) to bridge the gap between the student and the teacher.
Moreover, we study the effect of teacher assistant size and extend the
framework to multi-step distillation. Theoretical analysis and extensive
experiments on CIFAR-10,100 and ImageNet datasets and on CNN and ResNet
architectures substantiate the effectiveness of our proposed approach.Comment: AAAI 202
Beyond Disentangled Representations: An Attentive Angular Distillation Approach to Large-scale Lightweight Age-Invariant Face Recognition
Disentangled representations have been commonly adopted to Age-invariant Face
Recognition (AiFR) tasks. However, these methods have reached some limitations
with (1) the requirement of large-scale face recognition (FR) training data
with age labels, which is limited in practice; (2) heavy deep network
architecture for high performance; and (3) their evaluations are usually taken
place on age-related face databases while neglecting the standard large-scale
FR databases to guarantee its robustness. This work presents a novel Attentive
Angular Distillation (AAD) approach to Large-scale Lightweight AiFR that
overcomes these limitations. Given two high-performance heavy networks as
teachers with different specialized knowledge, AAD introduces a learning
paradigm to efficiently distill the age-invariant attentive and angular
knowledge from those teachers to a lightweight student network making it more
powerful with higher FR accuracy and robust against age factor. Consequently,
AAD approach is able to take the advantages of both FR datasets with and
without age labels to train an AiFR model. Far apart from prior distillation
methods mainly focusing on accuracy and compression ratios in closed-set
problems, our AAD aims to solve the open-set problem, i.e. large-scale face
recognition. Evaluations on LFW, IJB-B and IJB-C Janus, AgeDB and
MegaFace-FGNet with one million distractors have demonstrated the efficiency of
the proposed approach. This work also presents a new longitudinal face aging
(LogiFace) database for further studies in age-related facial problems in
future.Comment: arXiv admin note: substantial text overlap with arXiv:1905.1062