4,875 research outputs found
Deep Face Recognition Model Compression via Knowledge Transfer and Distillation
Fully convolutional networks (FCNs) have become de facto tool to achieve very
high-level performance for many vision and non-vision tasks in general and face
recognition in particular. Such high-level accuracies are normally obtained by
very deep networks or their ensemble. However, deploying such high performing
models to resource constraint devices or real-time applications is challenging.
In this paper, we present a novel model compression approach based on
student-teacher paradigm for face recognition applications. The proposed
approach consists of training teacher FCN at bigger image resolution while
student FCNs are trained at lower image resolutions than that of teacher FCN.
We explored three different approaches to train student FCNs: knowledge
transfer (KT), knowledge distillation (KD) and their combination. Experimental
evaluation on LFW and IJB-C datasets demonstrate comparable improvements in
accuracies with these approaches. Training low-resolution student FCNs from
higher resolution teacher offer fourfold advantage of accelerated training,
accelerated inference, reduced memory requirements and improved accuracies. We
evaluated all models on IJB-C dataset and achieved state-of-the-art results on
this benchmark. The teacher network and some student networks even achieved
Top-1 performance on IJB-C dataset. The proposed approach is simple and
hardware friendly, thus enables the deployment of high performing face
recognition deep models to resource constraint devices.Comment: 7 pages, 5 figure
Triplet Distillation for Deep Face Recognition
Convolutional neural networks (CNNs) have achieved a great success in face
recognition, which unfortunately comes at the cost of massive computation and
storage consumption. Many compact face recognition networks are thus proposed
to resolve this problem. Triplet loss is effective to further improve the
performance of those compact models. However, it normally employs a fixed
margin to all the samples, which neglects the informative similarity structures
between different identities. In this paper, we propose an enhanced version of
triplet loss, named triplet distillation, which exploits the capability of a
teacher model to transfer the similarity information to a small model by
adaptively varying the margin between positive and negative pairs. Experiments
on LFW, AgeDB, and CPLFW datasets show the merits of our method compared to the
original triplet loss.Comment: 5 pages, 2 tables, accpeted by ICML 2019 ODML-CDNNR Worksho
ShrinkTeaNet: Million-scale Lightweight Face Recognition via Shrinking Teacher-Student Networks
Large-scale face recognition in-the-wild has been recently achieved matured
performance in many real work applications. However, such systems are built on
GPU platforms and mostly deploy heavy deep network architectures. Given a
high-performance heavy network as a teacher, this work presents a simple and
elegant teacher-student learning paradigm, namely ShrinkTeaNet, to train a
portable student network that has significantly fewer parameters and
competitive accuracy against the teacher network. Far apart from prior
teacher-student frameworks mainly focusing on accuracy and compression ratios
in closed-set problems, our proposed teacher-student network is proved to be
more robust against open-set problem, i.e. large-scale face recognition. In
addition, this work introduces a novel Angular Distillation Loss for distilling
the feature direction and the sample distributions of the teacher's hypersphere
to its student. Then ShrinkTeaNet framework can efficiently guide the student's
learning process with the teacher's knowledge presented in both intermediate
and last stages of the feature embedding. Evaluations on LFW, CFP-FP, AgeDB,
IJB-B and IJB-C Janus, and MegaFace with one million distractors have
demonstrated the efficiency of the proposed approach to learn robust student
networks which have satisfying accuracy and compact sizes. Our ShrinkTeaNet is
able to support the light-weight architecture achieving high performance with
99.77% on LFW and 95.64% on large-scale Megaface protocols
Correlation Congruence for Knowledge Distillation
Most teacher-student frameworks based on knowledge distillation (KD) depend
on a strong congruent constraint on instance level. However, they usually
ignore the correlation between multiple instances, which is also valuable for
knowledge transfer. In this work, we propose a new framework named correlation
congruence for knowledge distillation (CCKD), which transfers not only the
instance-level information, but also the correlation between instances.
Furthermore, a generalized kernel method based on Taylor series expansion is
proposed to better capture the correlation between instances. Empirical
experiments and ablation studies on image classification tasks (including
CIFAR-100, ImageNet-1K) and metric learning tasks (including ReID and Face
Recognition) show that the proposed CCKD substantially outperforms the original
KD and achieves state-of-the-art accuracy compared with other SOTA KD-based
methods. The CCKD can be easily deployed in the majority of the teacher-student
framework such as KD and hint-based learning methods
Knowledge Squeezed Adversarial Network Compression
Deep network compression has been achieved notable progress via knowledge
distillation, where a teacher-student learning manner is adopted by using
predetermined loss. Recently, more focuses have been transferred to employ the
adversarial training to minimize the discrepancy between distributions of
output from two networks. However, they always emphasize on result-oriented
learning while neglecting the scheme of process-oriented learning, leading to
the loss of rich information contained in the whole network pipeline. Inspired
by the assumption that, the small network can not perfectly mimic a large one
due to the huge gap of network scale, we propose a knowledge transfer method,
involving effective intermediate supervision, under the adversarial training
framework to learn the student network. To achieve powerful but highly compact
intermediate information representation, the squeezed knowledge is realized by
task-driven attention mechanism. Then, the transferred knowledge from teacher
network could accommodate the size of student network. As a result, the
proposed method integrates merits from both process-oriented and
result-oriented learning. Extensive experimental results on three typical
benchmark datasets, i.e., CIFAR-10, CIFAR-100, and ImageNet, demonstrate that
our method achieves highly superior performances against other state-of-the-art
methods
Knowledge Distillation via Route Constrained Optimization
Distillation-based learning boosts the performance of the miniaturized neural
network based on the hypothesis that the representation of a teacher model can
be used as structured and relatively weak supervision, and thus would be easily
learned by a miniaturized model. However, we find that the representation of a
converged heavy model is still a strong constraint for training a small student
model, which leads to a high lower bound of congruence loss. In this work,
inspired by curriculum learning we consider the knowledge distillation from the
perspective of curriculum learning by routing. Instead of supervising the
student model with a converged teacher model, we supervised it with some anchor
points selected from the route in parameter space that the teacher model passed
by, as we called route constrained optimization (RCO). We experimentally
demonstrate this simple operation greatly reduces the lower bound of congruence
loss for knowledge distillation, hint and mimicking learning. On close-set
classification tasks like CIFAR100 and ImageNet, RCO improves knowledge
distillation by 2.14% and 1.5% respectively. For the sake of evaluating the
generalization, we also test RCO on the open-set face recognition task
MegaFace
A Survey of Model Compression and Acceleration for Deep Neural Networks
Deep neural networks (DNNs) have recently achieved great success in many
visual recognition tasks. However, existing deep neural network models are
computationally expensive and memory intensive, hindering their deployment in
devices with low memory resources or in applications with strict latency
requirements. Therefore, a natural thought is to perform model compression and
acceleration in deep networks without significantly decreasing the model
performance. During the past five years, tremendous progress has been made in
this area. In this paper, we review the recent techniques for compacting and
accelerating DNN models. In general, these techniques are divided into four
categories: parameter pruning and quantization, low-rank factorization,
transferred/compact convolutional filters, and knowledge distillation. Methods
of parameter pruning and quantization are described first, after that the other
techniques are introduced. For each category, we also provide insightful
analysis about the performance, related applications, advantages, and
drawbacks. Then we go through some very recent successful methods, for example,
dynamic capacity networks and stochastic depths networks. After that, we survey
the evaluation matrices, the main datasets used for evaluating the model
performance, and recent benchmark efforts. Finally, we conclude this paper,
discuss remaining the challenges and possible directions for future work.Comment: Published in IEEE Signal Processing Magazine, updated version
including more recent work
Cross-Resolution Face Recognition via Prior-Aided Face Hallucination and Residual Knowledge Distillation
Recent deep learning based face recognition methods have achieved great
performance, but it still remains challenging to recognize very low-resolution
query face like 28x28 pixels when CCTV camera is far from the captured subject.
Such face with very low-resolution is totally out of detail information of the
face identity compared to normal resolution in a gallery and hard to find
corresponding faces therein. To this end, we propose a Resolution Invariant
Model (RIM) for addressing such cross-resolution face recognition problems,
with three distinct novelties. First, RIM is a novel and unified deep
architecture, containing a Face Hallucination sub-Net (FHN) and a Heterogeneous
Recognition sub-Net (HRN), which are jointly learned end to end. Second, FHN is
a well-designed tri-path Generative Adversarial Network (GAN) which
simultaneously perceives facial structure and geometry prior information, i.e.
landmark heatmaps and parsing maps, incorporated with an unsupervised
cross-domain adversarial training strategy to super-resolve very low-resolution
query image to its 8x larger ones without requiring them to be well aligned.
Third, HRN is a generic Convolutional Neural Network (CNN) for heterogeneous
face recognition with our proposed residual knowledge distillation strategy for
learning discriminative yet generalized feature representation. Quantitative
and qualitative experiments on several benchmarks demonstrate the superiority
of the proposed model over the state-of-the-arts. Codes and models will be
released upon acceptance.Comment: 10 pages, 4 figure
Learning Metrics from Teachers: Compact Networks for Image Embedding
Metric learning networks are used to compute image embeddings, which are
widely used in many applications such as image retrieval and face recognition.
In this paper, we propose to use network distillation to efficiently compute
image embeddings with small networks. Network distillation has been
successfully applied to improve image classification, but has hardly been
explored for metric learning. To do so, we propose two new loss functions that
model the communication of a deep teacher network to a small student network.
We evaluate our system in several datasets, including CUB-200-2011, Cars-196,
Stanford Online Products and show that embeddings computed using small student
networks perform significantly better than those computed using standard
networks of similar size. Results on a very compact network (MobileNet-0.25),
which can be used on mobile devices, show that the proposed method can greatly
improve Recall@1 results from 27.5\% to 44.6\%. Furthermore, we investigate
various aspects of distillation for embeddings, including hint and attention
layers, semi-supervised learning and cross quality distillation. (Code is
available at https://github.com/yulu0724/EmbeddingDistillation.)Comment: To appear at CVPR 201
An Embarrassingly Simple Approach for Knowledge Distillation
Knowledge Distillation (KD) aims at improving the performance of a
low-capacity student model by inheriting knowledge from a high-capacity teacher
model. Previous KD methods typically train a student by minimizing a
task-related loss and the KD loss simultaneously, using a pre-defined loss
weight to balance these two terms. In this work, we propose to first transfer
the backbone knowledge from a teacher to the student, and then only learn the
task-head of the student network. Such a decomposition of the training process
circumvents the need of choosing an appropriate loss weight, which is often
difficult in practice, and thus makes it easier to apply to different datasets
and tasks. Importantly, the decomposition permits the core of our method,
Stage-by-Stage Knowledge Distillation (SSKD), which facilitates progressive
feature mimicking from teacher to student. Extensive experiments on CIFAR-100
and ImageNet suggest that SSKD significantly narrows down the performance gap
between student and teacher, outperforming state-of-the-art approaches. We also
demonstrate the generalization ability of SSKD on other challenging benchmarks,
including face recognition on IJB-A dataset as well as object detection on COCO
dataset.Comment: 8 pages and 5 figure
- …