2,154 research outputs found
Group channel pruning and spatial attention distilling for object detection
Due to the over-parameterization of neural networks, many model compression
methods based on pruning and quantization have emerged. They are remarkable in
reducing the size, parameter number, and computational complexity of the model.
However, most of the models compressed by such methods need the support of
special hardware and software, which increases the deployment cost. Moreover,
these methods are mainly used in classification tasks, and rarely directly used
in detection tasks. To address these issues, for the object detection network
we introduce a three-stage model compression method: dynamic sparse training,
group channel pruning, and spatial attention distilling. Firstly, to select out
the unimportant channels in the network and maintain a good balance between
sparsity and accuracy, we put forward a dynamic sparse training method, which
introduces a variable sparse rate, and the sparse rate will change with the
training process of the network. Secondly, to reduce the effect of pruning on
network accuracy, we propose a novel pruning method called group channel
pruning. In particular, we divide the network into multiple groups according to
the scales of the feature layer and the similarity of module structure in the
network, and then we use different pruning thresholds to prune the channels in
each group. Finally, to recover the accuracy of the pruned network, we use an
improved knowledge distillation method for the pruned network. Especially, we
extract spatial attention information from the feature maps of specific scales
in each group as knowledge for distillation. In the experiments, we use YOLOv4
as the object detection network and PASCAL VOC as the training dataset. Our
method reduces the parameters of the model by 64.7 % and the calculation by
34.9%.Comment: Appl Intel
์ผ๊ตด ํ์ ์ธ์, ๋์ด ๋ฐ ์ฑ๋ณ ์ถ์ ์ ์ํ ๋ค์ค ๋ฐ์ดํฐ์ ๋ค์ค ๋๋ฉ์ธ ๋ค์ค์์ ๋คํธ์ํฌ
ํ์๋
ผ๋ฌธ(์์ฌ)--์์ธ๋ํ๊ต ๋ํ์ :๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ ๋ณด๊ณตํ๋ถ,2019. 8. Cho, Nam Ik.์ปจ๋ณผ ๋ฃจ์
๋ด๋ด ๋คํธ์ํฌ (CNN)๋ ์ผ๊ตด๊ณผ ๊ด๋ จ๋ ๋ฌธ์ ๋ฅผ ํฌํจํ์ฌ ๋ง์ ์ปดํจํฐ ๋น์ ์์
์์ ๋งค์ฐ ์ ์๋ํฉ๋๋ค. ๊ทธ๋ฌ๋ ์ฐ๋ น ์ถ์ ๋ฐ ์ผ๊ตด ํ์ ์ธ์ (FER)์ ๊ฒฝ์ฐ CNN์ด ์ ๊ณต ํ ์ ํ๋๋ ์ฌ์ ํ ์ค์ ๋ฌธ์ ์ ๋ํด ์ถฉ๋ถํ์ง ์์ต๋๋ค. CNN์ ์ผ๊ตด์ ์ฃผ๋ฆ์ ๋๊ป์ ์์ ๋ฏธ๋ฌํ ์ฐจ์ด๋ฅผ ๋ฐ๊ฒฌํ์ง ๋ชปํ์ง๋ง,
์ด๊ฒ์ ์ฐ๋ น ์ถ์ ๊ณผ FER์ ํ์์ ์
๋๋ค. ๋ํ ์ค์ ์ธ๊ณ์์์ ์ผ๊ตด ์ด๋ฏธ์ง๋ CNN์ด ํ๋ จ ๋ฐ์ดํฐ์์ ๊ฐ๋ฅํ ๋ ํ์ ๋ ๋ฌผ์ฒด๋ฅผ ์ฐพ๋ ๋ฐ ๊ฐ๊ฑดํ์ง ์์ ํ์ ๋ฐ ์กฐ๋ช
์ผ๋ก ์ธํด ๋ง์ ์ฐจ์ด๊ฐ ์์ต๋๋ค.
๋ํ MTL (Multi Task Learning)์ ์ฌ๋ฌ ๊ฐ์ง ์ง๊ฐ ์์
์ ๋์์ ํจ์จ์ ์ผ๋ก ์ํํฉ๋๋ค. ๋ชจ๋ฒ์ ์ธ MTL ๋ฐฉ๋ฒ์์๋ ์๋ก ๋ค๋ฅธ ์์
์ ๋ํ ๋ชจ๋ ๋ ์ด๋ธ์ ํจ๊ป ํฌํจํ๋ ๋ฐ์ดํฐ ์งํฉ์ ๊ตฌ์ฑํ๋ ๊ฒ์ ๊ณ ๋ คํด์ผํฉ๋๋ค. ๊ทธ๋ฌ๋ ๋์ ์์
์ด ๋ค๊ฐํ๋๊ณ ๋ณต์กํด์ง๋ฉด ๋ ๊ฐ๋ ฅํ ๋ ์ด๋ธ์ ๊ฐ์ง ๊ณผ๋ํ๊ฒ ํฐ ๋ฐ์ดํฐ ์ธํธ๊ฐ ํ์ํ ์ ์์ต๋๋ค. ๋ฐ๋ผ์ ์ํ๋ ๋ผ๋ฒจ ๋ฐ์ดํฐ๋ฅผ ์์ฑํ๋ ๋น์ฉ์ ์ข
์ข
์ฅ์ ๋ฌผ์ด๋ฉฐ ํนํ ๋ค์ค ์์
ํ์ต์ ๊ฒฝ์ฐ ์ฅ์ ๊ฐ๋ฉ๋๋ค.
๋ฐ๋ผ์ ์ฐ๋ฆฌ๋ ๊ฐ๋ฒ ํํฐ์ ์บก์ ๊ธฐ๋ฐ ๋คํธ์ํฌ (MTL) ๋ฐ ๋ฐ์ดํฐ ์ฆ๋ฅ๋ฅผ ๊ธฐ๋ฐ์ผ๋กํ๋ ๋ค์ค ์์
ํ์ต์ ๊ธฐ๋ฐํ ์๋ก์ด ๋ฐ ๊ฐ๋
ํ์ต ๋ฐฉ๋ฒ์ ์ ์ํ๋ค.The convolutional neural network (CNN) works very well in many computer vision tasks including the face-related problems. However, in the case of age estimation and facial expression recognition (FER), the accuracy provided by the CNN is still not good enough to be used for the real-world problems. It seems that the CNN does not well find the subtle differences in thickness and amount of wrinkles on the face,
which are the essential features for the age estimation and FER. Also, the face images in the real world have many variations due to the face rotation and illumination, where the CNN is not robust in finding the rotated objects when not every possible variation is in the training data.
Moreover, The Multi Task Learning (MTL) Based based methods can be much helpful to achieve the real-time visual understanding of a dynamic scene, as they are able to perform several different perceptual tasks simultaneously and efficiently. In the exemplary MTL methods, we need to consider constructing a dataset that contains all the labels for different tasks together. However, as the target task becomes multi-faceted and more complicated, sometimes unduly large dataset with stronger labels is required. Hence, the cost of generating desired labeled data for complicated learning tasks is often an obstacle, especially for multi-task learning.
Therefore, first to alleviate these problems, we first propose few methods in order to improve single task baseline performance using gabor filters and Capsule Based Networks , Then We propose a new semi-supervised learning method on face-related tasks based on Multi-Task Learning (MTL) and data distillation.1 INTRODUCTION 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Age and Gender Estimation . . . . . . . . . . . . . . . . . . 4
1.2.2 Facial Expression Recognition (FER) . . . . . . . . . . . . . 4
1.2.3 Capsule networks (CapsNet) . . . . . . . . . . . . . . . . . . 5
1.2.4 Semi-Supervised Learning. . . . . . . . . . . . . . . . . . . . 5
1.2.5 Multi-Task Learning. . . . . . . . . . . . . . . . . . . . . . . 6
1.2.6 Knowledge and data distillation. . . . . . . . . . . . . . . . . 6
1.2.7 Domain Adaptation. . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2. GF-CapsNet: Using Gabor Jet and Capsule Networks for Face-Related Tasks 10
2.1 Feeding CNN with Hand-Crafted Features . . . . . . . . . . . . . . . 10
2.1.1 Preparation of Input . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Age and Gender Estimation using the Gabor Responses . . . . 13
2.2 GF-CapsNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Modification of CapsNet . . . . . . . . . . . . . . . . . 16
3. Distill-2MD-MTL: Data Distillation based on Multi-Dataset Multi-Domain Multi-Task Frame Work to Solve Face Related Tasks 20
3.1 MTL learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Data Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4. Experiments and Results 25
4.1 Experiments on GF-CNN and GF-CapsNet . . . . . . . . . . . . . . 25
4.2 GF-CNN Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.1 GF-CapsNet Results . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Experiment on Distill-2MD-MTL . . . . . . . . . . . . . . . . . . . 33
4.3.1 Semi-Supervised MTL . . . . . . . . . . . . . . . . . . . . . 34
4.3.2 Cross Datasets Cross-Domain Evaluation . . . . . . . . . . . 36
5. Conclusion 38
Abstract (In Korean) 49Maste
Shortcut-V2V: Compression Framework for Video-to-Video Translation based on Temporal Redundancy Reduction
Video-to-video translation aims to generate video frames of a target domain
from an input video. Despite its usefulness, the existing networks require
enormous computations, necessitating their model compression for wide use.
While there exist compression methods that improve computational efficiency in
various image/video tasks, a generally-applicable compression method for
video-to-video translation has not been studied much. In response, we present
Shortcut-V2V, a general-purpose compression framework for video-to-video
translation. Shourcut-V2V avoids full inference for every neighboring video
frame by approximating the intermediate features of a current frame from those
of the previous frame. Moreover, in our framework, a newly-proposed block
called AdaBD adaptively blends and deforms features of neighboring frames,
which makes more accurate predictions of the intermediate features possible. We
conduct quantitative and qualitative evaluations using well-known
video-to-video translation models on various tasks to demonstrate the general
applicability of our framework. The results show that Shourcut-V2V achieves
comparable performance compared to the original video-to-video translation
model while saving 3.2-5.7x computational cost and 7.8-44x memory at test time.Comment: to be update
Model Compression Techniques in Biometrics Applications: A Survey
The development of deep learning algorithms has extensively empowered
humanity's task automatization capacity. However, the huge improvement in the
performance of these models is highly correlated with their increasing level of
complexity, limiting their usefulness in human-oriented applications, which are
usually deployed in resource-constrained devices. This led to the development
of compression techniques that drastically reduce the computational and memory
costs of deep learning models without significant performance degradation. This
paper aims to systematize the current literature on this topic by presenting a
comprehensive survey of model compression techniques in biometrics
applications, namely quantization, knowledge distillation and pruning. We
conduct a critical analysis of the comparative value of these techniques,
focusing on their advantages and disadvantages and presenting suggestions for
future work directions that can potentially improve the current methods.
Additionally, we discuss and analyze the link between model bias and model
compression, highlighting the need to direct compression research toward model
fairness in future works.Comment: Under review at IEEE Journa
DeepFake detection based on high-frequency enhancement network for highly compressed content
The DeepFake, which generates synthetic content, has sparked a revolution in the fight against deception and forgery. However, most existing DeepFake detection methods mainly focus on improving detection performance with high-quality data while ignoring low-quality synthetic content that suffers from high compression. To address this issue, we propose a novel High-Frequency Enhancement framework, which leverages a learnable adaptive high-frequency enhancement network to enrich weak high-frequency information in compressed content without uncompressed data supervision. The framework consists of three branches, i.e., the Basic branch with RGB domain, the Local High-Frequency Enhancement branch with Block-wise Discrete Cosine Transform, and the Global High-Frequency Enhancement branch with Multi-level Discrete Wavelet Transform. Among them, the local branch utilizes the Discrete Cosine Transform coefficient and channel attention mechanism to indirectly achieve adaptive frequency-aware multi-spatial attention, while the global branch supplements the high-frequency information by extracting coarse-to-fine multi-scale high-frequency cues and cascade-residual-based multi-level fusion by Discrete Wavelet Transform coefficients. In addition, we design a Two-Stage Cross-Fusion module to effectively integrate all information, thereby greatly enhancing weak high-frequency information in low-quality data. Experimental results on FaceForensics++, Celeb-DF, and OpenForensics datasets show that the proposed method outperforms the existing state-of-the-art methods and can effectively improve the detection performance of DeepFakes, especially on low-quality data. The code is available here
The Emerging Trends of Multi-Label Learning
Exabytes of data are generated daily by humans, leading to the growing need
for new efforts in dealing with the grand challenges for multi-label learning
brought by big data. For example, extreme multi-label classification is an
active and rapidly growing research area that deals with classification tasks
with an extremely large number of classes or labels; utilizing massive data
with limited supervision to build a multi-label classification model becomes
valuable for practical applications, etc. Besides these, there are tremendous
efforts on how to harvest the strong learning capability of deep learning to
better capture the label dependencies in multi-label learning, which is the key
for deep learning to address real-world classification tasks. However, it is
noted that there has been a lack of systemic studies that focus explicitly on
analyzing the emerging trends and new challenges of multi-label learning in the
era of big data. It is imperative to call for a comprehensive survey to fulfill
this mission and delineate future research directions and new applications.Comment: Accepted to TPAMI 202
- โฆ