2,154 research outputs found

    Group channel pruning and spatial attention distilling for object detection

    Full text link
    Due to the over-parameterization of neural networks, many model compression methods based on pruning and quantization have emerged. They are remarkable in reducing the size, parameter number, and computational complexity of the model. However, most of the models compressed by such methods need the support of special hardware and software, which increases the deployment cost. Moreover, these methods are mainly used in classification tasks, and rarely directly used in detection tasks. To address these issues, for the object detection network we introduce a three-stage model compression method: dynamic sparse training, group channel pruning, and spatial attention distilling. Firstly, to select out the unimportant channels in the network and maintain a good balance between sparsity and accuracy, we put forward a dynamic sparse training method, which introduces a variable sparse rate, and the sparse rate will change with the training process of the network. Secondly, to reduce the effect of pruning on network accuracy, we propose a novel pruning method called group channel pruning. In particular, we divide the network into multiple groups according to the scales of the feature layer and the similarity of module structure in the network, and then we use different pruning thresholds to prune the channels in each group. Finally, to recover the accuracy of the pruned network, we use an improved knowledge distillation method for the pruned network. Especially, we extract spatial attention information from the feature maps of specific scales in each group as knowledge for distillation. In the experiments, we use YOLOv4 as the object detection network and PASCAL VOC as the training dataset. Our method reduces the parameters of the model by 64.7 % and the calculation by 34.9%.Comment: Appl Intel

    ์–ผ๊ตด ํ‘œ์ • ์ธ์‹, ๋‚˜์ด ๋ฐ ์„ฑ๋ณ„ ์ถ”์ •์„ ์œ„ํ•œ ๋‹ค์ค‘ ๋ฐ์ดํ„ฐ์…‹ ๋‹ค์ค‘ ๋„๋ฉ”์ธ ๋‹ค์ค‘์ž‘์—… ๋„คํŠธ์›Œํฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€,2019. 8. Cho, Nam Ik.์ปจ๋ณผ ๋ฃจ์…˜ ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ (CNN)๋Š” ์–ผ๊ตด๊ณผ ๊ด€๋ จ๋œ ๋ฌธ์ œ๋ฅผ ํฌํ•จํ•˜์—ฌ ๋งŽ์€ ์ปดํ“จํ„ฐ ๋น„์ „ ์ž‘์—…์—์„œ ๋งค์šฐ ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์—ฐ๋ น ์ถ”์ • ๋ฐ ์–ผ๊ตด ํ‘œ์ • ์ธ์‹ (FER)์˜ ๊ฒฝ์šฐ CNN์ด ์ œ๊ณต ํ•œ ์ •ํ™•๋„๋Š” ์—ฌ์ „ํžˆ ์‹ค์ œ ๋ฌธ์ œ์— ๋Œ€ํ•ด ์ถฉ๋ถ„ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. CNN์€ ์–ผ๊ตด์˜ ์ฃผ๋ฆ„์˜ ๋‘๊ป˜์™€ ์–‘์˜ ๋ฏธ๋ฌ˜ํ•œ ์ฐจ์ด๋ฅผ ๋ฐœ๊ฒฌํ•˜์ง€ ๋ชปํ–ˆ์ง€๋งŒ, ์ด๊ฒƒ์€ ์—ฐ๋ น ์ถ”์ •๊ณผ FER์— ํ•„์ˆ˜์ ์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ ์‹ค์ œ ์„ธ๊ณ„์—์„œ์˜ ์–ผ๊ตด ์ด๋ฏธ์ง€๋Š” CNN์ด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์—์„œ ๊ฐ€๋Šฅํ•  ๋•Œ ํšŒ์ „ ๋œ ๋ฌผ์ฒด๋ฅผ ์ฐพ๋Š” ๋ฐ ๊ฐ•๊ฑดํ•˜์ง€ ์•Š์€ ํšŒ์ „ ๋ฐ ์กฐ๋ช…์œผ๋กœ ์ธํ•ด ๋งŽ์€ ์ฐจ์ด๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ MTL (Multi Task Learning)์€ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ์ง€๊ฐ ์ž‘์—…์„ ๋™์‹œ์— ํšจ์œจ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋ฒ”์  ์ธ MTL ๋ฐฉ๋ฒ•์—์„œ๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ์ž‘์—…์— ๋Œ€ํ•œ ๋ชจ๋“  ๋ ˆ์ด๋ธ”์„ ํ•จ๊ป˜ ํฌํ•จํ•˜๋Š” ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์„ ๊ตฌ์„ฑํ•˜๋Š” ๊ฒƒ์„ ๊ณ ๋ คํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋Œ€์ƒ ์ž‘์—…์ด ๋‹ค๊ฐํ™”๋˜๊ณ  ๋ณต์žกํ•ด์ง€๋ฉด ๋” ๊ฐ•๋ ฅํ•œ ๋ ˆ์ด๋ธ”์„ ๊ฐ€์ง„ ๊ณผ๋„ํ•˜๊ฒŒ ํฐ ๋ฐ์ดํ„ฐ ์„ธํŠธ๊ฐ€ ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์›ํ•˜๋Š” ๋ผ๋ฒจ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋น„์šฉ์€ ์ข…์ข… ์žฅ์• ๋ฌผ์ด๋ฉฐ ํŠนํžˆ ๋‹ค์ค‘ ์ž‘์—… ํ•™์Šต์˜ ๊ฒฝ์šฐ ์žฅ์• ๊ฐ€๋ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์šฐ๋ฆฌ๋Š” ๊ฐ€๋ฒ„ ํ•„ํ„ฐ์™€ ์บก์Š ๊ธฐ๋ฐ˜ ๋„คํŠธ์›Œํฌ (MTL) ๋ฐ ๋ฐ์ดํ„ฐ ์ฆ๋ฅ˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœํ•˜๋Š” ๋‹ค์ค‘ ์ž‘์—… ํ•™์Šต์— ๊ธฐ๋ฐ˜ํ•œ ์ƒˆ๋กœ์šด ๋ฐ˜ ๊ฐ๋… ํ•™์Šต ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค.The convolutional neural network (CNN) works very well in many computer vision tasks including the face-related problems. However, in the case of age estimation and facial expression recognition (FER), the accuracy provided by the CNN is still not good enough to be used for the real-world problems. It seems that the CNN does not well find the subtle differences in thickness and amount of wrinkles on the face, which are the essential features for the age estimation and FER. Also, the face images in the real world have many variations due to the face rotation and illumination, where the CNN is not robust in finding the rotated objects when not every possible variation is in the training data. Moreover, The Multi Task Learning (MTL) Based based methods can be much helpful to achieve the real-time visual understanding of a dynamic scene, as they are able to perform several different perceptual tasks simultaneously and efficiently. In the exemplary MTL methods, we need to consider constructing a dataset that contains all the labels for different tasks together. However, as the target task becomes multi-faceted and more complicated, sometimes unduly large dataset with stronger labels is required. Hence, the cost of generating desired labeled data for complicated learning tasks is often an obstacle, especially for multi-task learning. Therefore, first to alleviate these problems, we first propose few methods in order to improve single task baseline performance using gabor filters and Capsule Based Networks , Then We propose a new semi-supervised learning method on face-related tasks based on Multi-Task Learning (MTL) and data distillation.1 INTRODUCTION 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Age and Gender Estimation . . . . . . . . . . . . . . . . . . 4 1.2.2 Facial Expression Recognition (FER) . . . . . . . . . . . . . 4 1.2.3 Capsule networks (CapsNet) . . . . . . . . . . . . . . . . . . 5 1.2.4 Semi-Supervised Learning. . . . . . . . . . . . . . . . . . . . 5 1.2.5 Multi-Task Learning. . . . . . . . . . . . . . . . . . . . . . . 6 1.2.6 Knowledge and data distillation. . . . . . . . . . . . . . . . . 6 1.2.7 Domain Adaptation. . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2. GF-CapsNet: Using Gabor Jet and Capsule Networks for Face-Related Tasks 10 2.1 Feeding CNN with Hand-Crafted Features . . . . . . . . . . . . . . . 10 2.1.1 Preparation of Input . . . . . . . . . . . . . . . . . . . . . . 10 2.1.2 Age and Gender Estimation using the Gabor Responses . . . . 13 2.2 GF-CapsNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.1 Modification of CapsNet . . . . . . . . . . . . . . . . . 16 3. Distill-2MD-MTL: Data Distillation based on Multi-Dataset Multi-Domain Multi-Task Frame Work to Solve Face Related Tasks 20 3.1 MTL learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2 Data Distillation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4. Experiments and Results 25 4.1 Experiments on GF-CNN and GF-CapsNet . . . . . . . . . . . . . . 25 4.2 GF-CNN Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2.1 GF-CapsNet Results . . . . . . . . . . . . . . . . . . . . . . 30 4.3 Experiment on Distill-2MD-MTL . . . . . . . . . . . . . . . . . . . 33 4.3.1 Semi-Supervised MTL . . . . . . . . . . . . . . . . . . . . . 34 4.3.2 Cross Datasets Cross-Domain Evaluation . . . . . . . . . . . 36 5. Conclusion 38 Abstract (In Korean) 49Maste

    Shortcut-V2V: Compression Framework for Video-to-Video Translation based on Temporal Redundancy Reduction

    Full text link
    Video-to-video translation aims to generate video frames of a target domain from an input video. Despite its usefulness, the existing networks require enormous computations, necessitating their model compression for wide use. While there exist compression methods that improve computational efficiency in various image/video tasks, a generally-applicable compression method for video-to-video translation has not been studied much. In response, we present Shortcut-V2V, a general-purpose compression framework for video-to-video translation. Shourcut-V2V avoids full inference for every neighboring video frame by approximating the intermediate features of a current frame from those of the previous frame. Moreover, in our framework, a newly-proposed block called AdaBD adaptively blends and deforms features of neighboring frames, which makes more accurate predictions of the intermediate features possible. We conduct quantitative and qualitative evaluations using well-known video-to-video translation models on various tasks to demonstrate the general applicability of our framework. The results show that Shourcut-V2V achieves comparable performance compared to the original video-to-video translation model while saving 3.2-5.7x computational cost and 7.8-44x memory at test time.Comment: to be update

    Model Compression Techniques in Biometrics Applications: A Survey

    Full text link
    The development of deep learning algorithms has extensively empowered humanity's task automatization capacity. However, the huge improvement in the performance of these models is highly correlated with their increasing level of complexity, limiting their usefulness in human-oriented applications, which are usually deployed in resource-constrained devices. This led to the development of compression techniques that drastically reduce the computational and memory costs of deep learning models without significant performance degradation. This paper aims to systematize the current literature on this topic by presenting a comprehensive survey of model compression techniques in biometrics applications, namely quantization, knowledge distillation and pruning. We conduct a critical analysis of the comparative value of these techniques, focusing on their advantages and disadvantages and presenting suggestions for future work directions that can potentially improve the current methods. Additionally, we discuss and analyze the link between model bias and model compression, highlighting the need to direct compression research toward model fairness in future works.Comment: Under review at IEEE Journa

    DeepFake detection based on high-frequency enhancement network for highly compressed content

    Get PDF
    The DeepFake, which generates synthetic content, has sparked a revolution in the fight against deception and forgery. However, most existing DeepFake detection methods mainly focus on improving detection performance with high-quality data while ignoring low-quality synthetic content that suffers from high compression. To address this issue, we propose a novel High-Frequency Enhancement framework, which leverages a learnable adaptive high-frequency enhancement network to enrich weak high-frequency information in compressed content without uncompressed data supervision. The framework consists of three branches, i.e., the Basic branch with RGB domain, the Local High-Frequency Enhancement branch with Block-wise Discrete Cosine Transform, and the Global High-Frequency Enhancement branch with Multi-level Discrete Wavelet Transform. Among them, the local branch utilizes the Discrete Cosine Transform coefficient and channel attention mechanism to indirectly achieve adaptive frequency-aware multi-spatial attention, while the global branch supplements the high-frequency information by extracting coarse-to-fine multi-scale high-frequency cues and cascade-residual-based multi-level fusion by Discrete Wavelet Transform coefficients. In addition, we design a Two-Stage Cross-Fusion module to effectively integrate all information, thereby greatly enhancing weak high-frequency information in low-quality data. Experimental results on FaceForensics++, Celeb-DF, and OpenForensics datasets show that the proposed method outperforms the existing state-of-the-art methods and can effectively improve the detection performance of DeepFakes, especially on low-quality data. The code is available here

    The Emerging Trends of Multi-Label Learning

    Full text link
    Exabytes of data are generated daily by humans, leading to the growing need for new efforts in dealing with the grand challenges for multi-label learning brought by big data. For example, extreme multi-label classification is an active and rapidly growing research area that deals with classification tasks with an extremely large number of classes or labels; utilizing massive data with limited supervision to build a multi-label classification model becomes valuable for practical applications, etc. Besides these, there are tremendous efforts on how to harvest the strong learning capability of deep learning to better capture the label dependencies in multi-label learning, which is the key for deep learning to address real-world classification tasks. However, it is noted that there has been a lack of systemic studies that focus explicitly on analyzing the emerging trends and new challenges of multi-label learning in the era of big data. It is imperative to call for a comprehensive survey to fulfill this mission and delineate future research directions and new applications.Comment: Accepted to TPAMI 202
    • โ€ฆ
    corecore