108 research outputs found

    Optimization of Mixed-precision Neural Architecture with Knowledge Distillation

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2020. 8. ์ดํ˜์žฌ.Quantization์€ ๋ฉ”๋ชจ๋ฆฌ์™€ ๊ณ„์‚ฐ ๋Šฅ๋ ฅ์ด ์ œํ•œ๋œ edge device์—์„œ deep neural network๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด ํ•„์ˆ˜์ ์ธ ํ”„๋กœ์„ธ์Šค์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ bit-width ๋ฐ transformation function์„ ํฌํ•จํ•˜์—ฌ ๋˜‘๊ฐ™์€ quantization ์„ ๋ชจ๋“  ๋ ˆ์ด์–ด์— ๊ทธ๋Œ€๋กœ ์ ์šฉํ•˜๋Š” uniform-precision quantization์€ ์‹ฌ๊ฐํ•œ ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ๊ฒช๋Š” ๊ฒƒ์œผ๋กœ ๋„๋ฆฌ ์•Œ๋ ค์ ธ ์žˆ๋‹ค. ํ•œํŽธ ๊ฐ๊ฐ์˜ ๋ ˆ์ด์–ด์— ์„œ๋กœ ๋‹ค๋ฅธ quantization ์„ ์ฐพ์•„์„œ ์ ์šฉํ•˜๋Š” ๊ฒƒ์€ ํ›„๋ณด๊ตฐ์˜ ์ˆ˜๊ฐ€ ๋ ˆ์ด์–ด ์ˆ˜์— ๋”ฐ๋ผ ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ ์šฉํ•˜๊ธฐ ์–ด๋ ต๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” Knowledge Distillation ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•˜์—ฌ ์„ ํ˜•์‹œ๊ฐ„ ๋‚ด์— ๊ฒ€์ƒ‰ ๊ณต๊ฐ„์„ ํšจ์œจ์ ์œผ๋กœ ํƒ์ƒ‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ํŠนํžˆ, ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์€ ๋Œ€์ƒ ๋ ˆ์ด์–ด์— ๋Œ€ํ•œ quantization ์˜ ์˜ํ–ฅ์„ ์ถ”์ •ํ•˜๊ธฐ ์œ„ํ•ด ๋ ˆ์ด์–ด๋ณ„๋กœ loss function์„ ๊ณต์‹ํ™”ํ•˜์˜€๋‹ค. ๋ ˆ์ด์–ด๊ฐ„์— ์˜์กด์„ฑ์ด ์ตœ์†Œ๋ผ๋Š” ๊ฐ€์ •์— ๊ทผ๊ฑฐํ•˜์—ฌ ๊ฐ ๋ ˆ์ด์–ด๋ณ„ ์ ์šฉ์„ ๊ฐœ๋ณ„์ ์œผ๋กœ ๊ฒฐ์ •ํ•ด์„œ ์„ฑ๋Šฅ์˜ ์†์‹ค์„ ์ตœ์†Œํ™”ํ•œ๋‹ค. ํ•˜๋“œ์›จ์–ด ์นœํ™”์ ์ธ quantization ๋งŒ ์‚ฌ์šฉํ•˜์—ฌ image classification๊ณผ object detection ์— ๋Œ€ํ•œ ์‹คํ—˜์„ ์‹ค์‹œํ•˜์˜€๋‹ค. ๊ทธ ๊ฒฐ๊ณผ, CIFAR-10 ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ ํฌ๊ธฐ๊ฐ€ 13.65๋ฐฐ ์ž‘์€ ๊ฐ€์žฅ ํšจ์œจ์ ์ธ mixed-precision์˜ ResNet20 ๋ชจ๋ธ๋กœ 93.62%์˜ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€์œผ๋ฉฐ, ์ด๋Š” ๊ธฐ๋ณธ full-precision ๋ชจ๋ธ๋ณด๋‹ค 0.19% ๋‚ฎ์€ ์ˆ˜์ค€์ด๋‹ค. VOC ๋ฐ์ดํ„ฐ ์„ธํŠธ์—์„œ, ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์€ ํ‰๊ท  precision ์ด 63.87% ์ธ mixed-precision Sim-YOLOv2-FPGA ๋ชจ๋ธ์„ ์ƒ์„ฑํ•˜์˜€๊ณ , ๋™์ผ ์••์ถ•๋ฅ ์˜ ๋ชจ๋“  uniform-precision ๋ชจ๋ธ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ๋›ฐ์–ด๋‚˜๋‹ค. ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋‹ค๋ฅธ ์ตœ์ฒจ๋‹จ mixed-precision quantization ์ ‘๊ทผ๋ฒ•๊ณผ ์œ ์‚ฌํ•œ ํšจ์œจ์„ ๋‹ฌ์„ฑํ•˜๋ฉด์„œ๋„ ์‹คํ–‰์€ ๋‹จ์ˆœํ•˜๋‹ค.Quantization is an essential process in the deployment of deep neural networks on edge devices which only have limited memory and computation capacity. However, it is widely known that the straightforward uniform-precision quantization method, which applies the same quantization scheme including bit-width and transformation function to all layers, suffers a severe performance degradation. Meanwhile, finding and assigning different quantization schemes to different layers is challenging as the number of candidates is exponential to the the number of layers. To address this problem, this study proposes a method that utilizes Knowledge Distillation technique to efficiently explore the search space in linear time. In particular, the proposed method formulates a per-layer loss function to estimate the impact of a quantization scheme on a target layer. Based on the assumption that the dependence among layers is minimal, the assignment is then decided for each layer separately to minimize the performance loss. Experiments are conducted for both image classification and object detection task, using only hardware-friendly quantization schemes. The results show that the most efficient mixed-precision ResNet20 model with 13.65 times smaller size can still achieve up to 93.62% accuracy on CIFAR-10 dataset, which is only 0.19% lower than the baseline full-precision model. On VOC dataset, the proposed method generates a mixed-precision Sim-YOLOv2-FPGA model with a mean average precision of 63.87, which outperforms all uniform-precision models with the same compression rate. The proposed method is practically simple to carry out while still achieving a comparable efficiency to other state-of-the-art approaches on mixed-precision quantization.Chapter 1: Introduction 1 Chapter 2: Related Work 4 2.1. Quantization techniques 4 2.2. Uniform quantization 5 2.3. Shifter quantization 6 2.4. Knowledge Distillation 8 2.5. Neural Architecture Search 10 Chapter 3: Knowledge-Distillation Mixed-Precision 11 3.1. Complexity challenge and solution 11 3.2. Impact assessment via Knowledge Distillation 15 3.3. Parallel Knowledge Distillation training 17 3.4. Candidate architecture generation 18 3.5. Final model fine-tuning 20 3.6. Summary 21 Chapter 4: Experimental Results 23 4.1. Final model fine-tuning time reduction 23 4.2. Scalable and flexible training 25 4.3. Effectiveness of Knowledge Distillation Mixed Precision 27 4.4. Intra-layer mixed precision 30 Chapter 5: Conclusion and Future Work 32 Appendix 33 A. Experiment with Sim-YOLOv2-FPGA on VOC dataset 33 B. Experiment with ResNet20 and ResNet32 on CIFAR-10 34 Reference 35 ์ดˆ ๋ก 38 Acknowledgement 40Maste

    Real-time Optimal Resource Allocation for Embedded UAV Communication Systems

    Get PDF
    We consider device-to-device (D2D) wireless information and power transfer systems using an unmanned aerial vehicle (UAV) as a relay-assisted node. As the energy capacity and flight time of UAVs is limited, a significant issue in deploying UAV is to manage energy consumption in real-time application, which is proportional to the UAV transmit power. To tackle this important issue, we develop a real-time resource allocation algorithm for maximizing the energy efficiency by jointly optimizing the energy-harvesting time and power control for the considered (D2D) communication embedded with UAV. We demonstrate the effectiveness of the proposed algorithms as running time for solving them can be conducted in milliseconds.Comment: 11 pages, 5 figures, 1 table. This paper is accepted for publication on IEEE Wireless Communications Letter

    Towards a Robust WiFi-based Fall Detection with Adversarial Data Augmentation

    Full text link
    Recent WiFi-based fall detection systems have drawn much attention due to their advantages over other sensory systems. Various implementations have achieved impressive progress in performance, thanks to machine learning and deep learning techniques. However, many of such high accuracy systems have low reliability as they fail to achieve robustness in unseen environments. To address that, this paper investigates a method of generalization through adversarial data augmentation. Our results show a slight improvement in deep learning-systems in unseen domains, though the performance is not significant.Comment: Will appear in Proceedings of the 54th Annual Conference on Information Sciences and Systems (CISS2020

    ShortcutFusion: From Tensorflow to FPGA-based accelerator with reuse-aware memory allocation for shortcut data

    Full text link
    Residual block is a very common component in recent state-of-the art CNNs such as EfficientNet or EfficientDet. Shortcut data accounts for nearly 40% of feature-maps access in ResNet152 [8]. Most of the previous DNN compilers, accelerators ignore the shortcut data optimization. This paper presents ShortcutFusion, an optimization tool for FPGA-based accelerator with a reuse-aware static memory allocation for shortcut data, to maximize on-chip data reuse given resource constraints. From TensorFlow DNN models, the proposed design generates instruction sets for a group of nodes which uses an optimized data reuse for each residual block. The accelerator design implemented on the Xilinx KCU1500 FPGA card significantly outperforms NVIDIA RTX 2080 Ti, Titan Xp, and GTX 1080 Ti for the EfficientNet inference. Compared to RTX 2080 Ti, the proposed design is 1.35-2.33x faster and 6.7-7.9x more power efficient. Compared to the result from baseline, in which the weights, inputs, and outputs are accessed from the off-chip memory exactly once per each layer, ShortcutFusion reduces the DRAM access by 47.8-84.8% for RetinaNet, Yolov3, ResNet152, and EfficientNet. Given a similar buffer size to ShortcutMining [8], which also mine the shortcut data in hardware, the proposed work reduces off-chip access for feature-maps 5.27x while accessing weight from off-chip memory exactly once.Comment: 12 page

    Federated Deep Reinforcement Learning-based Bitrate Adaptation for Dynamic Adaptive Streaming over HTTP

    Full text link
    In video streaming over HTTP, the bitrate adaptation selects the quality of video chunks depending on the current network condition. Some previous works have applied deep reinforcement learning (DRL) algorithms to determine the chunk's bitrate from the observed states to maximize the quality-of-experience (QoE). However, to build an intelligent model that can predict in various environments, such as 3G, 4G, Wifi, \textit{etc.}, the states observed from these environments must be sent to a server for training centrally. In this work, we integrate federated learning (FL) to DRL-based rate adaptation to train a model appropriate for different environments. The clients in the proposed framework train their model locally and only update the weights to the server. The simulations show that our federated DRL-based rate adaptations, called FDRLABR with different DRL algorithms, such as deep Q-learning, advantage actor-critic, and proximal policy optimization, yield better performance than the traditional bitrate adaptation methods in various environments.Comment: 13 pages, 1 colum
    • โ€ฆ
    corecore