108 research outputs found
Optimization of Mixed-precision Neural Architecture with Knowledge Distillation
ํ์๋
ผ๋ฌธ (์์ฌ) -- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ ๋ณด๊ณตํ๋ถ, 2020. 8. ์ดํ์ฌ.Quantization์ ๋ฉ๋ชจ๋ฆฌ์ ๊ณ์ฐ ๋ฅ๋ ฅ์ด ์ ํ๋ edge device์์ deep neural network๋ฅผ ์ํํ๊ธฐ ์ํด ํ์์ ์ธ ํ๋ก์ธ์ค์ด๋ค. ๊ทธ๋ฌ๋ bit-width ๋ฐ transformation function์ ํฌํจํ์ฌ ๋๊ฐ์ quantization ์ ๋ชจ๋ ๋ ์ด์ด์ ๊ทธ๋๋ก ์ ์ฉํ๋ uniform-precision quantization์ ์ฌ๊ฐํ ์ฑ๋ฅ ์ ํ๋ฅผ ๊ฒช๋ ๊ฒ์ผ๋ก ๋๋ฆฌ ์๋ ค์ ธ ์๋ค. ํํธ ๊ฐ๊ฐ์ ๋ ์ด์ด์ ์๋ก ๋ค๋ฅธ quantization ์ ์ฐพ์์ ์ ์ฉํ๋ ๊ฒ์ ํ๋ณด๊ตฐ์ ์๊ฐ ๋ ์ด์ด ์์ ๋ฐ๋ผ ๊ธฐํ๊ธ์์ ์ผ๋ก ์ฆ๊ฐํ๊ธฐ ๋๋ฌธ์ ์ ์ฉํ๊ธฐ ์ด๋ ต๋ค. ์ด ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด, ๋ณธ ์ฐ๊ตฌ์์๋ Knowledge Distillation ๊ธฐ๋ฒ์ ํ์ฉํ์ฌ ์ ํ์๊ฐ ๋ด์ ๊ฒ์ ๊ณต๊ฐ์ ํจ์จ์ ์ผ๋ก ํ์ํ๋ ๋ฐฉ๋ฒ์ ์ ์ํ๋ค. ํนํ, ์ ์๋ ๋ฐฉ๋ฒ์ ๋์ ๋ ์ด์ด์ ๋ํ quantization ์ ์ํฅ์ ์ถ์ ํ๊ธฐ ์ํด ๋ ์ด์ด๋ณ๋ก loss function์ ๊ณต์ํํ์๋ค. ๋ ์ด์ด๊ฐ์ ์์กด์ฑ์ด ์ต์๋ผ๋ ๊ฐ์ ์ ๊ทผ๊ฑฐํ์ฌ ๊ฐ ๋ ์ด์ด๋ณ ์ ์ฉ์ ๊ฐ๋ณ์ ์ผ๋ก ๊ฒฐ์ ํด์ ์ฑ๋ฅ์ ์์ค์ ์ต์ํํ๋ค. ํ๋์จ์ด ์นํ์ ์ธ quantization ๋ง ์ฌ์ฉํ์ฌ image classification๊ณผ object detection ์ ๋ํ ์คํ์ ์ค์ํ์๋ค. ๊ทธ ๊ฒฐ๊ณผ, CIFAR-10 ๋ฐ์ดํฐ ์ธํธ์์ ํฌ๊ธฐ๊ฐ 13.65๋ฐฐ ์์ ๊ฐ์ฅ ํจ์จ์ ์ธ mixed-precision์ ResNet20 ๋ชจ๋ธ๋ก 93.62%์ ์ ํ๋๋ฅผ ๋ฌ์ฑํ์์ผ๋ฉฐ, ์ด๋ ๊ธฐ๋ณธ full-precision ๋ชจ๋ธ๋ณด๋ค 0.19% ๋ฎ์ ์์ค์ด๋ค. VOC ๋ฐ์ดํฐ ์ธํธ์์, ์ ์๋ ๋ฐฉ๋ฒ์ ํ๊ท precision ์ด 63.87% ์ธ mixed-precision Sim-YOLOv2-FPGA ๋ชจ๋ธ์ ์์ฑํ์๊ณ , ๋์ผ ์์ถ๋ฅ ์ ๋ชจ๋ uniform-precision ๋ชจ๋ธ๋ณด๋ค ์ฑ๋ฅ์ด ๋ฐ์ด๋๋ค. ์ ์ํ๋ ๋ฐฉ๋ฒ์ ๋ค๋ฅธ ์ต์ฒจ๋จ mixed-precision quantization ์ ๊ทผ๋ฒ๊ณผ ์ ์ฌํ ํจ์จ์ ๋ฌ์ฑํ๋ฉด์๋ ์คํ์ ๋จ์ํ๋ค.Quantization is an essential process in the deployment of deep neural networks on edge devices which only have limited memory and computation capacity. However, it is widely known that the straightforward uniform-precision quantization method, which applies the same quantization scheme including bit-width and transformation function to all layers, suffers a severe performance degradation. Meanwhile, finding and assigning different quantization schemes to different layers is challenging as the number of candidates is exponential to the the number of layers. To address this problem, this study proposes a method that utilizes Knowledge Distillation technique to efficiently explore the search space in linear time. In particular, the proposed method formulates a per-layer loss function to estimate the impact of a quantization scheme on a target layer. Based on the assumption that the dependence among layers is minimal, the assignment is then decided for each layer separately to minimize the performance loss. Experiments are conducted for both image classification and object detection task, using only hardware-friendly quantization schemes. The results show that the most efficient mixed-precision ResNet20 model with 13.65 times smaller size can still achieve up to 93.62% accuracy on CIFAR-10 dataset, which is only 0.19% lower than the baseline full-precision model. On VOC dataset, the proposed method generates a mixed-precision Sim-YOLOv2-FPGA model with a mean average precision of 63.87, which outperforms all uniform-precision models with the same compression rate. The proposed method is practically simple to carry out while still achieving a comparable efficiency to other state-of-the-art approaches on mixed-precision quantization.Chapter 1: Introduction 1
Chapter 2: Related Work 4
2.1. Quantization techniques 4
2.2. Uniform quantization 5
2.3. Shifter quantization 6
2.4. Knowledge Distillation 8
2.5. Neural Architecture Search 10
Chapter 3: Knowledge-Distillation Mixed-Precision 11
3.1. Complexity challenge and solution 11
3.2. Impact assessment via Knowledge Distillation 15
3.3. Parallel Knowledge Distillation training 17
3.4. Candidate architecture generation 18
3.5. Final model fine-tuning 20
3.6. Summary 21
Chapter 4: Experimental Results 23
4.1. Final model fine-tuning time reduction 23
4.2. Scalable and flexible training 25
4.3. Effectiveness of Knowledge Distillation Mixed Precision 27
4.4. Intra-layer mixed precision 30
Chapter 5: Conclusion and Future Work 32
Appendix 33
A. Experiment with Sim-YOLOv2-FPGA on VOC dataset 33
B. Experiment with ResNet20 and ResNet32 on CIFAR-10 34
Reference 35
์ด ๋ก 38
Acknowledgement 40Maste
Real-time Optimal Resource Allocation for Embedded UAV Communication Systems
We consider device-to-device (D2D) wireless information and power transfer
systems using an unmanned aerial vehicle (UAV) as a relay-assisted node. As the
energy capacity and flight time of UAVs is limited, a significant issue in
deploying UAV is to manage energy consumption in real-time application, which
is proportional to the UAV transmit power. To tackle this important issue, we
develop a real-time resource allocation algorithm for maximizing the energy
efficiency by jointly optimizing the energy-harvesting time and power control
for the considered (D2D) communication embedded with UAV. We demonstrate the
effectiveness of the proposed algorithms as running time for solving them can
be conducted in milliseconds.Comment: 11 pages, 5 figures, 1 table. This paper is accepted for publication
on IEEE Wireless Communications Letter
Towards a Robust WiFi-based Fall Detection with Adversarial Data Augmentation
Recent WiFi-based fall detection systems have drawn much attention due to
their advantages over other sensory systems. Various implementations have
achieved impressive progress in performance, thanks to machine learning and
deep learning techniques. However, many of such high accuracy systems have low
reliability as they fail to achieve robustness in unseen environments. To
address that, this paper investigates a method of generalization through
adversarial data augmentation. Our results show a slight improvement in deep
learning-systems in unseen domains, though the performance is not significant.Comment: Will appear in Proceedings of the 54th Annual Conference on
Information Sciences and Systems (CISS2020
ShortcutFusion: From Tensorflow to FPGA-based accelerator with reuse-aware memory allocation for shortcut data
Residual block is a very common component in recent state-of-the art CNNs
such as EfficientNet or EfficientDet. Shortcut data accounts for nearly 40% of
feature-maps access in ResNet152 [8]. Most of the previous DNN compilers,
accelerators ignore the shortcut data optimization. This paper presents
ShortcutFusion, an optimization tool for FPGA-based accelerator with a
reuse-aware static memory allocation for shortcut data, to maximize on-chip
data reuse given resource constraints. From TensorFlow DNN models, the proposed
design generates instruction sets for a group of nodes which uses an optimized
data reuse for each residual block. The accelerator design implemented on the
Xilinx KCU1500 FPGA card significantly outperforms NVIDIA RTX 2080 Ti, Titan
Xp, and GTX 1080 Ti for the EfficientNet inference. Compared to RTX 2080 Ti,
the proposed design is 1.35-2.33x faster and 6.7-7.9x more power efficient.
Compared to the result from baseline, in which the weights, inputs, and outputs
are accessed from the off-chip memory exactly once per each layer,
ShortcutFusion reduces the DRAM access by 47.8-84.8% for RetinaNet, Yolov3,
ResNet152, and EfficientNet. Given a similar buffer size to ShortcutMining [8],
which also mine the shortcut data in hardware, the proposed work reduces
off-chip access for feature-maps 5.27x while accessing weight from off-chip
memory exactly once.Comment: 12 page
Federated Deep Reinforcement Learning-based Bitrate Adaptation for Dynamic Adaptive Streaming over HTTP
In video streaming over HTTP, the bitrate adaptation selects the quality of
video chunks depending on the current network condition. Some previous works
have applied deep reinforcement learning (DRL) algorithms to determine the
chunk's bitrate from the observed states to maximize the quality-of-experience
(QoE). However, to build an intelligent model that can predict in various
environments, such as 3G, 4G, Wifi, \textit{etc.}, the states observed from
these environments must be sent to a server for training centrally. In this
work, we integrate federated learning (FL) to DRL-based rate adaptation to
train a model appropriate for different environments. The clients in the
proposed framework train their model locally and only update the weights to the
server. The simulations show that our federated DRL-based rate adaptations,
called FDRLABR with different DRL algorithms, such as deep Q-learning,
advantage actor-critic, and proximal policy optimization, yield better
performance than the traditional bitrate adaptation methods in various
environments.Comment: 13 pages, 1 colum
Recommended from our members
E-Learnerโs needs for sustainable tourism Higher Education: a case of Vietnam
E-learning has been suggested to help higher education organizations in tourism to achieve sustainable development. Thus, this study aims to identify factors determining the learners' needs for e-learning programs, particularly in a tourism curriculum. Based on relevant research and the Technology Acceptance Model, the study conducted a survey with 1,109 learners in the Central Coastal region of Vietnam. The results show that the e-learning environment, perceived ease of use, perceived usefulness, playfulness, and information technology skills have a positive impact on the learner's needs. These findings provide useful managerial implications for e-learning of tourism programs which contributes to the sustainable development of higher education as well as tourism industry through its workforce
- โฆ