23 research outputs found

    PreVIous: A Methodology for Prediction of Visual Inference Performance on IoT Devices

    Full text link
    This paper presents PreVIous, a methodology to predict the performance of convolutional neural networks (CNNs) in terms of throughput and energy consumption on vision-enabled devices for the Internet of Things. CNNs typically constitute a massive computational load for such devices, which are characterized by scarce hardware resources to be shared among multiple concurrent tasks. Therefore, it is critical to select the optimal CNN architecture for a particular hardware platform according to prescribed application requirements. However, the zoo of CNN models is already vast and rapidly growing. To facilitate a suitable selection, we introduce a prediction framework that allows to evaluate the performance of CNNs prior to their actual implementation. The proposed methodology is based on PreVIousNet, a neural network specifically designed to build accurate per-layer performance predictive models. PreVIousNet incorporates the most usual parameters found in state-of-the-art network architectures. The resulting predictive models for inference time and energy have been tested against comprehensive characterizations of seven well-known CNN models running on two different software frameworks and two different embedded platforms. To the best of our knowledge, this is the most extensive study in the literature concerning CNN performance prediction on low-power low-cost devices. The average deviation between predictions and real measurements is remarkably low, ranging from 3% to 10%. This means state-of-the-art modeling accuracy. As an additional asset, the fine-grained a priori analysis provided by PreVIous could also be exploited by neural architecture search engines.Comment: 18 pages. 7 figure

    Deep learning in edge: evaluation of models and frameworks in ARM architecture

    Get PDF
    The boom and popularization of edge devices have molded its market due to stiff compe tition that provides better functionalities at low energy costs. The ARM architecture has been unanimously unopposed in the huge market segment of smartphones and still makes a presence beyond that: in drones, surveillance systems, cars, and robots. Also, it has been used successfully for the development of solutions for chains that supply food, fuel, and other services. Up until recently, ARM did not show much promise for high-level compu tation, i.e., thanks to its limited RISC instruction set, it was considered power efficient but weak in performance compared to x86 architecture. However, most recent advancements in ARM architecture pivoted that inflection point up thanks to the introduction of embed ded GPUs with DMA into LPDDR memory boards. Since this development in boards such as NVIDIA TK1, NVIDIA Jetson TX1, and NVIDIA TX2, perhaps it finally be came feasible to study and perform more challenging parallel and distributed workloads directly on a RISC-based architecture. On the other hand, the novelty of this technology poses a fundamental question of whether these boards are gaining a meaningful ratio be tween processing power and power consumption over conventional architectures or if they are bound to have reached their limitations. This work explores the Parallel Processing of Deep Learning on embedded GPUs of NVIDIA Jetson TX2 to evaluate the question above comprehensively. Thus, it uses 4 ARM boards, with 2 Deep Learning frameworks, 7 CNN models, and one medium-sized dataset combined into six board settings to con duct experiments. The experiments were conducted under similar environments, all built from the source. Altogether, the experiments ran for a total of 4,804 hours and revealed a slight advantage for MxNet on GPU-reliant training and a PyTorch overall advantage in total execution time and power, but especially for CPU-only executions. The experi ments also showed that the NVIDIA Jetson TX2 already makes feasible some complex workloads directly on its SoC

    ์ž„๋ฒ ๋””๋“œ ์‹œ์Šคํ…œ์—์„œ ์—ฌ๋Ÿฌ ์ปจ๋ณผ๋ฃจ์…˜ ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ๋ฅผ ์œ„ํ•œ ํ•˜๋“œ์›จ์–ด๋ฅผ ๊ณ ๋ คํ•˜๋Š” ์†Œํ”„ํŠธ์›จ์–ด ์ตœ์ ํ™” ๊ธฐ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021. 2. ํ•˜์ˆœํšŒ.์ž„๋ฒ ๋””๋“œ ๊ธฐ๊ธฐ๋Š” ๋Œ€๊ฐœ ๊ณ„์‚ฐ๋Ÿ‰, ๋ฉ”๋ชจ๋ฆฌ ํฌ๊ธฐ, ์—๋„ˆ์ง€ ์†Œ๋ชจ๋Ÿ‰ ๋“ฑ์˜ ์ œ์•ฝ ์‚ฌํ•ญ์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ์„ ์ž„๋ฒ ๋””๋“œ ๊ธฐ๊ธฐ์—์„œ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์€ ์‰ฝ์ง€ ์•Š๋‹ค. ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ์˜ ๊ณ„์‚ฐ๋Ÿ‰ ์ฆ๊ฐ€๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ ์—๋„ˆ์ง€ ํšจ์œจ์ ์ธ ๋ชจ๋ฐ”์ผ GPU, ๋””์ง€ํ„ธ ์‹ ํ˜ธ ์ฒ˜๋ฆฌ ํ”„๋กœ์„ธ์„œ์„ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜, ๋˜๋Š” ์ƒˆ๋กœ์šด ๋‰ด๋Ÿด ํ”„๋กœ์„ธ์„œ ์นฉ์„ ๋งŒ๋“œ๋ ค๋Š” ํ•˜๋“œ์›จ์–ด ์˜์—ญ์˜ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์ด ์žˆ๋‹ค. ๋ฐ˜๋ฉด์— ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ ์˜์—ญ์—์„œ๋Š” ์ƒˆ๋กœ์šด ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ์„ ๋งŒ๋“ค๊ฑฐ๋‚˜, ๋”ฅ ๋Ÿฌ๋‹์˜ ํ†ต๊ณ„์ ์ธ ํŠน์„ฑ์„ ์ด์šฉํ•œ ๊ทผ์‚ฌ ๊ณ„์‚ฐ ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•˜์—ฌ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜๊ณ  ์žˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋˜ ๋‹ค๋ฅธ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” ๋จผ์ € ํ•˜๋“œ์›จ์–ด ํ”Œ๋žซํผ์˜ ์„ฑ๋Šฅ ๋ณ‘๋ชฉ ๋ถ€๋ถ„์„ ์ฐพ๊ณ , ์ผ์„ ๋™๋“ฑํ•˜๊ฒŒ ์—ฌ๋Ÿฌ ๊ณ„์‚ฐ ์ž์›์— ๋ถ„๋ฐฐํ•˜์—ฌ ์ตœ์ ํ™”ํ•˜๋Š” ํ•˜๋“œ์›จ์–ด๋ฅผ ๊ณ ๋ คํ•œ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์ด ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํ•˜๋“œ์›จ์–ด๋ฅผ ๊ณ ๋ คํ•œ ์†Œํ”„ํŠธ์›จ์–ด ์ตœ์ ํ™” ๋ฐฉ๋ฒ•๋“ค์„ ๊ณ ์•ˆํ•˜์˜€๋‹ค. ๋จผ์ €, LPIRC ๋Œ€ํšŒ์— ์ฐธ๊ฐ€ํ•œ ๊ฒฝํ—˜์„ ๋ฐ”ํƒ•์œผ๋กœ ์ž„๋ฒ ๋””๋“œ ๋”ฅ ๋Ÿฌ๋‹ ์‹œ์Šคํ…œ์„ ์ตœ์ ํ™”ํ•˜๋Š” ์ฒด๊ณ„์ ์ธ ๋ฐฉ๋ฒ•๋ก ์„ ๊ณ ์•ˆํ•˜๊ณ , ๊ทธ ๋ฐฉ๋ฒ•๋ก ์— ๋”ฐ๋ฅธ C-GOOD์ด๋ผ๋Š” ๋”ฅ ๋Ÿฌ๋‹ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๊ตฌํ˜„ํ•˜์˜€๋‹ค. C-GOOD์€ ํ•˜๋“œ์›จ์–ด ํ”Œ๋žซํผ์— ๋…๋ฆฝ์ ์œผ๋กœ ์ž‘๋™ํ•˜๊ธฐ ์œ„ํ•ด ๋Œ€๋ถ€๋ถ„์˜ ์ž„๋ฒ ๋””๋“œ ๊ธฐ๊ธฐ์—์„œ ์ปดํŒŒ์ผ, ์ˆ˜ํ–‰์ด ๊ฐ€๋Šฅํ•œ C ์ฝ”๋“œ๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. ๋˜ํ•œ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ ์˜์—ญ์˜ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์˜ต์…˜๊ณผ ์‹œ์Šคํ…œ ์„ฑ๋Šฅ์„ ์ธก์ •ํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•˜์˜€๋‹ค. ์ด ๋ฐฉ๋ฒ•๋ก ์„ Jetson TX2, Odroid XU4, SRP ๋“ฑ์˜ ์„œ๋กœ ๋‹ค๋ฅธ 3๊ฐœ์˜ ๊ธฐ๊ธฐ์— ์ ์šฉํ•ด ๋ด„์œผ๋กœ์จ, ๊ณ ์•ˆ๋œ ๋ฐฉ๋ฒ•๋ก ์ด ํ•˜๋“œ์›จ์–ด ํ”Œ๋žซํผ์— ๋…๋ฆฝ์ ์ด๋ฉฐ C-GOOD์„ ํ†ตํ•ด ์‰ฝ๊ฒŒ ์—ฌ๋Ÿฌ ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•  ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค. ์ตœ๊ทผ ์ž„๋ฒ ๋””๋“œ ๊ธฐ๊ธฐ์— ์ด์ข… ํ”„๋กœ์„ธ์„œ๋“ค์ด ๋งŽ์ด ํƒ‘์žฌ๋˜๊ณ  ์žˆ๊ณ , ๋™์‹œ์— ์ž์œจ ์ฃผํ–‰ ์ž๋™์ฐจ์™€ ์Šค๋งˆํŠธํฐ ๋“ฑ์˜ ํ•˜๋‚˜์˜ ์ž„๋ฒ ๋””๋“œ ๊ธฐ๊ธฐ์—์„œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ์„ ๋™์‹œ์— ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์ด ํ•„์š”ํ•ด์ง€๊ณ  ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์—ฌ๋Ÿฌ ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ์„ ์ด์ข… ํ”„๋กœ์„ธ์„œ๋“ค์„ ํƒ‘์žฌํ•œ ์ž„๋ฒ ๋””๋“œ ๊ธฐ๊ธฐ์— ์Šค์ผ€์ค„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๊ณ ์•ˆํ•˜๊ณ , ์Šค์ผ€์ค„๋ง ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๊ตฌํ˜„ํ•˜์˜€๋‹ค. ์ด ๋ฐฉ๋ฒ•๋ก ์€ ์‹ค์ œ ๊ธฐ๊ธฐ์—์„œ์˜ ํ”„๋กœํŒŒ์ผ๋ง๋ถ€ํ„ฐ ์Šค์ผ€์ค„ ๊ฒฐ๊ณผ๋ฅผ ์‹ค์ œ ๊ธฐ๊ธฐ์—์„œ ํ™•์ธํ•˜๋Š” ๊ณผ์ •๊นŒ์ง€ ํฌํ•จํ•˜๋ฉฐ, ์‹ค์ œ ๊ธฐ๊ธฐ์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์ด์Šˆ๋“ค์ธ DVFS, CPU Hot-plug ๋“ฑ์„ ๊ณ ๋ คํ•˜์˜€๋‹ค. ์ด์ข… ํ”„๋กœ์„ธ์„œ๋กœ์˜ ์Šค์ผ€์ค„๋ง ๊ธฐ๋ฒ•์œผ๋กœ๋Š” ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ๋ฉ”ํƒ€ ํœด๋ฆฌ์Šคํ‹ฑ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์œ ์ „ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ํŠนํžˆ, ์„œ๋กœ ๋‹ค๋ฅธ ์ฃผ๊ธฐ์™€ ์ƒ๋Œ€ ์˜คํ”„์…‹์„ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ์—ฌ๋Ÿฌ ์‘์šฉ์„ ๋™์‹œ์— ์Šค์ผ€์ค„ํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋ชจ๋“  ํƒœ์Šคํฌ๋“ค์˜ ์Šค์ผ€์ค„ ๊ฐ€๋Šฅ์„ฑ์„ ๊ณ ๋ คํ•˜์—ฌ ์Šค์ผ€์ค„ํ•˜์˜€๋‹ค. ์Šค์ผ€์ค„ ๊ฒฐ๊ณผ๋ฅผ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•ด์„œ, ACL์˜ ์ฝ”์–ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ๋”ฅ ๋Ÿฌ๋‹ ์ถ”๋ก  ์‘์šฉ์„ ๊ตฌํ˜„ํ•˜์˜€์œผ๋ฉฐ, ์Šค์ผ€์ค„ ๊ฒฐ๊ณผ์™€ ๊ฐ™์ด ๊ฐ ๋ ˆ์ด์–ด๋“ค์„ ์‹ค์ œ ํ•˜๋“œ์›จ์–ด์˜ ์„œ๋กœ ๋‹ค๋ฅธ ํ”„๋กœ์„ธ์„œ ๋งคํ•‘ํ•˜๋„๋ก ๊ตฌํ˜„ํ•˜์˜€๋‹ค. ๊ฐค๋Ÿญ์‹œ S9 ์Šค๋งˆํŠธํฐ๊ณผ Hikey 970 ๋ณด๋“œ์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ๋‘๊ฐœ์˜ ๋”ฅ ๋Ÿฌ๋‹ ๋„คํŠธ์›Œํฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ , ์Šค์ผ€์ค„ ๊ฒฐ๊ณผ์™€ ๋น„๊ตํ•˜์—ฌ ๋ฐฉ๋ฒ•๋ก ์„ ๊ฒ€์ฆํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์ด์ „ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•๋“ค์ด ๋”ฅ ๋Ÿฌ๋‹ ์‘์šฉ์˜ ๊ณ„์‚ฐ๋Ÿ‰๊ณผ ํ”„๋กœ์„ธ์„œ๋“ค์— ์ง‘์ค‘ํ•˜์˜€๋Š”๋ฐ, ๋”ฅ ๋Ÿฌ๋‹ ๊ฐ€์†๊ธฐ ๋˜๋Š” NPU์˜ ์„ฑ๋Šฅ ๋ณ‘๋ชฉ์ด ์ƒ๊ธฐ๋Š” ์›์ธ์€ ์˜คํ”„ ์นฉ ๋ฉ”๋ชจ๋ฆฌ์™€ ์˜จ ์นฉ ์‚ฌ์ด์˜ ํ†ต์‹ ์ด๋‹ค. ๋”์šฑ์ด ์˜คํ”„ ์นฉ ๋ฉ”๋ชจ๋ฆฌ DRAM ์ ‘๊ทผ์€ NPU์˜ ์ „๋ ฅ์†Œ๋ชจ์˜ ๋งŽ์€ ๋ถ€๋ถ„์„ ์ฐจ์ง€ํ•œ๋‹ค๊ณ  ์•Œ๋ ค์ ธ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ด์™€ ๊ฐ™์€ ์˜คํ”„ ์นฉ DRAM ์ ‘๊ทผ์œผ๋กœ ์ธํ•œ NPU์˜ ์„ฑ๋Šฅ๊ณผ ์—๋„ˆ์ง€ ์˜ํ–ฅ์„ ์ค„์ด๊ณ ์ž ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์˜จ ์นฉ ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ๋ฅผ ๊ด€๋ฆฌํ•˜๋Š” ์ปดํŒŒ์ผ๋Ÿฌ ๊ธฐ๋ฒ•์„ ๊ณ ์•ˆํ•˜์˜€๋‹ค. ์˜จ ์นฉ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ฑ…ํฌ๋กœ ๊ตฌ์„ฑํ•˜๊ณ  ์—ฐ์‚ฐ ๋„์ค‘์— ์ธํ’‹ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฏธ๋ฆฌ ๋กœ๋“œํ•จ์œผ๋กœ์จ ์—ฐ์‚ฐ ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ๊ณผ ๋ ˆ์ด์–ด์˜ ์•„์›ƒํ’‹์„ ์˜จ ์นฉ ๋ฉ”๋ชจ๋ฆฌ์—์„œ ์žฌ์‚ฌ์šฉํ•˜์—ฌ ์˜คํ”„ ์นฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์„ ์ด์šฉํ•˜์—ฌ ์„œ๋กœ ๋‹ค๋ฅธ ๋‘ ๊ฐ€์ง€์˜ ๋ชฉ์  ํ•จ์ˆ˜๋ฅผ ๊ฐ€์ง„ ๋‘ ๊ฐ€์ง€ ๊ธฐ๋ฒ•์„ ๊ณ ์•ˆํ•˜์˜€๋‹ค. ๋ชฉ์  ํ•จ์ˆ˜๋Š” ๊ฐ๊ฐ ์˜คํ”„ ์นฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ๊ณผ ์˜คํ”„ ์นฉ ๋ฉ”๋ชจ๋ฆฌ ์ ‘๊ทผ์œผ๋กœ ์ธํ•œ ํ”„๋กœ์„ธ์„œ๋“ค์˜ ์ฒ˜๋ฆฌ ์ง€์—ฐ์‹œ๊ฐ„์„ ์ค„์ด๋Š” ๊ฒƒ์ด๋‹ค. ์„œ๋กœ ๋‹ค๋ฅธ 5๊ฐœ์˜ ๋”ฅ ๋Ÿฌ๋‹ ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์ดํด ๋ ˆ๋ฒจ NPU ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ์—์„œ ์ˆ˜ํ–‰ํ•˜์—ฌ ๋‘ ๋ชฉ์  ํ•จ์ˆ˜์— ๋”ฐ๋ฅธ ์ ˆ์ถฉ (Trade-off) ๊ด€๊ณ„ ๋ฅผ ํ™•์ธํ•˜์˜€๋‹ค. ๋˜ํ•œ ์˜จ ์นฉ ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ๊ด€๋ฆฌ ๊ธฐ๋ฒ•์„ ๋ ˆ์ด์–ด ๊ฐ„ ํ”ผ์ฒ˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ตœ๋Œ€ํ•œ ์žฌ์‚ฌ์šฉํ•˜๋Š” ๋ ˆ์ด์–ด ์œตํ•ฉ ๋ฐฉ๋ฒ•์œผ๋กœ ํ™•์žฅํ•˜์˜€๋‹ค. ๊ธฐ์กด์˜ ์ˆœ์ˆ˜ํ•œ ๋ ˆ์ด์–ด ์œตํ•ฉ ๋ฐฉ๋ฒ•์˜ ๊ฒฝ์šฐ์—๋Š” ์ค‘๋ณต ๊ณ„์‚ฐํ•˜๋Š” ์˜ค๋ฒ„ํ—ค๋“œ์™€ ์ถ”๊ฐ€์ ์ธ ํ•„ํ„ฐ ์›จ์ดํŠธ ๋กœ๋“œ๊ฐ€ ์ƒ๊ธด๋‹ค. ๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ธฐ์กด์˜ ๋ ˆ์ด์–ด ๋ณ„๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•๊ณผ ์ˆœ์ˆ˜ํ•œ ๋ ˆ์ด์–ด ์œตํ•ฉ ๋ฐฉ๋ฒ• ์‚ฌ์ด์˜ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๋ ˆ์ด์–ด ์œตํ•ฉ ๋ฐฉ๋ฒ•์„ ๊ณ ์•ˆํ•˜์˜€๋‹ค. ๋‘ ์˜จ ์นฉ ๋ฉ”๋ชจ๋ฆฌ ๋ฑ…ํฌ ๊ด€๋ฆฌ ๊ธฐ๋ฒ•์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ๋ ˆ์ด์–ด ์œตํ•ฉ ๋ฐฉ๋ฒ•์ด ๊ธฐ์กด์˜ ๋ ˆ์ด์–ด ๋ณ„ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ธฐ๋ฒ•๊ณผ ์ˆœ์ˆ˜ํ•œ ๋ ˆ์ด์–ด ์œตํ•ฉ ๋ฐฉ๋ฒ•๋ณด๋‹ค ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.Executing deep learning algorithms on mobile embedded devices is challenging because embedded devices usually have tight constraints on the computational power, memory size, and energy consumption, while the resource requirements of deep learning algorithms achieving high accuracy continue to increase. To cope with increasing computation complexity, it is common to use an energy-efficient accelerator, such as a mobile GPU or digital signal processor (DSP) array, or to develop a customized neural processor chip called neural processing unit (NPU). In the application domain, many optimization techniques have been proposed to change the application algorithm in order to reduce the computational amount and memory usage by developing new deep learning networks or software optimization techniques that take advantage of the statistical nature of deep learning algorithms. Another approach is hardware-ware software optimization, which finds the performance bottleneck first and then distributes the workload evenly by scheduling the workloads. This dissertation covers hardware-aware software optimization, which is based on a hardware processor or platform. First, we devise a systematic optimization methodology through the experience of participating in the Low Power Image Recognition Challenge (LPIRC) and build a deep learning framework called C-GOOD (C-code Generation Framework for Optimized On-device Deep Learning) based on the devised methodology. For hardware independence, C-GOOD generates a C code that can be compiled for and run on any embedded device. Also, C-GOOD is facilitated with various options for application domain optimization that can be performed according to the devised methodology. By applying the devised methodology to three hardware platforms, NVIDIA Jetson TX2, Odroid XU4, and the Samsung Reconfigurable Processor (SRP), we demonstrate that the devised methodology is independent of the hardware platforms and application domain optimizations can be performed easily with C-GOOD. Recently, embedded devices are equipped with heterogeneous processing elements (PEs), and the need for running multiple deep learning applications concurrently in the embedded systems such as self-driving cars and smartphones is increasing at the same time. In those systems, we devise an end-to-end methodology to schedule deep learning applications onto heterogeneous PEs and implement a scheduling framework according to the methodology. It covers from profiling on real embedded devices to verifying the schedule results on the devices. In this methodology, we use a genetic algorithm (GA)-based scheduling technique for scheduling deep learning applications onto heterogeneous PEs and consider several practical issues in the profile step. Furthermore, we schedule multiple applications with different throughput constraints considering the schedulability of mapped tasks on each processor. After implementing a deep learning inference engine that can utilize heterogeneous PEs using a low-level library of the ARM compute library (ACL), we verify the devised methodology by running two widely used convolution neural networks (CNNs) on a Galaxy S9 smartphones and a Hikey970 board. While the previous optimization methods focus on the computation and processing elements, the performance bottleneck of deep learning accelerators is the communication between off-chip and on-chip memory. Moreover, the off-chip DRAM access volume has a significant effect on the energy consumption of an NPU. To reduce the impact of off-chip DRAM access on the performance and energy of an NPU, we devise compiler techniques for an NPU to manage multi-bank on-chip memory with two different objectives: one is to minimize the off-chip memory access volume, and the other is to minimize the processing delay caused by unhidden DRAM accesses. The main idea is that by organizing on-chip memory into multiple banks, we may hide the off-chip DRAM access delay by prefetching data into unused banks during computation and reduce the off-chip DRAM access volume by storing the output feature map data of each layer to on-chip memory. By running CNN benchmarks on a cycle-level NPU simulator, we demonstrate the trade-off relation between two objectives. The devised multi-bank on-chip memory management (MOMM) techniques are extended to consider layer fusion that aims to reuse feature maps between layers maximally. Since the pure layer fusion technique incurs extra computation overhead and increases DRAM access for filter weights, a hybrid fusion technique is presented between a per-layer processing technique and the pure layer fusion techniques, based on the devised MOMM techniques with two different objectives. Experiment results confirm the superiority of the hybrid fusion technique to the per-layer processing technique and the pure layer fusion technique.Abstract Contents List of Figures List of Tables List of Algorithms Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Contribution 7 1.3 Dissertation Organization 8 Chapter 2 Background 9 2.1 Target Hardware 9 2.1.1 Commodity Hardware Platform 9 2.1.2 Application-specific Hardware Accelerator 10 2.2 Convolutional Neural Network 11 2.2.1 Convolution 11 2.2.2 Optimization Methods for Convolutional Neural Network 11 Chapter 3 Optimization for a Commodity Hardware Platform 14 3.1 Joint Optimization Method of Multiple Objectives 15 3.1.1 Hardware Platform 16 3.1.2 Deep Neural Network and Software Framework 17 3.1.3 Software Optimization Techniques 19 3.2 C-code Generation Framework for Optimized On-device Deep Learning 29 3.2.1 C-GOOD Framework 29 3.2.2 Experiments 36 3.3 Scheduling Deep Learning Applications Onto Heterogeneous Processors 44 3.3.1 Search Space Size 45 3.3.2 Hardware Platform and System Model 45 3.3.3 Proposed Scheduling Framework and Profiling 48 3.3.4 Scheduling a Single Deep Learning Application 53 3.3.5 Scheduling Multiple Deep Learning Applications 61 3.3.6 Verification with Real Hardware Platforms 65 3.4 Related Work 69 3.4.1 Deep Learning Framework 69 3.4.2 Deep Learning Compiler 70 3.4.3 Scheduling Deep Learning Application 70 3.4.4 Scheduling Multiple Applications on Heterogeneous Processors 72 Chapter 4 Optimization for an Application-specific Hardware Accelerator 75 4.1 Multi-Bank On-chip Memory Management Problem 75 4.1.1 Main Idea 75 4.1.2 Assumed Dataflow 76 4.1.3 Multi-bank On-chip Memory Management Problem 79 4.2 Proposed Multi-bank On-chip Memory Management Techniques 83 4.2.1 DRAM-first Storing Policy 84 4.2.2 DRAM Access Minimization Policy (MIN policy) 85 4.2.3 DRAM Access Hiding Policy (HIDE policy) 89 4.2.4 Multiple Path Consideration 91 4.3 Layer Fusion Technique 92 4.3.1 Layer Fusion Technique 92 4.3.2 Hybrid Fusion Technique 94 4.4 Experiments 96 4.4.1 Setup 96 4.4.2 Performance Comparison of MOMM Techniques 98 4.4.3 Multiple Path 100 4.4.4 Design Space Exploration of NPU Architecture 101 4.4.5 Hybrid Fusion Technique 104 4.5 Related Work 106 Chapter 5 Conclusion 108 Bibliography 111 Appendix 120 A Proposed Multi-bank On-chip Memory Management Algorithm 120 A.1 Multi-bank On-chip Memory (MOM) Manager 120 A.2 MIN policy 122 A.3 HIDE policy 124 ์š” ์•ฝ 126Docto

    Robust event-based non-intrusive appliance recognition using multi-scale wavelet packet tree and ensemble bagging tree

    Get PDF
    open access articleProviding the user with appliance-level consumption data is the core of each energy efficiency system. To that end, non-intrusive load monitoring is employed for extracting appliance specific consumption data at a low cost without the need of installing separate submeters for each electrical device. In this context, we propose in this paper a novel non-intrusive appliance recognition system based on (i) detecting events in the aggregated power signal using a novel and powerful scheme, (ii) applying multiscale wavelet packet tree to collect comprehensive energy consumption features, and (iii) adopting an ensemble bagging tree classifier along with comparing its performance with various machine learning schemes. Moreover, to validate the proposed model, an empirical investigation is conducted on two real and public energy consumption datasets, namely, the GREEND and REDD, in which consumption readings are collected at low-frequencies. In addition, a comprehensive review of recent non-intrusive load monitoring approaches has been conducted and presented, in which their characteristics, performances and limitations are described. The proposed non-intrusive load monitoring system shows a high appliance recognition performance in terms of the accuracy, F1 score and low time complexity when it has been applied to different households from the GREEND and REDD repositories, in which every house includes various domestic appliances. Obtained results have described, e.g., that average accuracies of 97.01% and 96.36% have been reached on the GREEND and REDD datasets, respectively, which outperformed almost existing solutions considered in this framework

    Optimization of deep learning algorithms for an autonomous RC vehicle

    Get PDF
    Dissertaรงรฃo de mestrado em Engenharia InformรกticaThis dissertation aims to evaluate and improve the performance of deep learning (DL) algorithms to autonomously drive a vehicle, using a Remo Car (an RC vehicle) as testbed. The RC vehicle was built with a 1:10 scaled remote controlled car and fitted with an embedded system and a video camera to capture and process real-time image data. Two different embedded systems were comparatively evaluated: an homogeneous system, a Raspberry Pi 4, and an heterogeneous system, a NVidia Jetson Nano. The Raspberry Pi 4 with an advanced 4-core ARM device supports multiprocessing, while the Jetson Nano, also with a 4-core ARM device, has an integrated accelerator, a 128 CUDA-core NVidia GPU. The captured video is processed with convolutional neural networks (CNNs), which interpret image data of the vehicleโ€™s surroundings and predict critical data, such as lane view and steering angle, to provide mechanisms to drive on its own, following a predefined path. To improve the driving performance of the RC vehicle, this work analysed the programmed DL algorithms, namely different computer vision approaches for object detection and image classification, aiming to explore DL techniques and improve their performance at the inference phase. The work also analysed the computational efficiency of the control software, while running intense and complex deep learning tasks in the embedded devices, and fully explored the advanced characteristics and instructions provided by the two embedded systems in the vehicle. Different machine learning (ML) libraries and frameworks were analysed and evaluated: TensorFlow, TensorFlow Lite, Arm NN, PyArmNN and TensorRT. They play a key role to deploy the relevant algorithms and to fully engage the hardware capabilities. The original algorithm was successfully optimized and both embedded systems could perfectly handle this workload. To understand the computational limits of both devices, an additional and heavy DL algorithm was developed that aimed to detect traffic signs. The homogeneous system, the Raspberry Pi 4, could not deliver feasible low-latency values, hence the detection of traffic signs was not possible in real-time. However, a great performance improvement was achieved using the heterogeneous system, Jetson Nano, enabling their CUDA-cores to process the additional workload.Esta dissertaรงรฃo tem como objetivo avaliar e melhorar o desempenho de algoritmos de deep learning (DL) orientados ร  conduรงรฃo autรณnoma de veรญculos, usando um carro controlado remotamente como ambiente de teste. O carro foi construรญdo usando um modelo de um veรญculo de controlo remoto de escala 1:10, onde foi colocado um sistema embebido e uma cรขmera de vรญdeo para capturar e processar imagem em tempo real. Dois sistemas embebidos foram comparativamente avaliados: um sistema homogรฉneo, um Raspberry Pi 4, e um sistema heterogรฉneo, uma NVidia Jetson Nano. O Raspberry Pi 4 possui um processador ARM com 4 nรบcleos, suportando multiprocessamento. A Jetson Nano, tambรฉm com um processador ARM de 4 nรบcleos, possui uma unidade adicional de processamento com 128 nรบcleos do tipo CUDA-core. O vรญdeo capturado e processado usando redes neuronais convolucionais (CNN), interpretando o meio envolvente do veรญculo e prevendo dados cruciais, como a visibilidade da linha da estrada e o angulo de direรงรฃo, de forma a que o veรญculo consiga conduzir de forma autรณnoma num determinado ambiente. De forma a melhorar o desempenho da conduรงรฃo autรณnoma do veรญculo, diferentes algoritmos de deep learning foram analisados, nomeadamente diferentes abordagens de visรฃo por computador para detecรงรฃo e classificaรงรฃo de imagens, com o objetivo de explorar tรฉcnicas de CNN e melhorar o seu desempenho na fase de inferรชncia. A dissertaรงรฃo tambรฉm analisou a eficiรชncia computacional do software usado para a execuรงรฃo de tarefas de aprendizagem profunda intensas e complexas nos dispositivos embebidos, e explorou completamente as caracterรญsticas avanรงadas e as instruรงรตes fornecidas pelos dois sistemas embebidos no veรญculo. Diferentes bibliotecas e frameworks de machine learning foram analisadas e avaliadas: TensorFlow, TensorFlow Lite, Arm NN, PyArmNN e TensorRT. Estes desempenham um papel fulcral no provisionamento dos algoritmos de deep learning para tirar mรกximo partido das capacidades do hardware usado. O algoritmo original foi otimizado com sucesso e ambos os sistemas embebidos conseguiram executar os algoritmos com pouco esforรงo. Assim, para entender os limites computacionais de ambos os dispositivos, um algoritmo adicional mais complexo de deep learning foi desenvolvido com o objetivo de detectar sinais de transito. O sistema homogรฉneo, o Raspberry Pi 4, nรฃo conseguiu entregar valores viรกveis de baixa latรชncia, portanto, a detecรงรฃo de sinais de trรขnsito nรฃo foi possรญvel em tempo real, usando este sistema. No entanto, foi alcanรงada uma grande melhoria de desempenho usando o sistema heterogeneo, Jetson Nano, que usaram os seus nรบcleos CUDA adicionais para processar a carga computacional mais intensa

    A 64mW DNN-based Visual Navigation Engine for Autonomous Nano-Drones

    Full text link
    Fully-autonomous miniaturized robots (e.g., drones), with artificial intelligence (AI) based visual navigation capabilities are extremely challenging drivers of Internet-of-Things edge intelligence capabilities. Visual navigation based on AI approaches, such as deep neural networks (DNNs) are becoming pervasive for standard-size drones, but are considered out of reach for nanodrones with size of a few cm2{}^\mathrm{2}. In this work, we present the first (to the best of our knowledge) demonstration of a navigation engine for autonomous nano-drones capable of closed-loop end-to-end DNN-based visual navigation. To achieve this goal we developed a complete methodology for parallel execution of complex DNNs directly on-bard of resource-constrained milliwatt-scale nodes. Our system is based on GAP8, a novel parallel ultra-low-power computing platform, and a 27 g commercial, open-source CrazyFlie 2.0 nano-quadrotor. As part of our general methodology we discuss the software mapping techniques that enable the state-of-the-art deep convolutional neural network presented in [1] to be fully executed on-board within a strict 6 fps real-time constraint with no compromise in terms of flight results, while all processing is done with only 64 mW on average. Our navigation engine is flexible and can be used to span a wide performance range: at its peak performance corner it achieves 18 fps while still consuming on average just 3.5% of the power envelope of the deployed nano-aircraft.Comment: 15 pages, 13 figures, 5 tables, 2 listings, accepted for publication in the IEEE Internet of Things Journal (IEEE IOTJ

    Developing Real-Time GPU-Sharing Platforms for Artificial-Intelligence Applications

    Get PDF
    In modern autonomous systems such as self-driving cars, sustained safe operation requires running complex software at rates possible only with the help of specialized computational accelerators. Graphics processing units (GPUs) remain a foremost example of such accelerators, due to their relative ease of use and the proficiency with which they can accelerate neural-network computations underlying modern computer-vision and artificial-intelligence algorithms. This means that ensuring GPU processing completes in a timely manner is essential---but doing so is not necessarily simple, especially when a single GPU is concurrently shared by many applications. Existing real-time research includes several techniques for improving timing characteristics of shared-GPU workloads, each with varying tradeoffs and practical limitations. In the world of timing correctness, however, one problem stands above all others: the lack of detailed information about how GPU hardware and software behaves. GPU manufacturers are usually willing to publish documentation sufficient for producing logically correct software, or guidance on tuning software to achieve "real-fast," high-throughput performance, but the same manufacturers neglect to provide details used when establishing temporal predictability. Techniques for improving the reliability of GPU software's temporal performance are only as good as the information upon which they are based, incentivising researchers to spend inordinate amounts of time learning foundational facts about existing hardware---facts that chip manufacturers must know, but are not willing to publish. This is both a continual inconvenience in established GPU research, and a high barrier to entry for newcomers. This dissertation addresses the "information problem" hindering real-time GPU research in several ways. First, it seeks to fight back against the monoculture that has arisen with respect to platform choice. Virtually all prior real-time GPU research is developed for and evaluated using GPUs manufactured by NVIDIA, but this dissertation provides details about an alternate platform: AMD GPUs. Second, this dissertation works towards establishing a model with which GPU performance can be predicted or controlled. To this end, it uses a series of experiments to discern the policy that governs the queuing behavior of concurrent GPU-sharing processes, on both NVIDIA and AMD GPUs. Finally, this dissertation addresses the novel problems for safety-critical systems caused by the changing landscape of the applications that run on GPUs. In particular, the advent of neural-network-based artificial-intelligence has catapulted GPU usage into safety-critical domains that are not prepared for the complexity of the new software or the fact that it cannot guarantee logical correctness. The lack of logical guarantees is unlikely to be "solved" in the near future, motivating a focus on increased throughput. Higher throughput increases the probability of producing a correct result within a fixed amount of time, but GPU-management efforts typically focus on worst-case performance, often at the expense of throughput. This dissertation's final chapter therefore evaluates existing GPU-management techniques' efficacy at managing neural-network applications, both from a throughput and worst-case perspective.Doctor of Philosoph
    corecore