Deep learning in edge: evaluation of models and frameworks in ARM architecture

Abstract

The boom and popularization of edge devices have molded its market due to stiff compe tition that provides better functionalities at low energy costs. The ARM architecture has been unanimously unopposed in the huge market segment of smartphones and still makes a presence beyond that: in drones, surveillance systems, cars, and robots. Also, it has been used successfully for the development of solutions for chains that supply food, fuel, and other services. Up until recently, ARM did not show much promise for high-level compu tation, i.e., thanks to its limited RISC instruction set, it was considered power efficient but weak in performance compared to x86 architecture. However, most recent advancements in ARM architecture pivoted that inflection point up thanks to the introduction of embed ded GPUs with DMA into LPDDR memory boards. Since this development in boards such as NVIDIA TK1, NVIDIA Jetson TX1, and NVIDIA TX2, perhaps it finally be came feasible to study and perform more challenging parallel and distributed workloads directly on a RISC-based architecture. On the other hand, the novelty of this technology poses a fundamental question of whether these boards are gaining a meaningful ratio be tween processing power and power consumption over conventional architectures or if they are bound to have reached their limitations. This work explores the Parallel Processing of Deep Learning on embedded GPUs of NVIDIA Jetson TX2 to evaluate the question above comprehensively. Thus, it uses 4 ARM boards, with 2 Deep Learning frameworks, 7 CNN models, and one medium-sized dataset combined into six board settings to con duct experiments. The experiments were conducted under similar environments, all built from the source. Altogether, the experiments ran for a total of 4,804 hours and revealed a slight advantage for MxNet on GPU-reliant training and a PyTorch overall advantage in total execution time and power, but especially for CPU-only executions. The experi ments also showed that the NVIDIA Jetson TX2 already makes feasible some complex workloads directly on its SoC

    Similar works