5,234 research outputs found

    An automatic optimizer for heterogeneous devices

    Get PDF
    Versión final aceptada de: https://doi.org/10.1016/j.future.2020.01.018This version of the article: Fernández-Fabeiro, J., Andrade, D., Fraguela, B. B., & Doallo, R. (2020). 'An automaticoptimizer for heterogeneous devices' has been accepted for publication in: Future Generation Computer Systems, 106, 572–584. The Version of Record is available online at: https://doi.org/10.1016/j.future.2020.01.018 .[Abstract]: Codes written in a naive way seldom effectively exploit the computing resources, while writing optimized codes is usually a complex task that requires certain levels of expertise. This problem is further increased in the presence of heterogeneous devices, which present more tunable parameters than regular CPUs and high sensitivity to the optimization decisions taken. Furthermore, portability is an added concern given the wide variety of accelerators available. This paper tackles this problem adding an automatic optimizer to a library that already provides an easy and portable way to program heterogeneous devices, the Heterogeneous Programming Library (HPL). Our optimizer takes as input a simple version of a code and then tunes it for the device where it is going to be executed by performing the most usual set of optimizations applicable in heterogeneous devices. These optimizations are parametrized using a set of optimization parameters that need to be tuned for the device. The HPL library has also been equipped with an autotuner that can be used to this purpose. The effectiveness of the autotuner and the optimizer has been tested on several codes and devices. The results show that the combination of the autotuner and the optimizer make the tested codes 16 times faster on average than the original codes written by the programmer.This research was supported by the Ministry of Economy and Competitiveness of Spain and FEDER funds (80%) of the EU (TIN2016-75845-P), and by the Government of Galicia (Xunta de Galicia, Spain) co-founded by the European Regional Development Fund (ERDF) under the Consolidation Programme of Competitive Reference Groups (ED431C 2017/04) as well as under Xunta de Galicia and FEDER funds of the EU (Centro de Investigación de Galicia accreditation 2019–2022, ref. ED431G2019/01)Xunta de Galicia; ED431C 2017/04Xunta de Galicia; ED431G2019/0

    Learning to infer: RL-based search for DNN primitive selection on Heterogeneous Embedded Systems

    Full text link
    Deep Learning is increasingly being adopted by industry for computer vision applications running on embedded devices. While Convolutional Neural Networks' accuracy has achieved a mature and remarkable state, inference latency and throughput are a major concern especially when targeting low-cost and low-power embedded platforms. CNNs' inference latency may become a bottleneck for Deep Learning adoption by industry, as it is a crucial specification for many real-time processes. Furthermore, deployment of CNNs across heterogeneous platforms presents major compatibility issues due to vendor-specific technology and acceleration libraries. In this work, we present QS-DNN, a fully automatic search based on Reinforcement Learning which, combined with an inference engine optimizer, efficiently explores through the design space and empirically finds the optimal combinations of libraries and primitives to speed up the inference of CNNs on heterogeneous embedded devices. We show that, an optimized combination can achieve 45x speedup in inference latency on CPU compared to a dependency-free baseline and 2x on average on GPGPU compared to the best vendor library. Further, we demonstrate that, the quality of results and time "to-solution" is much better than with Random Search and achieves up to 15x better results for a short-time search

    Learning to infer: RL-based search for DNN primitive selection on Heterogeneous Embedded Systems

    Get PDF
    Deep Learning is increasingly being adopted by industry for computer vision applications running on embedded devices. While Convolutional Neural Networks' accuracy has achieved a mature and remarkable state, inference latency and throughput are a major concern especially when targeting low-cost and low-power embedded platforms. CNNs' inference latency may become a bottleneck for Deep Learning adoption by industry, as it is a crucial specification for many real-time processes. Furthermore, deployment of CNNs across heterogeneous platforms presents major compatibility issues due to vendor-specific technology and acceleration libraries.In this work, we present QS-DNN, a fully automatic search based on Reinforcement Learning which, combined with an inference engine optimizer, efficiently explores through the design space and empirically finds the optimal combinations of libraries and primitives to speed up the inference of CNNs on heterogeneous embedded devices. We show that, an optimized combination can achieve 45x speedup in inference latency on CPU compared to a dependency-free baseline and 2x on average on GPGPU compared to the best vendor library. Further, we demonstrate that, the quality of results and time "to-solution" is much better than with Random Search and achieves up to 15x better results for a short-time search
    • …
    corecore