96 research outputs found

    Efficient and portable Winograd convolutions for multi-core processors

    Get PDF
    We take a step forward towards developing high-performance codes for the convolution operator, based on the Winograd algorithm, that are easy to customise for general-purpose processor architectures. In our approach, augmenting the portability of the solution is achieved via the introduction of vector instructions from Intel SSE/AVX2/AVX512 and ARM NEON/SVE to exploit the single-instruction multiple-data capabilities of current processors as well as OpenMP pragmas to exploit multi-threaded parallelism. While this comes at the cost of sacrificing a fraction of the computational performance, our experimental results on three distinct processors, with Intel Xeon Skylake, ARM Cortex A57 and Fujitsu A64FX processors, show that the impact is affordable and still renders a Winograd-based solution that is competitive when compared with the lowering GEMM-based convolution

    Efficient direct convolution using long SIMD instructions

    Get PDF
    This paper demonstrates that state-of-the-art proposals to compute convolutions on architectures with CPUs supporting SIMD instructions deliver poor performance for long SIMD lengths due to frequent cache conflict misses. We first discuss how to adapt the state-of-the-art SIMD direct convolution to architectures using long SIMD instructions and analyze the implications of increasing the SIMD length on the algorithm formulation. Next, we propose two new algorithmic approaches: the Bounded Direct Convolution (BDC), which adapts the amount of computation exposed to mitigate cache misses, and the Multi-Block Direct Convolution (MBDC), which redefines the activation memory layout to improve the memory access pattern. We evaluate BDC, MBDC, the state-of-the-art technique, and a proprietary library on an architecture featuring CPUs with 16,384-bit SIMD registers using ResNet convolutions. Our results show that BDC and MBDC achieve respective speed-ups of 1.44× and 1.28× compared to the state-of-the-art technique for ResNet-101, and 1.83× and 1.63× compared to the proprietary library.This work receives EuroHPC-JU funding under grant no. 101034126, with support from the Horizon2020 program. Adrià Armejach is a Serra Hunter Fellow and has been partially supported by the Grant IJCI-2017-33945 funded by MCIN/AEI/10.13039/501100011033. Marc Casas has been par-tially supported by the Grant RYC-2017-23269 funded by MCIN/AEI/10.13039/501100011033 and ESF Investing in your future. This work is supported by the Spanish Ministry of Science and Technology through the PID2019-107255GB project and the Generalitat de Catalunya (contract 2017-SGR-1414).Peer ReviewedPostprint (author's final draft

    Convolution Operators for Deep Learning Inference on the Fujitsu A64FX Processor

    Get PDF
    Ponència presentada a 2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) celebrat a Bordeaux, França.The convolution operator is a crucial kernel for many computer vision and signal processing applications that rely on deep learning (DL) technologies. As such, the efficient implementation of this operator has received considerable attention in the past few years for a fair range of processor architectures. In this paper, we follow the technology trend toward integrating long SIMD (single instruction, multiple data) arithmetic units into high performance multicore processors to analyse the benefits of this type of hardware acceleration for latency-constrained DL workloads. For this purpose, we implement and optimise for the Fujitsu processor A64FX, three distinct methods for the calculation of the convolution, namely, the lowering approach, a blocked variant of the direct convolution algorithm, and the Winograd minimal filtering algorithm. Our experimental results include an extensive evaluation of the parallel scalability of these three methods and a comparison of their global performance using three popular DL models and a representative dataset

    High Performance Depthwise and Pointwise Convolutions on Mobile Devices

    Full text link
    Lightweight convolutional neural networks (e.g., MobileNets) are specifically designed to carry out inference directly on mobile devices. Among the various lightweight models, depthwise convolution (DWConv) and pointwise convolution (PWConv) are their key operations. In this paper, we observe that the existing implementations of DWConv and PWConv are not well utilizing the ARM processors in the mobile devices, and exhibit lots of cache misses under multi-core and poor data reuse at register level. We propose techniques to re-optimize the implementations of DWConv and PWConv based on ARM architecture. Experimental results show that our implementation can respectively achieve a speedup of up to 5.5x and 2.1x against TVM (Chen et al. 2018) on DWConv and PWConv.Comment: 8 pages, Thirty-Four AAAI conference on Artificial Intelligenc

    Optimizing Grouped Convolutions on Edge Devices

    Get PDF
    When deploying a deep neural network on constrained hardware, it is possible to replace the network's standard convolutions with grouped convolutions. This allows for substantial memory savings with minimal loss of accuracy. However, current implementations of grouped convolutions in modern deep learning frameworks are far from performing optimally in terms of speed. In this paper we propose Grouped Spatial Pack Convolutions (GSPC), a new implementation of grouped convolutions that outperforms existing solutions. We implement GSPC in TVM, which provides state-of-the-art performance on edge devices. We analyze a set of networks utilizing different types of grouped convolutions and evaluate their performance in terms of inference time on several edge devices. We observe that our new implementation scales well with the number of groups and provides the best inference times in all settings, improving the existing implementations of grouped convolutions in TVM, PyTorch and TensorFlow Lite by 3.4x, 8x and 4x on average respectively. Code is available at https://github.com/gecLAB/tvm-GSPC/Comment: Camera ready version to be published at ASAP 2020 - The 31st IEEE International Conference on Application-specific Systems, Architectures and Processors. 8 pages, 6 figure

    Performance–energy trade-offs of deep learning convolution algorithms on ARM processors

    Get PDF
    In this work, we assess the performance and energy efficiency of high-performance codes for the convolution operator, based on the direct, explicit/implicit lowering and Winograd algorithms used for deep learning (DL) inference on a series of ARM-based processor architectures. Specifically, we evaluate the NVIDIA Denver2 and Carmel processors, as well as the ARM Cortex-A57 and Cortex-A78AE CPUs as part of a recent set of NVIDIA Jetson platforms. The performance–energy evaluation is carried out using the ResNet-50 v1.5 convolutional neural network (CNN) on varying configurations of convolution algorithms, number of threads/cores, and operating frequencies on the tested processor cores. The results demonstrate that the best throughput is obtained on all platforms with the Winograd convolution operator running on all the cores at their highest frequency. However, if the goal is to reduce the energy footprint, there is no rule of thumb for the optimal configuration.Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This research was funded by Project PID2020-113656RB-C21/C22 supported by MCIN/AEI/10.13039/501100011033. Manuel F. Dolz was also supported by the Plan Gen–T grant CDEIGENT/2018/014 of the Generalitat Valenciana. Héctor Martínez is a POSTDOC_21_00025 fellow supported by Junta de Andalucía. Adrián Castelló is a FJC2019-039222-I fellow supported by MCIN/AEI/10.13039/501100011033. Antonio Maciá is a PRE2021-099284 fellow supported by MCIN/AEI/10.13039/501100011033

    Performance–energy trade‑ofs of deep learning convolution algorithms on ARM processors

    Get PDF
    In this work, we assess the performance and energy efciency of high-performance codes for the convolution operator, based on the direct, explicit/implicit lowering and Winograd algorithms used for deep learning (DL) inference on a series of ARM-based processor architectures. Specifcally, we evaluate the NVIDIA Denver2 and Carmel processors, as well as the ARM Cortex-A57 and Cortex-A78AE CPUs as part of a recent set of NVIDIA Jetson platforms. The performance–energy evaluation is carried out using the ResNet-50 v1.5 convolutional neural network (CNN) on varying confgurations of convolution algorithms, number of threads/cores, and operating frequencies on the tested processor cores. The results demonstrate that the best throughput is obtained on all platforms with the Winograd convolution operator running on all the cores at their highest frequency. However, if the goal is to reduce the energy footprint, there is no rule of thumb for the optimal confguration.Funding for open access charge: CRUE-Universitat Jaume

    Reformulating the direct convolution for high-performance deep learning inference on ARM processors

    Get PDF
    We present two high-performance implementations of the convolution operator via the direct algorithm that outperform the so-called lowering approach based on the im2col transform plus the gemm kernel on an ARMv8-based processor. One of our methods presents the additional advantage of zero-memory overhead while the other employs an additional yet rather moderate workspace, substantially smaller than that required by the im2col+gemm solution. In contrast with a previous implementation of a similar zero-memory overhead direct convolution, this work exhibits the key advantage of preserving the conventional NHWC data layout for the input/output activations of the convolution layers.Funding for open access charge: CRUE-Universitat Jaume
    • …
    corecore