96 research outputs found
Efficient and portable Winograd convolutions for multi-core processors
We take a step forward towards developing high-performance codes for the convolution operator, based on the Winograd algorithm, that are easy to customise for general-purpose processor architectures. In our approach, augmenting the portability of the solution is achieved via the introduction of vector instructions from Intel SSE/AVX2/AVX512 and ARM NEON/SVE to exploit the single-instruction multiple-data capabilities of current processors as well as OpenMP pragmas to exploit multi-threaded parallelism. While this comes at the cost of sacrificing a fraction of the computational performance, our experimental results on three distinct processors, with Intel Xeon Skylake, ARM Cortex A57 and Fujitsu A64FX processors, show that the impact is affordable and still renders a Winograd-based solution that is competitive when compared with the lowering GEMM-based convolution
Efficient direct convolution using long SIMD instructions
This paper demonstrates that state-of-the-art proposals to compute convolutions on architectures with CPUs supporting SIMD instructions deliver poor performance for long SIMD lengths due to frequent cache conflict misses. We first discuss how to adapt the state-of-the-art SIMD direct convolution to architectures using long SIMD instructions and analyze the implications of increasing the SIMD length on the algorithm formulation. Next, we propose two new algorithmic approaches: the Bounded Direct Convolution (BDC), which adapts the amount of computation exposed to mitigate cache misses, and the Multi-Block Direct Convolution (MBDC), which redefines the activation memory layout to improve the memory access pattern. We evaluate BDC, MBDC, the state-of-the-art technique, and a proprietary library on an architecture featuring CPUs with 16,384-bit SIMD registers using ResNet convolutions. Our results show that BDC and MBDC achieve respective speed-ups of 1.44× and 1.28× compared to the state-of-the-art technique for ResNet-101, and 1.83× and 1.63× compared to the proprietary library.This work receives EuroHPC-JU funding under grant no. 101034126, with support from the Horizon2020 program. Adrià Armejach is a Serra Hunter Fellow and has been partially supported by the Grant IJCI-2017-33945 funded by MCIN/AEI/10.13039/501100011033. Marc Casas has been par-tially supported by the Grant RYC-2017-23269 funded by MCIN/AEI/10.13039/501100011033 and ESF Investing in your future. This work is supported by the Spanish Ministry of Science and Technology through the PID2019-107255GB project and the Generalitat de Catalunya (contract 2017-SGR-1414).Peer ReviewedPostprint (author's final draft
Convolution Operators for Deep Learning Inference on the Fujitsu A64FX Processor
Ponència presentada a 2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) celebrat a Bordeaux, França.The convolution operator is a crucial kernel for
many computer vision and signal processing applications that
rely on deep learning (DL) technologies. As such, the efficient implementation of this operator has received considerable attention
in the past few years for a fair range of processor architectures.
In this paper, we follow the technology trend toward integrating long SIMD (single instruction, multiple data) arithmetic units
into high performance multicore processors to analyse the benefits of this type of hardware acceleration for latency-constrained
DL workloads. For this purpose, we implement and optimise
for the Fujitsu processor A64FX, three distinct methods for the
calculation of the convolution, namely, the lowering approach,
a blocked variant of the direct convolution algorithm, and the
Winograd minimal filtering algorithm. Our experimental results
include an extensive evaluation of the parallel scalability of these
three methods and a comparison of their global performance
using three popular DL models and a representative dataset
High Performance Depthwise and Pointwise Convolutions on Mobile Devices
Lightweight convolutional neural networks (e.g., MobileNets) are specifically
designed to carry out inference directly on mobile devices. Among the various
lightweight models, depthwise convolution (DWConv) and pointwise convolution
(PWConv) are their key operations. In this paper, we observe that the existing
implementations of DWConv and PWConv are not well utilizing the ARM processors
in the mobile devices, and exhibit lots of cache misses under multi-core and
poor data reuse at register level. We propose techniques to re-optimize the
implementations of DWConv and PWConv based on ARM architecture. Experimental
results show that our implementation can respectively achieve a speedup of up
to 5.5x and 2.1x against TVM (Chen et al. 2018) on DWConv and PWConv.Comment: 8 pages, Thirty-Four AAAI conference on Artificial Intelligenc
Optimizing Grouped Convolutions on Edge Devices
When deploying a deep neural network on constrained hardware, it is possible
to replace the network's standard convolutions with grouped convolutions. This
allows for substantial memory savings with minimal loss of accuracy. However,
current implementations of grouped convolutions in modern deep learning
frameworks are far from performing optimally in terms of speed. In this paper
we propose Grouped Spatial Pack Convolutions (GSPC), a new implementation of
grouped convolutions that outperforms existing solutions. We implement GSPC in
TVM, which provides state-of-the-art performance on edge devices. We analyze a
set of networks utilizing different types of grouped convolutions and evaluate
their performance in terms of inference time on several edge devices. We
observe that our new implementation scales well with the number of groups and
provides the best inference times in all settings, improving the existing
implementations of grouped convolutions in TVM, PyTorch and TensorFlow Lite by
3.4x, 8x and 4x on average respectively. Code is available at
https://github.com/gecLAB/tvm-GSPC/Comment: Camera ready version to be published at ASAP 2020 - The 31st IEEE
International Conference on Application-specific Systems, Architectures and
Processors. 8 pages, 6 figure
Performance–energy trade-offs of deep learning convolution algorithms on ARM processors
In this work, we assess the performance and energy efficiency of high-performance codes for the convolution operator, based on the direct, explicit/implicit lowering and Winograd algorithms used for deep learning (DL) inference on a series of ARM-based processor architectures. Specifically, we evaluate the NVIDIA Denver2 and Carmel processors, as well as the ARM Cortex-A57 and Cortex-A78AE CPUs as part of a recent set of NVIDIA Jetson platforms. The performance–energy evaluation is carried out using the ResNet-50 v1.5 convolutional neural network (CNN) on varying configurations of convolution algorithms, number of threads/cores, and operating frequencies on the tested processor cores. The results demonstrate that the best throughput is obtained on all platforms with the Winograd convolution operator running on all the cores at their highest frequency. However, if the goal is to reduce the energy footprint, there is no rule of thumb for the optimal configuration.Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This research was funded by Project PID2020-113656RB-C21/C22 supported by MCIN/AEI/10.13039/501100011033. Manuel F. Dolz was also supported by the Plan Gen–T grant CDEIGENT/2018/014 of the Generalitat Valenciana. Héctor MartÃnez is a POSTDOC_21_00025 fellow supported by Junta de AndalucÃa. Adrián Castelló is a FJC2019-039222-I fellow supported by MCIN/AEI/10.13039/501100011033. Antonio Maciá is a PRE2021-099284 fellow supported by MCIN/AEI/10.13039/501100011033
Performance–energy trade‑ofs of deep learning convolution algorithms on ARM processors
In this work, we assess the performance and energy efciency of high-performance
codes for the convolution operator, based on the direct, explicit/implicit lowering and Winograd algorithms used for deep learning (DL) inference on a series of
ARM-based processor architectures. Specifcally, we evaluate the NVIDIA Denver2
and Carmel processors, as well as the ARM Cortex-A57 and Cortex-A78AE CPUs
as part of a recent set of NVIDIA Jetson platforms. The performance–energy evaluation is carried out using the ResNet-50 v1.5 convolutional neural network (CNN)
on varying confgurations of convolution algorithms, number of threads/cores, and
operating frequencies on the tested processor cores. The results demonstrate that the
best throughput is obtained on all platforms with the Winograd convolution operator
running on all the cores at their highest frequency. However, if the goal is to reduce
the energy footprint, there is no rule of thumb for the optimal confguration.Funding for open access charge: CRUE-Universitat Jaume
Reformulating the direct convolution for high-performance deep learning inference on ARM processors
We present two high-performance implementations of the convolution operator via the direct algorithm that outperform the so-called lowering approach based on the im2col transform plus the gemm kernel on an ARMv8-based processor. One of our methods presents the additional advantage of zero-memory overhead while the other employs an additional yet rather moderate workspace, substantially smaller than that required by the im2col+gemm solution. In contrast with a previous implementation of a similar zero-memory overhead direct convolution, this work exhibits the key advantage of preserving the conventional NHWC data layout for the input/output activations of the convolution layers.Funding for open access charge: CRUE-Universitat Jaume
- …