Over recent years heterogeneous systems have become more prevalent across HPC systems, with over 100 supercomputers in the TOP500 incorporating GPUs or other accelerators. These hardware platforms have different performance characteristics and optimization requirements. In order to make the most of multiple accelerators a developer has to provide implementations of their algorithms tuned for each device. Hardware vendors provide libraries targeting their devices specifically, which provide good performance but frequently have different API designs, hampering portability.
INTRODUCTION
The power and speed of modern GPUs coupled with their increased programmability through programming models such as SYCL [12] , OpenCL [34] and CUDA [10] have enabled many applications to achieve previously insurmountable performance. This has been one of the main driving factors behind the rise in machine learning since 2012. In the TOP500 [15] , 3 of the top 5 supercomputers in the world in terms of pure performance utilize GPUs, as do the top 4 supercomputers in the Green500 [4] , where flops per watt are considered.
Many hardware vendors provide highly specialized libraries for their platforms to extract optimal performance from their hardware. As more companies and computing centers move to support larger heterogeneous systems with more varied hardware this means users have to write applications which handle multiple different libraries for each available accelerator. This adds complexity both for developers and system maintainers.
RELATED WORKS
Highly specialized libraries-such as ARM Compute Library [2] , cuBLAS [8] , cuDNN [19] , CUTLASS [27] , MIOpen [5] , Intel MKL [35] and MKL-DNN [6] -provide optimal performance on the targeted hardware, however typically each of these libraries are restricted to the vendor's hardware.
A common way of achieving performance across various platforms is to specialize certain routines in a library or framework targeting a particular architecture. There are various high performance vendor-specified libraries tuned for particular architectures.
For ARM devices, ARM Compute Library [2] provides a set of low-level, optimized primitives for machine-learning and linear algebra. While for NVIDIA devices, cuBLAS [8] and cuDNN [19] provide highly optimized implementations of standard linear algebra subroutines and deep neural network routines, respectively. NVIDIA also provides a higher level CUTLASS [27] which takes advantage of CUDA C++ templates meta programming to deliver a high-performance GEMM computations allowing developers to specialize their matrix multiplies inside their application.
Similarly AMD provides MIOpen [5] , a deep learning acceleration framework developed to deliver highly optimized DNN and GEMM routines for AMD GPUs. For their multi-core and manycore systems, Intel provides MKL [35] for linear algebra routines, and MKL-DNN [6] as an open source, high performance library for accelerating deep learning frameworks.
Each of these libraries are optimized specifically for their target devices, however do not provide performance portability. The limitations of these libraries highlight two main key factors in providing this: the parallel programming framework used to develop the library and the performance metrics of devices.
Parallel programming frameworks
There are many different frameworks available to provide access to hardware parallelism, with different approaches and programming styles. Most are based on C or C++, with variations on the language to allow a developer to specify how tasks can be parallelized.
The Intel Thread Building Blocks (TBB) library [31] is a C++ template library for parallel programming on multi-core processors. TBB only provides parallelism on CPUs so applications developed on top of TBB are not portable to GPU or accelerated platforms.
On the other hand, CUDA [10] is used by NVIDIA as a low level framework to deliver performance on NVIDA GPUs and accelerators, but not CPUs. Although CUDA C++ supports template metaprogramming features which is necessary for performance portability, the lack of CUDA support on non-NVIDIA architecture prevents more widespread adoption.
A cross-platform framework like OpenCL [34] can be supported by various vendors and platforms, however the low level interface of OpenCL hampers development of easily tunable kernels that enable performance portability. Although OpenCL 2.1 supports template metaprogramming within kernels, the kernel interface itself cannot be templated, which limits the use of this feature.
SYCL [12, 32] is an open standard maintained by the Khronos Group, a collaboration of companies, universities and individuals dedicated to developing open standards, known for maintaining OpenCL, Vulkan, OpenGL and many others alongside SYCL. The standard is a royalty-free, cross-platform abstraction layer that enables developers to utilize heterogeneous systems with varied hardware while writing completely standard C++. SYCL provides a highly parallel programming model designed to abstract away much of the complexity of traditional parallel programming models like OpenCL. Using SYCL instead of a lower level interface like OpenCL allows the developer to write code using modern C++ language features, even within their accelerated kernels. This includes the use of templates and specializations, as well as the metaprogramming that these enable, instead of the heavy use of the preprocessor typical in OpenCL applications to provide specializations within kernels for data types, tile sizes and other constants. By allowing SYCL kernels to fully utilize C++, a developer can instead use templates to provide these compile time constants, as well as easily create many specialized kernels with very little additional code.
Combining the portability of OpenCL with modern C++ template metaprogramming makes SYCL a strong parallel programming framework for performance portability across different platforms.
ComputeCpp [3] , developed by Codeplay Software, is currently the only fully conformant implementation of the SYCL standard, supporting devices from AMD, Intel, NVIDIA and others. Several open-source libraries are compatible with ComputeCpp, including SYCL-DNN [13] , VisionCpp [16, 20] , SYCL-PRNG [14] , as well as contributed parts of Eigen [21, 24] and TensorFlow [17, 22, 23] .
Performance Metrics
Tuning the performance of accelerated compute kernels is not a trivial task requiring expertise in both parallel algorithms and hardware architectures. Having said that, many hardware accelerators share common factors affecting performance, allowing common algorithmic techniques for improving performance across devices.
We have studied common hardware-related optimization improvements on the following devices: Intel CPUs and GPUs, ARM Mali GPUs, AMD GPUs and the Renesas V3H and V3M accelerators. These metrics are used to develop parametric convolution and matrix multiply algorithms, two of the most computationally intense operations used in neural networks.
Thread reusability.
The number of individual compute units available in hardware varies significantly, from 2 in embedded systems to 84 in desktop GPUs e.g NVIDIA Tesla series. Depending on the number of compute units available the size of workload per thread can be parametrized in order to achieve maximal device occupancy. However, there is a trade off between increasing the number of work-groups and ensuring a sufficient workload per thread. For a given problem, creating more work groups or more threads will naturally decrease the amount of computation that any one thread will have to do.
Memory transactions.
In most accelerators, when a set of threads within a work-group loads data, an entire block of data called a cache-line is fetched from memory. This is based on the assumption that threads within a work-group access contiguous elements of data specified in the fetched block. Assuming the threads within a work-group access the same address, the permutation, masking or prediction has no effect on the loaded block of data, which can lead to more optimizations in the compiler. Moreover loading a block of data will reduce the number of memory transactions, improving performance [9].
Data reusability.
Memory accesses tend to be the bottleneck in GPU programming, despite graphics memory typically being much faster than standard DDR memory. As a result of this, many optimizations to GPU kernels involve maximizing data reuse to minimize the amount of data that needs to be fetched from memory. Hardware devices provide a memory hierarchy in order to reduce the memory access bottleneck which is categorized into global, local, and private memory regions in OpenCL terms, each of which can be controlled by programmers.
Common techniques to maximize data reusability include utilizing local memory as a programmable cache to allow a work group to reuse data loaded from global memory, and through using blocks of private memory to allow single threads to reuse memory. Private memory usually maps to hardware registers and so can be accessed incredibly quickly. Provided there are sufficient registers, the more we can re-use data located in registers, the better the performance.
Local memory access speeds are usually equivalent to caches. Nowadays many hardware vendors (e.g. Intel and ARM) provide large caches in front of their global memory and so can achieve similar performance to using programmer-controlled local memory when accessing data in a coalesced manner. Therefore, some embedded devices like ARM's Mali G-71 GPU do not dedicate any memory for local memory, instead relying on the cache and global memory. For such devices using local memory can be costly and negatively affect performance.
Vectorization.
Many GPU architectures have memory controllers which are designed to load and store multiple elements at once, as typical graphics workloads involve 4-element vectors. To get the best performance from these load-store units, computations should make use of vector loads and stores. Another benefit of using vectors in a GPU kernel even if the hardware does not support vector computations is that each vector element can be computed separately, and so there is increased instruction level parallelism. Table 1 summarizes the hardware features required for abstracting performance metrics for convolution and matrix multiply algorithms.
Research Niche
In this paper, we have designed and developed parametric algorithms for matrix multiplication and convolution-two of the most computationally intense operations in neural networks-which can be tuned according to the performance metrics of various devices. Written on top of SYCL, these implementations abstract common performance metrics, enabling performance portability across different platforms.
SYCL-BLAS
SYCL-BLAS [11] is an open source implementation of netlib BLAS [7] which enables performance portability across a range of accelerators. SYCL-BLAS uses an expression tree design to deliver BLAS co-routines; most of the BLAS Level 1 and BLAS Level 2 co-routines are memory-bound operations so using such an expression tree based approach allows multiple operations to be fused into a single compute kernel with a higher computational complexity. Increasing the computational intensity of memory-bound applications can significantly increase the performance by reducing the number of accesses to the device's global memory [18] .
The general matrix multiply operation (GEMM) is the heart of BLAS routines, used in a wide range of scientific domains from physics to machine learning. It is one of the computationally expensive operation in neural networks, and so optimizing GEMM can improve a DNN's performance.
GEMM
The General Matrix-Matrix product is a BLAS operation defined as:
where A, B and C are column-major matrices, and OP a and OP b are either identity or transpose operators (or conjugate transpose for complex data types).
We refer to M and N as the number of rows and columns of C, respectively, and K as the contracting dimension of A and B. A naive parallelization approach on massively parallel architectures is to assign one value of the output C per thread to accumulate the dot product of the i-th row of OP a (A) with the j-th column of OP b (B) in a local register r , and update the result: C(i, j) = αr + βC(i, j). However, this approach is highly memory-bounded as each thread is required to load 2 × k data elements while performing roughly 2 × k floating point operations.
A standard approach used to improve performance of dense linear algebra operations on modern hardware is to reduce the memory latency by increasing data reusability through a tiling technique [25, 29] .
3.1.1 Blocked GEMM. Given two matrices A M ×K and B K ×N and partitions
(1) then the following holds:
Thus, block matrices can be multiplied in the same way as regular matrices, by replacing each scalar multiply with a matrix multiply, and each scalar addition with a matrix addition.
A special case of the above property allows us to partition matrices A ′ and B ′ into panels, and matrix C into blocks where the partitioning of C dictates the partitioning of A ′ and B ′ :
Then each thread can compute a panel in the matrix multiply:
Each panel can then itself be broken down further into tiles and we recursively build tiles of smaller matrices in different memory hierarchies. To illustrate this tile based approach, suppose we are performing an 8 × 8 floating point matrix multiply, for a device with 128 bytes of local memory and a maximum of 12 float registers available without spilling. Further suppose that 4 work-groups with 4 threads in each work-group are created. Figure 1a represents the A ′ and B ′ panels and C blocks in global memory. The tile sizes are different in the memory hierarchies due to memory size limitations. The faster the memory is, the smaller the size of the memory, and therefore the smaller the tile size. Thus multiple iterations are required to load all the required data for the computation.
Since the total number of threads is 16, each thread computes 4 output elements. At first all threads of a work-group load the first tile of the A ′ and B ′ panel (the yellow square in Figure 1a) into local memory in a coalesced way. Given the size of the local memory, each input matrix must be broken into two tiles (represented by yellow and orange color in Figure 1a ) where one tile per matrix can be loaded into local memory at once (Figure 1b) . Similarly, each tile of the A ′ and B ′ panels in local memory is broken into 7 tiles of A ′′ and B ′′ to compute the matrix multiply without register spill (Figure 1c and Figure 1d ).
Moreover, there are different ways of loading tiles from local memory to registers. Figure 1c loads the local memory tiles into 7 separate register tiles, which enables coalesced writes to global memory. In Figure 1d , although the number of tiles is reduced due to using more registers, each thread writes a block of 2 consecutive elements, thus the memory writes are not coalesced.
Choosing between the two above tile configurations at the private memory level is platform specific. While in a SIMD platform coalesced read/write have a significant effect on performance, on non-SIMD devices (e.g CPUs) block data accesses and a reduced tile count provide better performance.
Providing a parametric GEMM enables a programmer to tune the operation for different platforms by choosing different configurations without re-implementing the kernel.
3.1.2 Data reuse in a GEMM operation. Assuming a m ′ × n ′ submatrix C i j is already in registers, the computation of a single block across k ′ elements requires reading m ′ k ′ + k ′ n ′ data entries from global memory, and requires 2m ′ n ′ k ′ floating point operations.
Overall, 2Km ′ n ′ floating point operations are needed to compute C i j , and m ′ n ′ + m ′ K + n ′ K data elements are required to be loaded from global memory.
The panels A i and B j are read in blocks of size k ′ , due to memory and register limitations. Then each block of A i will be multiplied with the appropriate block of B j and accumulated into C i j . In this way C i j is stored in registers during the entire operation and only written to global memory at the end.
The amount of data reuse obtained through this tiled approach to multiplying blocks is therefore given by:
Since the goal of the blocking is to improve the amount of data reuse, the block sizes should be chosen to maximize this property.
Equation 3 shows that increasing k ′ does not increase the amount of data reuse, but does increase the amount of register/local memory/cache space required to store the matrices. For the tiles in private memory k ′ = 1 seems to be the best choice. Increasing m ′ and n ′ increases data reusability, but it also increases the amount of local memory/register/cache space that is required. It can easily be shown that, with an upper limit on the available amount of memory, When supported, using local memory will significantly improve the performance when A is transposed or B is not transposed. The other two possibilities may not require explicit copies from global to local memory, as assigning items to elements of the block in the correct order yields a perfectly coalesced read of these matrices from memory. Such temporal locality ensures that the data is already cached when a different item of the same work group requests the same data entity. However, moving the data explicitly to local memory can improve performance when the hardware cache unit is a slower than the local memory unit.
Software pre-fetching or double buffering is a well-known technique to overcome the latency of off-chip memory accesses. By doubling the size of local memory, it is possible to load the next tile while computing the current tile. This can hide the latency of the loads by computations.
SYCL-DNN
SYCL-DNN [13] is an open-source library which provides optimized routines to accelerate neural network operations. It is built using the SYCL programming model to enable cross platform support, and so works on any accelerator supported by the user's chosen SYCL implementation.
Modern image recognition networks mainly involve many convolutions and matrix multiplies, which make up the majority of the runtime of these networks. The matrix multiplies can be supplied by a BLAS implementation like SYCL-BLAS. The purpose of SYCL-DNN is to provide convolutions and other similar routines which are required for these networks.
Convolutions
Convolutions are typically the most compute intensive operation in these networks, and so they are the most optimized in the library. To optimize this operation, SYCL-DNN provide implementations of a number of algorithms which compute a 2D convolution. These different algorithms have different performance characteristics and memory requirements, and so behave differently for different tensor sizes and on different devices.
In neural networks, a 2D convolution is an operation taking a batch of 3D input tensors and a 4D filter tensor. Each output element is the sum of products of an input value and a weight given by sliding a small window over the input tensor.
More formally, let H be the number of rows in the input, W the number of columns, C the number of input channels, K the number of output channels and R × S the filter size. Then the input tensor I ∈ R H ×W ×C and the filter tensor F ∈ R R×S ×C×K would give rise to an output tensor O ∈ R H ×W ×K .
Using these layouts, the convolution can be computed naively as described in Algorithm 1. For example with a filter size of 3 × 3, an output element at row h, column w and channel k would be computed as:
where i h,w is a 3 × 3 × C slice of the input tensor centered at row h and column w.
4.1.1 Tiling 2D convolutions. For convolutions with window sizes greater than 1, there are overlaps in the regions of the input tensor which affects two spatially adjacent output elements. The naive method of computing a convolution using Algorithm 1 on a GPU would use a single thread per output element, which means every thread has to load all of the input slice that it requires from memory.
Adjacent elements in the convolution output require overlapping slices of the input tensor, so using a single thread to compute a number of these elements will reduce the number of times a single input element is loaded from memory. We refer to this as a tiled algorithm, as each thread will compute a tile of the output. As the tile size increases the number of times each input element is reused also increases and so the total number of bytes read from memory in the whole operation decreases. Memory bandwidth is typically the largest bottleneck in these GPU computations, as discussed in Section 2.2, and so reducing the number of loads typically improves performance accordingly. Similarly the use of vector loads and stores, as well as vector math operations, can improve performance on some devices.
The downside of using large tiles and vectors in GPU kernels is that each significantly increases the number of registers required to hold the data required for a single thread. As an example, Figure 2 shows the number of registers required for varying tile and vector sizes as provided by AMD's CodeXL tools [1] .
As GPUs have a limited number of registers available in their register files, the number of registers required by a single thread determines how many threads can be concurrently executed. Overhead and memory fetches are largely hidden in a GPU by quickly switching threads when one becomes stalled. If each thread requires more registers then the number of concurrent threads decreases and so the chance that one is ready to execute as soon as another is stalled is also decreased.
The best performance then does not necessarily come when using the largest tile sizes to get the most data reuse and large vector sizes to get more efficient loads and instruction level parallelism, but also the best performance is not achieved when using the fewest registers and so allowing the GPU to run the most threads concurrently.
Optimal performance is typically achieved by balancing these three factors, as shown in Figure 3 where the peak performance on the AMD R9 Nano is achieved using a 4 × 5 tile, 4 element vectors for input channels and 2 element vectors for output channels (at Figure 3: Number of teraflops achieved for different tile sizes and vector sizes for a given 3×3 convolution. Each subplot gives the performance for a given tile size as the vector sizes vary. Performance data obtained using an AMD R9 Nano GPU, with amdgpu-pro divers. On different devices, the optimal tile sizes and vector widths to get best performance will differ depending on the device's loadstore units, the number of available registers and whether vector math units are available. [28] by using a transform studied by Winograd [37] 
Winograd technique for fast convolutions. Lavin and Gray suggested a technique to accelerate convolutions in

as well as Cook and Toom. In many DNN libraries this is referred to as the Winograd technique.
The technique involves transforming the input and filter tensors into a number of small matrices, then the convolution is computed using a batched matrix multiply between these matrices, followed by another transform to yield the output values. In this way the total number of floating point operations required to compute the convolution can be decreased to as little as 30% of that required by a naive computation.
For a convolution with a R × S filter size, the input transform converts an M ×N tile of the input tensor to a (M +R −1)×(N +S −1) tile, which is then scattered across (M + R − 1)(N + S − 1) matrices with one tile element equal to an corresponding element in one of the matrices.
Increasing the tile sizes used in the transforms increases the amount of data reuse, so overall fewer memory fetches are required as discussed in Section 4.1.1. However larger tile sizes also require more registers as above, which can lead to fewer concurrent threads and possibly worse performance.
Another effect of using larger tiles sizes is that the number of intermediate matrices increases, but the size of each individual matrix decreases. The majority of the computation is the batched matrix multiply, and for smaller matrices it can be harder to fully utilize a GPU even with specialized batched kernels.
Given these considerations, there is again the problem of balancing the tile size with register usage, optimal matrix multiply sizes and vector widths in order to get the best performance. As with the tiled convolution computations, this will differ depending on the device as well as the optimizations available in the matrix multiplies. The performance portability provided by the SYCL-BLAS matrix multiplies discussed in Section 3 significantly affects the achievable performance when using this technique to compute convolutions.
PERFORMANCE
The performance of our SYCL kernels is measured against handtuned, vendor provided libraries. The primary target devices of our libraries have been embedded devices, so the main comparison has been made to ARM's ComputeLibrary [2, v18.11] . This provides accelerated routines through both NEON instructions for the CPU and OpenCL kernels for GPUs.
We also provide comparisons for Intel hardware, utilizing both the CPU and GPU, with comparisons to Intel MKL-DNN [6, v0.18 .1] and to clBLAST [30] .
Benchmark device setup
The performance benchmarks are provided on a range of hardware, to show the portability of our SYCL libraries. When benchmarking using the GPU, the big (A73) CPUs are disabled to help keep device temperatures low. For benchmarking on the CPU, the big A73 cores are enabled.
Intel i7-6700K.
The i7-6700K is a Skylake desktop processor which contains 4 CPU cores with a base frequency of 4GHz, boosting to 4.2GHz, and an Intel HD Graphics 530 integrated GPU with a base frequency of 350 MHz, boosting to 1.15GHz.
For the neural network benchmarks in Section 5.3, we benchmarked SYCL-DNN using both the CPU and GPU in the i7-6700K processor, as Intel provide an OpenCL implementation for Xeon and Core CPUs and an open source Graphics Compute Runtime for their integrated graphics processors. When benchmarking using the CPU the scaling governor was set to 'performance'.
Intel i7-9700K.
The i7-9700K is a Skylake desktop processor which contains 8 CPU cores with a base frequency of 3.6GHz, boosting to 4.9GHz, and an Intel UHD Graphics 630 integrated GPU with a base frequency of 350 MHz, boosting to 1.2GHz.
For the BLAS benchmarks in Section 5.2, we benchmarked only the GPU in the i7-9700K processor.
SYCL-BLAS GEMM benchmarks
A comparison similar to the roofline model [36] is carried out to plot performance of a kernel (Gflop/s) as a function of its operational intensity, i.e the number of floating point operations per byte of data read or written (flop/byte).
The SYCL-BLAS GEMM implementation has been compared to clBLAST's hand-tuned GEMM OpenCL kernels [30] on an Intel UHD Graphics 630 GPU and to ARM Compute Library's GEMM kernels on an ARM Mali G-71 GPU. For each platform, different configurations of SYCL-BLAS have been used, with various register tile sizes, work-group sizes, and local memory sizes, as shown in Table 2. Each configuration contains h×w_r×c where h×w represents a tile of h by w registers per thread to compute the block matrix C i j , and r×c represents a work-group consisting of r × c threads.
The number of data elements stored in local memory for the computation of a matrix multiply between matrices of size M × K and K × N , for a configuration hxw_rxc is given by:
where X is the number of data elements that fit within a cache line. The local memory size is doubled when double buffering is used. In Figure 4a shows a roofline model comparing different configurations of SYCL-BLAS GEMM implementation against the GEMM implementation from clBLAST [30] . Comparing these results shows that the 8×4_8×16_loc configuration for SYCL-BLAS achieves close to the performance of the hand-tuned clBLAST OpenCL kernels on the Intel GPU. In addition, the results show that increasing the number of registers from 4 × 4 to 8 × 4 per thread significantly improves performance, as data reuse is increased. Figure 4b shows a roofline model comparing two different SYCL-BLAS GEMM configurations which use 16 registers, but in different tile layouts. The result shows that the square register tile 4×4_8×8 yields better performance than the non-square register tile 8×2_4×16, as discussed in Section 3.1.2. Similarly, Figure 4c shows the performance improvement from using double buffering to hide memory latency. In the region marked A, shown in Figure 5b , the matrices have a small size and are typically square, so the matrix multiply has a low computational intensity and so comparatively few floating point operations. The 4×4_8×8 configuration shows the best performance for SYCl-BLAS, and is competitive with ARM Compute Library. However, in the region marked B, Figure 5c , the 8×4_4×8 configuration is better. Here the matrices are small to medium sized and typically rectangular, so the operation has a higher computational intensity but still comparatively few operations, so launching fewer threads with a higher workload works well. Figure 5d shows the region marked as C in more detail, where the matrices are large and so the operation has high computational intensity and many floating point operations. In this region the 8×4_16×8 configuration is best, as the larger matrices can be split up much better across a larger number of threads.
Neural network benchmarks
To determine the performance of SYCL-DNN compared to other neural network acceleration libraries we extracted the convolution layers from well known standard image recognition networks. The networks chosen are VGG-16 [33] and ResNet-50 [26] , introduced in 2014 and 2015 respectively, which are fairly large convolutional networks which were state-of-the-art when introduced, and are still frequently used as they provide good results and are widely available.
The VGG model uses 16 layers of 3×3 convolutions interspersed with max pooling layers, while the ResNet model is made up of a number of blocks each consisting of a 1×1 convolution, a 3×3 convolution then a 1×1 convolution. Exact sizes for these layers as used in the benchmarks are shown in Table 3 and 4.
The performance results for the HiKey 960 SoC are shown in Figures 8 and 6 for the VGG and ResNet benchmarks respectively. 
CONCLUSION
The SYCL programming model allows us to write highly parametrized compute kernels, which can be instantiated to best suit the targeted hardware and so obtain good performance across different hardware.
Overall, we have shown that our general purpose, parameterized SYCL kernels give competitive performance against hand tuned, optimized libraries such as clBLAST, ARM Compute Library and MKL-DNN. Moreover these kernels can be easily tuned to perform well on other devices, even if they have significantly different performance characteristics.
There are many improvements still to make to these libraries, including adding vector operations to SYCL-BLAS and implementations of convolution kernels in SYCL-DNN which support data prefetching and local memory, as well as providing currently unsupported operations required for neural networks. There are also plans to develop a machine learning system to tune these libraries for new devices.
