Over the past few years machine learning has seen a renewed explosion of interest, following a number of studies showing the effectiveness of neural networks in a range of tasks which had previously been considered incredibly hard. Neural networks' effectiveness in the fields of image recognition and natural language processing stems primarily from the vast amounts of data available to companies and researchers, coupled with the huge amounts of compute power available in modern accelerators such as GPUs, FPGAs and ASICs. There are a number of approaches available to developers for utilizing GPGPU technologies such as SYCL, OpenCL and CUDA, however many applications require the same low level mathematical routines. Libraries dedicated to accelerating these common routines allow developers to easily make full use of the available hardware without requiring low level knowledge of the hardware themselves, however such libraries are often provided by hardware manufacturers for specific hardware such as cuDNN [9] for Nvidia hardware or MIOpen [5] for AMD hardware.
INTRODUCTION
Deep neural networks (DNNs) have been widely studied in the past few years, as they have repeatedly proved effective at solving hard computational problems when trained on sufficiently large data sets.
The resurgence in study of DNNs primarily started in 2012 when AlexNet [11] , a convolutional neural network (CNN), beat all previous entries to the ImageNet competition [13] with an error rate of 15.3% on the Top-1 task, compared to that year's next best 26.2%. Since then the error rates shown in the competition results have plummeted with all entries using DNNs.
The effectiveness of these neural networks has been achieved through both the large datasets now available and the vast amounts of compute provided by GPU hardware and other hardware acceleration. While CPUs offer a few highly programmable and fast cores, GPUs provide many more cores which are more limited. This restricts the tasks which can be effectively accelerated on GPU hardware, but for many of the numeric tasks in DNNs this many core, highly parallel model fits well. These common tasks include routines such as matrix multiplies and convolutions which typically make up the majority of the runtime of a neural network. As such any performance improvements obtained in these routines give substantial improvements to the performance of the whole DNN.
Massively parallel hardware is well suited to these tasks yet requires a very different application model and low level knowledge of the hardware to achieve good performance. This can be a barrier to obtaining good performance in machine learning models so many hardware vendors provide libraries targeting their platforms to accelerate these standard routines, such as NVidia's cuDNN [9] and AMD's MIOpen [5] . These libraries are typically tuned for the specific hardware they run on and are implemented in low level assembly or some intermediate representation to get optimal performance.
THE SYCL PROGRAMMING MODEL
SYCL [6] is a royalty-free open standard maintained by the Khronos group which provides a high level abstraction of GPGPU concepts based on OpenCL [15] . Using SYCL a developer can write standard C++ code to be run on accelerators supporting OpenCL, and so use the functionality provided by modern C++ including templates, inheritance, metaprogramming and the standard library while at the same time benefiting from the capabilities of the underlying hardware as exposed through OpenCL.
The design of SYCL is built around a single source programming model in completely standard C++, in contrast to other similar programming models which typically require additional keywords and restrictions on the language. By building on top of standard C++, the SYCL standard inherits all the improvements made recently to the language along with support from many standard tools.
Underneath, SYCL interacts with OpenCL devices and so hardware manufacturers do not need to provide any further implementation to allow developers to use their devices, provided a compatible OpenCL driver is available.
As 
SYCL-DNN
SYCL-DNN [7] is an open-source library developed by Codeplay Software to accelerate machine learning applications on OpenCL hardware using the SYCL programming model. The library has been specifically tuned for certain devices, but is usable on any device supported by the user's choice of SYCL implementation.
The library provides a high level interface allowing users to run neural network primitive routines accelerated on OpenCL hardware. Internally it contains a number of highly parameterized SYCL kernels to provide the computations required for each routine, and for some routines it provides a variety of different algorithms which all provide the same numeric results. In this way the library can adapt to different hardware characteristics by either choosing different algorithms or different parameters for each algorithm to maximize performance on the hardware.
Such customization is currently provided through manual tuning for a target device, however an automated, intelligent approach is planned for the future.
Convolutions are the main component of modern image recognition networks, providing a mechanism to detect features in images which is invariant with respect to the feature's position in the image. State-of-the-art image recognition networks such as VGG [14] and ResNet [10] are made up of many layers of convolutions, interspersed with pooling and normalization layers. The convolutions are the most compute intensive operations in these networks and so the primary optimization target in SYCL-DNN.
SYCL-DNN provides a number of implementations of a convolution, from a vectorized naive compute kernel which runs a single thread per output vector, to a tiled Winograd operation [12] which uses data transforms to convert the convolution into a number of small matrix multiplies, reducing the total number of floating point operations. Figure 1 shows the performance of the different convolution algorithms implemented in SYCL-DNN for a range of different convolutions in the ResNet-50 DNN model [10] . For different convolution parameters different algorithms perform better than others, with no single algorithm always performing best.
PERFORMANCE ACROSS DEVICES
The number of devices that can support SYCL-DNN is large, as SYCL-DNN is written in SYCL which can run on many OpenCL implementations. The primary target of the library has been embedded devices, though with few changes the same code can target desktop GPUs and other hardware.
The embedded devices that the library has been targeted towards are ARM's Mali G-71 GPU-a high-end mobile GPU that is designed for low power devices-and Renesas' R-Car platforms like the V3H-a system on chip designed for automated driving solutions that provide a programmable CV engine accelerator.
In addition to these, the library is tested on Intel processors, making use of both their CPU OpenCL implementation and the Intel Compute Runtime that targets integrated GPUs, and on AMD GPUs with both Fiji and Hawaii devices tested. Figures 1, 2 and 3 show the performance achieved using SYCL-DNN on a range of hardware when running the convolutional layers from ResNet-50.
The performance results for ResNet convolutions on an Intel i7-6700K processor are shown in Figure 3 , where SYCL-DNN can use either the CPU or the GPU. The performance is compared against MKL-DNN [4, v0.18.1], which contains highly vectorized micro-kernels that are JIT compiled to best suit particular tasks. Compared to other platforms, less work has been channeled into optimizing for Intel processors which is evident when compared to the very good performance that MKL-DNN achieves.
On the HiKey 960 system-on-chip (SoC), SYCL-DNN utilises the ARM Mali G-71 GPU and consistently outperforms ARM's Compute Library [1, v18.11] which runs on both the CPU using NEON and on the GPU using OpenCL, as shown in Figure 2 . Three of the convolution parameters stand out where the Compute Library achieves over 150 gigaflops; these are 3×3 convolutions without a stride which are very well optimized in their library. In almost all other cases SYCL-DNN achieves better performance than the hand tuned OpenCL kernels in the Compute Library.
CONCLUSION
Overall the performance provided by our library is good, but there are still significant improvements to be made. Further optimizations that are planned include implementing better support for memory prefetching and better utilization of user programmable caches, as well as providing alternative implementations of the compute routines which may perform better on some hardware platforms.
The SYCL-DNN team intend to extend the devices that the library is tuned for, and develop additional automated approaches to further ease this process in the future. As these tuning decisions involve many parameters and a large number of features they are a good candidate for a learned solution rather than a hand tuned one.
