Accelerated Neural Networks on OpenCL Devices Using SYCL-DNN by Burns, Rod et al.
Accelerated Neural Networks on OpenCL Devices Using
SYCL-DNN
Rod Burns, John Lawson, Duncan McBain and Daniel Soutar∗
{rod,john,duncan,daniel.soutar}@codeplay.com
Codeplay Software Ltd.
Edinburgh, UK
ABSTRACT
Over the past few years machine learning has seen a renewed
explosion of interest, following a number of studies showing the
effectiveness of neural networks in a range of tasks which had
previously been considered incredibly hard. Neural networks’ ef-
fectiveness in the fields of image recognition and natural language
processing stems primarily from the vast amounts of data available
to companies and researchers, coupled with the huge amounts of
compute power available in modern accelerators such as GPUs,
FPGAs and ASICs. There are a number of approaches available to
developers for utilizing GPGPU technologies such as SYCL, OpenCL
and CUDA, however many applications require the same low level
mathematical routines. Libraries dedicated to accelerating these
common routines allow developers to easily make full use of the
available hardware without requiring low level knowledge of the
hardware themselves, however such libraries are often provided by
hardware manufacturers for specific hardware such as cuDNN [9]
for Nvidia hardware or MIOpen [5] for AMD hardware.
SYCL-DNN is a new open-source library dedicated to provid-
ing accelerated routines for neural network operations which are
hardware and vendor agnostic. Built on top of the SYCL open stan-
dard and written entirely in standard C++, SYCL-DNN allows a
user to easily accelerate neural network code for a wide range of
hardware using a modern C++ interface. The library is tested on
AMD’s OpenCL for GPU, Intel’s OpenCL for CPU and GPU, ARM’s
OpenCL for Mali GPUs as well as ComputeAorta’s OpenCL for
R-Car CV engine and host CPU. In this talk we will present per-
formance figures for SYCL-DNN on this range of hardware, and
discuss how high performance was achieved on such a varied set
of accelerators with such different hardware features.
CCS CONCEPTS
• Computing methodologies → Neural networks; Massively
parallel algorithms; Parallel programming languages; Computer
vision problems.
KEYWORDS
SYCL, OpenCL, neural networks, GPGPU, machine learning
∗Authors listed alphabetically
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
IWOCL’19, May 13–15, 2019, Boston, MA, USA
© 2019 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-6230-6/19/05.
https://doi.org/10.1145/3318170.3318183
ACM Reference Format:
Rod Burns, John Lawson, Duncan McBain and Daniel Soutar. 2019. Acceler-
ated Neural Networks on OpenCL Devices Using SYCL-DNN. In Interna-
tional Workshop on OpenCL (IWOCL’19), May 13–15, 2019, Boston, MA, USA.
ACM, New York, NY, USA, 4 pages. https://doi.org/10.1145/3318170.3318183
1 INTRODUCTION
Deep neural networks (DNNs) have been widely studied in the past
few years, as they have repeatedly proved effective at solving hard
computational problems when trained on sufficiently large data
sets.
The resurgence in study of DNNs primarily started in 2012 when
AlexNet [11], a convolutional neural network (CNN), beat all previ-
ous entries to the ImageNet competition [13] with an error rate of
15.3% on the Top-1 task, compared to that year’s next best 26.2%.
Since then the error rates shown in the competition results have
plummeted with all entries using DNNs.
The effectiveness of these neural networks has been achieved
through both the large datasets now available and the vast amounts
of compute provided by GPU hardware and other hardware ac-
celeration. While CPUs offer a few highly programmable and fast
cores, GPUs provide many more cores which are more limited. This
restricts the tasks which can be effectively accelerated on GPU
hardware, but for many of the numeric tasks in DNNs this many
core, highly parallel model fits well. These common tasks include
routines such as matrix multiplies and convolutions which typically
make up the majority of the runtime of a neural network. As such
any performance improvements obtained in these routines give
substantial improvements to the performance of the whole DNN.
Massively parallel hardware is well suited to these tasks yet
requires a very different application model and low level knowledge
of the hardware to achieve good performance. This can be a barrier
to obtaining good performance in machine learning models so
many hardware vendors provide libraries targeting their platforms
to accelerate these standard routines, such as NVidia’s cuDNN [9]
and AMD’s MIOpen [5]. These libraries are typically tuned for
the specific hardware they run on and are implemented in low
level assembly or some intermediate representation to get optimal
performance.
2 THE SYCL PROGRAMMING MODEL
SYCL [6] is a royalty-free open standard maintained by the Khronos
group which provides a high level abstraction of GPGPU concepts
based on OpenCL [15]. Using SYCL a developer can write standard
C++ code to be run on accelerators supporting OpenCL, and so
use the functionality provided by modern C++ including templates,
inheritance, metaprogramming and the standard library while at
ar
X
iv
:1
90
4.
04
17
4v
1 
 [c
s.L
G]
  8
 A
pr
 20
19
IWOCL’19, May 13–15, 2019, Boston, MA, USA Burns, Lawson, McBain and Soutar
Figure 1: The number of gigaflops achieved by different algorithms for a range of convolutions from ResNet-50 with a batch
size of 32 on an AMD R9 Nano. The convolution parameters are given by: window size, stride, image rows, image columns,
input features, output features. Not all algorithms are compatible with all sets of parameters.
Figure 2: The number of gigaflops achieved on the ARM HiKey 960 SoC, with SYCL-DNN running on the Mali G-71 GPU
compared to ARM’s Compute Library running on the Mali G-71 GPU using OpenCL and on the CPU using NEON. The graph
shows the convolutions in ResNet-50, run with a batch size of 1.
the same time benefiting from the capabilities of the underlying
hardware as exposed through OpenCL.
The design of SYCL is built around a single source programming
model in completely standard C++, in contrast to other similar
programming models which typically require additional keywords
and restrictions on the language. By building on top of standard C++,
the SYCL standard inherits all the improvements made recently to
the language along with support from many standard tools.
Underneath, SYCL interacts with OpenCL devices and so hard-
ware manufacturers do not need to provide any further implementa-
tion to allow developers to use their devices, provided a compatible
OpenCL driver is available.
As a royalty-free open standard, SYCL can be implemented by
anyone. At the time of writing the only fully conformant SYCL
implementation is Codeplay Software’s ComputeCpp [2], which
we use in all the following benchmarks. Other implementations
available include triSYCL [8] and hip-SYCL [3].
3 SYCL-DNN
SYCL-DNN [7] is an open-source library developed by Codeplay
Software to accelerate machine learning applications on OpenCL
hardware using the SYCL programming model. The library has
been specifically tuned for certain devices, but is usable on any
device supported by the user’s choice of SYCL implementation.
The library provides a high level interface allowing users to run
neural network primitive routines accelerated on OpenCL hard-
ware. Internally it contains a number of highly parameterized SYCL
kernels to provide the computations required for each routine, and
for some routines it provides a variety of different algorithms which
all provide the same numeric results. In this way the library can
Accelerated Neural Networks on OpenCL Devices Using SYCL-DNN IWOCL’19, May 13–15, 2019, Boston, MA, USA
Figure 3: The number of gigaflops achieved on the Intel i7-6700K processor, with SYCL-DNN running on the integrated GPU
and on the CPU compared to MKL-DNN. The graph shows the convolutions in ResNet-50, run with a batch size of 4.
adapt to different hardware characteristics by either choosing dif-
ferent algorithms or different parameters for each algorithm to
maximize performance on the hardware.
Such customization is currently provided through manual tuning
for a target device, however an automated, intelligent approach is
planned for the future.
Convolutions are the main component of modern image recogni-
tion networks, providing a mechanism to detect features in images
which is invariant with respect to the feature’s position in the im-
age. State-of-the-art image recognition networks such as VGG [14]
and ResNet [10] are made up of many layers of convolutions, inter-
spersed with pooling and normalization layers. The convolutions
are the most compute intensive operations in these networks and
so the primary optimization target in SYCL-DNN.
SYCL-DNN provides a number of implementations of a convolu-
tion, from a vectorized naive compute kernel which runs a single
thread per output vector, to a tiled Winograd operation [12] which
uses data transforms to convert the convolution into a number of
small matrix multiplies, reducing the total number of floating point
operations.
Figure 1 shows the performance of the different convolution al-
gorithms implemented in SYCL-DNN for a range of different convo-
lutions in the ResNet-50 DNN model [10]. For different convolution
parameters different algorithms perform better than others, with
no single algorithm always performing best.
4 PERFORMANCE ACROSS DEVICES
The number of devices that can support SYCL-DNN is large, as
SYCL-DNN is written in SYCL which can run on many OpenCL
implementations. The primary target of the library has been em-
bedded devices, though with few changes the same code can target
desktop GPUs and other hardware.
The embedded devices that the library has been targeted towards
are ARM’s Mali G-71 GPU—a high-end mobile GPU that is designed
for low power devices—and Renesas’ R-Car platforms like the V3H—
a system on chip designed for automated driving solutions that
provide a programmable CV engine accelerator.
In addition to these, the library is tested on Intel processors,
making use of both their CPU OpenCL implementation and the
Intel Compute Runtime that targets integrated GPUs, and on AMD
GPUs with both Fiji and Hawaii devices tested.
Figures 1, 2 and 3 show the performance achieved using SYCL-
DNN on a range of hardware when running the convolutional
layers from ResNet-50.
The performance results for ResNet convolutions on an Intel
i7-6700K processor are shown in Figure 3, where SYCL-DNN can
use either the CPU or the GPU. The performance is compared
against MKL-DNN [4, v0.18.1], which contains highly vectorized
micro-kernels that are JIT compiled to best suit particular tasks.
Compared to other platforms, less work has been channeled into
optimizing for Intel processors which is evident when compared to
the very good performance that MKL-DNN achieves.
On the HiKey 960 system-on-chip (SoC), SYCL-DNN utilises
the ARM Mali G-71 GPU and consistently outperforms ARM’s
Compute Library [1, v18.11] which runs on both the CPU using
NEON and on the GPU using OpenCL, as shown in Figure 2. Three
of the convolution parameters stand out where the Compute Library
achieves over 150 gigaflops; these are 3×3 convolutions without a
stride which are very well optimized in their library. In almost all
other cases SYCL-DNN achieves better performance than the hand
tuned OpenCL kernels in the Compute Library.
5 CONCLUSION
Overall the performance provided by our library is good, but there
are still significant improvements to be made. Further optimizations
that are planned include implementing better support for memory
prefetching and better utilization of user programmable caches,
as well as providing alternative implementations of the compute
routines which may perform better on some hardware platforms.
The SYCL-DNN team intend to extend the devices that the library
is tuned for, and develop additional automated approaches to further
ease this process in the future. As these tuning decisions involve
many parameters and a large number of features they are a good
candidate for a learned solution rather than a hand tuned one.
IWOCL’19, May 13–15, 2019, Boston, MA, USA Burns, Lawson, McBain and Soutar
REFERENCES
[1] [n. d.]. The ARM Computer Vision and Machine Learning library. https://github.
com/ARM-software/ComputeLibrary/. Accessed: 2019-03-12.
[2] [n. d.]. ComputeCpp, Codeplay’s implementation of the open standard SYCL.
https://developer.codeplay.com/computecppce/latest/overview. Accessed: 2019-
03-12.
[3] [n. d.]. hipSYCL, Implementation of SYCL 1.2.1 over AMD HIP/NVIDIA CUDA.
https://github.com/illuhad/hipSYCL. Accessed: 2019-03-12.
[4] [n. d.]. Intel Math Kernel Library for Deep Neural Networks. https://github.com/
intel/mkl-dnn. Accessed: 2019-04-05.
[5] [n. d.]. MIOpen, AMD’s Machine Intelligence Library. https://github.com/
ROCmSoftwarePlatform/MIOpen. Accessed: 2019-03-12.
[6] [n. d.]. SYCL, C++ Single-source Heterogeneous Programming for OpenCL.
https://www.khronos.org/sycl/. Accessed: 2019-03-11.
[7] [n. d.]. The SYCL-DNN neural network acceleration library. https://github.com/
CodeplaySoftware/SYCL-DNN. Accessed: 2019-03-12.
[8] [n. d.]. triSYCL, Generic system-wide modern C++ for heterogeneous platforms
with SYCL from Khronos Group. https://github.com/triSYCL/triSYCL. Accessed:
2019-03-12.
[9] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John
Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives
for Deep Learning. CoRR abs/1410.0759 (2014). arXiv:1410.0759 http://arxiv.org/
abs/1410.0759
[10] K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep Residual Learning for Image
Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR). 770–778. https://doi.org/10.1109/CVPR.2016.90
[11] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet Classi-
fication with Deep Convolutional Neural Networks. Commun. ACM 60, 6 (May
2017), 84–90. https://doi.org/10.1145/3065386
[12] A. Lavin and S. Gray. 2016. Fast Algorithms for Convolutional Neural Networks.
In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
4013–4021. https://doi.org/10.1109/CVPR.2016.435
[13] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, SeanMa,
ZhihengHuang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C.
Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge.
International Journal of Computer Vision (IJCV) 115, 3 (2015), 211–252. https:
//doi.org/10.1007/s11263-015-0816-y
[14] K. Simonyan and A. Zisserman. 2014. Very Deep Convolutional Networks for
Large-Scale Image Recognition. CoRR abs/1409.1556 (2014).
[15] J. E. Stone, D. Gohara, andG. Shi. 2010. OpenCL: A Parallel Programming Standard
for Heterogeneous Computing Systems. Computing in Science Engineering 12, 3
(May 2010), 66–73. https://doi.org/10.1109/MCSE.2010.69
