Performance Modelling of Deep Learning on Intel Many Integrated Core
  Architectures by Viebke, Andre et al.
Performance Modelling of Deep Learning on Intel
Many Integrated Core Architectures
(Preprint, HPCS, ©2019 IEEE)
Andre Viebke
Linnaeus University
Va¨xjo¨, Sweden
av22cj@student.lnu.se
Sabri Pllana
Linnaeus University
Va¨xjo¨, Sweden
sabri.pllana@lnu.se
Suejb Memeti
Linko¨ping University
Linko¨ping, Sweden
suejb.memeti@liu.se
Joanna Kolodziej
Cracow University of Technology
Cracow, Poland
jokoldziej@pk.edu.pl
Abstract—Many complex problems, such as natural language
processing or visual object detection, are solved using deep learn-
ing. However, efficient training of complex deep convolutional
neural networks for large data sets is computationally demanding
and requires parallel computing resources. In this paper, we
present two parameterized performance models for estimation
of execution time of training convolutional neural networks on
the Intel many integrated core architecture. While for the first
performance model we minimally use measurement techniques
for parameter value estimation, in the second model we estimate
more parameters based on measurements. We evaluate the pre-
diction accuracy of performance models in the context of training
three different convolutional neural network architectures on the
Intel Xeon Phi. The achieved average performance prediction
accuracy is about 15% for the first model and 11% for second
model.
Index Terms—Deep Learning, Convolutional Neural Network
(CNN), Performance Modelling, Intel Many Integrated Core
(MIC) Architecture, Intel Xeon Phi
I. INTRODUCTION
Deep learning [1] is modeled as artificial deep neural net-
work that use many processing layers to learn complex func-
tions with successful application in various domains including,
self-driving cars [2], object recognition [3], natural language
processing [4], speech recognition [5], language translation
[6], optimization of the Cloud [7] and parallel computing
systems [8].
Deep learning is becoming increasingly computational de-
manding in accordance with the trend of increasing volumes
of available data [9] and complexity of deep neural networks.
Therefore, many-core parallel computing systems [10]–[14]
are used to accelerate the learning process of deep neural
networks [15], [16]. Many-core processors, such as NVIDIA
GPU or the Intel Xeon Phi, provide high performance and may
be used to accelerate the process of deep learning. Figure 1
depicts the performance of many-core processors compared to
the fastest supercomputers in the world in the TOP500 list
[17]. For instance, the peak performance of the Intel Xeon
Phi Knights Corner (KNC) or the Tesla K40 is similar to the
fastest supercomputer in the year 1997 that was ASCI Red
with 1.45 Teraflop/s peak performance.
This research has received funding from the Swedish Knowledge Founda-
tion under Grant No. 20150088
Fig. 1. Performance of many-core processors and #1 in the TOP500 list [17]
of most powerful supercomputers. The Intel Xeon Phi KNL in 2016 offered
similar performance like the supercomputer ASCI RED that was #1 in June
2000.
Related work has studied performance modeling of deep
learning on distributed systems [18], performance prediction
of asynchronous stochastic gradient descent [19], performance
modelling of various distributed deep learning frameworks
(such as, Caffe-MPI or TensorFlow) [20], analytical models for
predicting the usage of optimal resources of a GPU for deep
learning [21]. However, not much attention has been devoted to
performance modeling of deep convolutional neural networks
on the Intel many integrated core architecture.
In this paper, we describe our approach for performance
modeling of training convolutional neural networks on the
Intel Xeon Phi many core processor. We develop two param-
eterized performance models based on the theoretical analysis
of a code [22] for training convolutional neural networks that
we parallelized for the Intel Xeon Phi. Input variables of the
performance models are the number of training or validation
images, the number of test images, the number of network
instances, the number of epochs, and the number of processing
units. For the development of the first performance model,
we minimally use measurements for estimating parameter
values of the performance model; only memory contention
is estimated using measurements. For the second performance
model, we apply measurements for estimation of the sequential
ar
X
iv
:1
90
6.
01
99
2v
1 
 [c
s.D
C]
  4
 Ju
n 2
01
9
work, and the forward- and back-propagation. For evaluation,
we use the MNIST [20] data-set of handwritten digits. The
average deviation of predicted from the measured performance
over all measured thread counts and various neural network
architectures is about 15% for the first model and 11% for
second model. Major contributions of this paper include,
• development of two performance models for estimation of
execution time of training convolutional neural networks
on the Intel Xeon Phi,
• evaluation of prediction accuracy of performance models
for various execution contexts and neural network archi-
tectures,
• model-driven performance evaluation for larger number
of threads than the number of hardware threads of the
Intel Xeon Phi under study.
The rest of this paper is structured as follows. Section II
gives an overview of convolutional neural networks that are
addressed in this paper. We describe Intel many integrated
core architecture in Section III. Section IV describes our
performance modelling approach. An empirical evaluation of
the performance model is described in Section V. We discuss
the related work in Section VI. Section VII concludes this
paper.
II. CONVOLUTIONAL NEURAL NETWORKS
An artificial deep neural network is the underlying model
used in deep learning [1]. A Convolutional Neural Network
(CNN) is a variant of a Deep Neural Network (DNN), which
introduces two additional layer types: convolutional layers
and pooling layers. The mammal visual processing system
is hierarchical (deep) in nature. Higher level features are
abstractions of lower level ones. For instance, to understand
speech, waveforms are translated through several layers until
reaching a linguistic level. A similar analogy can be drawn for
images, where edges and corners are lower level abstractions
translated into more spatial patterns on higher levels.
The architecture of a DNN consists of multiple layers of
neurons. Neurons are connected to each other through edges
(weights). The network can simply be thought of as a weighted
graph; a directed acyclic graph represents a feed-forward
network. The depth and breadth of the network differs as may
the layer types. Regardless of the depth, a network has at least
one input and one output layer. A neuron has a set of incoming
weights, which have corresponding outgoing edges attached to
neurons in the previous layer. Also, a bias term is used at each
layer as an intercept term. The goal of the learning process
is to adjust the network weights and find a global minimum
by reducing the overall error, i.e. the deviation between the
predicted and the desired outcome of all the samples. The
resulting weight parameters can thereafter be used to make
predictions of unseen inputs [23].
DNNs can make predictions by forward propagating an
input through the network. Forward propagation proceeds by
performing calculations at each layer until reaching the output
layer, which contains a vector representing the prediction. For
example, in image classification problems, the output layer
(a) Small CNN: the first convolutional layer has 5 maps, 3380
neurons, uses a kernel size of 4x4, a map size of 26x26 and 85
weights.
(b) Medium CNN: the first convolutional layer has 20 maps, 13,520
neurons, uses a kernel size of 4x4, a map size of 26x26 and 340
weights.
(c) Large CNN: the last convolutional layer has 100 maps, 3,600 neurons, a
6x6 kernel, a map size of 6x6 and 216,100 weights.
Fig. 2. CNN architectures used for experimental evaluation in this study. I
stands for input, C for convolutional, M for max-pooling, F for fully connected
and O for output. The input layer has 841 neurons in a 29x29 grid. The output
layer has 10 neurons.
contains the prediction score that indicates the likelihood that
an image belongs to a category [23], [24].
The forward propagation starts from a given input layer,
then at each layer the activation for a neuron is activated using
the equation yli = σ(x
l
i) + I
l
i where y
l
i is the output value of
neuron i at layer l, xli is the input value of the same neuron,
and σ (sigmoid) is the activation function. I li is used for the
input layer when there is no previous layer. The goal of the
activation function is to return a normalized value (sigmoid
return [0,1] and tanh is used in cases where the desired return
values are [-1,1]). The input xli can be calculated as x
l
i =∑
j(w
l
jiy
l−1
j ) where w
l
ji denotes the weight between neuron i
in the current layer l, and j in the previous layer, and yl−1j the
output of the jth neuron at the previous layer. This process is
repeated until reaching the output layer. At the output layer, it
is common to apply a soft max function, or similar, to squash
the output vector and hence derive the prediction.
Back-propagation is the process of propagating errors, i.e.
the loss calculated as the deviation between the predicted and
the desired output, backward in the network, by adjusting the
weights at each layer. The error and partial derivatives δli are
calculated at the output layer based on the predicted values
from forward propagation and the labeled value (the correct
value). At each layer, the relative error of each neuron is
calculated and the weight parameters are updated based on
how much the neuron participated in the faulty prediction.
The expression δE/δyli =
∑
(wlijδE/δx
l+1
j ) denotes that
the partial derivative of neuron i at the current layer l is the
sum of the derivatives of connected neurons at the next layer
multiplied with the weights, assuming wl denotes the weights
between the maps. Additionally, a decay is commonly used to
control the impact of the updates, which is omitted in the above
calculations. More concretely, the algorithm can be thought of
as updating the layer's weights based on ”how much it was
responsible for the errors in the output” [23], [24].
A CNN is a multi-layer model constructed to learn various
levels of representations where higher level representations are
described based on the lower level ones [25]. It is a variant
of deep neural network that introduces two new layer types:
convolutional and pooling layers.
The convolutional layer consists of several feature maps
where neurons in each map connect to a grid of neurons in
maps in the previous layer through overlapping kernels. The
kernels are tiled to cover the whole input space. The approach
is inspired by the receptive fields of the mammal visual cortex.
All neurons of a map extract the same features from a map
in the previous layer as they share the same set of weights.
Pooling layers intervene convolutional layers and have shown
to lead to faster convergence. Each neuron in a pooling layer
outputs the (maximum/average) value of a partition of neurons
in the previous layer, and hence only activates if the underlying
grid contains the sought feature. Besides from lowering the
computational load, it also enables position invariance and
down samples the input by a factor relative to the kernel size
[26].
LeNet-5 is an example of a Convolutional Neural Network.
Each layer of convolution and pooling (that is a specific
method of sub-sampling used in LeNet) comprise several
feature maps. Neurons in the feature map cover different sub-
fields of the neurons from the previous layer. All neurons in a
map share the same weight parameters, therefore they extract
the same features from different parts of the input from the
previous layers. CNNs are commonly constructed similarly
to the LeNet-5, beginning with an input layer, followed by
several convolutional/pooling combinations, ending with a
fully connected layer and an output layer [26].
In this study, the MNIST [27] dataset of handwritten digits
is used. In total the MNIST data-set comprises 70000 images,
60000 of which are used for training/validation and the rest
for testing. Figure 2 depicts three different CNN architectures
that we use for evaluation: small, medium and large. There are
various CNN implementations, such as, EbLearn at New York
University and Caffe at Berkeley. As a basis for our work we
selected a project developed by Cires¸an [22], which targets the
MNIST dataset of handwritten digits and has the possibility to
dynamically configure the definition of layers, the activation
function, and the connection types using a configuration file.
T1 T2
T3 T4
L1/L2 cache
Core 15
T1 T2
T3 T4
L1/L2 cache
Core 1
T1 T2
T3 T4
L1/L2 cache
Core ...
T1 T2
T3 T4
L1/L2 cache
Core ...
GDDR MC
GDDR MC
GDDR MC
GDDR MC
PCIe 
Client 
Logic
T1 T2
T3 T4
L1/L2 cache
Core 30
...
...
Device
T1 T2
T3 T4
L1/L2 cache
Core 61
TD
TD
TD
TD
TD
TD
TD
TD
Fig. 3. An example of the Intel Many Integrated Core Architecture: Intel
Xeon Phi.
III. INTEL MANY INTEGRATED CORE ARCHITECTURE
Figure 3 depicts an overview of the Intel Xeon Phi (co-
denamed Knights Corner) architecture, which is an example
of the Intel Many Integrated Core (MIC) Architecture. It is a
many-core shared-memory Intel Xeon Phi processor, which
runs a lightweight Linux operating system that offers the
possibility to communicate with it over ssh. The Intel Xeon
Phi processor used in our study runs a µOS of version 2.6.38.8
and a software stack MPSS version 3.1.1.
The Intel Xeon Phi used in this study is of model 7120p, and
facilitates 61 cores, each with a clock frequency of 1.2 GHz
[28]. Each core can switch between four hardware threads
in a round-robin manner, which amounts to a total of 244
threads per processor. Theoretically, the processor can deliver
up to one teraFLOP/s of double precision performance, or two
teraFLOP/s of single precision performance. Each core has
its own L1 (32KB) and L2 (512KB) cache. The L2 cache
is kept fully coherent by a global distributed tag-directory
(TD). The cores are connected through a bidirectional ring
bus interconnect, which forms a unified shared L2 cache
of 30.5MB. In addition to the cores, there are 16 memory
channels that in theory offer a maximum memory bandwidth
of 352GB/s.
Efficient usage of the available vector processing units of the
Intel Xeon Phi is essential to fully utilize the performance of
the processor [29]. Through the 512-bit wide SIMD registers
it can perform 16 (16 wide × 32 bit) single-precision or 8
(8 wide × 64 bit) double-precision operations per cycle. The
Xeon Phi offers two programming models:
1) offload - parts of the applications running on the host
are offloaded to the Intel Xeon Phi processor
Fig. 4. An overview of our parallel deep leaning algorithm for Intel Xeon
Phi. i is the number of training or validation images, it is the number of test
images, ns is the number of network instances, ep is the number of epochs,
p is the number of processing units. w’, b’, c’, f’, g’, h’ indicate the work at
various algorithmic steps.
2) native - the code is compiled specifically for running
natively on the Intel Xeon Phi processor. The code and
all the required libraries should be transferred on the
device. In this study, we use the native mode.
In this study, we use OpenMP [30] for code implementation
that exploits thread- and SIMD-parallelism available on the
Intel Xeon Phi. The Intel Compiler 15.0.0 was used for native
compilation of the application for the processor, whereas the
O3 level was used for optimization.
IV. PERFORMANCE MODELLING
A performance model [31], [32] enables us to reason
about the behavior of an implementation in future execution
contexts. Our performance model can predict the performance
for numbers of threads that go beyond the number of hardware
threads supported in the Intel Xeon Phi model that we used
for evaluation. Additionally, it can predict the performance of
different CNN architectures with various number of images
and epochs.
The input variables of our performance model
T (i, it, ep, p, s) are: the number of training or validation
images (i), the number of test images (it), the number of
network instances (ns), the number of epochs (ep), and the
number of processing units (p).
Figure 4 depicts an overview of our parallel deep leaning
algorithm for Intel Xeon Phi using call outs to denote the
time complexity for different operations. Dashed lines denote
the critical path through the algorithm. As each processing
unit carries out equal amount of work, doing so in parallel
reduces the overall computations required per worker, the
shortest execution time depends on the slowest worker. Here,
the creation of network instances is not parallelized. The span
TABLE I
VARIABLES USED IN THE PERFORMANCE MODEL.
Variable Explanation
Parameters
p Number of processing units
i Number of training/validation images
it Number of test images
ep Number of epochs
Constants - hardware dependent
CPI Best theoretical CPI/thread
s Speed of processing unit
OperationFactor Operation factor
Measured - hardware dependent
MemoryContention Memory contention
TFprop+ Forward propagation / image (ms)
TBprop+ Back-propagation / image (ms)
TPrep+ Time for preparations
Calculated - hardware independent
FProp* # FProp Operations / image
BProp* # BProp Operations / image
Prep* # Operations carried out for preparations
* The parameter is only used in prediction strategy (a)
+ The parameter is only used in prediction strategy (b)
TABLE II
HARDWARE INDEPENDENT PARAMETERS USED IN THE PERFORMANCE
MODEL.
Term Value
Epochs (ep) 70 (small, medium), 15 (large)
Images (i) 60,000
Images (it) 10,000
Processing units/threads (p) 1 - 3,840
FProp See table VII
BProp See table VIII
Prep
Small: 109
Medium: 1010
Large: 1011
can be thought of as the sequential amount of work required to
initialize images and labels, and other variables necessary, plus
the maximum time for each network instances to carry out its
intended amount of work in training, validation, and testing.
If applying infinite number of processing units, what remains
are the initial amount of work and the maximum time spent
by each processing unit to process its chunk of the images.
The total execution time depends on several factors in-
cluding: speed, number of processing units, communication
costs (such as network latency), and memory contention.
Of particular interest are contentions causing waiting times,
including memory latencies and synchronization overhead. A
time penalty referred to as Tmem is added to the model to
reflect memory and synchronization overhead. The contention
is measured through an experimental approach by executing a
small script on the Intel Xeon Phi processor for different thread
counts, CNN weights and layers. The full set of variables is
shown in Table I.
TABLE III
HARDWARE SPECIFIC PARAMETERS USED IN THE PERFORMANCE MODEL.
Parameter Intel Xeon Phi
s 1.238 GHz
Max processing units (p) 244 (240 used for prediction)
TFprop(ms)
Small: 1.45
Medium: 12.55
Large: 148.88
TBprop(ms)
Small: 5.3
Medium: 69.73
Large: 859.19
TPrep(s)
Small: 12.56
Medium: 12.7
Large: 13.5
CPI 1-2 threads: 1; 3 threads: 1.5; 4 threads: 2
MemoryContention Table IV
OperationFactor
Small: 15
Medium: 15
Large: 15
TABLE IV
MEASURED AND PREDICTED MEMORY CONTENTION IN SECONDS [S] FOR
THE INTEL XEON PHI.
# Threads Small CNN Medium CNN Large CNN
1 7.10 ∗ 10−6 1.56 ∗ 10−4 8.83 ∗ 10−4
15 6.40 ∗ 10−4 2.00 ∗ 10−3 8.75 ∗ 10−3
30 1.36 ∗ 10−3 3.97 ∗ 10−3 1.67 ∗ 10−2
60 3.07 ∗ 10−3 8.03 ∗ 10−3 3.22 ∗ 10−2
120 6.76 ∗ 10−3 1.65 ∗ 10−2 6.74 ∗ 10−2
180 9.95 ∗ 10−3 2.50 ∗ 10−2 1.00 ∗ 10−1
240 1.40 ∗ 10−2 3.83 ∗ 10−2 1.38 ∗ 10−1
480* 2.78 ∗ 10−2 7.31 ∗ 10−2 2.73 ∗ 10−1
960* 5.60 ∗ 10−2 1.47 ∗ 10−1 5.46 ∗ 10−1
1,920* 1.12 ∗ 10−1 2.95 ∗ 10−1 1.09
3,840* 2.25 ∗ 10−1 5.91 ∗ 10−1 2.19
* Predicted memory contention
We define memory overhead as, Tmem(ep, i, p) =
(MemoryContention∗ep∗i)/p where MemoryContention
is the measured memory contention when p threads are com-
peting for I/O concurrently. The measured and predicted values
for memory contention are depicted in Table IV. In Table I
parameters used in the performance model are depicted; some
parameters are hardware dependent and others independent of
the underlying hardware. Each parameter is either measured
or calculated. Table II shows parameters that are independent
of the hardware, and Table III shows the parameters that are
specific for the Intel Xeon Phi.
We follow two strategies for performance modelling:
• Strategy (a) minimizes the use of measurements for esti-
mating parameter values of the performance model. Only
memory contention is estimated using measurements. The
performance model for strategy (a) is depicted in Table V.
• Strategy (b) applies the measurements to estimation of
the sequential work, the forward- and back-propagation.
The performance model for strategy (b) is depicted in
Table VI.
TABLE V
PERFORMANCE MODEL FOR STRATEGY (A).
T (i, it, ep, p, s)
= Tcomp(i, it, ep, p, s) + Tmem(ep, i, p)
preparation work =
(
Prep + 4 ∗ i + 2 ∗ it + 10 ∗ ep
s
training +
(((FProp + BProp
s
)
∗ i
pi
∗ ep
)
validation +
((FProp
s
)
∗ i
pi
∗ ep
)
testing +
((FProp
s
)
∗ it
pit
∗ ep))
calculation penalty ∗CPI
)
∗OperationFactor
memory overhead +
MemoryContention ∗ i ∗ ep
p
Please note that the constants are approximations, they are
relative to each other, and yet far from precise. Prep is
different for each CNN architecture (109, 1010 and 1011 for
small, medium and large architecture respectively) and denotes
the number of operations required to create network instances,
prepare weights, etc. The OperationFactor is adjusted to
closely match the measured value for 15 threads, and mitigate
the approximations done for instructions in the first place, at
the same time account for vectorization.
TABLE VI
PERFORMANCE MODEL FOR STRATEGY (B).
T (i, it, ep, p)
= Tcomp(i, it, ep, p) + Tmem(ep, i, p)
preparation work = Tprep
training +
(((
TFProp + TBprop
) ∗ i
pi
∗ ep
)
validation +
(
TFProp ∗ i
pi
∗ ep
)
testing +
(
TFProp ∗ it
pit
∗ ep
))
calculation penalty ∗CPI
memory overhead +
MemoryContention ∗ i ∗ ep
p
Tprep is the measured time it takes to prepare the training
(small 12.56 seconds, medium 12.7 seconds, and large 13.5
seconds); TFProp and TBProp indicate the required time to
forward- and back-propagate one image through the network.
When one hardware thread is available per core, then one
instruction per cycle can be assumed. For four threads per
core, only 0.5 instructions per cycle can be assumed per thread;
each thread gets to execute two instructions every fourth cycle
(CPI of 2). The speed s is defined in Table III. FProp and
BProp are placeholders for the actual number of operations
shown in Table VII and Table VIII respectively.
V. EVALUATION OF PERFORMANCE MODEL
In this section, we compare the predicted and measured
execution times for various numbers of threads and CNN
architectures. The execution time is the total time the program
TABLE VII
FProp: NUMBER OF OPERATIONS WHEN FORWARD PROPAGATING ONE
IMAGE FOR SMALL, MEDIUM AND LARGE CNN ARCHITECTURES.
Max Pool. Fully Con. Convolution Total Ratio
Small 7k 5k 46k 58k -
Medium 29k 56k 474k 559k 9.64
Large 99k 137k 5,113k 5,349k 9.57
TABLE VIII
BProp: NUMBER OF OPERATIONS WHEN BACK PROPAGATING ONE IMAGE
FOR SMALL, MEDIUM AND LARGE CNN ARCHITECTURES.
Max Pool. Fully Con. Convolution Total Ratio
Small 2k 10k 512k 524k -
Medium 4k 112k 6,003k 6,119k 11.68
Large 8k 274k 72,896k 73,178k 11.96
runs, excluding the time required to initialize the network
instances and images. To evaluate our approach we use an Intel
Xeon Phi 7120P accelerator that comprises 61 cores that run
at 1.2 GHz. We use 1, 15, 30, 60, 120, 180, and 240 threads
of the Intel Xeon Phi processor. Each thread is responsible
for one network instance. In the figures, we use the following
notations: Par refers to the parallel version, and T denotes
threads, for instance, Phi Par. 120 T is the parallel version
that is executed by 120 threads on the Intel Xeon Phi.
Result 1: The predicted execution times obtained from the
performance model match well the measured execution times.
Figures 5, 6, and 7 depict the predicted and measured ex-
ecution times for small, medium and large CNN architecture.
For the small network (Figure 5), the predictions are close to
the measured values with a slight deviation at the end. The
prediction model seems to over-estimate the execution time
with a small factor.
For the medium architecture (Figure 6) the prediction follow
the measured values closely, although it underestimates the
execution time slightly. At 120 threads, the measured and
predicted values start to deviate, which are recovered at 240
threads.
The large CNN architecture (Figure 7) yields similar per-
formance results as the medium CNN architecture. We may
Fig. 5. Comparing predicted execution times with measured execution times
on Intel Xeon Phi for the small CNN architecture.
Fig. 6. Comparing predicted execution times with measured execution times
on Intel Xeon Phi for the medium CNN architecture.
Fig. 7. Comparing predicted execution times with measured execution times
on Intel Xeon Phi for the large CNN architecture.
observe that the measured values are slightly higher than
the predictions, however, the predictions follow the measured
values. For 120 threads there is a deviation between the
measured and predicted value, which is then improved for 240
threads. While the predicted execution time increases between
120 and 240 threads, the measured execution time decreases.
This is most probably due to the CPI factor that is added when
3 or more threads are present on the same core.
We use the expression ∆ = (
∣∣Tµ − Tψ∣∣ /Tψ)100% to
calculate the prediction accuracy of our performance model,
where Tµ is the measured and Tψ is the predicted value.
The average prediction accuracy for strategies (a) and (b) and
various CNN architectures is shown in Table IX. We may
observe, that model (a) is more accurate for the small CNN,
whereas the model (b) is better for medium and large CNNs.
Result 2: The performance model results indicate that CNN
training on Intel Xeon Phi scales well up to several thousands
of threads.
We used the prediction model to predict the execution
times for 480, 960, 1920, and 3840 threads for different CNN
TABLE IX
AVERAGE ACCURACY ∆ OF PERFORMANCE MODEL FOR PREDICTION
STRATEGIES (A) AND (B) AND ALL CONSIDERED CNN ARCHITECTURES.
Small CNN Medium CNN Large CNN
a b a b a b
14.57% 16.35% 14.76% 7.48% 15.36% 10.22%
TABLE X
PREDICTED EXECUTION TIMES IN MINUTES FOR 480, 960, 1,920 AND
3,840 IMAGES USING THE PERFORMANCE MODELS (A) AND (B).
Small Medium Large
a b a b a b
480 6.6 6.7 36.8 39.1 92.9 82.6
960 5.4 5.5 23.9 25.1 60.8 45.7
1,920 4.9 4.9 17.4 18.0 44.8 27.2
3,840 4.6 4.6 14.2 14.5 36.8 18.0
TABLE XI
THE EXECUTION TIMES IN MINUTES WHEN SCALING EPOCHS AND IMAGES
FOR 240 AND 480 THREADS USING THE PERFORMANCE MODEL (A) ON
THE SMALL CNN ARCHITECTURE.
240 Threads 480 Threads
Images Epochs Epochs
i1 it2 70 140 280 70 140 280
60k 10k 8.9 17.6 35.0 6.6 12.9 25.6
120k 20k 17.6 35.0 69.7 12.9 25.6 51.1
240k 40k 35.0 69.7 139.3 25.6 51.1 101.9
1 Number of images in the training/validation set
2 Number of images in the test set
architectures, using the same parameters. The results in Table
X show that if 3,840 threads were available, the small network
should take about 4.6 minutes to train, the medium 14.5
minutes and the large 36.8 minutes. The predictions for the
large CNN architecture are not as well aligned when increasing
to larger thread counts as for small and medium.
Additionally, we evaluated the execution time for varying
image counts, and epochs, for 240 and 480 threads for the
small CNN architecture. As can be seen in Table XI doubling
the number of images or epochs, approximately doubles the
execution time. However, doubling the number of threads does
not reduce the execution time in half.
VI. RELATED WORK
In this section, we discuss related work with respect to
performance modeling of deep learning.
Yan et al. [18] focus on performance modeling and opti-
mization of deep learning on distributed systems. The authors
use analytical performance modeling techniques to explore the
configuration space and find optimal system configurations to
minimize the iteration time over the training data. According to
the authors, the error rates of under 25% allow them to identify
and distinguish good combination of system parameters from
the not so good ones.
Oyama et al. [19] propose a performance prediction model
for an asynchronous stochastic gradient descent deep learn-
ing system. The proposed approach considers the probability
distribution of mini-batch sizes and staleness (that is, the
number of updates done within one gradient computation). The
authors report model accuracy of 81-95% for various mini-
batch sizes. Similar to our work, the authors use the prediction
model to evaluate the scalability of deep learning for upcoming
hardware architectures.
Paleo, a performance model proposed by Qi et al. [33], can
efficiently predict a combination of the network architecture,
hardware and software choices, parallelization strategies, and
communication schemes to model the expected performance
and scalability of training deep neural networks.
Song et al. [21], in contrast, focus on the different require-
ments that the end-users need to perform various prediction
tasks. They propose an approach that combines offline compi-
lation (to select optimal batch-sizes) and run-time management
(to identify and schedule the fastest kernels, and partition the
available resources accordingly). The authors use analytical
models to predict the optimal resources of a GPU (such as
streaming multiprocessors) to use in each layer, and predict
the processing time of a given layer.
Shi et al. [20] use performance modelling to evaluate various
distributed deep learning frameworks (such as, Caffe-MPI or
TensorFlow) on GPU accelerated computing systems. Authors
observe performance gaps between deep learning implemen-
tations under study and identify methods that require further
optimization.
Yufei et al. [34] propose a performance model for prediction
of throughput on FPGAs, which is used to identify and explore
optimal design choices during the design phase. The authors
focus on modeling the DRAM access, latency, and on-chip
buffer access. The validation results show that estimations
derived from the model closely match (within 3%) the actual
test results executed on Arria 10 and Stratix 10 FPGAs.
In contrast to the related work, we focus on performance
modeling of training deep convolutional neural networks on
the Intel Xeon Phi many-core processor. In our previous work
[35] we used machine learning for performance prediction of
DNA sequence analysis [36], [37] on Intel Xeon Phi .
VII. SUMMARY
Deep learning is essential for solving complex problems in
many domains including, self-driving cars, object recognition,
natural language processing, speech recognition, and language
translation. In this paper, we have described an approach
for performance modeling of training convolutional neural
networks on the Intel Xeon Phi many core processor. We
developed two parameterized performance models based on
the theoretical code analysis. For the development of the
first performance model, we minimally used measurements
for estimating parameter values of the performance model;
only memory contention was estimated using measurements.
For the second performance model, we used measurements
for estimation of the sequential work, and the forward- and
back-propagation. We used three different convolutional neural
network architectures for evaluation of performance prediction
accuracy of the developed models. The average deviation of
predicted from the measured performance over all measured
thread counts and various neural network architectures was
about 15% for the first model and 11% for second model.
Future work will develop performance models of deep learn-
ing on large-scale parallel computing systems that comprise
multiple nodes with many-core processors.
REFERENCES
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”
Nature, vol. 521, pp. 436 – 444, 2015. [Online]. Available:
https://doi.org/10.1038/nature14539
[2] M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp,
P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang,
X. Zhang, J. Zhao, and K. Zieba, “End to end learning for self-
driving cars,” CoRR, vol. abs/1604.07316, 2016. [Online]. Available:
http://arxiv.org/abs/1604.07316
[3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
classification with deep convolutional neural networks,” Commun.
ACM, vol. 60, no. 6, pp. 84–90, May 2017. [Online]. Available:
http://doi.acm.org/10.1145/3065386
[4] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu,
and P. Kuksa, “Natural language processing (almost) from scratch,”
J. Mach. Learn. Res., vol. 12, pp. 2493–2537, Nov. 2011. [Online].
Available: http://dl.acm.org/citation.cfm?id=1953048.2078186
[5] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly,
A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury,
“Deep neural networks for acoustic modeling in speech recognition:
The shared views of four research groups,” IEEE Signal Processing
Magazine, vol. 29, no. 6, pp. 82–97, Nov 2012.
[6] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning
with neural networks,” in Proceedings of the 27th International
Conference on Neural Information Processing Systems - Volume 2, ser.
NIPS’14. Cambridge, MA, USA: MIT Press, 2014, pp. 3104–3112.
[Online]. Available: http://dl.acm.org/citation.cfm?id=2969033.2969173
[7] D. Grzonka, A. Jakobik, J. Kolodziej, and S. Pllana, “Using a
multi-agent system and artificial intelligence for monitoring and
improving the cloud performance and security,” Future Generation
Computer Systems, vol. 86, pp. 1106 – 1117, 2018. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0167739X17310531
[8] S. Memeti, S. Pllana, A. Binotto, J. Kołodziej, and I. Brandic, “Using
meta-heuristics and machine learning for software optimization of
parallel computing systems: a systematic literature review,” Computing,
Apr 2018. [Online]. Available: https://doi.org/10.1007/s00607-018-
0614-9
[9] S. Vitabile, M. Marks, D. Stojanovic, S. Pllana, J. M. Molina,
M. Krzyszton, A. Sikora, A. Jarynowski, F. Hosseinpour, A. Jakobik,
A. Stojnev Ilic, A. Respicio, D. Moldovan, C. Pop, and I. Salomie,
Medical Data Processing and Analysis for Remote Health and Activities
Monitoring. Cham: Springer International Publishing, 2019, pp. 186–
220. [Online]. Available: https://doi.org/10.1007/978-3-030-16272-6 7
[10] P. Czarnul, Parallel Programming for Modern High Performance Com-
puting Systems. Chapman and Hall/CRC, 2018.
[11] S. Pllana and F. Xhafa, Programming Multicore and Many-core Com-
puting Systems, 1st ed. Hoboken, New Jersey, USA: John Wiley &
Sons, Inc., 2017.
[12] W. W. Smari, M. Bakhouya, S. Fiore, and G. Aloisio, “New advances
in high performance computing and simulation: parallel and distributed
systems, algorithms, and applications,” Concurrency and Computation:
Practice and Experience, vol. 28, no. 7, pp. 2024–2030, 2016. [Online].
Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.3774
[13] S. Benkner, S. Pllana, J. L. Traff, P. Tsigas, U. Dolinsky, C. Augonnet,
B. Bachmayer, C. Kessler, D. Moloney, and V. Osipov, “Peppher:
Efficient and productive usage of hybrid computing systems,” IEEE
Micro, vol. 31, no. 05, pp. 28–41, sep 2011.
[14] E. Abraham, C. Bekas, I. Brandic, S. Genaim, E. B. Johnsen, I. Kondov,
S. Pllana, and A. Streit, “Preparing HPC Applications for Exascale:
Challenges and Recommendations,” in 2015 18th International Confer-
ence on Network-Based Information Systems, Sep. 2015, pp. 401–406.
[15] T. Ben-Nun and T. Hoefler, “Demystifying parallel and
distributed deep learning: An in-depth concurrency analy-
sis,” CoRR, vol. abs/1802.09941, 2018. [Online]. Available:
http://arxiv.org/abs/1802.09941
[16] A. Zlateski, K. Lee, and S. Seung, “Znn – a fast and scalable algorithm
for training 3d convolutional networks on multi-core and many-core
shared memory machines,” in 2016 IEEE International Parallel and
Distributed Processing Symposium (IPDPS), May 2016, pp. 801–811.
[17] “TOP500 list,” www.top500.org/, Nov. 2018.
[18] F. Yan, O. Ruwase, Y. He, and T. Chilimbi, “Performance modeling
and scalability optimization of distributed deep learning systems,”
in Proceedings of the 21th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, ser. KDD ’15. New
York, NY, USA: ACM, 2015, pp. 1355–1364. [Online]. Available:
http://doi.acm.org/10.1145/2783258.2783270
[19] Y. Oyama, A. Nomura, I. Sato, H. Nishimura, Y. Tamatsu, and S. Mat-
suoka, “Predicting statistics of asynchronous sgd parameters for a large-
scale distributed deep learning system on gpu supercomputers,” in 2016
IEEE International Conference on Big Data (Big Data), Dec 2016, pp.
66–75.
[20] S. Shi, Q. Wang, and X. Chu, “Performance modeling and evaluation
of distributed deep learning frameworks on gpus,” in 2018 IEEE 16th
Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl
Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big
Data Intelligence and Computing and Cyber Science and Technology
Congress(DASC/PiCom/DataCom/CyberSciTech), Aug 2018, pp. 949–
957.
[21] M. Song, Y. Hu, H. Chen, and T. Li, “Towards pervasive and user satis-
factory cnn across gpu microarchitectures,” in 2017 IEEE International
Symposium on High Performance Computer Architecture (HPCA), Feb
2017, pp. 1–12.
[22] D. Cires¸an, “Simple C/C++ code for training and testing MLPs and
CNNs,” http://people.idsia.ch/~ciresan/, [Online; accessed 11-February-
2019].
[23] N. Andrew, N. Jiquan, F. Chuan Yu, M. Yifan, and S. Caroline, “Ufldl
tutorial on neural networks,” Ufldl Tutorial on Neural Networks, 2011.
[24] A. Gibansky, “Fully connected neural network algorithms,”
http://andrew.gibiansky.com/blog/machine-learning/fully-connected-
neural-networks/, [Online; accessed 21-March-2019].
[25] J. Schmidhuber, “Deep learning in neural networks: An overview,”
Neural Networks, vol. 61, pp. 85–117, 2015.
[26] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
applied to document recognition,” Proceedings of the IEEE, vol. 86,
no. 11, pp. 2278–2324, 1998.
[27] Y. LeCun and C. Cortes, “Mnist handwritten digit database,” AT&T Labs
[Online]. Available: http://yann. lecun. com/exdb/mnist, 2010.
[28] G. Chrysos, “Intel® Xeon Phi Coprocessor-the Architecture,” Intel
Whitepaper, 2012. [Online]. Available: https://software.intel.com/en-
us/articles/intel-xeon-phi-coprocessor-codename-knights-corner
[29] X. Tian, H. Saito, S. Preis, E. N. Garcia, S. Kozhukhov, M. Masten,
A. G. Cherkasov, and N. Panchenko, “Practical SIMD Vectorization
Techniques for Intel Xeon Phi Coprocessors,” in IPDPS Workshops.
IEEE, 2013, pp. 1149–1158.
[30] S. Memeti, L. Li, S. Pllana, J. Kolodziej, and C. Kessler, “Benchmarking
opencl, openacc, openmp, and cuda: Programming productivity,
performance, and energy consumption,” in Proceedings of the 2017
Workshop on Adaptive Resource Management and Scheduling for Cloud
Computing, ser. ARMS-CC ’17. New York, NY, USA: ACM, 2017, pp.
1–6. [Online]. Available: http://doi.acm.org/10.1145/3110355.3110356
[31] S. Pllana, S. Benkner, F. Xhafa, and L. Barolli, “Hybrid performance
modeling and prediction of large-scale computing systems,” in 2008
International Conference on Complex, Intelligent and Software Intensive
Systems, March 2008, pp. 132–138.
[32] T. Fahringer, S. Pllana, and J. Testori, “Teuta: Tool support for perfor-
mance modeling of distributed and parallel applications,” in Computa-
tional Science - ICCS 2004, M. Bubak, G. D. van Albada, P. M. A. Sloot,
and J. Dongarra, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg,
2004, pp. 456–463.
[33] H. Qi, E. R. Sparks, and A. S. Talwalkar, “Paleo: A performance model
for deep neural networks,” in International Conference on Learning
Representations (ICLR), 2017.
[34] Y. Ma, Y. Cao, S. Vrudhula, and J. sun Seo, “Performance modeling for
cnn inference accelerators on fpga,” IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems, pp. 1–1, 2019.
[35] S. Memeti and S. Pllana, “A machine learning approach for
accelerating DNA sequence analysis,” The International Journal of
High Performance Computing Applications, vol. 32, no. 3, pp. 363–379,
2018. [Online]. Available: https://doi.org/10.1177/1094342016654214
[36] S. Memeti and S. Pllana, “Accelerating DNA Sequence Analysis Using
Intel(R) Xeon Phi(TM),” in 2015 IEEE Trustcom/BigDataSE/ISPA,
vol. 3, Aug 2015, pp. 222–227.
[37] S. Memeti, S. Pllana, and J. Kołodziej, Optimal Worksharing of
DNA Sequence Analysis on Accelerated Platforms. Cham: Springer
International Publishing, 2016, pp. 279–309. [Online]. Available:
https://doi.org/10.1007/978-3-319-44881-7 14
