Abstract-Many complex problems, such as natural language processing or visual object detection, are solved using deep learning. However, efficient training of complex deep convolutional neural networks for large data sets is computationally demanding and requires parallel computing resources. In this paper, we present two parameterized performance models for estimation of execution time of training convolutional neural networks on the Intel many integrated core architecture. While for the first performance model we minimally use measurement techniques for parameter value estimation, in the second model we estimate more parameters based on measurements. We evaluate the prediction accuracy of performance models in the context of training three different convolutional neural network architectures on the Intel Xeon Phi. The achieved average performance prediction accuracy is about 15% for the first model and 11% for second model.
I. INTRODUCTION
Deep learning [1] is modeled as artificial deep neural network that use many processing layers to learn complex functions with successful application in various domains including, self-driving cars [2] , object recognition [3] , natural language processing [4] , speech recognition [5] , language translation [6] , optimization of the Cloud [7] and parallel computing systems [8] .
Deep learning is becoming increasingly computational demanding in accordance with the trend of increasing volumes of available data [9] and complexity of deep neural networks. Therefore, many-core parallel computing systems [10] - [14] are used to accelerate the learning process of deep neural networks [15] , [16] . Many-core processors, such as NVIDIA GPU or the Intel Xeon Phi, provide high performance and may be used to accelerate the process of deep learning. Figure 1 depicts the performance of many-core processors compared to the fastest supercomputers in the world in the TOP500 list [17] . For instance, the peak performance of the Intel Xeon Phi Knights Corner (KNC) or the Tesla K40 is similar to the fastest supercomputer in the year 1997 that was ASCI Red with 1.45 Teraflop/s peak performance.
This research has received funding from the Swedish Knowledge Foundation under Grant No. 20150088 Related work has studied performance modeling of deep learning on distributed systems [18] , performance prediction of asynchronous stochastic gradient descent [19] , performance modelling of various distributed deep learning frameworks (such as, Caffe-MPI or TensorFlow) [20] , analytical models for predicting the usage of optimal resources of a GPU for deep learning [21] . However, not much attention has been devoted to performance modeling of deep convolutional neural networks on the Intel many integrated core architecture.
In this paper, we describe our approach for performance modeling of training convolutional neural networks on the Intel Xeon Phi many core processor. We develop two parameterized performance models based on the theoretical analysis of a code [22] for training convolutional neural networks that we parallelized for the Intel Xeon Phi. Input variables of the performance models are the number of training or validation images, the number of test images, the number of network instances, the number of epochs, and the number of processing units. For the development of the first performance model, we minimally use measurements for estimating parameter values of the performance model; only memory contention is estimated using measurements. For the second performance model, we apply measurements for estimation of the sequential work, and the forward-and back-propagation. For evaluation, we use the MNIST [20] data-set of handwritten digits. The average deviation of predicted from the measured performance over all measured thread counts and various neural network architectures is about 15% for the first model and 11% for second model. Major contributions of this paper include,
• development of two performance models for estimation of execution time of training convolutional neural networks on the Intel Xeon Phi, • evaluation of prediction accuracy of performance models for various execution contexts and neural network architectures, • model-driven performance evaluation for larger number of threads than the number of hardware threads of the Intel Xeon Phi under study. The rest of this paper is structured as follows. Section II gives an overview of convolutional neural networks that are addressed in this paper. We describe Intel many integrated core architecture in Section III. Section IV describes our performance modelling approach. An empirical evaluation of the performance model is described in Section V. We discuss the related work in Section VI. Section VII concludes this paper.
II. CONVOLUTIONAL NEURAL NETWORKS
An artificial deep neural network is the underlying model used in deep learning [1] . A Convolutional Neural Network (CNN) is a variant of a Deep Neural Network (DNN), which introduces two additional layer types: convolutional layers and pooling layers. The mammal visual processing system is hierarchical (deep) in nature. Higher level features are abstractions of lower level ones. For instance, to understand speech, waveforms are translated through several layers until reaching a linguistic level. A similar analogy can be drawn for images, where edges and corners are lower level abstractions translated into more spatial patterns on higher levels.
The architecture of a DNN consists of multiple layers of neurons. Neurons are connected to each other through edges (weights). The network can simply be thought of as a weighted graph; a directed acyclic graph represents a feed-forward network. The depth and breadth of the network differs as may the layer types. Regardless of the depth, a network has at least one input and one output layer. A neuron has a set of incoming weights, which have corresponding outgoing edges attached to neurons in the previous layer. Also, a bias term is used at each layer as an intercept term. The goal of the learning process is to adjust the network weights and find a global minimum by reducing the overall error, i.e. the deviation between the predicted and the desired outcome of all the samples. The resulting weight parameters can thereafter be used to make predictions of unseen inputs [23] .
DNNs can make predictions by forward propagating an input through the network. Forward propagation proceeds by performing calculations at each layer until reaching the output layer, which contains a vector representing the prediction. For example, in image classification problems, the output layer (c) Large CNN: the last convolutional layer has 100 maps, 3,600 neurons, a 6x6 kernel, a map size of 6x6 and 216,100 weights. contains the prediction score that indicates the likelihood that an image belongs to a category [23] , [24] .
The forward propagation starts from a given input layer, then at each layer the activation for a neuron is activated using the equation y the output of the jth neuron at the previous layer. This process is repeated until reaching the output layer. At the output layer, it is common to apply a soft max function, or similar, to squash the output vector and hence derive the prediction.
Back-propagation is the process of propagating errors, i.e. the loss calculated as the deviation between the predicted and the desired output, backward in the network, by adjusting the weights at each layer. The error and partial derivatives δ l i are calculated at the output layer based on the predicted values from forward propagation and the labeled value (the correct value). At each layer, the relative error of each neuron is calculated and the weight parameters are updated based on how much the neuron participated in the faulty prediction. The expression δE/δy
denotes that the partial derivative of neuron i at the current layer l is the sum of the derivatives of connected neurons at the next layer multiplied with the weights, assuming w l denotes the weights between the maps. Additionally, a decay is commonly used to control the impact of the updates, which is omitted in the above calculations. More concretely, the algorithm can be thought of as updating the layer's weights based on "how much it was responsible for the errors in the output" [23] , [24] .
A CNN is a multi-layer model constructed to learn various levels of representations where higher level representations are described based on the lower level ones [25] . It is a variant of deep neural network that introduces two new layer types: convolutional and pooling layers.
The convolutional layer consists of several feature maps where neurons in each map connect to a grid of neurons in maps in the previous layer through overlapping kernels. The kernels are tiled to cover the whole input space. The approach is inspired by the receptive fields of the mammal visual cortex. All neurons of a map extract the same features from a map in the previous layer as they share the same set of weights. Pooling layers intervene convolutional layers and have shown to lead to faster convergence. Each neuron in a pooling layer outputs the (maximum/average) value of a partition of neurons in the previous layer, and hence only activates if the underlying grid contains the sought feature. Besides from lowering the computational load, it also enables position invariance and down samples the input by a factor relative to the kernel size [26] .
LeNet-5 is an example of a Convolutional Neural Network. Each layer of convolution and pooling (that is a specific method of sub-sampling used in LeNet) comprise several feature maps. Neurons in the feature map cover different subfields of the neurons from the previous layer. All neurons in a map share the same weight parameters, therefore they extract the same features from different parts of the input from the previous layers. CNNs are commonly constructed similarly to the LeNet-5, beginning with an input layer, followed by several convolutional/pooling combinations, ending with a fully connected layer and an output layer [26] .
In this study, the MNIST [27] dataset of handwritten digits is used. In total the MNIST data-set comprises 70000 images, 60000 of which are used for training/validation and the rest for testing. Figure 2 depicts three different CNN architectures that we use for evaluation: small, medium and large. There are various CNN implementations, such as, EbLearn at New York University and Caffe at Berkeley. As a basis for our work we selected a project developed by Cireşan [22] , which targets the MNIST dataset of handwritten digits and has the possibility to dynamically configure the definition of layers, the activation function, and the connection types using a configuration file. Figure 3 depicts an overview of the Intel Xeon Phi (codenamed Knights Corner) architecture, which is an example of the Intel Many Integrated Core (MIC) Architecture. It is a many-core shared-memory Intel Xeon Phi processor, which runs a lightweight Linux operating system that offers the possibility to communicate with it over ssh. The Intel Xeon Phi processor used in our study runs a µOS of version 2.6.38.8 and a software stack MPSS version 3.1.1.
III. INTEL MANY INTEGRATED CORE ARCHITECTURE
The Intel Xeon Phi used in this study is of model 7120p, and facilitates 61 cores, each with a clock frequency of 1.2 GHz [28] . Each core can switch between four hardware threads in a round-robin manner, which amounts to a total of 244 threads per processor. Theoretically, the processor can deliver up to one teraFLOP/s of double precision performance, or two teraFLOP/s of single precision performance. Each core has its own L1 (32KB) and L2 (512KB) cache. The L2 cache is kept fully coherent by a global distributed tag-directory (TD). The cores are connected through a bidirectional ring bus interconnect, which forms a unified shared L2 cache of 30.5MB. In addition to the cores, there are 16 memory channels that in theory offer a maximum memory bandwidth of 352GB/s.
Efficient usage of the available vector processing units of the Intel Xeon Phi is essential to fully utilize the performance of the processor [29] . Through the 512-bit wide SIMD registers it can perform 16 (16 wide × 32 bit) single-precision or 8 (8 wide × 64 bit) double-precision operations per cycle. The Xeon Phi offers two programming models: 1) offload -parts of the applications running on the host are offloaded to the Intel Xeon Phi processor 2) native -the code is compiled specifically for running natively on the Intel Xeon Phi processor. The code and all the required libraries should be transferred on the device. In this study, we use the native mode. In this study, we use OpenMP [30] for code implementation that exploits thread-and SIMD-parallelism available on the Intel Xeon Phi. The Intel Compiler 15.0.0 was used for native compilation of the application for the processor, whereas the O3 level was used for optimization.
IV. PERFORMANCE MODELLING
A performance model [31] , [32] enables us to reason about the behavior of an implementation in future execution contexts. Our performance model can predict the performance for numbers of threads that go beyond the number of hardware threads supported in the Intel Xeon Phi model that we used for evaluation. Additionally, it can predict the performance of different CNN architectures with various number of images and epochs.
The input variables of our performance model T (i, it, ep, p, s) are: the number of training or validation images (i), the number of test images (it), the number of network instances (ns), the number of epochs (ep), and the number of processing units (p). Figure 4 depicts an overview of our parallel deep leaning algorithm for Intel Xeon Phi using call outs to denote the time complexity for different operations. Dashed lines denote the critical path through the algorithm. As each processing unit carries out equal amount of work, doing so in parallel reduces the overall computations required per worker, the shortest execution time depends on the slowest worker. Here, the creation of network instances is not parallelized. The span can be thought of as the sequential amount of work required to initialize images and labels, and other variables necessary, plus the maximum time for each network instances to carry out its intended amount of work in training, validation, and testing. If applying infinite number of processing units, what remains are the initial amount of work and the maximum time spent by each processing unit to process its chunk of the images. The total execution time depends on several factors including: speed, number of processing units, communication costs (such as network latency), and memory contention. Of particular interest are contentions causing waiting times, including memory latencies and synchronization overhead. A time penalty referred to as T mem is added to the model to reflect memory and synchronization overhead. The contention is measured through an experimental approach by executing a small script on the Intel Xeon Phi processor for different thread counts, CNN weights and layers. The full set of variables is shown in Table I . We define memory overhead as, T mem (ep, i, p) = (M emoryContention * ep * i)/p where M emoryContention is the measured memory contention when p threads are competing for I/O concurrently. The measured and predicted values for memory contention are depicted in Table IV . In Table I parameters used in the performance model are depicted; some parameters are hardware dependent and others independent of the underlying hardware. Each parameter is either measured or calculated. Table II shows parameters that are independent of the hardware, and Table III shows the parameters that are specific for the Intel Xeon Phi.
We follow two strategies for performance modelling:
• Strategy (a) minimizes the use of measurements for estimating parameter values of the performance model. Only memory contention is estimated using measurements. The performance model for strategy (a) is depicted in Table V. • Strategy (b) applies the measurements to estimation of the sequential work, the forward-and back-propagation. The performance model for strategy (b) is depicted in Table VI . Please note that the constants are approximations, they are relative to each other, and yet far from precise. P rep is different for each CNN architecture (10 9 , 10 10 and 10 11 for small, medium and large architecture respectively) and denotes the number of operations required to create network instances, prepare weights, etc. The OperationF actor is adjusted to closely match the measured value for 15 threads, and mitigate the approximations done for instructions in the first place, at the same time account for vectorization. (B) .
T prep is the measured time it takes to prepare the training (small 12.56 seconds, medium 12.7 seconds, and large 13.5 seconds); T F P rop and T BP rop indicate the required time to forward-and back-propagate one image through the network. When one hardware thread is available per core, then one instruction per cycle can be assumed. For four threads per core, only 0.5 instructions per cycle can be assumed per thread; each thread gets to execute two instructions every fourth cycle (CP I of 2). The speed s is defined in Table III . F P rop and BP rop are placeholders for the actual number of operations shown in Table VII and Table VIII respectively.
V. EVALUATION OF PERFORMANCE MODEL
In this section, we compare the predicted and measured execution times for various numbers of threads and CNN architectures. The execution time is the total time the program runs, excluding the time required to initialize the network instances and images. To evaluate our approach we use an Intel Xeon Phi 7120P accelerator that comprises 61 cores that run at 1.2 GHz. We use 1, 15, 30, 60, 120, 180, and 240 threads of the Intel Xeon Phi processor. Each thread is responsible for one network instance. In the figures, we use the following notations: Par refers to the parallel version, and T denotes threads, for instance, Phi Par. 120 T is the parallel version that is executed by 120 threads on the Intel Xeon Phi. Result 1: The predicted execution times obtained from the performance model match well the measured execution times. Figures 5, 6 , and 7 depict the predicted and measured execution times for small, medium and large CNN architecture. For the small network ( Figure 5 ), the predictions are close to the measured values with a slight deviation at the end. The prediction model seems to over-estimate the execution time with a small factor.
For the medium architecture ( Figure 6 ) the prediction follow the measured values closely, although it underestimates the execution time slightly. At 120 threads, the measured and predicted values start to deviate, which are recovered at 240 threads.
The large CNN architecture ( Figure 7 ) yields similar performance results as the medium CNN architecture. We may observe that the measured values are slightly higher than the predictions, however, the predictions follow the measured values. For 120 threads there is a deviation between the measured and predicted value, which is then improved for 240 threads. While the predicted execution time increases between 120 and 240 threads, the measured execution time decreases. This is most probably due to the CPI factor that is added when 3 or more threads are present on the same core.
We use the expression ∆ = ( T µ − T ψ /T ψ )100% to calculate the prediction accuracy of our performance model, where T µ is the measured and T ψ is the predicted value. The average prediction accuracy for strategies (a) and (b) and various CNN architectures is shown in Table IX . We may observe, that model (a) is more accurate for the small CNN, whereas the model (b) is better for medium and large CNNs.
Result 2: The performance model results indicate that CNN training on Intel Xeon Phi scales well up to several thousands of threads.
We used the prediction model to predict the execution times for 480, 960, 1920, and 3840 threads for different CNN Table  X show that if 3,840 threads were available, the small network should take about 4.6 minutes to train, the medium 14.5 minutes and the large 36.8 minutes. The predictions for the large CNN architecture are not as well aligned when increasing to larger thread counts as for small and medium.
Additionally, we evaluated the execution time for varying image counts, and epochs, for 240 and 480 threads for the small CNN architecture. As can be seen in Table XI doubling the number of images or epochs, approximately doubles the execution time. However, doubling the number of threads does not reduce the execution time in half.
VI. RELATED WORK
In this section, we discuss related work with respect to performance modeling of deep learning.
Yan et al. [18] focus on performance modeling and optimization of deep learning on distributed systems. The authors use analytical performance modeling techniques to explore the configuration space and find optimal system configurations to minimize the iteration time over the training data. According to the authors, the error rates of under 25% allow them to identify and distinguish good combination of system parameters from the not so good ones.
Oyama et al. [19] propose a performance prediction model for an asynchronous stochastic gradient descent deep learning system. The proposed approach considers the probability distribution of mini-batch sizes and staleness (that is, the number of updates done within one gradient computation). The authors report model accuracy of 81-95% for various minibatch sizes. Similar to our work, the authors use the prediction model to evaluate the scalability of deep learning for upcoming hardware architectures.
Paleo, a performance model proposed by Qi et al. [33] , can efficiently predict a combination of the network architecture, hardware and software choices, parallelization strategies, and communication schemes to model the expected performance and scalability of training deep neural networks.
Song et al. [21] , in contrast, focus on the different requirements that the end-users need to perform various prediction tasks. They propose an approach that combines offline compilation (to select optimal batch-sizes) and run-time management (to identify and schedule the fastest kernels, and partition the available resources accordingly). The authors use analytical models to predict the optimal resources of a GPU (such as streaming multiprocessors) to use in each layer, and predict the processing time of a given layer.
Shi et al. [20] use performance modelling to evaluate various distributed deep learning frameworks (such as, Caffe-MPI or TensorFlow) on GPU accelerated computing systems. Authors observe performance gaps between deep learning implementations under study and identify methods that require further optimization.
Yufei et al. [34] propose a performance model for prediction of throughput on FPGAs, which is used to identify and explore optimal design choices during the design phase. The authors focus on modeling the DRAM access, latency, and on-chip buffer access. The validation results show that estimations derived from the model closely match (within 3%) the actual test results executed on Arria 10 and Stratix 10 FPGAs.
In contrast to the related work, we focus on performance modeling of training deep convolutional neural networks on the Intel Xeon Phi many-core processor. In our previous work [35] we used machine learning for performance prediction of DNA sequence analysis [36] , [37] on Intel Xeon Phi .
VII. SUMMARY
Deep learning is essential for solving complex problems in many domains including, self-driving cars, object recognition, natural language processing, speech recognition, and language translation. In this paper, we have described an approach for performance modeling of training convolutional neural networks on the Intel Xeon Phi many core processor. We developed two parameterized performance models based on the theoretical code analysis. For the development of the first performance model, we minimally used measurements for estimating parameter values of the performance model; only memory contention was estimated using measurements. For the second performance model, we used measurements for estimation of the sequential work, and the forward-and back-propagation. We used three different convolutional neural network architectures for evaluation of performance prediction accuracy of the developed models. The average deviation of predicted from the measured performance over all measured thread counts and various neural network architectures was about 15% for the first model and 11% for second model. Future work will develop performance models of deep learning on large-scale parallel computing systems that comprise multiple nodes with many-core processors.
