At the heart of deep learning training and inferencing are computationally intensive primitives such as convolutions which form the building blocks of deep neural networks. Researchers have taken two distinct approaches to creating high performance implementations of deep learning kernels, namely, 1) library development exemplified by Intel MKL-DNN for CPUs, 2) automatic compilation represented by the TensorFlow XLA compiler. The two approaches have their drawbacks: even though a custom built library can deliver very good performance, the cost and time of development of the library can be high. Additionally, hand coding of a plethora of operators for performance is not scalable over the long term as more and more deep learning operators get invented. Automatic compilation of kernels is attractive but in practice, till date, automatically generated implementations lag expert coded kernels in performance by orders of magnitude.
Introduction
Deep learning has revolutionized many spheres of human activity, examples of which include, speech recognition [24] , image recognition [23, 28] , web search [1] , language translation [42] , conversational artificial intelligence [14] etc. Training and inferencing using deep neural networks (DNNs) that lie at the heart of deep learning are computationally intensive tasks. Because of the importance of deep learning and the need to speed up training and inferencing, custom chips such as Google TPUs, Intel Nervana chips, NVIDIA Tensor-Cores have been built. On the software side, frameworks such as TensorFlow, PyTorch, Intel MKL-DNN have been created to allow data scientists to write high performing deep learning code in an efficient manner. However, all these frameworks use manually optimized primitives to deliver high performance.
The proliferation of computer architectures and the invention of a variety of deep learning kernels has brought to the fore the need to generate high performing kernel implementations that undergird deep learning automatically. Existing automatic compilation, and auto-tuning techniques [4, 6-8, 10, 12, 13, 21, 27, 31, 35, 36, 38] are inadequate to match the performance of hand-tuned library implementations -later on in the paper we show that the state-of-the-art compiler generated code can lag library implementations on CPUs by as much as~10X or more. The main reason for the failure of automatic compiler techniques in achieving very high performance levels needed is that the sophistication in the CPU microarchitecture has increased over successive generations of CPUs (data prefetching, speculative execution, vector units, deep memory hierarchies, complex cache replacement algorithms etc). Consequently, the cost functions used to optimize code are unable to capture the nitty-gritties of the underlying architectures, and therefore are unable to derive the most effective execution schedules. Auto-tuning is an alternative approach where one explores a large number of program variants and selects the best performing version, sidestepping the complexity of deriving a cost function that adequately models the intricacies of the underlying hardware architecture. However, auto-tuning is expensive and furthermore, it may fall short of manually created library in performance [4] .
In this paper, we develop an approach that combines manually optimized microkernels with automatic compilation techniques. The inner most loops of kernels will be replaced with microkernels and the outer loops will be automatically optimized. It has been shown that 95% of all deep learning applications running in the data centers today have a recurring pattern in the inner most loops, namely blocked matrix multiplication [17, 26] . Therefore, if a microkernel such as that of matrix multiplication that exploits the vector units of modern microprocessors effectively is used for the inner most loops, and the outer loops are compiled to use the multilevel caches of the hardware system well, then the resulting performance will be very high and will be competitive with that of a hand-tuned library.
We use a code-generator to create a number of program variants -n in number -for a given program. The generated code variants are then analyzed by our novel data reuse algorithm -PolyScientist to characterize their cache behavior. We have developed a relative ranking algorithm which ranks the n variants based on their potential performance. The top k variants are selected and are run on the target hardware and the best performing program version is discovered. Thus, PolyScientist narrows down the number of variants to actually run on the target architecture from n to a small, manageable k (k << n). Through our experimental evaluation on convolutions of a range of popular and the state-of-the-art image recognition models, we show that the top k variants picked are some of the best performing variants, and the realized performance is close to and in many cases, higher than that of Intel MKL-DNN library [2], a hand-tuned library for deep learning kernels.
The contributions of the paper are the following:
• To the best of our knowledge, this is the first work that examines the use of microkernels and automatic compilation techniques in an integrated manner. • A novel cache data reuse analysis to characterize a loop nest's behavior with respect to a multi-level cache hierarchy is presented. • We describe a relative ranking heuristic that takes the compiler generated statistics and the system parameters, i.e., cache sizes and ranks the program variants based on their potential performance. • We conduct extensive experiments to demonstrate the efficacy of the developed system in being able to produce high performance on deep learning kernels.
The rest of the paper is organized as follows. The overall architecture of the PolyScientist system is presented in Section 3. Section 4 describes preliminary concepts that will be used later on. In Section 5, we develop the cache data reuse analysis and a poly-ranking system to rank the candidate program variants in terms of performance. Section 6 details the experimental evaluation conducted. The related work is discussed in Section 7 while Section 8 presents the conclusion of this work.
Motivation
The deep learning primitives are computationally intensive and most of the neural network training and inferencing time is spent in them. However, for different layers of the deep neural networks, the optimizations (e.g., loop order, tile sizes in tiled code etc) that need to applied are different. Using a version of the code optimized for one layer of a neural network for all others would yield poor performance. It is this need for custom design for different layers of neural networks (the number of layers in a deep neural network can be large) is what makes generating efficient code for deep learning primitives a challenging problem. To illustrate the need for such tailored loop optimizations we consider the convolution layers of the Fast R-CNN model [19] , one of the leading image recognition CNN models. We generate four variants of convolution code which differ only in the loop order and the rest of the loop structure remains the same for all of them (more details are provided in §6) and measure performance on a 28-core Intel(R) Xeon(R) Platinum 8280 (a.k.a Cascade Lake) CPU server. Figure 1 shows the normalized performance of the code variants on 25 convolution layers of Fast R-CNN: the performance is normalized with respect to the highest performing code among the four variants.
From Figure 1 , we observe that the performance of different versions of the code varies widely from layer to layer. A convolution layer differs from another convolution layer in problem sizes -image sizes, channel widths, filter sizes, strides, and padding. The code version v2 has high performance from layer 1 through 19, and is subpar from layer 20 to 27. The efficiencies of the other versions viz., v1, v3, and v4 are much more widely varying. Using the methodology detailed in the rest of the paper, the compiler technology we have developed -the PolyScientist, we are able to effectively analyze the four code variants and pick the best performing variant whose performance is shown in Figure  2 . The performance achieved by PolyScientist picked code is close to the highest performance among the four variants for all 25 layers of Fast R-CNN. Thus using PolyScientist, using compile-time static analysis alone, we will be able to automatically identify and apply the loop optimizations required for each layer of a deep neural network in order to achieve the best performance. For each generated code variant, the working set sizes are computed as described in §5.1. The statistics calculated for all the variants are then input to the poly-ranking algorithm described in §5.2 and it picks the top k best performing versions. The original microkernels are inserted back into the code of the k picks. The top performing variants selected analytically are now run on the target architecture and and best performing code among the k loop nests is determined.
Overall System Architecture
Microkernel Specification The microkernel function call is annotated with a pragma compiler directive which contains the loop-based functionally equivalent code. The microkernel function call is substituted with the loop based code for the compiler analysis in a pre-processing pass. When the cache data reuse analysis and ranking of the code variants are done, in a post-processing pass, the loop-based inner most loops are replaced with the call to the microkernel. 
Notation
We use the polyhedral model [16] , which is an advanced mathematical framework to reason about dependences and loop transformations, to develop our data reuse algorithm. We use the Integer Set Library [40] for performing polyhedral operations in this work and we use the same notation as used in ISL to elucidate the concepts and the algorithm. The matrix multiplication code shown in Figure 4 will be used to illustrate the workings of the data reuse analysis.
Sets A set is a tuple of variables x i s along with a collection of constraints c k s defined on the tuple variables.
The iteration spaces of loop nests are represented by sets. The iteration space of the loop in Figure 4 is defined as the following set.
Relations A relation is a mapping from input tuple variables x i s to output tuple variables y j s. In addition, a set of constraints c k s can be defined for a relation place constraints on the input/output tuple variables. 
The sole write relation in the loop is:
Apply operation When a relation r is applied on a set s, the domain of r will be intersected with s and the resulting range will be a new set s ′ . The set s ′ is said to be the result of the apply operation. The operation is mathematically defined as:
The data footprint of the loop can be computed by applying read and write relations on the iteration space set:
Polyhedral dependences
The exact data dependences in loop nests can be computed in the polyhedral model and are expressed as maps from source iterations to target iterations involved in the dependence. For cache data reuse analysis developed in §5, we consider four kinds of dependences -Read-After-Read (RAR), Read-After-Write (RAW, a.k.a flow), Write-After-Read (WAR, a.k.a anti), and Write-After-Write (WAW). The data dependencies of the matrix multiplication code in Figure 4 are shown below.
The source to target iteration relationships such as this are expressed in a parametric fashion as the relation d 2 .
Cache Data Reuse Analysis
We develop a polyhedral model based cache data reuse analysis to characterize a loop-nest's behavior with respect to a given cache hierarchy. The analysis computes the various existing data reuses of a program and then for the input cache hierarchy determines which data reuses are exploitable at various levels of cache.
Working set size computation
Each data dependence in a loop is also a case of data reusethe source and target iterations involved in the dependence touch the same data element and therefore, the data is reused. For a data dependence and hence data reuse to be realizable in a given level of cache, all the data elements accessed between the source and target iterations of the dependence -the working set -have to be retained in the cache so that when the execution reaches the target iteration, the data element(s) used in the source iteration will still be present in the cache. Add W S min and W S max to W S all Algorithm 1 computes all the working sets of the input loop nest. First, the input C source file is parsed using the Polyhedral Extraction Tool (PET) [41] to obtain the polyhedral representation of the program, namely iteration space of the loop nest, read and write relations and the schedule (line 1). The exact (and not transitive) RAR, RAW, WAR, WAW dependences are then computed and a union of all the four kinds of dependences is formed (line 2 and 3). The task now is to compute the working set size for each dependence which is carried out from line 5 through 13. For a given dependence, we consider a representative source -the first iteration (lexicographically) of all the source iterations (line 6). We can now compute the target iterations for the lexicographically first/minimum iteration. If the data element that is used in the source iteration is used in multiple subsequent iterations then there may be multiple target iterations for the same source iteration. Therefore, the working sets to exploit the data reuse may vary. For this reason, we compute the first (or lexicographically minimum) and last (lexicographically maximum) iterations of the target iterations (line 7 and 8). The intervening iterations between source and the first target iteration are determined (line 9). Similarly, the iterations between source and the target iteration are derived (line 10). The working sets will be the union of all the read and written data elements between the source and the first/last iterations of the target iteration set (line 11 and 12). Correspondingly, for a dependence we compute two working set sizes -W S min and W S max , if there are multiple target iterations for a source iteration in a given dependence. What this means is, in order to be able to exploit at least one data reuse arising from the dependence d, the cache memory should be capacious enough to hold at least W S min data elements. If all the data reuses are to be realized -till the last target iteration, then the cache should of size equal to W S max in terms of the datatype's elements.
We illustrate the operation of the algorithm using the running example in Figure 4 . Let us examine the dependence:
Of all the source iterations, the first/lexicographically minimum iteration is:
Among the target iterations, the first one is:
and the last one is:
The number of data elements of the three arrays -A, B, C accessed between I sour ce and I min_t ar is derived by applying the read and write relations on the intervening iteration set and it is:
The [1] accessed between the source iteration S[i = 0, j = 0, k = 0] and the target iteration I min_t ar = S[i = 0, j = 1, k = 0] lead to the W S min size of 2K + 3.
The maximum working set size -the size of the data touched between I sour ce and I max _t ar is:
TheW S max size is arrived at by counting the number of array elements accessed between the source iteration -S[i = 0, j = 0, k = 0] and the target iteration -
. Therefore, a total of N × K + N + 1 are read and written. for 8 Add the working sets W S j ∈ W S all that do not fit any cache to W S mem Figure 5 . The DNN architecture for ranking of code variants
Poly-ranking algorithm
We have built a code generator to emit a number of program variants. The code generator creates the loop variants by applying tiling and loop interchange program transformations. The tile sizes are varied as well. The working set size computation analysis - §5.1 is performed on each program version generated. Among the many variants generated, the poly-ranking algorithm described below picks the top k best performing versions, where k is a parameter. In this work, we pick the top 1 variant the code generator produces, i.e., k = 1.
If the working set size corresponding to a data reuse in the program is smaller than the cache size then the data reuse is exploitable in the cache. The poly-ranking system considers caches at different levels (typically L1, L2, and L3) and for each data reuse, determines at what level of cache hierarchy is the data reuse realizable. Algorithm 2 shows the steps to determine the cumulative working set sizes at each level of cache. The inputs to the algorithm are the working set sizes computed for a loop nest, and the cache sizes of the target system. The algorithm determines the fastest level of cache where the working set size corresponding to each data reuse fits and adds it to that cache's working set size. The working set sizes that fit in a particular level of cache L i are denoted by W S L i . If a working set does not fit in any cache, then the data reuse happens out of the main memory. Consequently, the memory's working set size is updated.
Performance cost model based ranking
The running time of the loop is directly related to the latency of the cache where the data reuse occurs as well as the working set size. Furthermore, the running time is inversely related to the bandwidth of the cache. Based on these observations, we define the following cost function:
The latency of cache L i is lat L i while its bandwidth is denoted by bw L i . For each code variant generated, we run the cache data reuse analysis and calculate the above cost function. Then, the variants are ranked in the decreasing order of the value of the cost function: the lower the value, the higher is its assumed performance, and higher is its rank.
DNN-based ranking algorithm
We explore the use of deep neural networks (DNNs) for ranking code variants. For the purposes of training the DNN model, we collect the performance data of code variants generated and the statistics as outputted by Algorithm 2working set sizes at different levels of the memory hierarchy.
We train the DNN model to perform relative ordering of two code variants. We then use a tournament based ranking system to assign ranks to the different code versions created -we play each code variant against every other code variant.
For each variant, we record the number of wins it has accumulated. We then rank the variants based on the number of wins -the higher the number of wins, the higher the rank. Figure 5 shows the architecture of the DNN. We normalize the compiler generated statistics of two code variants in the following fashion and input them to the DNN. We sum the working set sizes of the two variants together: sum = v 1 L 1 +v 1 L 2 +v 1 L 3 +v 1 mem +v 2 L 1 +v 2 L 2 +v 2 L 3 +v 2 mem and divide the individual statistic by this sum. The rationale for considering the sum of the two statistics together is that if one of the variants is creating higher volume working set sizes then its statistics should appear bigger to the DNN. This is because the smaller the working set sizes, we can expect higher performance. Therefore, for the DNN to learn the relative performances of the two variants, it is crucial that it sees the relative sizes of the working set sizes. Normalizing each variant individually (by considering the sum of statistics of one variant alone) would not bring out the differences in the absolute values of working set sizes of the two variants at different cache levels.
We use four intermediate layers of 64, 32, 16, 8 neurons respectively. We use relu, relu, softsign, and relu activation functions for the four intermediate layers. The output layer consists of two neurons and we use the softmax function for the output layer. The values of the two output neurons, because of the use of softmax function, sum to 1. If the output value is above a threshold -θ , we consider it a 1, otherwise a 0. If the first neuron fires a 1, then the first variant is considered the winner. If the second neuron fires a 1, then the second variant is considered the winner. If both of them are zero because none of them are above the threshold, then it is a draw between the two variants. In this work, we set the threshold θ to 0.6.
We experimented with deeper models as well. However, the depth beyond four layers did not have any discernible effect on accuracy.
Experimental Evaluation
We conduct experiments to evaluate the efficacy of the Poly-Scientist system which combines loop optimizations with expert-coded kernels. Since the aim of PolyScientist is to achieve performance competitive with manually optimized code and at the same time retaining the attractiveness of an automatic compiler, we gauge the performance of PolyScientist against a state-of-the-art library created specifically for deep learning networks -the latest version of Intel MKL-DNN [2] viz., v1.0.4 and against the compiler generated code using the Intel icc compiler of version 19.0.3.199.
Set up
We use the PolyScientist system to optimize the convolutions of Resnet-50 [23] , Fast R-CNN (fastrcnn) [19] , Mask R-CNN (maskrcnn) [22] , Xception (xception) [11] , You Only Look Once v2 (yolov2) [30] , MobileNets (mobilenet) [25] , AlexNet (alexnet) [29] , OverFeat (overfeat) [32] GoogLeNet v1 and v3 [34] , and (googlenetv1, googlenetv3), the popular and the state-of-the-art image recognition neural network models. We also gather the performance results from the hand-coded, highly optimized Intel MKL-DNN library for the same convolutions. In this work, we evaluate the performance benefits of the PolyScientist system on CPUs. Since in today's datacenters, CPUs are predominantly used for inference tasks partly due to latency considerations, we study the performance of forward-pass convolutions which are used in the inference tasks while performing image recognition. Figure 6 shows the convolution code with the GEMM (matrix multiplication) microkernel replaced with the equivalent C code. The shown code is data tiled in the input and output channel dimensions. The convolution code has a matrix multiplication operation (denoted GEMM in the code) embedded in it. The loops corresponding to matrix multiplication are moved to the inner most positions so that matrix multiplication is performed on the fastest varying dimensions of the arrays. We use the performance obtained using the code shown in 6 as the baseline.
We use the LIBXSMM [3] implementation for matrix multiplication -the microkernel. PolyScientist performs outer loop optimization around the call to the matrix multiplication microkernel by loop reordering and tiling using various tile sizes. We show the performance obtained by inserting the LIBXSMM microkernel in the code listed in Figure 6 under the banner of Microkernel in the subsequent performance graphs. Comparing the performance of Microkernel with PolyScientist will show the need to perform outer loop tuning as done by PolyScientist to obtain high performance for all layers and for all models.
The experiments are run on the latest Intel servers -Intel(R) Xeon(R) Platinum 8280 (Cascade Lake) CPU servers running at the frequency of 2.70GHz. Each processor has 28 cores, 32KB private L1 cache, 1MB private L2 cache, and 39MB shared L3 cache. Each program is run a 1000 times and the average performance across those runs is reported in the . Performance of fastrcnn layers on a 28-core Intel Cascade Lake server Figure 10 . Performance distribution of code variants for fastrcnn layers on a 28-core Intel Cascade Lake server Figure 11 . Performance of maskrcnn layers on a 28-core Intel Cascade Lake server Figure 12 . Performance distribution of code variants for maskrcnn layers on a 28-core Intel Cascade Lake server paper. The machine has a 512-bit SIMD vector unit and supports AVX-512 vector instructions. Consequently, 16 floating point arithmetic operations can be performed at a time (each floating point number is 32 bits long, and therefore, 16 floating point numbers make up 512 bits: 32×16 = 512). Since the microkernel vectorizes along the input and output channel loops (i f m and o f m loops in the code), to fully utilize the vector unit, the input and output channel widths have to be 16 or multiples of 16. In the CNN models considered, 86% of the convolutions meet this criterion and those convolutions are selected for experimental evaluation. The peak single precision floating point performance of a Cascade Lake processor is~3,300 GFLOPS/s. We set the mini-batch size to 28 and use data parallelism: the convolution operator is applied on 28 images simultaneously.
To train a DNN model for performing ranking of code variants as described in §5.2.2, we use 70% of the experimental data collected (to avoid overfitting). We create a single DNN model using data from all CNN models and use it to rank variants across the CNN models. Figure 13 . Performance of xception layers on a 28-core Intel Cascade Lake server Figure 14 . Performance distribution of code variants for xception layers on a 28-core Intel Cascade Lake server Figure 15 . Performance of yolov2 layers on a 28-core Intel Cascade Lake server Figure 16 . Performance distribution of code variants for yolov2 layers on a 28-core Intel Cascade Lake server Figure 17 . Performance of mobilenet layers on a 28-core Intel Cascade Lake server Figure 18 . Performance distribution of code variants for mobilenet layers on a 28-core Intel Cascade Lake server Figure 19 . Performance of alexnet layers on a 28-core Intel Cascade Lake server Figure 26 . Performance distribution of code variants for googlenetv3 layers on a 28-core Intel Cascade Lake server Figure 7 shows the performance in terms of GFLOPS/s (Giga Floating point Operations per second) of the baseline code, PolyScientist, Microkernel and MKL-DNN on convolutions of Resnet-50. The PolyScientist performance shown is the performance of the top code variant selected using the cost modeling based poly-ranking algorithm described in §5.2.1. The performance of PolyScientist over the baseline is 5X to 10X for all layers. The higher performance of PolyScientist is due to 1) the use of optimized GEMM microkernel and 2) the optimization of outer loops around the call to the microkernel. In most cases, PolyScientist closely matches the performance of MKL-DNN library. In several instances, Poly-Scientist outperforms MKL-DNN, notably for layers with IDs 15, 16, and 17 where the performance gain is 11%. On some layers such as layer 1, MKL-DNN fares better. This is explained by customizations for specific problem sizes including insertion of careful data prefetching instructions in the MKL-DNN library code. In contrast, PolyScientist's approach is automatic and in the case of Resnet-50, we observe that we are able to attain the same performance levels Table 1 . The geometric average of the performance of layers of various models in GFLOPS/s on a 28-core Intel Cascade Lake server as MKL-DNN. The geometric average of GFLOPS/s numbers are also shown in the graph. We report the geometric average of performance results in Table 1 as well for Resnet-50 and all other CNN models. For Resnet-50, we observe that performance of Microkernel and PolyScientist is similar indicating that the original loop order shown in Figure 6 gets good performance. Figure 8 shows the performance distribution of code variants generated for each layer of Resnet-50. The performance is normalized with respect to that of the best performing variant found empirically. The crux of the PolyScientist technology presented in the paper is to rank a given set of code variants using compile-time static analysis. Therefore, the closer the performance of the PolyScientist picked version is to the maximum performance seen by any code variant explored, the more efficacious the PolyScientist algorithms are. In the graph, we show the minimum performance observed, the maximum performance seen, the performance of the code picked per the poly-ranking algorithm ( §5.2.1) -PolyScientist and the performance of the code picked per the DNN based ranking algorithm ( §5.2.2) -PolyScientist-DNN. We note that the performance of PolyScientist version is close to the maximum performance in most layers save layer 19. Even though in terms of cache behavior (PolyScientist primarily models the cache behavior), the variant selected by PolyScientist may be the best, other factors such as prefetching, TLB behavior etc may cause their performance to be lower than those of other variants. The minimum performance seen i.e., the performance of the worst code variant, varies across layers -for layer 12 through 19, the minimum performance is much farther from the maximum performance. For the initial layers however, the different code variants generated perform similarly. In all cases, we note that PolyScientist performs significantly better than the worst variant including layer 19 where PolyScientist picked code is 14% slower than the best performing version and is 28% higher performing than the worst variant. We observe that there is no considerable difference in the performance achieved by cost model based ranking method -PolyScientist, and the DNN based ranking method -PolyScientist-DNN.
Experimental Results
The performance achieved by different methods for convolutions of fastrcnn is shown in Figure 9 . The performance of PolyScientist vis-a-vis the baseline code is anywhere between 4X and 11X across layers. PolyScientist performance is close to the MKL-DNN's. For some layers such as layer 11, PolyScientist is 9% faster while for a few layers notably layer 1, and 3, MKL-DNN performs better. In this case of fastrcnn, we see that PolyScientist outperforms Microkernel significantly clearly showing the need for outer loop tuning in addition to having a high performance implementation of matrix multiplication in the inner most loops. PolyScientist picked code achieves~2X performance gains over the code with the default loop order for layers 4, 7, 8, 10, and 11 while for layer 25, PolyScientist is 56% higher performing. Across all layers of fastrcnn, PolyScientist improves the performance by 28% on average. Figure 10 shows the performance distribution for all layers of fastrcnn. Here, we see that the performance distribution is great: the difference between the performance of the best and the worst code variant seen is vast for all layers except layer 18. We observe that PolyScientist is able to pick a variant whose performance is close to the performance of the best performing version.
From Figure 11 through Figure 26 , we show the performance achieved by various systems and the performance distribution of code variants seen for all other CNN models, namely, maskrcnn, xception, yolov2, mobilenet, alexnet, overfeat, googlenetv1, and finally googlenetv3. In Figure 11 we observe that the performance of two layers of maskrcnn -layer 31, and 32 is very low compared to the machine peak. The reason is, the image sizes for the two layers are 7X7 and 1X1 respectively. Consequently, the amount of work that each core has to perform is less and therefore, MKL-DNN and PolyScientist are not able to attain performance close to the machine peak. For maskrcnn too, we discover that the default loop order -Microkerenel, leaves a lot of performance on the table: for layers 4, 5, 6, 9, 10, 14, 15 , Poly-Scientist gets more than 2X extra performance compared to only the use of the microkernel. For layer 7, PolyScientist is 3X higher performing than Microkernel. Across all layers of maskrcnn, the average performance benefit is 41%. In Figure 14 , we see that PolyScientist picks the right variant for all layers for xception. For yolov2 from Figure 15 , we note that PolyScientist performance closely matches that of MKL-DNN and through Figure 16 , we see there is a great spread in performance of various code variants run. In mobilenet, PolyScientist is as high performing as MKL-DNN ( Figure 17) . Further, the different code variants perform very similarly for all layers of mobilenet ( Figure 18 ). In alexnet, we hardly see any difference in the relative performance across the layers - Figure 20 . In overfeat, PolyScientist is slightly higher performing than MKL-DNN and the performance spread is fair among different code variants generated (Figures 19, and 20) . googlenetv1 and googlenetv3 feature many more unique layers and PolyScientist's performance is slightly better than MKL-DNN's for googlenetv1 and is slightly worse for googlenetv3 (Figures 23, and 23) . Table 1 shows the average performance of the four systems -baseline, PolyScientist, PolyScientist-DNN, and MKL-DNN for various CNN models. The performance of PolyScientist-DNN is consistently slightly better than that of PolyScientist. Further, PolyScientist achieves magnitudes of higher performance compared to the baseline code and is very competitive with respect to the hand crafted MKL-DNN library code. In fact, PolyScientist outperforms MKL-DNN in the case of alexnet, overfeat, and googlenetv1.
Related Work
Researchers have developed auto parallelization and program transformation systems to automatically transform code to achieve high performance [7, 27] . The state-of-theart polyhedral compiler -Pluto [7] derives a schedule for the code that attempts to minimize data reuse distances. The effect of the Pluto transformation will be that the iterations that use the same data will be executed close to each other in time and therefore, it will be cache friendly. However, Pluto's performance can be far from what we can achieve with the use of microkernels that exploit the vector hardware effectively and by doing outer loop tuning in the way we have developed this work. Our initial experimental evaluation comparing our work with Pluto revealed that its performance can be off by as much as 100X. Kong et al. [27] develop a framework to decompose a program into sub-parts and use customized criteria (as opposed to using a single objective function) -such as stride optimization, outer loop optimization, inner loop optimization to transform code. They show that their work without tiling transformation is able to achieve comparable results to that of Pluto.
Compile-time modeling of cache behavior and in particular calculating the number of cache misses has been an active area of research [5, 18, 20] . The researchers have demonstrated good accuracy in predicting the number of cache misses on simulators. The modern computer architectures employ a hash based scheme to map memory addresses to cache sets [44] which breaks the assumptions behind the cache miss analysis. Consequently, their usefulness in modeling the performance of programs will suffer. In the present work, we model the behavior of caches as well. However, we do not model cache misses rather we consider data reuses and determine the size of cache needed to exploit the data reuses under conservative conditions. We ignore streaming accesses as their misses in cache will not be crucial in the resulting performance. Because of the these techniques, we show that we are able to accurately rank code variants in terms of performance. Latte [39] is a domain specific language and run time for coding of deep neural networks. Latte relies on pattern matching to transform code patterns to library calls. For example, when a convolution operation is recognized, a library call is made to the CPU-specific library, say Intel MKL-DNN. Our work in this paper could be complementary to that of Latte where instead of relying on an external hand coded library, Latte could be made to use PolyScientist for high performance implementations of deep learning primitives.
Strout et al [33] introduce the novel concept of universal occupancy vector (UOV) for storage optimization -expand the arrays to a minimal degree so that the false dependences are eliminated paving the way for better scheduling of the code. Thies et al [37] develop methods to find a good schedule when the storage mapping is fixed and vice versa. Ding et al [15] introduce approximate reuse distance analysis and sampling based profiling to predict whether a given application exhibits regular data accesses or irregular data accesses. Xiang et al [43] develop a theory of locality and show how different locality metrics are related to each other.
Autotuning systems using parametric tiling [6, 13, 21, 31, 35, 36] perform tile size exploration with tiled code generated with tile sizes as parameters (rather than hard-coded tile sizes). The parametric tiling technology reduces the code generation time during auto-tuning as the same code can be used for multiple tile size explorations. AutoTVM [10] is targeted at accelerating deep learning workloads and uses machine learning to guide auto-tuning of deep learning primitives. Tiramisu [4] is a polyhedral model based compiler framework that introduces a scheduling language to allow the programmer to explore various program transformations. Active Harmony [12] determines the parameters that are critical for performance among the set of parameters that need to be auto-tuned in a given application and focuses on optimizing them first in order to accelerate the process of auto-tuning. CHiLL [8] is a compiler framework that assists an auto-tuner in generating code for schedules expressed in a high level language. Tiwari et al [38] combine Active-Harmony and CHiLL to systematically explore the space of program transformations. All of the above auto-tuning systems, while useful, incur a huge expense in terms of computation cost and time. The compiler algorithm developed in the paper narrows down the search space to only a small number of code variants thereby saving the auto-tuning cost.
TVM [9] , a compiler for deep learning, introduces the concept of tensorization, where a unit of computation can be replaced with a microkernel written using hardware intrinsics. The TVM tensorization tool can be complementary to our work -our work can be leveraged within TVM to perform optimization of outer loops around tensorized code, much like how we optimize code around the microkernel in this paper.
Conclusion
In this paper, we presented a novel cache data reuse algorithm. We leverage the reuse analysis to perform relative ordering of program variants. Using these techniques, we develop a framework to generate high performing code for deep learning primitives. The framework also integrates the expert-coded microkernels, further enabling the development of high performing code that achieves performance nearly identical to that of a hand-coded library. The presented methodology will allow the creation of deep learning kernels in an automated fashion easing the burden of writing hand-tuned libraries and at the same time, realizing stateof-the-art efficiencies in utilizing modern dense processing engines.
