Abstract. Tensor operations are surging as the computational building blocks for a variety of scientific simulations and the development of highperformance kernels for such operations is known to be a challenging task. While for operations on one-and two-dimensional tensors there exist standardized interfaces and highly-optimized libraries (BLAS), for higher dimensional tensors neither standards nor highly-tuned implementations exist yet. In this paper, we consider contractions between two tensors of arbitrary dimensionality and take on the challenge of generating highperformance implementations by resorting to sequences of BLAS kernels. The approach consists in breaking the contraction down into operations that only involve matrices or vectors. Since in general there are many alternative ways of decomposing a contraction, we are able to methodically derive a large family of algorithms. The main contribution of this paper is a systematic methodology to accurately identify the fastest algorithms in the bunch, without executing them. The goal is instead accomplished with the help of a set of cache-aware micro-benchmarks for the underlying BLAS kernels. The predictions we construct from such benchmarks allow us to reliably single out the best-performing algorithms in a tiny fraction of the time taken by the direct execution of the algorithms.
Introduction
Tensor contractions play an increasingly important role in various scientific computations such as general relativity and electronic structure calculations in quantum chemistry. Computationally, contractions are generalizations of matrixvector and matrix-matrix products that involve operands of higher dimensionality. While there are several highly-tuned implementations of the Basic Linear Algebra Subprograms (BLAS) [1] [2] [3] for operands with up to 2 dimensions, there are no equivalently standardized high-performance libraries for general tensor contractions. Fortunately, just as matrix-matrix products can computationally be decomposed into a sequence of matrix-vector products, most higher dimensional tensor contractions can be cast in terms of matrix-matrix or matrix-vector BLAS kernels. However, each tensor contraction can be computed via BLAS kernels in many, even hundreds, of different ways, each with its own performance signature. This work addresses the problem of accurately predicting the performance of BLAS-based algorithms for tensor contractions.
One could argue that only algorithms that use the gemm kernel 1 are real candidates to achieve the best performance; while for the most part this observation is true, due to the fact that in practical contractions it is often the case that one or more dimensions are very small (while BLAS is mostly optimized for large dimensions), the difference in performance between two gemm-based algorithms can be dramatic. At any rate, with this work we aim at the accurate prediction of any BLAS-based contraction, irrespective of which kernel is used. Our approach, which never resorts to timing a full algorithm, makes use of what we call microbenchmarks. These are benchmarks that only execute one BLAS operation in a prescribed memory environment. The idea is to analyze the structure of the code, and determine the status of the cache (precondition) prior to the execution of the kernel; we recreate carefully such status within the micro-benchmark so that the specific kernel can be timed in conditions analogous to those experienced in the actual algorithm. Based on these timings, we extrapolate the total algorithm execution times with sufficient accuracy to single out the fastest algorithms. This micro-benchmark-based prediction proves to be several orders of magnitude faster than executions of the actual algorithms.
Tensor Notation. In the following, we denote tensor contractions by means of the Einstein notation; 2 let us briefly explain said notation by means of an example. In the contraction C abc = A ai B ibc , the entries C Related Work. The most prominent project targeting the efficient computation of tensor contractions is probably the Tensor Contraction Engine, a compiler built specifically for multi-tensor multi-index contractions to be executed within memory constraints [4] ; in light of the wide diffusion and nearly optimal efficiency of the BLAS library, an extension to TCE was proposed to compute contractions via BLAS operations [5] . In the same spirit, we provided simple rules to build a taxonomy for all contractions between two tensors, identifying which BLAS routines are usable and how to best exploit them [6] .
There also exists a variety of work in the field of performance prediction in the context of dense linear algebra. A notable example is Iakymchuk et al. [7, 8] , where the authors model the performance of dense linear algebra algorithms analytically based on very detailed models of the occurring cache-misses. Also, in [9] , we use measurement-based performance models to predict the behavior of blocked algorithms. However, none of these works target or address highperformance tensor contractions and their peculiarities, i.e., very regular patterns in routine invocation and memory access, but highly skewed dimensionality (tiny sizes for at least one of the dimensions).
Structure of the Paper. The rest of this paper is structured as follows. The systematic generation of BLAS-based algorithms for tensor contractions is discussed in Sec. 2. Our performance prediction framework is introduced in Sec. 3, and experimental results for a range of contractions are presented Sec. 4.
Algorithm Generation
In this section, we briefly explain how to systematically generate a family of BLAS-based algorithms for a tensor contraction. For a detailed discussion of the topic, we refer the reader to [6] .
Aware of the extreme level of efficiency inherent to the best BLAS implementations, our approach for computing a contraction consists in reducing it to a sequence of calls to one of the BLAS kernels. Since BLAS operates on scalars, vectors and matrices (zero-, one-and two-dimensional objects), tensors must be expressed in terms of a collection of such objects. To this end, we introduce the concept of slicing: With the help of Matlab's ":" notation, 3 slicing a d-dimensional operand Op ∈ R n1×n2×···×n d along the i-th index (or dimension) means creating the n i (d−1)-dimensional slices Op[:, . . . ,:
i−1 ,k, :, . . . ,:
], where k = 1, . . . , n i . Example 1. Consider the matrix-matrix product C ab := A ai B ib . The slicing of the matrix B along the b dimension reduces the matrix to a collection of column vectors; accordingly, the matrix-matrix product is reduced to a sequence of matrix-vector operations:
Similarly, a multi-dimensional tensor contraction can be reduced to operations involving solely matrices and vectors.
Depending on the slicing choices, a contraction is reduced to a number of nested loops with one of the following kernels at the innermost loop's body:
• dot (vector-vector inner product: α := x T y), • axpy (vector scaling and addition: y := αx + y), -BLAS-2:
• gemv (matrix-vector product: y := Ax + y),
• ger (vector-vector outer product: A := xy T + A), and -BLAS-3:
• gemm (matrix-matrix product: C := AB + C).
Notice that to comply with the BLAS interface, the elements in one of the two dimensions of a matrix must be contiguous. Therefore, algorithms that rely on gemv, ger" or gemm as computational kernel may require a temporary copy of slices prior and/or after the invocation of the corresponding BLAS routine. As case study, let us consider the contraction
which is visualized as follows: Instead of a blind search for appropriate slicings, we generate algorithms by following a goal-oriented approach: For each of the five kernels of interest, we know the dimensionality required for each operand; accordingly, we deduce how many slices are needed and which combination of free/contracted indices to slice. Table 1 (left) exhibits, for each kernel, the conditions necessary for a contraction to be computed in terms of that kernel. In particular, the second and the third columns indicate how many contracted and free indices, respectively, appear in each kernel. A and B refer to the first and the second input operand of the kernel; in a contraction between tensors of arbitrary dimension, all the indices beyond what indicated in these columns must be sliced.
Example 2. Since gemm involves one free index in each of its operands A and B, and one contracted index (common to both A and B), in order to reduce a contraction to a sequence of gemm calls, one must slice all free indices of A but one, all free indices of B but one, and all contracted indices but one. With reference to (1), this is achieved by slicing either dimension b or c, resulting in the two algorithms (b-gemm and c-gemm) 5 shown in the last two examples of Algs. 1. Algs. 1: C abc = A ai B ibc : 9 exemplary algorithms out of 36.
6
As already mentioned, given a contraction, there is no obvious a-priori choice of kernel and slicings to attain the highest performance. We therefore generate all possible combinations. Moreover, due to their impact on performance and to further stress our modeling tool, we generate all the permutations of the loops.
We developed a small algorithm and code generator that produces all such algorithms, constructs for each of them a C-implementation, as well as an ab- 5 The algorithm names are composed of two parts: the first part is the list of sliced tensor indices iterated over by the algorithm's loops and an apostrophe ′ for each copy-kernel, while the second part is the name of the used BLAS-kernel. Table 1 : Rules for tensor slicing to obtain a given BLAS kernel. Left: how many contracted and how many free indices appear in the operation corresponding to a kernel. Right: different slicings make it possible to express one contraction in terms of different kernels. The names in the rightmost column refer to the algorithms in Algs. 1.
Kernel
Number of indices Examples from C abc = AaiB ibc contracted free kernel sliced resulting indices indices algorithm
stract syntax tree (AST) representing its loop-based structure. The ASTs are then passed to the prediction tool introduced in the following section.
Performance Prediction
In this section, we present how to accurately model the performance of algorithms that compute tensor contractions through BLAS kernels. These algorithms consist of one or more nested loops and cast all the computation in terms of one single BLAS kernel. Taking advantage of this structure, we aim at estimating the execution time of a target algorithm with the help of only few micro-benchmarks of the kernels and with no direct execution of the algorithm itself. In order to obtain reliable estimates, the micro-benchmarks need to be executed in a setup that mirrors as closely as possible the computing environment (most importantly the cache) within the contraction algorithm. In the following, we incrementally go through the steps required to build a meaningful "replica" of the computing environment. Throughout this section, we track the changes in the performance prediction by considering the exemplary contraction C abc = A ai B ibc . We chose the tensors A and B of size i = 8 and a = b = c = 8, . . . , 1024 -a deliberately challenging scenario due to the thin tensor dimension i, for which BLAS kernels are generally not optimized. Our generator produces 36 algorithms for the considered contraction, some of which are shown in Algs. 1:
In this section, to focus our attention, we will only consider the BLAS-2 and BLAS-3 based algorithms (i.e., with kernels gemv, ger, and gemm).
We execute these algorithms on 1 core of an Intel Harpertown E5450 CPU
7
linking with the OpenBLAS library [10] . Figure 1a displays the performance, in terms of computed floating point operations per clock cycle (flops/cycle), measured for each algorithm; our goal is to accurately reproduce, without executing the algorithms, such performance profiles. While it is evident that only two of the algorithms -the gemm-based c-gemm ( ) and b-gemm ( ) -are competitive, we aim at predicting the behavior of all the algorithms to demonstrate the broad applicability of our methodology.
Repeated Execution
The first, most intuitive, attempt to predict the performance of an algorithm relies on the isolated and repeated measurement of its BLAS kernel. We implemented this approach by executing each kernel ten times and extracting the median execution time; the corresponding estimate is then obtained by multiplying the median by the number of kernel invocations within the algorithm. In our example, this boils down to multiplying the kernel execution time with the product of all loop lengths.
The performance profiles predicted by this first, rough approach are shown in Fig. 1b . By comparing this figure with the reference Fig. 1a , it becomes apparent that while the two top algorithms are already correctly identified, the performance of almost all algorithms is consistently overestimated. In other words, when executed as part of the algorithms, the BLAS kernels take longer to complete than in the isolated micro-benchmarks. The reason for this discrepancy is that the micro-benchmarks invoke the kernels repeatedly, with the same memory regions as operands, i.e., they operate on warm data (the operands remain in the CPU's cache). Within the algorithm, by contrast, at least one, and potentially even all of the operands, vary from one invocation to the next, i.e., the kernels operate at least partially on cold data.
Operand Access Distance
In order to improve the accuracy of the predictions, the idea is to first identify the status of the cache in the algorithm prior to the invocation of the BLAS kernel ("precondition"), and then to replicate such a status in the micro-benchmark. For this purpose, each algorithm is symbolically analyzed to reconstruct the order of memory accesses involving the kernel's operands. For each operand, we determine the set of memory regions M that were loaded into cache since its last access, and define the access distance as the sum of the size of these regions M . Once the access distances for all operands of a kernel are determined, we can create an artificial sequence of memory accesses to reconstruct the cache precondition. Based on this cache setup, the BLAS kernels are timed in a microbenchmark that closely resembles the actual execution of the algorithm. As before, these micro-benchmarks are repeated and timed ten times to yield a stable median. From the median, the performance of the algorithms is again obtained based on the number of kernel invocations per algorithm execution.
To predict which memory regions are in cache, we assume a fully associative Least Recently Used (LRU) cache replacement policy 8 and sum up the size of all memory regions accessed since an operand's last use, yielding the access distance. In first instance, we also assume that all loops surrounding the kernel are somewhere in the middle of their traversal (i.e., not in their first iteration); this assumption will be lifted later.
We now describe how to obtain the access distance for each of the operands. The presented method is general and allows for any combinations of loops and multiple kernels within the abstract syntax tree (AST), however for the sake of clarity, we limit the discussion to ASTs that only consist of a series of loops with a single call to a BLAS kernel at their innermost loop.
For each operand Op, we examine the algorithm's AST (see Sec. 2) with the kernel of interest as a starting point. The AST is traversed backwards until the previous access to Op (or the AST's root) is found, thereby collecting all other operands involved in kernels in the initially empty set M . Going up the AST, three different cases can be encountered. Op referred to a different memory region in the previous iteration of the loop. As a result, it is safe to assume that at least all memory regions covered by all kernel operands throughout these loops were accessed since the last access to Op. Hence, all operands are added to M and the memory regions are symbolically joined along the dimensions the loop iterated over. Since a previous access to Op was not yet detected, the traversal proceeds by going up one level in the AST, and applying the method recursively: the surrounding loop now takes the role of the starting node and we look for a previous access Op joined across this loop. In this case, the considered region is accessed only once (and for the first time). Since we do not know how the contraction is used (within a surrounding program), we can generally not make any assertions on the access distance. For the purpose of this paper, in which we execute the contraction repeatedly to measure its performance, however, we assume that no further memory regions were loaded since the last invocation of the contractioni.e., we compute the access distance from the previously collected memory regions in M .
Based on the such obtained access distances for each operand of an algorithm's kernel, we now construct a list of memory accesses that emulates the accesses within the algorithm prior to the kernel's execution. This list consists of accesses to the kernel's operands, interleaved with accesses to remote memory regions, in order to flush portions of the cache corresponding to the access distances: First, we access the operand with the largest access distance, then a remote region that accounts for the difference to the next smaller access distance, followed by the next operand, and so on until the operands with the smallest access distance followed by a remote access of this size. If the access distances to the first operands in this list are larger than 5 4 times the cache size, the list is truncated down to this limit at the front. The thus obtained benchmark, consisting of the setup followed by the kernel invocation, is once more executed ten times. The median of the kernel run-times of these ten benchmarks is then used to compute our second execution time estimate.
In Fig. 1c , we present the flops/cycle performance of our new estimates. These predictions are much closer to the measured performance (Fig. 1a) than the first rough estimates (Fig. 1b) . For several algorithms (such as ic-ger ( ), Algs. 1), the error is already within a few percent; for many others instead, the predictions are still off. In particular, the performance of some algorithms -for instance, bi-ger ( ) (see Algs. 1) -is underestimated; this is due to the fact that based on the access distance, certain operands are placed out of cache, while in practice they are (partially) brought into cache through either prefetching or because they share cache-lines across loop iterations. We address this discrepancy by further refining our micro-benchmarks.
Cache Prefetching
In the considered type of tensor contraction algorithms, prefetching of operands or sharing of cache lines across loop iterations occur frequently. Such prefetching situations occur when a certain set of conditions are met, namely:
1. the operand varies across the directly surrounding loop, and 2. the iterator of this loop indexes -either the first dimension of the operand, -or its second dimension, while the first is accessed entirely, or fits in a single cache-line.
As part of our AST-based algorithm analysis, such conditions are tested; when both of them are met, we can use a slight modification of the previously introduced method to compute the prefetch distance, i.e., how long ago the prefetching occurred. These prefetch distances are then integrated into the micro-benchmark's setup list just like the access distances, only that for prefetch accesses the access is limited to one cache-line along an operand's first dimension. (Since this setup consists only of accesses to the operands, it becomes redundant in out micro-benchmarks, because each of the ten repetitions will already touch all operands for the next repetition; hence, in such a case, we omit the setup altogether.)
Now accounting for prefetching, we obtain the performance estimates shown in Fig. 1d . Here, several algorithms, e.g. ba-gemv ( ), are estimated closer to their measured performance. However, several other algorithms, including cagemv ( ) are overestimated in performance (i.e., underestimated in execution time). There are two separate causes for this discrepancy.
-In several algorithms, such as ca-gemv ( ), where prefetching implicitly happens due to sharing of cache-lines, the prefetcher fails once a new cacheline is reached.
-In other algorithms, such as bi-ger ( ), the innermost loop is so short (here: 8 iterations) that each first iteration of the loop significantly impacts performance.
These two causes are teated in separately in the following sections.
Prefetching Failures
For those algorithms in which certain operands are identified as prefetched because they share cache lines across iterations (i.e., the surrounding loop indexes their first dimension), the CPU would need to prefetch the next cache-line every 8 iterations (1 cache-line = 8 doubles). However, as a detailed analysis of handinstrumented algorithms has shown, the CPU fails to do so. As a result, in every 8th iteration of the innermost loop, the operand is not available and the kernel may take significantly longer.
We account for this prefetching-artifact by performing two separate microbenchmarks: one simulating the 7 iterations, in which the operand is available in cache, as before, and one for the 8th iteration, where we account for the failure to prefetch and eliminate the emulated prefetching from our setup-list. The prediction for the total execution time is now obtained from weighting these two benchmark timings according to their number of occurrences in the algorithm and summing them up. 
First Loop Iterations
The predictions for several algorithms, such as ci-ger ( ), are still severely off, because the innermost loop of these algorithms is very short (in our example 8 iterations long). In such a case, the predictions are very accurate for all but the first iteration. Due to vastly different cache preconditions for this first iteration, however, its performance can be significantly different (in our case, up to 10× slower). Combined with the low total iteration count, this results in predictions that are off by a factor of up to 2.
To treat such situations, we introduce separate benchmarks to predict the performance of the first iteration of the innermost loop (and further loops if their first iterations account for more than 1% of the total kernel invocations). For this purpose, the access distance evaluation method is slightly modified: instead of the kernel itself, the starting point is now the loop whose first iteration is considered, and the set M already contains all of the kernel's memory regions joined across this loop. From these improved access distances, the cache setup and micro-benchmark are performed just as before. As for the "prefetching failures", the prediction for the total execution time is now obtained from weighting of all relevant benchmark timings with the corresponding number of occurrences within the algorithm.
In Fig. 1f , we present the improved performance predictions obtained from this modification. The performance of all algorithms is now predicted with satisfying accuracy.
Results
In order to showcase its applicability and effectiveness, in this section we apply our technique for performance prediction to a range of contractions. We consider three test cases: In Sec. 4.1, we use different hard-and software, as well as changing the problem sizes. In Sec. 4.2, we consider a contraction that only allows the use of BLAS-1 and BLAS-2. Finally, in Sec. 4.3, we consider a more complex contraction with numerous alternative algorithms and multithreading.
Test 1:
We commence with the same contraction used as case study in the previous section, yet with an entirely different setup: the sizes of a, b, and c are now fixed to 128, while the value of i ranges from 8 to 1,024. As experimental environment, we use a 10-core Intel Ivy Bridge-EP E5-2680 v2 processor running at a frequency of 3.6 GHz (Turbo) and 25 MB of L3 cache. Each core can execute 8 double precision flops/cycle. The routines for both the actual measurements and the micro-benchmarks were linked to the Intel Math Kernel Library (MKL, version 11.0) BLAS implementation. Figure 2 contains the performance measurements and the corresponding predictions for all 36 algorithms (see Algs. 1). Although everything, ranging from the problem size to the machine and BLAS library was changed in this setup, the predictions are of equivalent quality and our tool correclty determines that the gemm-based algorithms (c-gemm ( ) and b-gemm ( )) perform best and equally well. 
Test 2:
C a = A iaj B ji , only BLAS-1 and BLAS-2
For certain contractions (e.g., those involving 1D tensors), gemm cannot be used as a compute kernel, and only algorithms based on BLAS-2 or BLAS-1 are possible. One such scenario is encontered in the contraction C a = A iaj B ji , for which our generator yields 8 BLAS-based algorithms:
The measured and predicted performance for these algorithms is shown in Fig. 3 . Our predictions clearly discriminate the fastest algorithm j-gemv ( ) across the board. Furthermore, the next group of four algorithms is also correctly identified and the low performance (due to the overhead of the involved matrix-copy operation) of the second gemv-based algorithm i ′ -gemv ( ) is predicted too.
Test 3:
C abc = A ija B jbic , Challenging Contraction
We now turn to a more complex example: C abc = A ija B jbic . For this contraction, we look at a total of 176 different algorithms: -48 dot-based, -72 axpy-based, -36 gemv-based, -12 ger-based, and -8 gemm-based:
All gemm-based (see Algs. 3) and several of the gemv-based algorithms involve copy operations to ensure that each matrix has a contiguously stored dimension, as required by the BLAS interface. Once again, we consider a very challenging scenario where both contracted indices are of size i = j = 8 and the free indices a = b = c vary together. Starting with the predictions, in Fig. 4a , we present the expected flops/cycle of the 176 algorithms, where BLAS-1 and BLAS-2 algorithms are grouped by kernel. Even with the copy operations, the gemm-based algorithms are the fastest ones. However, within these 8 algorithms, the performance differs by more than 20%. Focusing on the gemm-algorithms, we compare with corresponding performance measurements 10 in Fig. 4b . The comparison shows that our predictions clearly separate the bulk of fast algorithms from the slightly less efficient ones.
Multithreading. The algorithms considered here can make use of shared memory parallelism by employing multithreaded BLAS kernels. To focus on the impact of parallelism, we increase the contracted tensor dimension sizes to i = j = 32 and use all 10 cores of the Ivy Bridge-EP CPU with OpenBLAS. Performance predictions and measurements for this setup are presented in Fig. 5 . 
Efficiency Study
The ultimate goal of this work is to automatically and quickly select the fastest algorithm for a given tensor contraction. The experiments presented so far provide evidence that our automated approach successfully identifies the fastest algorithm(s). With this last experiment, we investigate the efficiency of our micro-benchmark-based approach. For this purpose, we use again the contraction C abc = A ai B ibc , with i = 8 and varying a = b = c. Figure 6 displays the ratio of how much faster our micro-benchmark is compared to executing the corresponding algorithm. In general, our prediction proves to be several orders of magnitude faster than the algorithm itself. At a = b = c = 1,000, this relative improvement is smallest for the gemm-based algorithms ( ) at 10 3 ×, since each gemm performs a significant portion of the computation; for the ger-based algorithms ( ), it lies between 6 · 10 3 and 10 4 × and for the gemv-based algorithms ( ) the gain is 5 · 10 5 to 10 6 ×; finally, the gain for both BLAS-1-based algorithms ( , ), where each BLAS-call only performs a tiny fraction of the contraction, our prediction is between 6 and 9 orders of magnitude faster than the execution.
Conclusion
In this paper, we focused on the performance prediction of BLAS-based algorithms for tensors contractions. First, based on previous work, we developed an algorithm and code generator that given the mathematical description of a tensor contraction, casts the computation in terms of one of five different BLAS kernels; since, in general, a tensor contraction may be decomposed in terms of matrix and vector products in many different ways, the generator often returns dozens of alternative algorithms.
Then, we tackled the problem of selecting the fastest algorithms without ever executing them. Instead of executing the full algorithms, our approach is based on timing the BLAS kernels in a small set of micro-benchmarks. These micro-benchmarks are run in a context that emulates that of the actual computation; thanks to a careful treatment of cache-locality and a model of the cache prefetcher's behavior, our performance prediction tool is capable of identifying the best-performing algorithms in a tiny fraction of the time required to actually run and time all of them.
The quality of the predictions was showcased for a number of challenging scenarios, including contractions among tensors with small dimensions, contractions that can only be cast in terms of BLAS 1 and BLAS 2 kernels, and multi-threaded computations.
