The efficiency of tensor contraction is of great importance. Compilers cannot optimize it well enough to come close to the performance of expert-tuned implementations. All existing approaches that provide competitive performance require optimized external code. We introduce a compiler optimization that reaches the performance of optimized BLAS libraries without the need for an external implementation or automatic tuning. Our approach provides competitive performance across hardware architectures and can be generalized to deliver the same benefits for algebraic path problems. By making fast linear algebra kernels available to everyone, we expect productivity increases when optimized libraries are not available.
INTRODUCTION
Tensor contraction (TC) is an important and widely used computational pattern. A general ddimensional tensor T ∈ R n u 0 ×···×n u d −1 can be defined as the set of scalar elements indexed by the set of indices n u 0 . . . n u d −1 , T ≡ {A n u 0 ···n u d −1 ∈ R|(n u 0 , . . . , n u d −1 ) ∈ n u 0 × · · · × n u d −1 }.
Let A, B, and C be d A -, d B -, and d C -dimensional tensors, respectively. Let the free and the contracted indices of the tensor A be grouped into two bundles I = i 0 . . . i r −1 and P = p 0 . . . p t −1 , respectively. Similarly, the indices of B are grouped into bundles J = j 0 . . . j s−1 and P and the indices of C are grouped into bundles I and J . TC of tensors A and B into tensor C can be represented as C π C (I J ) = P α · A π A (I P ) · B π B (P J ) + β · C π C (I J ) , where P = n p 0 −1 p 0 =0 . . . SCoPs are defined in intermediate representations (e.g., LLVM IR (Lattner 2002) , SSA (Stallman 1999) ) used internally by compilers and compiler front-ends (e.g., Clang (Lattner 2002) , GCC (Stallman 1999) ), such that they can be extracted, optimized, and transformed into the corresponding intermediate representations (Grosser et al. 2012) . To perform optimizations in production compilers where the compile time is a crucial factor, every basic block, which is a sequence of code with no branches in (out) except for the entry (at the exit), is represented as a SCoP statement. Although it prevents the independent scheduling of statements belonging to the same basic block, it provides greater scalability.
Components of the Polyhedral Representation
For SCoPs, program regions that satisfy the requirements of the polyhedral model (Section 2.2), each compute statement is described by three components: iteration domain, scheduling function, and access relation. These components are defined using Z-polytopes (Loechner and Wilde 1997) and Presburger relations (Section 2.1).
An iteration domain describes the dynamic instances of the SCoP statement. It is represented as a Z-Polytope defined by affine constraints on the global parameters and iteration variables of loops enclosing the SCoP statement. The iteration domain of the SCoP statement S from Listing 1 has the form {S (i, j, p 
A scheduling function defines the execution order of the individual dynamic instance of a SCoP statement. It is described with a Z-polytope that relates dynamic statement instances to their execution time vectors. The lexicographical ordering of the execution time vectors defines the execution order for all dynamic instances. The scheduling function for the SCoP statement S from Listing 1 has the form {S (i, j, p) → (i, j, p)}.
An access relation maps dynamic instances of a SCoP statement to the array element(s) that they access. It is described using a Z-polytope that defines the relation between the iteration vector and the accessed array subscript. The pattern in which the array elements are accessed is evaluated using the stride of the access relation calculated with respect to the innermost loop. The stride is the distance between the memory accesses of two subsequently executed statement instances (Grosser et al. 2012 ). Accesses to scalar variables are modeled as accesses to zero-dimensional arrays. To distinguish the commonly multiple memory accesses of a SCoP statement, the access relations of the individual memory accesses are written in the expression evaluation order in the original program. The access relations describing the memory accesses of SCoP statement S from Listing 1 are {S (i, j, p) → A(i, p)}, {S (i, j, p) → B(p, j)}, {S (i, j, p) → C (i, j)}, {S (i, j, p) → C (i, j)}, where the first three relations represent reading from the memory and the last represents writing to the memory.
TC-LIKE KERNEL
We introduce a TC-like kernel as a set of programs with a data usage pattern that is similar to that produced by TC (Section 4). We show how it can be detected and mapped to matrices that may be optimized using approaches developed for MMA [×, +] . The TC-like kernel is defined as follows:
Definition 3.1. A TC-like kernel is a perfectly nested set of loops such that:
• It satisfies the requirements of the polyhedral model.
• Without loss of generality, it contains three nonempty bundles of one-dimensional forloops with induction variables that are grouped into bundles I = i 0 . . . i r −1 , J = j 0 . . . j s−1 , and P = p 0 . . . p t −1 , and they are incremented by one.
34:5
• The innermost loop body can be represented as a statement of the form C π C (I J ) = E(A π A (I P ) , B π B (P J ) , C π C (I J ) ), where A π A (I P ) , B π B (P J ) , C π C (I J ) are accesses to tensors A, B, C, respectively, π C (I J ), π A (IP ), and π B (P J ) are permutations of the enclosed indices, and E is an expression that contains reads from the tensors A, B, C, and an arbitrary number of reads from constants with respect to bundles I, J , and P.
According to the definition of a TC-like kernel, the following sufficient conditions for a SCoP to be a TC-like kernel can be stated. It has only one SCoP statement S as well as t true dependencies, t anti-dependencies, and t output dependencies. The dependencies have the form ∀i ∈ 0 .
, where t 1 * t 2 is a unique tuple such that each entry of tuples t 1 and t 2 appears only once and they appear in the same order as in the underlying set. Thus, it is legal to interchange the loops of I and J . In the case of P, it is necessary to check that either bundle P contains only one element or an associative operation is used to update C; otherwise, interchanging the loops of bundle P can violate the dependencies. According to the form of the innermost loop body, the SCoP should contain reads of the form S (.
, and possibly an arbitrary number of reads that have stride zero with respect to the loops with the induction variables of I, J , and P. The access relations of the individual memory accesses are written in the expression evaluation order of the original program (Section 2), so the last access relation should have the form S (. . . ) → C π C (I J ) and correspond to the last memory access in S that writes to memory.
To detect TC-like kernels, it is sufficient to check the specified conditions. We can verify that the last memory access is a store that writes the result from the TC-like kernel. It allows us determine the union of bundles I and J . Subsequently, we can check that ∀i ∈ 0 . . . t − 1, we only have true dependencies and anti-dependencies of the form
. This can also help to determine bundle P. Next, we can check that the SCoP contains at least three read accesses and determine bundles I and J . We can verify that all additional read memory accesses that do not correspond to the operands of the TC-like kernel have stride 0 if the innermost loop is exchanged with any of the loops. Finally, we can check that either bundle P contains only one element, or we use an associative operation in the case of generalized summation.
To optimize the detected TC-like kernels, we employ an approach that allows the mapping of tensors to matrices and apply optimized MMA[×, +] routines (Matthews 2016 ). This method is based on the new data layout, which provides a mapping between the elements of tensors and their locations in memory. An example of a commonly used data layout for matrices is the row-major data layout, where the elements are assigned successive locations moving across the first row and then continuing across subsequent rows. In the case of tensors, this can be generalized such that the tensor element
For example, if its dimensions are grouped into two bundles I and J , then the location can be represented as
If we consider the location of element M i j for an M×N matrix, i · n j + j, then we can deduce formulae for computing I , n j , and J for the new data layout, which are
n j l , and
n j l , respectively. Subsequently, new induction variables can be used to perform any type of logical matrix operation (Matthews 2016) . 1 In our case, the same formulae can be used to modify the scheduling function for the SCoP statement representing the TC, thereby operating on new induction variables and matrices instead of tensors. In the following sections, for the sake of simplicity, we let the innermost loop body of the TC-like kernel have the form [j] are accesses to matrices A, B, and C, respectively, and E is an expression that contains reads from the matrices. This representation allows us to consider only the algorithm for the optimization of MMA [×, +] and shows that it can be applied to TC-like kernels (Section 4).
OPTIMIZING A TC-LIKE KERNEL
In this section, we present an algorithm for transforming the TC-like kernel into an implementation that is structured in a similar manner to an expert-optimized kernel. First, we describe the expert-designed MMA[×, +] implementation in BLIS (Van Zee and van de Geijn 2015) and then discuss our optimization, which uses general purpose compiler transformations based on polyhedral modeling to obtain a MMA[×, +] implementation with comparable performance to an expertoptimized kernel. 
Expert Implementation of MMA[×, +]
An expert implementation of the MMA[×, +] algorithm is an implementation tuned by dense linear algebra experts. Examples of these implementations can be found in GotoBLAS (Goto and Geijn 2008) and its successors, OpenBLAS (Xianyi et al. 2012) and BLIS (Van Zee and van de Geijn 2015) . In contrast to GotoBLAS and OpenBLAS, BLIS exposes three innermost loops that facilitate the analytical identification of optimal parameters (Van Zee and van de Geijn 2015) .
The implementation of MMA [×, +] in BLIS comprises two packing routines and five loops around the micro-kernel. The micro-kernel is a loop around an outer product, which can be implemented in assembly. The micro-kernel and two surrounding loops form the macro-kernel. The pseudo-code for the implementation 2 can be found in Listing 2. The determination of the M c , N c , K c , M r , and N r parameters is considered in Section 4.5.
Optimization of a TC-Like Kernel
In this subsection, we present a new polyhedral optimization that helps to obtain code, which is a generalization of the implementation described by Section 4.1 in terms of the outer product defined in the semiring representing MMA [×, +] . The optimization transforms the SCoP statement containing the TC-like kernel (Section 3) by creating the micro-and macro-kernels, and by performing the packing transformation described in Section 4.3. The overall algorithm is summarized in Algorithm 1. Let us consider the algorithm and compare the code it produces as well as the implementation of MMA[×, +] of BLIS described in Section 4.1. First, we obtain the three loops around the macro-kernel using interchanging and tiling transformations (Lines 1-3). We then produce the macro-kernel by tiling the innermost loops (Line 4). Subsequently, we separate the domains of the innermost two loops into full and partial tiles to allow the vectorization in the case of parametric bounds and loop sizes that cannot be divided evenly by the tile sizes (Line 5). Finally, we create the micro-kernel by applying unrolling and vectorization of the innermost loops, and performing the packing transformation (Lines 6-8). Consequently, the generated code is similar to the expert implementation (Section 4.1), except the innermost loop body that is not necessarily the outer product. 3 Let us consider the innermost loop produced by the transformation. According to Section 3, this loop updates the blocks of the matrix, e.g., C, with length M r and width N r . Updating is achieved by a series of K c calculations involving M r elements from the packed matrix, e.g., A c , and N r elements from the packed matrix, e.g., B c . At each iteration of p c , M r N r intermediate results are computed by operating on M r elements from the matrix A c and N r elements from the matrix B c . Subsequently, they are accumulated into the block of the matrix C, e.g., C c , which is reused between iterations in loop p c . There is no reuse of the elements of A c and B c in p c , because they are used exactly 34:8 R. Gareev et al. once each time via p c . Performing this analysis on the remaining loops can show that our data usage pattern is similar to that produced by the expert implementation described in Section 4.1. Consequently, since the size of reused data becomes smaller as the loop depth increases, smaller amounts of reused data are mapped to faster memory layers in the memory hierarchy (Goto and Geijn 2008) .
As an example, we applied the algorithm to the statement for the SCoP presented in Listing 1. In addition, Figure 1 presents the single-threaded performance evaluation results for the steps in the implementation of MMA [×, +] provided by the PolyBench 3.2 benchmark suite (Pouchet 2011) , matrices containing elements of type double, and ARM (see Section 5 for details).
Applying the interchanging and tiling transformations (Lines 1-3) transforms the output dimensions of the scheduling function
, p mod K c , and the dimensions j mod N r , i mod M r are added. The transformed SCoP can be found in Listing 3. For simplicity,
The loop tiling and loop interchange transformations (Lines 1-4) help to achieve 23% of the theoretical peak (polly-1-4).
Loop unrolling transformations (Line 5) do not affect the performance (polly-1-5). The packing transformation (Line 6) helps to achieve 26% of the theoretical peak (polly-1-6). The transformed SCoP can be found in Listing 4. The result obtained by the unrolling and packing transformation (Lines 5 and 6) is described in Section 4.3.
In the case of Polly, the implementation of vectorization and the subsequent loop hoisting and loop sinking (Lines 7 and 8) are based on the LLVM's vectorization passes, and they allow us to achieve 73% of the theoretical peak (polly (opt)). To improve the performance further, prefetch instructions can be generated using the loop data prefetch pass in LLVM, which is not used by default and it is not available for all targets (e.g., x86_64 architecture) at the time of writing this article. In the case of ARM, it helps to achieve 75% of the theoretical peak. We conclude that the performance of the micro-kernel is degraded by the lack of prefetching instructions. Nevertheless, Figure 1 shows that using prefetch instructions is not sufficient to match the performance of optimized libraries. pure-c represents the evaluation of the implementation presented in Listing 1 with loop interchange transformations (Lines 1-4) as well as the application of the packing transformation (Line 6), which helps to achieve 12.31% of the theoretical peak.
Loop unrolling transformations (Line 5) and manual vectorization using SSE intrinsics allows us to achieve 79.74% of the theoretical peak (naive-sse). It generates a micro-kernel that can be represented as
It should be noted that the elements of the second operand of the matrix-matrix multiplication can be stored in a vector register to increase the performance (Lehn 2014) . At the time of writing this article, the LLVM vectorizer is not capable of performing these transformations. Thus, our optimization (Lines 1-8) generates the representation of the micro-kernel. 5 To improve the performance further and achieve 83.91% of the theoretical peak (sse), we also keep the elements of the matrix A in a vector register and use different permutations of its elements to obtain all of the rank-1 updates performed by the micro-kernel (Section 4.1). This process generates a micro-kernel that can be represented as
To avoid the limitations of SSE intrinsics, we translate the intrinsics to assembly by ourselves. It allows us to manually perform instruction scheduling to consider the latency of the SSE instructions and fully occupy the execution units of the CPU. For instance, there are several loads in the vector registers in the innermost loop but instead of waiting for all of them to be completed, we can execute FMA operations between the load calls. Furthermore, manual translation of the SSE 34:10 R. Gareev et al. intrinsics and additional unrolling of the innermost loop of the micro-kernel allows us to use as many of the vector registers as possible without modifying the data usage pattern. In addition, we use prefetch instructions to prefetch the elements of the operands of MMA from two iterations ahead, which helps to achieve 86.92% of the theoretical peak and match the performance of the code in the BLIS framework for all data sizes (sse-asm). The instruction scheduling and register allocation processes depend on the LLVM's vectorization passes as well as the compiler backends, which produce code for the specified machine or other languages. In addition to the previously described problem related to micro-kernel generation, a problem with register allocation occurs. Specifically, the LLVM's vectorization passes neither additionally unroll the innermost loop of the micro-kernel or vectorize this loop if it is unrolled by Polly. Improving these parts of the compiler are outside the scope of this study.
Packing Transformation
The copying of the data in matrices A and B, the operands of the TC-like kernel into the created arrays A c and B c , respectively, during the packing transformation (Van Zee and van de Geijn 2015) is illustrated in Figure 3 . In our case, A c and B c are three-dimensional arrays with sizes of
respectively, which helps to ensure that their elements are read during in-stride access, aligned to the cache line boundaries, and preloaded into certain cache levels. The packing transformation can be represented as a data-layout transformation (Section 4.4), which introduces a new array, copies data to it, and changes the memory access locations to reference the array.
We now consider how the packing transformation can be expressed as a sequence of transformations with a polyhedral model to produce a SCoP (Section 2). The corresponding access relation should be created to introduce a new array. Data can be copied to an array by introducing the SCoP statement that contains a read from the specific location in memory and storing it to another. This transformation requires updating the scheduling functions for all the SCoP statements as well as the iteration domains and access relation maps if new loops are introduced. The corresponding access relations should be modified to change a memory access location. Since the described transformations can be performed with the polyhedral model without violations of its requirements, the result of the packing transformation is the new SCoP.
As an example, we apply the packing transformation to the statement of the SCoP presented in Listing 1. We assume that Lines 1-4 of Algorithm 1 are applied. The modified access relations help to map every element of matrices A and B to the corresponding elements of matrices Ac and Bc, respectively, and they represent relations of the form: 
Data-Layout Transformation Infrastructure
Data layout transformation comprises a class of transformations that optimizes the layout of the data (Srikant and Shankar 2007) , which can be conducted within either a single aggregate data type or across data objects, e.g., to ensure that they are read during in-stride access, aligned to cache line boundaries, and preloaded into certain cache levels (Section 4.3). New data layout can eliminate the memory stream alignment issue, i.e., redundant loads where an element is moved with a different load for every distinct vector register position where it needs to be used . Consequently, data layout transformations can help to decrease the memory access latency. We extend Polly (Grosser et al. 2011 ) to perform polyhedral data-layout optimization. In particular, we add the ability to change, create access relations, and introduce SCoP statements, each of which represents copying elements from one array to another. We check that these transformations do not violate the requirements of a polyhedral model. Hence, the result obtained after their application is a new SCoP. Furthermore, we extend JSCoP, which is a file format based on JSoN (Grosser et al. 2012) , to allow affine-linear data-layout transformations. Thus, we can test the transformations manually without any knowledge of the compiler's internals as well as developing new research prototypes.
To the best of our knowledge, we developed the first general infrastructure for polyhedral affinelinear data-layout transformations, which allows their application in a production compiler. We use it to implement the packing transformation, which requires the introduction of a new array, copies data to it, and changes the memory access locations to reference the array (Section 4.3). Even without additional modifications, this approach can be used to implement existing data-layout transformations, such as dimension-lifted transposition, which helps to eliminate the memory stream alignment issue ).
Architecture Model
To obtain a high-performance implementation of a TC-like kernel, particulary MMA[×, +], the optimization should be mapped onto the architecture. A model of the hypothetical processor can be created for this purpose. For example, the analytical model for determining the blocksize parameters in BLIS (Low et al. 2016 ) describes caches, vector instructions, load/store architecture, and vector registers, as follows.
-Caches. All data caches are set-associative. Each cache level Li is characterized by the size of the cache line C Li , associativity degree W Li , size S Li , and the number of sets N Li . -Vector instructions. The throughput of the processor floating-point arithmetic units is an N VFMA vector fused multiply-add instructions (VFMA) per clock cycle. 6 The minimum number of cycles between the execution of two dependent consecutive VFMA instructions is L VFMA , which is the latency of each VFMA. In the case of architectures without VFMA instructions, L VFMA is the sum of the latencies for one multiply instruction and one addition instruction. -Load/store architecture and vector registers. Data are loaded into the processor registers before computations are performed based on them. Each vector register can hold N VEC elements of size S DATA .
Our optimization uses the modified version of the model, where instead of L VFMA , we use L VMMA , which is the sum of the latencies for instructions comprising vectorized matrix multiply-and-add operations (VMMA). Instead of N VFMA , we consider N VMMA , which is the processor throughput for the VMMA per clock cycle (Sedukhin and Paprzycki 2012) . If the vector instructions comprising the VMMA can be executed simultaneously, then N VMMA can be estimated as N VMMA = min (N 1 , N 2 ) , where N i is the throughput of an instruction comprising the VMMA; otherwise, it can be estimated as
. This is explained by the ability to interchange independent instructions for different VMMA instructions that comprise a long sequence. These sequences can be used to estimate N VMMA (Fog 2017) . Our optimization produces code with a similar data usage pattern to that used in the implementation of MMA[×, +] in BLIS (Section 4.2). Hence, after determining the values of the parameters, we can use the same strategy that is applied to BLIS (Low et al. 2016) to deduce the formulae needed to find the values of the parameters for the micro-and macro-kernels. M r and N r are the parameters of the micro-kernel and they can be computed as follows:
. They are set at sufficiently large values to avoid stalls in the floating-point pipelines caused by executing dependent consecutive VMMA during the updating of the matrix, e.g., C, by the micro-kernel (Section 4.2). In addition, M r and N r are set at the smallest values to release the vector registers for the elements of matrices A and B, which are the operands of the TC-like kernel.
The size of the reused data decreases as the loop depth increases, so smaller reused data are mapped onto faster memory layers in the memory hierarchy, where the faster layers have a lower capacity than the slow layers (Section 4.2). To allow rapid access and reduce cache trashing, the reused data should ideally be kept in the cache between iterations. Thus, there are restrictions on the values of the parameter of the micro-kernel K c , and the parameters of the macro-kernel M c and N c , which define the shape of reused data (Section 4.1). We use the following formulae to compute these parameters:
, and
N r , where C B c is the number of bytes of memory available for array B c created during the packing transformation (Section 4.3). These parameters are selected to conform to the sizes of the caches, the cache replacement policy, and the cache organization.
These formulae are similar to those applied in the case of BLIS (Low et al. 2016 ) and they are deduced in a similar manner. The exceptions are N VMMA and L VMMA , which are used instead of N VFMA and L VFMA , respectively. It allow us to apply the optimization to MMA [⊗, ⊕] and to solve the APP, such as finding the least and most reliable paths, or finding the paths with the maximum cost. In the case of MMA [×, +] , the values derived by our analytical model can be optimal, because N VMMA and L VMMA are equal to N VFMA and L VFMA , respectively.
Scalability
Parallelism is important for high-performance modern processing platforms and it is essential in the case of big data problems (Facchinei et al. 2014) . In this study, we restrict ourselves to the case of data level parallelism and a single-threaded implementation of a TC-like kernel. In this subsection, we briefly describe the issues and possibilities related to the scalability of the proposed approach.
According to the expert multithreaded implementation of MMA[×, +] (Smith et al. 2014) , j c , the second loop around the micro-kernel (Listing 2), can provide a good opportunity for parallelization. In particular, if the ratio of N c relative to the parameter N r is large, which is usually the case, then the time spent in this loop cancels out the cost of packing the elements of the created array A c (Section 4.3) into the L2 cache (Smith et al. 2014) . Polly automatically checks all of the generated loops and introduces OpenMP parallelism for the outermost parallel loops (Grosser et al. 2011 ). In the case of a TC-like kernel, particularly MMA[×, +], i, j, and p contain the packing transformation (Section 4.3), which introduces dependencies that prevent parallelization. Thus, j c is the outermost parallel loop that is automatically parallelized by Polly. Section 5 presents the performance evaluation results for the proposed approach.
However, if the L2 cache is not shared and j c is parallelized, then the elements of the created A c are duplicated across the L2 caches. This occurs during the execution of the micro-kernel and it may overlap with the computation, which can reduce the performance (Smith et al. 2014) . In this case, the first loop around the macro-kernel i, can be considered. If we parallelize this loop, then each thread will be assigned different elements of the matrix A, which reside in the L2 cache, and they are packed into the created array A c , where this may cause a race condition. In the case of Polly, arrays A c and B c are allocated statically during compilation. Thus, they can be replicated to avoid the race condition by using OpenMP parallelism. However, Polly does not support this mechanism at present.
Determining the optimal parameter values for a multithreaded implementation of MMA[×, +] as well as the loops that should be parallelized to obtain the best performance gain is orthogonal to the analytical model for determining the blocksize parameters in BLIS (Low et al. 2016) . The modified version of this model (Section 4.5), which is used by the proposed approach, is affected by the same issue.
EXPERIMENTAL RESULTS
In this section, we present the results of the performance evaluations for the proposed optimization, TCCG (Springer and Bientinesi 2018) , TBLIS (Matthews 2016) , and the current production compilers, as well as the compiler front-end (GCC (Stallman 1999) , Clang (Lattner 2002) , ICC (Intel 2015) , and IBM XL (IBM 2012)). In addition, we compared the performance of the code generated using our approach and the instances of MMA[×, +] that are available in Basic Linear Algebra Subprograms (Intel's MKL (Intel [n.d.] ), ARMPL (ARM 2015), BLIS (Van Zee and van de Geijn 2015), and OpenBLAS (Xianyi et al. 2012) ). We also showed that the approach can be applied to APPs.
Our experimental setup (Tables 1 and 2 ) comprised modern processors (IBM Power 8, Intel Xeon Phi, Intel Sandy Bridge, Intel Kaby Lake, and APM883208-X1) with different architectures (ppc64le, x86_64, and aarch64). Each single measurement or result reported in this section is the 6.0.0 -O3 -march=native 10 -mllvm -polly -O3 -march=native -mllvm -polly 11 -polly -target-throughput-vector-fma=N VMMA -mllvm -polly-target-latency-vector-fma=L VMMA polly (opt) 7 Information about the latencies and throughputs can be found in manuals with instruction tables (Fog 2017; Intel 2018). APM883208-X1 (AppliedMicro X-Gene) implements ARM v8. Consequently, the latency of FMA for Cortex A57 can be used (ARM 2016). However, the throughput is different (Dolbeau 2016) . 8 The values in brackets were empirically determined and they are larger than the real latency values. This due to the trade-off between obtaining results that are optimal according to the architecture model (Section 4.5) and using as many of vector registers as possible. This problem can be solved by improving the register allocation (Section 4.2). 9 The CPU frequencies with enabled Intel Turbo Boost technology are shown in brackets. 10 If the -march=native option is not supported by the target architecture, then the -mcpu and -mtune options with appropriate values are used instead. 11 In the cases of IBM Power 8 and DEGMM, the -disable-ppc-preinc-prep option in the LLVM's PowerPC backend is used. 
High-Performance Generalized

TC
We considered problems from a TC benchmark containing a wide range of user cases collected from previous studies related to TC (Springer and Bientinesi 2018) , which included tensors with different dimensionality encountered in coupled-cluster methods (Stock et al. 2012 ) and quantum chemistry calculations (Baumgartner et al. 2005) . We evaluated the single-threaded performance on the Intel Sandy Bridge machine for type double. To simplify the presentation, we encode a con-
A imjn ·B nlmk can be encoded as ijkl-imjn-nlmk. Figure 4 shows the evaluation results for abcde-efbad-cf and abcd-eafc-bfde, which are representative examples of TSc. The results for the full benchmark are provided in Appendix A. For abcde-efbad-cf, the sizes of the a, b, d, and e indices were fixed as 8, 8, 4, and 4, respectively. The sizes of the c and f indices were equal and they varied simultaneously, ranging from 4 to 1,024. In the case of abcd-eafc-bfde, the sizes of all the indices were equal and they varied simultaneously, ranging from 4 to 64. In addition, Figure 4 presents the evaluation results for MMA [×, +] with equivalent size and implemented in BLIS.
We observed substantial speedups when using the proposed approach compared with the production compilers, where it exceeded the reported ICC performance by up to 84× and outperformed GCC and Clang by up to 82×. However, we noted that if the tensors contained less than 32 2 elements, it was reasonable to use the default optimization techniques instead of the proposed approach and other approaches based on the mapping of TC to MMA[×, +] (e.g., TCCG (Springer and Bientinesi 2018) and TBLIS (Matthews 2016) ). It corresponds to the results obtained for MMA[×, +] (Heinecke et al. 2015) .
We conclude that the approach achieved more than 86.12% of the performance with TCCG 12 and TBLIS. In the next section, it is showed that this difference was caused by the suboptimal optimization of MMA [×, +] , although the same performance was achieved in some cases. In the case of TCCG, the difference was due to the inherent disadvantages of TTGT (Hirata 2003) and LoG (Napoli et al. 2014 ) approaches, which are used to generate implementations of TCs, and, subsequently, time them on the target platform. In particular, TTGT requires explicit tensor transpositions, which accounts for the pure overheads. LoG performs tensor slicing into a sequence of matrices, which can be small and/or incur strided memory accesses, thereby leading to poor performance. According to the results with BLIS, we note that BLIS exceeded the performance reported for TCCG by up to 1.5×.
In the case of TBLIS, the heuristics used to maximize the spatial locality of the data and reduce the number of cache and TLB misses can lead to suboptimal results. In particular, TBLIS logically reorders the dimensions from their original order so the dimensions in bundles I and J are sorted by increasing stride in the tensor that is produced by TC, e.g., C. Subsequently, the operand with the unit stride dimension of C is used as the first operand, e.g., A, for the mapped matrix-matrix multiplication and it is packed more frequently. However, it is not always possible to achieve unit stride access in both C and A using these heuristics (e.g., in the case of ijk-jli-lk with column-major order). The heuristics are modified in our approach. The LLVM's vectorization passes only store the second operand, e.g., B, of the mapped matrix-matrix multiplication to a vector register and load its elements after elements of A, so we swap A with B. The modified heuristics can also lead to suboptimal results. However, they allow us to keep the elements of B, which are reused more often than the elements of A, in the L1 cache. Furthermore, they can help to achieve the performance of TBLIS if its heuristics do not work (e.g., in the case of abcde-efbad-cf with row-major order).
MMA[×, +]
We consider the MMA[×, +] of the following form C ← α ⊗ C ⊕ β ⊗ A ⊗ B, 13 implemented in the Polybench 3.2 benchmark suite (Pouchet 2011) , which is a collection of programs from various domains that expose only the kernels that need to be optimized, and thus it is easy to work with.
Figures 5 and 6 present the results of single-threaded performance evaluations of the implementations of MMA[×, +] for type double and square and nonsquare matrices, respectively. Figure 7 shows the results for type float. We conclude that the optimization achieved a speedup of up to 20× compared with the modern production compilers and 1.63× compared with the Intel C compiler. We also note that if the matrices contained less than 80 2 elements, the optimization attained the performance of the single-threaded instances of MMA[×, +] available in MKL. However, if the matrices contained less than 32 2 elements, it was reasonable to use the default optimization techniques instead of the presented approach. It corresponds to Heinecke et al. (2015) .
We found that in the case of Kaby Lake, square matrices, and type double, our approach achieved 83.33% of the performance of OpenBLAS. However, in the case of IBM Power 8, it only achieved 75.54%. These differences are explained by suboptimal register allocation, as mentioned previously (Section 4.2). Thus, we estimated the number of vector registers used in the case of Kaby Lake. According to Section 4.2, at each iteration of the innermost loop, M r N r intermediate results are computed and M r + N r elements of the operands of MMA[×, +] are used. Consequently, since only the elements of the second operand are stored in the vector registers (Section 4.2), at most
vector registers are used (Section 4.5). In the case of Kaby Lake, this number is equal to 8·6+8 4 = 12, which is 87.5% of the vector registers are available on the machine. In the case of IBM Power 8, only 48.43% of the available vector registers are used. In the cases of Sandy Bridge and ARM, 62.5% and 65.62% of the available vector registers are used, respectively. Thus, Sandy Bridge 13 Our implementation of the Algorithm 1 is based on LLVM's vectorization passes (Section 4), which do not vectorize multiplication of the parameters β and A ⊗ B at the time of writing this article. Furthermore, α ⊗ C can be computed in a different SCoP statement, so our optimization does not influence its performance. Thus, different values for the parameters α and β only influence the vectorization but its optimization is not the goal of this study. For simplicity, we assume that their values are equal to one. and ARM could achieve 80.35% and 84.61% of the performance of OpenBLAS, respectively. We conclude that in the case of float type, Kaby Lake, IBM Power 8, Sandy Bridge, and ARM with our optimization method achieved 83.33%, 54.54%, 70%, and 86.25% of the performance of OpenBLAS, respectively, and this is due to the different numbers of vector registers used, i.e., 87.5%, 42.18%, 56.25%, and 65.62% in the cases of Kaby Lake, IBM Power 8, Sandy Bridge, and ARM, respectively.
High-Performance Generalized
We also evaluated the multi-threaded performance of the proposed approach for square matrices with dimensions of 8000 × 8000, type double, different values for the maximum number of threads in the OpenMP parallel region, and two target platforms. The first target platform evaluated had two sockets for 10-core Intel Xeon E5 CPUs (Table 1) . The second had one 64-core Intel Xeon Phi CPU booted in flat mode (Table 1) . Figure 8 shows the results of the multi-threaded performance evaluations. In the case of Intel Xeon E5-2630, the proposed approach achieved 86.18% of the performance of the code from the optimized libraries. In the case of Intel Xeon Phi, the proposed approach exceeded the performance of OpenBLAS, which does not support Intel Xeon Phi CPUs and it uses a reference implementation. However, in the latter case, the proposed approach achieved only 69.02% of the performance of 34:18 R. Gareev et al. BLIS due to the inability of Polly to parallelize i, which is the first loop around the macro-kernel. The Intel Xeon Phi cores are grouped in tiles, which comprise two cores with their own L2 cache. Consequently, i allows better parallelism than j c , which is the second loop around the microkernel, and it could be parallelized by Polly (Section 4.6). We only considered the case of data level parallelism and a single-threaded implementation of a TC-like kernel, and further evaluations were beyond the scope of this study.
APPs
Since the 1970s, it has been known that MMA in different semirings can be used by solvers for a large number of problems combined under a single umbrella, which are called Algebraic Path Problem (APPs) (Sedukhin and Paprzycki 2012). We consider the following APPs: finding the least and the most reliable paths, finding paths with the maximum capacity, finding paths with the maximum cost, and finding for all pairs the shortest connecting path implemented using MMA [×, min] , MMA [×, max] , MMA [min, max] , MMA [+, max] , and MMA [+, min] , respectively. We evaluated them for different types of elements (double and float) and square matrices. The single-threaded performance evaluation results obtained for APPs on the Intel Sandy Bridge machine are shown in Appendix A. To compute the theoretical peak performance in GFLOP/sec for a particular MMA, we used the following standard formula (Ngxande and Moorosi 2014): theoretical peak performance = (CPU speed in GHz) × (CPU instruction per cycle). Consequently, the theoretical peak performance differed according to the ability of the CPU to simultaneously execute the micro-operations used to compute the MMA. For example, in the case of the double floating point data type, MMA [×, min] , MMA [×, max] , and MMA[×, −], 8 floating point operations could be executed per cycle. However, only 4 floating point operations could be executed per cycle in the case of MMA [+, max] , MMA [+, min] , and MMA [min, max] . This difference can be explained by the inability of Intel Sandy Bridge to simultaneously execute instructions that require the same execution unit (e.g., vmaxpd and vaddpd). In the case of MMA [/, max] , only 0.4 floating point operations could be executed per cycle due to N VMMA , which was equal to 0.05 (Section 4.5).
For the problems that required finding the least and the most reliable paths, the optimization achieved more than 69% of the theoretical peak performance and a speedup of 1.56× compared with the Intel C loop optimizer. For the other problems, the optimization achieved more than 34:20 R. Gareev et al. 85% of the theoretical peak performance and produced similar results to the Intel C compiler. The performance of the code generated using our approach could be improved by applying prefetching instructions and optimized instruction scheduling (Section 4.2).
In addition to the performance evaluation results for APPs of type float, Appendix A shows the performance evaluation results for MMA [×, −] and MMA [/, max] , which do not exist, since the corresponding binary operations do not form semirings. Nevertheless, the results show that we achieved 78% and 89% of the theoretical peaks in the cases of MMA [×, −] and MMA [/, max] , respectively. Thus, the proposed approach can be applied even if the corresponding MMA operations do not form semirings. We also note that ICC allowed us to achieve high performance only for MMAs with throughputs less than one (e.g., MMA [+, max] , MMA [min, max] , MMA [+, min] , and MMA [/, max] with Sandy Bridge).
RELATED WORK
In this section, we describe previous studies related to optimization of TC and the contribution of our proposed approach.
Automatic Optimization of MMA[×, +]
Approaches are available for fully automatically optimizing BLAS functions, particularly MMA [×, +] . In general, they are based on autotuning or domain-specific knowledge of algorithms for dense linear algebra kernels.
The Portable High-Performance ANSI C (PHiPAC) (Bilmes et al. 2014) and Automatically Tuned Linear Algebra Software (ATLAS) (Whaley et al. 2001) projects introduced autotuning for empirically determining the optimal parameters for BLAS routines when running them on the target platform. Autotuning can be combined with analytical techniques to identify the best optimization decisions, and this approach is used by LGen (Spampinato and Püschel 2014) and Build to Order BLAS (BTO) (Belter et al. 2009 ).
LGen is a compiler for small-scale, basic linear algebra computations where the input is a fixed-size linear algebra expression and the output is a corresponding C function. BTO is a compiler where the input is a sequence of matrix and vector arithmetic statements in annotated MATLAB, and the output is a tuned implementation in C++. Autotuning can be time consuming and it can be difficult to apply it to production compilers where the execution time is a crucial factor. Several approaches can fully automatically optimize MMA[×, +] and obtain high-performance code that outperforms expert-tuned implementations. The POET optimization library (Yi et al. 2014) and AUGEM framework (Wang et al. 2013 ) use annotations and templates of sequential code, respectively, written by domain experts to guide general-purpose compilers to produce optimized MMA[×, +] kernels from specifically prepared code. The Portable Compiler Approach (POCA) (Su et al. 2017) generates an optimized micro-kernel based on LLVM IR representing MMA [×, +] and subsequent domain-specific but architecture-independent optimizations of its micro-kernel. LIBXSMM (Heinecke et al. 2015 ) is a library that employs just-in-time in-memory code generation to obtain a high-performance implementation of small dense and sparse MMA[×, +] when highperformance libraries such as the Intel's MKL do not deliver optimal performance. The approaches for automaticall optimizing MMA[×, +] complement our method, allowing the creation of the optimization that helps to obtain the code that outperforms expert-tuned implementations.
Optimization of TC
Two main types of approaches can be applied to optimize TC (Springer and Bientinesi 2018) : methods based on mapping TC to MMA[×, +] and methods based on nested loops.
The approaches based on nested loops (Apra et al. 2014) improve the performance of the highlevel specifications of TCs by applying a series of loop transformations. However, their implementations can be degraded by strided memory access. The performance of small TCs that fit into caches can be improved using vectorization ). In the case of larger TCs, loop-based code generators for GPUs can be used (Ma et al. 2011) .
Transpose-Transpose-GEMM-Transpose (TTGT) (Hirata 2003) and Loops-over-GEMMs (LoG) (Napoli et al. 2014 ) use highly efficient MMA[×, +] implementations provided by BLAS-compatible libraries. In TTGT, explicit tensor transpositions "flatten" the arbitrary dimensional tensors into matrices to recast TC as a single MMA [×, +] . However, additional time is required to transpose the tensors as well as extra memory to store the transposition results. In LoG, tensors are sliced into a sequence of matrices that are contracted via MMA [×, +] . If the sliced matrices are small and/or the memory accesses produced are strided, then this can lead to poor performance. GEMM-like Tensor-Tensor multiplication (GETT) (Springer and Bientinesi 2018) and Tensor-Based Library Instantiation Software (TBLIS) (Matthews 2016) pack elements from the input tensors into fixedsize buffers to improve cache reuse and they subsequently use a micro-kernel that is typically implemented in assembly. Other than this similarity, there are several critical differences between GETT and BLIS. GETT uses an auto-fine-tuning framework guided by a performance model to implement TC for a fixed size. By contrast, TBLIS uses an entirely run-time algorithm, which can operate on any size of shape of tensor.
CONCLUSION AND FUTURE DIRECTIONS
In this study, we proposed a new compiler optimization to obtain highly optimized MMA and TC code without external software. Our approach is based on a new algorithm for the automatic transformation of TC, which is implemented 14 using a novel general infrastructure for performing arbitrary affine-linear data-layout transformations on the low-level compiler IR as well as the use of an analytical model.
We compared the execution time of code produced by optimizing MMA[×, +] with instances of MMA [×, +] that are available in Intel's MKL (Intel [n.d.] ), ARMPL (ARM 2015), BLIS (Van Zee and van de Geijn 2015), and OpenBLAS (Xianyi et al. 2012) , as well as code produced using the current leading production compilers (GCC (Stallman 1999) , ICC (Intel 2015) , and IBM XL (IBM 2012) ) and the compiler front-end Clang (Lattner 2002) . Our method attained the performance of vendor optimized BLAS libraries, with a speedup of more than 1.63× compared with the state-ofart compilers. In the case of APPs, we achieved more than 85% of the theoretical peak performance and a speedup of 1.56× compared with the code produced by compilation. In the case of TC, we attained the performance of TCCG and TBLIS, and achieved 80.35% of the theoretical peak performance with a speedup of 84× compared with the Intel C loop optimizer.
We consider that our approach can be improved. In future research, we will map the proposed techniques to other BLAS operations. According to the expert multi-threaded implementation of MMA[×, +] (Smith et al. 2014) , the code produced by the proposed optimization is suitable for parallelization. The algorithm could be improved by modifying the automatic vectorization. Optimizing sequences of BLAS functions obtains speedups of up to 137% compared with vendor optimized BLAS libraries (Belter et al. 2009 ) and this approach could also improve the proposed algorithm. Finding blocks that contain MMA may help to automatically transform algorithms such as Floyd-Warshall into a blocked form (Matsumoto and Sedukhin 2009; Takahashi and Sedukhin 2005) , before their subsequent optimization. 
APPENDIX A FULL BENCHMARK
