We present Tensor Transpose Compiler (TTC), an open-source parallel compiler for multidimensional tensor transpositions. To generate high-performance C++ code, TTC explores a number of optimizations, including software prefetching, blocking, loop-reordering, and explicit vectorization. To evaluate the performance of multidimensional transpositions across a range of possible use-cases, we also release a benchmark covering arbitrary transpositions of up to six dimensions. Performance results show that the routines generated by TTC achieve close to peak memory bandwidth on both the Intel Haswell and the AMD Steamroller architectures and yield significant performance gains over modern compilers. By implementing a set of pruning heuristics, TTC allows users to limit the number of potential solutions; this option is especially useful when dealing with high-dimensional tensors, as the search space might become prohibitively large. Experiments indicate that when only 100 potential solutions are considered, the resulting performance is about 99% of that achieved with exhaustive search.
INTRODUCTION
Tensors appear in a wide range of applications, including electronic structure theory (Bartlett and Musiał 2007) , multiresolution analysis (Harrison et al. 2016) , quantum many-body theory (Pfeifer et al. 2014) , quantum computing simulation (Markov and Shi 2008) , machine learning (Chetlur et al. 2014; Abadi et al. 2015; Vasilache et al. 2015) , and data analysis (Kolda and Bader 2009) . While a range of software tools exist for computations involving one-and two-dimensional arrays, that is, vectors and matrices, the availability of high-performance software tools for tensors is much more limited. In part, this is due to the combinatorial explosion of different operations that need 15:2 P. Springer et al. to be supported: There are only four ways to multiply two matrices, but ( m k )( n k ) ways to contract an mand an n-dimensional tensor over k indices. Furthermore, the different dimensions and data layouts that are relevant to applications are much larger in the case of tensors, and these issues lead to memory access patterns that are particularly difficult to execute efficiently on modern computing platforms.
Efficient computation of tensor operations, particularly contractions, exposes a tension between generality and mathematical expression on one hand and performance and software reuse on the other. If one implements tensor contractions in a naive way-using perfectly-nested loops-then the connection with the mathematical formulae is obvious, but the performance will be suboptimal in all nontrivial cases. High performance can be obtained by using the Basic Linear Algebra Subprograms (BLAS) (Dongarra et al. 1990 ), but mapping from tensors to matrices efficiently is nontrivial (Di Napoli et al. 2014; Li et al. 2015) and optimal strategies are unlikely to be performance portable, due to the ways that multidimensional array striding taxes the memory hierarchy of modern microprocessors.
A well-known approach for optimizing tensor computations is to use the level-3 BLAS for contracting matrices at high efficiency and to always permute tensor objects into a compatible matrix format. This approach has been used successfully in the NWChem Tensor Contraction Engine module (Hirata 2003; Kowalski et al. 2011) , the Cyclops Tensor Framework (Solomonik et al. 2014) , and numerous other coupled-cluster codes dating back more than 25 years (Gauss et al. 1991) . The critical issue for this approach is the existence of a high-performance tensor permutation or tensor transpositions.
While tensor contractions appear in a range of scientific domains (e.g., climate simulation (Drake et al. 1995) and multidimensional Fourier transforms (Frigo and Johnson 2005; Pekurovsky 2012 )), they are perhaps of greatest importance in quantum chemistry (Baumgartner et al. 2005; Hirata 2003) , where the most expensive widely used methods-coupled-cluster methods-consist almost entirely of contractions that deal with 4-, 6-, and even 8-dimensional tensors. Such computations consume millions of processor-hours of supercomputing time at facilities around the world, so any improvement in their performance is of significant value.
To this end, we developed Tensor Transpose Compiler (TTC) , an open source code generator for high-performance multidimensional transpositions that also supports multithreading and vectorization. 1 Together with TTC, we provide a transpose benchmark that can be used to compare different algorithms and implementations on a range of multidimensional transpositions.
Let A i 1 ,i 2 , ...,i N be an N -dimensional tensor, and let Π(i 1 , i 2 , . . . , i N ) denote an arbitrary permutation of the indices i 1 , i 2 , . . . , i N . The transposition of A into B Π(i 1 ,i 2 , ...,i N ) is expressed as B Π(i 1 ,i 2 , ...,i N ) ← α × A i 1 ,i 2 , ...,i N . To make TTC flexible and applicable to a wide range of applications, we designed it to support the class of transpositions
where α and β ∈ R; that is, the output tensor B can be updated (as opposed to overwritten), and both A and B can be scaled. 2 As an example, let A i 1 ,i 2 be a two-dimensional tensor, Π(i 1 , i 2 ) = (i 2 , i 1 ), and α = 1, β = 0; then Equation (1) reduces to an ordinary out-of-place 2D matrix transposition of the form B i 2 ,i 1 ← A i 1 ,i 2 . Throughout this article, we adopt a Fortran storage scheme, that is, the tensors are assumed to be stored following the indices from left to right (i.e., the leftmost index has stride 1). Additionally, TTC: A High-Performance Compiler for Tensor Transpositions 15:3 Fig. 1 . Single-threaded bandwidth for all the 120 possible loop orders of two different 5D transpositions. explicit-vec and hw-pre, respectively, denote whether explicit vectorization and hardware prefetching are enabled (Y) or disabled (N).
we use the notation π (i a ) = i b to denote that the ath index of the input operand will become the bth index of the output operand.
For each input transposition, TCC explores a number of optimizations (e.g., blocking, vectorization, parallelization, prefetching, loop-reordering, non-temporal stores), each of which exposes one or more parameters; to achieve high-performance, these parameters have to be tuned for the specific hardware on which the transposition will be executed. Since the effects of such parameters are non-independent, the optimal configuration is a point (or a region) of an often large parameter space; indeed, as the size of the space grows as N !, an exhaustive search is feasible only for tensors of low dimensionality. In general, it is widely believed that the parameter space is too complex to design a perfect transpose from first principles (McCalpin and Smotherman 1995; Lu et al. 2006 ). For all these reasons, whenever an exhaustive search is not applicable, TTC's generation relies on heuristics.
In Figure 1 , we use two exemplary 5D transpositions (involving tensors of equal volume), as compelling evidence in favor of a search-based approach. 3 The figures show the bandwidth attained by each of the possible 5! = 120 loop orders 4 -sorted in descending order, from left to rightwith and without hardware prefetching (hw-pre:Y, hw-pre:N) , and with and without explicit vectorization (explicit-vec:Y, explicit-vec:N) . 5 The compiler-vectorized code consists of perfectly nested loops including #pragma ivdep, to assist the compiler. The horizontal line labeled "icpc" denotes the performance of the compiler-vectorized versions (with hardware prefetching enabled), where the size of the tensor was known at compile time; this enabled the compiler to choose the loop order according to its internal cost model. Thus, all 120 variants result in the same loop order.
Several observations can be made. (1) Despite the fact that the two transpositions move exactly the same amount of data, the resulting top bandwidth is clearly different. (2) By comparing the leftmost data point of the beige ( ) and blue lines ( ), one concludes that the explicit vectorization 3 The experiments were performed on an Intel Xeon E5-2670 v2 CPU, using one thread and icpc 14.0.2 as the compiler. 4 A d dimensional transposition can be implemented as d perfectly nested loops around an update statement. These loops can be ordered in d ! different ways. 5 We refer to explicit vectorization to denote code that is written with AVX intrinsics; the vectorized versions use a blocking of 8 × 8 (see Section 3.1). improves the performance over the fastest compiler-vectorized version by at least 20%. (3) The difference between the leftmost and the rightmost datapoints-of any color-provides clear evidence that the loop order has a huge impact on performance: 9.2× and 4.3× in Figures 1(a) and 1(b), respectively. This is in line with the findings of Hammond (2009) , who pointed out that the best loop order for a multidimensional transpose can have a huge impact on performance. (4) However, this impact on performance is less noticeable once explicit vectorization is used. (5) Since the cyan ( ) lines in Figures 1(a) and 1(b) are practically the same, one can conclude that the difference in performance between the two transpositions is due to hardware prefetching. (6) The difference between the blue line ( ) and the horizontal black line in Figure 1 (a) indicates that when it is possible for the compiler to reorder the loops, the code generated is much better than most loop orders, but still about 40% away from the best one; similarly for Figure 1(b) , the compiler fails to identify the best loop order.
While observation (6) suggests that modern compilers struggle to find the best loop order at compile time, an even bigger incentive to adopt a search-based approach is provided by observation (5), since detailed information about the mechanism of hardware prefetchers is not welldocumented. Moreover, the implementations of hardware prefetchers can vary between architectures and manufactures. Hence, designing a generic analytical model for different architectures seems infeasible at this point.
The contributions of this article can be summarized as follows.
-We introduce TTC, a high-performance transpose generator that supports single, double, single-complex, double-complex, and mixed precision data types, generates multithreaded C++ code with explicit vectorization for AVX-enabled processors, and exploits spatial locality. -We also introduce a comprehensive multidimensional transpose benchmark, to provide the means of comparing different implementations. By means of this benchmark, we perform a thorough performance comparison on two architectures. -We analyze both the effects of four different optimizations in isolation and quantify the impact of a dramatic reduction of the search space.
The remainder of the article is structured as follows. Section 2 summarizes the related work on tensor transpositions. In Section 3, we introduce TTC with a special focus on its optimizations. In Section 4, we present a thorough performance analysis of TTC's generated implementationsshowing both the attained peak bandwidth as well as the speedups over a realistic baseline implementation. Finally, Section 5 concludes our findings and outlines future work. McCalpin and Smotherman (1995) realized that search is necessary for high-performance 2D transpositions as early as 1995. Their code-generator explored the optimization space in an exhaustive fashion. Mateescu et al. (2012) developed a cache model for IBM's Power7 processor. Their optimizations include blocking, prefetching, and data alignment to avoid conflict-misses. They also illustrate the effect of large TLB 6 page sizes on performance. Lu et al. (2006) developed a code-generator for 2D transpositions using both an analytical model and search. They carried out an extensive work covering vectorization, blocking for both L1 cache and TLB, while parallelization was not explored. Vladimirov (2013) presented his research on in-place, square transpositions on Intel Xeon and Intel Xeon Phi processors. Chatterjee and Sen (2000) investigated the effect of cache and TLB misses on performance for square, in-place, 2D transpositions. Among other things, they concluded that "the limited number of TLB entries can easily lead to thrashing" and that "hierarchical non-linear layouts are inherently superior to the standard canonical layouts."
RELATED WORK
While there has been a lot of research targeted on 2D transpositions (McCalpin and Smotherman 1995; Mateescu et al. 2012; Goldbogen 1981; Vladimirov 2013; Chatterjee and Sen 2000; Lu et al. 2006 ) and 3D transpositions (Jodra et al. 2015; Ding 2001; van Heel 1991) , higher-dimensional transpositions have not yet experienced the same level of attention.
Ding (2001) and He and Ding (2002) present an algorithm dedicated to multidimensional in-place transpositions. Their approach is optimal with respect to the number of bytes moved. However, their results suggest that this approach does not yield good performance if the position of the stride-1 index changes (i.e., π (i 1 ) i 1 ).
The work of Wei and Mellor-Crummey (2014) is probably the most complete study of multidimensional transpositions so far. Their code-generator, which "uses exhaustive global search," explores blocking, in-cache buffers to avoid conflict misses (Gatlin and Carter 1999), loop unrolling, software prefetching, and vectorization. The generated code achieves a significant percentage of the system's memcpy bandwidth on an Intel Xeon and an IBM Power7 node for cache-aligned transpositions. However, a parallelization approach is not described, and different loop orders are not considered.
Lyakh (2015) designed a generic multidimensional transpose algorithm and evaluated it across different architectures (e.g., Intel Xeon, Intel Xeon Phi, AMD, and NVIDIA K20X). In contrast to our approach, theirs does not rely on search. Their results suggest that on both the Xeon Phi as well as the NVIDIA architectures, there still is a significant performance gap between their transposition algorithm and a direct copy.
TENSOR TRANSPOSE COMPILER
TTC is a domain-specific compiler for tensor transpositions of arbitrary dimension. It is written in Python and generates high-performance C++ code. 7 For a given permutation 8 and tensor-size, TTC explores a search space of possible implementations. These implementations differ in some properties that have direct effects on performance, for example, loop order, prefetch distance, blocking; each of such properties will be discussed in the following sections. Henceforth, we use the term "candidate" for all these implementations; similarly, we use the term "solution" to denote the fastest (i.e., best performing) candidate.
To reduce the compilation time, TTC allows the user to limit the number of candidates to explore; by default, TTC explores up to 200 candidates. Unless the user specifically wants to explore the entire search space exhaustively, TTC applies heuristics to prune the search space and to identify promising candidates that are expected to yield high performance. A good heuristic should be generic enough to be applicable on different architectures, and it should prune the search space to a degree so the remaining implementations can be evaluated exhaustively. Once the search space has been pruned, TTC generates the C++ routines for all the remaining candidates, which are compiled and timed.
The minimum required input to the compiler is the actual permutation and the size of each tensor dimension; while the generated code is optimized for the given tensor size, it also works 7 Changing TTC to generate C code instead of C++ is trivial. 8 We use the terms permutation and transposition interchangeably. 
for other sizes as well. Moreover, several additional arguments to pass extra information and to guide the code-generation process can be supplied by command-line arguments; a subset of the input arguments is listed in Table 1 . For instance, by using the --lda and --ldb arguments, it is possible to transpose tensors that are a portion of larger tensors. This feature is particularly interesting, because it enables TTC to generate efficient packing routines of scattered data elements into contiguous buffers ; such routines are frequently used in dense linear algebra algorithms such as a matrix-matrix multiplication (Low et al. 2015) . Moreover, TTC supports single-, double-, single-complex-, and double-complex data types for both the input A and the output B-via --dataType. TTC is also able to generate mixed-precision transpositions-where A and B are of different data type (e.g., --dataType=sd denotes that A uses single-precision while B uses double-precision). This feature is again especially interesting in the context of linear algebra libraries, since it allows us to implement mixed-precision routines effortlessly. Furthermore, the user can guide the search by choosing certain blockings, prefetch distances or loop orders-and thereby reduce the search space. With the argument --maxImplementations, the user influences the compile time by imposing a maximum size to the search space; in the extreme case in which this flag is set to one, the solution is returned without performing any search. On the other hand, setting --maxImplementations=-1 effectively disables the heuristics and instructs TTC to explore the search space exhaustively. A flowchart outlining the stages (1)-(9) of TTC is shown in Figure 2. (1) To reduce complexity, TTC starts off by merging indices, whenever possible, in the input and output tensor. For instance, given the permutation Π(i 1 , i 2 , i 3 ) = (i 2 , i 3 , i 1 ), the indices i 2 and i 3 are merged into a new "super index"ĩ 2 := (i 2 , i 3 ) of the same size as the combined indices (i.e., size(ĩ 2 ) = size(i 2 )×size(i 3 )); as a consequence, the permutation becomes Π(i 1 ,ĩ 2 ) = (ĩ 2 , i 1 ). 9 Next, TTC queries a local SQL database of known/previous solutions, to check whether a solution for the input transposition already exists; if so, no generation takes place, and the previous solution is returned. Otherwise, the code-generation proceeds as follows: (2) one of the possible blockings is chosen, (3) a loop order is selected, and (4) other optimizations (e.g., software prefetching, streaming-stores) are set. The combination of the chosen blocking, the loop order and the optimizations uniquely identify a candidate. After these steps, (5) an estimated cost for the current candidate is calculated. This cost is used to determine whether the current candidate should be added to a queue of candidates or if it should be neglected. The aforementioned input argument --maxImplementations determines the capacity of this queue.
The loop starting and ending at stage (2) is repeated, if different combinations of blockings, loop orders and optimizations are still possible. Once all candidates have been generated, they are (7) compiled by an external C++ compiler, and (8) timed. Finally, (9) the best candidate (i.e., the solution) is selected and stored to a .cpp/h file and its timing information as well as its properties (e.g., blocking, loop order) are saved in the SQL database for future references.
As a compiler, TCC needs the input arguments to be known at compile time, and not at runtime as a library would allow. While TTC's compilation time can be influenced by the flag --maxImplementations, the required time to generate executable code is too large to justify the usage of this domain specific compiler; indeed, the cost for the generation of optimized code should be amortized over multiple executions of the same tensor transposition. Building on the insights described in this work, we recently presented HPTT, a high-performance C++ library for tensor transpositions that avoids the compilation step entirely (Springer et al. 2017) .
For each and every transposition, TTC explores the following tuning opportunities:
-Explicit vectorization (Section 3.1) -Blocking (Section 3.2) -Loop-reordering (Section 3.3) -Software prefetching (Section 3.4) -Parallelization (Section 3.5) -Non-temporal stores, if applicable (Section 3.6)
The following sections discuss these optimizations in greater detail. For the remainder of this article, let w denote the vector-width, in elements, for any given precision and architecture (e.g., w = 8 for single-precision calculations on an AVX-enabled architecture).
Vectorization
With respect to vectorization, we distinguish two different cases: the stride-1 index (i.e., the leftmost index) of the input and output tensors is constant (see Figure 3 (a)) or not (see Figure 3(b) ). To achieve optimal performance, these cases require significantly different implementations. 3.1.1 Case 1: π (i 1 ) = i 1 . When the stride-1 index does not change, vectorization is straightforward. In this case, the transposition moves a contiguous chunk of memory (i.e., the first column) of the input/output tensor at once as opposed to a single element. Hence, the operation is essentially a series of nested loops around a memcopy of the size of the first column. In terms of the memory access pattern, this scenario is especially favorable, because of the available spatial data locality. Since the vectorization in this case does not require any in-register transpositions and merely boils down to a couple of vectorized loads and stores, we leave the vectorization to the compiler, which in this specific scenario is expected to yield good performance.
3.1.2 Case 2: π (i 1 ) i 1 . To take full advantage of the SIMD capabilities of modern processors, this second case requires a more sophisticated approach. Without loss of generality, let us assume that the index i b , with i b i 1 , will become the stride-1 index in B (e.g., i b = i 3 in Figure 3(b) ). Accesses to B are contiguous in memory for successive values of i b , while accesses to A are contiguous for successive values of i 1 . Full vectorization is achieved by unrolling the i 1 and i b loops by multiples of w elements, giving raise to an w × w transpose. Henceforth, we refer to such a w × w tile as a micro-tile. The transposition of a micro-tile is computed by a fully vectorized, in-register transposition, to which we refer as microkernel; an exemplary double-precision microkernel for an AVX-enabled processor is shown in Listing 1.
Using this scheme, an arbitrarily dimensional out-of-place tensor transposition is reduced to a series of independent two-dimensional w × w transpositions, each of which accesses w many w-wide consecutive elements of both A and B. 
Blocking
In addition to the w × w micro-tiles, we introduce a second level of blocking to further increase locality. The idea is to combine multiple micro-tiles into a so-called macro-tile of size b A × b B , where b A and b B correspond to the blocking in the stride-1 index of A and B, respectively. 10 By default, TTC explores the search space of all macro-tiles of size
The flexibility of supporting multiple sizes of macro-tiles has several desirable advantages: first, it enables TTC to adapt to different memory systems, which might favor contiguous writes (e.g., 16 × 32) over contiguous reads (e.g., 32 × 16) and vice versa; second, it implicitly exploits architectural features such as adjacent cacheline prefetching and cacheline-size (e.g., 16 elements for single precision for modern ×86 CPUs); finally, it reduces false-sharing of cache lines between different threads in a parallel setting. Figure 4 illustrates the decomposition of a tensor transposition into micro-tiles and macro-tiles for the 2D case (B i 2 ,i 1 ← A i 1 ,i 2 ). Each macro-tile (colored in green, blue and orange) comprises multiple micro-tiles (colored in shades of green). Each micro-tile is transposed by the in-register transposition discussed in the previous section.
If desired by the user, then TTC can effectively prune the search space and only evaluate the performance of a subset of the available tiles. We designed a heuristic that ranks the blockings for a given transposition and size. Specifically, the heuristic favours blockings with (1) b A and b B both being multiples of the cacheline-size (in elements), and (2) small remainders r i = S i 1 (mod) b i , i ∈ {A, B}, with S i 1 being the size of the stride-1 index of tensor i ∈ {A, B}. The quality of this heuristic is demonstrated in Section 4.4. Fig. 5 . Overview of the software prefetching mechanism for a distance d = 5. A i represents the current macro-tile while A i+1 and A i+2 denote the succeeding macro-tiles, each consisting of four micro-tiles. Arrows denote the order in which the micro-tiles are being processed.
Loop Order
As Figure 1 already suggested, the choice of the proper loop order has a significant influence on performance. Since the number of available orderings for a tensor with d dimensions is d!, determining the best loop order by exhaustive search is expensive, even for modest values of d.
Our heuristic to choose the loop order is designed to increase data locality in both A and B. This strategy fulfills multiple purposes: (1) it reduces cache-and TLB-misses, and (2) it reduces the stride within the innermost loop. The latter is especially important, because large strides can prevent modern hardware prefetchers from learning the memory access patterns. For instance, the maximal stride supported by hardware prefetchers of Intel Sandy Bridge CPUs is limited to 2KiB (Intel Corporation 2015). Other aspects of the hardware implementation affect the cost of different loop orders; for example, the write-through cache policy of the IBM Blue Gene/Q architecture makes it extremely important to exploit write locality, since writing to a cache line evicts it from cache (Finkel 2015) . The reader interested in further details on this heuristic is referred to the available source-code at www.github.com/HPAC/TTC.
Software Prefetching
Software prefetching is only enabled for the case of π (i 1 ) i 1 ; indeed, the memory access pattern for π (i 1 ) = i 1 is so regular that it should be easily caught by the hardware prefetcher.
We designed the software prefetching to operate on micro-tiles; hence, a given prefetch distance d has the same meaning irrespective of the chosen macro-blocking. The prefetching mechanism for A is depicted in Figure 5 ; the prefetching in B works similarly. TTC always prefetches entire w × w micro-tiles. Before transposing the current micro-tile j, TTC prefetches the micro-tile that is at distance d ahead of the current tile. This is illustrated by the colors in Figure 5 , where the macro-tile contains n = 4 micro-tiles: before processing the orange w × w block of the current macro-tile A i , TTC already prefetches the orange micro-tile of the corresponding macro-tile A p , with p = i + (j + d )/n (i.e., A i+1 in Figure 5 ).
Parallelization
Listing 2 contains the generated parallel code 11 for a given tensor transposition with software prefetching disabled. The collapse clause increases the available parallelism and improves load balancing among the threads such that each thread has to process roughly the same amount of macro-tiles.
Listing 2. Parallel code generated by TTC for the permutation (i 3 , i 2 , i 1 ).
The parallel implementation of the tensor transposition with prefetching enabled is almost identical to its sequential counterpart (discussed in the previous section). The only difference is that each thread maintains a private queue with the macro-tiles that it will access in the near future, to prefetch the corresponding blocks into its local caches. These macro-tiles are processed independently, following the sequential implementation.
Non-Temporal Stores
Modern write-back caches typically employ the write-allocate policy (Bryant et al. 2003) . As such, a write-miss (i.e., a write to a memory location that is not in the cache) fetches the corresponding cacheline from main memory before updating it; this additional memory traffic is referred to as write-allocate traffic. This mechanism is favorable if the data involved will be accessed again in the near future.
On the other hand, if the cacheline is not accessed again (i.e., non-temporal), it pollutes the cache with unnecessary data. Modern microarchitectures usually provide some form of non-temporal stores that are specifically designed for non-temporal data. On the Intel Haswell mircorarchitecture, for instance, non-temporal stores avoid the write-allocate traffic, and immediately invalidate the cacheline once it has been written. It is important to keep in mind that data is transferred at the granularity of a cacheline. For instance, the cacheline size on a Haswell CPU is 64 bytes (i.e., two AVX registers = 16 floats = 8 doubles). This means that one has to write 16 single-precision or 8 double-precision consecutive elements at a time; failing to do so will reduce the effective bandwidth, because a later store to the same (invalidated) cacheline will have to fetch this cacheline redundantly.
Non-temporal stores are only useful in the context of tensor transpositions if the output tensor B is not updated (i.e., β = 0 in Equation (1)). Furthermore, non-temporal stores should be avoided if the output tensor is temporal and fits in cache; the packing routines within high-performance BLAS kernels are an example.
When using non-temporal stores, TTC ensures that a full cacheline is transferred at a time. As a result b B is restricted to multiples of the cacheline size (e.g., b B = 16, 32, 48, . . . for singleprecision data). Moreover, instead of updating B from within the microkernel (see Lines 25 ff in Listing 1), each microkernel stores the transposed micro-tile to a corresponding location within a local buffer B of size b B × b A . B is eventually written to B via non-temporal stores at the end of the macro-kernel; this is done to execute stores to the same cacheline back to back, thus enabling the hardware to combine them into a single memory transaction (thereby avoiding redundant memory fetches).
PERFORMANCE EVALUATION
We evaluate the performance of TTC on two different systems, Intel and AMD. The Intel system consists of two Intel Xeon E5-2680 v3 CPUs (with 12 cores each) based on the Haswell microarchi-tecture. For all measurements, ECC is enabled, and both Intel Speedstep and Intel TurboBoost are disabled. The compiler of choice is the Intel icpc 15.0.4 with flags -O3 -openmp -xhost. Unless otherwise mentioned, this is the default configuration and system for the experiments. The AMD system consists of a single AMD A10-7850K APU with 4 cores based on the steamroller microarchitecture. The compiler for this system is gcc 5.3 with flags -O3 -fopenmp -march=native. All measurements are based on 24 threads and 4 threads for the Intel and AMD system, respectively (i.e., one thread per physical core).
Experimental results suggest that optimal performance is attained with one thread per physical core. We also experimented with thread affinity on both systems. 12 The reported bandwidth BW(x ) for solution x is computed as
where S is the size in bytes of the tensor and λ is either 2 or 3, depending on whether B is updated (β = 0) or not (β 0 
Transposition Benchmark
To evaluate the performance of multidimensional transpositions across a range of possible usecases, we designed a synthetic benchmark. The benchmark comprises 19 different transpositions, 13 chosen so no indices can be merged. For each tensor dimension (2D-6D), we included the inverse permutation (e.g., B i 3 ,i 2 ,i 1 ← A i 1 ,i 2 ,i 3 , B i 4 ,i 3 ,i 2 ,i 1 ← A i 1 ,i 2 ,i 3 ,i 4 , . . .) and exactly one permutation for which the stride-1 index does not change. These two scenarios typically cover both ends of the spectrum, yielding the worst and the best performance, respectively.
In the benchmark, each transposition is evaluated in three different configurations-for a total of 57 test cases: one configuration where all indices are of roughly the same size, and two additional configurations where the tensors have a ratio of six between the largest and smallest index; the full collection of test cases is listed in the Appendix (see Table 2 ). The desired volume of the tensors across all dimensions are roughly the same and can be chosen by the user; in our experiments, we fixed it to 200MiB, which is bigger than any L3 cache in use today.
The benchmark is publicly available at www.github.com/HPAC/TTC/benchmark.
TTC-Generated Code
We now present the performance of the fastest implementations-generated by TTC-for all transpositions in the benchmark. Furthermore, we analyze the influence of the individual optimizations (blocking, loop-reordering, software prefetching, and explicit vectorization) on the performance. Figure 6 illustrates the attained bandwidth and speedup across the benchmark for the Intel and AMD systems. The speedups are measured over a reference routine consisting of N perfectly nested loops annotated with both #pragma omp parallel for collapse(N − 1) on the outermost loop, and #pragma omp simd on the innermost loop. Moreover, the loop order for the Fig. 6 . Bandwidth and speedup of TTC's fastest solution across the benchmark for the Intel and AMD system for β 0. The vertical lines identify the dimensionality of the tensors. The black and green solid lines, respectively, denote the system's SAXPY and STREAM-triad bandwidth.
reference version is chosen such that the output tensor B is accessed in a perfectly linear fashion; this loop order reduces false sharing of cache lines between the threads. With this setup, the compiler is assisted as much as possible to yield a competitive routine. Figures 6(a) and 6(c) show the attained bandwidth of the TTC-generated solutions across the benchmark for the Intel and AMD system, respectively. The transpositions are classified in three categories: stride-1 ( ), inverse ( ), and general ( ), respectively, denoting those permutations in which the first index does not change, the inverse permutations, and those transpositions that do not fall into either of the previous two categories. In addition to the bandwidth, these figures also report the STREAM-triad bandwidth (solid green line), as well as the bandwidth of a SAXPY (i.e., single-precision vector-vector addition of the form y ← αx + y, α ∈ R, x, y ∈ R n ; see solid black line). The figures illustrate that TTC achieves a significant fraction of the SAXPY-bandwidth on both architectures (the average across the entire benchmark is 91.68% and 78.30% for the Intel and AMD systems, respectively). With the exception of some performance-outliers on the AMD system, the performance of TTC is stable across the entire benchmark.
It is interesting to note that on the AMD system, TTC attains a much higher bandwidth than the STREAM-triad benchmark. This phenomenon is due the fact that the STREAM benchmark does Fig. 7 . Bandwidth and speedup of TTC's fastest solution across the benchmark for the Intel and AMD system for β = 0. The vertical lines identify the dimensionality of the tensors. The black and green solid lines, respectively, denote the system's MKL-SCOPY and STREAM-copy bandwidth. Non-temporal stores are abbreviated by "NTS." not account for the write-allocate traffic, and that g++ 5.3 does not issue non-temporal stores; thus, it is possible to exceed the STREAM-triad bandwidth by up to 33%. 14 In terms of speedups over the reference implementation, TTC achieves considerable results on both systems (see Figures 6(b) and 6(d) ). When looking closely at the plots, it becomes apparent that the inverse permutations benefit the most from TTC, attaining speedups of up to 8.84× and 18.15× on the Intel and AMD system, respectively. While the speedups for the stride-1 transpositions are much smaller, they can still be as high as 1.66×. For general transpositions, the speedups range from 1.02× to 8.71×, and from 1.28× to 19.44×, for the Intel and AMD system, respectively. Figure 7 illustrates the bandwidth and speedups that TTC attains for the case where B is not updated (i.e., η = 0); the bandwidth with ( ) and without ( ) non-temporal stores is shown separately. We make the following observations. (1) In line with our expectations, the performance loss between β = 0 and β 0 (compare Figures 7(a) and 6(a) or see Table 2 ) is roughly 33%; this is due to the not-accounted write-allocate traffic. (2) If non-temporal stores are used, then this loss is partially reduced; a reason for this performance difference is that the remainder loops cannot utilize non-temporal stores and, as a result, will suffer from write-allocate traffic. (3) Some of the test cases do not exhibit a speedup from non-temporal stores (i.e., 28, 30, 43, 44, 45) ; the reason is that the compiler elects to ignore our hints and issues ordinary temporal stores instead. Finally, (4) one notices that the speedups over the reference implementation (see Figure 7 (b)) are similar to those outlined in Figure 6(b) ; thus, TTC also yields significant speedups in the case of β = 0.
To gain insights on where the performance gains come from, in Figure 8 we report the speedup due to each of TTC's optimizations separately. For each test-case from the benchmark, the speedup is measured as the bandwidth of the fastest candidate without the particular optimization over the fastest candidate with the particular optimization enabled, while all other optimizations are still enabled. 15 TTC: A High-Performance Compiler for Tensor Transpositions 15:15 Fig. 8 . Breakdown of the speedups for the Intel system. Figure 8 (a) presents the speedup that can be gained over a fixed 8 × 8 blocking. The optimal blocking results in up to 35% performance increase and motivates our search in this search dimension. Software prefetching (Figure 8(b) ) also yields a noticeable speedup of up to 11%. In contrast to the high speedups gained by loop-reordering shown in Section 1, TTC only exhibits much smaller ones (see Figure 8(c) ). The reason for this behaviour is twofold: first, we chose a good loop order for the reference implementation (i.e., the loop order for which the output tensor B is accessed in a linear fashion); second, some of the drawbacks of a suboptimal loop order might be mitigated by the other optimizations-especially blocking. Even then, an additional search yields speedups of up to 22% over the reference loop order. Despite the fact that transpositions are memory bandwidth bound, we see an appreciable speedup by implementing an in-register transpose via AVX intrinsics (see Figure 8(d) ). The speedup for vectorization is obtained by replacing our explicitly vectorized 8 × 8 microkernel with a scalar implementation (i.e., two perfectly nested loops with the loop-trip-counts being fixed to 8). While the reference implementation is also vectorized by the compiler, the compiler fails to find an in-register transpose implementation.
All in all, we see that each optimization has a positive effect on the attained bandwidth; the combination of all these optimizations results in significant speedups over modern compilers.
Reduction of the Search Space
We discuss the possibility of lowering the compilation time by reducing the search space by identifying "universal" settings that yield nearly optimal performance. Intuitively, the optimal prefetch distance (for software prefetching) should only depend on the memory latency to the main memory and thus be independent of the actual transposition (See Lee et al. (2012) for details). This observation motivates us to seek a "universal" prefetch distance that would reduce TTC's search space.
To evaluate the performance of TTC for a fixed prefetch distance d, we introduce the concept of efficiency. Let C t be the set of all candidates for the tensor transposition t, x a particular candidate implementation, and d x the prefetch distance used by candidate x; furthermore, let BW(x ) be the bandwidth attained by candidate x, and BW max t the maximum bandwidth among all candidates for transposition t. Then the efficiency E (d, t )-which quantifies the loss in performance one would experience if the prefetch distance were fixed to d-is defined as
(
The efficiency is bounded from above and below by 1.0 and 0.0, respectively-with 1.0 being the optimum. Figure 9 presents the maximum ( ), minimum ( ), and average ( ) efficiency across all the transpositions of the benchmark as a function of the prefetch distance. We notice that (1) there is at least one transposition within the benchmark for which the influence of software-prefetching is negligible (see the leftmost, blue triangle ), (2) for each fixed prefetch distance d, there is at least one transposition for which d is suboptimal (see cyan line ), (3) both the minimum and average efficiency increase with d (cyan and beige lines), and (4) once d is "large enough," the efficiency does not improve much.
Quantitatively, a prefetch distance greater or equal to five increases the average and the minimum efficiency across the benchmark to more than 99% and roughly 98%, respectively. Hence, fixing d to any value between 5 and 8 is a good choice for the given system, effectively reducing the search space by a factor of 9, without introducing a performance penalty.
Quality of Heuristics
On our Intel system, TTC evaluates roughly 8 candidates per second across the whole benchmark (for tensors of size 200MB); this includes all the necessary steps from code-generation to Fig. 10 . Speedup over the reference implementation with the number of TTC-generated candidates limited to 1, 10, 100, and all (i.e., no limit).
compilation and measurement. If a solution has to be generated in a short period of time (i.e., the search space needs to be pruned efficiently), then the quality of the heuristics to choose a proper loop order and blocking becomes especially important.
In TTC, the loop order and the blocking heuristics are given the same weight. More precisely, if the search space is limited to N candidates, then TCC evaluates N combinations of the best √ N loop orders with the best √ N blockings. Figure 10 presents the speedups that TTC achieves if the user limits the number of generated candidates to 1, 10 and 100, and when "no limit" 16 is given (i.e., all reflects the same results as in Figures 6(b) and 6(d) ). When the number of generated candidates is limited to 1, no search takes place, that is, the loop order and the blocking for the generated implementation are determined solely by the heuristics. Even in this extreme case, TTC still exhibits remarkable speedups over modern compilers. Specifically, with a search space of 1, 10, and 100 candidates, TTC achieves 94.58%, 97.35%, and 99.10% of the performance of the unlimited search (averaged across the whole benchmark). In other words, one can rely on the heuristics for the most part and resort to search only in scenarios in which even the last few bits of performance are critical. In those cases, we observe that a search space of 100 candidates already yields results within 1% from those obtained with an exhaustive search.
CONCLUSION AND FUTURE WORK
We presented TTC, a compiler for multidimensional tensor transpositions. By deploying various optimization techniques, TTC significantly outperforms modern compilers and achieves nearly optimal memory bandwidth. We investigated the source of the performance gain and illustrated the individual impact of blocking, software prefetching, explicit vectorization, and loop-reordering. Furthermore, we showed that the heuristics used by TTC efficiently prune the search space, so the remaining candidates are easily ranked via exhaustive search.
For the future, should compilation time become a concern, iterative compilation techniques should be considered (Kisuki et al. 2000; Knijnenburg et al. 2002) . While this current work focused on architectures using the AVX instruction set, TTC is designed to accommodate other instruction sets; in general, the effort to port TTC to a new architecture is related to optimizations such as explicit vectorization and software prefetching. As a next step, we plan to support the AVX512 instruction set (e.g., used by Intel's upcoming Knights Landing microarchitecture), ARM and IBM Power CPUs as well as NVIDIA GPUs. Finally, we will be using TTC as a building block for our upcoming Tensor Contraction Code Generator (TCCG) . 
APPENDIX: FULL BENCHMARK

ACKNOWLEDGMENTS
We thank the anonymous reviewers for their valuable feedback.
