High-performance DSL developers work hard to take advantage of modern hardware. The DSL compilers have to build their own complex middle-ends before they can target a common back-end such as LLVM, which only handles single instruction streams with SIMD instructions. We introduce Tiramisu, a common middle-end that can generate efficient code for modern processors and accelerators such as multicores, GPUs, FPGAs and distributed clusters. Tiramisu introduces a novel three-level IR that separates the algorithm, how that algorithm is executed, and where intermediate data are stored. This separation simplifies optimization and makes targeting multiple hardware architectures from the same algorithm easier. As a result, DSL compilers can be made considerably less complex with no loss of performance while immediately targeting multiple hardware or hardware combinations such as distributed nodes with both CPUs and GPUs. We evaluated Tiramisu by creating a new middle-end for the Halide and Julia compilers. We show that Tiramisu extends Halide and Julia with many new capabilities including the ability to: express new algorithms (such as recurrent filters and non-rectangular iteration spaces), perform new complex loop nest transformations (such as wavefront parallelization, loop shifting and loop fusion) and generate efficient code for more architectures (such as combinations of distributed clusters, multicores, GPUs and FPGAs). Finally, we demonstrate that Tiramisu can generate very efficient code that matches the highly optimized Intel MKL gemm (generalized matrix multiplication) implementation. We show that Tiramisu can generate code that outperforms Intel MKL DNN convolutions and show speedups reaching 4× over the original Halide and 16× over the original Julia due to optimizations enabled by Tiramisu.
Introduction
With a diverse set of high-performance hardware platforms available, the development of a credible high-performance Domain Specific Language (DSL) compiler requires cross platform code generation [1, 2, 7, 14, 45, 49, 50] . Many of these DSL compilers target intermediate representations (IRs) such as the LLVM IR [36] , a low-level representation that models an abstract, RISC-like CPU with infinite registers. However, before generating the low-level LLVM IR, the DSL compilers must create their own middle-end passes that transform their own architecture-independent high-level IR to take advantage of target hardware platforms such as multicores, distributed clusters, GPUs, and FPGAs, each with very different requirements.
These platform specific transformations cannot be done in the LLVM IR, as it encodes memory accesses with a concrete memory layout, which reduces optimization opportunities. This also adds unnecessary dependences such as anti-and output-dependences, forcing the compiler to modify data layout in order to perform certain optimizations. For example, compilers privatize [26, 38, 56] and expand [22, 43, 44] arrays and scalars in order to parallelize loop nests; subsequently, the arrays are contracted [19, 37, 48] to minimize the memory footprint.
This paper presents Tiramisu, a compiler middle-end that hides the complexity of the large variety of execution platforms and reduces the burden on DSL creators by providing a multi-layer representation suitable for transforming from the high-level, domain-specific representation of the DSL to GPUs, multicore CPUs, distributed machines, and FPGAs. The novel multi-layer design exposes three different representations that make it easier to perform each program transformation at the right level of abstraction. The top layer representation (abstract computation layer) describes the pure algorithm using producer-consumer relationships without memory locations. The second layer (computation placement layer) specifies the order of the computations, along with which processor computes each value; this layer is suitable for performing a vast number of optimizations without dealing with concrete memory layouts. Finally, the third layer (concrete computation layer) specifies where to store intermediate data before they are consumed. Tiramisu then lowers the third layer to a target architecture while performing back-end-specific code optimizations.
Tiramisu does not automatically parallelize or perform other transformation decisions; the goal of the middle-end framework is to provide mechanisms and not policy, which is the responsibility of the DSL compiler or the application programmer as in Halide [49] . Instead, Tiramisu concentrates on providing scheduling primitives that can be used to transform and map the programs to the underlying hardware. This frees the DSL designers from needless bookkeeping, analysis, and redundant implementation of code transformations and provides multiple architecture-specific code generation.
Tiramisu uses a unified framework based on polyhedral sets to represent the three layers. This makes it simple for Tiramisu to reason about and implement iteration space and data layout transformations, since these are represented as transformations on polyhedral sets. It also simplifies deciding about the legality of transformations (based on dependence analysis). The use of a unified general framework relieves compiler developers from developing custom algorithms and analyses that only work in certain situations and cannot be composed with other optimizations passes.
This paper makes the following contributions:
• We show that a unified middle-end for DSL compilers can generate code for multiple high-performance architectures such as multicore CPUs, GPUs, FPGAs, distributed machines, or any combination of these, using a set of simple scheduling commands to guide the program transformation and code generation.
• We present the first compiler framework that matches the performance of the highly optimized Intel MKL gemm (comparing a single gemm kernel from MKL to a single gemm kernel generated by Tiramisu). To the best of our knowledge, this is the first compiler framework that can demonstrate such performance. Former frameworks could outperform Intel MKL but by fusing multiple kernels, since a library such as Intel MKL cannot fuse kernels. Tiramisu can also do that to get even better performance.
• We introduce a novel three-layer intermediate representation for the middle-end that separates the algorithm from the code transformations and the data layout transformations, simplifying the composition of architecturespecific lowering transformations.
• We implemented this middle-end, Tiramisu, and demonstrate its power and viability by using it as the middleend for Halide [49, 50] and Julia [7] compilers.
• We demonstrate the expressiveness of Tiramisu by extending Halide with many new capabilities such as tiling recurrent filters, performing precise bound inference for non-rectangular iteration spaces, and performing advanced loop nest transformations such as skewing, shifting, and loop fusion.
• We evaluate Tiramisu and show that we match Intel MKL gemm, show up to 4× performance gains in Halide and 16× in Julia.
Middle-End Challenges
Building a compiler middle-end requires overcoming many challenges. The middle-end must support a large enough class of computations to be usable by multiple DSLs. The middle-end must also produce code for most of the popular high-performance computing systems, including multicore CPUs, clusters of compute nodes and accelerators such as GPUs and FPGAs in many combinations. Finally, the middleend has to be able to perform heroic program and data transformations and compose them together to generate highly efficient code. These transformations must be orchestrated and tailored to the application as well as the underlying execution hardware.
The Data Dependence Challenge
Most IRs use memory as a means of communication between program statements. This creates memory-based dependences in the program. It also means that the data-layout is chosen before optimizing the code. Optimizing a program for different hardware architectures usually requires modifying the data-layout (since different data layouts are suitable for different hardware architectures) and requires eliminating memory-based dependences since they restrict optimizations [42] . Consider the simple image processing pipeline in Figure 1 -(a) (left), which has not been optimized yet. In order to optimize it for a multicore machine, we can parallelize the outermost loop (over the i iterator). To do so, we must first expand the two-dimensional arrays b1 and b2 into three-dimensional arrays. In addition to the parallelization, data locality can be improved by fusing the brightening and clamp stages. After fusion, we can replace the b1 array accesses with accesses to a scalar. After applying these optimizations, we get the code shown in Figure 1 -(b) (left). Assuming now that the target architecture is multi-GPU instead of CPU, changing from an Array-of-Structures (AoS) layout to a Structure-of-Arrays , , (SoA) memory layout may lead to better performance. In order to maximize parallelism on GPU, we apply loop fission to the loop over b2 and the loop over out. The result of applying these optimizations is shown in Figure 1-(c) (left) .
Applying the previous data-layout transformations and the elimination of memory-based dependencies is, in general, challenging [19, 22, 26, 37, 38, 43, 44, 48, 56] . However, if the program dependencies are captured using producerconsumer relationships and the data-layout is not specified early on, all program transformations can be performed without the complexity of memory layout transformations. Most modern DSL IRs capture the dependencies with producerconsumer relationships. Thus, the middle-end compiler can implement program transformations using the producerconsumer relationships first and then introduce memory layouts as a further lowering step. This requires carefully designing the middle-end IR in multiple layers. We show that a three-layer IR design can overcome these difficulties.
The Program Representation Challenge
Lowering code to complex hardware architectures requires many transformations such as loop interchange, blocking, skewing, fission, fusion, and distribution. These transformations change the loop structure in complex ways by introducing new loops, complex loop bounds, and non-trivial access functions [59] . Analyzing code generated by one of these transformations is challenging, which complicates composition with other transformations. This problem can be solved if the loops are kept within a single unified representation through all the transformations. However, many representations are inadequate or too conservative to support complex transformations. They are also inadequate for performing tasks such as dependence analysis (which is necessary for deciding the correctness of optimizations) and the computation of communication sets in the general case. For example, the interval-based representation used in Halide [49] is unable to support accurate bounds computation for non-rectangular iteration spaces. It also cannot represent programs that have cyclic dependence graphs, and does not naturally support complex transformations such as loop skewing (wavefront parallelism). We show that a polyhedral based representation is more flexible, powerful, and is capable of supporting all transformations necessary in the middle-end.
The Optimization Orchestration Challenge
Producing high-performance code for a computer system from an architecture-independent algorithm requires a series of program transformations. These transformations are non-trivial and highly dependent on the program structure, input data, and the target architecture. Parallelizing compilers have worked to fully automate this process using cost models, heuristics [28] , and machine learning [54] . However, obtaining performance comparable to hand-coded implementations using such fully automated approaches is still a challenging problem. A more promising approach is to expose optimizations to the user by providing a set of optimization directives. Languages such as Halide [13, 49] have shown that this approach is a viable one. However, in order to expand this to a common middle-end, the scheduling language must encompass a much broader set of transformations and seamlessly compose these transformations. Figures 1-(b) (Schedule) and 1-(c) (Schedule) show how a simple collection of scheduling commands can map the architecture independent program into different architectural configurations.
The MPI+OpenMP+CUDA Challenge
Most high performance computer systems are complex and heterogeneous. A typical supercomputer has multiple interconnected nodes equipped with multicore CPU processors with vector units connected using NUMA shared memory. They may also have multiple GPU accelerators per node [39] . Recently, data centers have been adding FPGA accelerators to this mix [10] . Getting the best performance requires the program to take full advantage of all these different components. In the supercomputing community this is usually referred to as the MPI+OpenMP+CUDA challenge [60] . Writing code that targets such heterogeneous systems is non-trivial, as each of these require drastically different styles of code and optimizations, all using different libraries and languages. Getting all of these components to communicate and synchronize is non-trivial as well. The state-of-the-art practice is to manually write the program in separate language extensions. However, any changes to the program partitioning between these heterogeneous units will normally require a complete rewrite of the program. With Tiramisu, it is possible to generate code for these complex computer systems in a simple, flexible and malleable way. Figure 1 -(c) (left) shows an example where the algorithm is mapped to a GPU cluster using simple scheduling commands and without the need to change the algorithm or to write code in different languages and libraries.
Tiramisu Overview
Tiramisu is a middle-end compiler framework for DSLs. Using this framework, DSLs can transform their architectureindependent IRs into architecture-specific, low-level IRs while taking advantage of modern architectural features such as multicore parallelism, non-uniform memory (NUMA) hierarchies, clusters, and accelerators like GPUs and FPGAs. Tiramisu is designed for DSLs that logically operate over dense data using loop nests and sequences of statements, which is the case of DSLs for image processing, dense linear algebra, and stencil computations, among others.
The Three-Layer Intermediate Representation
Tiramisu uses polyhedral sets to represent each one of the three IR layers and uses polyhedral set and relation operations Amarasinghe Layer I
The constraints C n , C m and C k are defined above. {b 1 (i, j, c ) :
for (c in 0..3)
Layer II // Layer II generated from Layer I using the schedule
Layer III Layer III = Layer II representation + the following data mapping 
Layer II // Layer II generated from Layer I using the following schedule
Layer III // Same as Layer III in (b) except the mapping of b 1 and b 2 should be replaced with the following Figure 1 . Three versions of the motivating example (left) and their equivalent Layer I, II and III representations (right) to represent transformations on the iteration domain and data layout. Polyhedral sets and relations are described using affine (linear) constraints over loop iterators and program parameters (invariants) and are implemented in Tiramisu using ISL [58] . We use a combination of classical extensions to the polyhedral model in order to support non-affine iteration spaces; these extensions are sufficient and practical for large classes of programs [4, 5] .
A typical workflow for using Tiramisu is illustrated in Figure 2 . DSL compilers parse input programs and perform domain specific optimizations before translating the DSL program into Layer I of the Tiramisu IR. The first layer of the IR is then transformed to lower layers (Layer II and Layer III), and finally LLVM or other low-level IR is generated.
The three layers of the Tiramisu IR are:
Layer I (Abstract Computation Layer) which specifies the algorithm without specifying the schedule (when and where the computations occur) or how data should be stored in memory (data layout). As there is no notion of data location, values are communicated via explicit producer-consumer relationships.
Layer II (Computation Placement Layer) which specifies the order of execution of computations and the processor on which they execute. This layer does not specify how intermediate values are stored in memory; this simplifies optimization passes since these passes do not need to perform complicated data-layout transformations. The transformation of Layer I into Layer II is done automatically using scheduling and data layout commands. Examples of the scheduling commands supported in Tiramisu are presented in Table 1 .
Layer III (Concrete Computation Layer) which makes the data layout concrete, specifying where intermediate values are stored. The separation into levels does not force data-layout mapping to occur after scheduling; in Tiramisu, the user can still specify data layout before scheduling (to constrain scheduling, for example). The separation ensures that the scheduling phase can safely assume no data-layout transformations are required, greatly simplifying scheduling transformations. If a user requests a transformation that cannot be performed with the specified data layout, Tiramisu will prevent the illegal transformation from occurring, ensuring correctness.
In the following section, we provide more details about the three layers of Tiramisu.
The Tiramisu IR
The input to Tiramisu is the Layer I computations and a set of scheduling and data layout commands. Layer II is generated by applying the schedule to Layer I. Commands for buffer allocation, data layout mapping, and communication (between CPU nodes for example) are then added to the Layer II representation; the result constitutes Layer III. An annotated abstract syntax tree (AST) is then generated from Layer III. This AST is traversed to generate the target code.
In this section, we describe in detail the three representations used in Tiramisu. We also describe scheduling via high-level scheduling commands as well as low level scheduling maps. We begin by showing an example.
An Example in the Three-Layer IR
We first provide an overview of the concepts of polyhedral sets and maps. More details and a formal definition of these concepts are provided in the Appendix.
An integer set is a set of integer tuples described using affine constraints. An example of a set of integer tuples is {(1, 1); (2, 1); (3, 1); (1, 2); (2, 2); (3, 2)}. Instead of listing all tuples in a set, we describe the set using affine constraints over loop iterators and symbolic constants:
where i and j are the dimensions of the set. A map is a relation between two integer sets. For example
is a map between tuples in the set S1 and tuples in the set S2 (e.g. the tuple S1(i, j) maps to tuple S2(i + 2, j + 2)). We use the Integer Set Library (ISL) [58] notation for sets and maps. Figure 1 shows the code for each optimized implementation discussed in the previous section. The original, unoptimized code is shown in Figure 1 -(a), with the right side showing the Layer I representation. This Layer I representation is the same for all the code variants, as this layer specifies the computation in a high-level form separate from scheduling.
Each line in Layer I of Figure 1 -(a) (right side in the figure) corresponds to a statement in the algorithm (left side of the figure): for example, the first line of Layer I represents the line 5 in Figure 1-(a) . The first part of that line 1 , which is {b 1 (i,j,c): 0<=i<N and 0<=j<M and 0<=c<3} specifies the iteration domain of the statement, while the second part, 1.5 * imд(i, j, c), is the computed expression. The iteration domain is the set of tuples
Computations in Layer I are not ordered. The declaration order does not affect their order of execution, which is specified in Layer II. Table 1 . Layer II is generated automatically by applying these commands to Layer I. Tiramisu provides a large set of high-level scheduling and data layout transformation commands.The Layer II representation is also shown in Figure 1-(b) . Computations in Layer II are ordered based on their lexicographical order 2 . The set {b 1 (i (cpu), 0, j, c, 0): 0<=i<N and 0<=j<M and 0<=c<3} in the example, is an ordered set of computations. The tag (cpu) for the i dimension indicates that this dimension is a space dimension and that each ith iteration is mapped to the ith CPU. In Layer II, the computation order is controlled by a total ordering of these tuples.
Layer III in Figure 1 -(b) adds data layout mapping to Layer II, concretizing where each computation is stored (memory buffers and scalars are also declared in this layer). In the example, the data mapping
0<=i<N and 0<=j<M and 0<=c<3}
indicates that the result of the computation b 2 (0, i (cpu), j, c, 1) is stored in the array element bu f b2 [i, j, c]. Data mapping in Tiramisu is an affine relation that maps a computation from Layer II to a buffer element; scalars are single-element buffers. Tiramisu allows the expression of any data-layout mapping that can be expressed as an affine relation (examples provided in Section 12.3). For brevity, the declaration of buffers, their types, their allocation (including when and where they are allocated), are all omitted from the example, but such information must be specified for correct code generation.
Layer I: Abstract Computation Layer
The first layer defines abstract computations, which are not yet scheduled or mapped to memory. Each computation represents an expression that should be computed.
As an example, the following code
can be represented in Layer I as {A(i,j): 0<=i<4 and 0<=j<4 and i<j and i!= 2}: cos(i)
though it is important to remember that this representation, unlike the pseudocode above, does not necessarily store results to memory locations. A(i, j) is the computation, while the constraints over i and j define the iteration domain. The second part, cos (i), is the computed expression.
Computations in Layer I are in Static Single Assignment (SSA) form [18] ; each computation is defined only once (we use the ϕ operator to deal with branches as in classical SSA).
Reductions and Updates Reductions and updates do not fit naturally in the memory-independent model used within Tiramisu, and thus we treat them as a special case. To implement algorithms that perform a reduction or update a variable (a histogram for example), we declare a new computation for each update. These computations will all be mapped to the same buffer in Layer III. For example, a dense matrix multiplication, which has a reduction, is represented in Layer I as follows: {c0(i,j): 0<=i<N and 0<=j<N}: 0 {c1(i,j,k): 0<=i<N and 0<=j<N and 1<=k<N}:
Since c1(i, j, k ) needs to read the results of the computations c0(i, j) and c1(i, j, k − 1), we use the ϕ node to merge them into one expression ϕ (c0(i, j), c1(i, j, k − 1)) (although the use of ϕ nodes in this case can be avoided, in the general case the use of ϕ nodes is necessary to support cases such as the definition of computations within data-dependent conditions).
Support for Non-Static-Affine Iteration Spaces Tiramisu can represent non-static-affine code. In particular, Tiramisu can represent non-static-affine array accesses, while loops, non-static-affine loop bounds and non-static-affine conditionals. Tiramisu treats any non-static-affine conditional in a way similar to [4] : the conditional is represented as a single macro-statement together with its body (i.e., as a statement encapsulating both the control and the body). while loops and loops with non-static-affine bounds are handled in a way similar to [6] .
Layer II: Computation Placement Layer
The computation placement layer describes when and where each computation is computed. Unlike computations in the first layer, computations in this layer are ordered (specifying when) and are assigned to a particular processor (specifying where). This order is dictated by space dimensions and time dimensions. Space dimensions specify on which processor computations should be executed; such dimensions are not relevant for determining the order of execution. On the other hand, time dimensions specify the order of execution relative to other computations. The order of execution of computations is determined by the lexicographical ordering of the dimensions. Space dimensions are distinguished from time dimensions using tags, which consist of a processor type followed by zero or more properties. Currently, Tiramisu supports the following space tags: cpu the dimension runs on a CPU in a shared memory system node the dimension maps to nodes in a distributed system gpu_thread_X the dimension runs on a gpu thread (dimension X where X=0 for outermost and 2 for innermost). Similar tags are used for blocks.
, ,
Tagging a dimension with a processor type indicates that the dimension should be distributed over processors of that type in a system; for example, tagging a dimension with cpu will execute each iteration in that dimension on a separate CPU.
In addition to processor type, tags can optionally include one of the following dimension properties:
vectorize the dimension (s is the vector length) unroll unroll the dimension pipeline pipeline the dimension (FPGA only)
Computations mapped to the same processor are ordered by projecting the computation set onto the time dimensions and comparing their lexicographical order, without considering the name of the computation, since all computations in this layer are in the same time-space domain.
Layer III: Concrete Computation Layer
The concrete computation layer specifies memory locations for storing computed values. It consists of the Layer II representation along with allocation/deallocation statements, and a set of access relations, which map computations from Layer II to array elements read or written by those computations. Scalars are treated as single-element arrays. For each buffer, an allocation statement is created, specifying the type of the buffer (or scalar) and its size, and is scheduled by being mapped to the time-space domain. Similarly, a deallocation statement is also added.
Possible data mappings in Tiramisu include mapping computations to structures-of-arrays, arrays-of-structures, and contraction of multidimensional arrays into arrays with fewer dimensions or into scalars. It is also possible to specify more complicated accesses such as the storage of computations c (i, j) into the array elements c (i%2, j%2) or into c (j, i).
Generating Layer II and III from Layer I
Transforming the first layer into the second layer is usually done using an affine relation called a scheduling map. This maps each computation in the first layer into a particular position in time-space. Composing many transformations can be done simply by composing different scheduling maps.
Scheduling Maps
Affine transformations including loop tiling, skewing, loop fusion, distribution, splitting, reordering, and many others can be expressed as an affine map that maps computations from Layer I into the time-space domain in Layer II. A scheduling map takes as input the iteration domain from Layer I and transforms it into a new set that represents the computation in the time-space domain. For example, suppose we want to tile the following computation (which is in Layer I) into 16 × 16 tiles and parallelize the outermost loop:
{C(i,j): 0<=i<N and 0<=j<N}: A(i,j) + B(i,j)
To do so, we provide the following scheduling map to Tiramisu:
{C(i,j)->C(i1(cpu),j1,i2,j2):i1=floor(i/16) and i2=i%16 and j1=floor(j/16) and j2=j%16 and 0<=i<N and 0<=j<N}
which will produce the following set in Layer II:
{C(i1(cpu),j1,i2,j2): i1=floor(i/16) and i2=i% 16 and j1=floor(j/16) and j2=j%16 and 0<=i<N and 0<=j<N}: A(i1*16+i2, j1*16+j2) * B(i1*16+i2, j1*16+j2)
High Level Scheduling Commands
Tiramisu provides a set of predefined scheduling maps for common affine loop nest transformations. Table 1 , presented previously, shows examples of Tiramisu high-level scheduling commands. These commands are similar to those in Halide [49] and ChiLL [13] . The high-level scheduling commands in Tiramisu provide an easy-to-use interface for advanced loop nest transformations in a composable way, while still enabling advanced users to provide their own lowlevel scheduling maps to modify the space-time mapping for scheduling not covered by typical compiler transformations.
Checking the Validity of Schedules
In order to check the validity of transformations, we first compute the dependences of the input program, then we check the validity of transformations using violated dependence analysis [57] .
Generating Tiramisu from DSLs
To demonstrate the utility of Tiramisu, we integrated it into two DSL compilers: Halide [49] and Julia [7] . A DSL compiler that uses Tiramisu must generate three pieces of information:
• Layer I, which describes the algorithm;
• A scheduling map or a scheduling command;
• Commands declaring the buffers and mapping the computations to buffer elements. Tiramisu takes the three inputs and generates Layer II and Layer III automatically, and then generates an Abstract Syntax Tree (AST) from Layer III. The AST is traversed to generate target code (LLVM IR, Nvidia PTX, ...).
Halide to Tiramisu
Halide [49] is an industrial-quality DSL for image processing. We generate Tiramisu IR from Halide by mapping a Halide Func, which is equivalent to a statement in a loop nest, directly to a Tiramisu computation (Layer I). Reductions, which update the same function, are mapped to Tiramisu computations as described in Section 4.2. Halide scheduling directives, such as tiling, splitting, reordering, parallelization, vectorization, etc., are directly mapped to the equivalent high level set of scheduling commands defined in Tiramisu. Finally, we map computations to buffer elements using the default Halide mappings, while allowing Halide scheduling commands that control data mappings to perform equivalent transformations for the Layer III representation. The rest of the code generation to low-level executable code takes place within Tiramisu. 
Julia to Tiramisu
Julia is a high-level dynamically-typed programming language designed for numerical computing. However, in contrast to Halide, Julia is more general: it supports while loops and recurrent computations and is memory-based (i.e., it uses variables unlike Halide which defines pure functions mostly). We extend Julia with a set of scheduling directives and function annotations. Functions annotated with the @acc macro are optimized with Tiramisu. We generate Tiramisu from the low-level Julia IR (which is in SSA form) by translating each statement in the Julia IR into a computation in Tiramisu. This Julia low-level IR does not have high level control flow (it only has gotos); thus we change the compilation flow of Julia to annotate the lowlevel IR with information about the original control flow of the program. We use the annotations to recover the control flow and generate the iteration domain of each computation. Although Julia has another high level IR that has control flow information, we cannot use that IR because it lacks the necessary data type annotations.
We transform the memory-based Julia IR into the producerconsumer Tiramisu IR using classical array expansion techniques [22, 43, 44] . The goal here is to extract the data-flow representation of the code. The user is then free to change the data layout of computations using high level data-layout directives.
Generating Code for Different Platforms
Generating code from Layer III (an ordered set of computations) amounts to generating nested loops (AST) that visit each computation in the set, once and only once, while following the lexicographical ordering between integer tuples. Array accesses are generated from the maps that describe the data mapping. The Tiramisu code generator (which uses the ISL [58] library for code generation) takes Layer III as input and generates an AST as output. The AST is then traversed to generate lower level code targeting different hardware architectures.
Multicore CPU
When generating code that targets multicore shared memory systems, loop levels that are tagged as space cpu dimensions are translated into parallel loops in the generated code, using OpenMP-style parallelism. Loops that are tagged with the vec space dimensions are translated into vectorized loops. Currently we only support the vectorization of loops that do not contain any control flow.
GPU
For GPU code generation, data copy commands are provided in Layer III of Tiramisu. These commands are translated into the equivalent data copy calls in the lowered code. Computation dimensions tagged with GPU thread or GPU block tags 1 // Layer I 2 {bx(y,x): 0<=y<rows and 0<=x<cols} : 3 (in(y,x) + in(y,x+1) + in(y,x+2))/3); 4 {by(y,x): 0<=y<rows and 0<=x<cols} :
5
(bx(y,x) + bx(y+1,x) + bx(y+2,x))/3); 6 // Layer II 7 bx.split(y,chunk_sz,y1,y2); by.split(y,chunk_sz,y1,y2); 8 // Layer III 9 comm_prop blk({BLOCK}), ablk({ASYNC,BLOCK}); 10 send_recv xfer = create_transfer("{(q,y,x): 1<=q<N-1 and 0<= y<2 and 0<=x<cols}","{(q,y,x): 0<=q<N-2 and 0<=y<2 and 0<=x<cols}",q-1,q,ablk,blk,bx(y,x)); 11 bx.distribute(y1); by.distribute(y1); 12 xfer.s->distribute(q); xfer.r->distribute(q); Figure 3 . Tiramisu pseudocode for a 3x3 distributed blur are translated into the appropriate GPU thread and block IDs in the lowered code.
Distributed Memory Systems
Tiramisu utilizes MPI to generate code for distributed memory systems. Figure 3 shows Tiramisu pseudocode for a 3x3 distributed box blur. Lines 2 and 4 define the blur computation. This code remains the same regardless of whether we use a shared memory or distributed memory back-end.
For this example, we want to distribute the computation such that each MPI rank (process) operates on contiguous rows of the input data. Each rank gets chunk_sz rows. On line 7, the outer loop is split by chunk_sz. The resulting inner loop ranges over the rows in the chunk, and the outer loop ranges over the number of MPI ranks we want to use.
Lines 9 and 10 deal with communication. We assume that our image data is already distributed, thus only boundary rows need to be communicated among adjacent ranks. Line 9 defines two communication types, which will be used to select the appropriate MPI function. blk represents a blocking call, and ablk represents an asynchronous, blocking call. We use two-sided communication in Tiramisu, meaning communication is done with pairs of send and receive operations. The actual transfer is defined by create_transfer on line 10, which takes as input the send and receive iteration domains, the source and destination ranks, the communication types for send and receive, and the access into the producer for the send.
Line 11 tags dimension y1 of bx and by as distributed, and line 12 tags dimension q of the send and receive as distributed. During code generation, we postprocess the generated code and convert each distributed loop into a conditional based on the rank of the executing process. (q≥1 and q<N-1 Figure 4 . Additional Tiramisu commands needed to generate a 3x3 distributed GPU box blur transfer) to more advanced transformations (e.g. specializing a distributed computation based on rank).
GPU scheduling can also be composed with distribution, allowing programs to execute in either a multi-GPU or heterogeneous CPU-GPU environment. Only a few extra scheduling commands need to be added to distributed Tiramisu code to enable the use of GPU. Figure 4 shows the four additional scheduling commands needed to convert the distributed box blur code in Figure 3 to distributed GPU code. Lines 1 and 3 copy data from the host (CPU) to the device (GPU) and from the device to the host, respectively. Line 2 tags the computations to run on the GPU. The resulting code can be used to distribute the box blur computation across multiple GPUs that reside on different nodes. As with CPU distribution, we use MPI to control the inter-node communication.
FPGA
Tiramisu relies on FROST [20] to generate code for FPGAs. FROST is a common back-end for the hardware acceleration of DSLs on FPGA. It exposes an IR that DSLs can target, as well as a high level scheduling language to express FPGAspecific optimizations.
We integrated FROST within Tiramisu, enabling us to target FPGAs. We use Tiramisu to perform loop nest transformations that are necessary to prepare the code for lowering to FPGA, while FROST focuses on the actual lowering to the target High-Level Synthesis (HLS) toolchain. For example, in order to vectorize a loop, Tiramisu first splits the loop so that the innermost loop has a fixed number of iterations and then tags that loop for later vectorization by FROST. FROST then performs the actual vectorization of both loop and input/output buffers.
The output of FROST is a C++ implementation of the input code suitable for HLS tools, like Xilinx Vivado HLS [33] . Finally, FROST leverages Xilinx SDAccel to synthesize the resulting FPGA design and produce the bitstream file for execution on actual hardware.
Evaluation
We performed the evaluation on a cluster of dual-socket machines with two 24-core Intel Xeon E5-2680 v3 CPUs running at 2.50GHz running Ubuntu 14.04.5 LTS, with an Infiniband interconnect. 
Halide to Tiramisu
To evaluate the integration of Tiramisu with Halide, we used the following benchmarks: cvtColor, which converts RGB to grayscale; convolution, a simple 2D convolution;
gaussian, which performs a gaussian blur; warpAffine, which does affine warping on an image; heat2D, a simulation of the 2D heat equation; nb pipeline, a synthetic pipeline that computes two outputs from the same image, a negative and a brightened image; rgbyuv420, an image conversion from RGB to YUV420; heat3D, the 3D heat equation with timestepping; and ticket #2373, a code snippet from a bug filed against Halide where the bounds inference is over-approximated, leading the generated code to fail in execution. Figure 5 compares the execution time of code generated by Halide and Halide-Tiramisu. In five of the benchmarks (namely convolution, cvtColor, gaussian, heat2D, and warpAffine), the performance of the code generated by Halide-Tiramisu matches the performance of Halide. We use the same schedule for both implementations.
Two of the other benchmarks , heat3D and ticket #2373, cannot be implemented in Halide. The following is an example of a recurrent filter extracted from [12] , a compiler designed to support recurrent filters.
heat3d(x,y,z,0) = a*in(x, y, z) + b*(in(x+1, y, z) + in(x-1, y, z)+ in(x, y+1, z) + in(x, y-1, z) + in(x, y, z+1) + in(x, y, z-1));
heat3d(x,y,z,t) = a*heat3d(x, y, z, time.x-1) + b*(heat3d(x+1, y, z, t-1) + heat3d(x-1, y, z, t-1)+ heat3d(x, y+1, z, t-1) + heat3d(x, y-1, z, t-1) + heat3d(x, y, z+1, t-1) + heat3d(x, y, z-1, t-1));
This code cannot be implemented in Halide because it contains a cyclic dependence graph due to the loop over timesteps while the Halide compiler assumes that the dependence graph is a DAG (directed acyclic graph). This limitation is mainly because it is difficult to prove the legality of optimizations in an interval-based representation in the case of a cyclic dependence graph. This is not the case for Tiramisu, which relies on a precise dependence analysis [23] and on checking the legality of transformations using the polyhedral model [55] to decide whether a transformation can be performed. In ticket #2373, which exhibits a triangular iteration domain, Halide's bounds inference over-approximates the computed bounds which leads the generated code to fail in execution. This over-approximation in Halide is due to the use of intervals to represent iteration domains, which prevents Halide from performing precise bounds inference on non-rectangular iteration spaces. Tiramisu can handle this case naturally since it relies on a polyhedral based model where sets can include any affine constraint in addition to the min and max bounds. These examples show that the model exposed by Tiramisu naturally supports more complicated code patterns than an advanced, mature DSL compiler. For nb pipeline and rgbyuv420, the code generated from Halide-Tiramisu achieves a 4× speedup over the code generated from Halide. This is primarily due to fusion. In both cases, Tiramisu can fuse multiple loops into one loop which enhances data locality; loop fusion is currently unsupported in Halide. This is another case that demonstrates that the expressiveness of the polyhedral-based representation in Tiramisu allows the framework to naturally perform certain iteration domain transformations that are difficult in other models.
Julia to Tiramisu
We used the following benchmarks to evaluate the integration of Tiramisu within Julia: bicg, a biconjugate gradient method; doitgen, a multiresolution analysis kernel; mttkrp, the matricized tensor times Khatri-Rao product; covariance, which performs a covariance computation; and gesummv, which is summed matrix-vector multiplications. For a fair comparison, the Julia code was tagged with the inbounds macro to remove any boundary checks on buffer accesses. Figure 6 shows the execution time of code generated by Julia-Tiramisu compared to Julia without Tiramisu. The speedups of Julia-Tiramisu in covariance, doitgen, mttkrp and bicg are mainly due to the improved data locality obtained after tiling using Tiramisu, which is not possible in Julia. We evaluate the FPGA backend in Tiramisu using 6 image processing kernels: convolution, cvtColor, gaussian, scale, sobel, and threshold. We chose these kernels because they are already implemented in the Vivado HLS Video Library [32] , which implements several OpenCV functions for FPGA. We compare the execution time of code generated from Tiramisu with the codes extracted from the Vivado HLS Video Library. These codes are synthesized using the Xilinx SDAccel 2016.4 toolchain at 200MHz and ran on a ADM-PCIE-7V3 board by Alpha Data (powered by a Xilinx Virtex 7 FPGA). For all the kernels, we use a 512 × 384 RGB image, except for the threshold kernel, which takes as input a single channel image. The HLS Video Library kernels expect the input image to be arranged with channels as the innermost dimension of the image. The accelerator on the FPGA receives a stream of pixels from the off-chip memory, and processes each channel of the pixel completely in parallel.
While the HLS Video Library parallelizes only the channel dimension, the flexibility of the Tiramisu scheduling commands allowed us to explore other alternatives including the parallelization over the width dimension of the image leading to better performance (at the expense of more FPGA resources). Indeed, while the Video Library performs, at most, three computations in parallel (on the channels), the code generated from Tiramisu can perform, at most, sixty-four computations in parallel, in the case of a 512-bit vectorization of the input/output buffers for a 8-bit image. Figure 7 shows the execution time of Tiramisu/FROST and the Vivado HLS Video Library. Tiramisu with FROST outperformed the Video Library implementations by at least 3×. For each kernel, we used Tiramisu to arrange the input image and split the innermost loop to prepare for vectorization (we applied vectorization to both input/output buffers). We also applied loop pipelining and array partitioning (for convolution, gaussian and sobel). 
Generating BLAS sgemm using Tiramisu
To evaluate the performance of Tiramisu on an extreme case, we used Tiramisu to generate the BLAS generalized matrix multiplication (sgemm) which computes C = αAB + βC. The sgemm implementation in the Intel MKL library is known to be one of the most highly hand-optimized implementations for Intel CPU architectures. We used a large set of optimizations including three-dimensional L2 and L3 blocking, fusion of the computation of T = αAB and C = T + βC into a single loop, vectorization, unrolling, array packing (as described in [25] ), register blocking, separation of full and partial tiles (which is crucial to enable vectorization, unrolling, and reduce control overhead). We also tuned the tile size and unrolling factor for the machine on which we run our experiments. The resulting kernel matches the Intel MKL implementation as shown in 8. The Tiramisu implementation of saxpy, convolution and two fused convolutions all outperform or match the Intel MKL implementation (lower is better).
Distributed and GPU Backend
For the Tiramisu distributed backend, we used 6 kernels for evaluation: blurxy, sobel, convolution, gaussian, pipeline, and cvtColor (we chose these kernels because these are already implemented in the distributed Halide compiler [21] ). We assume that the data is already distributed across the Figure 9 compares the execution time of distributed Tiramisu and distributed Halide on 16 nodes for each of the kernels. Tiramisu is faster than distributed Halide in each case. For the kernels involving communication, code generated by distributed Halide has two problems compared to Tiramisu: (1) It overestimates the amount of data it needs to send; (2) It unnecessarily packs together contiguous data into a separate buffer before sending. Figure 10 shows the execution time of the kernels with distributed Tiramisu when running on 2, 4, 8, and 16 nodes. As expected, execution time decreases for these kernels as the number of nodes increases.
Putting it All Together
As a final experiment, we ran a modified version of the cvtColor kernel in a distributed GPU configuration and compared it with a distributed CPU configuration. For this experiment, we ran on a small cluster of 4 nodes, each consisting of a single Nvidia K40 GPU and a 12-core Intel Xeon E5-2695 v2 CPU clocked at 2.40GHz. We used OpenMPI 1.6.5 [46] as our MPI implementation. Figure 11 shows the results of this experiment. The back row shows the results for running the cvtColor kernel on one node, using either 1 core, 10 cores, or 1 GPU. As expected, 10 cores is better than 1 core and the GPU outperforms the CPU. The front row shows the same configuration, expect distributed across 4 nodes. So, from left-to-right, the columns of the front row represent a total of 4 cores, then 40 cores, and then 4 GPUs. As with the the single node performance, 40 cores is better than 4 cores, and 4 GPUs is better than the CPUs. Figure 11 . Results for either CPU or GPU running on a single node (back row), and distributed across 4 nodes (front row).
Evaluation Summary
Overall, the experiments demonstrated the use of Tiramisu as an IR and optimization framework for two DSLs and multiple backends. We show that Tiramisu is expressive: it allows both Halide and Julia to perform new optimizations and allows Halide to express new algorithms. The experiments also show that Tiramisu is suitable for targeting multiple hardware architectures, such as multicore, GPUs, distributed systems, and FPGA. And thanks to its flexible scheduling commands, it can generate highly optimized code for a variety of of architectures and algorithms.
8 Related Work
High Performance DSL Compilers
High performance DSL compilers such as Halide [50] , Diderot [14] , Simit [35] , Polymage [45] , OoLaLa [41] and others build custom compilation pipelines for specific domains such as image processing or linear algebra. These compilers have shown that it is possible to obtain high performance by applying domain-specific optimizations in the course of compilation. However, such compilers map DSL code directly to the target hardware, sometimes using a low-level compiler framework like LLVM. Our goal in this work is to build a more generic framework and intermediate representation that can be used by domain-specific language compilers in place of ad-hoc re-implementations of compute and data transformations.
DSL IRs and Optimization Frameworks
Delite [11] is a generic framework for building DSL compilers using Lightweight Modular Staging (LMS) [51] , a technique for embedding code generators and compilers in Scala. Delite exposes several parallel computation patterns that DSLs can use to express computation; however, it has no facilities for advanced loop nest transformations. We therefore believe that generic DSL frameworks like Delite can benefit from using Tiramisu.
PENCIL [3, 4] is another generic DSL IR and automatic optimization framework which uses a polyhedral representation internally. It is a subset of C99 with additional constructs to help parallelizing compilers perform more accurate static analyses, and as a result generate more efficient code. The Pluto [9] automatic scheduling algorithm used within PENCIL can be integrated seamlessly in Tiramisu on top of the first layer. The main difference between PENCIL and Tiramisu is that Tiramisu separates computation, schedule, and data layout. In contrast, the PENCIL IR is a subset of C99 with arrays, scalar variables, etc. and thus successful parallelization and optimization sometimes requires data-layout transformations such as expansion and privatization which are not necessary in Tiramisu.
CHiLL [13, 27] is a polyhedral based compiler framework for Fortran that allows users to express a composition of highlevel transformations such as tiling and unrolling, which the system performs, freeing users from having to implement them. URUK [24] is a similar framework that also uses a polyhedral representation. Other systems such as POET [61] parametrize loop transformations with a language-agnostic, purely-syntactic transformation system. These frameworks require working with concrete data layouts, in contrast to Tiramisu that does not have a concrete data layout in its first layer.
Darkroom [29] is a language and compiler for image processing pipelines. Darkoom compiles the input programs into optimized line-buffered pipelines (it relies on an ILP solver to optimally schedule the pipelines), and then synthesizes them for ASIC, FPGA, or CPU. Similarly, [47] presents an extension to Halide to hardware accelerate applications on FPGA. The authors implemented additional scheduling commands to define and control the code generated for FPGA. These works are designed for image processing applications only, while Tiramisu with FROST can support also other types of computations (e.g. linear algebra).
Data-layout Transformation
Techniques such as scalar and array expansion remove false dependencies, enabling loop nest transformation and parallelization [22, 34] . Expansion increases dimensionality to create private copies of data for each loop iteration. In Tiramisu, computations are single assignment, and thus are fully expanded, obviating the need for privatization.
A family of array contraction techniques attempts to reduce the memory footprint without constraining loop nest transformations [19, 37, 48] : the compiler performs a maximal expansion before applying loop transformations, and then attempts to contract the expanded arrays. Tiramisu simplifies this process, since maximal expansion is not needed. This is similar to Halide [49] where computations are mapped , , by default to fully expanded arrays and then a compiler pass performs storage folding to contract arrays.
Several alternative approaches try to constrain expansion. Maximal static expansion (MSE) restricts the elimination of dependencies to the situations where the data flow can be captured accurately at compilation time [16] . It is important when generalizing array dependence analyses and loop transformations to dynamic control flow, and it can be combined with array contraction [15] . A priori constraints on memory footprint can also be enforced, up to linear volume approximations [53] , and more generally, trade-offs between parallelism and storage allocation can be explored. These techniques can also be applied in Tiramisu to constrain the schedule.
Data layout transformations for specific dependence patterns using the polyhedral model have been used to eliminate SIMD intra-vector operations [30] and for enhancing cache locality in non-uniform cache access (NUCA) architectures [40] . These kinds of transformations can be easily implemented in Tiramisu.
Functional IRs and Data Layout
Transformations The NOVA functional language [17] was designed to be used as an IR for DSL compilers. It is a polymorphic, staticallytyped functional language with a suite of higher-order functions such as map, reduce and scan that are used to express parallelism. Although the NOVA IR does not represent memory explicitly, it does not provide any framework for advanced loop nest transformations. For example, only map functions that iterate over the same ranges can be fused. Iteration space transformations such as skewing are not addressed. Transformations such as fusion are done at the function level. Tiramisu provides an IR that allows advanced iteration space transformations while still separating the algorithm from the schedule and the data layout.
Most functional languages do not expose notions of memory layout to programmers. Instead, programmers rely on profiling tools to characterize data movement [52] or design algorithms around models of memory traffic for such programming languages [8] . In contrast, Tiramisu enables writing the algorithm in a functional manner while separately dealing with data layout and computation scheduling.
Conclusion
In this paper we introduce Tiramisu, a middle-end compiler for domain specific languages that separates the algorithm, the schedule and the data layout in a three-layer intermediate representation. Tiramisu supports backend code generation for multicore CPUs, GPUs, FPGAs, and distributed systems, as well as machines that contain any combination of these architectures.
Tiramisu is designed so most DSLs can use high-level scheduling and data mapping constructs to control the lowering from the algorithm to the backend, cross-platform code. In addition, the underlying representations are accessible to advanced users that wish to implement new optimizations and transformations.
We evaluate Tiramisu by creating a new middle-end for the Halide and Julia compilers, targeting a variety of backends. We also demonstrate transformations made possible by Tiramisu increasing performance by up to 4× in Halide and 16× in Julia and demonstrate that Tiramisu can generate very fast code matching one of the most hand optimized kernels (Intel MKL gemm). 
Layer II: Computation Placement Layer
The second layer is identical to the first layer except that computations in this layer are ordered based on their lexicographical order.
Layer III: Concrete Computation Layer
The third layer is a union of computation sets and a set of access relations. The computation sets are identical to the Layer II computation sets except that new allocation/deallocation statements are added. The set of access relations is described as follows: 
Time-Processor Vectors
The time-space vector in Layer II is a vector indicates the logical time of execution of computations and the processor on which they should be executed. Each one of those vectors has a name associated to it (the name of the computation). S1(0, 0, 0), S2(0, 0, 1), S1(i, j, 0) and S2(i (cpu), j, 1) are examples of time-space vectors representing computations in Layer II. In general, the time-space vector has two types of dimensions: time dimensions and space dimensions. The time dimensions provide the logical order of execution of the computations while the space dimensions indicate on which processor the computations should be executed. In the previous example, the first three vectors have time dimensions only, while the last vector has one space dimension. We use a tag to indicate that a given dimension is a space dimension; this tag indicates mainly the type of processor to which the computations are mapped. Assuming that we have two time-space vectors we want to know which vector among the two executes first, then all we need to do is to compare the two vectors lexicographically 3 . In the example, S1(0, 0, 0] precedes S2(0, 0, 1) lexicographically, so S1(0, 0, 0) is scheduled to be executed before S2(0, 0, 1). The ability to add dimensions and reorder them freely enables the expression of multiple possible mappings from the original iteration space of the computations to complex execution scenarios. Figure 14 provides examples of different optimizations for a simple algorithm and shows the time-space vectors used to express those optimizations. Each of the dimensions of the vector can be an indexed variable, distributing the computation over that dimension or a constant providing a lexical ordering between statements. The algorithms will be using a custom intermediate representation within each DSL, however, we use a classical imperative language representation to describe them in this paper. A value can be annotated by a processor type, indicating where that computation will be placed, and indicating that dimension will be run in parallel. 
