Deep learning frameworks automate the deployment, distribution, synchronization, memory allocation, and hardware acceleration of models represented as graphs of computational operators. These operators wrap high-performance libraries such as cuDNN or NNPACK. When the computation does not match any predefined library call, custom operators must be implemented, often at high engineering cost and performance penalty, limiting the pace of innovation. To address this productivity gap, we propose and evaluate: (1) a domain-specific language with a tensor notation close to the mathematics of deep learning; (2) a Just-In-Time optimizing compiler based on the polyhedral framework; (3) carefully coordinated linear optimization and evolutionary algorithms to synthesize high-performance CUDA kernels; (4) the transparent integration of our flow into PyTorch and Caffe2, providing the fully automatic synthesis of high-performance GPU kernels from simple tensor algebra. The performance is comparable to, and often exceeds the performance of, highly tuned libraries.
(2) an intermediate representation and Just-in-Time optimizing compiler based on the polyhedral framework, enabling complex program transformations and levels of automation unmatched by any other compiler for the acceleration of computational sub-graphs of neural networks; (3) coordinated optimization algorithms with integrated functional correctness, profitability modeling, domain and target specialization; we propose a layered approach, relying on integer linear programming and other polyhedral algorithms to address the core program optimization and synthesis challenges, while resorting to evolutionary algorithms as a higher level of control, to select high-level strategies and fine-tune transformation parameters; (4) the transparent integration of our flow into PyTorch [48] and Caffe2 [29] , providing the fully automatic synthesis of high-performance GPU kernels from simple tensor algebra. The TC flow is also portable to other ML frameworks with a few lines of code. While our initial implementation focuses on Nvidia GPUs, the core technology applies to other types of accelerators with shared or partitioned memory [43, 51, 70, 76] ; these include vector and SIMD accelerators and also the generation of computational patterns suitable for ASICs with systolic designs and efficient storage management involving non-volatile memory technologies.
TENSOR COMPREHENSIONS
Tensor Comprehensions (TC) are an algorithmic notation for computing on multi-dimensional arrays. It borrows from the Einstein notation, a.k.a. summation convention: (1) index variables are defined implicitly, and their range is inferred from what they index; (2) indices that only appear on the right-hand side of a statement are assumed to be reduction dimensions; (3) the evaluation order of points in the iteration space does not affect the output.
A tensor comprehension function (or tensor comprehension for short) defines output tensors from pointwise and reduction operations over input tensors. These operations are defined declaratively as a sequence of pointwise equations or reductions, called tensor comprehension statements (or statements for short).
Let us consider matrix-vector product as a simple example of a tensor comprehension with two statements:
This defines the function mv with A and x as input tensors and C as an output. The shapes of A and X are of size (M, K ) and (K ), respectively. The shape of C is inferred automatically. The statements introduce two indices "i" and "k." Variables not defined in the function signature implicitly become indices. Their range is inferred based on how they are used in indexing (see Section 3.1); here, we will discover i ∈ [0, M ) and k ∈ [0, K ). Because k only appears on the right-hand side, stores into C will reduce over k with the reduction operator +.
Intuitively, a tensor comprehension may be thought of as the body of a loop whose control flow is inferred from context. The equivalent C-style pseudo-code is: Importantly, the nesting order (i then k) is arbitrary: The semantics of a tensor comprehension is always invariant to loop permutation. 2 TC allows in-place updates while preserving a functional semantics that is atomic on full tensors: RHS expressions are read in full before assigning any element on the LHS. This specification is important in case the LHS tensor also occurs in the RHS [24] : The compiler is responsible for checking the causality of in-place updates on element-wise dependences, currently allowing only pointwise updates. Also, to enable in-place updates across TC functions, outputs of a TC statement can also be used as inputs.
We provide a short-cut for an initializing reduction, where the result is initialized to the operator's neutral element before reduction by appending "!" to the operator, e.g., "+=!" instead of "+=". A one-line definition of the matrix-vector product mv is given below; and common ML kernels can be written in just a few lines, such as the sgemm function from BLAS:
Expressing general tensor contractions is equally easy. A fully connected layer followed by a rectified linear unit takes the form of a transposed matrix multiplication initialized to a broadcast bias term and followed by pointwise clamping (applying the built-in scalar function fmaxf with 0):
The where annotation informs the inference algorithm of the intended index variable ranges when they cannot be unambiguously inferred. In this case, "b" indexes only "out" whose size also needs to be inferred. Unlike tensor kernel libraries with predefined layout conventions, notice that TC lets the user control data layout through the order of tensor indexing dimensions. Here, we chose to reuse the out tensor across all comprehensions, indicating the absence of temporary storage.
Similarly, the where clause serves to indicate ranges of kh and kw in the max pooling layer, which would otherwise be under-constrained:
A 2-D convolution is also simple. Its reduction is initialized to 0 (note the use of +=!) with reduction dimensions kh, kw:
Subscript expressions can be any affine function of iterators, or subscript-of-subscript expressions (a tensor element indexing another), and combinations thereof. The latter capture datadependent accesses such as a gather operation:
TC algorithmic notation differs from today's prominent frameworks where most operators are defined as black-box functions. The design of TC makes it easy to experiment with small layer variations while preserving a concise, in-place expression. Thus, a strided convolution is easily created as a tweak on convolution, e.g., strided by 2 along h and 3 along w is: Fig. 1 . Simplified EBNF syntax for core TC. Parentheses denote inline alternatives, brackets denote optional clauses, and angle brackets contain textual descriptions used for simplicity. Figure 1 shows the grammar of the Tensor Comprehension language in EBNF notation.
Data Layout
TC makes data layout explicit and easy to reason about. It supports generalized tensor transpositions (i.e., applying an n-D permutation matrix where n > 2), and data tiling can be achieved by reshaping tensors and adjusting the index expressions. Range inference and checking guarantees such reshaping will always be consistent throughout the statements of a tensor comprehension. For instance, NCHW convolution operates on an explicit input, declared as float I(N, C, H, W), with the layout matching the expected row-major semantics.
In addition, the TC compiler may transparently apply layout transformations, e.g., when mapping tensor tiles to GPU shared memory.
Automatic Differentiation
TC does not natively deal with automatic differentiation, but we aim to add TC support to an existing differentiation tool in the future. DSLs like PlaidML [49] already demonstrated this.
However, backward passes can readily be implemented in TC as a few lines of code. Here is the backward pass of matrix multiplication:
TENSOR COMPREHENSIONS WORKFLOW
The Tensor Comprehensions workflow consists of several stages, progressively lowering the level of abstraction ( Figure 2 ). Given a TC with specialized tensor sizes and strides, 3 we lower it to a parametric Halide-IR expression, which is further lowered to a polyhedral representation where most transformations are applied. The output of the polyhedral flow is CUDA code that can be further JIT-compiled with NVRTC and executed. Complementing this flow, an autotuner and serializable compilation engine interacts with scheduling and mapping strategies to search the optimization space.
Much of TC's versatility and effectiveness resides in its embedding of a polyhedral compiler as the main optimization engine. The polyhedral framework is an algebraic representation of "sufficiently regular" program parts, covering arithmetic expressions on arrays surrounded by static control flow [23] . It has been a cornerstone of loop optimization in the past three decades [3, 8, 14, 22, 32, 70] and is integrated into production compilers [13, 30, 43, 62] . Despite its deceiving apparent simplicity, it covers a large class of computationally intensive kernels. It is parametric on loop bounds and array sizes and captures more transformations of the control and data flow than domain-specific representations such as Halide [55] or TVM [17] . The use of the polyhedral model by TC is derived from that of PPCG [70] , and this section only provides a general overview. Our transformation engine is composed of the following specially adapted or algorithmically novel components:
(1) range inference and lowering from high-level TC abstraction to the polyhedral representation; (2) core affine scheduling adapted from isl that automatically optimizes for (outer) loop parallelism and locality, tuned towards folding a complete TC function into a single GPU kernel; (3) the schedule is further tiled to facilitate the mapping and temporal reuse on the deep parallelism and memory hierarchy of GPUs [72] ; (4) mapping to GPUs borrows from PPCG [70] with extensions to support the more complex and imperfectly nested control structures of ML kernels; (5) memory promotion deals with explicit data transfers to and from shared and private memory.
This work demonstrates that the polyhedral framework is particularly well suited for deep neural networks, featuring large and deeply nested loops with long dependence chains and non-uniform or all-to-all patterns-arising from fully connected layers and tensor contractions, and transpositions. These features push the optimization problem into a different heuristic space than Halide's for image processing, and a wider space than linear algebra alone.
Range Inference
TC loops are implicit and output tensor sizes are inferred from index ranges, which themselves may also be inferred. Our algorithm infers the largest rectangular ranges that avoid out-of-bounds reads on inputs. A where clause allows for disambiguation if multiple such ranges exist.
Consider the conv2d kernel on page four. The sizes of the input tensors, in and weight, are known from the function signature. The algorithm needs to infer the ranges of the iterators and the size of the output tensor out. The iterators b, op, kh, and kw appear only once on the RHS and their ranges are therefore [0, B), [0, OP), [0, KH), [0, KW) so they index the input tensors maximally.
The iterator ip appears twice, but indexes the dimension of the same size, so its range is [0, IP). Had it been indexing dimensions of different sizes, its range would have been the intersection of all size-imposed ranges. Once the ranges of kh and kw are known, it is possible to infer those of h and w: We require h + kh ≤ H and w + kw ≤ W, which leads to the maximal ranges of [0, H − KH) and [0, W − KW), respectively. Finally, the size of out can be inferred given the ranges of the iterators that index it, yielding float(B, OP, H − KW, W − KW). The user of TC is able to inspect the symbolic sizes inferred for the output tensors using a command-line flag.
Consider now a typical stencil operation A(i)+= B(i + k) * K(k): There are multiple ways to maximize the ranges of i and k. To disambiguate without annotations, range inference proceeds in rounds. It maintains a set of index variables whose ranges are not yet resolved. Initially, it contains all variables not in any where clause. Each step considers argument expressions that contain a single unresolved variable and constructs a Boolean condition stating the accesses are within bounds. Using Halide [55] mechanisms, range inference computes the maximal range that satisfies this condition given the already known ranges of other variables. If different ranges are computed for the same variable, they are then intersected. For the stencil above, in the first round, we ignore the expression B(i + k), because it contains multiple unresolved variables. We use K(k) to deduce a range for k. In the second round, B(i + k) contains a single unresolved variable, and we use the already-inferred range of k to deduce a maximal range for i.
Lowering to the Polyhedral Representation
The role of lowering is to bridge the impedance mismatch between the logical layout of highlevel tensor operations (dimension ordering) and the data format the polyhedral code generator expects (C-style row-major arrays). It ensures the absence of aliasing and performs range inference for output tensors. Based on range inference, TC differs from NumPy-style implicit "broadcast" semantics (non-trivial tensor dimensionality extension) adopted by XLA, PyTorch, and MXNet.
Our representation derives from schedule trees [71] , implemented in the isl library [68] , and uses a set of node types. Each TC-statement corresponds to multiple runtime statement instances-one for every valuation of the index variables. The root domain node defines the set of statement instances to be executed. Due to the nature of the TC-language, the constraints on the index variables are always affine, resulting in an exact representation of the set of operations. A band node defines a partial execution order through one or multiple piecewise affine functions defined over iteration domains. The name refers to the notion of a permutable schedule band, a tuple of one-dimensional schedule functions that can be freely interchanged while preserving the semantics of the program. A filter node partitions the iteration space, binding its sub-tree to a subset of the iteration domain. It can be arranged into set or sequence nodes depending on whether or not the order of execution must be serialized. Context nodes provide additional information on the parameters, e.g., tensor extents or GPU grid/block sizes. Finally, extension nodes introduce auxiliary computations that are not part of the original iteration domain, which is useful for, e.g., introducing data-copy statements.
A canonical schedule tree for a TC is defined by an outer sequence node, followed by filter nodes for each TC statement. Inside each filtered branch, band nodes define an identity schedule with as many one-dimensional schedule functions as loop iterators for the statement. The implicit loops form a permutable band as per TC semantics.
In addition to the schedule tree, our representation includes tensor access functions that map the index variables to the subscripts of tensors they access. These subscripts are not necessarily affine, in which case over-approximations are used [11] : A non-affine access is assumed to potentially access all values along the given dimension. After the polyhedral representation is constructed, dependence analysis can be used to ensure the absence of out-of-bounds accesses [53] .
Additional lowering steps include forward substitution of convolution expressions (storage/ computation trade-off), padding, mirroring, and clipping. The process is analogous to Halide's [55] .
Example. Figure 3 (a) shows the canonical schedule tree for unions of relations where tuples of iterators are guarded with syntactic identifiers [53] 4 for the sgemm TC defined on page 4. One recognizes a 2-D nest from the initialization statement followed by a 3-D nest for the update statement. The schedule can be either parametric in input sizes or have extra context information on the tensor sizes. In cases where band nodes do not define an injective schedule, the statement instances are scheduled following the lexicographical order of their domain coordinates.
Tunable Polyhedral Scheduling
Program transformation in the polyhedral model involves defining a different schedule, which corresponds to a different (partial or total) order of traversing the iteration domain. The instances of all statements are scheduled completely automatically [14] using one of several scheduling strategies with which we extended the isl scheduler [72] .
The isl scheduler iteratively solves integer linear programming problems to compute piece-wise affine functions that form new schedule band nodes. Internally, it operates on a data dependence graph where nodes correspond to statements and edges express dependences between them. It introduces the affine clustering technique that is based on computing the schedule bands separately for individual strongly connected components of the dependence graph and then clustering these components iteratively and scheduling them with respect to each other. Clustering not only decreases the size of the linear problems the scheduler has to solve, but also serves as a basis for isl's loop fusion heuristic.
We extended isl to provide finer-grained control over the scheduling process. For affine transformations, the user can set additional scheduling options. For clustering, the user can supply a decision function for pairwise dependence graph component combination, after this combination was demonstrated to be valid by the scheduler. These configuration points serve as a basis for both fixed scheduling choices made by TC and scheduling strategies. In particular, TC tells the scheduler to produce schedules with only non-negative coefficients and without any skewing. Clustering decisions allow TC to control the conventional minimum and maximum fusion targets, and additionally, maximum fusion that preserves at least three nested parallel loops (to be mapped to CUDA blocks and threads). With the scheduling strategies, one may optionally enable point band rescheduling (i.e., scheduling the inner dimensions after tiling). In particular, two fusion strategies can be specified, one for the global schedule and one for the point band. If these fusion strategies are different, then the point band (along with all its descendants) is rescheduled after tiling, preserving only the outer tile band of the original schedule. Scheduling strategies can be selected through the autotuning process. In all cases, we enforce that a single GPU kernel is generated.
Example.
Observing that the C tensor in sgemm (see page four) is reused between two nests, the scheduler constructs the tree in Figure 3 (b) to leverage access locality and improve performance. This tree features an outer band node with i and j loops that became common to both statements, which corresponds to loop fusion. The sequence node ensures that instances of S are executed before respective instances of T enabling proper initialization. The second band is only applicable to T and corresponds to the innermost (reduction) loop k.
Overall, the tuning process is greatly simplified compared to Halide and TVM. Relying on a heavy-duty, well-understood analytical optimization framework based on integer linear programming, TC exposes a small, dedicated search space of high-level strategies and block-size parameters. Beyond guaranteeing the validity of the transformation, dependences can be used to explore parallelization opportunities (independent instances can be executed in parallel), to improve data access locality (dependent instances executed close in time) or to automate vectorization [14, 50, 66, 72, 77 ].
Imperfectly Nested Loop Tiling
Let us first describe the general setting for loop tiling on schedule trees, before developing the TC-specific specialization and extensions.
Tiling Permutable Bands. Pluto has been very successful at decoupling the actual implementation of loop tiling from the preparation of an affine schedule exposing permutable loops amenable to tiling [14] . This design allows exploring locality and parallelization tradeoffs without bloating the schedule representation with complex quasi-affine forms capturing the precise distribution of iterations into tile and point loops. Schedule trees ease the implementation of such a decoupled design, capturing tiling as the conversion of a permutable schedule band into a chain of two bands, with the outer band containing tile loops and the inner band containing point loops with fixed trip count. This can be seen as a conventional strip-mine and sink transformation.
In addition to conventional loop tiling, the schedule tree representation allows tiling imperfectly nested loops. The technique is based on the following observation: If a loop does not carry dependences, it can be sunk below any other loop. In valid schedules, all dependences are carried (or satisfied) by some loop, along which they feature a positive distance. A dependence is only violated if it has a negative distance along some loop before it is carried by another loop [35] . Parallel loops do not carry dependences by definition and therefore do not affect dependence satisfaction or violation. Therefore, imperfectly nested tiling may be implemented by first tiling bands in isolation and then sinking parallel point loops in the tree. During this process, the point band is replicated in each sub-tree below a sequence (or set) node and its schedule is restricted to only map the relevant points in the iteration domain. Such an extension is particularly helpful in Pluto, where bands of permutable loops are rediscovered through a post-pass traversal of the affine schedule.
Parallelism and Locality Trade-offs. TC applies two tiling schemes with complementary purposes. The first one takes place immediately after affine scheduling. It aims at exposing a sufficient number of parallel dimensions, some of which amenable to memory coalescing, and some better suited to block-level parallelism. It also aims at exploiting data locality within thread blocks (through shared memory) and individual threads (through register reuse). This tiling scheme is influenced by the strong emphasis on loop fusion in the affine scheduling heuristic (to enforce that the generated code runs as a single GPU kernel). In this context, conventional loop nest tilingconsidering a single band at a time-appears to be sufficient. This is the hypothesis we make in this article. 5 The second tiling scheme takes place in the block and thread mapping algorithm, which is the topic of the next sub-section.
Example. Figure 3 (c) shows the schedule tree for the fused and tiled sgemm. It purposely has two imperfectly nested bands. Dependence analysis shows that loops i and j are parallel. Therefore, we can tile them and sink the point loops below the band of the reduction k loop, resulting in the schedule tree in Figure 3 
Mapping to Blocks and Threads
A schedule tree can also be used to represent the mapping to an accelerator, in particular a GPU with multiple blocks and threads. This operation is performed by associating certain schedule band members, and the corresponding loops, to thread or block indices. The polyhedral code generator then omits the loops, if possible, and rewrites the index expressions accordingly. Building on PPCG, our mapping approach is decoupled from tiling for data locality: Grid and block sizes are specified independently from tile sizes and are exposed as tunable parameters. Due to the semantics of blocks and threads, only parallel loops that belong to a permutable schedule band can be mapped. If point loops are mapped to threads, the ratio between tile sizes and block sizes controls the number of iterations executed by each thread. Note that tile sizes smaller than the block sizes lead to some threads not performing any computation.
Contrary to PPCG, which may generate multiple kernels for a given input program, our mapping approach handles imperfectly nested loops in a way that generates a single kernel as expected by ML frameworks. We require the schedule tree to have at least an outermost band with outer parallel dimensions. The parallel dimensions of the (single) outermost band are mapped to GPU blocks. In each schedule tree branch, the innermost permutable band, typically consisting of point loops, is mapped to GPU threads with the following restrictions: The number of mapped dimensions must be equal across branches, and on each branch, there must be exactly one band mapped to threads. The mapping is performed bottom-up, first attempting to map the leaf bands to threads, before moving to a parent band only if none of the children could be mapped to threads.
Thread mapping can be extended to imperfectly nested loops, following the same principle as imperfect loop tiling. Within a given thread block, one may sink parallel point loops so multiple bands in a sequence (or set) may be equalized in depth and mapped together. However, TC currently does not perform any such sinking.
Example. Our mapping strategy produces the schedule tree in Figure 3 (e). We introduced a context node in the schedule tree to indicate the effective sizes of the parameters as well as the grid and block sizes (denoted as b x , b y and t x , t y , respectively, standing for the values eventually taken by blockIdx.x, blockIdx.x and threadIdx.x, threadIdx.y). This insertion is performed just in time, when the effective tensor sizes are known. Also notice the filter nodes referring to the b x , b y , t x , and t y parameters: these nodes express the mapping to the GPU.
Memory Promotion
We are interested in promoting parts of tensors into shared or private GPU memory. While the promotion decision is taken by a heuristic and the corresponding imperative code is generated at a later stage, schedule trees offer a convenient interface for attaching memory-related information. Memory promotion is based on the notion of an array tile, a form of data tiling for softwarecontrolled local memories. It is a constant-size potentially strided block in the array that covers all elements accessed by within a given (schedule) tile. We build upon and extend PPCG's support for memory promotion [70, 72] and expose the promotion to shared and private memory as Boolean options for the autotuner.
Promotion of Indirectly Accessed Arrays. Memory promotion is also applicable to indirectly accessed arrays. These frequently occur when modeling variable length data through embedding layers such as word embeddings in natural language processing. This is particularly important in the case of latency-bound benchmarks where there is little computational or additional data processing work to hide global memory latency. Indirect arrays used to be promoted in the initial TC implementation based on PPCG. When implementing parallel reductions, working towards the first released version of TC, we realized that parallelizing reductions was sufficient to deliver comparable or higher speedups in our word-embedding benchmarks. For this reason, indirect array promotion was dropped from the publicly available version of TC. We still report on the design, for it remains interesting to describe how the polyhedral TC flow may optimize non-affine data flow.
Without Because some values can be duplicated, indirect promotion is only possible if both the outer and the index arrays are only read, since writing to them could result in different values that cannot be trivially merged. In general, we require the index array to have an array tile, i.e., only a fixed-sized block of it is accessed. When computing the array tile for the outer array, we ignore the indirect parts of the subscript (affine parts are treated as usual). We then introduce as many additional index expressions in the promoted outer array as are associated to the index array. Extents of the array along these new dimensions correspond exactly to the array tile sizes of the index array. Hence, an element of the promoted array contains a copy of the global array element that would be accessed with the given index array. Indirect subscripts are only used when copying from global memory, while all other accesses are rewritten through code generation. In presence of multiple indirect index expressions that share sub-expressions and have equal tile sizes along the corresponding dimensions, it is sufficient to introduce a single index expression in the promoted array for all identical sub-expressions.
Promotion Heuristics. Directly accessed arrays are promoted to shared memory if there exists an array tile of fixed size, if individual elements are accessed more than once, and if at least one of the accesses does not feature memory coalescing. The latter is visible from the access relation with the schedule applied to the domain: The last access dimension should be aligned with the schedule dimension mapped to x threads.
For indirect arrays, the coalescing requirement may be dropped because of the presence of additional long memory dependences that these cases entail. The total amount of shared memory being fixed, one may follow a simple greedy heuristic, refusing promotion if the required amount of shared memory would outgrow the available resources.
Matching Library Calls
While TC aims at generating code for any computational kernel expressible in the DSL, if (part of) a kernel happens to match a pattern that is heavily optimized by some library, then it may as well be handled by that library. In particular, and as a proof of concept, TC looks for opportunities for letting CUB handle specific forms of reductions [57] . It is currently restricted to single-dimensional addition reductions.
A reduction is represented in TC by a binary relation between updated tensor elements and the statement instances that perform the corresponding updates. 6 Right before the mapping to threads, each permutable band with a sufficient number of parallel members is checked for reductions. In particular, the band should have at least one non-parallel member and the number of parallel members plus one (corresponding to the non-parallel member) should be greater than or equal to the number of dimensions that will be mapped to threads. If the band schedules instances of exactly one reduction statement and if the instances of any other statement scheduled by the band can be moved before or after the reduction instances, taking into account the active dependence at (the top of) the band, then the remaining band (involving only reduction statement instances) will be considered for replacement by a library call during thread mapping.
When a band marked for replacement is considered during thread mapping, full/partial tile separation is applied-using the block size tuning parameter-since only the full tiles can be handled directly by CUB. Furthermore, the condition separating full tiles from partial tiles should be simple enough, as otherwise the cost of determining when to invoke CUB would outweigh any possible benefit obtained from the invocation. If the condition is too complicated, the separation is discarded and the band is treated in the same way as bands that were not marked for replacement. Otherwise, the collection of full tiles is tiled along the parallel dimensions, since a single scalar variable is used to hold the result of the reduction mapped to CUB. Synchronization and a special marking is then inserted around the point band of this tiling, which is later used during code generation to replace each full tile by a call to CUB. Finally, since CUB uses some shared memory, its consumption is taken into account during the downstream memory promotion step.
Autotuning and Caching
While the polyhedral core of TC is capable of optimizing and generating code for any TC function, it is well known that the state-of-the-art linear optimization heuristics are not sufficient to account for all performance anomalies and interactions with downstream program transformations [39, 77] . Different kernels need different, target-specific optimization trade-offs. We thus complement our flow by an autotuner that varies the options of the polyhedral JIT compiler marked as tunable in the previous section. These options can be stored and reused for similar operations/kernels (similar shapes, target architecture), since autotuning may require significantly more time than compilation.
The tuning session is defined by a list of parameters to tune and their admissible values, initial values, and the search strategy. We currently implement a genetic search strategy [27] . It runs for multiple steps, each one evaluating multiple candidate values. Each candidate is assigned a fitness value inversely proportional to its runtime. The pool is updated on each generation by cross-breeding three candidates, chosen from the pool at random, with fitter candidates having a higher chance of being chosen, such that each candidate's value is inherited from one of its parents. A subsequent mutation phase can change the candidate's values at random with some low probability. Much of the autotuning effort resides in tile size selection, for which no linear objective functions exist in polyhedral compilers. Genetic approaches have been used successfully to explore such spaces, performing better than random search due to the strong coupling of optimization decisions-including tile sizes bound by the limits of the memory hierarchy- [18, 50] .
Autotuning evaluates hundreds to thousands of versions for each kernel. We devise a generic multi-threaded, multi-GPU autotuner. It maintains a queue of candidates to compile with the polyhedral flow and a queue of compiled kernels ready to be profiled on the GPU (see Figure 4 ). Candidates or kernels are picked up by available worker threads and compiled or profiled concurrently. Profiling results are accumulated in the tuning database and used for setting up successive search steps.
Each generated version is "warmed up" by a few executions before being profiled. Without any performance guarantees, autotuning needs to quickly prune poor candidates. Because CUDA kernels cannot be stopped once launched, we rely on the following pruning heuristics to decrease the autotuning time by an order of magnitude. (1) Parameter specialization allows the exact number of active threads and blocks to be computed beforehand. Kernels with fewer threads than some configurable threshold (e.g., 256) are not launched. (2) If during the first run, a kernel is more than 100× slower than the best version so far, or it is 5× slower after warmup, it is pruned immediately.
While autotuning time may become significant, compilation and autotuning time is not a fundamental limit to TC's applicability. In training scenarios, a significant amount of time is spent on computing the same kernel repeatedly over different data during the (stochastic) gradient descent. In inference scenarios, the network is optimized ahead of time. As a result, although TC operates as a JIT compiler, it only marginally hits the typical compilation/run-time trade-offs of JIT compilers. Autotuning time may become an issue in specific training scenarios where hyper-parameters would need to be frequently updated, but in such a case one may leverage TC's intrinsic handling of dynamic shapes and generate a single version of each operator or fused operators to handle all hyper-parameter configurations.
INTEGRATION WITH ML FRAMEWORKS
TC is designed to optimize individual layers or small subgraphs of an ML model. The entire model is not only computationally expensive, but often leads to most transformations being hindered by a large number of data dependences. Furthermore, ML frameworks perform work distribution and placement at the model level, treating a layer as a unit of work; extremely large layers could interfere with the framework operation.
Unlike XLA or Glow, TC supports completely custom layers. In TC, layer fusion is merely pasting the code that constitutes the layers into a single function, or inlining TC functions at the AST level. Unlike Halide and TVM, the polyhedral backbone of TC includes instance-wise dependence analysis, capturing dependences and tensor access relations at the level of individual loop iterations and tensor elements. This allows TC to fuse operations without introducing redundant computation, and to combine fusion with enabling transformations such as shifting (for convolutions) or scaling (for pooling layers). TC's polyhedral representation also enables it to automatically infer sizes, and to discover parallelism and locality-parallelism trade-offs beyond a predefined collection of map/reduce/scan combinators.
Let us now describe the transparent integration into a ML framework, from a user perspective. Until now, such levels of integration had only been demonstrated on operator graph compilers such as XLA [28] and Glow [58] , starting from a lower level of abstraction than TC, and missing the genericity and high reusability of a polyhedral framework as well as feedback-directed autotuning.
We opted for an "in process" implementation, streamlining the interaction with computation graph engines and ML applications built on top of them, a unique feature for a fully automated scheduling and mapping flow. TC is integrated into any ML framework as follows: We provide a thin API that translates the specific tensor object model to our own (see Figure 5 ). Operator definitions are overridden to generate TC rather than the framework's backend implementation, as well as provide users the ability to write their own TC. A single TC may correspond to a DAG of operators in the ML framework. The tensor comprehensions are then JIT-compiled as shown in Figure 2 . DAG partitioning, matching, and rewriting (like, e.g., TensorRT [47] ) is currently not part of the flow, although this would make an interesting future combination, with feedback from the compiler.
PERFORMANCE RESULTS
We evaluate our framework on two systems: (1) Nvidia Pascal nodes with 2 socket, 14 core Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz, with two Quadro P100-12GB; and (2) Nvidia Volta nodes with 2 socket, 20 core Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz, with eight Tesla V100-SXM2-16GB. Both systems use CUDA 9.0 and cuDNN 7.0.
We report results for nine TC functions ranging from a simple matrix multiplication kernel to a full WaveNet cell [64] . The individual benchmarks are described below: Figure 6 and Figure 7 show the complete source code. The matrix multiplication and convolution kernels were selected for their dominance of the training and inference time of the most classical networks [4, 75] . The other kernels bring interesting computation patterns to enable expressiveness and performance comparisons in more diverse network architectures.
These results are all based on TC commit, 2e1a0dc54850 available at https://github.com/nicolasvasilache/TensorComprehensions. Running the autotuner for 25 generations of 100 candidates, the (parallel) autotuning process takes up to 1h on the longest running kernels, and 6h in total. 7 The relative performance of kernels automatically generated with TC compared to Caffe2 is shown in Figure 8 and Figure 9 . 8 Caffe2 provides a very strong baseline by wrapping tuned Table 1. implementations, which originate from either hand-tuned libraries or other high-performance code generators. 9 We chose to compare against Caffe2 rather than against other optimization flows due to expressivity and automation limitations: XLA or Glow do not support custom layers, and Halide or TVM lack range inference and automatic parallelism discovery, which significantly complicates the expression of new layers such as KRU and WaveNet. The common set of comparable layers would be limited to matrix multiplications and convolutions, while one of the main contributions of TC is to enable exploration of new unconventional layers before super-optimized implementations are available.
In addition, Figure 10 brings together the performance of TC-compiled kernels on both GPU systems, normalized to Caffe2 on P100. This consolidated graph conveys three classes of information in a common context: (1) speedup of Caffe2 V100 over Caffe2 P100 to illustrate the out-of-the-box benefits (or lack thereof) of a faster GPU; (2) speedup of TC over Caffe2 on P100 (main comparison); 9 A recent unification effort [59] made Caffe2 the backend for PyTorch 1.0. and (3) speedup of TC V100 over Caffe2 P100. The last choice may seem surprising, but presented in the context of the other two, allows for relative comparisons: the height of the Caffe2 V100 and TC V100 captures the raw speedups of TC on V100. We aim at compactly illustrating that TC provides a path to performance portability, improving on state-of-the-art frameworks and library primitives. Table 1 provides absolute runtime running TC and Caffe2; all values are reported in μs.
TMM: Transposed Matrix-Multiplication. On matrix multiplications of shapes and sizes relevant to deep learning workloads (i.e., small 128 × 32 × 256, medium 128 × 1,024 × 1,024 and large 128 × 4,096 × 16,384), TC does not perform competitively, except in the low-latency small case. This is due to: (1) the lack of a target-specific register blocking optimization, making kernels bound by shared memory bandwidth that is an order of magnitude slower than register bandwidth;
(2) the lack of target-specific, basic-block level optimizations including careful register allocation and instruction scheduling. Matrix multiplication is the most tuned computation kernel in history:
The missing optimizations are all well known and may be found in use cases and open-source implementations such like CUTLASS [36] . Alternatively, polyhedral compilation has been shown to match or outperform cuBLAS, provided sufficient target-and operator-specific information has been captured in the optimization heuristic and code generator [20] . While our scientific focus was on covering a wide range of layers with TC, a production release would need to embed such operator-specific strategies as well. One strategy would be to follow the classification and heuristic steering of Kong et al. [39] . Also, TC does not replace all layers: It only acts as a custom operation in a graph; one may use TC concurrently with numerical libraries as well as custom implementations provided through TVM.
Group Convolution. Group convolution is expressible with two lines of TC. We report comparisons for sizes relevant to the ResNext model [75] . Despite not using either register optimizations, Fourier or Winograd domain convolutions, TC produces faster kernels than the cuDNN ones, with running times between 250μs and 750μs. To check how TC fares w.r.t. recent advances in optimizing group convolutions, we performed an additional comparison with the PyTorch nightly package py36_cuda9.0.176_cudnn7.1.2_1 with torch.backends.cudnn.benchmark=True. TC speedups range from −2% to 8×. We also observe PyTorch performance on V100 to be worse than on P100, while TC achieves performance portability.
Group Normalization. Group Normalization was recently proposed as a way to overcome limitations of Batch Normalization at smaller batch sizes and increase parallelism [74] . In TC, group normalization is a five-line function. TC performance is roughly 30% better than the hand-tuned Caffe2 implementation. Whereas Caffe2 uses four handwritten CUDA kernels, we chose to write the TC version as two separately compiled TC functions for better reuse and overall performance. We also experimented with writing a single fused TC but performance degraded. This is mostly due to kernels requiring substantially different grid configurations, which makes their fusion unprofitable. A larger, graph-level compiler that decides on TC function granularity, informed by the TC mapper and the autotuner, is necessary to automate this decision process but is left for future work.
Production Model. The kernels 1LUT, 2LUT, MLP1, and MLP3 are the backbone of a low-latency production model used at scale in a large company and correspond to (1) reductions over a large lookup table embedding (10M rows); (2) fused reduction over two large lookup table embeddings (10M rows); (3) small size Multi-Layer Perceptron (fully connected, bias, ReLU); and (4) very small size, three consecutive Multi-Layer Perceptrons. Despite LUT sizes, this model is essentially latencybound. Existing libraries are often not tuned for low-latency regimes and tend to perform poorly.
On these examples, the need for reuse and instruction-level parallelism is dwarfed by the need to quickly load data from the memory into registers. TC is able to adapt to the problem size, leveraging reduction parallelism to hide memory latency. This results in large speedups over Caffe2 with cuBLAS 9.0.
Transposed Batch MatMul. This kernel is meant as a case study to characterize performance benefits and losses in the current flow, compared with reference libraries. For the sizes relevant to Factorization Machines [56] , (500 × 26 × 72 × 26), Nvidia Profiler reports the TC autotuned kernel taking 56μs on the Nvidia Quadro P6000 GPU (Pascal), while both Pytorch and Caffe2 resort to the specialized cuBLAS function maxwell_sgemm_128x64_nn that takes 87μs. Beyond architecture mismatch indicated in the function name, a detailed performance comparison demonstrates that TC executes 500 blocks of 26 × 13 = 338 threads, compared to 500 blocks of 128 threads for cuBLAS, reaching 81.8% occupancy instead of 23.6%. Additionally, the cuBLAS kernel shows a large number of predicated-off instructions due to the block size not matching the problem size. Occupancy is limited by the number of registers in both cases (11,264 vs. 15,360) , but the TC version can be distributed over five blocks instead of four. 10 TC promotes all tensors to shared memory, saturating its bandwidth, whereas arithmetic instructions are the performance limiter for cuBLAS. Given the large occupancy metric, performance can be further increased by promoting one tensor to registers instead, trading off lower occupancy for reduced pressure on memory bandwidth.
Kronecker Recurrent Units. These have been recently proposed as a solution to drastically reduce model sizes by replacing the weights matrix of a linear layer by a Kronecker product of much smaller matrices [33] . In TC, a Kronecker product of three matrices is easily written as shown in the kronecker3 function in Figure 6 . The following table shows the running time in μs-or out of memory (OOM)-of a large matrix multiplication in Caffe2 and the equivalent Kronecker product of three matrices. Note that the performance difference mostly comes from using a different algorithm. While no specialized GPU library primitives exist for Kronecker recurrent units, TC's automatic flow enabled rapid exploration and reached unprecedented levels of performance, as shown in Table 2 . Clearly, this benchmark deserves a deeper discussion of the space of possible TC derivations, including memory/computation/parallelism trade-offs falling outside the scope of this article. The kronecker3 function is one such possible implementation that performed well for the three selected matrix shapes; it avoids redundant computation at the expense of storage (two tensors for intermediate computations).
WaveNet. WaveNet [64] is a popular model that enables generation of realistic sounding voices as highlighted at Google I/O 2018. We encoded a full WaveNet cell using a single TC function and compared our generated kernel with a WaveNet layer from PyTorch. This experiment uses a batch size of 1, residual and dilation channels of 32, and 256 skip channels. With TC, we observe performance improvements up to 4× on Volta, as shown in Table 2 .
RELATED WORK
Despite decades of progress in optimizing and parallelizing compilation, programmers of computationally intensive applications complain about the poor performance of optimizing compilers, often missing the machine peak by orders of magnitude. Among the reasons for this state of affairs, one may cite the complexity and dynamic behavior of modern processors, domain knowledge required to prove optimizations' validity or profitability being unavailable to the compiler, program transformations whose profitability is difficult to assess, and the intrinsic difficulty of composing complex transformations, particularly in the case of computationally intensive loop nests [6, 26] .
Several contributions have successfully addressed this issue, not by improving a general-purpose compiler, but through the design of application-specific program generators, a.k.a. active libraries [67] . Such generators often rely on feedback-directed optimization to select the best generation schema [60] , as popularized by ATLAS [73] for dense matrix operations (and more recently BTO [10] ) and FFTW [25] for the fast Fourier transform. Most of these generators use transformations previously proposed for traditional compilers, which fail to apply them for the aforementioned reasons. The SPIRAL project [54] made a quantum leap over these active libraries, operating on a domain-specific language (DSL) of digital signal processing formulas. Compilers for DSLs typically rely on domain-specific constructs to capture the intrinsic parallelism and locality of the application. Using such an approach, DSL compilers such as Halide [55] for image processing show impressive results. Its inputs are images defined on an infinite range while TC sets a fixed size for each dimension using range inference. This is better suited to ML applications, dominated by fixed-size tensors with higher temporal locality than 2-D images; it is also less verbose in the case of reductions and does not carry the syntactic burden of anticipating the declaration of stage names and free variables (Halide needs this as a C++ embedded DSL). OoLaLa [42] takes a similar approach for linear algebra, and TACO [37] and Simit [38] use a similar notation as TC but generate sparse matrix code for numerical solvers.
Following this trend in the context of deep neural networks, we not only design yet another DSL and compiler but propose a more generic code generation and optimization framework, bringing together decades of research in loop nest optimization and parallelization for high-performance computing. We also design the domain language to cover a variety of existing and emerging machine learning models. Our framework automates a combination of affine transformations involving hierarchical tiling, mapping, shifting, fusion, distribution, interchange, on either parametric or fully instantiated problems, that are not accessible to Halide [45, 55] , Latte [63] , or XLA's [28] representations of tensor operations.
The polyhedral framework is a powerful abstraction for the analysis and transformation of loop nests, and a number of tools and libraries have been developed to realize its benefits [12, 14, 22, 70, 77] , including production compilers such as GCC (Graphite) and LLVM (Polly). Polyhedral techniques have also been tailored for domain-specific purposes. State-of-the-art examples include the PolyMage [46] DSL for image processing pipelines and the PENCIL approach to the construction of parallelizing and compilers for DSLs [5, 9] . PolyMage is a clear illustration of the benefits of operating at a high level of abstraction, closer to the mathematics of the domain of interest: While GCC/Graphite and LLVM/Polly struggle to recover affine control and flow from low-level code, PolyMage natively captures patterns amenable to domain-specific optimization, such as stencil-specific overlapped tiling with or without recomputation, and cache-conscious fusion and tiling heuristics; it also offers a more productive programming experience for end-users. Interestingly, some techniques derived from PolyMage crossed out of polyhedral representations into Halide's automatic scheduler [45] . Back to deep learning frameworks, TVM extends Halide with recurrent (parallel scan) operators, support for ML accelerators, and tight integration with ML frameworks [17] . It also provides autotuning capabilities [18] and shares several engineering goals of TC, such as transparent ML framework integration. Much like PolyMage, TC implements optimizations well suited to the long-distance, non-uniform reuse patterns of deep learning models; these heuristics are not available in general-purpose compilers such as LLVM/Polly, Pluto, or PPCG, or semi-automatic frameworks such as Halide and TVM.
None of the aforementioned frameworks offer the complete transparency of TC's end-to-end compilation flow. TVM involves some level of manual intervention and/or feedback-directed optimization even for producing the most baseline GPU implementation, and it guarantees functional correctness for a subset of the scheduling primitives and tensor operations: e.g., convolutions can only be fused at the expense of introducing redundant computations or involving lower-level transformations that cannot be verified at compilation time. In addition, the balance between analytical objective functions (profitability heuristics) and feedback-directed autotuning is completely different: Halide and TVM auto-schedulers expose all scheduling decisions to the autotuner and infer most performance-related information from execution profiles, while TC's polyhedral flow reduces the autotuning space to a narrow set of optimization options and tile sizes.
TC also shares several motivations with Latte [63] and PlaidML [49], including a high-level domain-specific language and an end-to-end flow. TC provides elementwise access that is just as expressive when implementing custom layers, but unlike Latte it is more concise (thanks to type and shape inference), safer regarding static bound checking and graph connectivity, and more flexible by decoupling indexing from representation and layout choices. In addition, our framework implements more complex scheduling and mapping transformations than both Latte and PlaidML, some of which are essential to GPU targets with partitioned memory architectures. Unlike Latte, it is also designed as a JIT compilation library for seamless integration with deep learning frameworks. Unlike PlaidML, it is not limited to high-level patterns and rewrite rules, but captures complex affine transformations resulting from analytical modeling and autotuning. As a consequence, the TC compilation process takes generally more time than PlaidML, a price to pay for the ability to implement a wider range of optimizations.
Like TC, XLA [28] provides automatic shape and size inference, it may operate "in process" as a JIT compilation library, and it integrates into a production deep learning framework (TensorFlow, Caffe2 [29] ). XLA shares many motivations with Latte, with a focus on integration and completeness of functionality rather than on the complexity of the optimizations and mapping strategies. Glow [58] is a recent domain-specific, retargetable compiler for PyTorch/Caffe2. It shares many of the motivations and capabilities of XLA, while emphasizing retargetability (CPUs as well GPUs and ML accelerators from multiple vendors) and the ability to differentiate, optimize, and lower operations and sub-graphs of operations within its own hierarchy of intermediate representations. It can leverage blackbox numerical libraries as well as generate custom vector processing kernels relying on LLVM. Our compiler design and algorithmic contributions would naturally fit XLA, Latte, or Glow, except for the following: TC remains independent from a specific computation graph while preserving tight integration with production frameworks; we did not use an embedded DSL approach-keeping C++ as an interface for implementing optimization strategies only-isolating the user from complexity and debugging hurdles of embedded DSLs, and we leverage polyhedral techniques to factor out most of the optimization heavy-lifting, while XLA, Latte, and Glow resort to operation-specific emitters/lowering, optimization schemas, and heuristics.
Recently, R-Stream·TF [52] was presented as a proof-of-concept adaptation of the R-Stream polyhedral compiler to the automatic optimization of TensorFlow operators. Similarly to our approach, the generated code is wrapped as a custom operator of TensorFlow. The tool takes a computation graph as input and partitions it into sub-graphs amenable to tensor fusion, contraction, and layout optimization. R-Stream·TF also leverages the broadcast semantics of TensorFlow to maximize the operator's polymorphism w.r.t. input tensor dimension and shapes. This makes R-Stream·TF very aggressive in terms of static memory management and kernel partitioning. We made the more pragmatic choice of leaving most of these decisions to the level of tensor algebra, allowing a domain-specific optimizer or ML expert to rewrite declarative comprehensions into capacity-and layout-optimized ones. However, TC is more ambitious in its domain-specialization of affine scheduling and mapping, aiming for the generation of a single accelerated kernel, with heuristics adapted to the high-dimensional, non-uniform, long-distance reuse patterns of neural networks. The lack of algorithmic detail in the R-Stream·TF paper prevents us from comparing those affine transformation heuristics.
CONCLUSION
We presented and evaluated the first fully automatic, end-to-end flow, mapping a high-level mathematical language to high-performance accelerated GPU kernels. TC resembles the mathematical notation of a deep neural network and makes it easy to reason about, communicate, and to manually alter the computation and storage/computation trade-offs. Our flow leverages decades of progress in polyhedral compilation to implement the heavy-duty program transformations, analytical modeling of profitable optimizations, and code synthesis. It also implements domain-specific optimizations, code generation, autotuning with a compilation cache, and lightweight integration within Caffe2 and PyTorch. This unique combination differs from alternative proposals relying mainly on autotuning such as TVM [18] , or pattern-based transformations such as PlaidML [49] .
TC is capable of quickly synthesizing solid accelerated implementations that effectively lift bottlenecks in large training runs. In practice, such bottlenecks slow down ML research significantly, requiring substantial engineering efforts to be mobilized. Our contribution addresses this productivity gap; it brings more expressive power and control into the hands of domain experts, relieving ML frameworks' dependence on highly tuned vendor libraries without compromising performance. TC automates boilerplate optimization that has been replicated over the numerous deep learning frameworks and builds on a generic polyhedral intermediate representation and libraries shared with other domains (image processing, linear algebra) and general-purpose compilers (LLVM/Polly). Future work includes additional model-based domain-specific optimizations, CPU code generation, learning best mapping configurations automatically, automatic differentiation, interaction with the graph-level optimizer, and providing a path to emit a series of calls to a native library or hardware acceleration blocks.
