Achieving high-performance GPU kernels requires optimizing algorithm implementations to the targeted GPU architecture. It is of utmost importance to fully use the compute and memory hierarchy, as well as available specialised hardware.
Introduction
Implementing high-performance kernels for GPUs is incredibly challenging. Achieving performance close to the theoretical peak requires the careful application of a plethora of optimizations to optimally use the compute and memory hierarchy of the target architecture. In order to maximize available memory bandwidth, one needs to make use of vectorized loads and stores, padding to avoid shared memory bank conflicts, or storage layout transformations. Optimizations affecting the compute hierarchy include warp † This work was done while the authors were at NVIDIA. specialization, swizzling, or the exploitation of special builtin instructions like NVIDIA's WMMA (warp-level matrix multiply accumulate).
Figuring out which optimizations lead to the best performance is a tedious task. Today, kernels are either written in low-level programming languages such as CUDA or PTXassembly, or they are generated using high-level frameworks such as Halide [15, 16] or TVM [6] . In the first case, low-level implementation details are interwoven with and obscure the intended functionality. This makes it hard to try out different optimizations as it usually requires rewriting large parts of the existing code. To mitigate this, high-level frameworks such as Halide separate the description of the computation (algorithm) from its optimizations (schedule) using a separate scheduling language that describes the optimizations to apply. Here, however, users are still limited to the set of optimizations and application the specific framework supports.
In this paper, we propose Fireiron, a framework for gradual forging of high-performance GPU kernels. Fireiron is a language and compiler which, similar to Halide and TVM, provides a small set of decompositions which a programmer uses to design implementations of specifications. In Fireiron, important implementation concerns such as the mapping of data to threads are first-class concepts of the language and are specified before less important decisions are made. This allows, for instance, for a programmer to decide how matrices are stored across the GPU, with the remainder of the implementation following these prior constraints.
Similar to Halide, a specification represents the computation to implement (algorithms in Halide) and a decomposition describes how to (partially) implement a given specification (schedule in Halide). Our decompositions are composable and modular, so programmers can start an implementation with a simple decomposition and then gradually refine it in order to specialize it for the current target hardware. This modularity allows decompositions to be reused when targeting different architectures.
When implementing high-performance applications, GPU programmers usually rely extensively on libraries of finetuned routines such as cuBLAS and cuDNN. One limitation is that inputs to these routines must be passed in addressible storage such as global memory or shared memory. This prevents programmers from reusing values stored in, for example registers or FMA argument collectors, whose address cannot be taken. This means existing implementations cannot be decomposed into smaller, flexible components from which new, efficient implementations could be composed by library users.
Consider the case when the input matrices are distributed across the registers of the threads of a cooperative thread array (CTA). Ideally, we want to specify the location of the matrix (in registers) when calling the library procedure. However, today, the procedure needs to be compiled for a particular size of the matrix and a particular number of threads in the CTA. Additionally, the implementation of the producers of the matrices needs to be coordinated so that the matrices are placed in the registers expected by the consumers. This coordination prevents independent development and reuse of implementations. Fireiron is a language and compiler architecture that allows efficient library procedures to be broken into reusable pieces.
Since every DSL is only as good as the abstractions it provides, experts often struggle to break out of restrictions in situations where they need to make use of optimizations for which suitable abstractions cannot be defined easily. In contrast to existing high-level frameworks like Halide, Fireiron allows the programmer to regain full control and provide custom implementations (e.g., handwritten inline assembly), for arbitrary specifications at any point in a decomposition. This allows performance experts to fine-tune specific parts of their implementations using hand-crafted solutions to specifications.
To summarize, Fireiron is a framework for simplifying the development of high-performance GPU kernels, targeted towards experts. We achieve this by achieving a balance between high-level abstractions and control, as well as by providing novel mechanisms to faciliate reuse of optimized implementations that go beyond reusing tuned library routines. In this paper, we make the following contributions:
1. We introduce Fireiron, a high-level language and compiler providing easy-to-use Decompositions to implement Specifications of GPU computations. 2. We enable rapid prototyping of complex optimizations by introducing Refinements for decompositions which allow gradual improvement of an existing implementation with minimal effort. We illustrate these concepts using the example of matrix multiplication. 3. We show how to use and extend Fireiron's basic decompositions to make use of the TensorCores in newer NVIDIA architectures, and show how to define and reuse optimizations as first-class objects in Fireiron. 4. We evaluate our approach by comparing against handwritten CUDA code targeting different architectures.
We express the same optimizations in Fireiron and show that our generated code achieves the same performance as the hand-tuned kernels. Additionally we compare our approach to cuBLAS and show that by using Fireiron, in combination with mma.sync assembly instructions and carefully chosen tile sizes, we are able to outperform high-performance library implementations by up to 2x.
Motivation
Fireiron's design is based upon four main goals, and the mechanisms we used to achieve them.
Control:
Fine-grained Level of Control over the Implementation Strategy. Every domain-specific language is only as good as the abstractions it provides to its user. When striving for optimal performance, there will always be implementations for which no suitable abstractions are available in high-level DSLs. In order to avoid fighting restrictive abstractions, Fireiron offers kernel programmers the ability to regain full control at any point and for any part of the implementation. By allowing the programmer to insert their own specialized code as micro-kernels for matching specifications, we achieve a balance between leveraging high-level abstractions and maximizing productivity and performance.
Mechanism: The programmer specifies the implementation strategy using a decomposition language. This language controls the decomposition of the computation, the placement of data into memory, the communication of data between levels of the memory hierarchy, as well as the mapping of computation onto the compute elements. Decomposing a specification yields a new sub-specification which describes the problem left to implement. A programmer decomposes a specification until it is executable, that is it matches either the specification of a built-in instruction such as fusedmultiply-add (FMA), or the specification of a user-provided micro-kernel for which they have provided an implementation. During code generation we then either emit the built-in instruction or the provided code snippet.
Reusability: Reuse of Implementation
Decompositions. We offer the kernel programmer the reuse of previously developed high-performance decompositions, including those that result in generated code which typically cannot be implemented as a traditional library procedure.
Mechanism: The implementation of a kernel is hierarchical and each fragment of the implementation can be described with a concise specification. Specifications are implemented with decompositions that are stored as first-class objects and can therefore be reused. Essentially, a decomposition breaks existing efficient library procedures (such as a kernellevel GEMM computation) into reusable pieces which can be stored as first-class objects. This allows the extraction and reuse of specific parts of an efficient implementation such as a tiled warp-level GEMM whose operands are stored in shared memory or registers.
In more detail, the implementation is a decomposition tree whose leaves are specifications (representing the remaining sub-computations to implement). In a final implementation, the leaf specifications are executable, e.g. built-in instructions such as FMA or user-provided micro-kernels. In a partial implementation, the leaves can be specifications that still need to be implemented.
Flexibility: Target (Specialized) Instructions
with Arbitrary Granularity. The architecture of GPUs changes rapidly and significantly. For example, the recent NVIDIA Volta and Turing architectures contain specialized MMA (Matrix-Multiply Accumulate) units, so-called TensorCores, which are able to efficiently compute matrix multiplications using small groups of cooperating threads. Currently, this functionality is exposed in two ways: as 8-thread HMMA instructions in PTX (mma.sync), and as warp-wide WMMA instructions in CUDA. The decomposition language must be able to express implementations which make use of these specialized instructions despite the varying granularity of their cooperating threads.
Mechanism: In Fireiron, kernels, micro-kernels, and builtin instructions are described using the same concept: specifications. Once a programmer decomposes a problem into a specification that is executable, i.e., a specification for which we know how to generate code, they can decide to stop decomposing the specification and instead pass the implementation to the code generator.
Imagine we decompose a kernel-level M×N×K-matrix multiplication kernel to a warp-level 16×16×16 matrix multiplication computation. We can either decompose this specification further into several thread-level FMA instructions or stop decomposing here and generate a single WMMA invocation. This mechanism allows us to adapt to future architectures as we can simply add new built-in instructions and their specifications to Fireiron, which is exactly what we did for both HMMA and WMMA.
Cooperation: Data Types for Parallel
Cooperation. In parallel programs, multiple threads cooperate to (1) load arrays, (2) compute new values, and (3) store them back. In advanced implementations, the mapping of the array elements to threads may change between these three phases.
To simplify programming such implementations, we want to abstract away from the mapping as long as possible, specifying it only at lower levels of the decomposition. This makes higher-level specifications more general, permitting more possible implementations.
Mechanism: Distributed arrays are data types that the programmer can distribute over private memories. Fireiron provides the illusion that a distributed array is indivisibly stored across, say, the registers of multiple warps; here, registers are memories that are private to each thread. The Fireiron compiler automatically divides the array and distributes the pieces onto registers of threads of the involved warps. This is especially useful for the epilog of a kernel implementations where a CTA needs to move computed results (usually residing in registers spread across all its threads) back to global memory.
Specs and Decompositions
Most high-performance kernels for GPUs are written in a hierarchical style. The original problem to be computed is decomposable into smaller sub-problems of the same kind. These sub-problems are then assigned to and computed by the different levels of the compute hierarchy. Figure 1 visualizes this observation using a simple matrix multiplication kernel.
Step by step, the matrix multiplication is decomposed into a hierarchy of tiles and data is transferred to lower levels of the memory hierarchy until eventually every thread computes a single FMA instruction. Here, FMA can be viewed as a matrix multiplication of matrices which only contain a single element.
Fireiron introduces two main concepts: Specifications and Decompositions. These two concepts allow to describe implementations, such as the one shown in Figure 1 , and their mapping to the hardware in a natural way.
Specifications
A Specification (spec) is a data-structure describing the computation to implement. A spec contains enough information such that a programmer would be able to manually write an implementation. This especially entails that a spec keeps track of the shapes, locations and storage layouts of its input and output tensors, as well as which level of the compute hierarchy (i.e., Kernel, Block, Warp or Thread) is responsible for computing this operation. Currently, Fireiron supports two main classes of specs: Matrix Multiplication (MatMul), and data movement (Move). The following listing shows a kernel-level matrix multiplication spec: further specified, we assume that all matrices are stored in column-major format and contain elements of type float and write MatMul(M,N,K)(GL,GL,GL)(Kernel) as a short form representing the spec in the listing above.
Given a spec, one can perform one of the following actions: 1. Arrive at an executable spec; or 2. Decompose it into a smaller sub-spec.
Executable Specifications
A specification is called executable when it matches the specification of a user-provided micro-kernel, or a built-in instruction. Fireiron, contains a built-in collection of executable specs matching different instructions like FFMA and HFMA. For example, the FMA instruction has the spec MatMul(1,1,1)(RF,RF,RF)(Thread). When a final implementation contains executable leaf specs, Fireiron will emit the matching built-in instruction or chosen micro-kernel when generating the implementation.
Micro-Kernels
At any time when decomposing specs with Fireiron, the user can provide a handwritten micro-kernel which implements the current spec. This allows the Fireiron user to break out of the DSL and use custom implementations, potentially written in low-level assembly, for which we cannot yet provide good abstractions.
Decompositions
A Decomposition describes how to (partially) implement a given spec. More specifically, a decomposition is a function Spec → Spec which, given a spec, returns a new spec that represents the smaller decomposed sub-problem. Fireiron provides two main decompositions, tile and load, which allow implementations to use the compute and memory hierarchy of a GPU.
• spec.tile(r,c) creates r ×c shaped tiles in the output matrix. Input matrices are tiled accordingly. In order to assign tiles to a level of the compute hierarchy, we can further refine the tiling by applying .to(level) which .load(A,SH,decomposition) Figure 3 . Applying load to a MatMul spec results in a new spec in which the memory location of the specified operand has changed. The load will be implemented as specified in the load-decomposition. changes the responsible compute hierarchy level for the resulting tiled spec. Figure 2 shows the effect of tiling a MatMul spec. • spec.load(M,l,d) loads the matrix M to the level l of the memory hierarchy with an implementation described as decomposition d. Figure 3 shows the effect of loading a MatMul spec. The memory hierarchy has three levels: global memory (GL), shared memory (SH) and registers (RF).
Fireiron is implemented as a domain-specific language, embedded in Scala, which generates CUDA. We allow the user to define custom decompositions to allow them to implement advanced optimisations, as described later.
Decomposing Matrix Multiplication
Listing 1 shows an example decomposition of a MatMul spec using only tile, and load. The done operator marks the end of the decomposition and invokes the code generator. This example can already be compiled into a simple, correct implementation. Here, the load decompositions, which describe how to implement the data movement, are omitted for brevity (denoted _) . Note that unlike in the implementation shown in Figure 1 , the K-dimension remains unchanged and the location of the C-matrix has not changed. The residual spec (MatMul(1,1,K)(RF,RF,GL)(Thread)) is not executable, but instead describes a simple dot-product which is implemented in the micro-kernel dot.cu, and passed as an optional argument to done.
This simple decomposition already establishes major design decisions for implementing matrix multiplication: it defines 1) how many elements each level of the compute hierarchy will compute and, 2) how many elements of each operand are stored at which layer of the memory hierarchy, (which determines the overall memory consumption). Further decompositions will preserve this mapping, which emphasizes Fireiron's methodology for developing high-performance GPU kernels: Fireiron allows programmers to focus on one thing at a time and then to gradually improve unspecified or sub-optimal parts of the decomposition.
Every decomposition needs to specify three things which also need to be specified when adding new decompositions to Fireiron: a. Given the current spec, how to compute a new spec; b. The code to emit during code generation when processing this decomposition; and c. How to update the view into the current slices of the spec's operands to be able to generate correct indexing expressions. If a level of the compute hierarchy has been assigned (using the .to(level) refinement), instead of emitting sequential for-loops, we use the unique compute hierarchy indexes for the specified level to assign a tile to each unit at that level:
For the load decomposition, we allocate a temporary array in the required memory hierarchy level at the beginning of the kernel and emit loops to copy the data to the new memory region: Finally, if the residual spec is executable, we emit the associated built-in instruction or micro-kernel (e.g. FMA or dot.cu in Listing 1, respectively).
c. Index Computation Almost every optimization affects the computation of indexes in some way. In Fireiron, index computations for accessing all matrices are calculated and emitted when required. Every application of a decomposition returns a new spec in which we either sliced the operands or moved them to a new memory location. In both cases, these changes need to be considered when generating index expressions for accessing the array elements. For example, when applying tile to the MatMul spec, the following indices are computed as expected: When applying load, a fresh indexing expression for the newly allocated array is generated.
Expressing Advanced Optimizations
In this section, we show how to express advanced optimized strategies by refining the strategy shown in Listing 1.
Extending Specialized Decompositions
Fireiron's existing decompositions can easily be extended by the user to express custom ways to decompose a given spec. In this section, we introduce two matrix multiplication specific decompositions (split and epilog). To add a new decomposition to Fireiron, we need to explain how to: a) compute a new sub-spec; b) generate a partial implementation; c) update the index computations for the involved operands.
Splitting the K-Dimension For matrix multiplications, we need to be able to create tiles in the K-dimension. This allows the best use of shared memory, especially for large matrices, where a whole row of the A matrix (or column of B) might not fit into the limited shared memory.
Since the split decomposition is specific for matrix multiplication, it can only be applied to the MatMul spec, and its behavior is described as follows:
The following partial implementation will be emitted when processing split during code generation: Specifying Efficient Epilogues The decomposition shown in Listing 1 did not change the memory location of C. This is because without split, we write only once to C, namely after computing the dot-product of a whole row of A and column of B. In an efficient implementation however, the K-dimension is split into chunks and outer-products are accumulated in registers. Once we finish iterating over the chunks of the K-dimension, every CTA contains the final results of its assigned tile in registers spread across its threads.
In the simplest case, every thread stores its results to the required position in global memory. However, depending on the memory layout of the operand matrices, it can be more efficient to cooperate with other threads of the same CTA to accumulate partial results in shared memory, before storing them to global memory. Fireiron supports the concept of distributed arrays and provides the illusion that we have an indivisible CTA-level matrix C, even though the actual matrix is distributed across Figure 1 registers private to each thread. This allows separate decompositions for storing back the computed results from registers to global memory to implement, for example, the more efficient cooperation via shared memory. Typical libraries cannot offer this kind of composability since we cannot pass this distributed CTA-level matrix to a procedure which implements the store because the location of the accumulation registers cannot be taken.
Epilog In order to express advanced decompositions for storing computed results, as well as accumulating intermediate results in registers, we introduce a new decomposition .epilog(l,i,d). Similar to load, i and d are decompositions. Here, i describes the initialization of a new buffer in location l (usually in registers) used for accumulating the results. The decomposition d describes the implementation of the Move spec which represents storing the results from l to C's original location (usually global memory). The behavior of epilog is described as: Since Fireiron's decomposition language is hierarchical, the subsequent decompositions only need to know the new location of C. Similar to computing index expressions for load, we start with a fresh index expression for accessing the newly allocated buffer. The emitted code snippet for epilog is as follows: 
Advanced Optimization using Refinements
Listing 2 shows how to use the decompositions to express the implementation from Figure 1 . A strategy like this already establishes the sizes of data assigned to all levels of the compute and memory hierarchy. In order to generate high-performance kernels, however, we need to specialize all parts of the decomposition to optimally make use of the target hardware. This is where the traditional approach of optimizing GPU kernels quickly becomes tedious. Conceptually trivial changes, like changing storage layouts during loads or using inline PTX rather than CUDA, require disproportionate amounts of work since a large fraction of the kernel will need to be rewritten. Fireiron provides easy-to-use refinements to liberate the programmer from these tedious tasks and allows them productively focus on achieving the best possible performance.
Refinements are optional modifications to decompositions which enable advanced optimizations. For example, to(level) is a refinement we've already seen for the tile decomposition. Without this refinement, we generate sequential forloops; with it however, we assign a specific level of the compute hierarchy to the newly created tiles, effectively computing them in parallel. In this fashion, Fireiron provides multiple easy-to-use refinements which allow kernels to be gradually fine-tuned to achieve optimal performance.
Tile Refinements
The following refinements are available for tile decompositions:
• .to(level) assigns the created tiles to the specified level of the compute hierarchy. • .unroll unrolls the generated for-loops.
• .layout(l) assigns the tiles to elements of the compute hierarchy in either column-or row-major order. This allows the programmer to match the storage layout so array accesses are coalesced.. • .swizzle(perm) introduces the use of advanced swizzle expressions to further optimize the mapping of tiles to elements of the current level of the compute hierarchy.
The layout and swizzle refinments enable changing the mapping of tiles to blocks, warps and threads. They allow the programmer to use different tile shapes for different operations at the same level of the compute hierarchy (such as independent loads of the A and B matrices). This allows advanced optimisations where the assignment of data to threads is more complex than merely column-or row-major.
Load Refinements Fireiron contains the following refinements for load decompositions:
• .noSync avoids emitting __syncthreads() when loading to shared memory, to avoid potentially unnecessary synchronization.
• .storageLayout(l) stores the elements in the destination buffer using either row-or column-major storage layout. • .pad(n) allocates n extra columns in the destination buffer, to avoid memory bank conflicts. • .align(n) ensures a specific alignment for the created destination buffer. • .reuseBuffer reuses a previously allocated buffer in the same memory location if it is no longer used, to reduce memory usage.
Split Refinements Fireiron also allows the user to define refinements for custom decompositions. This is done by registering the new refinements, including their required code snippets, in the code generator. For the split decomposition, for example, we add two refinements:
• .unroll unrolls the created for-loop, like the equivalent refinement on tile. • .sync emits __syncthreads() as the last statement in the body of the created for-loop. This may be required depending on how shared memory is used in a particular implementation. We are aware that some refinements, especially noSync and sync, can allow incorrect implementations due to race conditions. However, a decomposition without refinements will always generate correct code. Until now these issues have not caused problems as Fireiron has only been used by performance experts. However, we intend to improve the analyses within Fireiron to ensure these refinements cannot cause correctness issues.
Instructions with varying Granularity
In order to support tensor cores on newer GPUs, we have added new executable specs to Fireiron which represent the specialized MMA (Matrix Multiply Accumulate) operations.
Supporting WMMA The new WMMA-API in CUDA 10.0 introduces warp-wide matrix multiply operations which operate on warp-wide register collections called fragments. In order to generate kernels which use the new WMMA API, we extend Fireiron in two ways: First, we extend Fireiron's memory hierarchy and add a new level Fragment<M,N,K> (parameterized due to the CUDA API) in between shared memory and registers. This allows data to be loaded into the fragments required by the new WMMA operation:
MatMul (16 ,16 ,16) Second, we add three new built-in instructions to Fireiron: // wmma : mma_sync == MatMul (16 ,16 ,16) . epilog ( fragment , Float , 7
Move . tile (16 , 16) . to ( Warp ) . done , // init 8
Move . tile (16 , 16) . to ( Warp ) . done ) // store 9
. split (16) 10 // /// WARP -LEVEL //////////////////////////////////// 11
. tile (16 , 16) . to ( Warp ) 12
. load ( MatMul .a , fragment , Move . done ) 13
. load ( MatMul .b , fragment , Move . done ) 14
. done // = > MatMul (16 ,16 ,16) These small additions allow us to write a simple Fireiron decomposition which uses the WMMA API, as shown in Listing 3. Note that this code does not apply any further decompositions. It computes the matrix multiplication as follows: 1) Assign 64 × 64 elements to a CTA (line 5); 2) Initialize 16 (4 × 4) accumulator fragments (line 7); 3) fill operand fragments (lines 12-13); compute the result (line 14); and store results from fragments to global memory (line 8). Note that we only need to decompose the computation until we reach the warp level because we can generate code for the residual executable spec (line 14) using the built-in wmma::mma_sync instruction.
Supporting HMMA Using the HMMA instructions exposed via PTX 1 allows even more fine-grained control over how the tensor cores are used. In order to support these instructions, we extend Fireiron with executable specs which exactly describe their semantics: Note the unusal shapes of the operands which are dictated by the semantics of the HMMA instruction. None of the existing decompositions can be used to create such slices of operands. In order decompose a spec into this executable spec, we need to add an additional mmaspecific decomposition. The mmaTile decomposition expects a contiguous 16×16 warp-level MatMul spec and returns an executable HMMA spec.
Evaluation
To evaluate Fireiron, we compare the performance of our generated code against three references: 1) a manually-tuned kernel targeting the Maxwell architecture written by NVIDIA's GPU performance experts, 2) the publicly available CUDA sample cudaTensorCoreGemm kernel 2 targeting the WMMA API, which exploits NVIDIA's TensorCores on Volta and Turing architectures, and 3) the high-performance GEMM implementations provided in cuBLAS which are written in low-level assembly and are the fastest GEMM implementations available for NVIDIA GPUs today.
We chose this set of comparisons to highlight Fireiron's capability to generate efficient code for different GPU architectures ranging from older, such as Maxwell, up to state-ofthe-art, such as Turing. Furthermore, this set of references includes a wide variety of optimizations which are all expressible using Fireiron. These range from simple tiling and shared memory padding (to avoid bank conflicts), as used in the CUDA sample code, to carefully-tuned swizzles [14] and inline PTX assembly which achieves the highest performance possible.
We expressed all reference algorithms using decompositions in Fireiron and compare the achieved performance of our generated code on different architectures. Specifically, we used three different GPUs: GeForce GTX 750 Ti (Maxwell), Quadro GV100 (Volta) and GeForce RTX 2080 Ti (Turing), CUDA-10.0, Driver Version 425.00 and compiled all kernels using -O3 --use_fast_math -arch=sm_XX where XX = 52,70, and 75 for Maxwell, Volta, and Turing respectively. We locked the clocks to fixed frequencies, report the minimum kernel runtime of 1000 runs using nvprof and omit data transfer time because we are only interested in the quality of our generated kernel code.
Targeting the Maxwell Architecture
The first kernel we compare against is manually tuned for larger input sizes (M,N,K >= 1024) and optimized to run efficiently on the Maxwell architecture. Listing 4 shows the decomposition used to express this specific implementation. Note that we express two different load strategies for prefetching operands A (lines 22-27) and B (lines 29-34) to shared memory. This is due to computing GEMM_NT where one operand is transposed and in order to perform efficient loads, we need to consider the storage layout for both operands such that global memory loads are coalesced. We use advanced swizzle expressions (line 2) shuffling the mapping of data to threads to avoid shared memory load conflicts. Furthermore, we make heavy use of refinements to explicitly specify which loops to unroll and where to add or avoid synchronization. This decomposition also uses vectorized loads (lines 45 and 48), sparse-thread tiles (line 38) and . tile (128 , 128) . to ( Block ) 10
. layout ( ColMajor ) 11 // ---epilog : store results RF = > GL ------------------// 12 . epilog (RF , 13
Move // init 64 registers per thread 14
. tile (64 , 32) . to ( Warp ) 15
. tile (8 , 8) . to ( Lane ) 16
. tile (1 , 1) . unroll 17
. done , 18
Move // store results using microkernel 19 . done ( storeCUDA ) ) 20
. split (8) .
. load ( MatMul .a , SH , Move 23
. tile (128 , 1) . to ( Warp ) 24
. tile (64 , 1) . unroll 25 . tile (2 , 1) . to ( Lane ) 26
. layout ( ColMajor ) 27
. done ) . toColMajor . noSync
. tile (8 , 16) . to ( Warp ) 31
. tile (8 , 4) . unroll 32
. tile (1 , 1) . to ( Lane ) 33
. layout ( ColMajor ) 34
. done ) . toRowMajor . pad ( Strided ((4 , 32) , (4 , 16) ) ) 39
. to ( Lane ) 40
. layout ( ColMajor ) 41
. swizzle ( swizz ) 42
. split (1) . unroll 43 // ---load A to RF ------------------------------------// 44
. load ( MatMul .a , RF , 45
Move . tile (4 , 1) .
Move . tile (1 , 4) . unroll . done ) 49 // ---perform computation -----------------------------// 50 . tile (1 , 1) . unroll 51
. done // residual = MatMul (1 ,1 ,1) ( RF , RF , RF ) ( Thread ) Listing 4. Efficient decomposition for large input sizes on Maxwell. Optimizations expressed include: vectorized loads, sparse thread-tiles, swizzling, custom epilog micro-kernel.
a custom epilog (line 12) which swizzles the data via shared memory, using an unconventional data-layout, to achieve efficient stores to global memory. Figure 4 shows the achieved performance relative to the reference kernels (higher is better). We executed both the reference and the Fireiron-generated kernels, on the Maxwell, Volta, and Turing architectures. Our automatically generated code achieves 94.4% of the available performance compared to the manually written CUDA code on Maxwell, as well as 98% and 99% on Volta and Turing respectively. Due to automatic index expression generation, our loops and array indices look slightly different to the hand-written CUDA code we compared against, and can be further optimised, which explains the remaining gap in performance. However, due to the tensor cores introduced with the Volta architecture, this implementation is now only of limited interest. The Volta architecture supports the MMA instruction which has an 8× higher theoretical peak performance than the FMA instruction. In fact, we are able to significantly outperform this specific implementation (up to 5×) by avoiding the FMA instruction and using Fireiron decompositions better suited to the newer architectures. Similarly, this GEMM implementation only achieves 37% of the performance on Volta and 19% of the performance on Turing when comparing to cuBLAS. Still, this comparison highlights Fireiron's ability to express the exact same optimizations as manual implementations which nonetheless achieve the same performance.
Targeting TensorCores using WMMA
The NVIDIA CUDA samples contain two kernels which use WMMA to put the TensorCores to use. The first kernel simply introduces the API calls and is not optimized. In fact, the decomposition shown in Section 4.3 generates exactly this kernel. The second kernel is slightly optimized, using tiling and padded shared memory buffers to avoid bank conflicts. Listing 5 shows this implementation expressed as a Fireiron decomposition. Figure 5 shows the achieved performance for the optimized decomposition on both Volta and Turing. Again, our decomposition expresses exactly the optimizations implemented by hand in the reference kernel. Therefore, the generated kernel achieves the same performance as the manually written kernel, but the decomposition is much more concise than the optimized CUDA code. In order to develop this decomposition, we able to start with the unoptimized decomposition and gradually refine it, focusing on adding one optimization at a time.
Compared to cuBLAS this implementation however still only achieves 61% of the performance on Volta and 68% of the performance on Turing. This is because the memory and compute hierarchy need to be used more efficiently. In particular, using PTX's mma.sync instruction, instead of the . tile (16 , 16) . unroll . done , Move // store : FR = > GL // FR = > SH ( first step ) . load ( Move . src , SH , Move . tile (64 , 32) . to ( Warp ) . tile (16 , 16) . unroll . done ) . reuseBuffer // SH = > GL ( second step ) . tile (16 , 128) . to ( Warp ) . tile (1 , 128) . unroll . tile (1 , 4) . to ( Lane ) . done ) . split (128) . sync . unroll
. load ( MatMul .a , SH , Move . tile (16 , 128) . to ( Warp ) . tile (2 , 128) . unroll . tile (1 , 8) . to ( Lane ) . done ) . noSync . (128 , 16) . to ( Warp ) . tile (128 , 2) . unroll . tile (8 , 1) . to ( Lane ) . layout ( ColMajor ) . done ) . pad (8) // /// WARP -LEVEL //////////////////////////////////////// . tile (64 , 32) . to ( Warp ) . split (16) . (16 , 16) . unroll . done ) // ---fill WMMA fragments for B -----------------------// . load ( MatMul .b , FR , Move . tile (16 , 16) . unroll . done ) // ---perform WMMA computation ------------------------// . tile (16 , 16) . unroll . done // MatMul (16 ,16 ,16 ) ( FR , FR , FR ) ( Warp )
Listing 5. Fireiron WMMA Decomposition. Note that we do not descend down to the thread-level as the WMMAinstruction is a warp-wide instruction. more coarse grained WMMA API, enables delicate control over the TensorCores computation.
High-Performance HMMA Inline PTX
Finally, we focus on achieving the highest performance possible and compare against cuBLAS, which provides the most efficient GEMM implementations for several architectures and input sizes (written in optimized SASS assembly). We compare against the cublasGemmEx routine using half precision floating-point, so the mma instruction is used on architectures with TensorCores. We express the implementations by composing Fireiron's decompositions to use the mmaTile decomposition to target the executable spec for HMMA. cuBLAS uses different CTAtile sizes depending on the input size. We parameterized our decompositions which allows us to choose different sizes. Since we are interested in seeing whether Fireiron can express all optimizations required to achieve state-of-the-art performance on new architectures, we explore several input sizes in this experiment. Specifically, we run two separate experiments where we explore small (M, N , K ≤ 1024) and large matrices (1024 ≤ M, N , K ≤ 4096). The optimizations required to achieve high performance on small and large input sizes differ significantly, as the small input sizes expose less parallelism opportunities. cuBLAS provides multiple optimized implementations and chooses a specific one based on internal heuristics. In order to have a fair comparison, instead of only generating one Fireiron kernel, we wrote two parameterized decompositions (one generally more suited for smaller and one for larger input sizes), exhaustively explored CTA-tile sizes (powers of two: 2 4 -2 8 ), and report the best performance. Figure 6 shows the achieved performance compared to cuBLAS for the large input matrices (higher is better), and Figure 7 shows the achieved performance for small matrices. Generally, we are able to significantly increase the performance compared to the previous Maxwell-and WMMAimplementations. For large matrices, we exactly match the performance of the carefully tuned SASS kernels used in cuBLAS in three cases. On average, we achieve 93.1% of the cuBLAS performance with minimum of 88.3% in one case and a maximum of 101% in two cases. These results show that Fireiron can produce state-of-the-art CUDA kernels which achieve performance very close to the practical peak performance. This is emphasized by the fact that cuBLAS kernels are provided as optimized SASS assembly, which exposes further optimization opportunities unavailable to Fireiron's generated CUDA kernels. We left out the decompositions for this experiment for brevity, but none made use of any micro-kernels. Instead, we used mmaTile and the library of executable HMMA specs to inject inline PTX for appropriate residual specs.
For small input sizes (< 1024) cuBLAS typically uses implementations where the reduction of the K-dimension is additionally parallelized across warps (usually referred to as in-CTA-split-k). This creates a three dimensional warp arrangement and exposes more parallelism in the cases where two dimensional tiles are not enough to saturate all cores.
The speedups we achieve for the smallest input sizes are due to the tile sizes we found to work best for this particular setup. We generally found a tile size 16 × 16 in the M and N dimensions and 64 in the K dimension, computed by two warps per CTA, to perform best. The heuristics of cuBLAS also chose 64 for the K dimension, but larger sizes for the M and N dimensions, which reduces the available parallelism required to achieve even better performance.
Developing high-performance kernels in Fireiron allows comfortable experimentation with optimizations. With Fireiron, developers can solely focus on expressing optimizations while the compiler automates tedious tasks like computing array indices which are easy to get wrong and often need to change throughout the whole kernel. This significantly increases programmer productivity and leads to high-quality GPU code.
Related Work
Schedule-based Compilers Fireiron is heavily inspired by Halide [15, 16] and TVM [6] in using decompostitions (i.e., scheduling primitives) to express optimizations for specs. Similar frameworks include Tiramisu [3] , CHiLL [5] , and for graph algorithms GraphIt [25] . Some frameworks also use C's pragmas and other annotations to create DSLs for manual loop optimisations [8, 11, 13] . None of these, however, provides neither the same flexibility nor do they allow to decompose existing efficient library routines into reusable components. [4, 20, 24] which orchestrate the application of rewrite rules in term rewriting systems can be seen as predecessors to today's schedule-based compilers. However, to the best of our knowledge, these languages have not been used in the context of generating efficient GPU code which implements state-of-the-art optimizations and which can compete with the performance of vendor-tuned libraries. Similar analogies can also be drawn between Fireiron and the use of tactics in theorem provers to manually decompose proofs into sub-goals [7] , though our schedule DSL is not turing-complete and does not allow the same levels of reflection found in these tactic languages.
High-Performance Code Generation Currently, Fireiron provides optimized implementations for a set of specs. Lift [10, 19] , TensorComprehension [21] and Futhark [12] are frameworks also aiming at high-performance code generation. In contrast to fixed specifications encoding the computations as used in Fireiron, these frameworks allow to specify computations using a flexible high-level (functional) programming language. Rather than simply extending the set of available Fireiron specs we can imagine expressing computations in a similar pattern-based high-level programming language in the future.
Polyhedral Compilers Diesel [9] , NOVA [9] and PPCG [23] are compilers making heavy use of optimization via the polyhedral model. Since most of our decompositions eventually result in nested loops with affine accesses, Fireiron could potentially profit from using polyhedral techniques too, especially as the resulting implementation has a representation similar to Schedule Trees [22] . In contrast to polyhedral optimization techniques however, Fireiron's decomposition operate on a higher-level of abstraction. Fireiron's decompositions directly modify specifications which later compile to low-level loops implemented in CUDA.
Auto-Tuning and Program Synthesis
Auto-Tuning approaches including Halide's auto-tuners [1, 16] , OpenTuner [2] , ATF [17] and MD-Hom [18] , and program synthesis techniques such as SwizzleInventor [14] aim to automatically develop optimized implementations by navigating a search space of possible implementations. We see potential for a similar automatic search space exploration for Fireiron's decompositions, however as of today, Fireiron is designed as a tool for performance experts, simplifying the development of optimizations rather than automatically searching for highly optimized implementations.
Conclusion
In this paper we introduced Fireiron, a scheduling language for high-performance linear algebra on GPUs. We introduced the concept of specifications, which represent a computation to optimize, and decompositions which partially implement them. Defining low-level PTX assembly as well as marcoinstructions like WMMA as executable specs allows us to flexibly target new architectures including TensorCores.
Using matrix multiplication as a case study, we showed how to develop high-performance implementations using Fireiron's specs, decompositions and refinements. Fireiron is expressive enough to support all optimizations typically used in hand-tuned kernels and flexible enough to allow the insertion of micro-kernels when no suitable high-level abstractions can be built easily. Our experimental evaluation shows that Fireiron generates code with performance competitive to vendor-tuned high-performance libraries. Finally, all Fireiron programs are composed of building blocks which can be reused in future implementations targeting new hardware architectures, allowing these optimisations to be applied more widely.
