High performance dense linear algebra (DLA) libraries often rely on a general matrix multiply (Gemm) kernel that is implemented using assembly or with vector intrinsics. The real-valued Gemm kernels provide the overwhelming fraction of performance for the complex-valued Gemm kernels, along with the entire level-3 BLAS and many of the real and complex LAPACK routines. Achieving high performance for the Gemm kernel translates into a high performance linear algebra stack above this kernel. However, it is a monumental task for a domain expert to manually implement the kernel for every library-supported architecture. This leads to the belief that the craft of a Gemm kernel is more dark art than science. It is this premise that drives the popularity of autotuning with code generation in the domain of DLA.
Introduction
High performance dense linear algebra libraries (e.g. the Level 3 Basic Linear Algebra Subroutines (BLAS) [3] , and LAPACK [1] ) are often written using a high performance general matrix-matrix multiplication routine (gemm ) [5, 8] . This gemm routine is typically custom-written by a domain expert for a particular architecture. These domain expert are often affectionately called "assembly-ninjas", for two main reasons. Firstly, the routines written by these experts are frequently written using low-level languages, i.e. assembly, that are architecture-specific. More importantly, the process by which these experts implement these architecture-specific routines are often unknown, or considered too intricate and architecture-specific to be worth modeling. In some sense, the work performed by these experts are more akin to some form of dark art than science.
It is for that reason that many of the productivity improvement tools such as auto-tuning and codegeneration [2, 19, 18] are focused on generating code that are auto-tuned around a small black-box kernel that is still implemented by the expert. The main idea is for the auto-tuner/code-generator to find the loop ordering, and blocking sizes surrounding the optimized kernel that "best"-suit the targeted architecture, Figure 1 : The GotoBLAS approach for matrix multiplication as refactored in BLIS. Blocked arrows represent explicit data packing, and thin arrows represent the data layout in after packing.
while being agnostic to the details of the hand-coded implementations. The smaller architecture-specific kernel reduces the implementation effort of the expert, thus allowing more high performance kernels, each targetting different architectures, to be implemented.
Around the turn of the century, Kazushige Goto revolutionalized the way matrix-matrix multiplication is implemented with the GotoBLAS approach [4] . Specifically, Goto demonstrated that data movement and computation for computing gemm can be systematically orchestrated in a specific manner, as depicted in Figure 1 , that does not change as we shift between architectures. In Goto's implementations, a small macrokernel needs to be custom-implemented. The BLIS framework [16] refactored Goto's algorithm exposing additional loops so that only a much smaller micro-kernel needs to be customized for a new architecture, with all the other parts implemented in the C programming language.
Prior to this paper, the point at which science became art is in the implementation of the micro-kernel itself. With this paper, we push science one step further, exposing the system behind the implementation of the micro-kernel. Specifically, we make the observation that a micro-kernel is inherently implemented as a loop around outer-products. The execution of that loop can be thought of as a first stage consisting of one or more iterations that suffer a certain degradation in performance as hardware resources are being consumed, a second stage during which each iteration involves computation with data that already exists in registers that can mask data movement to be used in this iteration or future iterations, and a third stage during which final computations are performed and results are flushed from the system. The performance of the micro-kernel is primarily determined by how efficient the second stage is and that is where we therefore focus our efforts. We codify this effort as a model based code generation system for micro-kernels, thus automating the last-mile for producing high performance gemm routines.
2 Anatomy of a Gemm micro-kernel BLIS is a framework for rapidly instantiating the BLAS using the GotoBLAS (now maintained as OpenBLAS[13]) approach [4] , and it is one of the most efficient expert-tuned implementations of the BLAS. The GotoBLAS approach essentially performs loop tiling [14] and packing of data for different layers [6] within the cache hierarchy in a specific manner to expose an inner kernel. The specific loop tiling strategy in GotoBLAS has been shown to work well on many modern CPU architectures. BLIS extends these ideas, and tiles the loops in the GotoBLAS inner kernel to expose an even smaller Gemm kernel, the micro-kernel, and showed that high performance is attained when this micro-kernel is optimized [20] .
The micro-kernel
The micro-kernel is a small matrix multiplication that implements C+= AB, where C is a m r × n r matrix, while A and B are micro-panels of size m r ×k c and k c ×n r respectively. In addition, because of the packing of A and B prior to the invocation of the micro-kernel, it can be assumed that A is stored in a contiguous block of memory in column-major order while B is contiguously stored in row-major order. Since the micro-kernel is a small gemm kernel, the micro-kernel can be described, using compiler terminology, as a gemm kernel computed using a triply-nested loop of the KIJ or KJI variant.
Within the BLIS framework, it can also be assumed that m r , n r k c . In addition, we assume that the bounds of the loops (i.e. k c , m r , and n r ) are determined analytically using the models from [11] .
Computing the micro-kernel. Mathematically, the micro-kernel is computed by first partitioning A into columns and B into rows. The output C is then computed in the following manner:
where the fundamental computation is now C+= a i b T i , a single outer-product, and our task is to compute the outer-product multiple times, each time with a new column and row from A and B, in as efficient a manner as possible.
Decomposing the outer-product
Focusing on a single outer-product, C+ = ab T , we can decompose the outer-product further by performing loop tiling of C. This, in turn, will require us to block the columns of A and rows of B into sub-columns and sub-rows of conformal lengths respectively, as follows:
  
In this case, the outer-product is decomposed into a smaller unit of computation, which we will term as Unit Update, which computes
where C i,j is now a m v × n u matrix, and the subvectors of a i and b j are vectors of length m v and n u . Decomposing the outer-product into smaller unit updates now allow us to determine how the outer-product can be computed with the available instructions on the targeted architectures. In the case where m v = n u = 1, each unit update computes a single element in the matrix. This means that the outer-product, C, is computed element-wise, can be performed using the following code:
where b j is streamed from the L1 cache, and a i is loaded into the registers from the L2 cache. Alternatively, interchanging the two loops yields, for j = 0; : n r /n u − 1 for i = 0; : Figure 2 : An additional three loops are introduced after decomposing the BLIS micro-kernel into smaller outer-product kernels of size m v × n u . These set of loops would replace the micro-kernel shown in Figure 1 .
where each iteration of the inner-most loop requires new values of a i to be loaded from the L2 cache. In either case, we can replace the micro-kernel block in Figure 1 with the new diagram in Figure 2 . The interesting case is when m v = 1 and/or n u = 1. In this case, each unit update is a smaller outerproduct. By selecting appropriate values of m v and n u , we gain the flexibility of mapping the computation of the unit update (C i,j ) to the availability of vector / single-instruction-multiple data (SIMD) instructions available on modern architecture. This flexibility of mapping also yields us a family of algorithms that compute the outer-product, which is the kernel within the micro-kernel we are trying to optimized.
Identifying Outer-Product Kernels
Recall that the micro-kernel is essentially a loop around multiple outer-products. In addition, each outerproduct can be further decomposed into smaller unit updates of size m v × n u . In this section, we discuss how the values of m v and n u are determined by the instructions available on the desired architecture.
The building blocks: SIMD instructions
The key to high performance is to use Single Instruction Multiple Data (SIMD) vector instructions available on many of the modern processors. We assume that all computation involves double precision arithmetic and that each vector register can store v double precision floating point numbers, where v is a power of two. In addition, we assume that the following classes of vector instructions are available:
1. Vector Stores. Vector store instructions write all v elements of a vector register to memory. We assume that all elements loaded by a single Load instruction are within v memory addresses of each other 1 . Prefetches are considered Load instructions.
3. Vector Shuffles. Shuffle Instructions reorders and/or duplicates the elements in a vector register. We restrict ourselves to only instructions that be represented by a n v × n v matrix where each row contains exactly n v − 1 zeros and a single one, but each column may contain multiple ones. In addition, we assume that the number of ones in each column is a power of two. 4. Vector Computation. It is assumed instructions that perform computation on vector registers are element-wise operations. This means that, given vector registers, reg a and reg b, the output is of the form:
where op i and op j , i = j may be different binary operators. The result of the computation may be stored in one of the input registers, or a third vector register. 5. Composite Instructions. Some instructions -we will call them Composite Instructions -can be viewed as a combination of some of the previous three types of instruction. For example, the instruction, vfmadd231pd, reg, reg, mem1to8
instruction on the Xeon Phi can be expressed as a Load instruction, followed by a broadcast (Shuffle) instruction, followed by a fused multiply-add (Computation) instruction.
Mapping unit updates to SIMD instructions
Given the available SIMD instructions, one possible size of a single unit update is for m v = v and n u = 1, where v is the size of a SIMD register. This means that v values from a, and a single value of b is loaded into two SIMD registers, reg a and reg b. In addition, we know that the loaded value of b is duplicated v times because n u < v. Computing with reg a and reg b will yield a single unit update of size v × 1. A single outer-product of m r × n r can then be computed through m r /m v × n r /n u multiple unit updates as shown in Figure 3 . Alternatively, a different algorithm emerges when m v = v and n u = 2. The difference is that reg b would contain two unique values, each duplicated v/2 times. After the first computation is performed, the values in reg b has to be shuffled before the next computation can be performed. This computation-shuffle cycle has to be repeated at least n u − 1 times in order to compute a single unit update. Pictorially, this is shown in Figure 4 . The astute reader will recognized that we could have chosen to shuffle the values in reg a without significantly changing the computation of the unit update. Figure 4 : SIMD computation of a 4 × 4 outer-product using two unit updates of size 4 × 2. As there are multiple (2) unique values, vector shuffles must be performed to compute each unit update. Shaded registers denotes output register for the current stage of computation.
A family of outer-product algorithms
What we have learnt from the two algorithms described previously are the following:
-The size of a single unit update can be determined by the number of unique values loaded into registers reg a and reg b.
-When there are more than one unique values in registers reg a and reg b, the number of computationshuffles required is the minimum of m v and n u .
-Loading more unique values into reg b reduces the number of Loads of b from the L1 cache, at the cost of increasing the number of shuffles required to compute the unit update.
Given that we chose not to shuffle reg a, this means that there are log 2 (v) + 1 different ways of picking n u , i.e. the number of unique elements loaded into reg b (while still being a power of 2) 2 . For a given choice of n u , the different ways in which the data in reg b should be shuffled yields different implementations for the unit update. By accumulating the instructions for computing all m r /m v × n r /n u unit updates, different sets of instructions, or instruction mix, describing different implementations of the outer-product can be obtained.
Recall that the different number of loaded unique elements results in different number of loads and shuffle stages required. On different architectures, the cost (in term of latency) of loads and shuffles may differ, which suggests a need for a means to estimate the cost of computing with a set of instruction mix. where λ and W are the average arrival rate of new jobs, and average time spent in the system (waiting and processing time) of a job, respectively. Rearranging the above equation, we obtain
which gives us the throughput of a particular queue with an average of L jobs, each taking an average of W unit of time.
The overall throughput of the system for computing the outer-product can then be estimated using
where λ i is the throughput of pipeline i, L i is the number of instructions from the instruction mix that has been assigned to the pipeline. Essentially, the throughput for an outer-product is the inverse of the time it takes for the instruction mix to clear the pipeline of the lowest throughput.
Estimating throughput
Consider the instruction mix required to compute the 4 × 4 outer-product shown in Figure 3 being executed on the model Sandy Bridge architecture described in Figure 5 . The instruction mix contains a single Load of a vector of a, four Loads (with duplication) of elements from b, four multiplies and four adds (Computation) instructions. All instructions will be sent to their respective pipelines. In addition, another four job items are also sent to the pipeline connected to the shuffle functional unit. This is because the Load of element from b on the SandyBridge is a Composite instruction, that comprises of two instructions, a Load and a Shuffle. Based on documentations from the hardware manufacturers [7] , we know that the throughputs of the Shuffle and Computation instructions are 1 per cycle, and Loads have a throughput of 2 per cycle. This means that the estimated throughput of the system is
= 0.25 outer-product per cycle However, these throughput values are based on the assumption that the instructions in all queues are fully pipelined, and independent. When the instructions in the pipelines are not independent, this implies that 
Kernel Builder
Best Tiling
Sched. & Tune
Embed. Func. Blk. Sz. Figure 6 : Our outer product kernel generation work flow produces expert level performing code by: enumerating a space of implementations, modeling their expected performance, selecting the top candidate and translating that into efficient code.
Gen. Kernel Code
the latency of executing that instruction cannot be hidden. As such the throughput of the pipeline drops. This can happen when one instruction has to be computed before dependent instructions can be processed. What this means is that the pipeline has to stall, thus increasing the average waiting time of that pipeline.
The new average waiting time for the stalled pipeline can be esimated as
where n is the number of dependent instructions and k is the latency of the instruction. Using this new value of W , we can then compute the effective throughput using Little's Law (Equation 1).
Dealing with dependencies across pipelines
The authors of [11] proposed an analytical model for sizing the micro-kernels, where the values of m r and n r are sized such that all computations within a single iteration, i.e. the computation required to compute a single outer-product are independent. Hence, by adopting the BLIS analytical model presented in [11] , we know that all the computation instructions are independent, and can be pipelined without introducing bubbles into the pipelines.
To overcome other dependencies between other instructions, we can perform loop unrolling [14] and software pipelining [9] during code generation to identify and schedule independent instructions.
Generating the Gemm micro-kernel
In the previous section, we described a mathematical technique for determining the most efficient mix of instructions for implementing the outer-product kernel on a given architecture. We use queuing theory to estimate the sustained throughput of a potential implementation without empirical measurements. The end goal is transforming this highly parallel mix of instructions and implementing an efficient kernel which sustains the predicted throughput.
In this section we describe our code generator, which takes as its input an outer-product instruction mix, and outputs a high performance kernel. By using the more parallel outer-product formulation over an inner(dot)-product formulation, the resulting mix contains a large number of independent instructions. Using these independent instructions, or instruction level parallelism (ILP), the latency of these instruction are easily hidden through the use of static scheduling, enabling the kernel to achieve the performance predicted by the model. Thus, the main functions of our python-based code generator is to capture the desired instruction-mix in a template, statically schedule the instruction mix, and emit ANSI C code in a manner that preserves the static instruction schedule. In these three steps our generator produces a high performance kernel that approaches the theoretical performance of an instruction mix as determined by our queueuing theory model. 
A Work Flow for Kernel Generation
Our complete work flow is captured in Figure 6 , and accomplishes the following: We start with the ISA for the target architecture, this includes the available instructions and their latency and throughput. This information is passed to our Unit Update Enumeration stage which enumerates the space of all unit updates, for example given the ISA in Figure 7 the unit updates in Figure 8 are enumerated. After the unit update space is enumerated, all possible tilings of unit updates that form outer-products of our desired m r × n r are enumerated ( Figure 9 ). These outer-products are then modeled and estimated using our queueing theory model described in the previous section. This process insures that the tiling, or instruction mix, selected can sustain a high throughput, given that all other instruction overheads are minimize. The steps that follow focus on minimizing said overhead through several optimizations, namely static instruction scheduling.
Continuing with the work flow, the highest performance instruction mix tiling is selected and passed to a kernel builder which blocks the outer-product. This will reduce register pressure when we perform instruction scheduling (see Figure 11 ). The kernel builder then outputs a skeleton of the kernel, like the one in Figure 12 , which captures the various blocking parameters of the kernel along with a set of embedding functions such as Figure 10 . These embedding functions capture the selected instruction mix as functions. At this point the embedding functions and skeleton represented an untuned matrix multiply kernel that is implemented using the selected instruction mix.
The embedding functions and the skeleton are then passed to a Scheduling and Optimization phase which hides the instruction latency by statically performing software pipelining scheduling [9] , allowing the kernel to perform near the predicted performance. The resulting statically scheduled code is then emitted using inline assembly ANSI C intrinsics from [17] , which produces C code that can be compiled with a fixed static schedule. This process yields outer-product kernel code ( Figure 13 ) that achieves expert level performance. The role of the external compiler on this emitted C code is to provide register coloring, simplify memory indexing computation, and insure efficient instruction alignments for the fetch and decode stages of the processor.
Embedding Function Capture the Instruction Mix
The instruction-mix selected by our queuing model is not a complete kernel, instead it is a collection of instructions that describe the data movement and floating-point computation of data elements in a permuted outer-product. The dependencies, register utilization and memory address computation is implicit in this stage. Therefore, the first step in the transformation from instruction-mix to kernel is to make these characteristics explicit. This is done by expressing the three components of an outer-product instruction mix (gathering of the elements A, gathering of the elements of B, performing the multiply and accumulate Figure 10 : In order to pass the instruction mix to the kernel code generator, it is encoded as a function, similar to listing. These functions dispatch to a specific instruction which depends on what element C ii,jj is being on in the outer product.
of C) are expressed as embedding functions: get a element, get b element, and fma ( Figure 10 ). In these functions dependencies, register utilization, and memory address computation are made explicit. These functions take the following inputs: arrays of registers that represent the elements of a,b and c, along with the indices for the m, n and k dimension of the current unit-update. The embedding functions maps the indices of the outer-product to the appropriate instructions from the instruction-mix. Because these functions capture the majority of the work a handful of optimizations are applied to these embedding functions. Namely, we optimize for the fetch and decode stage by minimizing bytes per instruction, and we simplify dependencies to minimize register utilization.
Optimizing for bytes per instruction. On most modern processors, the maximum throughput of the fetch and decode units is low enough to become a bottleneck. Thus, some of the optimizations performed by our generator minimize the instruction length and decode complexity in order to avoid this bottleneck. The following decisions ensure that shorter instructions are generated:
1. In some cases, we generate instructions that are meant to operate on single-precision data instead of instructions that operate on double-precision data. An example of this is the use of the vmovaps instruction to load reg a, instead of vmovapd. This is because both instructions perform the identical operation but the single-precision instruction can be encoded in fewer bytes.
2. We hold the partially accumulated intermediate m r × n r matrix of C, which we will refer to as T , using high-ordered registers (i.e. register xmm8 to xmm16). On most architecture we tested, high-ordered SIMD registers require more bytes to encode. Thus, by using the low-ordered registers to hold working values and high-ordered registers to store T , we ensure that each instruction has at most one register operand (i.e. the output operand) that is a high-ordered register.
3. For memory operations, address offsets that are beyond the range of −128 to 127 bytes require additional bytes to encode. Therefore, we restrict address offsets to fit in this range by subtracting 128 bytes from the base pointers into A and B. Figure 11 : Once a candidate outer product tiling is selected ( figure 9 ), we perform an additional layer of blocking (m s , and n s ) to assist the code generator in minimizing register spills. Register blocking allows fewer registers to be live at a given cycle thus allowing the code generator to aggressively schedule the instructions.
Eliminating unnecessary dependencies. Notice that each permute-multiply-add step of a unit update can be performed independently once the permutation of reg b has been completed. The key is to ensure that in generating the permutations, unnecessary dependencies are not introduced to make the independent permute-multiply-add steps dependent. Consider the following code snippet for producing four permutations of B: vmovapd (addr_B), %ymm1 vshufpd $5, %ymm1, %ymm1, %ymm1 vperm2f128 $1, %ymm1, %ymm1, %ymm1 vshufpd $5, %ymm1, %ymm1, %ymm1 Notice that each instruction is dependent on the result of the permutation instruction immediately before it. As a result, the previous set of instructions would take longer to compute than the following sequence of instructions. vmovapd (addr_B), %ymm1 vperm2f128 $1, %ymm1, %ymm1, %ymm2 vshufpd $5, %ymm1, %ymm1, %ymm3 vshufpd $5, %ymm2, %ymm2, %ymm4 In both cases the same permutations are being computed, but in the latter case the permutations are stored in different registers. This eliminates the false dependencies between the instructions, so the shuffle instructions are independent and can be executed independently.
From Embedding Functions to Generated Code
Once we have an instruction mix for the outer product, as determined by our queueuing theory model, we can generate an implementation that hides the instruction latency in this instruction mix. Static instruction scheduling is key for this next step, however this optimization is limited by the number of available registers. Thus, the primary step in the kernel builder ( Figure 6 ) is to determine a further layer of blocking for the outer product. /* initialize temp buffer */ for( i = 0; i < m_r; i++ ) for( j = 0; j < n_r; j++ ) c_reg[ii][jj] = 0; /* computation */ #unroll(k_u) #schedule_software_pipeline for( pp = 0; pp < k_b; pp++ ) /* perform the outer products */ for( i = 0; i < m_r; i+=m_s ) for( j = 0; j < n_r; j+=n_s ) for( ii = i; ii < i+m_s; ii++ ) get_a_elem(a_reg, ii,j ); for( jj = j; jj < j+n_s; jj++ ) get_b_elem(b_reg, ii,jj ); fma( c_reg, a_reg, b_reg, ii,jj,pp ); /* accumulate temp to results */ for( i = 0; i < m_r; i++ ) for( j = 0; j < n_r; j++ ) C[(ii,jj)] += c_reg[ii][jj]; Figure 12 : In this code skeleton we capture an outline of the generated kernel. We pass a similar outline to our code generated along with our instruction mix. This mix is encoded as the functions get a elem, get b elem and fma. The code generator uses this information to generate the code, perform optimizations such as unrolling and code motion and schedule the resulting kernel code using software pipelining in way that targets the microarchitecture.
Limits imposed by registers.
Recall that to compute a unit update, u b permutations of the elements in reg b are required. However, a multiply and an add is performed with each permutation. This implies that each unit update will require two new registers (R R = 2): One to store the permutation of reg b, and another to hold the output of the multiplication. On architectures with a fused-multiply-add instruction, only one new register is required (i.e. R R = 1). As the register that holds the output of the accumulated result is reused over multiple outer-products, it means that the number of unit updates (N updates ) that can be performed without register spilling is constraint only by the number of registers given by the following:
where R total and R A are the total number of registers and the registers required to hold the the column of A. We select the additional blocking dimensions such that:
In Figure 11 we show the blocking and scheduling of a vshuffle instruction mix ( Figure 9 ) for a m r × n r = 8 × 4 outer-product.
Scheduling and Tuning
The instruction mix selected by the Queueing Model Estimator is translated by the Kernel Builder into several embedding functions (get a elem, get b elem and fma), like those in Figure 10 . These function are embedded in a looping structure that matches the outer-product kernel (Figure 12 ). These loops iterate over the m r , n r , m s and n s dimensions and generate the dependencies between the instructions inside the embedding functions.
for( pp = 0; pp < k_b; pp+=KUNR ) { /* STEADY STATE CODE */ VLOAD_IA(GET_A_ADDR(0),GET_A_REG(0)) VLOAD_IA(GET_A_ADDR(1),GET_A_REG(1)) VLOAD_IA(GET_B_ADDR(0),GET_B_REG(0)) VSHUFFLE_IA(0x05,GET_B_REG(0),GET_B_REG(1)) VFMA(GET_A_REG(0),GET_B_REG(0),GET_C_REG(0,0)) VFMA(GET_A_REG(0),GET_B_REG(1),GET_C_REG(0,1)) VPERM2F128_IA(0x01,GET_B_REG(1),GET_B_REG(2)) VSHUFFLE_IA(0x05,GET_B_REG(2),GET_B_REG(3)) VFMA(GET_A_REG(1),GET_B_REG(0),GET_C_REG(0,0)) VFMA(GET_A_REG(1),GET_B_REG(1),GET_C_REG(0,1)) /* snip */ } Figure 13 : This is generated excerpt from our kernel generator. The resulting kernel code implements the instruction mix identified by our queueing theory model and is statically scheduled to maintain the estimated performance of the mix.
Once these dependencies are built, a few basic optimizations, such as common sub-expression elimination, are performed such that the only the original instruction mix plus a few looping instructions exist in the final output code.
Next, the code generator performs software pipelining [9] over the entire looping structure of the outerproduct. By statically scheduling the kernel, the risk of instruction stalls is minimized thus allowing the processor to compute the instruction mix near the rate predicted by our model. Once the code is scheduled, the generator emits a mix of C code and inline assembly instruction macros which preserve the schedule [17] . This resulting code implements a high performance outer-product kernel with the desired dimensions and the selected instruction mix. We provide an excerpt of a generated kernel in Figure 13 .
Experimental Results
In this section, we test the effectiveness of our kernel generation system in automating the last-mile for high performance dense linear algebra. We evaluate both the queueing theory model, which finds an efficient outerproduct instruction-mix, and the code generation system, which translates that mix into a high performance kernel.
We use a variety of machines listed in Table 16 that span a diverse range of double precision vector lengths (v ∈ {2, 4, 8}), number and partitioning of functional units, and instruction latencies. Because our kernel fits in the context of a larger Goto/BLIS-style Gemmcontext, the blocking parameters m c , k c , m r and n r are determined from the analytical models developed in [4] and [11] along with the cache and microarchitecture details listed in Table 14 and Table 15 respectively. The microarchitecture details in particular (Table 15) were used by the queueing theory model to select the highest throughput outer-product instruction-mix. Additionally, these details determined N updates and the register sub-blocking dimensions m s , n s using the formula developed in the previous section.
Lastly, the Xeon Phi requires that four threads run concurrently in order to effectively utilize a core. This requires that we distribute the work across multiple threads. Therefore we used the implementation in [15] with the following parameters: the number of threads used in each dimension (i c and j r ) must satisfy i c * j r ≤ 59 * 4, and ideally should be factors of m mc and n respectively. By empirical selection, i c = 12 and j r = 16 satisfied both of those requirements and resulted in the largest number of cores that achieved efficient per core performance. Figure 14 : Cache details of the processors used in our experiments. These cache details are needed for determining m c , k c , m r and n r according to [4] and [11] . The value S l corresponds to the size of the l level of cache. W l is the number of ways and N l is the number of cache lines in each way. Figure 15 : Here we capture the pertinent microarchitecture parameters that are used for our queueing theory model. The column u represents the latency in cycles of instruction u, where L1 and L2 represents instruction reads that hit in those caches. In the case of a system without fused-multiplyadd (fma), the latency is represented as the sum of the multiply instruction and add instruction. The columns R u represent the functional units that are required to compute instruction u. For some instructions multiple function units may be required (represented by ∧) and some instruction may take multiple paths (represented by ∨). Figure 16 : The cache blocking parameters m c and k c where determined using the results in [4] and the hardware parameters in Table 14 and Table 15 . The register blocking parameters m r and n r were determined from [11] using the values in Table 14 . Lastly, N updates and subsequently the sub-blocking dimension m s and n s were determined using Equation 2. m c , k c , m r , n r , m s and n s correspond to the values used in the generated code (see Figure 12 ). Note N updates v ≥ m s n s . Figure 17 : We estimate the number of cycles needed to compute our generated Xeon Phi outer-product kernels. The first column is the number of vpermute unit updates of size 4 × 8 used to implement the outer product. The remainder of the outer-product is computed using multiple 1 × 8 broadcast based unit updates.
Analysis of Queueing Model
In order to demonstrate the effectiveness of our model, we compare the predicted performance against the actual performance estimated by our queueing theory model.
For the Xeon Phi we compare the performance of eight different instruction-mix implementations of an 8 × 30 outer-product. We selected a family of instruction-mixes where the work is partitioned between 8 permute unit updates and 8 × 1 broadcast based unit updates. In Table 17 we detail each outer-product implementation. Each row represents a specific implementation, where the first two columns represent the number of permute and broadcast unit updates in the implementation.In the next two columns we compute the number of instructions that need the memory port (p mem ) and vector port (p 0 ) respectively. For the Xeon Phi each permute component requires one load instruction, a permute instruction and four fma instructions. The broadcast based component require one load and one fma instruction. Additionally, each implementation requires four prefetch instructions that occupy the memory ports. In the fifth column we use our queueing theory model to estimate the performance of the implementation. We can estimate the performance in FLOP per cycle as: flop cyc. = λ outer-product (m r )(n r )(2f lop)
In the last column we estimate the performance in GFLOP using the following formula:
Each of these implementations has a different throughput predicted by our model. In our experiment ( Figure 19) , we compare the relative performance of these implementations. Assuming that the overheads are similar between all implementations, then if the model does not fit, we expect a significant difference between the relative ordering of the implementations and the predictions. However, for the Xeon Phi we see that the relative ordering of the implementations is preserved in the experimental results, with the exception of one of the implementations. We suspect that the overhead is slightly lower for the 0 permute, 30 broadcast implementation. We repeat the same experiment with the Sandy Bridge processor. In Table 18 we estimate the performance for several implementations. Unlike the Xeon Phi experiment we chose two different kernel sizes (m r × n r ). According to [11] , the 8 × 4 implementations is more efficient than the 4 × 12. In Figure 19 we plot the performance of the four implementations. What we see is despite that our model predicts identical performance, we see a significant difference between the kernels of different sizes. What this demonstrates is that even if we can produce an efficient kernel in isolation, our model operates within the constraints of the larger GotoBLAS/BLIS algorithm. There are also additional and subtle microarchitectural details that explain the difference between implementations of the same size on this system. For example, even though both ports p 2 and p 3 service memory operations, they are limited in the total number of bytes that can be read in a cycle. Therefore, the Sandy Bridge retires less than 2 memory operations per cycle. In the case Figure 18 : We estimate the number of cycles needed to compute our generated Sandy Bridge outer-product kernels. We implement outer-products of size m r × n r ∈ {8 × 4, 4 × 12}. Note that the model predicts similar performance across these implementation, however due to subtle microarchitecture details the experimental performance is different.
where this is not an issue (between the two 8 × 4 implementations, we attribute the performance difference to scheduling because the permute based approach has fewer dependencies than the broadcast implementation, giving the scheduler greater freedom to hide instruction latency. This experiment demonstrates that for outer-product implementations of same size we can accurately estimate the performance of our generated implementations. However, our kernels operate within the constraints of a bigger GotoBLAS/BLIS Gemmalgorithm, and our performance is ultimately limited by the parameters selected for the bigger algorithm. In the next subsections, we look at this interaction in the opposite direction, how decisions made in generating the kernel affect the overall GotoBLAS/BLIS algorithm.
Analysis of the Generated Kernel
We evaluate the effectiveness of our kernel generation approach by comparing the performance of our generated outer-product kernels against state-of-the-art Gemm implementations such as OpenBLAS [13] and ATLAS [19] . We selected OpenBLAS because it is the highest performance open source BLAS implementation on most architectures, including the systems used in this paper. ATLAS was also selected because it is a high performance code generation system. Unlike, our code generator, this framework relies on hand tuned assembly kernels and uses search to determine the blocking dimensions around these kernels. The systems used in this experiment represent the past four major microarchitecture designs from Intel Table 16 . The parameters for the GotoBLAS/BLIS Gemmalgorithm were analytically selected to maximize performance. These values also match the ones used by the OpenBLAS.
In these experiments (Figure 20 and Figure 21 ), our generated code is within 2-5% of the expert tuned OpenBLAS. We suspect this difference is due to loop overhead because we rely on the compiler to optimize this which results in several extra instructions over the expert code. The older the processor generation, the more pronounced of an effect this has, which is why ATLAS outperforms our code on the Penryn. We believe we can resolve this difference by implementing the looping structure in inline assembly which should give us performance that is near identical to the expert written code.
Sensitivity to Parameters
In addition to using static scheduling and avoiding register spilling, we observe that even in the presence of an Out-Of-Order engine there is a benefit from maintaining uniformly-sized N-updates for creating the outer product.
Given two implementation of the micro-kernel, we vary the instruction tile sizes and compare the performance. Our reference implementation uses uniformly sized N-updates of size m s × n s = 4 × 2. We compare this to an implementation composed of two types of N-updates of size m s × n s = 4 × 3 and m s × n s = 4 × 1. We illustrate these two implementations in Figure 22 , where each outer-product is partitioned according to a uniform or non-uniform scheme.
We ensure that both implementations are free of register spilling and are scheduled -not only to avoid stalls -but also to insure that the number of instructions between prefetch instructions and their subsequent loads are uniform. We ran both implementations through the Intel Code Architecture Analyzer (IACA), a software simulator for Intel microarchitectures, and determined that both implementations lack instruction stalls, spend an equal number of cycles on each functional unit, and have an identical throughput (Figure 23 ). Figure 20 : We compare the performance of our generated kernels against ATLAS and the OpenBLAS for various problem sizes to demonstrate that expert level performance can be automated. We see that our generated code approaches the performance of hand tuned expert code and for most architectures exceeds the performance of the generated ATLAS code. Figure 20 , we compare the performance of our generated kernels against the OpenBLAS and ATLAS. Each port represents a functional unit that is used for our operation. D represents data fetch latency. What this shows is that both the uniform and non-uniform shaped implementations of the same outer product look identical to the Out-of-Order engine, as simulated by the IACA tool. However, we will show the performance of the two implementations are significantly different.
However, the results in Fig 24 do not reflect the results we obtained from IACA because the non-uniform N-update implementation performs 4% worse than the uniform N-update.
The non-uniform N-update implementation leads to clusters of instructions with very long encodings. This would present a bottleneck for the decoder and slow down the overall execution rate. Using uniform N-updates results in large instructions being evenly distributed throughout the code which prevents the decoder from becoming a bottleneck.
Register For generating our kernels, our aversion to register spilling goes beyond the performance penalty of the additional store to and load from memory. The reason is that these kernels fit in a much larger Gemm algorithm that achieves high performance by reducing Translation Look-aside Buffer (TLB) misses, and by spilling registers to memory this is disrupted and performance degrades significantly as result of these TLB misses.
To demonstrate how large of an effect that register spilling in the kernel has on the number of TLB misses, we evaluate three kernels with varying degrees of register spilling (No Spilling, Moderate and Heavy). This is achieved by varying how much the N-updates overlap when we schedule them using software pipelining. The greater the overlap, the greater the register pressure and the larger the number of spills. In addition to measuring performance (FLOPs per cycle), we also measure TLB misses using PAPI [12] . The goal is to show that by increasing the amount of register spilling we will disrupt how the larger Gemm algorithm avoids TLB misses.
The performance per cycle results in Fig 6. 3 demonstrate that as we increase the number of spills performance decreases -which is what we would expect. We see that for large problem sizes the number of TLB misses is greater for the Heavy amount of spilling compared to the Moderate amount which is greater for the No Spilling case. If it were the case that the added latency incurred by the register spills were the only source of performance penalty, then we would not expect to see a change in the number of TLB misses between the three cases.
This shows that register spilling has performance implication beyond the additional round trip to cache because it disrupts the TLB miss avoiding characteristics of the Gemm algorithm described in [4] . For practical purposes this removes spilling as an option when the outer-product instruction-mix is translated into a kernel. Figure 26 : The overall algorithm that our kernels embedded in, the GotoBLAS/BLIS Gemm, maximizes performance by insuring that the kernel receives data at a sufficient rate while minimizing TLB misses.
Spilling into memory requires that extra TLB entries be utilized to address memory that would not have been used otherwise. Thus even if spilling improves the kernel performance in isolation, it will degrade the overall performance of the Gemm operation.
Conclusion
In this paper, we address the last-mile problem of generating the architecture-specific micro-kernel for the general matrix-matrix multiplication routine that underlies most high performance linear algebra libraries. Specifically, we reveal the system behind generating high performance kernels that traditionally is implemented manually by a domain expert. We decompose the smallest unit of computation that is currently written by the expert into even smaller units, which we term the unit update. Using these unit update, we generate a family of algorithms for computing the micro-kernel, and used analytical models to estimate the running time for each possible implementation of the micro-kernel. Finally, we generate one out of the many algorithms. We demonstrate that our approach to systematically for generating this micro-kernel yields kernels that are competitive with expert-optimized versions.
To validate that the performance of our generated micro-kernels are indeed high performance, we compared our generated results with those from OpenBLAS, which uses a similar approach to high performance matrix multiply. On many of the architectures, we demonstrated that the generated kernels are within 2-5% of the OpenBLAS performance. In addition, we also show that the generated kernels also scale in a similar fashion when parallelized on SMP system such as the Xeon Phi. While analytically generating the micro-kernels makes us competitive with expert-implemented kernels, empirical fine-tuning could then be employed to recover the missing performance without having to exhaustively search over a large space. This is something we will explore in the future.
