Many linear algebra libraries, such as the Intel MKL, Magma or Eigen, provide fast Cholesky factorization. These libraries are suited for big matrices but perform slowly on small ones. Even though State-of-the-Art studies begin to take an interest in small matrices, they usually feature a few hundreds rows. Fields like Computer Vision or High Energy Physics use tiny matrices. In this paper we show that it is possible to speed up the Cholesky factorization for tiny matrices by grouping them in batches and using highly specialized code. We provide High Level Transformations that accelerate the factorization for current multi-core and many-core SIMD architectures (SSE, AVX2, KNC, AVX512, Neon, Altivec). We focus on the fact that, on some architectures, compilers are unable to vectorize and on other architectures, vectorizing compilers are not efficient. Thus hand-made SIMDization is mandatory. We achieve with these transformations combined with SIMD a speedup from ×14 to ×28 for the whole resolution in single precision compared to the naive code on a AVX2 machine and a speedup from ×6 to ×14 on double precision, both with a strong scalability.
I. INTRODUCTION
Linear algebra is everywhere, especially in scientific computation. There are a lot of fast linear algebra libraries like MKL [1] , Magma [2] or Eigen [3] . However, these libraries are optimized for big matrices. Experience shows that these libraries are not adapted for tiny matrices and perform slow on them.
Many computer vision applications require real-time processing, especially for autonomous robots or associated to statistical figures, linear and curves fitting, ellipse fitting or even covariance matching [4] . This is also the case in High Energy Physics (HEP) where computation should be done on-the-fly. In these domains, it is usual to manipulate tiny matrices. There is, therefore, a need for a linear algebra which is different from classical High Performance Computing. Matrices up to a few dozen of rows are usual, for example through Kalman Filter: [5] uses a 5 dimensions Kalman Filter and [6] uses 4 dimensions. More and more people take an interest in smaller and smaller matrices [7] , [8] , [9] .
The goal of this paper is to present an optimized implementation of a linear system solver for tiny matrices using the Cholesky factorization on SIMD multi-core architectures, for which it exists no efficient implementation unlike on GPUs [10] . People tend to rely on the compiler to vectorize the scalar code, but the result is not efficient and can be improved by manually writing SIMD code.
It consists in a set of portable linear algebra routines and functions written in C. The chosen way to do it is to solve systems by batch, and parallelizing along matrices instead of inside one single factorization. Our approach is similar to Spiral [11] or ATLAS [12] : we compare many different implementations of the same algorithm to keep the best one for each architecture.
We expose first the Cholesky algorithm in section II. Then we explain the transformations we made to improve the performance for tiny matrices in section III. We discuss about the precision and the accuracy of the square root and how use these considerations to improve our implementation in section IV. And finally, we present the result of the benchmarks in section V.
II. CHOLESKY ALGORITHM
The whole resolution is composed of 3 steps: the Cholesky factorization (also known as decomposition), the forward substitution and the backward substitution. The substitution steps are grouped together.
A. Cholesky Factorization
The Cholesky factorization is a linear algebra algorithm used to express a symmetric positive-definite matrix as the product of a triangular matrix with its transposed matrix: A = L · L T (algorithm 1). equation is equivalent to L · L T · X = R. Triangular systems are easy to solve using the substitution algorithm.
The equation can be written like this: L · Y = R with Y = L T · X. So we need to first solve L · Y = R (forward substitution) and then to solve L T · X = Y (backward substitution). Those two steps are group together to entirely solve a Cholesky factorized system (algorithm 2). Like the factorization, substitutions are naturally in-place algorithms: R, Y and X can be the same storage. 
C. Batch
With small matrices, parallelization is not efficient as there is no long dimension. For instance, a 3-iteration loop cannot be efficiently vectorized.
The idea is to add one extra and long dimension to compute the Cholesky factorization of a large set of matrices instead of one. We can now parallelize along this dimension with both vectorization and multithreading. The principle is to have a for-loop iterating over the matrices, and within this loop, compute the factorization of the matrix. This is also the approach used in [15] .
III. TRANSFORMATIONS
Improving the performance of software requires transformations of the code, and especially High Level Transforms. For Cholesky, we made the following transforms:
• High Level Transforms: memory layout [16] and fast square root (the latter is detailed in section IV), • loop transforms (loop unwinding [17] , loop unrolling and unroll&jam), • Architectural transforms: SIMDization.
With all these transformations, the number of possible versions is high. More specially, loop unwinding generates different versions for each matrix size. To facilitate this, the code is automatically generated for all transformations and all sizes from 3×3 up to 16×16 with the template engine Jinja2 [18] in Python. It generates C99 code with the restrict keyword which helps the compiler to vectorize. This could be replaced by a C++ template metaprogram like in [19] .
The use of jinja2 instead of more common metaprogrammation methods allows us to have full access and control over the generated code. It is really important for some people like in embedded systems to have access to the source code before the compilation. They can understand more easily some bugs which are hard to track on black box systems.
A. Memory Layout Transform
The memory layout transform is the first transform to address as the other ones rely on it. The most important aspect of the memory layout is the battle between AoS (Array of Structures) and SoA (Structure of arrays) [20] (Figure 1 ).
The AoS memory layout is the natural way to store arrays of objects in C. It consists in putting full objects one after the other. The code to access the x member of the i th element of an array A looks like this: A[i].x. This memory layout uses only one active pointer and reduces the systematic cache eviction. The systematic cache eviction appears when multiple pointers share the same least significant bits and the cache associativity is not high enough to cache them all. But this memory layout is difficult to vectorize because the "xs" are not contiguous in memory.
The SoA memory layout addresses the vectorization problem. The idea is to have one array per member, and group them inside a structure. The access is written: A.x[i]. This memory layout is the default one in Fortran 77. It helps the vectorization of the code. But it uses as many active pointers as the number of members of the objects and can increase the number of systematic cache eviction when the number of active pointers is higher than the cache associativity.
The AoSoA memory layout (Array of SoA, also known as Hybrid SoA) tries to combine the advantages of AoS and SoA for SIMD. The idea is to have a SoA memory layout of fixed size, and packing these structures into an array. Thus, it gives the same opportunity to vectorize as with SoA, but it keeps only one active pointer like in AoS. A typical value for the size of the SoA part is the SIMD register cardinal (or a small multiple of it). This access scheme can be simplified when iterating over such objects. The loop over the elements is split into two nested loops: one iterating over the AoS part, and one iterating over the SoA part. It is harder to write, especially to deal with boundaries.
The SoA memory layout was not used in this paper, and the term SoA will refer to the hybrid memory layout for the next part of this paper.
AoS:
x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 . . .
SoA:
x0 x1 x2 x3 . . . y0 y1 y2 y3 . . . z0 z1 z2 z3 . . .
AoSoA:
x0 x1 x2 y0 y1 y2 z0 z1 z2 x3 x4 x5 y3 y4 y5 . . .
Fig. 1: Memory Layouts
The alignment of the data is also very important. The hardware has some requirements on the addresses of the elements. It is easier (if not mandatory) for the CPU to load a register from memory when the address is a multiple of the register size. In scalar code, float loads must be aligned with 4 bytes. This is done by the compiler automatically. However, vector registers are larger. The load address must be a multiple of the size of the SIMD register: 16 for SSE, 32 for AVX and 64 for AVX512. Aligned memory allocation should be enforced by specific functions like posix_memalign, _mm_malloc or aligned_alloc (in C11). One might also want to align data with the cache size (usually 64 bytes). This may improve cache hits by avoiding data being split into multiple cache lines when they fit within one cache line and avoid false sharing between threads.
The way data are stored and accessed is also important. The usual way to deal with multidimensional arrays in C is to linearize the addresses. For example, a N ×M 2D array will be allocated like a 1D array with N M elements.
The knowledge of the actual size including the padding is required to access elements. Iliffe vectors [21] allow to access multi-dimensional arrays more easily. They consist of a 1D array plus an array of pointers to the rows. A(i, j) is accessed through an Iliffe vector with A[i][j] (see Figure 2 ). It allows to store arrays of variable length rows like triangular matrices or padded/shifted arrays and remains completely transparent to the user at they will always access A(i, j) with A[i][j] whatever is used internally, as long as A(i, j) is mathematically correct. It is extensible to higher dimensions.
With this memory layout, it is still possible to get the address of the data beginning, and use it like a linearized array. The allocation of an Iliffe vector needs extra space for the array of pointers. It also requires an initialization of the pointers before any use. As we work with pre-allocated arrays, the initialization of the pointers is not part of the benchmarks.
Accessing an Iliffe vector requires to dereference multiple pointers. It is possible to access the elements of an Iliffe vector like a linearized array. Keeping the last accessed position allows to avoid the computation of the new linearized address. Indeed, the new address can be obtained by moving the pointer from the previous address. 
B. Loop unwinding
Loop unwinding is the special case of loop unrolling where the loop is entirely unrolled. It is possible to do it here as the matrices are tiny (see algorithms 3 and 4) . This technique has several advantages:
• it avoids branching, • it allows to keep all temporary results into registers (scalarization), • it helps out-of-order processors to efficiently reschedule instructions,
This transform is very important as the algorithm is memory bound. One can see that the arithmetic intensity of the scalarized version is higher, and even higher when the steps are fused together. When the factorization and the substitution are merged together and scalarized, even more memory accesses can be removed: storing L (lines 18-21 of algorithm 3) and loading L again (lines 2-5 of algorithm 4) are unnecessary as L registers are still available. This leads to algorithm 5 and reduces the amount of memory accesses (Table Ib) .
The register pressure is higher and the compiler may generate spill code to temporarily store variables into memory.
To facilitate the unwinding for multiple sizes, the code is automatically generated for any given sizes by Jinja2.
Algorithm 3: Factorization unwinded for 4×4 matrices input : A // 4×4 symmetric positive-definite matrix output : L // 4×4 lower triangular matrix 
The Cholesky factorization of n×n matrices involves n square roots + n divisions for a total of ∼n 3 /3 floating-point operations (see Table I ). The time before the execution of two data independent instructions (also known as throughput) is smaller than the latency. The latency of pipelined instructions Algorithm 5: Cholesky factorization + substitution unwinded and scalarized for 4×4 matrices input : A // 4×4 symmetric positive-definite matrix input : R // vector of size 4 output :
can be hidden by executing another instruction in the pipeline without any data-dependence with the previous one. The ipc (instructions per cycle) is then limited by the throughput 1 of the instruction and not by its latency. If the throughput is less than 1, several instructions can be launched during the same cycle.
As current processors are Out-of-Order, they can reschedule instructions on-the-fly in order to execute in the pipeline dataindependent instructions. The size of the rescheduling window is limited and the processor may not be able to reorder instructions efficiently. In order to help the processor to pipeline instructions, it is possible to unroll loops and to interleave instructions of data-independent loops (Unroll&Jam). Here, Unroll&Jam of factor 2, 4 and 8 is applied to the outer loop over the array of matrices.
This technique increases the register pressure with the order of unrolling k, the number of unrolled iterations. Unroll&jam of order k requires k times more local variables. Its efficiency is limited by the throughput of the unrolled loop instructions. II: SIMD instruction latencies and throughputs for single and double precision instructions on Haswell [22] latency/throughput 128-bit (SSE) 256-bit (AVX) · · ·_cvtps_pd() 2/1 5/ 1 · · ·_add_ps() 3/1 3/ 1 · · ·_mul_ps() 5/0.5 5/ 0.5 · · ·_rcp_ps() 5/1 7/ 2 · · ·_div_ps() 11/7 19/14 · · ·_rsqrt_ps() 5/1 7/ 2 · · ·_sqrt_ps() 11/7 19/14 · · ·_cvtpd_ps()
Unroll&jam is also generated by our Jinja2 templates.
IV. PRECISION AND ACCURACY
The Cholesky Algorithm requires n square roots and (n 2 + 3n)/2 divisions for a n×n matrix. But these arithmetic operations are slow, especially for double precision (Table II) and usually not fully pipelined. For example, divisions and square root require 7 cycles for single precision in SSE. The cycle penalty for these operations reaches 16 cycles for double precision in AVX. During these cycles, no other instructions can be executed in the same pipeline port, even with hyperthreading. Thus, square roots and divisions limit the overall Cholesky throughput.
As explained by Soderquist [23] , it is possible in hardware to compute them faster with less accuracy. That is why reciprocal functions are available: they are faster but have a lower accuracy: usually 12 bits for a 23-bit mantissa in single precision.
The accuracy is measured in ulp (Unit in Last Place). Given a floating-point number x, ulp(x) is the distance between x and the floating-point number that come just after x. For all normal floating-point numbers ulp(x) = ulp(x + ) iif x and x + have the same exponent ( log 2 (x) = log 2 (x + ) , power of 2 are the corner cases). In this case, one can omit which number ulp refer to.
A. Memorization of the reciprocal value
In the algorithm, a square root is needed to compute L(i, i). But L(i, i) is used in the algorithm only with divisions. The algorithm needs n 2 + 3n /2 of these divisions per n×n matrix.
Instead of computing x/L(i, i), one can compute x · L(i, i) −1 . It becomes obvious that one can store L(i, i) −1 and save several divisions. After this transformation, the algorithm needs only n divisions.
This transformation might affect the accuracy of the result. Indeed, x/y is rounded once (correct rounding as specified by IEEE 754). But x · y −1 requires two successive roundings: one to compute the reciprocal y −1 = 1/y and the other one to compute the product x · y −1 . Thus, x · y −1 has an error < 1 ulp instead of < 0.5 ulp when computed directly.
B. Fast square root reciprocal estimation
The algorithm performs a division by a square root and therefore needs to compute f (x) = 1/ √ x. There are some ways to compute an estimation of this function depending on the precision. 
Single Precision: Most of current CPUs have a specific instruction to compute an estimation of the square root reciprocal in single precision. In fact, some ISA (Instruction Set Architecture) like Neon and Altivec VMX do not have SIMD instruction for the square root and the division, but do have an instruction for a square root reciprocal estimation. On x86, ARM and Power, this instruction is as fast as the multiplication (Table II) and gives an estimation with 12-bit accuracy (Table III) . Unlike regular square root and division, this instruction is fully pipelined (throughput = 1) and thus avoids pipeline stall. 
2) Double Precision: On most CPUs, there is no such instruction for double precision (only AVX512F has such). Therefore, we need another way to get an estimate. A possibility is to convert two SIMD double precision registers into a single single precision register and execute the single precision instruction to get a 12-bit accurate estimation and convert the 
register back into two double precision registers (algorithm 6). This technique has a constraint: it can be used only if the input is within the range of single precision floating-point 2 −126 , 2 127 ( ∼10 −38 , ∼10 38 ). Cholesky algorithm needs to compute the square root of a difference, so if this difference is very close to 0, catastrophic cancellation may occur, and the value may not be in the range of single precision float. This issue is not handled by this approach.
3) Bit Trick for Double Precision:
The square root reciprocal can be estimated directly in double precision taking benefits from the IEEE 754 floating-point format (algorithm 7). This is mathematically explained by Lomont in [24] . It was initially attributed to John Carmack in the Quake III Arena source code. A quick explanation could be like this: the right bit shift allows to divide by 2 the exponent (effect of the square root) and the subtraction allows to take the opposite of the exponent (effect of the reciprocal). The rest of the magic constant 0x5fe6eb50c7b537a9 is set up to minimize the error of the result as explained by Lomont. This technique is really fast (especially when integer and floating-point operations can be executed in parallel by the CPU) but inaccurate ( ∼ 0.0342128 ∼ 1 29 ).
C. Accuracy recovering
Depending on the application, the previous techniques might not be accurate enough (especially the bit trick in double precision subsubsection IV-B3). It is possible to recover the accuracy with the Newton-Rahpson method or the Householder's method (a higher order generalization). It is worth noting that if one does not require full accuracy, they can reduce the number of iterations done in order to be faster. output: r // corrected estimation 1 α ← vrsqrtsq_f32(r · x,r) 2 r ←r · α // corrected approximation estimation x n of a root of f , one can find a more accurate estimation x n+1 with the following formula:
This method can be used to refine an estimation of a square root reciprocal. To compute the square root reciprocal of a, one can find the root of f (x) = 1 x 2 − a. Applying (1) to this function gives the following equation:
With the equation (2), the iteration needs 4 multiplications (algorithm 8). But one can see that the multiplication by 1 /2 can be moved inside the brackets and the product 1 /2 · a computed once before any iteration. After this, it requires a multiplication at initialization plus 3 multiplications and one subtraction per iteration. The subtraction can be fused into a Fused Multiply Add (FMA) if supported by the architecture.
The Newton-Raphson method has a quadratic convergence. This means that the number of correct bits doubles at each iteration ( becomes 2 ). With the single precision estimate, one iteration is needed and allows to recover almost full accuracy with a mean error < 0.5 ulp and a max error < 4.7 ulp (∼2.5 bits). The maximum and mean relative error are computed by computing relative error with algorithm 9 exhaustively on all normal single precision floats. For double precision, results are reported in Table IV. The Neon ISA supports an instruction to help Newton-Raphson method for f (x) = 1
x 2 − a (algorithm 10). The instruction vrsqrtsq_f32 is as fast as a multiplication and All additions are fused saves 2 multiplications and 1 subtraction (or 1 FMA and 1 multiplication). It is interesting not only because it requires fewer instructions, but also because it saves the need for two constants (0.5 and 3). 
As for Newton-Raphson, we need to find the zero of f (x) = 1 x 2 − a. Stopping at order 3 gives the following iteration:
We can notice that in (4), in brackets is a polynomial of x n 2 a. This leads to:
Horner scheme allows to compute a scalar polynomial with the least number of multiplication [26] . With Horner scheme, evaluating a n degree polynomial requires n multiplications and n additions. Moreover, these operations can be fused together. Thus, on CPUs with FMA, it can be computed with FMAs only. On current CPUs, FMA instructions are as fast as a single multiplication. This allows to write algorithm 11 which is the order 3 Householder's method for the square root reciprocal efficiently using only 3 multiplications and 3 FMAs. It is one multiplication less than using the Newton-Raphson method.
One can do the same calculations for other orders. It is then possible to see, for a given source accuracy and a given target precision, which order allows to compute full accuracy with a minimum number of operations. We computed the orders up to 5, and see which order requires the lowest number of operations. Results are reported in Table V .
V. BENCHMARKS

A. Benchmark protocol
In order to evaluate the impact of the transforms, we used exhaustive benchmarks.
The algorithms were benchmarked on eight machines whose specifications are provided in Table VI. The tested functions are the following: The function substitute1 has been tested as it is the only one to be available in the MKL in batch mode.
The time is measured with _rdtsc() which provides reliable time measures in cycles. Indeed, on current CPUs, the timestamp counter is normalized around the nominal frequency of the processor and is independent from any frequency changes.
In order to have reliable measures, we run several times each function and take the minimum execution time measured. Then, we divide the time by the number of matrices to have a time per matrix.
The code has been compiled for Intel architectures with Intel icc v17.0 with the following options: -std=c99 -O3 -vec -ansi-alias and for other architectures with gcc 6.2 with the following options: -std=c99 -O3 -ffast-math -ftree-vectorize -fstrict-aliasing.
The plots use the following conventions: • "legacy" tag stands for the version without the reciprocal storing (base version). • "fast" tag stands for the use of fast square root reciprocal. • "fastest" tag stands for the use of fast square root reciprocal estimation without any accuracy recovering. • "×k" tags stand for the order of unrolling of the outer loop (unroll&jam)
B. Results
We focus our explanations on solve and the HSW-i7 machine. All the machines have similar behaviors unless explicitly specified otherwise. ARM code has not been compiled for ARM 64 bits (aarch64) and thus the double precision version is not present.
We first present the impact on performance of the batch size. We consider the performance of all the functions. After, we show the differences between single and double precision.
Then we detail what transformations improve the performance and how. We exhibit with more details the impact of unrolling. And we show the scalability of our solution. Finally, we show summary results on all machines. 1) Batch size performance: Figure 3 shows important results for the understanding of our function performance. It shows the performance of factorize, substitute and solve on HSW-i7 for 3×3 matrices. If we look at these charts, we can notice similar behaviors for the 3 functions: the performance drops of a factor 2-3 for every version. It happens when data do not fit anymore in caches: this is a cache overflow. On the factorize chart (Figure 3a) , one can notice 3 intervals of batch size for 3×3 matrices on HSW-i7:
• [400, 1000]: this is the L1 cache overflow. As the L1 cache is 32 KB, we cannot store data for more than 546 systems. • [3000, 8000]: this is the L2 cache overflow. As the L2 cache is 256 KB, we cannot store data for more than 4,369 systems. • [10 5 , 6·10 5 ]: this is the L3 cache overflow. As the L3 cache is 8 MB, we cannot store data for more than 139,810 systems. After that, the data has to be fetched from the main memory.
As we repeat several times the same function and take the minimum time, data are as much as possible within caches. If SIMD F32 SIMD fast F32 SIMD fastest F32 SIMD F64 SIMD fast F64 SIMD fastest F64 data fit within a cache, they may not be within it at the first execution: the cache is cold. But at the next execution, data will be in the cache: the cache warms up. But if data size is larger than the cache, the cache will be constantly overflowed by new data. At the next execution, the needed data will not be within the cache as they have been overridden by the extra data of the previous execution. If data are only a bit larger than the cache, then a part can remain within the cache and be reused the next time.
Basically, one can interpret the performance plot like this: if all the matrices fit within the L1 cache, the performance per matrix will be the performance on the plot before the L1 cache overflow. The performance at the right end is actually the performance when none of the matrices are in any caches, ie: they are in main memory only. The performance drops after the cache overflow because lower level caches are faster.
After the L3 cache overflow, the best versions have almost the same performances: they are limited by the memory bandwidth. In this case, the bandwidth of the factorize function after the last cache overflow is about 7.9 GB/s, which is the bandwidth of the machine external memory.
On every plot of Figure 3 , for the fast and fastest versions, the performance starts by increasing on the left. This is mainly due to the amortization of the overheads mainly due to SIMD.
2) Function performances: Figure 4 shows the performance of all 4 functions factorize, substitute, substitute1 and solve in single and double precision for 3×3 and 16×16 matrices. When we look at Figure 4a and 4d, we can see that, for 3×3 matrices, the scalar SoA unwinded version performs very well on substitute and substitute1 in both single and double precision, but is slower on other functions. The function substitute1 provides the higher Gflops: the number of load/store is the lowest as L is kept in registers. In average, AVX version is twice faster than SSE version. Double precision has a similar shape than single precision.
The MKL is very slow. The reason is that it performs a lot of verification on input data, has many functions calls and has huge overheads. These overheads are required for speeding up the execution, but are not efficient for large matrices. However, for large matrices, these issues disappear and the MKL is extremely fast. Eigen has similar issues but is a bit faster on 3×3 matrices.
With 16×16 matrices, we can notice that all "slow" versions Fig. 6 : Speedups of the transformations for solve on HSW-i7 in single and double precision mono-core and multi-core are a faster. We can also see that the MKL is much better on substitute1: it is able to deal with batches, and overhead is reduced. However, the scalar SoA unwinded version is much slower: at this point, the compiler does not vectorize this code anymore.
3) float vs double: When we compare 32-bit single precision (float) performance with 64-bit double precision (double) performance in Figure 3 (3a, 3b, 3c vs 3d, 3e, 3f), we can see that the plots are similar. There is two main differences. First, the double version is slower than the float version. It can be easily explained by SIMD cardinal: a float SIMD instruction is able to compute twice more data as the double one in the same time or less. Quantitative comparison will be addressed later in the paper. The second difference is about cache overflow. On the double version, cache overflows happen twice earlier: the size of double is twice the size of float, but the cache remains the same size, so the number of double that can be in the cache is half.
On plots of Figure 5 , we can see that the speedup of float over double is higher than ×2: between ×3 and ×4 for both "non-fast" and "fast" on HSW-i7 (5a), SKL-i7 (5b), HSW Xeon (5c) and KNC (5d). A factor of 2 is explained by the cardinal of SIMD registers. The extra speedup depends on unwinding SIMD fast SQRT unroll&jam Fig. 7 : Speedups of the transformations for solve on HSW Xeon in single and double precision mono-core and multicore which version is considered. For "non-fast" versions, IEEE 754 divisions and square roots are used. These instructions are slower for doubles than for floats (Table II) and compute half the number of elements. The time to compute a square root or a division per element is then more than twice the time in float.
For fast versions, no square root nor division instruction is used. However, a fast square root reciprocal estimate is computed, and then the accuracy is recovered with the Newton-Rahpson method or the Householder's method. These methods require more iterations in double than in float because there is more precision to recover. So there is more computation to do in double precision. This also explains why the speedup "fast" over "non-fast" is higher in single precision than in double precision.
On KNC and KNL in single precision (Figure 5d and 5e), "fast" and "fastest" versions are completely identical (same code). These architectures have a square root reciprocal instruction that give full accuracy in single precision (Table III) , so there is no need for a Newton-Raphson iteration. Figure 6 gives the speedup of each transformation in the following order: unwinding, SoA + SIMD, fast square root, unroll&jam. The speedup of a transformation is dependent of the transformations already applied: the order is important.
If we look at the speedups on HSW-i7 mono-thread single precision (Figure 6a ), we can see that unwinding the inner loops improves the performance well: from ×2 to ×3. The impact of unwinding decreases when the size of the matrix increases: the register pressure is higher. Looking at the assembly, we can actually see that the compiler generates spill code for large matrices. Spill code consists in moving values from register to memory to free a register, and moving back SIMD gives a sub-linear speedup: from ×3.2 to ×6. In fact, SIMD instructions cannot be fully efficient on this function without fast square root (see subsection IV-B). With further analysis, we can see that the speedup of SIMD + fast square root is almost constant around ×6. The impact of the fast square root decreases as their number become negligible compared to the other floating-point operations. SIMD has more place to be efficient. For small matrices, unroll&jam allows to get the last part of the expected SIMD speedup. SIMD + fast square root + unroll&jam: from ×6.5 to ×9. Unroll&jam loses its efficiency for larger matrices: the register pressure is higher.
If we look at the multithreaded version (Figure 6b ), results are similar. We can notice that the speedup of fast SQRT + unroll&jam is similar on both single thread and multithread charts, but the fast square root gives more speedup. This is especially visible on the double precision version. This is due to the hyperthreading that has an effect similar to unroll&jam allowing to use free functional units of a core for another thread by interleaving instructions within the processor pipeline.
The doube precision versions are not exactly the same (Figure 6c ). The speedup of the unwinding/scalarization transformation does not decrease with the size of the matrix. In double precision, the required bandwidth is higher, so saving memory loads and stores has more impact. One can expect this speedup to decrease with even larger matrices. Another difference is the impact of the fast square root + unroll&jam. On HSW-i7, this set of transformations gives a higher speedup on double precision than in single precision. This is due to the latency of the square root and division instructions, and the throughput of the fast square root reciprocal. On this machine, square root and division instructions have a high latency and without unroll&jam, the performance of this very code is limited by the instruction latencies. With unroll&jam, the performance is limited by the instruction throughputs and fast square root reciprocal computation is highly pipelined. The double precision square root and division instructions have a much higher latency on this machine, while the fast square root reciprocal throughput in double precision is still good. On SKL-i7, this effect is not visible as the square root and division instructions have a lower latency than on HSW-i7.
If we look at HSW Xeon (Figure 7) we see similar results. Figure 8 shows the performance of solve for different AVX versions.
5) Impact of unrolling:
Without any unrolling, all versions except "legacy" have similar performance: performance seems to be limited by the latency between data-dependent instructions. Unwinding can help Out-of-Order engine and thus reduce data-dependency.
For 3×3 matrices (Figure 8a and 8d) , the performance of the "non-fast" and "legacy" versions are limited by the square Fig. 11 : Multi-core performance of solve in Gflops on tested machines root and division instruction throughput. The performance has reached a limit and cannot be improved further this limitation, even with unrolling: both unwinding and unroll&jam are inefficient in this case. The "legacy" version is more limited because it requires more divisions. These versions are even more limited in double precision as square root and division instructions are even slower.
For "fast" versions, both unrolling are efficient. Unroll&jam achieves a ×3 speedup on regular code and ×1.5 speedup on unwinded code. This transformation reduces pipeline stalls between data-dependent instructions (subsection III-C). We can see that unroll&jam is less efficient when the code is already unwinded but keeps improving the performance. Register pressure is higher when unrolling (unwinding or unroll&jam).
The unwinded "fastest" versions give an important benefit especially in double precision. By removing the accuracy recovering instructions, we save a lot of instructions (IV-C, Accuracy recovering). As the number of instructions to recover double precision is higher compared to single precision, the speedup of "fastest" over "fast" is higher in double precision.
For 16×16 matrices (Figure 8c and 8f) , the performance of all versions are leveled. The transformations we have done are good for small matrices, but become less efficient for bigger matrices.
• unwinding: register pressure becomes higher (for a n×n matrix, the code needs O n 3 registers). • unroll&jam: it allows to hide latencies, but with larger matrices, computations are more independent from each other. • fast square root: the proportion of square roots and divisions decreases with the size of the matrix: these operations are diluted among the others.
For such large matrices, unroll&jam slows down the code when it is already unwinded because of the register pressure. 6) multithread scaling: Figure 9 shows the efficiency of the multithreading for the best SIMD version of solve. The efficiency is defined as the speedup of the multi-core code over the single core (hyperthreaded) code divided by the number of cores.
The scaling is strong (between 80% and 100%) for all multi-core machines: HSW-i7, SKL-i7, HSW Xeon, TX1 and Power 8. On manycore machines (KNC and KNL), the scaling is lower. On KNC, the scaling is strong for small matrices and gets lower for larger matrices, especially in double precision. On KNL, the scaling is lower: ∼60% for all sizes. A memory bottleneck is suspected for both manycore architectures. 7) Summary: Figure 10 and Figure 11 show the performance of our best SIMD version against scalar versions and library versions (Eigen and MKL) for all architecture in both monocore and OPENMP. The MKL is not present on the multi-core 
results as its heuristic limits multithreading for tiny problems. So this is not possible to compare our implementation with the MKL on multithreaded code. Both Eigen and the MKL are slower than our scalar AoS code.
We can see that for Intel architecture, scalar SoA is good for small matrices, but becomes slow after 9×9 matrices in monocore. This is due to the compiler icc: it is able to vectorize the unwinded code up to 9×9. For larger matrices, it stops vectorizing. The threshold is after for the multithreaded versions, probably due to a change in the compiler heuristic. On other machines, gcc was used: gcc is unable to vectorize our scalar code, and there is no way to enforce it to vectorize, unlike icc. Writing SIMD code is mandatory to achieve efficient code as compilers are not always able to vectorize scalar code, even while enforcing them.
The speedup of the best version compared to the scalar AoS version for all tested architectures is in Table VII .
We achieve a high overall speedup for 3×3 up to 16×16 matrices compared to the basic scalar version. On HSW Xeon, we reach a ×28 speedup on single precision and a ×14 speedup on double precision. On Rasp3, we reach a ×15 speedup on single precision. And on Power 8, we reach a ×38 speedup on single precision and a ×16 speedup on double precision. The code scales also very well with a mutlithread efficiency above 80% on most of the machines.
CONCLUSION
In this paper, we have presented an efficient SIMD implementation for tiny matrices ( 16×16) of the Cholesky algorithm, because in some fields they are very used and because State-of-the-Art libraries are inefficient for such tiny matrices.
Moreover, on some architectures like ARM Cortex or IBM Power, the existing optimizing compilers are unable to vectorize this kind of code. On other architectures like x86, some compilers are able to vectorize but not efficiently and not for all sizes. Hand-written SIMD code is thus mandatory to fully benefit from architecture.
To reach a high level of performance, the proposed implementation combines low level transformations (loop unrolling and loop unwinding), hardware optimizations (SIMD and multi-core) and High Level Transforms (fast square root and memory layout). We achieve a high overall speedup outperforming existing codes for SIMD CPU architectures on tiny matrices on both single precision and double precision: a speedup of ×30 on a high-end Intel Xeon workstation, ×15 on a ARM Cortex embedded processor and ×38 on a IBM Power 8 HPC cluster node.
