Abstract-Many linear algebra libraries, such as the Intel MKL, Magma or Eigen, provide fast Cholesky factorization. These libraries are suited for big matrices but perform slowly on small ones. Even though State-of-the-Art studies begin to take an interest in small matrices, they usually feature a few hundreds rows. Fields like Computer Vision or High Energy Physics use tiny matrices. In this paper we show that it is possible to speedup the Cholesky factorization for tiny matrices by grouping them in batches and using highly specialized code. We provide High Level Transformations that accelerate the factorization for current Intel SIMD architectures (SSE, AVX2, KNC, AVX512). We achieve with these transformations combined with SIMD a speedup from 13 to 31 for the whole resolution compared to the naive code on a single core AVX2 machine and a speedup from 15 to 33 with multithreading compared to the multithreaded naive code.
I. INTRODUCTION
Linear algebra is everywhere, especially in scientific computation. There are a lot of fast linear algebra libraries like MKL, Magma or Eigen. However, these libraries are usually conceived with big matrices in mind. Experience shows that these libraries are not adapted for tiny matrices and perform slow on them.
Some fields like Computer Vision [14] or High Energy Physics often use tiny matrices, and require linear algebra different from classical High Performance Computing. Matrices up to a few dozen of rows are usual, for example through Kalman Filter [8] (in [4] , they use a Kalm an Filter with 4 dimensions). People begin to take an interest in small matrices [15] , [17] . These studies are focused in matrices much larger than SIMD register size.
The goal of this paper is to present an optimized implementation of the Cholesky factorization for tiny matrices. It consists in a set of portable linear algebra routines and functions written in C. The chosen way to do it is to solve systems by batch, and parallelizing along matrices instead of parallelizing inside one single factorization. Our approach is similar to Spiral [16] or ATLAS [3] : we compare many different implementations of the same algorithm to keep the best one. This paper presents the Cholesky factorization by batch. We will first expose the Cholesky algorithm. Then we will explain the transform ations we made to improve the perform ance for tiny matrices. And finally, we will present the result of the benchmarks.
II. CHOLESKY ALGORITHM
The whole resolution is composed of 3 steps: the Cholesky factorization (aka decomposition), the forward substitution and the backward substitution. The substitution steps are grouped together.
A. Cholesky Factorization
The Cholesky factorization is a linear algebra algorithm used to express a symmetric positive-definite matrix as the product of a triangular matrix with its transposed matrix: A = L . LT (algorithm 1).
The Cholesky factorization of a nxn matrix has a complexity in term of floating point operation of n 3 /3 that is half of the LV one (2n 3 /3 ) , and is numerically more stable [9] , [10] . This algorithm is naturally in-place as every input element is accessed only once and before writing the associated element of the output: L and A can be the same storage. 
B. Substitution
Once we have the factorized form of A, we are able to solve easily systems like: 
With small matrices, all the dimensions are small. But parallelization is not efficient on small dimensions. A 3-iteration loop can not be efficiently vectorized.
The idea is to add one extra and long dimension to compute the Cholesky factorization of a large number of matrices instead of one.
Our problem now has a long dimension which can be parallelized with both vectorization and multi-threading. The principle is to have a for loop iterating over the matrices, and within this loop, compute the factorization of the matrix. This is also the approach used in [5] , [6] .
III. TRANSFORMATIONS
Improving the performance of software requires transformations of the code, and especially High Level Transforms. For Cholesky, we made the following transforms:
• memory layout transform [2] , • loop transforms (loop unwinding [13] , loop unrolling and unroll&jam),
• Architectural transforms: SIMDization + fast square root.
A. Memory Layout Transform
The memory layout is the first transformation considered. The most important aspect of the memory layout is the battle between AoS (Array of Structures) and SoA (Structure of arrays) [1] (Figure 1) .
The AoS memory layout is the natural way to store arrays of objects in C. It consists in putting full objects one after the other. The code to access the x member of the i t h element of an array A looks like this: A [ i 1 . x. This memory layout uses only one active pointer and reduces the systematic cache eviction. The systematic cache eviction appears when multiple pointers share the same least significant bits and the cache associativity is not high enough to cache them all. But this memory layout is difficult to vectorize because the "xs" are not contiguous in memory.
The SoA memory layout addresses the vectorization problem. The idea is to have one array per members, and group them inside a structure. The access is written: A. x [ i 1 . This memory layout is the default one in Fortran 77. It helps the vectorization of the code. But it uses as many active pointers as the number of members of the objects and can increase the number of systematic cache eviction when the number of active pointers is higher than the cache associativity.
The AoSoA memory layout (aka Hybrid SoA) tries to combine the advantages of AoS and SoA. The idea is to have a SoA memory layout of fixed size, and packing these structures into an array. Thus, it gives the same opportunity to vectorize as with SoA , but it keeps only one active pointer like in AoS. A typical value for the size of the SoA part is the SIMD register cardinal (or a small multiple of it). This access scheme can be simplified when iterating over such objects. The loop over the elements is split into two nested loops: one iterating over the AoS part, and one iterating over the SoA part. It is harder to write, especially to deal with boundaries.
The SoA memory layout was not used in this paper, and the term SoA will refer to the hybrid memory layout for the next part of this paper.
AoS:
Ixo lyo lzo lxI IYl lzl lx2 1Y2 1z2 1xa lya iza lx4 1Y4 1z4 1 ··1 The alignment of the data is also very important. The hardware has some requirements on the addresses of the elements. It is easier (if not mandatory) for the CPU to load a register from memory when the address is a multiple of the register size. In scalar code, flo a t loads must be aligned with 4 bytes. This is done by the compiler automatically. However, vector registers are larger. The load address must be a multiple of the size of the SIMD register: 16 for SSE, 32 for A VX and 64 for AVX512. Aligned memory allocation should be enforced by specific functions like posix_memalign or _ mm_ malloc. One might also want to align data with the cache size (usually 64 bytes). This may improve cache hits by avoiding data being split into multiple cache lines when they fit within one cache line and avoid false sharing between threads.
The way data are stored and accessed is also important. The usual way to deal with multidimensional arrays in C is to linearize the addresses. For example, a N x M 2D array will be allocated like a ID array with N . M elements. A( i, j) is accessed with A [i xM+ j 1.
The knowledge of the actual size including the padding is required to access elements. Iliffe vectors [11] allow to access multi-dimensional arrays more easily. They consist of a ID array plus an array of pointers to the rows. A( i, j) is accessed through an Iliffe vector with A [ i 1 [j 1 (see Figure 2 ). It allows to easily store arrays of variable length rows like triangular matrices or padded/shifted arrays. It is easily extensible to higher dimensions.
With this memory layout, it is still possible to get the address of the data beginning, and use it like a linearized array. The allocation of an Iliffe vector needs extra space for the array of pointers. It also requires an initialization of the pointers before any use. As we work with pre-allocated arrays, the initialization of the pointers is not part of the benchmarks.
Accessing an Iliffe vector requires to dereference multiple pointers. It is possible to access the elements of an Iliffe vector like a linearized array. Keeping the last accessed position allows to avoid the computation of the new linearized address. Indeed, the new address can be obtained by moving the pointer to the previous address. 
B. Loop unwinding
It is possible to completely unroll the loops (see algorithms 3 and 4) as the matrices are tiny. This technique has several advantages:
• it avoids branching, • it allows to keep all temporary results into registers (scalarization), • it helps out-of-order processors to efficiently reschedule instructions, • it can be combine with data prefetch.
When the factorization and the substitution are merged together, lines 18-21 of algorithm 3 and lines 2-5 of algorithm 4 disappear reducing the amount of memory accesses (Table Ib) . This transform is very important as the algorithm is memory bound. One can see that the arithmetic intensity of the scalarized version is higher, and even higher when the steps are fused together leading to a version without intermediate memory access.
The register pressure is higher and the compiler may generate spill code to temporarily store variables into memory. It also makes the assembly code of the function bigger and may have a negative impact on the instruction cache. 
The Cholesky factorization of nxn matrices involves n square roots + n divisions for a total of rvn3/3 floating point operations (see Table I ). The time before the execution of two data independent instructions (aka: throughput) is smaller than the latency. The latency of pipelined instructions can be hidden by executing in the pipeline another instruction without any data-dependence with the previous. The ipc (instructions per cycle) is then limited by the throughput of the instruction and not by its latency. Note that the "throughput" term used in the Intel documentation is the inverse of classical throughput: it is the number of cycles to wait between the launch of two consecutive instructions. If the throughput is less than 1, several instructions can be launched during the same cycle.
As current processors are Out-of-Order, they can reschedule instructions on-the-fly in order to execute in pipeline dataindependent instructions. The size of the rescheduling window is limited and the processor may not be able to reorder instructions efficiently. In order to help the processor to pipeline instructions, it is possible to unroll loops and to interleave (Table II) . Both use the same execution unit and their throughput is shared between them. For example, a division can only be executed at least 7 cycles after a square root and vice versa.
The factorization of a 3x3 matrix requires 3 square roots and 3 divisions. Consequently, the factorization in SSE is at least 3 x 7 + 3 x 7 = 42 cycles long for up to 4 matrices. The other 11 floating point operations have a lower throughput and can be partially executed in parallel to the square roots and divisions. Thus, their execution time is negligible compared to the square roots and divisions.
AVX square root and division have a throughput twice longer than SSE. The factorization is at 3 x 14 + 3 x 14 = 84 cycles long for up to 8 matrices. In this case AVX is not faster than
SSE.
These instructions can be replaced by faster and less accurate ones: rcp_ps () and rsqrt_ps (). They respectively compute an approximation of the reciprocal and an approximation of the square root reciprocal with a 12-bit accuracy. These instructions are highly pipe lined and may improve the factorization speed. In our algorithm, only rsqrt_ps () is needed: y'X = x . y'X-I. A lot of cycles are saved using the fast instruction rsqrt_ps () instead of sqrt_ps () + div_ps ().
We do not need to compute any reciprocal as we get the reciprocal with the previous combination, and it is stored for later use. Every time the algorithm needs a division, a multiplication by the previously computed reciprocal is done.
It is less accurate than regular instructions, but it can be enough for some uses. If needed, it is possible to recover accuracy [12] . The method consists in the calculation of one Newton-Raphson iteration (algorithm 5). The NewtonRaphson algorithm converges quadratically which leads after one iteration to an accuracy of 0.46 ulp (Unit in last Place) in average on all normal finite positive floats (max: 4.7 ulp). Considering the latency only, it is actually slower than a regular square root. But it is much more pipelineable. 
IV. B E NCHMARKS

A. Benchmark protocol
In order to calculate the impact of the transforms, we used exhaustive benchmarks.
The algorithms were evaluated on 4 machines whose specifications are provided in Table III. The tested functions are the following:
• factorize: Cholesky factorization: A -+ L· LT
• substitute: Solve the 2 triangular systems: L· LT . X = R
• substitute1: same as substitute, but with the same L for every Rs • solve: Solve the unfactorized system (factorize + substi-
The function substitute} has been tested as it is the only one to be available in the MKL in batch mode.
The time is measured with _rdtsc () which provides reliable time measures in cycles. Indeed, on current CPUs, • "u n winded" tag stands for inner loops unwinded (ie: fully unrolled).
35r-----------------------------------
• "fast" tag stands for the use of fast square root reciprocal.
• "x k" tags stand for the order of unrolling of the outer loop (unroll&jam)
B. Results
We first present the monocore results and in a second part the multicore results with OPENMP.
We focus on the analysis of 3 x 3 matrix factorization as it is the slowest. We will show that our analysis remains correct for larger matrices. 1) Monocore: Figure 3 shows important results for the understanding of our function performance. It shows the performance of fa ctorize, substitute and solve on HSW for 3 x 3 matrices. If we look at these charts, we can notice similar behaviors for the 3 functions: the performance drops of a factor 2-3 for every version . It happens when data do not fit anymore in caches: this is a cache overflow. On the factorize (Figure 3a) , one can notice 3 intervals of batch size for 3 x 3 matrices on HSW:
• [ 400, 1000]: this is the L1 cache overflow. As the L1 cache is 32 KB, we can not store data for more than 546 systems. • [ 3000, 8000]: this is the L2 cache overflow. As the L2 cache is 256 KB, we can not store data for more than 4369 systems. • [ 3.10 5 ,10 6 ] : this is the L3 cache overflow. As the L3 cache is 35 MB , we can not store data for more than 611,669 systems. After that, the data has to be fetched from the main memory.
As we repeat several times the same function and take the minimum time, data are as much as possible within caches. If data fit within a cache, they may not be within it at the first execution, but they will at the next one. But if data size is larger than the cache, the cache will be constantly overflowed by new data. At the next execution, the needed data will not be within the cache as they have been overridden by the extra data of the previous execution. If data are only a bit larger than the cache, then a part can remain within the cache and be reused the next time. Basically, one can interpret the performance plot like this: If all the matrices fit within the L1 cache, the performance per matrix will be the performance on the plot before the L1 cache overflow. The performance at the right end is actually the performance when none of the matrices are in any caches, ie: they are in main memory only. The performance drops after the cache overflow because lower level caches are faster.
After the L3 cache overflow, the best versions have almost the same performances: they are limited by the memory bandwidth. In this case, the bandwidth of the factorize function after the last cache overflow is about 10 GB/s, which the bandwidth of our machine external memory .
On every plot of Figure 3 , for the fastest version, the performance starts by increasing on the left. This is mainly due to the amortization of the overheads mainly due to SIMD.
The MKL is very slow. The reason is that it does a lot of verification on input data, has many functions calls and has huge overheads. These overheads are required for speeding up the execution, but are effective only for large matrices. For large matrices, the MKL is, of course, very fast and it would have impossible to be faster than the MKL for large matrices.
Still on Figure 3 , the scalar AoS versions are slow: they are not vectorized, unlike the other versions. The compiler is unable to vectorize these because of the memory layout even with -vee compiler option. Although, it is possible to write SIMD code for AoS, but requires scatter/gather instructions (or a way to emule it).
When we look at Figure 4 , we can see that the scalar SoA The speedup given by the KNC plot (Figure 5d ) is actually not r epresentative as KNC is not ada pted for scalar c ode. However, this plot allows to see the impact of each transformation on the per formance.
If we look at the s peedups on HSW (Figure 5c ), we can see that unwinding t he inner loops improves the per formance well: from x 2 t o x 3 .5. The impact of unwinding decreases when the size of the matrix increase: the register pressure i s higher. SIMD gives a sub-linear speedup: from x3 .2 to x 5. 5. In fact, SIMD instructions can not be fully efficient on this function without fas t square root (see subsection III-D). With further analysis, we can see that the s p eedup of SIMD + fast square root i s almost c onstant around x 6. The impact of the fast square root decreases as their number become negligible compared to the other fl oating point o perations. SIMD has more place to b e efficient. For small matrices, unroll&jam allows to get the l ast part of the expected SIMD speedup.
SIMD + fast square root + unroll&j am: from x 6.5 to x 9. Unroll&j am loses its efficiency for l arger matrices: the register pressure is higher. Figure 6 shows the performance of solve for different A VX versions. The performance of the "non-fast" versions are limited by the square roots and divisions around 9 Gflops (see subsection III-D). In this case, both unrolling (unwinding and unroll&jam) are not very efficient and can not improve performance further this limitation. For "fast" versions, both unrolling are efficient. U nroll&jam achieves a x 3 speedup on regular code and x 1.5 speedup on unwinded code. We can see that unroll&jam is less efficient when the code is already unwinded but keeps improving the performance. Register pressure is higher when unrolling (unwinding or unroll&jam).
2) OPENMP: Figure 7 shows how our implementation scales with higher number of cores. The scaling of the SIMD version is strong and for some sizes, the code get a super-linear speedup.
The MKL version however does not scale at all: it does not support batch function for the Cholesky factorization and it does not enable the multithreading as the matrices are too small according to its heuristics. Again, the MKL is not adapted for tiny matrices, but scales very well for large matrices.
On KNC, the scaling is strong and for some sizes of matrices, it has a super-linear speedup (see Figure 7c) . The scaling is actually not representative as it is compared to mono-thread execution. Indeed, latencies on KNC are high and a single thread can not use efficiently the whole core.
3) Extension to matrices up to 16 x 16: Figure 8 shows that our transformations studied for 3 x 3 matrices are also efficient with larger matrices. Speedups for benched machines are summarized into Table IV . Again, the speedup for KNC is not representative as it is compared to scalar code. 
CONCLUSION
In this paper, we have presented a fast implementation for tiny matrices (~ 16 x 16) of the well-known Cholesky algorithm. The State-of-the-Art codes and libraries lack of support for tiny matrices: existing libraries deals with large matrices (usually thousands of rows), and are usually slow for small matrices. For a lot of people, a "small matrix" is about few hundreds rows. But in Computer Vision/High Energy Physics, people need tiny matrices and scalar naive code was a rather good option. We have shown that the scalar naive code is much slower compared to the very specialized code we wrote.
Our approach is to compute the factorization by batch which enables efficient parallelization (both SIMD + OPENMP).
We achieve a high overall speedup: more than an order of magnitude compared to scalar naive version in monothread. The speedup for 3x3 matrices on 2x14 cores Haswell Xeon is between x 13 and x 31. Our multi-threaded implementation has almost the same speedup compared to the multithread naive code. Our Haswell get a speedup between x15 and x33 The speedup decreases as the reference version becomes faster, but for all sizes of matrices from 3x3 to 16x16, the code performance is close to the sustainable peak performance of the processor.
In the future, we will add support for other SIMD architectures like ARM Neon or IBM Altivec. The double precision and mixed precision will also be part of futur implementations.
