Abstract Current compilers cannot generate code that can compete with hand-tuned code in efficiency, even for a simple kernel like matrix-matrix multiplication (MMM). A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and number of levels of tiling. The scheduling parameter values selection is a very difficult and time-consuming task, since parameter values depend on each other; this is why they are found by using searching methods and empirical techniques. To overcome this problem, the scheduling sub-problems must be optimized together, as one problem and not separately. In this paper, an MMM methodology is presented where the optimum scheduling parameters are found by decreasing the search space theoretically, while the major scheduling sub-problems are addressed together as one problem and not separately according to the hardware architecture parameters and input size; for different hardware architecture parameters and/or input sizes, a different implementation is produced. This is achieved by fully exploiting the software characteristics (e.g., data reuse) and hardware architecture parameters (e.g., data caches sizes and associativities), giving high-quality solutions and a smaller search space. This methodology refers to a wide range of CPU and GPU architectures.
Introduction
Matrix-matrix multiplication (MMM) is an important kernel in most varied domains and application areas. Its performance is of great practical importance and it highly depends on memory management; the most performance critical sub-problems are those of finding the schedules with the minimum numbers of (i) L1 data cache accesses, (ii) L2 data cache accesses, (iii) L3 data cache accesses, (iv) main memory data accesses, (v) addressing instructions. The above sub-problems depend on each other (e.g., a decrease in the number of L2 data cache accesses will consequently increase the number of L1 data cache accesses) and this is why they should be addressed together as one problem and not separately (in [1] , a methodology for matrix-vector multiplication is given).
Toward this, much research has been done, either to simultaneously optimize only two phases, e.g., register allocation and instruction scheduling (the problem is known to be NP-complete) [2, 3] or to apply predictive heuristics [4, 5] . Nowadays, compilers and related works apply either iterative compilation techniques [6] [7] [8] [9] , or both iterative compilation and machine learning compilation techniques to restrict the configurations' search space [10] [11] [12] [13] [14] [15] . A predictive heuristic tries to determine a priori whether or not applying a particular optimization will be beneficial, while at iterative compilation, a large number of different versions of the program are generated-executed by applying transformation and the fastest version is selected; iterative compilation provides good results, but requires extremely long compilation times.
The state-of-the-art (SOA) hand/self-tuning libraries for linear algebra, such as ATLAS [16] , OpenBLAS [17] , Eigen [18] , Intel_MKL [19] , PHiPAC [20] and a few of the many noteworthy papers from the past such as [21] [22] [23] [24] for CPUs and [25, 26] for GPUs, do not give a theoretical solution; instead, they find the performance critical parameter values mostly by searching and using heuristics and empirical techniques. The selection of the parameter values is a difficult and time-consuming task for two reasons. First, many parameters have to be taken into account, such as the number of levels of tiling, tile sizes, loop unroll depth, software pipelining strategies, register allocation, code generation, data reuse and loop transformations. Second, the optimum parameters for two slightly different architectures are different. Such a case is MMM algorithm, which is a major kernel in linear algebra and also the topic of this paper. In this paper, the optimum scheduling parameters are found by decreasing the search space theoretically, for a wide range of CPU (including multi-cores) and GPU architectures.
A former MMM methodology for CPUs that support SIMD was introduced in [27] ; the major contribution of [27] is that it addresses the major software and hardware parameters together as one problem and not separately. This paper extends [27] in four ways. First, the search space is decreased theoretically by applying static performance estimation. Second, this paper extends [27] to a wide range of computer architectures; this methodology has been extended to GPU architectures, CPU architectures without SIMD unit and CPU architectures different from those of the general purpose CPUs, e.g., microcontrollers, smaller processors with one level of data cache and processors with direct mapped data cache. Third, this methodology takes into account more hardware architecture parameters (the memories' latencies, the data cache line size, the number, the latencies and the type of the CPU function units, the number of the load/store units) and software characteristics (number of matrix operations, number of addressing instructions). Fourth, the proposed methodology, due to the major contribution of number three above, gives a smaller search space, a smaller code size and a smaller compilation time, as it does not test a large number of alternative schedules.
The proposed methodology is compared with the state-of-the-art software libraries of ATLAS and Intel_MKL. Although a performance comparison with Intel_MKL is unfair, a detailed experimental analysis has been made as it is the fastest MMM library in the world for Intel general purpose processors. A performance comparison with Intel_MKL is unfair for two reasons. First, Intel_MKL developers have access to all the Intel processor architecture details which we do not, e.g., victim cache and hardware prefetchers; this is why Intel_MKL library is the fastest library on Intel processors only. Second, Intel_MKL loop kernels are written in assembly code, while our method is written in C (assembly code is always more efficient); Intel developers write assembly code to deal with the low-level transformations, e.g., register allocation, instruction selection and instruction scheduling. The proposed methodology lies at a higher level of abstraction and used for a wide range of computer architectures. Implementing the proposed methodology in assembly code is beyond the scope of this paper and thus lowlevel transformations are applied by the target compiler (which is less efficient). The scope of this paper is not to provide the peak-performance MMM implementations, but to analytically give architecture-dependent high-level transformation parameters (e.g., tile sizes) that achieve peak performance. We strongly believe that if we could modify the MKL library scheduling parameters according to the proposed methodology, an even higher performance would be achieved.
The proposed methodology is compared to the SOA libraries of ATLAS and Intel_MKL for CPUs and cuBLAS for GPUs. The evaluation is done using Intel Xeon CPU E3-1241 v3, Pentium Intel i7-2600K, Valgrind tool [28] , ARMv7-a on GEM5 simulator, PowerPC-440 on Xilinx FPGA Virtex-5, Nvidia GeForce GTX-580, Gem5 [29] and SimpleScalar simulator [30] .
The remainder of this paper is organized as follows. In Sect. 2, related work is given. The proposed methodology is presented in Sect. 3. In Sect. 4, the experimental results are presented while Sect. 5 is dedicated to conclusions.
Related work
The problem of speeding up MMM has been studied in the the last decades. A historical perspective is given in [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] ; these works present how hardware and software can work on scalable multi-processor systems for matrix algorithms.
ATLAS [16, [42] [43] [44] [45] is an implementation of a high-performance software production/maintenance called automated empirical optimization of software (AEOS). In an AEOS-enabled library, many different ways of performing a given kernel operation are supplied, and timers are used to empirically determine which implementation is best for a given architectural platform. ATLAS uses BLAS implementations. The basic linear algebra subprograms (BLAS) are routines that provide standard building blocks for performing basic vector and matrix operations. Intel math kernel library (Intel MKL) [19] is a computing math library of highly optimized, extensively threaded routines for applications that require maximum performance. Intel_MKL library has been written by Intel and this is why it performs the best for Intel processors only. MKL kernels have been written in assembly for maximum performance. During the installation of ATLAS and Intel_MKL, on the one hand an extremely complex empirical tuning step is required, and on the other a large number of compiler options are used, both of which are not included in the scope of this paper. Although ATLAS is one of the SOA libraries for MMM algorithm, its techniques for supplying different implementations of kernel operations concerning memory management are empirical and hence does not provide any methodology for it.
Regarding MMM implementations for one core, many related works exist, such as [21] [22] [23] [46] [47] [48] [49] [50] [51] . In [22] , BLIS is presented; BLIS is a framework for rapid instantiation of BLAS. In [23] , BLIS extends the GotoBLAS approach to implement peak performance MMM implementations. In [21] , a systematic analysis of the high-level issues affecting the design of high-performance matrix multiplication is given. Reference [24] gives a significant theoretical background on finding the optimum scheduling parameters, but refers to specific CPUs architectures only. In [24] , analytical models are presented for estimating the optimum tile size values, assuming only fully associative caches, which in practice are very rare. The aforementioned works refer to specific CPU architectures only and find the scheduling parameters mostly by using empirical techniques.
Although loop tiling is necessary to achieve high performance, the above works do not find the tile sizes and the number of levels of tiling, by taking into account the cache sizes, their associativities and the data array layouts together as one problem; instead, searching is applied. Let us give an example. According to ATLAS [16] (only cache size is taken into account), the size of three rectangular tiles (one for each matrix) must be smaller or equal to cache size; however, the elements of these tiles are not written in consecutive main memory locations and thus do not use consecutive data cache locations; this means that having a set-associative cache, they cannot simultaneously fit in data cache due to the cache modulo effect. Moreover, even if the tile elements are written in consecutive main memory locations (different data array layout), the three tiles cannot simultaneously fit in data cache if the cache is two-way associative or directly mapped. We will show that loop tiling is efficient only when cache size, cache associativity and data array layouts are addressed together as one problem and not separately.
There are several works optimizing MMM for many cores [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] . The fastest implementations are given in [23] where MMM is parallelized on Intel Xeon Phi and IBM Blue Gene/Q; an analysis is made on which loop is to be parallelized. The vast majority of previous works regarding multi-core architectures deal with cluster architectures; they partition the MMM problem into many distributed memory computers (distributed memory refers to a multiple-processor computer system in which each processor has its own private memory). SRUMMA [63] describes one of the best parallel algorithms suitable for clusters and scalable shared memory systems. Although SRUMMA minimizes the communication contention between CPUs, it does not optimize the MMM problem for one CPU (it runs the cblas_sgemm ATLAS optimized routine). Furthermore, about half of the above works use the Strassen's algorithm [64] to partition the MMM problem into many multi-core processors; Strassen's algorithm minimizes the number of multiplication instructions sacrificing the number of add instructions and data locality. The MMM code for one core is either given by Cilk tool [65] or by cblas_sgemm routine of ATLAS. At last, Refs. [66, 67] show how shared caches can be utilized. All the above works are empirical techniques and do not provide a theoretical model.
Regarding GPUs, several related works exist, such as [25, 26, [68] [69] [70] [71] [72] [73] [74] [75] . Reference [26] shows how to modify the MAGMA GEMM kernels to use the Fermi architecture more efficiently. Reference [68] presents a method for producing MMM kernels tuned only for a specific architecture, through a canonical process of heuristic autotuning, based on the generation of multiple code variants and selecting the fastest ones through benchmarking. Reference [70] provides implementations of Strassen's MMM algorithm as well as of Winograd's variant; they show that only square matrices of very large sizes (16,384 × 16,384) achieve 33 % speedup over cblas_sgemm (ATLAS routine for data of type float) and 21 % over cblas_dgemm (ATLAS routine for data of type double). Reference [72] presents an in-depth study to reveal interesting tradeoffs between shared memory and the hardware-managed L1 data caches for MMM. Reference [73] investigates different performance techniques such as tiling, memory coalescing, prefetching and loop unrolling, in trying to evaluate which method is the most efficient. Reference [74] provides a theoretical analysis of why performance drawbacks appear for specific problem sizes when using cache memories. Finally, in [75] , different data array layouts are evaluated, such Z-Morton and X-Morton. All the above works are empirical techniques and do not give a methodology.
In contrast to the proposed methodology, the above works find tile sizes mostly by searching, since they do not exploit all the h/w and the s/w constraints. However, if these constraints are fully exploited, the optimum solution can be found by enumerating only a small number of solutions; in this paper, tile sizes are given by inequalities which contain the cache sizes and cache associativities.
Proposed methodology
In this paper, an MMM methodology is presented where the sub-problems of finding the schedules with the minimum numbers of (i) L1 data cache accesses, (ii) L2 data cache accesses, (iii) L3 data cache accesses, (iv) main memory data accesses, (v) addressing instructions are addressed together as one problem and not separately. We find the scheduling parameters that achieve the best performance by fully exploiting the software characteristics (production-consumption, data reuse and MMM parallelism) and the major hardware architecture parameters, i.e., the (a) number of the cores, (b) number of memories, (c) size of each memory, (d) number of registers, (e) associativities of data cache memories, (f) memories' latencies, (g) SSE instruction latencies. For different hardware architecture parameters, different schedules for MMM are created.
It is well known that the search space, i.e., all different MMM implementations, is infinite and cannot be searched. In this paper, we find the best schedules among these that exist in our search space. The search space being addressed consists of all the high-level transformations that affect MMM performance, including all different transformation orderings and all different transformation parameters (e.g., tile sizes); these are the number of levels of tiling, loop tiling for all the memories, register for (j=0; j<M; j++) for (k=0; k<P; k++)
blocking, loop interchange, loop unroll, scalar replacement and data array layouts. The search space we use does not contain low-level transformations, e.g., instruction scheduling to decrease the number of pipeline stalls (this is beyond the scope of this paper). Although the low-level transformations are not taken into account, the search space being addressed is enormous and impractical to be searched, e.g., if the number of different tile sizes for each loop is 100, the number of all different schedules that apply loop tiling for L1 and L2 is (6! × 100 6 ) = 7.2 × 10 14 (two levels of tiling add 6 new loops); if we consider that the compilation time of each schedule is 1 s and given that 1 s = 3.1 × 10 −8 years, the compilation time is very big; if we include all the above transformations, the number of the schedules becomes enormous. Given that the above transformations are strongly interdependent, the only way to decrease the search space is to address together as one problem and not separately. For the reminder of this paper, the three array names and sizes are that shown in Fig. 1, i. e., C = C + A × B, where C, A and B are of size N × M, N × P and P × M, respectively.
MMM performance depends on the time needed for the (i) data to be loaded/stored (C, A and B arrays), (ii) matrix operations to be executed, (iii) addressing instructions to be executed (integer instructions only), (iv) instructions to be loaded from instruction cache.
Regarding (iv), the time needed for the instructions to be fetched from L1 instruction cache is lower than the (i)-(iii) values above, since no instruction cache conflicts occur; this is because the MMM code size is small and always fits in the L1 instruction cache. It is important to say that all the present day processors contain separate L1 data and instruction caches and thus we can assume that shared/unified L2/L3 caches contain only data. Architectures with unified L1 caches are not discussed in this paper.
Equations 1 and 2 approximate the MMM execution time; Eq. 1 holds for architectures where matrix operations and addressing operations are executed in parallel (we assume that the arrays are floating point numbers) and Eq. 2 holds for architectures that do not [either no floating point arithmetic logic unit (ALU) exists or the arrays are of type integer]; in most architectures, the load/store unit and the execution unit work in parallel.
Regarding the time needed to execute the matrix operations (T matri x-operations ), it is given by the following two equations, i.e., Eqs. 3 and 4; Eq. 3 is used in the case there is a separate multiplication unit working in parallel with the ALU and Eq. 4 otherwise (the number of multiplications is larger than the number of additions). T matri x-operations is a constant number.
where Mul lat , Add lat , N um Mult and N um ALU are the latencies and the numbers of the multiplication and ALU units, respectively.
Regarding addressing instructions, the T addressing value is not a constant number;
it depends on the implementation/schedule. The number of addressing instructions is decreased when (a) the number of levels of tiling is decreased, (b) the tile sizes are increased, (c) more array references are assigned into available registers, (d) loop unroll factor is increased.
Regarding load/store instructions, T data T matri x-operations in most cases, i.e., if the data do not fit in L1 data cache. In contrast to the T matri x-operations , T data and T addressing are not constant numbers; they depend on the schedule used. Also, T data and T addressing depend on each other; normally, by increasing the T data value, T addressing is decreased and vice versa, while the number of matrix operations remains constant.
To summarize, given that T matri x-operations is a constant number and in most cases T matri x-operations ≺ T addressing and T matri x-operations ≺ T data , MMM performance can be increased only by minimizing both T data and T addressing values; given that T data and T addressing are interdependent, high performance is achieved, only for both low T data and T addressing values. This is because the separate optimization of the T data and T addressing values gives different schedules which cannot coexist, by refining one and degrading another.
T data value is found by decreasing the search space theoretically according to the memory hierarchy architecture parameters. For different cache hierarchy architectures, a different equation that estimates T data value is created, e.g., for one level of data cache architecture, T data value is given by eq.7. These equations give theoretically the tile sizes that achieve a minimum T data value.
As far as the T addressing value is concerned, it cannot be found theoretically. This is because the number of addressing instructions highly depend on the target compiler and its optimizations, e.g., unroll factor values. Thus, we find all the schedules achieving a low T data value and only those that achieve a low T addressing value are selected; all the schedules achieving a low T data value are converted into assembly code (by using the target compiler) and their number of addressing instructions is measured. Then, all schedules achieving both low T data and T addressing values are compiled and run to the target platform to find the fastest.
Instead of searching all different MMM schedules to find the best which is impractical because their number is enormous, only a small number is searched. The exploration space is decreased by orders of magnitude, since we test only solutions that are close to the best; we test only these schedules achieving both low T data and T addressing values. In this way, the compilation time is drastically decreased. A different schedule emerges for different types of CPUs, CPU parameters and input size. The reminder of this paper presents all these schedules. The proposed methodology for CPUs with one core is given in Sect. 3.1, while the proposed methodology for CPUs with more than one core is given in Sect. 3.2. The proposed methodology for GPU architectures is given in Sect. 3.3 .
A different schedule is given according to the (a) number and the type of the cores, (b) number of cache memories, (c) cache sizes, (d) input sizes and (e) whether an SIMD unit is supported or not (Fig. 2) . In Fig. 2 
CPUs with one core
A different schedule is given according to the (a) number of data cache memories, (b) cache sizes, (c) input sizes and (d) whether an SIMD unit is supported or not (Fig. 2) . The Sects. 3.1.1-3.1.6 which are shown in Fig. 2 are given below.
Small input sizes and CPU with L1 data cache and with/without L2 cache
For small input sizes, i.e., if for all the data of B, the data of 2 × i 0 rows of A and the data of 2 × i 0 rows of C fit in different ways in L1 data cache (in Eq. 5), the scheduling given below is used (Fig. 3) .
where k1 =
, L1 is the size of the L1 data cache memory in bytes, element_si ze is the size of the array elements in bytes and assoc is the L1 associativity (assoc = 1). P × M is the size of the B array in elements. k1 is an integer and gives the number of L1 data cache lines with identical L1 addresses used for 2 × i 0 rows of A; k2 is an integer and gives the number of L1 data cache lines with identical L1 addresses used for 2 × i 0 rows of C; for the reminder of this paper, we will more freely say that we use k1 cache ways for A, k2 ways for C and (assoc − k1 − k2) cache ways for B (in other words A, B and C are written in separate data cache ways). In the case that P × M 2×i 0 × P +2×i 0 × M or in the case that a two-way set associative cache exists, 1 cache way is used for both A and C. In the case that assoc = 1, a slightly different schedule is given at the last paragraph of Sect. 3.1.3.
The optimum production-consumption (when an intermediate result is produced, it is directly consumed or used) of array C and the sub-optimum data reuse of array A have been selected by splitting the arrays into tiles according to the number of the registers, Eq. 6 ( Fig. 3) . All different i 0 , j 0 combinations satisfying Eq. 6 give a feasible solution.
where RF FP is the number of the available floating point registers. In the case that C,A and B contain integer numbers, Eq. 6 contains the addressing variables and the loop iterators. We use i 0 × j 0 registers for C, i 0 for A and j 0 for B (Fig. 3 ). We assign registers across i, j iterators ( Fig. 1) and not across k iterator, because k is the innermost one (it changes its value in each iteration and thus no data reuse is achieved).
The schedule is shown in Fig. 3 . First, the first i 0 elements of the first column of A are multiplied by the first j 0 elements of the first row of B. Then, the first i 0 elements of the second column of A are multiplied by the first j 0 elements of the second row of B, etc. This is repeated until the first i 0 rows of A have been multiplied by the first j 0 columns of B; then the same i 0 rows of A, as above, are multiplied by the next j 0 columns of B, etc.
Given that each row of A is multiplied by all columns of B, both A and B are reused M and N times, respectively; thus, they have to remain in the L1 data cache. To do this, the cache lines of A and B must be written in L1 without conflict with each other and also with C. 2 × i 0 rows of A and C have to fit in L1, the current processed i 0 rows and the next ones, for two reasons. First, except from the first i 0 rows of A and C, also the second i 0 rows must be loaded in L1, without conflict with B. Second, when the third i 0 rows of A are multiplied by B, the L1 cache lines containing the first i 0 rows are replaced by the third i 0 rows according to the LRU cache replacement policy, without conflict with the B ones. This is achieved by storing the rows of A, C and the columns of B in consecutive main memory locations and by using (k1 × L1 assoc ) L1 memory size for A, (k2 × L1 assoc ) for C and ((assoc − k1 − k2) × L1 assoc ) for B (in Eq. 5). We can more freely say that this is equivalent to using k1 cache ways for A, k2 cache ways for C and (assoc − k1 − k2) cache ways for B. An empty cache line is always granted for each different modulo (with respect to the size of the cache) of A, C and B memory addresses. It is important to say that if we use
× element_si ze instead of in Eq. 5, the number of L1 misses will be much larger because A, B and C cache lines would conflict with each other.
The time needed for the array elements to be loaded/stored is approximated by Eq. 7 and Table 1 .
where L1 load_lat , DDR load_lat , L1 store_lat and DDR store_lat are the L1 and DDR load and store latencies, respectively. L1 loads , DDR loads , L1 stores and DDR stores , are the numbers of loads and stores occurring for each memory and are shown in Table 1 . line L1 is the number of elements each L1 cache line contains and L1 ports is the number of L1 load/store ports. Without loss of generality, in Eq. 7, we assume a memory architecture where only one L1 cache line is replaced at a time; if more than one cache line is replaced in parallel, then Eq. 7 is changed accordingly. Regarding the number of DDR writes, j 0 /line L1 L1 data cache lines are written to main memory for each j 0 elements of C. For data cache architectures where reads and writes are executed in parallel, Eq. 7 is changed accordingly.
From Eq. 7 and Table 1 , we can approximate the T data value. Supposing that L1 and DDR load/store latencies are 1 and 50, respectively and also that L1 ports = 1 and line L1 = 4, Eq. 7 and Table 1 give 
Section 3.1.3 case: Eq. 6 does not hold
Regarding DDR access time, it is minimized when j 0 is a multiple of cache line size, i.e., 4. Regarding L1 data cache access time, it is minimized when (
is minimized (i 0 , j 0 are found according to Eq. 6 and R F F P is a power of 2). For 
. Also, if R F F P = 8, then i 0 = j 0 = 2 gives the minimum L1 data cache access time. However, in the case that R F F P = 32, we cannot find a good i 0 = j 0 solution, since several registers are wasted, i.e., if i 0 = j 0 = 4, then 16, 4 and 4 registers are used for C, A and B, respectively, which means that only 24/32 registers are used; in this case, we do not fully utilize the RF size and thus a solution different from i 0 = j 0 is preferred, e.g., i 0 = 5 and j 0 = 4. On the other hand, if we select i 0 = j 0 = 5, then we need 35 registers which are more than 32.
However, the best performance is not always achieved by minimizing T data value; although i 0 == j 0 case achieves a lower number of data accesses than the j 0 i 0 case (for most register file sizes), it achieves a larger number of addressing instructions; this is because in the first case, more array addresses per iteration are computed (Fig. 4) .
The above schedule achieves a minimum number of data accesses (for most register file sizes). In this case, the C array is loaded and stored once from the L1 data cache, A larger number of data accesses occur because C is always accessed twice (C array is both loaded and stored); thus, it is more efficient to access C array once and A and B arrays more times than the opposite.
Medium input sizes with L1 data cache only
For medium input sizes where Eq. 9 holds, another schedule is used (Fig. 5) . If all the data of a Tile1 of A and a Tile1 of B fit in separate ways of L1 cache (in Eq. 9), the scheduling given below is used:
where k =
is the size of the L1 cache, assoc is the L1 associativity (assoc = 1) and element_si ze is the size of the array elements in bytes.
k is an integer and gives the number of L1 data cache lines with identical L1 addresses used for Tile1 of B; we use k cache ways for B and assoc − k cache ways for A (in other words, A and B are written in separate data cache ways).
For the Tile1 tiles of A and B to remain in L1 data cache, the cache lines of Tile1 of A must be written in L1 without conflict with the Tile1 of B ones. This is achieved by storing the Tile1 of A and the Tile1 of B in consecutive main memory locations and by using (k × L1 assoc ) L1 memory size for B and ((assoc − k) × L1 assoc ) L1 memory size for A (in Eq. 9). We can more freely say that this is equivalent to using k cache ways for B and (assoc − k) cache ways for A. An empty cache line is always granted for each different modulo (with respect to the size of the cache) of A and B memory addresses. We do not need any empty space for C because (a) Tile1 of C size is much smaller than the other Tile1 tiles (normally, P I I j 0 ), (b) each element of C is stored just once into memory (no data reuse) and (c) C is stored into the main memory infrequently; thus, the number of conflicts due to C can be neglected (victim cache, if exists, eliminates the misses of C). However, if P is comparable to I I , then an additional cache way must be used for C.
Equation 9 holds only when Tile1 of A and B are written in consecutive main memory locations. Given that the arrays are written row-wise in main memory, the elements of Tile1 of A are written in consecutive main memory locations, but the elements of Tile1 of B are not. Thus, the data layout of B is changed from row-wise to tile-wise, i.e., all elements are written in main memory just as they are fetched; first, the first j 0 elements of the first row of B are written to main memory, then the first j 0 elements of the second row of B, etc.
It is important to say that if we use L1 ≥ (I I × P + P × j 0 ) × element_si ze instead of in Eq. 9, the number of L1 misses will be much larger because A and B cache lines would conflict with each other.
The scheduling follows. First, the first i 0 rows of A are multiplied by the first j 0 columns of B exactly as in the previous subsection. Then the next i 0 rows of A are multiplied by the same columns of B as above. This is repeated until all the rows of Tile1 of A are multiplied by the first j 0 columns of B. After the first Tile1 of A has been multiplied by the first Tile1 of B, the first Tile1 of A is multiplied by the second Tile1 of B, etc. In this way, data reuse is achieved on both A and B arrays.
The time needed for the array elements to be loaded/stored is approximated by Eq. 7 and Table 1. Supposing that L1 and DDR load/store latencies are 1 and 50, respectively, and also that L1 ports = 1 and line L1 = 4, Eq. 7 and Table 1 give
Regarding DDR access time, it is minimized when I I is maximized and j 0 is a multiple of cache line size, i.e., 4; also I I is maximized when j 0 = 1 (I I and j 0 are interdependent); there is a trade-off. Regarding L1 data cache access time, it is minimized when (
) is minimized (i 0 , j 0 are found according to Eq. 6).
Large input sizes and CPUs with L1 data cache only
For large input sizes, where Eq. 9 does not hold, the scheduling given below is used (Fig. 6 ).
In this case, i 0 rows of A and j 0 columns of B do not fit in L1; Eq. 9 cannot give j 0 1 and i 0 1. Thus, P dimension is also tiled and Tile1 tiles become even smaller; Tile1 tiles contain i 0 sub-rows of A and j 0 sub-columns of B; the Tile1 tiles of A and B are of size I I × K K and K K × j 0 , respectively. K K is selected to be as large as possible, since the C array is both loaded and stored K K times from the main memory. Now, instead of multiplying rows of A by columns of B, sub-rows of size K K are multiplied by sub-columns of B. The largest I I integer value for K K = P/2 is selected that holds in Eq. 11. If Eq. 11 still cannot give an I I value satisfying Eq. 6,
where
and n is positive integer (n ≥ 1).
I I sub-rows of A and j 0 sub-columns of B of size K K have to fit in separate ways of L1 data cache. It is important to say that Eq. 11 holds only when the tile elements of A and B are written in consecutive main memory locations; otherwise, the tile sub-rows/sub-columns will conflict with each other due to the cache modulo effect. As it was explained in the previous subsection, we do not need any empty space for C. However, if the size of Tile1 of C is comparable to others, then cache size which equals to one cache must be granted for C.
Regarding the data layout of A, when P dimension is tiled, the data layout of A is changed from row-wise to tile-wise; A elements are written into memory just as they are fetched. If the data layout of A is not changed, Eq. 11 cannot give a minimum number of cache conflicts since the sub-rows of A will conflict with each other. The same holds for B. The data array layout of B is changed from row-wise to tile-wise.
The scheduling follows. The multiplication between two Tile1 tiles is exactly the same as in the previous case (Sect. 3.1.1). First, the first Tile1 of A is multiplied by all Tile1 tiles of the first Tile1 block row of B. Then, the second Tile1 of the first Tile1 block column of A is multiplied by all the Tile1 tiles of the first Tile1 block row of B. Then, the second Tile1 block column of A is multiplied by the second Tile1 block row of B, etc.
Regarding DDR access time, it is minimized when K K and I I are maximized and when j 0 is a multiple of cache line size, i.e., 4. T data highly depends on the KK and II values which are the largest possible according to the L1 data cache size.
In the case of direct mapped data cache, both A and B tiles compete with each other for the same L1 addresses. Given that both tiles cannot remain in L1 due to the cache modulo effect, we select Tile1 of A to be many times larger than Tile1 of B, i.e., I I j 0 . In this way, the main part of Tile1 of A remain in L1 and the number of L1 misses is kept low.
Medium input sizes and CPUs with L2 and L1 data cache
If all the data of A and the Tile1 of B/C fit in separate ways of L2 cache (in Eq. 13), the scheduling given below is used (Fig. 7) :
where L2 is the size of the L2 cache, assoc is the L2 associativity and element_si ze is the size of the arrays elements in bytes (e.g., element_si ze = 4 for floating point numbers). Given that N J J for most cases, the size of A is much larger than the size of Tile1 of B and Tile1 of C, and thus ((assoc − 1) × Regarding L1 data cache, 2 × i 0 rows of A and J J columns of B have to fit in separate ways of L1 (in Eq. 14).
where k = 2×i 0 ×P×element_si ze L1/assoc ≤ assoc 2 , L1 is the size of the L1 data cache memory in bytes, element_si ze is the size of the array elements in bytes and assoc is the L1 associativity (assoc = 1).
The scheduling follows. First, the first Tile1 of A is multiplied by the first Tile1 of B. The multiplication between two Tile1 tiles is the same as in Sect. 3.1.1. Then, the second Tile1 of A is multiplied by the same Tile1 tiles of B as above, etc. We select the whole A to fit in L2, since having a Tile1 of B in L1 data cache, its elements need to be multiplied by as many rows of A as possible before they are spilled to upper level memories.
The time needed for the array elements to be loaded/stored is approximated by Eq. 15 and Table 1 . Supposing that L1, L2 and DDR load/store latencies are 1, 4 and 50, respectively, and also that L1 ports = 1, line L1 = 4, line L2 = 4, Eq. 7 and Table 1 give
Regarding DDR access time, it is minimized when j 0 is a multiple of L2 cache line size, i.e., 4. Regarding L2 data cache access time, it is minimized when J J is maximized and j 0 is a multiple of L1 cache line size, i.e., 4. J J is maximum according to the L1 data cache size. Regarding L1 data cache access time, it is minimized when (
) is minimized.
Large input sizes and CPUs with L2 and L1 data cache
For large input sizes, where Eq. 13 does not hold, the scheduling given below is used (Fig. 8) .
In this case, Eq. 14 cannot give J J 1 and i 0 1. Thus, P dimension is also tiled as in Sect. 3.1.3 and Tile1 tiles become even smaller. 
2 × i 0 sub-rows of A and j 0 sub-columns of B of size K K have to fit in separate ways of L1 data cache. It is important to say that Eq. 17 holds only when the tiles of A and B are written in consecutive main memory locations (tile-wise); otherwise, the tiles sub-rows/sub-columns will conflict with each other due to the cache modulo effect.
To efficiently use the L2 cache, the array A is further partitioned into T ile2 tiles. A T ile2 tile of A (size of I I × K K ), a Tile1 tile of B (size of K K × j 0 ) and a Tile2 of C (size of I I × j 0 ) have to fit in the L2 cache (in Eq. 18). Array A uses assoc − 1 L2 ways, while B-C arrays use only one L2 way. This is because the sum of Tile2 of C and Tile1 of B is of smaller size than one L2 way (I I j 0 ); moreover, C and B tiles do not achieve data reuse in L2 and thus there is no need to remain in L2 (Tile1 of B is reused in the L1 data cache and not in L2).
The scheduling follows. First, the first Tile1 of the first Tile2 of A is multiplied by the first Tile1 of the first Tile1 block row of B. Then, the second Tile1 of the first Tile2 of A is multiplied by the same Tile1 of B as above, etc. After the first Tile2 of A has been multiplied by the first Tile1 of B, it is multiplied by the remaining Tile1 tiles of the first Tile1 block row of B. Then, the second Tile2 of the first Tile2 block column of A is multiplied by the all Tile1 tiles of the first Tile1 block row of B, etc.
The procedure ends when all Tile2 block columns of A have been multiplied by all Tile1 block rows of B.
We select a big Tile2 of A to fit in L2, since a Tile1 of B which resides in L1 data cache is multiplied by all rows of A; thus, we multiply Tile1 of B with as many rows of A as possible before they are spilled to the upper level memory.
CPUs with SIMD
In this case, a scheduling similar to that explained in Sect. 3.1.1 is used. The optimum production-consumption (when an intermediate result is produced, it is directly consumed or used) of array C and the sub-optimum data reuse of array A have been selected by splitting the arrays into tiles according to the number of XMM/YMM registers (Eq. 20).
where R F is the number of the XMM/YMM registers and p is the number of the registers used for C array. Thus, we assign p registers for C and 1 register each for A and B.
Regarding T data value, the best schedule here is similar to that given in Sect. 3.1.1, i.e., 2 × 2, 2 and 2 registers are used for C, A and B, when R F = 8 and 3 × 3, 3 and 3 registers are used for C, A and B, when R F = 16. However, the schedule according to Eq. 20 is faster on SIMD architectures, since the lower number of addressing instructions has a larger effect on performance. As the p value increases, the number of SSE instructions decreases (it is explained below). We have evaluated both solutions in a large number of different architectures and the first (that of Fig. 9 ) is the fastest of all architectures.
The illustration example consists of the scenario that there are eight XMM registers (XMM0:XMM7 of 16 bytes each) and the arrays contain floating point data (4 byte elements). The first four elements of the first row of A (A(0, 0:3)) and the first four elements of the first column of B (B(0:3, 0)) are loaded from memory and assigned into XMM0 and XMM1 registers, respectively (the elements of B have been written into the main memory tile-wise, i.e., just as they are fetched). XMM0 is multiplied ...
Fig. 9 MMM for CPUs with SIMD
by XMM1 and the result is stored in the XMM2 register (Fig. 4) . Then, the next four elements of B (B(0:3, 1)) are loaded into the XMM1 register again; XMM0 is multiplied by XMM1 and the result is stored in the XMM3 register (Figs. 4, 9 ). XMM0 is multiplied by XMM1 six times and the XMM2:XMM7 registers contain the multiplication intermediate results of the C array. Then, the next four elements of A (A(0, 4:7)) are loaded into XMM0 which is multiplied by XMM1 six times, as above (Fig. 4) ; the intermediate results in the XMM2:XMM7 registers are always produced and consumed. When the first row of A has been multiplied by the first six columns of B, the four values of each one of the XMM2:XMM7 registers are added and stored into the main memory (C array), e.g., the sum of the four XMM2 values is C(0, 0). The above procedure continues until all the rows of A have been multiplied by all the columns of B. There are several ways to add the XMM2:XMM7 data; three of them are shown in Fig. 10 , where four XMM registers are used to store the data of C, i.e., XMM1, XMM2, XMM3 and XMM4 (the SSE instructions' latencies are taken into account here). The first one (Fig. 10a) sums the four 32-bit values of each XMM register and the results of the four registers are packed into one which is stored in the memory (the four values are stored in the memory using one SSE instruction). The second one sums the four 32-bit values of each XMM register and then each 32-bit value is stored in the memory separately (without packing). For most SIMD architectures, the second (Fig. 10b) is faster than the first one, because the store and add operations can be executed in parallel (the first one has a larger critical path). The third one (Fig. 10c) unpacks the 32-bit values of the four registers and packs them into new ones to add elements of different registers. For most SIMD architectures, the third is faster than the other two, because unpacking and shuffle operations usually have smaller latency and throughput values than slow hadd operations.
Using SSE instructions, Eq. 5 changes into Eq. 21. K K is not shown in Fig. 9 ; K K is the tile size across dimension P and for large input sizes K K ≺ P (Fig. 8) . As the K K value decreases, the number of SSE instructions increases according to the following equation (code shown in Fig. 10 is executed more times) . where c is the sum of the SSE instruction latencies shown in Fig. 10 . Thus, as the p value increases, the number of SSE instructions decreases.
Multi-core CPUs
To run MMM effectively in many cores, the MMM problem is partitioned into smaller sub-problems and each sub-problem corresponds to a thread; each thread runs in one core only. Each thread must contain at least p1 instructions, where p1 is found experimentally and differs from one CPU to another. Otherwise, the cores will remain idle for several CPU cycles, since the threads' initialization and synchronization time is made comparable to the threads' execution time; this leads to low performance. Furthermore, to achieve high performance, the number of threads must be higher than p2, where p2 is found experimentally. The impact of p1 and p2 on the performance is comparable to the memory management problem. Regarding small input sizes, a large speedup cannot be achieved. This is because either a low p1 and/or a low p2 value is selected. In this case, it is preferable to run MMM with fewer number of cores to increase p1 and/or p2.
The MMM execution time on a quad core CPU is approximated by Eqs. 22 and 23.
where i = [1, 4] and T total-thread j is the T total value given in Sect. 3.1. Most multi core processors, typically contain two or three levels of cache, i.e., a) separate L1 data and instruction caches and a shared L2 cache, b) separate L1 data and instruction caches, separate unified L2 caches and a shared L3 cache. The proposed methodology for shared L2 and shared L3 is given in Sects. 3.2.1 and 3.2.2, respectively.
CPUs with shared L2
To utilize L2 shared cache, we partition the three arrays into Tile2 tiles (Fig. 11) . The Tile2 tiles of A, B and C are of size I I × K K , K K × J J and I I × J J, respectively. Each multiplication between two Tile2 tiles creates a different thread. For small and medium input sizes, we always select K K = P (Fig. 11) . Each multiplication between two Tile2 tiles is made as in Sect. 3.1.5. Having q number of cores, each Tile2 of A is multiplied by q consecutive Tile2 tiles of B in parallel, each one at a different core (Fig. 11) ; thus, J J J (J J J = q × J J) is evenly divisible by M.
One Tile2 of A and at least q Tile1 of B have to fit in L2 shared cache. The Tile2 of A is always fetched to all the cores. Also, q Tile1 tiles of different Tile2 tiles of B are loaded, which have no consecutive elements between themselves. The goal is these q Tile1 tiles of B and the next four ones do not conflict with the Tile2 of A and do not conflict with each other. In general, an L2 cache with assoc ≥ q + 1 is needed here. Cache size equal to ((assoc − q) × L2 assoc ) and (q × L2 assoc ) is needed for A (array A is written into the main memory tile-wise) and B-C, respectively (in Eq. 24). We select a big Tile2 of A to fit in L2, since having one Tile1 of B in each one of the q L1 data caches, their elements need to be multiplied by as many rows of A as possible before they are spilled to L2; Tile2 of A is reused M/j 0 times. L2 cache size that equals to q L2 ways is used for the Tile1 tiles of B and C, since their size is small and their elements are not reused (Tile1 of B is reused in L1 data cache and not in L2).
Regarding the L2 data cache, the largest I I value which satisfies Eq. 24 is selected (Fig. 11) . Given that a Tile1 of B is written in L1 data cache, it is memory efficient to Fig. 11 The proposed methodology for four cores having a shared L2 cache be multiplied by as many rows of A as possible before it is spilled from L1.
where the K K value is determined according to L1 data cache size (in Eq. 17). The J J value depends on the number of instructions each thread must contain and is found experimentally; if J J is smaller than this minimum number, the thread initialization and synchronization time is comparable with its execution time. The large number of ways needed here is not a problem as the L2 associativity is larger or equal to 16 in most architectures. The scheduling follows; suppose that there are four cores (Fig. 11) . Each Tile2 of A is multiplied by a Tile2 of B exactly as in Sect. 3.1.5. Each multiplication between two Tile2 tiles makes a different thread and each thread is executed at only one core. 
CPUs with shared L3
To utilize L3 cache, A and B arrays are further partitioned into Tile3 tiles, of size ((q × I I ) × K K ) and (K K × J J J), respectively (Fig. 12) . For small and medium input sizes, we always select K K = P (Fig. 12) . Each multiplication between two Tile2 tiles of A and B makes a different thread. Each multiplication between two Tile2 tiles is made as in Sect. 3.1.5. We determine the Tile1, Tile2 and Tile3 parameters by the data cache sizes and associativities. Regarding the L2 cache, we compute the I I value according to the L2 architecture parameters (in Eq. 25).
The largest I I value is selected satisfying Eq. 25. J J is found experimentally since each thread has to contain a minimum number of instructions. K K is found according to the L1 data cache size (in Eq. 17).
Given that a Tile1 of B is written in L1 data cache, it is memory efficient to be multiplied by as many rows of A as possible, before it is spilled from L1. Thus, L2×(assoc−1) assoc size of L2 is used for A; the layout of A is tile-wise here. L2 cache size that equals to one L2 way is used for the Tile1 of B and C, since their size is small and their elements are not reused (Tile1 of B is reused in L1 data cache, but not in L2).
It is important to say that tiling for L2 is not always performance efficient, because (a) an L2 miss has a small penalty here; this is because in this case, the data are loaded not from main memory but from L3 cache, which is fast enough; (b) extra addressing instructions are inserted, which may degrade performance. Thus, tiling only for L1 and L3 may be more efficient in several cases.
Regarding the L3 cache, we compute J J J according to the L3 cache parameters. We choose the biggest Tile3 of B possible to fit in L3 shared cache. There is ((assoc − k − 1) × 
q×I I ×K K L3/assoc
and L3 is the size of the L3 cache. The larger J J J value is selected satisfying Eq. 26. Also, J J J = l × J J, where l is an integer. The J J value depends on the number of instructions each thread must contain and is found experimentally. The K K value is found according to L1 data cache size and is given by Eq. 17.
The elements of Tile3 of A and B are written in consecutive memory locations in the main memory and thus occupy consecutive cache lines in the L3 cache. These two Tile3 tiles must use different L3 cache ways, as cache lines of Tile3 of B must not conflict with Tile3 of A ones. Cache size equal to one L3 way is used for the C array so that its cache lines do not conflict with the Tile3 of A and B; C elements are not reused and do not occupy a large space.
The scheduling follows; suppose that there are four cores (Fig. 12) . Each Tile2 of A is multiplied by a Tile2 of B exactly as in Sect. 3.1.5. Each multiplication between two Tile2 tiles makes a different thread and each thread is executed in one core only. First, all Tile2 tiles of the first Tile3 of A are multiplied by all Tile2 tiles of the first Tile3 of B ((J J J/J J) × q different threads); these threads are executed in parallel exploiting the data reuse of Tile1 of B in L1, the data reuse of Tile2 of A in L2 and the data reuse of Tile3 of B in L3. Then, the same Tile3 of B as above is multiplied by all the Tile2 tiles of the second Tile3 of the first Tile3 block column of A, etc; each Tile3 of B is reused N /I I times (this is why the Tile3 of B has to fit in L3 shared cache) and each Tile2 of A is reused (J J J/j 0 ) times in L2 (this is why Tile2 of A has to fit in the L2 cache). The procedure is repeated until all Tile3 block columns of A have been multiplied by all Tile3 block rows of B.
GPUs
All modern GPU architectures are in common (compute capability of 2 · * , 3 · * ). They consist of p ( p is up to 16) streaming multiprocessors (SM) which are connected to a common L2 cache. Each SM contains k processors (k is up to 192), each one having a configurable L1; the L1 memory contains an L1 data cache and a shared L1. The number of the SMs, the number of the processors and the memory sizes differ from one architecture to another. Also, GPUs have smaller and faster cache memories in contrast to the CPUs.
Regarding MMM performance on GPUs, the most critical parameters are the number of the threads running in parallel, the number of SMs working in parallel, the GPU occupancy and the memory management.
The proposed methodology gives a different schedule according to the GPU architecture parameters and the input size.
Likewise Sect. 3.1.1, the best schedule here is that of row-column for two reasons. First, this schedule gives the minimum number of DDR accesses. This is because the C array is not just loaded, but it is also stored into the main memory (it is accessed twice); thus, it is preferable to access the C array only once and A and B more times than the opposite (Sect. 3.1.1). Second, by writing the C array just once to the main memory, we avoid synchronization problems and all threads run in parallel.
To utilize the lower level GPU memories, the three arrays are partitioned into smaller ones according to the number of registers and the number of threads each SM supports. The three arrays are partitioned into square tiles of size k × k, where k is a power of 2 and k 2 ≤ N um_T hreads, where N um_T hreads is the number of threads each SM supports. Furthermore, n1 Tile1 tiles of A and n2 Tile1 tiles of B constitute a Tile2 of A and B, respectively (Fig. 13) ; we select n = n1 = n2 and n ≥ 2 (a detailed analysis is given below). n depends on the input size and its maximum value is limited to the number of the available registers. A different n value is selected for different input sizes to achieve high occupancy (it is explained below).
The number of blocks of threads run in parallel is
None of the SMs remains idle, and Eq. 27 holds. Eq. 27 states that the number of blocks of threads is always larger or equal to the number of the SMs; otherwise, several SMs will remain idle.
We select the largest k value possible satisfying Eq. 27, where n ≥ 2. We select the largest k value as by increasing k, (a) the number of threads running in parallel increases, (b) the number of data accesses is decreased; in modern GPU architectures, always k ≥ 16.
The schedule follows. There are Fig. 13 by the three elements of B that exist at one position to the right of that shown in Fig. 13 . The thread2 multiplies the three elements of A shown in Fig. 13 by the three elements of B that exist at two positions to the right of that shown in Fig. 13 , etc. The thread of number k multiplies the three elements of A that exist one position down to that shown in Fig. 13 by the three elements of B shown in Fig. 13 , etc. In general, each thread multiplies n elements of the first column of Tile2 of A by n elements of the first row of Tile2 of B (k 2 threads are executed in parallel). Then the procedure continues with the second column of Tile2 of A and the second row of Tile2 of B, etc. The multiplication of the first n × k rows of A by the first n × k columns of B takes place in the first SM. The multiplication of the first n × k rows of A by the second n × k columns of B takes place in the second SM, etc.
The number of data accesses in memory hierarchy is given in Table 2 , where r = ( Table 3 , we assume that none of the cores remains idle). In Table 2 , the Shared L1 and the L1 data cache value corresponds to each core.
Tile2 tiles of A and B achieve data reuse and thus placed in Shared L1 (in Eq. 28). We use shared L1 and not L1 data cache. An implementation with Shared L1 is faster than L1 data cache, because (a) shared memory normally has 32 banks and is much less susceptible to conflicts, (b) by using shared memory, the layout of A and B is not changed (except from the special case explained below), decreasing the number of load/store and addressing instructions. Shared L1 is fully utilized. Shared L1 access time equals to 1 o'clock if no data conflicts occur. Normally, shared L1 memory is organized into 32 banks and each bank has a width of 32 bits; successive 32-bit words are assigned to successive banks. If all threads of a warp access different banks, there is no bank conflict. For most GPU architectures, if k ≥ 16 no bank conflicts exist. On the other hand, if k ≺ 16, the data array layouts are changed from row-wise to tile-wise to eliminate L1 conflicts, i.e., first we write the first tile's elements in main memory in the exact order they are loaded, then the second's, etc.
Another critical parameter for GPU is memory coalescing. Normally, DDR memory accesses data in 64/128 byte segments. Thus, to achieve peak bandwidth transfer rate, we have to access data within 64/128 byte boundaries at the minimum. Regarding array A, this means that in the case that k ≺ 32 and DDR memory accesses data in 128 byte segments, we have to change the A data array layout from row-wise to tile-wise (we suppose that the arrays are floating point numbers). Regarding array B, this means that in the case that n × k = 32 × m, where m is a positive integer, the data array layout of B is changed from row-wise to tile-wise.
To sum up, whether the data array layouts change or not depends on the shared L1/DDR memory architecture parameters and on the input size. However, the change of the data array layouts introduces an additional cost. The arrays have to be loaded and rewritten from/to DDR memory. To find out whether changing the data array layouts is performance efficient or not, the two schedules are tested and the fastest is selected. Normally, if the above parameters are very close to the optimum ones, the performance is approximately the same.
Regarding the L2 data cache, tiling is not performance efficient; this is because the L2 size is small, in contrast to the number of the SMs; if we partition the three arrays even more into Tile3 tiles so that the new tiles fit in L2, the tile sizes would be very small and the extra addressing and load/store instructions will overlap the locality advantage.
The MMM execution time can be approximated by Eq. 1. Moreover, if we assume that none of the cores remains idle, T matri x-operations (Eq. 3) is transformed into Eq. 29; the floating point multiplications and additions are implemented by the multiply-add instructions which all the GPUs support.
Also, T data is approximated by the following equation.
If we assume that shared L1, L2 and DDR have access latencies of 1 cycle, 3 and 50 cycles, respectively, and that line L1 = line L2 = 4, L1 ports = 2, line Shared = 4, cores SM = 32 and load/store_U nits = 16, Eq. 30 and Table 2 give:
The number of DDR accesses is the critical parameter. If n1 = n2 had been used instead of n = n1 = n2, then Eq. 31 would have (
). This is because array A is loaded M/(n2 × k) times, B is loaded N /(n1 × k) and C is loaded/stored once. Thus, the total number of loads is
) value is minimized for n1 = n2. This is why we use n1 Tile1 tiles of A and n2 Tile1 tiles of B, where n = n1 = n2.
According to Eq. 31, the DDR/L2 access time is minimized when n × k is maximized. However, k, n maximum values are k ≤ 32 and n ≤ 6, respectively, because their values are restricted to the number of registers and threads each SM supports.
The best T data value does not necessarily give the best performance. If the numbers of (a) threads that run in parallel and/or (b) SMs that work in parallel are less than the number of cores and the number of the SMs, respectively, the performance is degraded.
Regarding small input sizes, we select a small n value according to Eq. 27 (n = 2), because if we do not, the number of blocks of threads will become smaller than the number of SMs; this means that several SMs will remain idle, decreasing the performance. Since the number of blocks of threads is N k×n × M k×n , their number is increased when n is decreased.
Regarding medium input sizes, n value is up to 4. For large input sizes, n 4. The performance is increased by increasing the n value since a smaller number of data accesses is achieved.
MMM performance is also increased by applying software prefetching using software pipeline. In this case, we use two times more shared L1 memory than in Eq. 28, since we need the current Tile2 tiles (Tile2 tiles of A and B) and the next ones. When the current Tile2 tiles are multiplied by each other, the next Tile2 tiles are loaded from DDR to shared L1, in parallel; in this way, the cores do not remain idle until the Tile2 tiles are fetched.
Finally, performance is slightly increased by applying loop unroll transformation and by using the texture memory to store the A and B arrays on devices of compute capability larger than 2.0. By using texture memory, (a) it does not interfere with the other caches (this is important if there is a lot of pressure on the L1/L2 caches) (b) it supports address computations.
Experimental results
The experimental results for the proposed methodology, presented in this section, were carried out with Intel Xeon CPU E3-1241 v3 (four physical cores), Pentium Intel i7-2600K (6 physical cores), Valgrind tool [28] , ARMv7-a on GEM5 simulator [29] , PowerPC-440 on Xilinx FPGA Virtex-5, GEM5 and SimpleScalar simulator [30] and Nvidia eForce GGTX-580 of computate capability 2.0. Optimization level -O3 was used at all cases. The three arrays are one-dimensional arrays and their elements are aligned into the main memory according to the L1 data cache line size, since the aligned load/store instructions have lower latency than the not aligned ones. The routine changing of the array layouts is always included in the execution times.
Although a performance comparison with Intel_MKL is unfair, a detailed experimental analysis has been made, as it is the fastest MMM library in the world for Intel general purpose processors. A performance comparison with Intel_MKL is unfair for two reasons. First, Intel_MKL developers have access to all the Intel processor architecture details which we do not, e.g., victim cache, hardware prefetchers; this is why Intel_MKL library is the fastest library on Intel processors only. Second, Intel_MKL loop kernels are written in assembly code, while our method is in C (assembly code is always more efficient); Intel developers write assembly code to deal with the low-level transformations, e.g., register allocation, instruction selection and instruction scheduling. The proposed methodology lies at a higher level of abstraction and is used for a wide range of computer architectures. Implementing the proposed methodology in assembly code is beyond the scope of this paper and thus the low-level transformations are applied by the target compiler (which is less efficient). The scope of this paper is not to provide the peak-performance MMM implementations, but to analytically give the architecture-dependent high-level transformation parameters (e.g., tile sizes) that achieve peak performance. We strongly believe that if we could modify the MKL library scheduling parameters according to the proposed methodology, an even higher performance would be achieved.
The experimental results for single-core and multi-core CPU architectures are given in Sects. 4.1 and 4.2, respectively. The experimental results for GPU architectures are given in Sect. 4.3. Finally, in Sect. 4.4, an evaluation between multi-core CPUs and GPUs is given.
Experimental results for single-core CPU architectures
Regarding general purpose processors, a performance comparison is made over ATLAS SOA library 3. It is important to say that by using only one core, one thread has to be manually assigned to the one core; the programmer has to set the CPU thread affinity flag. Otherwise, the operating system (OS) will make the core assignment, and it will toggle the thread among the cores degrading performance because of the pure data locality. The proposed methodology is compared with cblas_dgemm library (double precision values) for square and non-square matrix sizes (Figs. 14, 15) . Also, the Valgrind tool [28] is used to measure the number of load/store instructions and the number DDR memory accesses of the three methods for Fig. 14 . In all the figures shown, in this section the average execution time among ten executions is shown. First, a performance evaluation for square matrices is given. Regarding small input sizes (we used the schedules given in Sects. 3.1.1 and 3.1.6), i.e., N = 32 and N = 56 (Fig. 14) , the proposed methodology achieves a very large performance gain over ATLAS and a significant gain over Intel_MKL. The proposed methodology achieves the smallest number of DDR accesses (Table 3) ; moreover, it achieves a much smaller number of load/store instructions than MKL and about the same number with ATLAS (Table 3 ). The number of load/store instructions is less because the number of YMM registers has been fully exploited; 14, 1 and 1 YMM registers are used for C, A and B arrays, respectively ( p = 14 in Fig.9 ). For N = 56 case, both the proposed methodology and ATLAS execute about 8.5 × 10 6 instructions in total (both arithmetical and load/store instructions), while Intel_MKL executes 13.9 × 10 6 . We believe that the above number of arithmetical instructions here is close to the minimum because we apply the loop tiling transformation to one of the three iterators only (j iterator), decreasing the number of addressing instructions. Although ATLAS achieves a lower number of load/store instructions, arithmetical instructions and DDR accesses than Intel_MKL, it does achieve performance gain; we strongly believe that the reason is the better Intel_MKL hand-written assembly code and the more efficient low-level optimizations it applies (they are further explained below). For medium and large input sizes (we used the schedules given in Sects. 3.1.4 and 3.1.5), the proposed methodology achieves a large performance gain over ATLAS (speedup value is about 1.8) and a very small performance loss over Intel_MKL (from 1.01 up to 1.13 times slower). Regarding the number of load/store instructions, the three methods achieve approximately the same values. As far as the number of DDR data accesses is concerned, the proposed methodology achieves a smaller number at all cases. This is because for different input sizes and cache parameters, a different schedule is generated minimizing the number of data accesses; this is also shown in Table 1 , where the numbers of memories accesses are shown given the tile sizes.
Given that the proposed methodology achieves about the same number of load/store and arithmetical instructions and a lower number of DDR accesses (for medium and large input sizes, it is the most performance critical parameter) than MKL at all cases, we strongly believe that the small performance loss is due to the four following lowlevel optimizations: (a) MKL uses a more efficient AVX instruction set (especially, when unpacking and storing the multiplication results), (b) MKL achieves a lower Taddressing value due to the more efficient hand-written assembly code, (c) the sequence of MKL assembly instructions is more efficient and thus the number of idle cycles in the pipeline is minimized, (d) MKL uses in a more efficient way some Intel processor units (e.g., victim cache, hardware prefetchers), since it has access to the hardware details, (e) software prefetch instructions are written at the exact code line. We strongly believe that if it could modify the MKL library to use the proposed methodology, a higher performance would be achieved. As far as non-square matrices are concerned, the performance is about the same as in Fig. 14 except from the third cluster of results (P dimension is much smaller than the others). In this case, Intel_MKL performance is reduced and the peak performance of 50 gigaflop is not achieved. ATLAS does not perform well for small input sizes, but for medium and large input sizes achieves about 25 gigaflop in all cases. The proposed methodology is the fastest for small input sizes and also comparable to Intel_MKL for medium and large input sizes. In general, the proposed methodology and ATLAS achieve about the same number of flops for square and non-square input sizes, while MKL does not.
The proposed methodology is also compared with two different embedded processors, i.e., PowerPC-440 on Xilinx Virtex-5 FPGA and ARMv7-a on GEM5 simulator (Fig. 16) . Given that there is no state-of-the-art software library for these processors, we evaluated our methodology with 'arm-linux-gnueabihf-gcc' and Xilinx compiler, on ARM and PowerPC, respectively. As far as PowerPC 440 is concerned, it is a 32-bit embedded Superscalar processor with a seven-stage pipeline, developed by IBM. It contains L1 data and L1 instruction cache memories of size 32 kbytes, highly associative (64-way). As far as ARMv7-a is concerned, it is a RISC dual core processor with two levels of data cache (experiments are made on one of the two cores). It contains L1 data and instruction caches of 32 kbytes (2-way associative) and L2 cache of 1 Mbyte and 16-way associative. The speedup values are from 2.6 up to 3.3 and from 3.2 up to 4.8, for PowerPC and ARM, respectively. PowerPC performs better than ARM because the data cache of the first processor is 64-way associative, while the data cache of the second is 2-way associative; this gives a much larger number of cache misses. As the input size increases, the speedup value increases (at both processors) as the memory management problem becomes more critical.
Furthermore, an evaluation of Sect. 3.1.1 (small input sizes-in Eq. 5 holds) is made on SimpleScalar simulator (Fig. 17) . In Fig. 17 , the number of load/store instructions is shown, for different tile sizes; square matrices of size N = 216 are taken. Since Eq. 5 holds, all the data of B and 2 × i 0 rows of A fit in the L1 data cache (Fig. 2) (Fig. 5) to Eq. 6, the best schedule regarding the minimum number of load/store instructions is (i 0 = 2, j 0 = 2) and the next is (i 0 = 1, j 0 = 8) (the first uses 8 registers and the second 10 registers); the first schedule achieves a slightly smaller number of load/store instructions since the following equation 2×N ×M +
(it gives the number of load/store instructions) gives 2
. Furthermore, by using more than ten registers, the number of load/store instructions is highly increased, since the data are spilled from the RF and reloaded many times (Fig. 17) . The analysis made in Sect. 3.1.1 is confirmed by Fig. 17 .
Moreover, an evaluation of Sect. 3.1.2 is made on on SimpleScalar simulator (Fig. 18) . One level of cache is selected here with L1 data cache of size 8 Kbytes and eight-way associative (each way is of size 1 kbyte); given that there is no L2 cache, the number of L1 misses equals to the number of the DDR accesses. The number of DDR accesses/L1 misses for different tile sizes is shown, when Eq. 9 holds (Fig. 18 ) (floating point elements and N = M = P = 128). As expected, as far as Eq. 9 holds, i.e., if the II rows of A and the j 0 columns of B fit in L1, the number of DDR accesses is kept low (the B array is written tile-wise in main memory). For the (i 0 = 1, j 0 = 16) case, Eq. 9 does not hold for any I I 1 value, as the Tile1 of B is 8 Regarding (i 0 = 1, j 0 = 4), (i 0 = 1, j 0 = 8) and (i 0 = 2, j 0 = 2) cases, the maximum I I value satisfying Eq. 9 is I I = 8; this is why the minimum number of DDR accesses is achieved for this case. When I I = 16, Tile1 of A becomes 8 kbytes and thus there is no cache space for Tile1 of B, increasing the number of DDR accesses. Finally, an evaluation over different cache size and associativity values is made on GEM5 simulator using ARMv7-a processor for five different cache sets (L1 size, L1 associativity) (Figs. 19, 20) . The compiler used is 'arm-linux-gnueabihf-gcc' and the input size is N = 80 (floating point values and square matrix sizes). Regarindg the gcc compiler, by changing the cache parameters, its performance does not change significantly, except from the case that L1 = 4 kbytes, where its performance is about five times lower; this is because the row of A cannot remain in L1 cache in this case and both A and B elements are always loaded from DDR. For the row of A to remain in L1, one row of A and one column of B have to be smaller than L1 size (otherwise, the A elements are spilled); however, B is written row-wise in main memory (by default) and all the column elements are written in no consecutive main memory locations; thus for one column element to be fetched, an entire L1 cache line is fetched (cache line size is 64 bytes). Given that each array element is 4 bytes (float numbers), for one column of B to be fetched, 5120 bytes are necessarily fetched (16 columns of B which are of size 80 × 16 × 4), and this is why the performance is highly decreased when L1 = 4 kbytes. This is also shown in Fig. 20 where the number of L1 misses is highly increased. Another important observation is that in contrast to gcc, the proposed methodology achieves a much smaller number of L1 misses for small associativity values (Fig. 20) ; this is because the data cache associativity is taken into account. Moreover, by increasing the cache size, the proposed methodology further increases its performance, while gcc does not.
Experimental results for multi-core CPU architectures
An evaluation of the proposed methodology using more than one physical cores is made in Fig. 21 . The experimental results were carried out with Intel Xeon CPU E3-1241 v3 using the AVX instructions. The proposed methodology is compared with ATLAS SOA library 3.10.2 and Intel_MKL (Parallel Studio XE 2016) (cblas_dgemm routine). It is important to say that ATLAS does not support multiple threads. As far as the performance on the one core is concerned, the results are similar to those in Fig. 14 . By increasing the number of threads, the performance is increased accordingly. The CPU utilization factors are from 1.85 up to 1.95, from 2.67 up to 2.9 and from 3.4 up to 3.79, for two to four cores, respectively. Regarding small input sizes, the core utilization factor is smaller, as the thread initialization time is comparable to its execution time. The proposed methodology uses the schedule given in Sect. 3.2.2. The performance is highly affected by the number of YMM registers used and by the L1 tile size. Regarding shared cache, the performance is not highly affected by using different tile sizes, suffice the tiles fit in shared cache. Intel MKL achieves a small speedup over the proposed methodology for the reasons explained in the previous subsection.
Experimental results for GPU architectures
The experimental results presented in this subsection were carried out with Nvidia GeForce GTX-580. GTX-580 contains 16 SMs which are connected to a common L2 of size 768 Kbytes. Each SM contains 32 cores, each one with its own configurable L1. The comparison is made with cuBLAS (CUDA Toolkit 6.5) state-of-the-art library.
The proposed methodology achieves a significant speedup over cuBLAS SOA library for small and medium input sizes and approximately the same performance for large input sizes (Fig. 22) . For input size N = 64 and N = 128, n = 2 and k = 32 are selected according to the corresponding inequalities given in Sect. 3.3; if we use a larger n value for these cases, the number of blocks of threads will become 0.00E+00 
Speedup over cuBLAS
Input size N -square matrices of size NxN smaller than the number of SMs and the performance will be degraded; the n value is selected according to that in Eq. 27. For larger n sizes, we cannot select k = 32, since we exceed the available number of registers; thus, k = 16 is selected. Furthermore, for input sizes N = 192 and 256 ≤ N ≺ 2048, n = 3 and n = 4 are selected, respectively. Last, for input sizes N ≥ 2048, n = 5 is selected.
Evaluation between multi-core CPUs and GPUs
Finally, an evaluation between GPU GTX580 and CPU i7-2600K (six physical cores) is made in Fig. 23 . The GPU contains 512 small cores, while the CPU contains six big cores each one using SSE vector instructions; regarding floating point data, this means that each one of the CPU cores executes four floating point operations at each cycle. For small and medium sizes, i.e., N ≤ 512, the CPU performs faster than GPU (Fig. 23) ; this is because in this case, the time needed to transfer the three arrays from the CPU to GPU and vice versa is comparable to the execution time. However, for larger input sizes, where the time needed to transfer the three arrays from the CPU to GPU and vice versa, is smaller than the execution time, the GPU performs faster. For N = 1024, N = 2048 and N = 4096, the GPU is 1.92, 2.83 and 3.8 times faster, EcecuƟon Ɵme on GPU / ExecuƟon Ɵme on i7
Input size N -square matrices of size NxN Fig. 23 Performance comparison between GPU GTX580 and CPU i7-2600K
respectively. As the input size increases, the speedup over CPU increases, since the transfer time becomes smaller than the execution time.
Conclusions
In this paper, an MMM speeding up methodology is presented where the optimum scheduling parameters are found by decreasing the search space theoretically, while the major scheduling sub-problems are addressed together as one problem and not separately. For the first time, an MMM methodology is given for such a wide range of computer architectures. For different hardware architecture parameters, a different implementation is produced. This is achieved by fully exploiting the MMM algorithm characteristics and the hardware architecture parameters.
