Asymmetric processors have emerged as an appealing technology for severely energy-constrained environments, especially in the mobile market where heterogeneity in applications is mainstream. In addition, given the growing interest on ultra low-power architectures for high performance computing, this type of platforms are also being investigated in the road towards the implementation of energyefficient high-performance scientific applications. In this paper, we propose a first step towards a complete implementation of the BLAS interface adapted to asymmetric ARM big.LITTLE processors, analyzing the trade-offs between performance and energy efficiency when compared to existing homogeneous (symmetric) multi-threaded BLAS implementations. Our experimental results reveal important gains in performance while maintaining the energy efficiency of homogeneous solutions by efficiently exploiting all the resources of the asymmetric processor.
INTRODUCTION
The decay of Dennard scaling [4] during the past decade marked the end of the "GHz race" and the shift towards multicore designs due to their more favorable performanceenergy ratio. In addition, the doubling of transistors on chip with each new semiconductor generation, dictated by Moore's law [15] , has only exacerbated the power wall problem [5, 12, 14] , leading to the arise of "dark silicon" [6] and the deployment of heterogeneous facilities for high performance computing.
Asymmetric multicore processors (AMPs) are a particular class of heterogeneous architectures equipped with cores that share the same instruction set architecture 1 but differ in performance, complexity, and power consumption. AMPs have recently received considerable attention as a means to improve the performance-energy ratio of computing systems [9, 8, 16, 20] , mainly by exploiting the presence of serial and parallel phases within applications.
In this paper we investigate the practical performance-powerenergy balance of ARM's asymmetric big.LITTLE technology, employing as a case of study the compute-intensive general matrix multiplication (gemm): C += A · B, where the sizes of A, B, C are respectively m × k, k × n, m × n. Most previous related work targets the parallelization of gemm on i) distributed-memory heterogeneous architectures (see [3, 2] and references therein); or ii) asymmetric multicores, but using trivial (unoptimized) implementations of gemm [11, 10] . Compared with these other efforts, our paper makes the following contributions: First, we leverage a static mapping of threads and we propose a workload partitioning strategy of the BLIS implementation of gemm specifically tailored for the Exynos 5422 big.LITTLE architecture, a systemon-chip (SoC) featuring two processing clusters: an ARM Cortex-A15 quad core and a Cortex-A7 quad core. Second, we perform a detailed evaluation of our solution in terms of performance compared with that of the symmetric counterpart on each of the processing clusters of the Exynos 5422. Third, we perform an energy efficiency evaluation of each Loop 1 for jc = 0, . . . , n − 1 in steps of nc Loop 2 for pc = 0, . . . , k − 1 in steps of kc B(pc : pc + kc − 1, jc :
for jr = 0, . . . , nc − 1 in steps of nr // Macro-kernel Loop 5 for ir = 0, . . . , mc − 1 in steps of mr Cc(ir : ir + mr − 1, jr : jr + nr − 1) // Micro-kernel += Ac(ir : ir + mr − 1, 0 : kc − 1) · Bc(0 : kc − 1, jr : jr + nr − 1) endfor endfor endfor endfor endfor Figure 1 : High performance implementation of gemm in BLIS. In the code, Cc ≡ C(ic : ic + mc − 1, jc : jc + nc − 1) is just a notation artifact, introduced to ease the presentation of the algorithm, while Ac, Bc correspond to actual buffers that are involved in data copies. approach using the GFLOPS/W metric (equivalent to billions of floating-point arithmetic operations, or flops, per Joule).
MATRIX MULTIPLICATION FOR GENERAL-PURPOSE PROCESSORS
Modern implementations of gemm for general-purpose architectures, including BLIS and OpenBLAS, follow the approach pioneered by GotoBLAS [7] . Concretely, BLIS implements gemm as three nested loops around a macro-kernel plus two packing routines (see Loops 1-3 in Figure 1 ). The macro-kernel is then implemented in terms of two additional loops around a micro-kernel (Loops 4 and 5 in Figure 1 ). In BLIS, the micro-kernel is typically implemented as a loop around a rank-1 (i.e., outer product) update using assembly or with vector intrinsics, while the remaining five loops are implemented in C; see [19] for further details. Furthermore, the BLIS (cache) optimization parameters nc, kc, mc, nr and mr are adjusted taking into account the latencies of the floating-point units (FPUs), number of vector registers, and size/associativity degree of the cache levels. The goal is that Ac and a narrow column panel of Bc, say Br, are feed into the floating-point units from the L2 and L1 caches, respectively, and these transfers are fully amortized with enough computation from within the micro-kernel; see [13] .
The parallelization of gemm in BLIS is analyzed in [18] for conventional multi-threaded processors and [17] for extreme many-threaded architectures such as the IBM PowerPC A2 (16 cores/64 threads) and the Intel Xeon Phi (60 cores/240 threads). Basically, in both "types" of architectures, the parallel implementations exploit the concurrency available in the nested 5-loop organization of the matrix multiplication algorithm at one or multiple levels (i.e., loops). In general, the approach takes into account the cache organization of the processor (e.g., the presence of multiple sockets, which cache levels are shared/private, etc.), while discarding the parallelization of loops that would incur into race conditions in the update of C as well as loops with too fine granularity. These analyses [18, 17] can be summarized as follows:
• Parallelization of Loop 5 (indexed by ir). With this option, different threads execute different instances of the micro-kernel. Furthermore, they access the same column block Br (of nr columns) in the L1 cache. The amount of parallelism in this case, ⌈ mc mr ⌉, is limited as mc is usually a few hundreds.
• Parallelization of Loop 4 (indexed by jr). Different threads access the same block Ac, of dimension mc×kc, in the L2 cache. The time spent in this loop amortizes the cost of packing (moving) the block of Ac from main memory into the L2 cache. The amount of parallelism, ⌈ nc nr ⌉, is in general larger than in the previous case, as nc is frequently in the order of several hundreds up to a few thousands.
• Parallelization of Loop 3 (indexed by ic). Each thread packs a different block Ac into the L2 cache and executes a different instance of the macro-kernel. The number of iterations of this loop is not limited by the blocking sizes, but instead depends on the problem dimension m. When m is less than the product of mc and the degree of parallelization of the loop, the blocks Ac will be smaller than the optimal dimension and performance may suffer. When there is a shared L2 cache, the size of the blocks Ac will have to be reduced by a factor equal to the degree of parallelization of this loop. However, reducing mc is equivalent to parallelizing the first loop around the micro-kernel.
• Parallelization of Loop 2 (indexed by pc). This is not a good option because multiple threads simultaneously update the same parts of C, requiring a mechanism to deal with race conditions.
• Parallelization of Loop 1 (indexed by jc). From a datasharing perspective, this option is equivalent to gaining parallelism outside of BLIS. In any case, this parallelization is reasonable on a multi-socket system where each CPU has a separate LLC (last-level cache).
To sum up, these are general guidelines to decide which loops are theoretically good candidates to be parallelized in order to fully exploit the cache hierarchy of a target architecture. At a glance, the combination of loops to parallelize strongly depends on which cache(s) are shared. Usually, Loop 1 (jc) is a good candidate when the LLC is separated for each CPU (e.g., a multi-socket platform with on-chip L3 cache); Loop 3 (ic) should be parallelized when each core has its own L2 cache; and Loops 4 and/or 5 (jr and ir, respectively) are to be parallelized when the cores share the L2 cache.
MATRIX MULTIPLICATION ON AMPS
The ODROID-XU3 contains a Samsung Exynos 5422 SoC with an ARM Cortex-A15 quad-core processing cluster (running at 1.6 GHz in our setup) and a Cortex-A7 quad-core processing cluster (at 1.3 GHz). Both clusters access a shared DDR3 RAM (2 Gbytes) via 128-bit coherent bus interfaces. Each ARM core (either Cortex-A15 or Cortex-A7) has a 32+32-Kbyte L1 (instruction+data) cache. The four ARM Cortex-A15 cores share a 2-Mbyte L2 cache, while the four ARM Cortex-A7 cores share a smaller 512-Kbyte L2 cache; see Figure 2 . In order to attain high performance, a preliminary step is to determine the optimal block sizes (mc, kc, nc) for the target architecture and precision (all our experiments use ieee 754 double-precision arithmetic). For this purpose, we performed an empirical search on the Cortex-A15 cores, detecting the optimal values at mc = 176 and kc = 368. In this architecture, nc plays a minor role and is simply set to nc = 4, 096 (nc is usually related to L3 cache, which is not present on these ARM CPUs). The micro-kernel for this architecture is hand-coded with mr = 4 and nr = 4. These optimal values are used in this work for both the Cortex-A7 and the Cortex-A15 cores.
Mapping multi-threaded BLIS to AMPs
BLIS allows to select, at run time, which (one or more) of the five internal loops are parallelized. In particular, if one of the loops is parallelized, a static partition and mapping of loop iteration chunks to the OpenMP threads is performed prior to the beginning of the loop.
Our asymmetric version of BLIS integrates the following three new features, which modify the behavior of the multithreaded BLIS at run time, in order to accomodate an AMP architecture: i) a mechanism to create "slow" and "fast" threads, which will be bound upon initialization of the library to LITTLE (Cortex-A7) and big (Cortex-A15) cores; ii) a mechanism to decide which one of the loops that are parallelized needs to be partitioned and assigned to slow/fast cores asymmetrically (thus, chunks assigned to threads will no longer be of uniform size, but partitioned according to the capabilities of each type of core); and iii) an interface to specify the ratio of performance between LITTLE and big cores, which will ultimately define the number of iterations assigned to each thread/core. All these mechanisms are currently modified via environment variables, but the development of an ad-hoc API is part of ongoing work.
For the target Exynos 5422 SoC, given the memory organization of the this big.LITTLE architecture (private L1 cache per core, shared L2 cache per cluster, lack of L3 cache), and the guidelines given for the parallelization of BLIS gemm at the end of section 2, we chose the approach explained next for the parallelization on the target Exynos 5422 AMP.
At a coarse-grain, the computational workload of the complete multiplication C += A · B is distributed among the Cortex-A15 and Cortex-A7 clusters by parallelizing either Loop 1 (jc) or 3 (ic). In order to preserve the optimal cache parameters during the execution of gemm, while attaining a distribution of the workload proportional to computational power of the A15 vs A7 clusters, we assign a different number of iterations of the parallelized loop to each cluster; see, e.g., Figure 3 . In particular, the ratio applied to distribute the iteration space between the Cortex-A15 and Cortex-A7 for gemm has been empirically determined to be 6:1 2 .
At a finer-grain, the execution of each macro-kernel Cc += Ac · Bc (see Figure 1) is partitioned among the cores of the same type by parallelizing Loops 4 (jr), 5 (ir) or both; see, e.g., Figure 4 . 
EVALUATION OF PERFORMANCE AND ENERGY EFFICIENCY
The goal of the performance and energy efficiency tests in this section is to carry out an experimental study of both metrics comparing the original multi-threaded of gemm in BLIS against our asymmetric-aware implementation. In all tests, we ensure the cores run at their highest frequency by setting the performance governor. Codes are instrumented with the pmlib [1] framework, which collects power consumption data corresponding to instantaneous power readings from four independent sensors in the board (for the Cortex-A7 cores, Cortex-A15 cores, DRAM and GPU), with a sampling rate of 200 ms.
The first round of experiments analyzes the performance and energy behavior of the Cortex-A7 and the Cortex-A15 core types when working in isolation. For this purpose, we execute a collection of gemm kernels using one of the finegrain parallelization exposed in Section 3. Concretely, as the L2 cache is shared among the cores of a cluster, we parallelize Loop 4 using 1, 2, 3 and 4 threads (cores), with the performance and energy results in Figure 5 . These plots reveal that the Cortex-A15 cores clearly deliver higher performance, with a rough increase of 2.5 GFLOPS per core, attaining a peak performance of about 10.2 GFLOPS with 4 threads. For the Cortex-A7 cores, the performance peaks are around 2.0 GFLOPS and is also attained with 4 cores. Regarding energy efficiency, the Cortex-A15 obtains the best results in terms of GFLOPS/W. However, the benefits from increasing the number of threads in this case are less significant (0.055 GFLOPS/W per core) when compared with those obtained with the Cortex-A7 cores (0.193 GFLOPS/W per core). It is also worth emphasizing that the use of 4 Cortex-A7 cores is more energy-efficient than an alternative that leverages a single Cortex-A15 core, though the overall performance of the former is slightly worse.
The second round of experiments evaluates the performance and energy efficiency of the asymmetric-aware port of BLIS to the big.LITTLE architecture. For this purpose, we run a collection of gemm kernels, relaying on a 2-way parallelization to distribute iterations of Loop 3 (see Section 3), with a ratio of 6:1, among the cores of the fast and slow clusters, and taking advantage of the independent L2 cache per cluster in this manner. For the fine-grain parallelization, 4 threads are leveraged in order to assign chunks of the iteration space for Loop 4 to each core within the cluster. Our experiments with different configurations revealed this option to be the most efficient for the target big.LITTLE architecture. Figure 6 reports the results for this second evaluation. The line labeled as "big.LITTLE (4+4 threads)" corresponds to the asymmetric-aware implementation. The same gemm kernels were computed with BLIS using a symmetric workload distribution (the iteration space is equally distributed among the Cortex-A7 and Cortex-A15 cores), with the results labelled as "A7+A15 (4+4 threads)" in the figure. For comparison purposes, the performance and energy obtained using exclusively four Cortex-A7 or four Cortex-A15 CPUs are also added. Finally, the "ideal" line corresponds to the sum of the peak performances of the configurations that use four cores of each of the two types in isolation (i.e., the performance of the four Cortex-A15 cores plus the performance of the four Cortex-A7 cores).
These performance results show that the AMP configuration outperforms the peak performance of all other configurations being close to the ideal case. The increment compared to the configuration that employs four Cortex-A15 cores for the largest tested problem is close to 20%. The asymmetric version does not outperform the original version for small matrices though, as the chunks assigned to the big and LIT-TLE cores are, in those cases, too small to exploit the asymmetric architecture. In terms of energy-efficiency, the AMP configuration is as efficient as the symmetric setup using exclusively four Cortex-A15 CPU.
The symmetric workload distribution attains about 40% of the highest performance that is observed when employing only the Cortex-A15 cores. The reason is that, with the symmetric workload distribution, thread scheduling is delegated to the operating system or the OpenMP runtime, using a homogeneous distribution of chunks. This causes a severe load imbalance as the fast Cortex-A15 threads finish processing their assigned chunk, and have to wait a long time for the Cortex-A7 threads to complete their assignment. The energy-efficiency is also affected, and this configuration achieves the worst energy-efficiency. Diving into details that explain the energy efficiency of our implementations, Table 1 shows a breakdown of power/energy per component of the SoC, for a particular problem size: m = n = k = 4, 096. This table shows the (average) power consumption and energy efficiency when employing i) from 1 to 4 threads of a single cluster; ii) the AMP configuration with all 4+4 cores; and iii) the symmetric configuration of BLIS using all 4+4 cores. The first four columns report the average power consumption gathered from the SoC sensors, while the average power consumption of the entire SoC is in the fifth column. The performance achieved by the different configurations is reported in the sixth column and the energy efficiency is displayed in the last one.
The first aspect to note is that, as expected, the Cortex-A15 cores dissipate more power than the Cortex-A7 cores. Indeed, a single Cortex-A15 core roughly doubles the power dissipation rate of four combined Cortex-A7 cores, and the Cortex-A15 CPU in idle state consumes more power than two Cortex-A7 cores in execution. A second issue is that the memory (DRAM) and total power consumption of the AMP and symmetric configurations are close to those obtained by adding the corresponding values of the two CPU clusters in isolation. An exception is the total power consumption with the symmetric configuration, in which a significant decrease is observed due to the Cortex-A15 cores completing their share of the work much earlier than the Cortex-A7 cores. This aspect strongly affects the energy efficiency of the symmetric configuration as the power consumption is three times higher than that obtained with the entire Cortex-A7 cluster, but the performance is only doubled. As expected, the AMP configuration is the one that dissipates a higher power rate, as it fully utilizes all the available resources. On the other hand, it also obtains the shortest execution time, yielding the best energy-to-solution.
CONCLUSIONS
In this paper, we have proposed several mechanisms to map the high-performance multi-threaded implementation of the matrix multiplication in the BLIS library to an asymmetric ARM big.LITTLE (Cortex A15+A7) SoC. Our results reveal excellent improvements in performance compared with a homogeneous implementation that operates exclusively on one type of core (either A15 or A7), and also with respect to multi-threaded implementations that rely on a symmetric work distribution and delegate scheduling to the operating system. This is the first step towards a full BLAS implementation optimized for big.LITTLE architectures, which is the ultimate goal of our work. We believe that the approach applied to gemm carries over to the rest of the BLAS. However, there are still a number of issues that need to be addressed to further increase performance and adaptation to the architecture. Among those, the most significant ones are the integration of different micro-kernels and block sizes tuned to each type of core in order to extract the maximum performance, and the dynamic distribution and mapping of the workload to each type of core transparently to the programmer. A port to a 64-bit ARMv8 architecture, and performing a experimental study on architectures with different number of big/LITTLE cores are also key milestones in our roadmap. Table 1 : Power consumption breakdown and energy efficiency for DGEMM (m = n = k = 4096) on the Exynos 5422 SoC, using different thread configurations. The rows labeled as Asymmetric BLIS and Symmetric BLIS use all the available eight cores in the SoC, using our modified BLIS version and the original BLIS multi-threaded implementation, respectively.
