Performance and Energy Optimization of Matrix Multiplication on
  Asymmetric big.LITTLE Processors by Catalán, Sandra et al.
ar
X
iv
:1
50
7.
05
12
9v
1 
 [c
s.D
C]
  1
7 J
ul 
20
15
Performance and Energy Optimization of Matrix
Multiplication on Asymmetric big.LITTLE Processors
Sandra Catalán
Depto. Ingeniería y Ciencia de
Computadores
Universitat Jaume I, Castellón,
Spain
catalans@uji.es
Francisco D. Igual
Depto. Arquitectura de
Computadores y Automática
Universidad Complutense de
Madrid, Spain
figual@ucm.es
Rafael Mayo
Depto. Ingeniería y Ciencia de
Computadores
Universitat Jaume I, Castellón,
Spain
mayo@uji.es
Luis Piñuel
Depto. Arquitectura de
Computadores y Automática
Universidad Complutense de
Madrid, Spain
lpinuel@ucm.es
Enrique S. Quintana-Ortí
Depto. Ingeniería y Ciencia de
Computadores
Universitat Jaume I, Castellón,
Spain
quintana@uji.es
Rafael
Rodríguez-Sánchez
Depto. Ingeniería y Ciencia de
Computadores
Universitat Jaume I, Castellón,
Spain
rarodrig@uji.es
ABSTRACT
Asymmetric processors have emerged as an appealing tech-
nology for severely energy-constrained environments, espe-
cially in the mobile market where heterogeneity in applica-
tions is mainstream. In addition, given the growing inter-
est on ultra low-power architectures for high performance
computing, this type of platforms are also being investi-
gated in the road towards the implementation of energy-
efficient high-performance scientific applications. In this
paper, we propose a first step towards a complete imple-
mentation of the BLAS interface adapted to asymmetric
ARM big.LITTLE processors, analyzing the trade-offs be-
tween performance and energy efficiency when compared
to existing homogeneous (symmetric) multi-threaded BLAS
implementations. Our experimental results reveal important
gains in performance while maintaining the energy efficiency
of homogeneous solutions by efficiently exploiting all the re-
sources of the asymmetric processor.
Categories and Subject Descriptors
C.1.3 [Computer Systems Organization]: Other Archi-
tecture Styles—heterogeneous (hybrid) systems; C.4 [Per-
formance of systems]: Performance and energy efficiency;
G.4 [Mathematical Software]: Efficiency
1. INTRODUCTION
The decay of Dennard scaling [4] during the past decade
marked the end of the “GHz race” and the shift towards
multicore designs due to their more favorable performance-
energy ratio. In addition, the doubling of transistors on
chip with each new semiconductor generation, dictated by
Moore’s law [15], has only exacerbated the power wall prob-
lem [5, 12, 14], leading to the arise of “dark silicon” [6] and
the deployment of heterogeneous facilities for high perfor-
mance computing.
Asymmetric multicore processors (AMPs) are a particular
class of heterogeneous architectures equipped with cores that
share the same instruction set architecture1 but differ in per-
formance, complexity, and power consumption. AMPs have
recently received considerable attention as a means to im-
prove the performance-energy ratio of computing systems [9,
8, 16, 20], mainly by exploiting the presence of serial and
parallel phases within applications.
In this paper we investigate the practical performance-power-
energy balance of ARM’s asymmetric big.LITTLE technol-
ogy, employing as a case of study the compute-intensive gen-
eral matrix multiplication (gemm): C += A · B, where the
sizes of A, B, C are respectively m× k, k× n, m× n. Most
previous related work targets the parallelization of gemm on
i) distributed-memory heterogeneous architectures (see [3, 2]
and references therein); or ii) asymmetric multicores, but us-
ing trivial (unoptimized) implementations of gemm [11, 10].
Compared with these other efforts, our paper makes the fol-
lowing contributions: First, we leverage a static mapping
of threads and we propose a workload partitioning strategy
of the BLIS implementation of gemm specifically tailored
for the Exynos 5422 big.LITTLE architecture, a system-
on-chip (SoC) featuring two processing clusters: an ARM
Cortex-A15 quad core and a Cortex-A7 quad core. Second,
we perform a detailed evaluation of our solution in terms of
performance compared with that of the symmetric counter-
part on each of the processing clusters of the Exynos 5422.
Third, we perform an energy efficiency evaluation of each
1According to this definition, servers equipped with one (or
more) general-purpose multicore processor(s) and a PCIe-
attached graphics accelerator, or systems-on-chip like the
NVIDIA Tegra TK1, are excluded from this category.
Loop 1 for jc = 0, . . . , n− 1 in steps of nc
Loop 2 for pc = 0, . . . , k − 1 in steps of kc
B(pc : pc + kc − 1, jc : jc + nc − 1) → Bc // Pack into Bc
Loop 3 for ic = 0, . . . ,m− 1 in steps of mc
A(ic : ic +mc − 1, pc : pc + kc − 1) → Ac // Pack into Ac
Loop 4 for jr = 0, . . . , nc − 1 in steps of nr // Macro-kernel
Loop 5 for ir = 0, . . . ,mc − 1 in steps of mr
Cc(ir : ir +mr − 1, jr : jr + nr − 1) // Micro-kernel
+= Ac(ir : ir +mr − 1, 0 : kc − 1)
· Bc(0 : kc − 1, jr : jr + nr − 1)
endfor
endfor
endfor
endfor
endfor
Figure 1: High performance implementation of gemm in BLIS. In the code, Cc ≡ C(ic : ic+mc−1, jc : jc+nc−1)
is just a notation artifact, introduced to ease the presentation of the algorithm, while Ac, Bc correspond to
actual buffers that are involved in data copies.
approach using the GFLOPS/W metric (equivalent to bil-
lions of floating-point arithmetic operations, or flops, per
Joule).
2. MATRIX MULTIPLICATION FOR
GENERAL-PURPOSE PROCESSORS
Modern implementations of gemm for general-purpose ar-
chitectures, including BLIS and OpenBLAS, follow the ap-
proach pioneered by GotoBLAS [7]. Concretely, BLIS im-
plements gemm as three nested loops around a macro-kernel
plus two packing routines (see Loops 1–3 in Figure 1). The
macro-kernel is then implemented in terms of two additional
loops around a micro-kernel (Loops 4 and 5 in Figure 1). In
BLIS, the micro-kernel is typically implemented as a loop
around a rank–1 (i.e., outer product) update using assembly
or with vector intrinsics, while the remaining five loops are
implemented in C; see [19] for further details. Furthermore,
the BLIS (cache) optimization parameters nc, kc, mc, nr
and mr are adjusted taking into account the latencies of the
floating-point units (FPUs), number of vector registers, and
size/associativity degree of the cache levels. The goal is that
Ac and a narrow column panel of Bc, say Br, are feed into
the floating-point units from the L2 and L1 caches, respec-
tively, and these transfers are fully amortized with enough
computation from within the micro-kernel; see [13].
The parallelization of gemm in BLIS is analyzed in [18] for
conventional multi-threaded processors and [17] for extreme
many-threaded architectures such as the IBM PowerPC A2
(16 cores/64 threads) and the Intel Xeon Phi (60 cores/240
threads). Basically, in both“types”of architectures, the par-
allel implementations exploit the concurrency available in
the nested 5–loop organization of the matrix multiplication
algorithm at one or multiple levels (i.e., loops). In general,
the approach takes into account the cache organization of
the processor (e.g., the presence of multiple sockets, which
cache levels are shared/private, etc.), while discarding the
parallelization of loops that would incur into race conditions
in the update of C as well as loops with too fine granularity.
These analyses [18, 17] can be summarized as follows:
• Parallelization of Loop 5 (indexed by ir). With this
option, different threads execute different instances of
the micro-kernel. Furthermore, they access the same
column block Br (of nr columns) in the L1 cache. The
amount of parallelism in this case, ⌈mc
mr
⌉, is limited as
mc is usually a few hundreds.
• Parallelization of Loop 4 (indexed by jr). Different
threads access the same block Ac, of dimensionmc×kc,
in the L2 cache. The time spent in this loop amortizes
the cost of packing (moving) the block of Ac from main
memory into the L2 cache. The amount of parallelism,
⌈nc
nr
⌉, is in general larger than in the previous case, as
nc is frequently in the order of several hundreds up to
a few thousands.
• Parallelization of Loop 3 (indexed by ic). Each thread
packs a different block Ac into the L2 cache and ex-
ecutes a different instance of the macro-kernel. The
number of iterations of this loop is not limited by the
blocking sizes, but instead depends on the problem di-
mension m. When m is less than the product of mc
and the degree of parallelization of the loop, the blocks
Ac will be smaller than the optimal dimension and per-
formance may suffer. When there is a shared L2 cache,
the size of the blocks Ac will have to be reduced by a
factor equal to the degree of parallelization of this loop.
However, reducingmc is equivalent to parallelizing the
first loop around the micro-kernel.
• Parallelization of Loop 2 (indexed by pc). This is not a
good option because multiple threads simultaneously
update the same parts of C, requiring a mechanism to
deal with race conditions.
• Parallelization of Loop 1 (indexed by jc). From a data-
sharing perspective, this option is equivalent to gaining
parallelism outside of BLIS. In any case, this paral-
lelization is reasonable on a multi-socket system where
each CPU has a separate LLC (last-level cache).
To sum up, these are general guidelines to decide which loops
are theoretically good candidates to be parallelized in order
to fully exploit the cache hierarchy of a target architecture.
At a glance, the combination of loops to parallelize strongly
depends on which cache(s) are shared. Usually, Loop 1 (jc)
is a good candidate when the LLC is separated for each
CPU (e.g., a multi-socket platform with on-chip L3 cache);
Loop 3 (ic) should be parallelized when each core has its own
L2 cache; and Loops 4 and/or 5 (jr and ir, respectively) are
to be parallelized when the cores share the L2 cache.
3. MATRIX MULTIPLICATION ON AMPS
The ODROID-XU3 contains a Samsung Exynos 5422 SoC
with an ARMCortex-A15 quad-core processing cluster (run-
ning at 1.6 GHz in our setup) and a Cortex-A7 quad-core
processing cluster (at 1.3 GHz). Both clusters access a
shared DDR3 RAM (2 Gbytes) via 128-bit coherent bus in-
terfaces. Each ARM core (either Cortex-A15 or Cortex-A7)
has a 32+32-Kbyte L1 (instruction+data) cache. The four
ARM Cortex-A15 cores share a 2-Mbyte L2 cache, while
the four ARM Cortex-A7 cores share a smaller 512-Kbyte
L2 cache; see Figure 2.
❈ ✁✂✄☎ ✆✝✞✟
✸✠✡✸✠☛☞ ✌✞
❈ ✁✂✄☎ ✆✝✞✟
✸✠✡✸✠☛☞ ✌✞
❈ ✁✂✄☎ ✆✝✞✟
✸✠✡✸✠☛☞ ✌✞
❈ ✁✂✄☎ ✆✝✞✟
✸✠✡✸✠☛☞ ✌✞
✠✷☞ ✌✠ ✍✎✍✏✄
✞✠✶✝☞✑✂ ✒✓✔ ✕✖✂✄✁✗✎✍✄
❈ ✁✂✄☎✝✆✞✟ ✘✓✎✙ ❈✚✛
❈ ✁✂✄☎ ✆✝✜
✸✠✡✸✠☛☞ ✌✞
❈ ✁✂✄☎ ✆✝✜
✸✠✡✸✠☛☞ ✌✞
❈ ✁✂✄☎ ✆✝✜
✸✠✡✸✠☛☞ ✌✞
❈ ✁✂✄☎ ✆✝✜
✸✠✡✸✠☛☞ ✌✞
✞✠✶✝☞✑✂ ✒✓✔ ✕✖✂✄✁✗✎✍✄
❊☎✢✖ ✔ ✟✣✠✠ ✤✢✔✂✄✥✝ ✖✝❈✏✑✦
❈ ✁✂✄☎✝✆✜ ✘✓✎✙ ❈✚✛
✟✞✠☛☞ ✌✠ ✍✎✍✏✄
Figure 2: Exynos 5422 block diagram.
In order to attain high performance, a preliminary step is
to determine the optimal block sizes (mc, kc, nc) for the
target architecture and precision (all our experiments use
ieee 754 double-precision arithmetic). For this purpose, we
performed an empirical search on the Cortex-A15 cores, de-
tecting the optimal values at mc = 176 and kc = 368. In
this architecture, nc plays a minor role and is simply set to
nc = 4, 096 (nc is usually related to L3 cache, which is not
present on these ARM CPUs). The micro-kernel for this
architecture is hand-coded with mr = 4 and nr = 4. These
optimal values are used in this work for both the Cortex-A7
and the Cortex-A15 cores.
3.1 Mapping multi-threaded BLIS to AMPs
BLIS allows to select, at run time, which (one or more) of
the five internal loops are parallelized. In particular, if one
of the loops is parallelized, a static partition and mapping of
loop iteration chunks to the OpenMP threads is performed
prior to the beginning of the loop.
Our asymmetric version of BLIS integrates the following
three new features, which modify the behavior of the multi-
threaded BLIS at run time, in order to accomodate an AMP
architecture: i) a mechanism to create “slow” and ”fast”
threads, which will be bound upon initialization of the li-
brary to LITTLE (Cortex-A7) and big (Cortex-A15) cores;
ii) a mechanism to decide which one of the loops that are
parallelized needs to be partitioned and assigned to slow/fast
cores asymmetrically (thus, chunks assigned to threads will
no longer be of uniform size, but partitioned according to
the capabilities of each type of core); and iii) an interface
to specify the ratio of performance between LITTLE and
big cores, which will ultimately define the number of iter-
ations assigned to each thread/core. All these mechanisms
are currently modified via environment variables, but the
development of an ad-hoc API is part of ongoing work.
For the target Exynos 5422 SoC, given the memory organi-
zation of the this big.LITTLE architecture (private L1 cache
per core, shared L2 cache per cluster, lack of L3 cache), and
the guidelines given for the parallelization of BLIS gemm at
the end of section 2, we chose the approach explained next
for the parallelization on the target Exynos 5422 AMP.
At a coarse-grain, the computational workload of the com-
plete multiplication C += A · B is distributed among the
Cortex-A15 and Cortex-A7 clusters by parallelizing either
Loop 1 (jc) or 3 (ic). In order to preserve the optimal cache
parameters during the execution of gemm, while attaining a
distribution of the workload proportional to computational
power of the A15 vs A7 clusters, we assign a different num-
ber of iterations of the parallelized loop to each cluster; see,
e.g., Figure 3. In particular, the ratio applied to distribute
the iteration space between the Cortex-A15 and Cortex-A7
for gemm has been empirically determined to be 6:12.
At a finer-grain, the execution of each macro-kernel Cc +=
Ac · Bc (see Figure 1) is partitioned among the cores of the
same type by parallelizing Loops 4 (jr), 5 (ir) or both; see,
e.g., Figure 4.
❥
❝
❥
❝
✐
❝
✳
✰ 
✰ 
✳
❆✁✂
❈✄☎✆✝✞
❆✟
❈✄☎✆✝✞
❆✁✂
❈✄☎✆✝✞
❆✟
❈✄☎✆✝✞
❆✟
❈✄☎✆✝✞
❆✁✂
❈✄☎✆✝✞
Figure 3: Workload distributions for the matrix
multiplication C += A · B between the A15 and A7
quad-core clusters. Top: parallelization of Loop 1
(jc); bottom: parallelization of Loop 3 (ic). In the
bottom plot, the small rectangles, delimited by the
fine lines, denote the operands of the macro-kernel
Cc += Ac ·Bc.
4. EVALUATION OF PERFORMANCE AND
ENERGY EFFICIENCY
The goal of the performance and energy efficiency tests in
this section is to carry out an experimental study of both
2This ratio varies depending on the target architecture, core
operating frequency, and specific routine, so it should be
adjusted accordingly.
❥r
✐
r
❥
r
✳
✰ 
✰ 
✳
❈
✵
❈
✶
❈
✷
❈
✸
❈
✶
❈
✵
❈
✷
❈
✸
❈
✵
❈
✶
❈
✷
❈
✸
Figure 4: Workload distributions for the macro-
kernel multiplication Cc += Ac·Bc between four cores
of the same type (C0, C1, C2, C3). Top: paral-
lelization of Loop 4 (jr); bottom: parallelization of
Loop 5 (ir). In this example, the OpenMP chunk
size equals 2 in the first case and 4 in the second.
metrics comparing the original multi-threaded of gemm in
BLIS against our asymmetric-aware implementation. In all
tests, we ensure the cores run at their highest frequency by
setting the performance governor. Codes are instrumented
with the pmlib [1] framework, which collects power con-
sumption data corresponding to instantaneous power read-
ings from four independent sensors in the board (for the
Cortex-A7 cores, Cortex-A15 cores, DRAM and GPU), with
a sampling rate of 200 ms.
The first round of experiments analyzes the performance
and energy behavior of the Cortex-A7 and the Cortex-A15
core types when working in isolation. For this purpose, we
execute a collection of gemm kernels using one of the fine-
grain parallelization exposed in Section 3. Concretely, as
the L2 cache is shared among the cores of a cluster, we par-
allelize Loop 4 using 1, 2, 3 and 4 threads (cores), with the
performance and energy results in Figure 5. These plots
reveal that the Cortex-A15 cores clearly deliver higher per-
formance, with a rough increase of 2.5 GFLOPS per core,
attaining a peak performance of about 10.2 GFLOPS with
4 threads. For the Cortex-A7 cores, the performance peaks
are around 2.0 GFLOPS and is also attained with 4 cores.
Regarding energy efficiency, the Cortex-A15 obtains the best
results in terms of GFLOPS/W. However, the benefits from
increasing the number of threads in this case are less sig-
nificant (0.055 GFLOPS/W per core) when compared with
those obtained with the Cortex-A7 cores (0.193 GFLOPS/W
per core). It is also worth emphasizing that the use of 4
Cortex-A7 cores is more energy-efficient than an alternative
that leverages a single Cortex-A15 core, though the overall
performance of the former is slightly worse.
The second round of experiments evaluates the performance
and energy efficiency of the asymmetric-aware port of BLIS
to the big.LITTLE architecture. For this purpose, we run a
collection of gemm kernels, relaying on a 2-way paralleliza-
tion to distribute iterations of Loop 3 (see Section 3), with
1000 2000 3000 4000
0
5
10
15
Problem size
GF
LO
PS
BLIS DGEMM performance on Exynos 5422
DGEMM - A15 (4 threads)
DGEMM - A15 (3 threads)
DGEMM - A15 (2 threads)
DGEMM - A15 (1 threads)
DGEMM - A7 (4 threads)
DGEMM - A7 (3 threads)
DGEMM - A7 (2 threads)
DGEMM - A7 (1 threads)
1000 2000 3000 4000
0
0.5
1
1.5
2
2.5
3
Problem size
GF
LO
PS
/W
BLIS DGEMM energy efficieny on Exynos 5422
DGEMM - A15 (4 threads)
DGEMM - A15 (3 threads)
DGEMM - A15 (2 threads)
DGEMM - A15 (1 threads)
DGEMM - A7 (4 threads)
DGEMM - A7 (3 threads)
DGEMM - A7 (2 threads)
DGEMM - A7 (1 threads)
Figure 5: Performance (top) and energy efficiency
(bottom) of the BLIS DGEMM using exclusively
one type of core, for a varying number of threads.
a ratio of 6:1, among the cores of the fast and slow clus-
ters, and taking advantage of the independent L2 cache per
cluster in this manner. For the fine-grain parallelization, 4
threads are leveraged in order to assign chunks of the it-
eration space for Loop 4 to each core within the cluster.
Our experiments with different configurations revealed this
option to be the most efficient for the target big.LITTLE
architecture.
Figure 6 reports the results for this second evaluation. The
line labeled as “big.LITTLE (4+4 threads)” corresponds to
the asymmetric-aware implementation. The same gemm
kernels were computed with BLIS using a symmetric work-
load distribution (the iteration space is equally distributed
among the Cortex-A7 and Cortex-A15 cores), with the re-
sults labelled as “A7+A15 (4+4 threads)” in the figure. For
comparison purposes, the performance and energy obtained
using exclusively four Cortex-A7 or four Cortex-A15 CPUs
are also added. Finally, the “ideal” line corresponds to the
sum of the peak performances of the configurations that use
four cores of each of the two types in isolation (i.e., the per-
formance of the four Cortex-A15 cores plus the performance
of the four Cortex-A7 cores).
These performance results show that the AMP configuration
outperforms the peak performance of all other configurations
being close to the ideal case. The increment compared to
the configuration that employs four Cortex-A15 cores for
the largest tested problem is close to 20%. The asymmetric
version does not outperform the original version for small
matrices though, as the chunks assigned to the big and LIT-
TLE cores are, in those cases, too small to exploit the asym-
metric architecture. In terms of energy-efficiency, the AMP
configuration is as efficient as the symmetric setup using
exclusively four Cortex-A15 CPU.
The symmetric workload distribution attains about 40% of
the highest performance that is observed when employing
only the Cortex-A15 cores. The reason is that, with the
symmetric workload distribution, thread scheduling is del-
egated to the operating system or the OpenMP runtime,
using a homogeneous distribution of chunks. This causes a
severe load imbalance as the fast Cortex-A15 threads fin-
ish processing their assigned chunk, and have to wait a long
time for the Cortex-A7 threads to complete their assign-
ment. The energy-efficiency is also affected, and this config-
uration achieves the worst energy-efficiency.
1000 2000 3000 4000
0
5
10
15
Problem size
GF
LO
PS
BLIS DGEMM performance on Exynos 5422
DGEMM - big.LITTLE (4+4 threads)
DGEMM - A15 (4 threads)
DGEMM - A7 (4 threads)
DGEMM - A7+A15 (4+4 threads)
Ideal
1000 2000 3000 4000
0
0.5
1
1.5
2
2.5
3
Problem size
GF
LO
PS
/W
BLIS DGEMM energy efficiency on Exynos 5422
DGEMM - big.LITTLE (4+4 threads)
DGEMM - A15 (4 threads)
DGEMM - A7 (4 threads)
DGEMM - A7 + A15 (4 + 4 threads)
Figure 6: Performance (top) and energy efficiency
(bottom) of the BLIS DGEMM implementations us-
ing a single as well as different types of cores.
Diving into details that explain the energy efficiency of our
implementations, Table 1 shows a breakdown of power/energy
per component of the SoC, for a particular problem size:
m = n = k = 4, 096. This table shows the (average) power
consumption and energy efficiency when employing i) from 1
to 4 threads of a single cluster; ii) the AMP configuration
with all 4+4 cores; and iii) the symmetric configuration of
BLIS using all 4+4 cores. The first four columns report the
average power consumption gathered from the SoC sensors,
while the average power consumption of the entire SoC is in
the fifth column. The performance achieved by the differ-
ent configurations is reported in the sixth column and the
energy efficiency is displayed in the last one.
The first aspect to note is that, as expected, the Cortex-A15
cores dissipate more power than the Cortex-A7 cores. In-
deed, a single Cortex-A15 core roughly doubles the power
dissipation rate of four combined Cortex-A7 cores, and the
Cortex-A15 CPU in idle state consumes more power than
two Cortex-A7 cores in execution. A second issue is that the
memory (DRAM) and total power consumption of the AMP
and symmetric configurations are close to those obtained by
adding the corresponding values of the two CPU clusters in
isolation. An exception is the total power consumption with
the symmetric configuration, in which a significant decrease
is observed due to the Cortex-A15 cores completing their
share of the work much earlier than the Cortex-A7 cores.
This aspect strongly affects the energy efficiency of the sym-
metric configuration as the power consumption is three times
higher than that obtained with the entire Cortex-A7 cluster,
but the performance is only doubled. As expected, the AMP
configuration is the one that dissipates a higher power rate,
as it fully utilizes all the available resources. On the other
hand, it also obtains the shortest execution time, yielding
the best energy-to-solution.
5. CONCLUSIONS
In this paper, we have proposed several mechanisms to map
the high-performance multi-threaded implementation of the
matrix multiplication in the BLIS library to an asymmetric
ARM big.LITTLE (Cortex A15+A7) SoC. Our results re-
veal excellent improvements in performance compared with
a homogeneous implementation that operates exclusively on
one type of core (either A15 or A7), and also with respect
to multi-threaded implementations that rely on a symmetric
work distribution and delegate scheduling to the operating
system.
This is the first step towards a full BLAS implementation op-
timized for big.LITTLE architectures, which is the ultimate
goal of our work. We believe that the approach applied to
gemm carries over to the rest of the BLAS. However, there
are still a number of issues that need to be addressed to
further increase performance and adaptation to the architec-
ture. Among those, the most significant ones are the integra-
tion of different micro-kernels and block sizes tuned to each
type of core in order to extract the maximum performance,
and the dynamic distribution and mapping of the workload
to each type of core transparently to the programmer. A
port to a 64-bit ARMv8 architecture, and performing a ex-
perimental study on architectures with different number of
big/LITTLE cores are also key milestones in our roadmap.
Acknowledgments
The researchers from Universitat Jaume I were supported by
project CICYT TIN2011-23283 of MINECO and FEDER,
the EU project FP7 318793 “EXA2GREEN” and the FPU
program of MECD. The researchers from Universidad Com-
plutense deMadrid were supported by project CICYT TIN2012-
32180.
6. REFERENCES
[1] P. Alonso, R. M. Badia, J. Labarta, M. Barreda, M. F.
Dolz, R. Mayo, E. S. Quintana-Ort´ı, and Ruyma´n
Reyes. Tools for power-energy modelling and analysis
of parallel scientific applications. In 41st Int. Conf. on
Parallel Processing – ICPP, pages 420–429, 2012.
Configuration
Average Dissipated Power (W)
GFLOPS GFLOPS/W
A7 A15 DRAM GPU Total
Asymmetric BLIS 0.785 5.994 0.191 0.119 7.091 12.035 1.697
1xA15 0.109 1.828 0.060 0.083 2.081 2.718 1.305
2xA15 0.124 3.242 0.076 0.099 3.543 5.377 1.517
3xA15 0.135 4.613 0.091 0.106 4.946 7.963 1.609
4xA15 0.140 5.878 0.105 0.110 6.233 10.374 1.664
1xA7 0.305 0.499 0.066 0.102 0.973 0.546 0.560
2xA7 0.488 0.501 0.072 0.102 1.164 1.098 0.942
3xA7 0.661 0.503 0.084 0.103 1.352 1.587 1.173
4xA7 0.831 0.502 0.089 0.103 1.526 2.086 1.366
Symmetric BLIS 0.810 3.440 0.201 0.109 4.562 3.897 0.854
Table 1: Power consumption breakdown and energy efficiency for DGEMM (m = n = k = 4096) on the Exynos
5422 SoC, using different thread configurations. The rows labeled as Asymmetric BLIS and Symmetric BLIS use
all the available eight cores in the SoC, using our modified BLIS version and the original BLIS multi-threaded
implementation, respectively.
[2] Olivier Beaumont and Loris Marchal. Analysis of
dynamic scheduling strategies for matrix
multiplication on heterogeneous platforms. In Proc.
23rd Int. Symp. High-performance Parallel and
Distributed Computing, HPDC’14, pages 141–152,
2014.
[3] David Clarke, Alexey Lastovetsky, and Vladimir
Rychkov. Column-based matrix partitioning for
parallel matrix multiplication on heterogeneous
processors based on functional performance models. In
Euro-Par 2011: Parallel Processing Workshops,
volume 7155 of LNCS, pages 450–459. 2012.
[4] R.H. Dennard, F.H. Gaensslen, V.L. Rideout,
E. Bassous, and A.R. LeBlanc. Design of
ion-implanted MOSFET’s with very small physical
dimensions. Solid-State Circuits, IEEE Journal of,
9(5):256–268, 1974.
[5] M. Duranton et al. The HiPEAC vision for advanced
computing in horizon 2020, 2013.
http://www.hipeac.net/roadmap.
[6] H. Esmaeilzadeh, E. Blem, R. St. Amant,
K. Sankaralingam, and D. Burger. Dark silicon and
the end of multicore scaling. In Proc. 38th Annual Int.
Symp. on Computer architecture, ISCA’11, pages
365–376, 2011.
[7] Kazushige Goto and Robert van de Geijn. Anatomy of
a high-performance matrix multiplication. ACM
Trans. Math. Softw., 34(3):12:1–12:25, 2008.
[8] M.D. Hill and M.R. Marty. Amdahl’s law in the
multicore era. Computer, 41(7):33–38, 2008.
[9] Rakesh Kumar, Dean M. Tullsen, Parthasarathy
Ranganathan, Norman P. Jouppi, and Keith I. Farkas.
Single-ISA heterogeneous multi-core architectures for
multithreaded workload performance. In Proc. 31st
Annual Int. Symp. on Computer Architecture,
ISCA’04, page 64, 2004.
[10] Nagesh B. Lakshminarayana, Jaekyu Lee, and
Hyesoon Kim. Age based scheduling for asymmetric
multiprocessors. In Proc. Conference on High
Performance Computing Networking, Storage and
Analysis, SC’09, pages 25:1–25:12, 2009.
[11] N.B. Lakshminarayana and Hyesoon Kim.
Understanding performance, power and energy
behavior in asymmetric multiprocessors. In IEEE Int.
Conf. Computer Design – ICCD 2008, pages 471–477,
2008.
[12] J. F. Lavignon et al. ETP4HPC strategic research
agenda achieving HPC leadership in Europe.
[13] Tze Meng Low, Francisco D. Igual, Tyler M. Smith,
and Enrique S. Quintana-Ort´ı. Analytical modeling is
enough for high performance BLIS. ACM Trans.
Math. Soft., 2014. In review. Available at
http://www.cs.utexas.edu/users/flame.
[14] R. Lucas et al. Top ten Exascale research challenges,
2014.
http://science.energy.gov/~/media/ascr/ascac/pdf/meetings/20140210/Top10reportFEB14.pdf.
[15] G.E. Moore. Cramming more components onto
integrated circuits. Electronics, 38(8):114–117, 1965.
[16] T.Y. Morad, U.C. Weiser, A. Kolodny, M. Valero, and
E. Ayguade. Performance, power efficiency and
scalability of asymmetric cluster chip multiprocessors.
Computer Architecture Letters, 5(1):14–17, 2006.
[17] Tyler M. Smith, Robert van de Geijn, Mikhail
Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee.
Anatomy of high-performance many-threaded matrix
multiplication. In Proc. IEEE 28th Int. Parallel and
Distributed Processing Symp., IPDPS’14, pages
1049–1059, 2014.
[18] Field G. Van Zee, Tyler M. Smith, Bryan Marker,
Tze Meng Low, Robert A. van de Geijn, Francisco D.
Igual, Mikhail Smelyanskiy, Xianyi Zhang, Michael
Kistler, Vernon Austel, John Gunnels, and Lee
Killough. The BLIS framework: Experiments in
portability. ACM Trans. Math. Soft., 2014. In review.
Available at
http://www.cs.utexas.edu/users/flame.
[19] Field G. Van Zee and Robert A. van de Geijn. BLIS:
A framework for generating blas-like libraries. ACM
Trans. Math. Soft., 2014. To appear.
[20] Jonathan A. Winter, David H. Albonesi, and
Christine A. Shoemaker. Scalable thread scheduling
and global power management for heterogeneous
many-core architectures. In Proc. 19th Int. Conf.
Parallel Architectures and Compilation Techniques,
PACT’10, pages 29–40, 2010.
