Decoupled Access-Execute on ARM big.LITTLE by Weber, Anton et al.
Decoupled Access-Execute on ARM big.LITTLE
Anton Weber
Uppsala University
anton.weber.0295
@student.uu.se
Kim-Anh Tran
Uppsala University
kim-anh.tran
@it.uu.se
Stefanos Kaxiras
Uppsala University
stefanos.kaxiras
@it.uu.se
Alexandra Jimborean
Uppsala University
alexandra.jimborean
@it.uu.se
ABSTRACT
Energy-efficiency plays a significant role given the battery
lifetime constraints in embedded systems and hand-held de-
vices. In this work we target the ARM big.LITTLE, a het-
erogeneous platform that is dominant in the mobile and em-
bedded market, which allows code to run transparently on
different microarchitectures with individual energy and per-
formance characteristics. It allows to use more energy effi-
cient cores to conserve power during simple tasks and idle
times and switch over to faster, more power hungry cores
when performance is needed.
This proposal explores the power-savings and the perfor-
mance gains that can be achieved by utilizing the ARM
big.LITTLE core in combination with Decoupled Access-
Execute (DAE). DAE is a compiler technique that splits
code regions into two distinct phases: a memory-bound Ac-
cess phase and a compute-bound Execute phase. By schedul-
ing the memory-bound phase on the LITTLE core, and the
compute-bound phase on the big core, we conserve energy
while caching data from main memory and perform compu-
tations at maximum performance. Our preliminary findings
show that applying DAE on ARM big.LITTLE has poten-
tial. By prefetching data in Access we can achieve an IPC
improvement of up to 37% in the Execute phase, and man-
age to shift more than half of the program runtime to the
LITTLE core. We also provide insight into advantages and
disadvantages of our approach, present preliminary results
and discuss potential solutions to overcome locking over-
head.
Keywords
Decoupled Access-Execute, Energy Efficiency, Compiler, Em-
bedded Systems, Heterogeneous Architectures, ARM
big.LITTLE
ACM ISBN 978-1-4503-2138-9.
DOI: 10.1145/1235
1. INTRODUCTION
Designed for embedded devices with strict size and en-
ergy constraints [13], ARM was quickly adopted in modern
mobile hardware, such as smartphones and tablets, making
it the de-facto standard in these devices today. One of the
main reasons for this is the low-power RISC design, allow-
ing for small, energy efficient chips that at the same time are
powerful enough to run modern mobile operating systems.
While chip designs have been able to keep up with the
rapid development of the mobile segment in the past, ARM,
just like its competitors, is facing new challenges with the in-
creasing demand on performance and battery life in portable
devices and the end of Dennard scaling where it is no longer
possible to lower transistor size and keep the same power
density. To address this issue, ARM developed a new hetero-
geneous design named big.LITTLE that aims at combining
different CPU designs on the same system on chip (SoC).
The ARM big.LITTLE opens up opportunities for the com-
piler to make scheduling decisions for phases of the pro-
gram with different performance characteristics. One such
compiler technique that can profit from the processor set
up in the ARM big.LITTLE is Decoupled Access-Execute
(DAE)[9].
DAE has been developed as a way to improve perfor-
mance and energy efficiency on modern CPUs. It decou-
ples code in coarse-grained phases, namely memory-bound
Access phases and compute-heavy Execute phases. Using
Dynamic Voltage Frequency Scaling (DVFS) these phases
can be executed at different CPU frequencies and voltages.
During the memory-bound phase the CPU is stalled waiting
for data to arrive from memory and can be clocked down to
conserve energy without performance loss. Once the code
reaches the Execute phase, the CPU can be scaled back to
perform the compute-heavy tasks at maximum performance.
As a result, processors can save energy by executing parts
of a program at lower voltage or frequency without slowing
down the overall execution.
In this work we bring the two concepts together, not only
to demonstrate the advantages of Decoupled Access-Execute
on ARM, but also to adapt the previous ideas to the new
architecture and take advantage of the unique features that
the heterogeneous design of ARM big.LITTLE has to offer.
In particular, we want to benefit from two different core de-
signs with individual performance and energy characteristics
being available to run decoupled execution. To this end, our
contributions are:
ar
X
iv
:1
70
1.
05
47
8v
1 
 [c
s.D
C]
  1
3 J
an
 20
17
1.DAE on ARM big.LITTLE We propose a new trans-
formation pattern that shows how DAE can be applied
to efficiently use the big and LITTLE cores of the ARM
big.LITTLE architecture.
2.Proof-of-Concept Implementation We provide a proof-
of-concept implementation of our transformation pat-
terns for a selection of benchmarks. Our ideas lay
the ground for the development of automatic compiler
transformations for DAE on ARM big.LITTLE.
Our experiments show that Execute phases can signifi-
cantly benefit from prefetched data from Access phases. By
scheduling memory-bound phases on Access and compute-
bound phases on Execute, we effectively reduce the time
spend on the big core (we observe IPC improvements of up
to 37% for analyzed benchmarks), with more than half of
the runtime being spent on the LITTLE core. While being
a proof-of-concept, our current implementation still intro-
duces a big synchronization overhead. However, we present
solutions to overcome it.
2. BACKGROUND
In the following, we will first give an overview on existing
DAE transformations. Afterwards, we will introduce details
on the processor in focus, the ARM big.LITTLE.
2.1 Decoupled Access-Execute
While caches and hardware prefetchers have been intro-
duced to decrease latency when accessing data from main
memory, it still remains a common bottleneck in current
computer architectures. With the processor waiting for data
to arrive, there is not only a negative impact on the overall
runtime of programs, but also on energy consumption.
One reason for this is that the processor runs at high(est)
frequency while stalled in order to compute at full perfor-
mance as soon as the data arrives. Lowering the frequency
to reduce energy consumption during stalls would result in
slower computations and further increase in program execu-
tion time. As an ideal approach, the memory-bound instruc-
tions that stall the CPU should be executed at low frequen-
cies while the processor should run at maximum frequency
for compute-bound parts of the program.
Spiliopoulos et al. [14] explored this possibility and created
a framework that detects the memory- and compute-bound
regions in the program and scales the CPU’s frequency ac-
cordingly. Their work shows that this is indeed a viable
approach but only if the regions are coarse enough.
One of the reasons for this is that using current techniques,
such as DVFS, switching the frequency or core voltage on the
processor involves a significant transition overhead, prevent-
ing us from just switching between low and high frequencies
as rapidly, as the ideal approach would require.
With Decoupled Access-Execute (DAE), Koukos et al. [11]
proposed a solution to this problem by grouping memory-
and compute-bound instructions in a program, creating larger
code regions and thus reducing the number of frequency
switches required. The result are two distinct phases, re-
ferred to as Access (memory-bound) and Execute phase (compute-
bound).
Jimborean et al. [9] created a set of compiler passes in
LLVM allowing for these transformations of the program
to be performed statically at compile-time. These passes
create the Access phase by duplicating the original code and
stripping the copy of all instructions that are not memory
reads or address computations. The original code becomes
the Execute phase.
While the added Access phase introduces overhead ini-
tially, the Execute phase will no longer be stalled by mem-
ory accesses, as the data will be available in the cache. In
addition, the Access phase allows for more memory level
parallelism, potentially speeding up main memory accesses
compared to the original code.
Figure 1: Loop transformation with multiple Access phases
in SMVDAE.
Software Multiversioned Decoupled Access-Execute (SMV-
DAE) [12] deals with the difficulties of applying DAE on
general purpose applications. Here complex control flow and
pointer aliasing make it hard to statically reason about the
efficiency of the transformations. The SMVDAE compiler
passes apply the transformations with a variety of param-
eters, thus generating multiple different access phase can-
didates, that allow to find the best performing options for
the given code dynamically. With this approach, Koukos et
al. were able to achieve on average over 20% energy-delay-
product (EDP) improvements across 14 different benchmarks.
To create coarse grained code regions, DAE targets hot
loops within programs. These loops are split into smaller
slices that typically consist of several iterations of the origi-
nal loop. Figure 1 illustrates the SMVDAE transformation:
for each slice of the original code, SMVDAE creates different
Access phase versions and one Execute phase. Using a dy-
namic version selector, the best performing Access version
is selected during runtime. As the amount of data accessed
in each loop iteration depends on the task, the optimal size
for the slices, the granularity, is benchmark- and cache de-
pendent.
2.2 ARM big.LITTLE
Since its first prototype, the ARM1 from 1985, many it-
erations of ARM CPU cores have been developed. One key
characteristic that distinguishes all these core designs from
competitors like Intel and AMD is the RISC architecture.
Instead of implementing a set of complex instructions in
hardware, the RISC architecture shifts the complexity away
from the chip and towards the software [13]. This makes
RISC compiler tools more complex but also allow for much
simpler hardware designs on the CPU. As a result, ARM has
been able to create small, energy efficient chips that have
seen great popularity in the embedded and mobile segment
where size and battery life are major factors.
A recent development by ARM is the big.LITTLE archi-
tecture. First released in 2011, this heterogeneous design
combines fast, powerful cores with slower, more energy ef-
ficient ones on the same SoC, allowing the devices to con-
Figure 2: Typical ARM big.LITTLE design
serve power during small tasks and idle states but at the
same time deliver high performance when needed. While
the big and LITTLE processor cores are fully compatible
from an instruction set architecture (ISA) level, they fea-
ture different microarchitectures. big cores are characterized
by complex, out-of-order designs with many pipeline stages
while the LITTLE cores are more simple, in-order proces-
sors. In comparison, the ARM Cortex-A15 (big) pipeline has
15 integer and 17-25 floating point pipeline stages, while the
Cortex-A7 (LITTLE) only has 8 pipeline stages.
In modern ARM SoCs, CPU cores are grouped in clusters.
Each core has an individual L1 data and instruction cache
but shares the L2 cache with other cores in the same clus-
ter. CPU clusters and other components on the chip, such
as GPU and peripherals, are connected through a shared
bus. ARM provides reference designs for these intercon-
nects, but manufacturers commonly use custom implemen-
tations in their products. The interconnect uses ARM’s Ad-
vanced Microcontroller Bus Architecture (AMBA) and pro-
vides system-wide coherency through the AMBA AXI Co-
herency Extensions (ACE) and AMBA AXI Coherency Ex-
tensions Lite (ACE-Lite) [15]. These protocols allow mem-
ory coherency across CPUs, one of the main prerequisites
for big.LITTLE processing [5].
ARM big.LITTLE usually features two clusters: one for
all big cores and one cluster containing all LITTLE cores.
These designs allow three different techniques for runtime
migrations of tasks: cluster switching, CPU migration and
Global Task Scheduling (GTS) [5, 7].
Cluster switching was the first and most simple imple-
mentation. Only one of the two clusters is active at a time
while the other is powered down. If a switch is triggered, the
other cluster is powered up, all tasks are migrated and the
previously used cluster is deactivated until the next switch.
In CPU migration, each individual big core is paired with a
LITTLE core. Each pair is visible as one virtual core and
the system can transparently move a task between the two
physical cores without affecting any of the other pairs. GTS
is the most flexible method, as all cores are visible to the
system and can all be active at the same time.
Cluster switching and GTS also allow for asymmetric de-
signs, where the number of big and LITTLE cores does not
necessarily have to be equal.
3. METHODOLOGY
Our methodology applies DAE on ARM and takes advan-
tage of the heterogeneous hardware features on the big.LITTLE
architecture. With two different types of processors available
on the same system, energy savings can now be a result of
lower core frequencies and running code on the simpler, more
energy efficient microarchitecture of the LITTLE cores. A
straight-forward way to benefit from this in decoupled exe-
cution, is to place our Access and Execute phase onto the
different cores.
Running the two phases on several CPUs also eliminates
DVFS transition overhead, as the cores can constantly be
kept at ideal frequencies throughout the entire execution.
Since the cores are located on different clusters, we are no
longer prefetching the data into a shared cache and are
instead providing the Execute phase with prefetched data
through coherence.
We have chosen to implement this approach for a selection
of benchmarks.
Figure 3: Applying DAE techniques on big.LITTLE. The
memory-bound Access phase runs on the LITTLE core,
while the compute-bound Execute phase runs on the big
core.
3.1 Transformation pattern
As our future goal is to have a set of compiler passes that
can transform any input program, we design our new DAE
transformations as a step by step process. The following
sections describe this pattern and illustrate it on a sample
program in pseudocode. The sample program in Figure 4 is
used as a running example.
Although some of the steps have remained largely un-
changed from current DAE compiler passes, we have chosen
to implement this approach as prototypes in C code first.
This allows us to adjust parameters and details in the im-
plementation more rapidly and with more flexibility.
void do_work () { //Main thread
for(i=0;i<N;i++){
c[i] = a[i+1]+b[i+2]
}
}
Figure 4: Our running example: the original program. This
example is kept small for the sake of simplicity.
Spawning threads for Access and Execute phase
Similar to the current DAE compiler passes, we are targeting
hot loops. The loop is duplicated, but instead of executing
the Access and Execute phases within the same thread, we
move them to individual cores (see Figure 3). This is done
by creating two threads: one for the Access- and one for the
Execute phase.
The threads are created using the Linux POSIX thread in-
terfaces (Pthreads). Configuring the attributes to the Pthread
calls enables us to manually define CPU affinity and spawn
the Execute phase on a big core and the Access phase on a
LITTLE core. This allows us to benefit from the flexibility
of Global Task Scheduling and as our phases are meant to
remain on the same core for the entire execution, we can
avoid task migration and any of the associated overhead en-
tirely. Spawning two additional threads instead of reusing
the main thread for one of the phases allows us to run the
phases on the two different cores without affecting the CPU
affinity of the remaining parts of the program running on
the main thread.
For the first transformation step this effectively means
copying the loop twice and placing it into an empty func-
tion each. These two new functions will become our Access
and Execute phase. The original loop is replaced with the
calls required to spawn two threads, one for each of the new
functions, and join them when they finish. Figure 5 shows
the output of applying this transformation to the example
program. The relevant changes are highlighted in red.
void do_work () { //Main thread
spawn(access_thread , access)
spawn(execute_thread , execute)
join(access_thread)
join(execute_thread)
}
void access () { // Access thread
for(i=0;i<N;i++){
c[i] = a[i+1]+b[i+2]
}
}
void execute () { // Execute thread
for(i=0;i<N;i++){
c[i] = a[i+1]+b[i+2]
}
}
Figure 5: We extract the original loop into a function (Ex-
ecute phase) and duplicate the function to become the Ac-
cess phase. These two versions are executed as individual
threads: one running on the LITTLE core, one on the big
core.
Generating Access and Execute Phases
Similar to previous DAE approaches, our Access phases in-
clude control flow, loads and memory address calculations.
Once all irrelevant instructions have been removed from the
Access phase, we can potentially optimize the resulting code
during the compilation step, such as removing dead code
or control flow that is no longer needed as a result of our
changes.
The Execute phase remains unchanged after loop chunking
(generation of slices) as we can issue the same data requests
to the interconnect as the original code. All accesses to
data that has already been prefetched by the Access phase
automatically benefit from it being available closer in the
memory hierarchy, as the request will be serviced by the
other cluster transparently.
Both phases contain address calculations. This means
that we are now calculating addresses twice, introducing ad-
ditional instructions compared to the original program. As
in previous DAE implementations, we aim at compensating
for this overhead through the positive side-effects of decou-
pled execution.
Figure 6 shows the generation of Access and Execute: first
we chunk the loop (i.e. creating an inner and outer loop),
then we remove unnecessary instructions for Access and re-
place loads with prefetch instructions.
void do_work () { //Main thread
spawn(access_thread , access)
spawn(execute_thread , execute)
join(access_thread)
join(execute_thread)
}
void access () { // Access thread
//Outer loop
offset =0
for(j=0;j<(N/granularity);j++){
//Inner loop
for(k=0;k<granularity;k++){
i=offset+k
prefetch(a[i+1])
prefetch(b[i+2])
}
offset += granularity
}
}
void execute () { // Execute thread
//Outer loop
offset =0
for(j=0;j<(N/granularity);j++){
//Inner loop
for(k=0;k<granularity;k++){
i=offset+k
c[i]=a[i+1]+b[i+2]
}
offset += granularity
}
}
Figure 6: Access Phase Generation: Chunking the loop and
removing instructions from the Access phase.
Synchronization
As we are parallelizing decoupled execution, synchronization
is required to enforce two rules: First, the Execute phase
must not start computing before prefetching has finished, as
it can only benefit from decoupled execution when the data
is available in the cache as it is requested. And second, the
Access phase can not start the next slice before the Exe-
cute phase has completed the current one. As cache space
is limited, prefetching the next set of data will potentially
evict previously prefetched cache lines. As seen in Figure 7,
running Access and Execute phases in turn on the two cores
can be achieved using a combination of two locks.
Figure 8 shows how we can implement such a locking
scheme in this transformation step. Each phase waits on
Figure 7: Synchronization between individual Access and
Execute phases.
void do_work () { //Main thread
init(access_lock)
init(execute_lock)
unlock(access_lock)
lock(execute_lock)
spawn(access_thread , access)
spawn(execute_thread , execute)
join(access_thread)
join(execute_thread)
}
void access () { // Access thread
//Outer loop
offset =0
for(j=0;j<(N/granularity);j++){
//Inner loop
lock(access_lock)
for(k=0;k<granularity;k++){
i=offset+k
prefetch(a[i+1])
prefetch(b[i+2])
}
offset += granularity
unlock(execute_lock)
}
}
void execute () { // Execute thread
//Outer loop
offset =0
for(j=0;j<(N/granularity);j++){
//Inner loop
lock(execute_lock)
for(k=0;k<granularity;k++){
i=offset+k
c[i]=a[i+1]+b[i+2]
}
offset += granularity
unlock(access_lock)
}
}
Figure 8: Adding synchronization between Access and Exe-
cute phase.
one of these locks before starting the next chunk.
Access lock (blue):
The first lock is used to start the Access phase and unlocked
at the end of each slice in the Execute phase. It is initially
unlocked, so that the Access phase can start as soon as the
thread spawns.
Execute lock (red):
The second lock signals the Execute phase to continue once
the Access phase finished the current slice. The Execute
phase follows the same pattern with the difference that its
first lock is initialized as locked, making sure that the phase
does not start before the first Access slice has been com-
pleted.
3.2 Optimizations
The pattern above describes a basic implementation of
DAE on big.LITTLE. Further optimizations to this method-
ology can improve results significantly in some scenarios.
While only the first optimization has been used as part
of this work, all of the methods below have been proven to
benefit decoupled execution on big.LITTLE through initial
testing.
Reducing thread overhead
As we create a pair of threads every time a loop is exe-
cuted, the overhead of setting up, spawning and joining the
threads can degrade performance noticeably when the pro-
gram executes the loop frequently. A solution to this is to
keep the threads running over the course of the program and
provide them with new data on every loop execution. This
thread-pool approach does not come without a downside as
the threads need to be signaled when new data is available
and the next loop should be executed. In many cases the
benefits outweigh the added overhead and we decide exper-
imentally whether to apply this optimization to the chosen
benchmarks.
Overlap
Running the Access and Execute phases in parallel intro-
duces a new way to reduce overall execution time by over-
lapping the two phases for each individual slice. While it
is problematic to start the Execute phase too early, as the
data has not yet arrived in the cache of the LITTLE cluster,
it does not have to wait for the entire slice to be prefetched.
In other words, to benefit from prefetched data in the
Execute phase, the prefetch has to be completed by the time
a particular data set is needed. Hence, any prefetches for
data that is required at a later stage can still be pending
when the Execute phase starts as long as they complete by
the time the data is requested.
Figure 9 illustrates one possible scenario where the Ex-
ecute phase is started before the last data set has been
prefetched. As the data set is only required towards the
end of the current Execute slice, the Access phase can over-
lap with the early stages of the execute thread while Access
prefetches the last set of data in parallel.
Achieving the correct timing for this can be difficult, as
the two phases are executed at different speeds and the Ex-
ecute phase processes the data at a different rate than it is
prefetched in the Access phase.
Figure 9: Overlapping Access and Execute phases.
In fact, as we employ the Preload Data (PLD) instruc-
tion [2] to prefetch data in the Access phase, we already deal
with overlapping in the current implementation to some de-
gree. The PLD prefetch hint is non-blocking, meaning that
it will not wait for the data to actually arrive in the cache
before completing. Hence, finishing the Access phase only
guarantees that we have issued all hints while any number
of them might not have prefetched the data into the cache
yet. We are currently extending this work to monitor the
cache behavior (misses/hits for each big and LITTLE core)
and origin of fetched data (local cache, cache of the other
cores, memory).
3.2.1 Timing-based implementation
As we are expecting the synchronization overhead to be
our main problem with the methodology described above,
reducing or removing it entirely would greatly improve how
the implementation performs.
While we generally need synchronization when dealing
with multiple threads that work together, we have the ad-
vantage that we do not affect the correctness of the program
by starting the Execute phase early or late. Incorrect tim-
ing merely affects the performance of the Execute phase,
as the operations within the Access phase are limited to ad-
dress calculations an prefetches - i.e. operations that have no
side-effects. As a matter of fact, we already have a loose syn-
chronization model as a result of the non-blocking prefetches
described in the section above.
Figure 10: Timing-based DAE: instead of locking, approx-
imate the sleep times of Access and Execute, in order to
synchronize the phases.
This allows us to go with a more speculative approach.
Instead of locking, we can suspend the two threads and let
the Access and Execute phases wait for a specific amount of
time before processing the next slice. The amount of wait
time should correspond to the execution time of the other
phase. Runtime measurements of the individual phases pro-
vide us with a starting point for these sleep times, while the
exact values that perform best can be found experimentally.
As a result, we can approximate a similar timing between
the two threads without any of the locking overhead (see
Figure 10).
The main problem with this alternative is that it is diffi-
cult to reliably approximate the timing of the two threads as
our system does not guarantee real-time constraints. Other
programs running on the same system and the operating sys-
tem itself can interfere with the DAE execution and affect
the runtime of individual slices. Suspending the threads can
be inaccurate, too, as using functions like nanosleep can be
inaccurate and resume threads too late or return early.
The consequence in these cases is that the two threads can
go out of sync, working on two different slices and negating
all DAE benefits. The impact and likeliness of this to occur
is dependent on many factors, including the number and size
of the slices.
A trade-off would be a hybrid solution that only locks pe-
riodically to synchronize the two threads to the same slice.
A group of slices would still be executed lock-free (i.e. with
reduced overhead) and the synchronization point would pre-
vent one of the phases from running too far ahead. In a more
advanced implementation, these moments can also be used
to adjust the sleep times of the individual threads dynami-
cally to adapt to the current execution behavior caused by
system load and other external factors.
void execute () { // Execute thread
//Outer loop
offset =0
for(j=0;j<(N/granularity);j++){
//Inner loop
nanosleep(S)
for(k=0;k<granularity;k++){
i=offset+k
c[i]=a[i+1]+b[i+2]
}
offset += granularity
}
}
Figure 11: Replacing both locks with a single sleep call in
each phase.
4. EVALUATION
In this section we first describe the ARM big.LITTLE that
we use in our evaluation. We further describe our measure-
ment techniques and our evaluation criteria. Afterwards we
present and discuss the experimental results obtained on the
selected benchmarks.
4.1 Experimental Setup
Test system
We evaluate the benchmarks on an ODROID-XU4 single-
board computer, running the Samsung Exynos 5422 SoC.
This ARMv7-A chip features one big and one LITTLE clus-
ter. Each cluster shares the L2 cache while each individual
core has a private L1 data and private L1 instruction cache
available. The 2 GB of LPDDR3 main memory is speci-
fied with a bandwidth of 14.9 GB/s. Table 1 contains more
details about the processors on this chip.
big cluster LITTLE cluster
Number of cores 4 4
Core type Cortex-A15 Cortex-A7
fmax 2 GHz 1.4 GHz
fmin 200 MHz 200MHz
L2 cache 2MB 512 kB
L1d cache 32kB 32 kB
Table 1: Exynos 5422 specifications.
The operating system is a 64 bit Ubuntu Linux with a
kernel maintained by ODROID that contains all relevant
drivers to run the scheduler in GTS mode. To reduce sched-
uler interference, two processor cores are removed from the
scheduler queues using the isolcpus kernel option. The bench-
marks are cross-compiled on an Intel-based x86 machine us-
ing LLVM 3.8 [1].
Measurement technique
Measurements through the Linux perf events interface, while
convenient to set up, have shown to produce too much over-
head - especially with frequent, fine-grained measurements.
Instead, we directly access the performance statistics avail-
able as part of the on-chip Performance Monitor Units (PMUs).
In addition to a dedicated cycle counter, there is a number
of configurable counters available (4 per A7 core and 6 on
the Cortex-A15 [3, 4]). One of these event counters on each
processor is set to capture the number of retired instruc-
tions [2]. The cycle counters are configured to increment
once every 64 clock cycles to avoid overflows in the 32 bit
registers when measuring around larger code regions. We
enable PMU counters and allow user-space access through
a custom kernel module. This enables us to read the rel-
evant register values through inline assembly instructions
from within our code.
4.2 Evaluation criteria
As the goal of this project is improve energy efficiency of
programs without sacrificing performance, these two aspects
will be our main criteria for evaluating the prototypes.
Similar to Software Multiversioned Decoupled Access Ex-
ecute (SMVDAE) [12], we generate multiple versions with
different granularities for each benchmark to find the opti-
mal setting.
Performance:
We measure the performance impact with the overall run-
time of the benchmark. While previous DAE work has
shown to speed up benchmark execution in some cases, we
now expect synchronization and coherence overhead to af-
fect our results.
Energy:
Without a sophisticated power model, speeding up the code
run on the big core, together with the individual execution
times for each phase, are our main indicators in terms of
energy savings. Speeding up the Execute phase means that
the big core will be active for less time and hence consume
less power. On the other hand, these savings are only rel-
evant if the overall overhead is small enough not to negate
this effect.
Measurement considerations:
Changing the original benchmark into a threaded version
introduces synchronization overhead. This is reflected in
the runtime measurements, where faster execution does not
only result from data being available faster, but also from
lower synchronization overhead. This means that we need
different measurements to evaluate how much we speed up
the Execute phase purely by providing the data from our
warmed up cache.
Currently, the most accurate way to measure this is to see
how many CPU cycles the calculations in the execute phase
need to finish. If the data is not available in cache, the in-
structions will take longer time to execute. On the other
hand, if the data can be brought in through coherence, the
core will spend less time stalled waiting for it to arrive and
finish the instructions in fewer cycles. The baseline for run-
time comparisons is the unmodified version of each bench-
mark. While loop chunking alone can affect performance,
the impact has shown to be negligible for the benchmarks
we evaluate.
A common unit to visualize these results is instructions
per cycle (IPC). For this we capture the cycle and instruc-
tion count in the Execute phase individually and calculate
the IPC. The samples are taken around the inner loop of the
Execute phase (see Figure 12) to avoid capturing the exe-
cution of the other phase and locking overhead. These are
measured separately. The IPC of the baseline is obtained
by chunking the unmodified version of the benchmark, and
by measuring the instuctions and cycles of the otherwise un-
changed inner loop. Thus, the IPC numbers of both versions
refer to the exact same region of code.
void execute () { // Execute thread
...
//Outer loop
for(j=0;j<(N/granularity);j++){
lock(execute_lock)
start = read_counter ()
//Inner loop
for(k=0;k<granularity;k++){
...
}
end = read_counter ()
...
unlock(access_lock)
}
}
Figure 12: Measuring the effect of Access on Execute: we
take measurements around the inner loop, in order to avoid
capturing overhead associated with locking or other phases.
Benchmarks:
For the evaluation of our approach we have modified two
benchmarks from the SPEC 2006 suite [8], libquantum and
LBM, and CIGAR [16], a genetic algorithm search. All three
benchmarks have been considered by Koukos et al. [11] in the
initial task-based DAE framework and allow a comparison
of DAE on big.LITTLE to previous results.
Each benchmark has individual characteristics and mem-
ory access patterns which have an impact on how the proto-
types perform. While libquantum and CIGAR are consid-
ered memory-bound, LBM has been classified as intermedi-
ate [11].
4.3 Benchmark results
LBM
The loop we target in LBM performs a number of irregular
memory accesses with limited control flow. The if-conditions
only affect how calculations are performed in the Execute
phase while the required data remains the same in all cases.
This results in a simplified Access phase without any control
flow, as we can always prefetch the same values, independent
of which path in the control flow graph the Execute phase
takes. Most of the values that are accessed from memory
are double-precision floating point numbers.
The irregular access pattern and memory-bound calcu-
lations are an ideal target for decoupled execution. And in
fact, the benchmark results show that we are able to improve
the IPC of the Execute phase by up to 31% by prefetching
the data on the LITTLE core.
The total benchmark runtime increases significantly as
we lower the granularity. A smaller granularity increases
the number of total slices and with that also the number of
synchronizations performed between the two phases. This
additional locking overhead results in a penalty to overall
runtime. The loop is executed several times based on an
input parameter for the benchmark, which multiplies this
negative effect.
Figure 13 relates these two findings, the IPC speed-up
and the overall benchmark slow-down, to each other. While
we observe the best runtime at large granularities, choosing
a slightly higher overall slow-down results in much better
Execute phase performance (i.e. we spend less time on the
big core). This is discussed in more detail further below.
Figure 13: Execute IPC and overall slow-down for LBM.
CIGAR
In this case we apply DAE to a function with a higher de-
gree of indirection in the memory accesses. The calculations
themselves are relatively simple and compare two double
values within a struct to determine a new maximum and
swap two values within an integer array.
The results show that we can achieve slightly better IPC
improvements compared to LBM with a peak speed-up of
37%. The slow-down of overall execution time at lower gran-
ularities is still significant, yet not as extreme as in the pre-
vious benchmark. This can be explained by the smaller loop
size (i.e. the same granularity results in less total slices) and
the fact that the loop is only executed once as part of the
benchmark.
The trend of the IPC graph indicated that we can poten-
tially speed up the Execute phase even further by lowering
the granularity. Yet, as the overhead at low granularities
increases significantly, any improvement in IPC would be
negated.
Figure 14: Execute IPC and overall slow-down for CIGAR.
libquantum
While the other two benchmarks spawn threads every time
they execute their loop, frequent calls to our target loop in
libquantum make it a good candidate for the thread-pool
optimization. As we in fact have been able to reduce the
overhead noticeably in this case, we have taken all measure-
ments with the thread-pool variant.
The loop itself has a regular access pattern and performs
a single bitwise XOR operation on a struct member on each
iteration. Despite the memory-bound nature of this loop,
we only observe a maximum Execute phase IPC speed-up of
6.7% (see Figure 15). This is an interesting finding, as pre-
vious DAE evaluations of libquantum show improvements in
energy and performance [9, 11, 12]. Further investigations
are needed to determine whether this new behavior is caused
by coherence side-effects or other new factors introduced by
DAE execution on big.LITTLE or whether the shift to the
new architecture of the Cortex-A15 and Cortex-A7 alone are
responsible.
Figure 15: Execute IPC and overall slow-down for libquan-
tum.
4.4 Performance
Breaking down the time we spend inside the individual
functions that contain the targeted loop, we can analyze
which part (Access, Execute or synchronization) is causing
the increase in execution time. Figure 16 illustrates this for
all three benchmarks.
Here we can clearly see that, while the Execute phase gets
faster, overhead is slowing down the overall execution as we
reduce the granularity. This portion of the function time is
dominated by the locking overhead between the two phases,
as the time to initialize and join the threads is insignifi-
cant (<1ms). As mentioned before, the locking overhead
is proportional to the total number of slices, as each slice
causes two locking operations. This makes large granulari-
ties perform significantly better. While the overhead of syn-
chronizing Access and Execute outweights benefits achieved
from prefetching, the mechanism to synchronize can be ex-
changed by a lightweight mechanism in the future (such as
the timing-based DAE described in Section 3.2.1).
Compared to previous DAE implementations, our approach
achieves less speed up for Execute phase. A major factor for
this is the lack of a shared LLC on our system. Running
the Access phase on the A7 only brings in the data into the
LITTLE cluster. As a result, touching prefetched data in
the Execute phase no longer results in an instant cache hit.
The cache miss is just serviced by the A7 cluster instead of
by main memory. Effectively this means that we are not
making data instantly available in the cache, but are merely
reducing the cache miss latency on the A15. The result is
that we are now observing the full memory latency on the
A7 and a reduced load time on the A15 instead of a much
shorter combination of full memory latency in the Access
phase plus cache hit latency in the Execute phase.
4.5 Energy savings
The previous individual results show that we managed to
speed up the Execute phase significantly for two out of the
three benchmarks. While we were only able to do this at the
cost of slowing down the overall execution time, it is impor-
tant to consider that more than half of it is now spent on the
(a) LBM (b) CIGAR
(c) Libquantum
Figure 16: Function runtime breakdown: runtime associated
to Access, Execute and synchronization.
LITTLE core as part of the Access phase. During this time
we not only run on a more energy-efficient microarchitecture
but also at a lower base clock frequency with the option to
lower it even further at relatively small performance penal-
ties.
With excessive overhead potentially nullifying all energy
savings that we achieve through speeding up the Execute
phase, the current implementation of DAE for big.LITTLE
requires us to find a balance between reducing the time we
spend on the big core and keeping the overall runtime low.
As we do not only observe notable IPC improvements for
small granularities where locking overhead becomes prob-
lematic, but also for larger slices (e.g. 28% IPC speed-up
at 1.77x slow-down compared to 37% peak improvement at
2.82x slow-down for CIGAR), the granularity effectively be-
comes main point of adjustment when we want to strike this
balance.
The sub optimal performance that results from the lack of
a shared LLC between the big and LITTLE core mentioned
in the previous section also affects our energy consumption
directly. Not being able to reduce the stalls in the Execute
phase further means that we are spending more time on the
power-hungry big core than an ideal DAE implementation
would. On the other hand, running the protoype on a system
with shared LLC, we expect to see significantly improved
results in both performance and enery savings.
5. RELATEDWORK
5.1 Inter-core prefetching
Kamruzzaman et al. [10] investigated how a multi-core
system can be used to improve performance in single-threaded
applications. Their inter-core prefetching technique uses
helper threads on different cores to prefetch data. The orig-
inal compute thread is migrated between these cores at spe-
cific points in the execution to benefit from the data that
has been brought into the caches. This technique performs
transformations in software without the need of special hard-
ware and, similar to the DAE approach, targets large loops
within a program.
With memory-bound applications, they have been able
to achieve up to 2.8x speed up and an average energy re-
duction between 11 and 26%. It also shows the advantages
of moving the prefetches to a different core in comparison
to previous simultaneous multithreading (SMT) approaches.
This prevents from competing for CPU resources with the
main thread and avoids the negative impact on L1 cache
behavior. On the other hand, it is also mentioned that this
approach has downsides, such as problems with cache coher-
ence when working on the same data in the main and helper
threads.
While this concept has some similarities to the method-
ology we design for DAE on big.LITTLE, we avoid migrat-
ing tasks during execution and rely on accessing prefetched
values through coherence. Additionally, we coordinate our
Access and Execute phases to work on the same chunk of
data instead of prefetching ahead.
5.2 big.LITTLE task migration
big.LITTLE implementations currently choose between
two distinct methods for task migration. The first method
relies on CPU frequency frameworks, such as cpufreq, and
works with cluster switching and CPU migration. When a
certain performance threshold is reached, tasks are migrated
to another core or a cluster switch is triggered. This is com-
parable to traditional DVFS techniques with the difference
that lower power is not represented by a change in CPU
voltage or frequency but by a migration to a different core
or cluster [5].
Global Task Scheduling relies on the OS scheduler to mi-
grate tasks. In this model the scheduler is aware of the
different characteristics of the cores in the system and cre-
ates and migrates tasks accordingly. For this, it tracks the
performance requirements of threads and CPU load on the
system. This data can then be used together with heuristics
to decide the scheduling behavior [5, 7]. As all cores are
visible to the system and can be active at the same time,
this method is regarded as the most flexible. Manufacturer
white papers show that GTS improves benchmark perfor-
mance by 20% at similar power consumption compared to
cluster switching on the same hardware [7].
5.3 Scheduling on heterogeneous architectures
Chen et al. [6] analyzed scheduling techniques on heteroge-
neous architectures. For this they created a model that bases
scheduling decisions on matching the characteristics of the
different hardware to the resource requirements of the tasks
to be scheduled. In their work, they consider instruction-
level parallelism (ILP), branch predictability, and data lo-
cality of a task and the hardware properties hardware is-
sue width, branch predictor size and cache size. Using this
method they reduce EDP by an average of 24.5% and im-
prove throughput and energy savings. While DAE does not
consider the same workload characteristics, we are taking
a related approach by matching the different phases to the
appropriate core type within the heterogeneous big.LITTLE
design.
Van Craeynest et al. [17] created a method to predict
how the different cores in single-ISA heterogeneous archi-
tectures, including ARM big.LITTLE, perform for a given
type of task and developed a dynamic scheduling mechanism
based on their findings. Their work mentions that simple,
in-order cores perform best on tasks with high ILP while
complex out-of-order cores benefit from memory-level par-
allelism (MLP) or where ILP can be extracted dynamically.
As a conclusion, choosing cores based on whether the execu-
tion is memory- or compute-intensive without taking those
aspects into account can lead to sub optimal performance.
Decoupled execution focuses on the memory- or compute-
bound properties to divide tasks into the two phases. While
the work of Van Craeynest et al. [17] evaluates benchmarks
as a whole and DAE is applied on a much finer scale within
individual functions of the program, their insights are impor-
tant to take into account when moving DAE to a heteroge-
neous architecture. When deciding which tasks to consider
for the individual phases for example, taking the level of ILP
and MLP into account becomes important as the phases can
now be scheduled on different types of cores.
6. CONCLUSIONS
The results show that Decoupled Access-Execute on ARM
big.LITTLE can indeed provide noticeable energy savings,
but currently only at the expense of sacrificing performance
due to the increased overhead of synchronization. Our pro-
totypes successfully demonstrate decoupled execution on a
heterogeneous architecture by running memory-bound sec-
tions on the energy-efficient LITTLE core and compute-
bound parts of the program on the performance-focused big
core. For two out of three benchmarks we were able to
improve the Execute phase IPC significantly (up to 37%),
reducing the time they are executed at high frequencies and
on performance-focused hardware.
As part of the development and evaluation process, we
have identified the bottlenecks of the current implementa-
tion and suggest concrete optimization concepts for future
iterations of this work. The locking overhead that has been
introduced as a result of parallelizing the decoupled execu-
tion is the main reason for the overall benchmark slow-down.
As it is proportional to the number of slices, choosing some
granularities is no longer viable. Instead we are now facing
a trade-off between IPC improvement and slow-down in the
overall runtime with this approach. The proposed optimiza-
tions show the potential to reduce or remove this overhead
entirely while still benefiting from decoupled execution.
While the synchronization overhead plays a big role, we
are also limited by the fact that the CPU clusters in the
Exynos 5422 do not share a LLC. For our implementation
this has the far-reaching disadvantage that any prefetches
performed in the A7 cluster are not directly available for
the A15 cores and have to be brought in through coherence
instead. This prevents DAE to perform at its full potential.
While this problem is unavoidable on the current system,
adding a shared LLC is a simple solution to it. In fact,
newer interconnect designs used in recent homogeneous de-
signs already support a shared L3 cache between clusters
that is directly connected to the bus.
7. REFERENCES
[1] The LLVM Compiler Infrastructure, 2016. [Online]
accessed 2016-05-28. Available at http://llvm.org/.
[2] ARM. Architecture Reference Manual. ARMv7-A and
ARMv7-R edition. 2012.
[3] ARM. Cortex-A7 MPCore Processor Technical
Reference Manual. 2013. Revision: r0p5.
[4] ARM. Cortex-A15 MPCore Processor Technical
Reference Manual. 2013. Revision: r4p0.
[5] ARM Limited. big.LITTLE Technology: The Future
of Mobile. Technical report, 2013.
[6] J. Chen and L. K. John. Efficient program scheduling
for heterogeneous multi-core processors. In Proceedings
of the 46th Annual Design Automation Conference,
pages 927–930. ACM, 2009.
[7] H. Chung, M. Kang, and H.-D. Cho. Heterogeneous
Multi-Processing Solution of Exynos 5 Octa with
ARM R© big.LITTLETM Technology.
[8] J. L. Henning. SPEC CPU2006 benchmark
descriptions. SIGARCH Computer Architecture News,
34(4):1–17, 2006.
[9] A. Jimborean, K. Koukos, V. Spiliopoulos,
D. Black-Schaffer, and S. Kaxiras. Fix the code. don’t
tweak the hardware: A new compiler approach to
voltage-frequency scaling. In Proceedings of Annual
IEEE/ACM International Symposium on Code
Generation and Optimization, page 262. ACM, 2014.
[10] M. Kamruzzaman, S. Swanson, and D. M. Tullsen.
Inter-core prefetching for multicore processors using
migrating helper threads. In ACM SIGARCH
Computer Architecture News, volume 39, pages
393–404. ACM, 2011.
[11] K. Koukos, D. Black-Schaffer, V. Spiliopoulos, and
S. Kaxiras. Towards more efficient execution: A
decoupled access-execute approach. In Proceedings of
the 27th international ACM conference on
International conference on supercomputing, pages
253–262. ACM, 2013.
[12] K. Koukos, P. Ekemark, G. Zacharopoulos,
V. Spiliopoulos, S. Kaxiras, and A. Jimborean.
Multiversioned decoupled access-execute: the key to
energy-efficient compilation of general-purpose
programs. In Proceedings of the 25th International
Conference on Compiler Construction, pages 121–131.
ACM, 2016.
[13] A. Sloss, D. Symes, and C. Wright. ARM system
developer’s guide: designing and optimizing system
software. Morgan Kaufmann, 2004.
[14] V. Spiliopoulos, S. Kaxiras, and G. Keramidas. Green
governors: A framework for continuously adaptive
dvfs. In Green Computing Conference and Workshops
(IGCC), 2011 International, pages 1–8. IEEE, 2011.
[15] Stevens, Ashley. Introduction to AMBA R© 4 ACETM
and big.LITTLETM Processing Technology. Technical
report, 2013.
[16] University of Nevada, Reno. Evolutionary Computing
Systems Lab, 2016. [Online] accessed 2016-05-28.
Available at http://ecsl.cse.unr.edu/.
[17] K. Van Craeynest, A. Jaleel, L. Eeckhout, P. Narvaez,
and J. Emer. Scheduling heterogeneous multi-cores
through performance impact estimation (pie). In ACM
SIGARCH Computer Architecture News, volume 40,
pages 213–224. IEEE Computer Society, 2012.
