Speedup and Power Scaling Models for Heterogeneous Many-Core Systems by Rafiev A et al.
 
 
 
 
 
Newcastle University ePrints | eprint.ncl.ac.uk 
Rafiev A, Al-hayanni MAN, Xia F, Shafik R, Romanovsky A, Yakovlev A.  
Speedup and Power Scaling Models for Heterogeneous Many-Core Systems.  
IEEE Transactions on Multi-Scale Computing Systems 2018,  
DOI: 10.1109/TMSCS.2018.2791531.
DOI link 
https://doi.org/10.1109/TMSCS.2018.2791531 
ePrints link 
http://eprint.ncl.ac.uk/pub_details2.aspx?pub_id=245030 
Date deposited 
25/12/2018 
Copyright 
© 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be 
obtained for all other uses, in any current or future media, including reprinting/republishing 
this material for advertising or promotional purposes, creating new collective works, for 
resale or redistribution to servers or lists, or reuse of any copyrighted component of this work 
in other works. 
 
IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS 1
Speedup and Power Scaling Models
for Heterogeneous Many-Core Systems
Ashur Rafiev, Mohammed A. N. Al-hayanni, Student member, IEEE, Fei Xia, Rishad Shafik, Member, IEEE,
Alexander Romanovsky, Alex Yakovlev, Fellow, IEEE
Abstract—Traditional speedup models, such as Amdahl’s Law,
Gustafson’s, and Sun and Ni’s models, have helped the research
community and industry to better understand the performance
capabilities of systems and the parallelizability of applications.
Mostly targeting homogeneous hardware platforms or a limited
form of processor heterogeneity, these models do not cover newly
emerging multi-core heterogeneous architectures. This paper
reports novel speedup and energy consumption models based
on a more general representation of heterogeneity, called normal
form heterogeneity, supporting a wide range of heterogeneous
many-core architectures. The modelling method aims to predict
system energy efficiency and performance ranges and facilitates
research and development for the hardware and system software
levels. Extensive experimentation on an off-the-shelf big.LITTLE
heterogeneous platform validates the models showing less than
1% error for speedup and less than 4% error for power
dissipation. The practical use of the method is demonstrated
with a quantitative study of system load balancing efficiency.
Index Terms—Heterogeneous systems, speedup modelling,
enrgy-aware systems, load balancing, Amdahl’s law, multi-core
processors
I. INTRODUCTION
FROM the early days of computing systems, there has beena persistent engineering effort to improve computation
speed by distributing the work across multiple devices. Pre-
dicting the system’s gain in performance, called the speedup,
has been a major focus in this area of the research. Amdahl’s
law has been known since 1967 [1]. It assumes that a fixed
workload is executed in n processors and compares the perfor-
mance with the same workload executed in a single processor.
The model shows that the speedup will quickly saturate with
increasing n if the workload requires synchronization. In
1988, Gustafson introduced the principle of workload scaling
pertaining to the fixed time model [2]. This model proposes
to extend the workload proportionally to system’s scalability
with the result of having linear increase in the speedup. In
1990, Sun and Ni suggested a new model, which included
extended workload calculations by considering the capability
of the memory [3], [4].
Over the years, technology scaling has facilitated signifi-
cant performance improvement at reduced power consumption
through increased operating frequency and smaller device
A. Rafiev, M. Al-hayanni, F. Xia, R. Shafik, A. Romanovsky, and
A. Yakovlev are with Newcastle University, UK
E-mail: {ashur.rafiev, m.a.n.al-hayanni, fei.xia, rishad.shafik,
alexander.romanovsky, alex.yakovlev}@ncl.ac.uk
M. Al-hayanni is also with University of Technology and HCED, Iraq
This work is supported by EPSRC/UK as a part of PRiME project
EP/K034448/1.
geometries [5]. The number of transistors per unit of area
have increased substantially conforming to Moore’s [6] and
Koomey’s laws [7], and Pollack’s rule suggests that perfor-
mance is increasing approximately proportional to the square
root of the complexity [8].
As a result, nowadays almost every consumer device or
embedded system uses the computational power of multi-
core processing. The number of cores in a device is con-
stantly growing, hence the speedup scaling models remain
of high importance. The convenience of using Amdahl’s law
and derived models is in that they do not require complex
modelling and simulation of individual inter-process com-
munications. Instead, these models operate on the average
platform and application characteristics and provide simple
analytical solutions that project system’s capabilities in a clear
and understandable way. They provide a valuable insight into
system scalability and have become pivotal for multi-scale
systems research. However, it is still important to keep the
models up to date, to make sure they stay relevant and correctly
represent novel aspects of platform design.
From the increase in system complexity and integration, the
concept of heterogeneous computation has emerged. Initially,
the heterogeneity appeared in a form of specialized acceler-
ators, like GPU and DSP. In recent years, multiple types of
CPU cores in a single device have also been made popular.
For instance, the ARM big.LITTLE processor has found a
wide use in mobile devices [9]. Heterogeneous systems pose
additional engineering and research challenges. In the area of
scheduling and load balancing, the aim is to improve core
utilization for more efficient use of the available performance.
Operating systems traditionally implement symmetric multi-
processor (SMP) scheduling algorithms designed for homo-
geneous systems, and ARM have done dedicated work on
modifying the Linux kernel to make load balancing suitable
for their big.LITTLE processor [10].
In addition to performance concerns, power dissipation
management is also a significant issue in scalable systems: ac-
cording to Dennard’s CMOS scaling law [11] despite smaller
geometries the power density of devices remains constant.
Hill and Marty extended Amdahl’s speedup model to cover
simple heterogeneous configurations consisting of a single big
core and many smaller ones of exactly the same type [12],
which relates to the CPU-GPU type of heterogeneity. The
studies in [13], [14] extended Hill-Marty analysis to all three
major speedup models. The problem of energy efficiency
has been addressed in [15] for the homogeneous and simple
heterogeneous Amdahl’s model.
IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS 2
TABLE I
EXISTING SPEEDUP MODELS AND THE PROPOSED MODEL
ho
m
og
en
ei
ty
he
te
ro
ge
ne
ity
po
w
er
A
m
da
hl
’s
la
w
G
us
ta
fs
on
’s
m
od
el
Su
n
an
d
N
i’s
m
od
el
[1] yes no no yes no no
[2] yes no no yes yes no
[3] yes no no yes yes yes
[12] yes simple no yes no no
[15] yes simple yes yes no no
[13], [14] yes simple no yes yes yes
proposed yes normal form yes yes yes yes
models (Section 3)
A. Research Contributions
In order to be relevant to more general and emerging types
of heterogeneous systems, new types of models need to be
developed. This paper extends the classical speedup models to
a so-called normal form representation of heterogeneity, which
describes core performances as a vector. This representation
can fit a wider range of systems, including the big.LITTLE
processor or homogeneous processors with multiple dynamic
voltage-frequency scaling (DVFS) islands. The initial work
on this topic has been published in [16], which includes the
derivation of the speedup models as well as a set of power
models for this extended representation of heterogeneity. In
addition to fixed-workload Amdahl’s law, workload scaling
from the works of Gustafson and Sun-Ni have also been
consirered. In the current publication we expand the work
by addressing the effects of workload distribution and load
balancing, and also explore additional modes of workload
scaling relevant only to heterogeneous systems. We discover
that the presented models inherit certain limitations from the
Amdahl’s law, which may be significant for heterogeneous
modelling and need to be taken into account. The paper
provides a concise discussion on the matter. In addition, an
extensive set of experiments has been carried out to validate
the proposed models and to explore their practical use.
This paper makes the following contributions:
• extending the classical speedup models to normal form
heterogeneity in order to represent modern examples of
heterogeneous systems;
• extending heterogeneous speedup models to include
power and energy estimation and studying the
performance-energy trade-offs in the heterogeneous
systems;
• clarifying the limitations of the Amdahl-like heteroge-
neous models and outlining further challenges of hetero-
geneous speedup and power modelling;
• validating the models on a real heterogeneous platform
under a set of carefully controlled model parameters;
• practically using the models to evaluate the efficiency
of the Linux scheduler’s load balancing while running
realistic workload in a heterogeneous system.
Table 1 compares this paper’s contributions to the range of
related research publications.
The experimental work presented in this paper has been
carried out on the Odroid-XU3 [17] development platform
centred around ARM big.LITTLE Cortex A7-A15 cores.
The paper is organized as follows. Section 2 gives an
overview of the existing homogeneous and heterogeneous
speedup models. Section 3 discusses the model assumptions
and formally defines the normal form heterogeneous system’s
structure. Sections 4 and 5 present the new heterogeneous
speedup and power models respectively. Section 6 experimen-
tally validates the models. Section 7 shows the experiments
with real life benchmarks. Section 8 concludes the work.
II. EXISTING SPEEDUP MODELS
In homogeneous systems all cores are identical in terms of
performance, power, and workload execution.
For a homogeneous system we consider a system consisting
of n cores, each core having a performance of θ = It(1) ,
where I is the given workload and t (1) is the time needed
to execute the workload on the core. This section describes
various existing models for determining the system’s speedup
S (n) in relation to a single core, which can be used to find
the performance Θ (n) of the system:
Θ (n) = θS (n) . (1)
Amdahl-like speedup models are built around the paralleliz-
ability factor p, 0 ≤ p ≤ 1, which reflects the application’s
capability of performing parallel computation. Given a total
workload of I , the parallel part of a workload is pI and the
sequential part is (1− p) I .
A. Amdahl’s Law (Fixed Workload)
The general idea of this model is to compare execution time
for some fixed workload I on a single core with the execution
time for the same workload on the entire n-core system [1].
Time to execute workload I on a single core is t (1),
whereas t (n) adds up the sequential execution time on one
core at the performance θ and the parallel execution time on
all n cores at the performance nθ:
t (1) =
I
θ
, t (n) =
(1− p) I
θ
+
pI
nθ
, (2)
thus the speed up can be found as follows:
S (n) =
t (1)
t (n)
=
1
(1− p) + pn
. (3)
B. Gustafson’s Model (Fixed Time)
Gustafson re-evaluated the fixed workload speedup model to
derive a new fixed time model [2]. In this model, the workload
increases with the number of cores, while the execution
time is fixed. An important note is that the workload scales
asymmetrically: the parallel part is scaled to the number of
cores, whilst the sequential part is not increased.
Let’s denote the initial workload as I and extended work-
load as I ′. The time to execute initial workload and extended
IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS 3
workload are t (n) and t′ (n) respectively. The workload
scaling ratio can be found from:
t (1) =
I
θ
, t (n) =
(1− p) I
θ
+
pI ′
nθ
, (4)
and, since t (1) = t (n) , the extended workload can be found
as I ′ = nI. The time that would take to execute I ′ on a single
core is:
t′ (1) =
(1− p) I
θ
+
pnI
θ
, (5)
which means that the achieved speedup equals to:
S (n) =
t′ (1)
t (1)
= (1− p) + pn. (6)
Not all applications can provide workload extension only
in the parallelizable part without changing the sequential part,
hence there is a limitation on the applicability of the extended
workload models. A typical example is an application that
changes the computation quality depending on the number of
available cores. The main contribution of Gustafson’s model,
however, is to show that it is possible to build an application
that scales to multiple cores without suffering saturation.
C. Sun and Ni’s Model (Memory Bounded)
Sun and Ni took into account the previous two speedup
models by considering the memory bounded constraints [3],
[4]. In this model the execution time and the workload change
according to the memory capability. The parameter g (n)
reflects the scaling of the workload in relation to scaling the
memory with the number of cores:
I ′ = g (n) I. (7)
A typical example g (n) is given for an m × m matrix
multiplication, which has the memory requirement of O
(
m2
)
and the computation cost (workload) of O
(
m3
)
. In this case,
g (n) = n
3
2 .
The model calculates the speedup as follows:
S (n) =
t′ (1)
t′ (n)
=
(1− p) + pg (n)
(1− p) + pg(n)n
. (8)
Because the workload is scaled by g (n) according to (7),
one of the important properties of this model is that for g (n) =
1 Sun and Ni’s model (8) transforms into Amdahl’s law (3),
and for g (n) = n it becomes Gustafson’s law (6). Further in
this paper, we do not specifically relate g (n) to the memory
access or any other property of the system and consider it as
a given or determined parameter pertaining to a general case
of workload scaling.
D. Existing Heterogeneous Models
Previous attempts to extend speedup laws to heterogeneous
systems were mainly focused on a single high performance
core and many smaller cores of the same type [12].
Core performances are related to some base-core equivalent
(BCE), which is considered to have θ = 1. This model studies
a system with one big core having Θ (r) relative performance
and (n− r) little cores with BCE performances, as shown in
Figure 1(b). The sequential workload is executed on the faster
core, while the parallel part exercises all cores simultaneously.
This transforms Amdahl’s law (3) as follows:
S (n, r) =
1
(1−p)
Θ(r) +
p
Θ(r)+(n−r)
. (9)
In this work we aim to cover more diverse cases of
heterogeneity pertaining to such modern architectures as ARM
big.LITTLE [17] which are not directly covered by existing
speedup models.
III. HETEROGENEOUS PLATFORM ASSUMPTIONS
Homogeneous models are used to compare the speedup
between different numbers of cores. Similarly, heterogeneous
models should compare the speedup between core config-
urations, where each configuration defines the number of
cores in each available core type. This section discusses the
problems of modelling consistency across different core types
and provides the foundation for all heterogeneous models
presented later in this paper.
A. The Challenges of Heterogeneous Modelling
Heterogeneous models must capture the performance and
other characteristics across different types of cores in a com-
parable way. Such a comparison is not always straightforward,
and in many ways similar to cross-platform comparison. This
section discusses the assumptions behind Amdahl’s law and
similar models under the scope of heterogeneous modelling
and outlines the limitations they may cause.
1) Hardware-dependent parallelizability: In the models
presented in Section 2, there is a clear time-separation of the
synchronous and parallel executions of the entire workload. In
other words, all threads synchronize at the same time. These
models do not explore complex interactions between the pro-
cesses, hence they do not provide exact timing predictions and
should not be used for time-critical analyses like real-time sys-
tems research. Solving for process interactions is possible with
Petri Net simulations [18] or process algebra [19]. Amdahl-
like models, in contrast, focus on generic analytical solutions
that give approximate envelopes for platform capabilities.
These models use parallelizability factor p as one of the
main parameters to the model, however they give little expla-
nation on how to obtain p. This problem has been a challenge
for numerous research efforts [20]. A nave intuition is that p
is a property of an algorithm. However, there is a number of
known embarrassingly-parallelizable algorithms (p = 1), but it
has never been possible to achieve this level of parallelization
running such algorithms on real platforms.
The real value of p is a combined property of an algorithm
running on a specific hardware, and represents algorithm paral-
lelizability pa being modulated by platform architecture’s par-
allelizability ph. In a simplified case, there are two sources of
sequential execution: computation synchronization (1− pa) I ,
required by the algorithm itself, and the hardware critical
section (1− ph) paI , when algorithmically parallel computa-
tions have to wait for shared hardware resources. Hence, the
IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS 4
(a)
1 1 1 1 1 1 1 1
n cores
(b)
1 1 1 1 1 (r)
(n – r) small cores 1 large core
n1 type 1 cores
1 1 1 1 1 1
(c)
n2 type 2 cores
12 12 12 12 12
..
.
nX type x cores
x
1
virtual
BCE
x x
Fig. 1. The proposed extended structure of a heterogeneous system (c)
compared to a homogeneous system (a) and the previous assumption [12]
on heterogeneity (b). The numbers in the core boxes denote the equivalent
number of BCEs.
combined sequential workload is (1− paph) I and the fully
parallel part is paphI , which leads to:
p = paph. (10)
On the surface, this equation divides p into application-
dependent and hardware-dependent components, but this is
not quite true as ph is also dependent on the instructions
being executed. As an example, consider two embarrassingly-
parallelizable (pa = 1) applications: one performing big
calculation on a small set of data, and another is performing
small calculation over a large amount of data. The second
application requires considerably more memory accesses per
unit of output, and if the memory is a shared resource, this
would cause different ph for the two algorithms on the same
processor. Hence, (10) does not solve the problem, but pushes
it one step further.
From the standpoint of heterogeneous modelling, the poten-
tial differences in ph between core types or cache islands will
cause the overall p to change between core configurations.
In this paper, we do not attempt to solve this challenge.
As demonstrated further in this paper, it is still possible to
build heterogeneous models around a constant p and use a
range of possible values [pmin, pmax] to determine the system’s
minimum and maximum speedup capabilities.
2) Workload equivalence and performance comparison:
Workload is a model parameter that links performance with
the execution time. In many cases, a popular metric for
performance is instructions per second (IPS), in which case
a workload is characterized by its number of instructions. IPS
is convenient as it is an application-independent property of
the platform; it is also used for deriving power optimization
metrics such as energy per instruction.
In heterogeneous models, it is important to have a con-
sistent metric across all core types. For devices of different
architecture types, the same computation may be compiled
into different numbers of instructions. In this case, the total
number of instructions can no longer meaningfully represent
the same workload, and IPS cannot be universally used for
cross-platform performance comparison. This is particularly
clear when comparing CPU and GPU devices.
In order to build a valid cross-platform performance com-
parison model, we need to reason about the workload as a
meaningful computation, and two workloads are considered
equivalent as long as they perform the same task. In this
paper we measure workload in so-called “workload items”,
which can be defined on a case by case basis depending on
the practical application. Respectively, instead of energy per
instruction, we use energy per workload item.
Hill and Marty’s model, presented in Section 2.4, describes
the performance difference between the core types as Θ (r).
In real life, this relation is application dependent, as will be
demonstrated in Section 6. Differences in hardware, such as
pipeline depth and cache sizes, cause performance differences
on a per-instruction basis [21]. As a result, even within the
same instruction set, core type i may execute workload A
faster than core type j, but core type j may execute workload
B faster than core type i. Hence, A and B must use different
performance ratios to describe the same heterogeneous plat-
form.
B. Platform Assumptions
We build our models under the assumptions listed below.
These assumptions put limitations on the models as discussed
earlier in this section. However, the same assumptions are used
in the classical Amdahl’s law and similar models, hence they
do not reduce the applicability of the presented models.
• The models and model parameters are both application
and hardware specific.
• The relation between performances of cores of different
types can be approximated to a constant ratio.
• The parallelizability factor p can be approximated by a
constant and is known or can be determined (exactly or
within a range).
• Environmental factors, such as temperature, are not con-
sidered.
Inter-core communication overheads, addressed in the Li and
Malek’s model [22], are not considered in this paper and are
the subject of future work.
C. Normal Form Representation of Heterogeneity
Performance-wise, the models presented in subsequent sec-
tions describe heterogeneity using the following normal form
representation.
The normal form of heterogeneous system configuration
considered in this paper consists of x clusters (types) of
homogeneous cores with the numbers of cores defined as a
vector n = (n1, . . . , nx). The total number of cores in the
system is denoted as N =
∑x
i=1 ni. Vector α = (α1, . . . , αx)
defines the performance of each core by cluster (type) in
relation to some base core equivalent (BCE), such that for
all 1 ≤ i ≤ x we have θi = αiθ. As discussed earlier,
the parameter α is application- and platform-dependent. The
structure is shown in Figure 1(c).
IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS 5
time
α1 = 4
α2 = 3
α3 = 6
sequential parallel
(a)
time
α1 = 4
α2 = 3
α3 = 6
sequential parallel
(b)
13
13
1310
10
12
9
18
ts tp
ts tp
Fig. 2. Workload distribution examples following (a) equal-share model and
(b) balanced model.
IV. PROPOSED SPEEDUP MODELS
This section extends homogeneous speedup models for
determining the speedup S (n) of a heterogeneous system in
relation to a single BCE, which can then be used to find the
performance of the system using (1).
A. Workload Distribution Models
The fundamental idea behind the speedup modelling is that
the cores do not contribute to overall system performance
when idle. Homogeneous models distinguish two states of
performance: the parallel execution exercises all cores, and
the sequential execution exercises only one core while others
are idle. The cores in such systems are considered identical,
hence they all execute equal shares of the parallelizable part
of the workload and finish at the same time. As the result, the
combined performance of the cores working in parallel is θn.
In heterogeneous systems this is not as straightforward: each
type of cores works at a different performance rate, hence the
execution time depends greatly on the workload distribution
between the cores. Imperfect distribution causes some cores
to finish early and become idle, even when the parallelizable
part of the workload has not been completed.
In real systems, the scheduler is assisted by a load balancer,
whose task is to redistribute the workload during run-time
from busy cores to idle cores, however its efficiency is not
guaranteed to be optimal [23]. The actual algorithm behind the
load balancer may vary between different operating systems,
and the load balancer typically has access to run-time only
information like CPU time of individual processes and the
sizes of waiting queues. Hence it is virtually impossible to
accurately describe the behaviour of the load balancer as an
analytical formula. In this section we address the problem by
studying two boundary cases, which may provide a range of
minimum and maximum parallel performances.
By definition, the total execution time for the workload I
is a sum of sequential and parallel execution times, ts (n) and
tp (n), and it represents the time interval between the first
instruction in I starting and the last instruction in I finishing,
meaning that, during a parallel execution, only the longest
running core has an effect on the total execution time. In other
words:
tp (n) =
x
max
i=1
Ipi
αiθ
, (11)
where Ipi is a share of the parallelizable workload (pI) for a
single core of the type i.
To be analogous to the homogeneous models and to simplify
our equations, we also define the system’s parallel performance
via the performance-equivalent number of BCEs denoted
as Nα:
Nα =
pI
tp (n) θ
= pI ·
x
min
i=1
αi
Ipi
. (12)
1) Equal-share workload distribution: In homogeneous
systems, the parallelizable workload is equally split between
all cores. As a results, many legacy applications, developed
with the homogeneous system architecture in mind, would
also equally split the workload by the total number of cores
(threads): Ipi = pIN , which leads to a very inefficient execution
in heterogeneous systems, where everyone is waiting for the
slowest core (thread), as illustrated in Figure 2(a). In this case,
Nα is calculated from the minimum of α:
Nα = N ·
x
min
i=1
αi. (13)
The above equation implies that the workload cannot be
moved between the cores. If the system load balancer is
allowed to re-distribute the work, then the real Nα may be
greater than (13). This equation can be used to define a lower
performance bound corresponding to nave scheduling policy
with no balancing.
2) Balanced workload distribution: Figure 2(b) shows the
ideal case of workload balancing, which implies zero waiting
time, hence all cores should theoretically finish at the same
time. In other words, a parallel execution time tpi (n) in core
type i must be equal to tp (n) for all 1 ≤ i ≤ x:
tp (n) = tpi (n) =
Ipi
αiθ
. (14)
Because of this, we can agree that, from (12), individual core
workload Ipi can be found for all 1 ≤ i ≤ x as follows:
Ipi =
αipI
Nα
. (15)
We also know that all individual core workloads must add up
to the total parallelizable workload:
pI =
x∑
i=1
niIpi, (16)
and we can solve equations (15) and (16), giving us Nα for
optimal workload distribution:
Nα =
x∑
i=1
αini. (17)
Nαθ represents the system’s performance during the parallel
execution, hence Nα values from (13) and (17) define the
range for heterogeneous system parallel performances. A load
balancer that violates the lower bound (13) is deemed to
be worse than nave. The upper bound (17) represents the
theoretical maximum, and it cannot be exceeded.
IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS 6
B. Heterogeneous Amdahl’s Law
We assume that the sequential part is executed on a single
core in the cluster s, hence the system’s performance during
sequential execution is αsθ. In Section 4.1 we defined parallel
performance as Nαθ. Hence, the time to execute the fixed
workload I on the given heterogeneous system is:
t (n) = ts (n) + tp (n) =
(1− p) I
αsθ
+
pI
Nαθ
. (18)
The speedup in relation to a single BCE is:
S (n) =
t (1)
t (n)
=
1
(1−p)
αs
+ pNα
. (19)
One can verify that this equation also covers Hill-Marty’s
model (9), in which case n = (n− r, 1), α = (1,Θ (r)),
αs = Θ (r), and Nα is calculated for the balanced workload
distribution (17).
C. Workload Scaling
Like in the homogeneous case, Amdahl’s law works with a
fixed workload, while Gustafson and Sun-Ni allow changing
the workload with respect to the system’s capabilities. In this
section we consider a general assumption on workload scaling,
which defines the extended workload using characteristic
functions g (n) and h (n) as follows:
I ′ = h (n) · ((1− p) I + pg (n) I) , (20)
where h (n) represents the symmetric scaling of the entire
workload, and g (n) represents the scaling of the parallelizable
part only.
The sequential and parallel execution times are respectively:
t′s (n) =
(1− p) I
αsθ
, t′p (n) =
pg (n) I
Nαθ
. (21)
Hence, in the general case, for given workload scaling
functions g (n) and h (n), the speedup is calculated as follows:
S (n) =
I ′(
t′s + t′p
)
θ
=
(1− p) + pg (n)
(1−p)
αs
+ pg(n)Nα
. (22)
Note that the speedup does not depend on the symmetric scal-
ing h (n). Indeed, the execution time proportionally increases
with the workload, and the performance ratio (i.e. the speedup)
remains constant. However, changing the execution time is
important for the fixed-time Gustafson’s model.
D. Heterogeneous Gustafson’s Model
In the Gustafson model, the workload is extended to
achieve equal time execution: t′ (n) = t (1). For homogeneous
Gustafson’s model: g (n) = n and h (n) = 1. For a heteroge-
neous system, there are more than one way to achieve equal
time execution.
1) Purely parallel scaling mode: The maximum speedup
for equal time execution is achieved by scaling only the
parallel part, i.e. h (n) = 1. We know that Gustafson’s model
requires equal execution time, hence:
t′s (n) + t
′
p (n) = t (1) , (23)
which leads to:
pg (n)
Nα
= 1− (1− p)
αs
. (24)
From this, we can find that:
g (n) =
(
1− (1− p)
αs
)
Nα
p
, (25)
however this equation puts a number of restrictions on the
system. Firstly, it doesn’t work for p = 0, because it is not
possible to achieve equal time execution for a purely sequential
program if αs 6= 1 and only the parallel workload scaling
is allowed. Secondly, a negative g (n) does not make sense,
hence the relation αs > (1− p) must hold true. This means
that the sequential core performance must be high enough to
overcome the lack of parallelization. Another drawback of this
mode is that it requires the knowledge of p in order to properly
scale the workload.
In this scenario, the speedup is calculated as:
S (n) = (1− p) +
(
1− (1− p)
αs
)
Nα. (26)
2) Classical scaling mode: In order to remove the restric-
tions of the purely parallel scaling mode, and to provide a
model generalizable to p = 0, we need to allow scaling of
the sequential execution. However, since this mode potentially
increases the sequential execution time, it exercises the cores
less efficiently than the previous mode and leads to lower
speedup. In this case, (23) can be updated to:
h (n) · (t′s (n) + t′p (n)) = t (1) . (27)
From this, in the case of p = 0, we find that h (n) = αs. And
for the case of p > 0 and h (n) = αs:
g (n) =
(
1
h (n)
− (1− p)
αs
)
Nα
p
=
Nα
αs
. (28)
This scaling mode relates to the classical homogeneous
Gustafson’s model, which requires g (n) to be proportional
to the ratio between the system performances of the parallel
and sequential executions. In the homogeneous case, if the
sequential performance is θ, the parallel performance would
be nθ, leading to g (n) = n.
For the heterogeneous Gustafson’s model in classical scal-
ing mode, the speedup is calculates as:
S (n) = αs (1− p) + pNα. (29)
V. PROPOSED POWER AND ENERGY MODELS
We base our power models on the concept of power state
modelling, in which a device has a number of distinct power
states, and the average power over an execution is calculated
from the time the system spent in each state.
IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS 7
For each core in the system we consider two power states:
active and idle. Lower power states like sleeping and shutting
down the cores are not included in the presented models,
however it is possible to extend the models to cater to
these effects. Let’s denote the active power of a core in a
homogeneous system as wa and the idle power of a core as w0.
Active power can also be expressed as a sum of idle power w0
and effective power w that is spent on workload computation,
wa = w0 + w. In this view, the idle component is no longer
dependent on the system’s activity and can be expressed as a
system-wide constant term W0, called background power. The
total power dissipation of the system is:
Wtotal = W0 +W (n) , (30)
W (n) is the total effective power of active cores – this is the
focus of our models. The constant term of background power
W0 can be studied separately.
A. Power Modelling Basics
In the normal from representation of a heterogeneous system
(Section 3), the difference between power dissipations of the
cores is expressed by the vector β = (β1, . . . , βx), which
defines the effective power in relation to a BCE’s effective
power, such that for all 1 ≤ i ≤ x we have effective power
wi = βiw.
The effective power model can be found as a time-weighted
average of the sequential effective power ws and parallel
effective power wp of the system:
W (n) =
wst
′
s (n) + wpt
′
p (n)
t′s (n) + t′p (n)
, (31)
where t′s (n) and t
′
p (n) are the speedup-dependent times re-
quired to execute sequential and parallel parts of the extended
workload respectively.
In a homogeneous system:
ws = w, wp = nw. (32)
In a heterogeneous system, if we execute the sequential code
on a single core s:
ws = βsw,
wp = Nβw,
(33)
which gives for the balanced case of parallel execution (17):
Nβ =
x∑
i=1
βini (34)
For equal-share execution (13), Nβ is calculated as follows:
Nβ = minα ·
x∑
i=1
βini
αi
. (35)
Nβ is called a power-equivalent number of BCEs. Hetero-
geneous power models will transform into homogeneous if
αs = βs = 1 and Nα = Nβ = n.
B. Power Distribution and Scaling Models
We express the scaling of effective power in the system via
the speedup and the power distribution characteristic function
Dw (n):
W (n) = wDw (n)S (n) . (36)
Dw (n) represents the relation between the power and perfor-
mance in a heterogeneous configuration. Since the speedup
models are known from Section 4, this section focuses on
finding the matching power distribution functions.
From (33) and (31), we can find that in the general case:
Dw (n) =
(
βst
′
s (n) +Nβt
′
p (n)
) · θ
I ′
, (37)
thus substituting the workload scaling definition (20) and
execution times (21) will give us:
Dw (n) =
βs
αs
(1− p) + pg (n) NβNα
(1− p) + pg (n) . (38)
It is worth noting that for homogeneous systems, Dw (n) =
1 in all cases, and the effective power equation will transform
into:
W (n) = wS (n) , (39)
i.e. in homogeneous systems the power scales in proportion to
the speedup.
Power distribution for Amdahl’s workload: For Amdahl’s
workload, g (n) = 1, hence the power distribution function
becomes:
Dw (n) =
βs
αs
(1− p) + p · Nβ
Nα
(40)
Power distribution for Gustafson’s workload: Following
the same general form (38) for the effective power equation,
we can find power distribution functions Dw (n) for two cases
of workload scaling described in Section 4.4.
For the classical scaling mode:
Dw (n) =
βs (1− p) + pNβ
αs (1− p) + pNα . (41)
For the purely parallel scaling mode:
Dw (n) =
βs (1− p) + (αs − (1− p))Nβ
αs (1− p) + (αs − (1− p))Nα . (42)
C. Energy and Power-Normalized Performance
Power modelling is typically used for optimizing system
power dissipation. Due to the power-performance trade-off,
advanced metrics are required as optimization targets. For ex-
ample, power-normalized performance (performance per Watt)
represents the performance achievable at a given power capac-
ity. This parameter is the reciprocal of energy per instruction
(in our case, per workload item) Ei, which can be found from
dividing the total power (30) by the system’s performance (1):
Ei =
Wtotal
Θ (n)
=
W0 +W (n)
θS (n)
. (43)
For a single BCE we can denote energy per workload item
as a sum of effective energy e and idle energy. Applying the
power model (36) to (43):
Ei = e ·Dw (n) + E0
S (n)
. (44)
IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS 8
TABLE II
CHARACTERIZATION EXPERIMENTS: SINGLE CORE EXECUTION
benchmark sqrt int log
base workload, iterations 2.4 · 108 4.08 · 109 2.4 · 108
core type i A7 A15 A7 A15 A7 A15
measured execution time, ms 74943 79798 79254 63920 62857 35668
measured active power, W 0.2805 0.8402 0.2887 0.8400 0.3143 0.9474
power measurement std dev 0.61% 0.24% 0.55% 0.19% 0.55% 1.89%
calculated effective power, W 0.1234 0.4850 0.1316 0.4848 0.1572 0.5922
αi 1 0.9392 1 1.2399 1 1.7623
βi 1 3.9303 1 3.6839 1 3.7672
The system-wide sum of idle energy per workload item is
denoted as E0. This equation shows that the power distribution
function Dw (n) increases the effective component of the
energy as more power-hungry cores are being active, and the
speedup S (n) decreases the idle energy component due to
better core utilization. The total energy consumption during
the execution of the extended workload I ′ is E = EiI ′.
Energy-delay product (EDP) is another optimization metric
that improves energy and performance at the same time:
Et (n) = Wtotal ·
(
I ′
Θ (n)
)2
. (45)
VI. EXPERIMENTAL VALIDATIONS
This section validates the models presented in Sec-
tions 4 and 5 against the a of experiments on a real heteroge-
neous platform. In these experiments, the goal is to determine
the accuracy of the models when all model parameters, such
as parallelization factor p, are under control.
A. Platform Description
This study is based on a multi-core mobile platform, the
Odroid-XU3 board [17]. The main part of it is the 28nm
application processor Exynos 5422. It is an SoC hosting an
ARM big.LITTLE heterogeneous processor consisting of four
Cortex A7 cores (C0 to C3) and four Cortex A15 cores
(C4 to C7). The big Cortex A15 is a high performance 32-
bit core having 32KB instruction and 32KB data L1 caches
and 2MB L2 cache and the maximum frequency of 2.0GHz.
The LITTLE Cortex A7 is a low power 32-bit core including
the same L1 cache size and 512 KB L2 cache, and the
maximum frequency of 1.4 GHz. There are compatible Linux
and Android distributions available for Odroid-XU3; in our
experiments we used Ubuntu 14.04. This SoC also has four
power domains: A7 power domain, A15 power domain, GPU
and memory power domains. The Odroid-XU3 board allows
per-domain DVFS using predefined voltage-frequency pairs.
The previous assumption by Hill and Marty for heteroge-
neous architectures, shown in Figure 1(b), cannot describe
systems such as big.LITTLE. Our models do not suffer from
these restrictions and can be applied to big.LITTLE and
similar structures.
B. Benchmark Description and Model Characterization
The models operate on application- and platform-dependent
parameters, which are typically unknown and imply high
START
END
Pin to Core s
Execute
(1–p)·I cycles
... Pin to Core cN
Execute
p·g(n)·I/N cycles
Create N threads
Join threads
s
e
q
u
e
n
ti
a
l
p
a
ra
lle
l Pin to Core c1
Execute
p·g(n)·I/N cycles
Fig. 3. Synthetic application with controllable parallelization factor and equal-
share workload distribution. Parameter p, workload size I and scaling g (n),
the number of threads (cores) N , and the core allocation s, c = (c1, . . . , cN )
are specified as the program arguments.
efforts in characterization. However, in order to prove that
the proposed models work, it is sufficient to show that, if α,
β and p are defined, the performance and power behaviour of
the system follows the model’s prediction. These parameters
can be fixed by a synthetic benchmark. This benchmark does
not represent realistic application behaviour and was designed
only for validation purposes. Experiments with real application
examples are presented in Section 7.
The model characterization is derived from single core
experiments. These characterized models are used to predict
multi-core execution in different core configurations. The pre-
dictions are then cross-validated against experimental results.
1) Controlled parameters: The benchmark has been devel-
oped specifically for these experiments in order to provide
control over the parallelization parameter p. Hence, p is not
a measured parameter, but a control parameter that tells the
application the ratio between the parallel (multi-threaded) and
sequential (single thread) execution.
The application is based on POSIX threads, and its flow
is shown in Figure 3. Core configurations, including homo-
geneous and heterogeneous, can be specified per application
run as the sequential execution core s and the set of core al-
locations c = (c1, . . . , cN ), where N is the number of parallel
threads; s, cj ∈ {C1, . . . ,C7} for 1 ≤ j ≤ N . C0 is reserved
for OS and power monitors. These variables define n used in
the models. We do not shut down the cores and use per-thread
core pinning via pthread_attr_setaffinity_np to
avoid unexpected task migration.
The workload size I and the workload scaling g (n) are
IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS 9
1 A7
0 A15
2 A7
0 A15
3 A7
0 A15
0 A7
1 A15
0 A7
2 A15
0 A7
3 A15
0 A7
4 A15
1 A7
1 A15
2 A7
2 A15
2 A7
3 A15
3 A7
4 A15
1 A7
0 A15
2 A7
0 A15
3 A7
0 A15
0 A7
1 A15
0 A7
2 A15
0 A7
3 A15
0 A7
4 A15
1 A7
1 A15
2 A7
2 A15
2 A7
3 A15
3 A7
4 A15
1 A7
0 A15
2 A7
0 A15
3 A7
0 A15
0 A7
1 A15
0 A7
2 A15
0 A7
3 A15
0 A7
4 A15
1 A7
1 A15
2 A7
2 A15
2 A7
3 A15
3 A7
4 A15
1 A7
0 A15
2 A7
0 A15
3 A7
0 A15
0 A7
1 A15
0 A7
2 A15
0 A7
3 A15
0 A7
4 A15
1 A7
1 A15
2 A7
2 A15
2 A7
3 A15
3 A7
4 A15
1 A7
0 A15
2 A7
0 A15
3 A7
0 A15
0 A7
1 A15
0 A7
2 A15
0 A7
3 A15
0 A7
4 A15
1 A7
1 A15
2 A7
2 A15
2 A7
3 A15
3 A7
4 A15
1 A7
0 A15
2 A7
0 A15
3 A7
0 A15
0 A7
1 A15
0 A7
2 A15
0 A7
3 A15
0 A7
4 A15
1 A7
1 A15
2 A7
2 A15
2 A7
3 A15
3 A7
4 A15
0.0
0.5
1.0
1.5
2.0
2.5
3.0
sqrt, p=0.3
0.0
1.0
2.0
3.0
4.0
5.0
6.0
sqrt, p=0.9
0.0
0.5
1.0
1.5
2.0
2.5
3.0
int, p=0.3
0.0
1.0
2.0
3.0
4.0
5.0
6.0
int, p=0.9
0.0
0.5
1.0
1.5
2.0
2.5
3.0
log, p=0.3
0.0
1.0
2.0
3.0
4.0
5.0
6.0
log, p=0.9
theory measured theory measured theory measured
theory measured theory measured theory measured
0.00%
-0.01%
0.00%
0.01%
-0.01%
-0.01% -0.01%-0.01% -0.02%0.00% 0.00%
0.00%
0.01%
0.02%
0.00%
-0.01%
0.02%
0.04%
-0.08%
-0.02%
0.04%
0.10%
0.00%
-0.01% 0.00% 0.00%
0.06% 0.04%
0.03%
0.05%
0.07% 0.03%
0.04%
0.00%
0.01%
0.02%
0.00%
0.02%
0.09%
0.10%
0.05%
0.05%
0.10%
0.11%
0.00%
0.17%
0.21%
0.00%
0.46%
0.55% 0.58%
0.45%
0.57% 0.60%
-0.68%
0.00%
-0.16%
-0.12%
0.00%
-0.47%
-0.33%
-0.22%
-0.14%
-0.04%
0.04%
-0.31%
Fig. 4. Speedup validation results for the heterogeneous Amdahl’s law showing percentage error of the theoretical model in relation to the measured speedup.
1 A7
0 A15
2 A7
0 A15
3 A7
0 A15
0 A7
1 A15
0 A7
2 A15
0 A7
3 A15
0 A7
4 A15
1 A7
1 A15
2 A7
2 A15
2 A7
3 A15
3 A7
4 A15
1 A7
0 A15
2 A7
0 A15
3 A7
0 A15
0 A7
1 A15
0 A7
2 A15
0 A7
3 A15
0 A7
4 A15
1 A7
1 A15
2 A7
2 A15
2 A7
3 A15
3 A7
4 A15
1 A7
0 A15
2 A7
0 A15
3 A7
0 A15
0 A7
1 A15
0 A7
2 A15
0 A7
3 A15
0 A7
4 A15
1 A7
1 A15
2 A7
2 A15
2 A7
3 A15
3 A7
4 A15
1 A7
0 A15
2 A7
0 A15
3 A7
0 A15
0 A7
1 A15
0 A7
2 A15
0 A7
3 A15
0 A7
4 A15
1 A7
1 A15
2 A7
2 A15
2 A7
3 A15
3 A7
4 A15
1 A7
0 A15
2 A7
0 A15
3 A7
0 A15
0 A7
1 A15
0 A7
2 A15
0 A7
3 A15
0 A7
4 A15
1 A7
1 A15
2 A7
2 A15
2 A7
3 A15
3 A7
4 A15
1 A7
0 A15
2 A7
0 A15
3 A7
0 A15
0 A7
1 A15
0 A7
2 A15
0 A7
3 A15
0 A7
4 A15
1 A7
1 A15
2 A7
2 A15
2 A7
3 A15
3 A7
4 A15
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
sqrt, p=0.3
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
int, p=0.3
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
log, p=0.3
theory measured
0.0
0.5
1.0
1.5
2.0
2.5
sqrt, p=0.9
0.0
0.5
1.0
1.5
2.0
2.5
int, p=0.9
0.0
0.5
1.0
1.5
2.0
2.5
log, p=0.9
theory measured
theory measured
theory measured
theory measured
theory measured
-0.34%3.25% 2.67%
-0.06%
1.52%
2.43% 1.26%
1.92%
1.66% 1.65% 2.48%
-0.34%
2.22%
0.92%
-0.06%
-0.76%
-0.27%
-2.40%
0.31%
2.57%
-0.48%
-0.09%
0.03% 4.17% 2.98%
-0.17%
1.25% 1.95%
1.87%
2.15%
1.26% 0.92% 0.05%
0.03%
0.20%
-0.70%
-0.17%
-0.40%
0.17%
-0.93%
0.23%
3.04%
1.18%
-1.11%
0.36% 3.65% 2.80%
-0.04%
1.49%
0.09% 0.40%
2.27%
2.11% 0.78% 0.58%
0.36%
0.94%
-0.68%
-0.04%
-1.56%
-2.64%
-0.04%
0.47%
1.83%
0.15%
-2.64%
Fig. 5. Total power dissipation results for the heterogeneous Amdahl’s law showing percentage error of the theoretical model in relation to the measured
power.
also given parameters, which are used to test Gustafson’s
models against the Amdahl’s law. The application implements
three workload functions: square root calculation (sqrt), integer
arithmetic (int), and logarithm calculation (log) repeated in
a loop. These computation-heavy tasks use minimal memory
access to reduce the impact of hardware on the controlled
p. A fixed number of loop iterations represents one workload
item. The functions are expected to give different performance
characteristics, hence the characterization and cross-validation
experiments are done separately for each function.
Figure 3 shows equal-share workload distribution, where
each parallel thread receives equal number of pg (n) IN work-
load items. This execution gives Nα and Nβ that correspond to
nave load balancing according to (13) and (35). Additionally,
after collecting the characterization data for α, we imple-
mented a version that uses α to do optimal (balanced) work-
load distribution by giving each core cj ∈ c a performance-
adjusted workload of pg (n) IN · αjA , where A =
∑N
j=1 αj .
This execution follows different Nα and Nβ , which can be
calculated from (17) and (34).
2) Relative performances of cores: All experiments in this
section are run with both A7 and A15 cores at 1.4GHz.
Running both cores at the same frequency exposes the effects
of architectural differences on the performance. In addition, by
avoiding higher frequencies we reduce the temperature effects
and avoid throttling. In this study, we set BCE to A7, hence
αA7 = 1; and αA15 can be found as a ratio of single core
execution times αA15 = tA7 (1) /tA15 (1), as shown in Table 2.
The three different functions provide different αA15 values.
It can be seen that A15 is unsurprisingly faster than A7
for integer arithmetic and logarithm calculation, however the
square root calculation is faster on A7. This is confirmed mul-
tiple times in many experiments. We did not fully investigate
the reason of this behaviour since the board’s production and
support have been discontinued, and this is in any case outside
the scope of this paper. A newer version of the board, Odroid-
IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS 10
1 A7
0 A15
2 A7
0 A15
3 A7
0 A15
0 A7
1 A15
0 A7
2 A15
0 A7
3 A15
0 A7
4 A15
1 A7
1 A15
2 A7
2 A15
2 A7
3 A15
3 A7
4 A15
1 A7
0 A15
2 A7
0 A15
3 A7
0 A15
0 A7
1 A15
0 A7
2 A15
0 A7
3 A15
0 A7
4 A15
1 A7
1 A15
2 A7
2 A15
2 A7
3 A15
3 A7
4 A15
1 A7
0 A15
2 A7
0 A15
3 A7
0 A15
0 A7
1 A15
0 A7
2 A15
0 A7
3 A15
0 A7
4 A15
1 A7
1 A15
2 A7
2 A15
2 A7
3 A15
3 A7
4 A15
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
sqrt, p=0.3
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
int, p=0.3
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
log, p=0.3
classical scaling purely parallel scalingclassical scaling purely parallel scalingclassical scaling purely parallel scaling
-0.01%
-0.01%
-0.02%
0.01%
-3.48%
-5.66%
-7.15%
-3.49%
-7.15%
-8.24%
-9.70%
0.00%
0.01%
0.00%
0.01%
10.39%
16.93%
21.36%
7.01%
18.05%
21.57%
26.31%
-0.01%
0.00%
0.00%
0.30%
23.83%
39.26%
49.30%
4.22%
27.62%
35.63%
47.29%
Fig. 6. Gustafson’s model outcomes showing the the measured speedup gain from using the purely parallel workload scaling compared to the classical scaling;
sqrt shows poor speedup because of αs < 1.
TABLE III
GUSTAFSON’S WORKLOAD SCALING CALCULATIONS
bench nA7 nA15 classical g (n)
purely parallel g (n)
p = 0.3 p = 0.9
sqrt 1 1 2 1.594 1.865
sqrt 2 2 4 3.189 3.730
sqrt 2 3 5 3.986 4.662
sqrt 3 4 7 5.580 6.527
int 1 1 1.613 2.903 2.043
int 2 2 3.226 5.806 4.086
int 2 3 4.033 7.257 5.107
int 3 4 5.646 10.160 7.150
log 1 1 1.135 4.019 2.096
log 2 2 2.270 8.037 4.192
log 2 3 2.837 10.046 5.240
log 3 4 3.972 14.065 7.336
XU4, which is also built around Exynos 5422, does not have
this issue. It is important to note that we compiled all our
benchmarks using the same gcc settings. We include this
case of non-standard behaviour in our experiments to explore
possible negative impacts on the performance modelling and
optimization.
3) Core idle and active powers: The Odroid-XU3 board
provides power readings per power domain, i.e. one combined
reading per core type, from which it is possible to derive single
core characteristic values w0 and w.
Idle powers are determined by averaging over 1min of
measurements while the platform is running only the operating
system and the power logging software. The idle power values
are w0,A7 = 0.1571W and w0,A15 = 0.3552W, which are used
across all benchmarks. The standard deviation during the idle
power measurements is 1.1% of the mean value.
Effective powers wA7, wA15 are calculated from the mea-
sured active powers by subtracting idle power according
to (30). The power ratios are then found as βA7 = 1 and
βA15 = wA15/wA7; the values are presented in Table 2.
C. Amdahl’s Workload Outcomes
A large number of experiments have been carried out
covering all functions (sqrt, int, log) in various core config-
urations, and repeated for p = 0.3 and p = 0.9. This set of
runs use a fixed workload of 60000 items with equal-share
workload distribution between threads. Model predictions and
experimental measurements for a set of selected homogeneous
and heterogeneous multi-core configurations can be found
in Figures 4 and 5. The measured speedup is calculated as
the measured time for a single A7 core execution tA7 (1),
shown in Table 2, over the benchmark’s measured execution
time t (n):
S (n) =
tA7 (1)
t (n)
. (46)
The observations validate the model (19) by showing that
the differences between the model predictions and the experi-
mental measurements are very small. The speedup error never
exceeds 1%, and the power error never exceeds 4%, which
is comparable to the standard deviation of the characterization
measurements. A possible explanation for the low error values
can be that our synthetic benchmark produces very stable α
and β, and accurately emulates p. However, these small errors
also prove that the model can be used with high confidence
if it is possible to track these parameters. The model can
also be confidently used in reverse to derive parallelization
and performance properties of the system from the speedup
measurements, as demonstrated in Section 7.
The counter intuitive result for 7-core (three A7 cores and
four A15 cores) execution having lower power dissipation than
four A15 cores and no A7 cores can be explained by the equal-
share workload distribution. Because the parallel workload is
equally split between these cores, the A15 cores finish early
and wait for A7 cores. This idling reduces the average total
power dissipation, however it implies that intelligent workload
distribution can improve core utilization by scheduling more
tasks to A15 cores than to A7 ones so that they finish at the
same time. This is investigated in Section 6.5.
D. Gustafson’s Workload Outcomes
Two sets of experiments have been carried out to validate
heterogeneous Gustafson’s models in both purely parallel and
classical workload scaling modes described in Section 4.4. The
initial workload I is set to 60000, and the scaled workload I ′
is defined by (20). Table 3 shows selected examples of calcu-
lating g (n) for both scaling modes according to (25) and (28).
These experiments also use equal-share workload distribution
and s is fixed to A15. The integer g (n) values in the case of
sqrt are because minα = αs for this function.
The measured speedup is calculated as the ratio of perfor-
mances according to (1), or as the time ratio multiplied by the
workload size ratio:
S (n) =
tA7 (1)
t (n)
· I
′
I
. (47)
IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS 11
equal-share balanced
0.000
1.000
2.000
3.000
4.000
5.000
6.000
7.000
8.000
Speedup: int, p=0.9
equal-share balanced
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Energy-delay product, Js: int, p=0.9
0.000
1.000
2.000
3.000
4.000
5.000
6.000
7.000
8.000
Speedup: log, p=0.9
0
200
400
600
800
1000
1200
1400
1600
1800
2000
Energy-delay product, Js: log, p=0.9
0.000
0.500
1.000
1.500
2.000
2.500
3.000
Power dissipation, W: int, p=0.9
0.000
0.500
1.000
1.500
2.000
2.500
3.000
Power dissipation, W: log, p=0.9
equal-share balanced
equal-share balanced
equal-share balanced equal-share balanced
1 A7
1 A15
2 A7
1 A15
3 A7
1 A15
4 A7
1 A15
1 A7
2 A15
2 A7
2 A15
3 A7
2 A15
4 A7
2 A15
1 A7
3 A15
2 A7
3 A15
3 A7
3 A15
4 A7
3 A15
1 A7
4 A15
2 A7
4 A15
3 A7
4 A15
4 A7
4 A15
9.99%
6.20%
4.35%
3.26%
12.19%
8.56%
6.44%
5.06%
12.64%
9.52%
7.49%
6.08%
12.51%
9.85%
8.00%
6.65%
32.45%
20.53%
14.65%
11.18%
39.53%
28.26%
21.60%
17.23%
40.93%
31.33%
25.03%
20.60%
40.44%
32.35%
26.66%
22.46%
7.53%
6.23% 5.29%
4.59%
9.88% 8.60%
7.60% 6.79%
10.65%9.55%
8.66% 7.91%
10.78%9.88% 9.12%
8.46%
21.81%
17.78%
14.93%
12.82%
29.43%25.19%
21.95%19.39%
32.20%28.42%
25.38% 22.89%
32.94% 29.69% 27.00%
24.72%
-11.11%
-5.80%
-3.29%
-1.92%
-12.70%
-7.85%
-5.02%
-3.24%
-12.80%
-8.66%
-5.95%
-4.10%
-12.48%
-8.94%
-6.45%-4.65%
-30.56%
-18.92%
-12.56%
-8.73%
-33.52%
-23.90%
-17.72%
-13.12%
-33.44%
-25.54%
-19.79%
-15.50%
-32.60%
-25.96% -20.83%
-16.83%
1 A7
1 A15
2 A7
1 A15
3 A7
1 A15
4 A7
1 A15
1 A7
2 A15
2 A7
2 A15
3 A7
2 A15
4 A7
2 A15
1 A7
3 A15
2 A7
3 A15
3 A7
3 A15
4 A7
3 A15
1 A7
4 A15
2 A7
4 A15
3 A7
4 A15
4 A7
4 A15
1 A7
1 A15
2 A7
1 A15
3 A7
1 A15
4 A7
1 A15
1 A7
2 A15
2 A7
2 A15
3 A7
2 A15
4 A7
2 A15
1 A7
3 A15
2 A7
3 A15
3 A7
3 A15
4 A7
3 A15
1 A7
4 A15
2 A7
4 A15
3 A7
4 A15
4 A7
4 A15
1 A7
1 A15
2 A7
1 A15
3 A7
1 A15
4 A7
1 A15
1 A7
2 A15
2 A7
2 A15
3 A7
2 A15
4 A7
2 A15
1 A7
3 A15
2 A7
3 A15
3 A7
3 A15
4 A7
3 A15
1 A7
4 A15
2 A7
4 A15
3 A7
4 A15
4 A7
4 A15
1 A7
1 A15
2 A7
1 A15
3 A7
1 A15
4 A7
1 A15
1 A7
2 A15
2 A7
2 A15
3 A7
2 A15
4 A7
2 A15
1 A7
3 A15
2 A7
3 A15
3 A7
3 A15
4 A7
3 A15
1 A7
4 A15
2 A7
4 A15
3 A7
4 A15
4 A7
4 A15
1 A7
1 A15
2 A7
1 A15
3 A7
1 A15
4 A7
1 A15
1 A7
2 A15
2 A7
2 A15
3 A7
2 A15
4 A7
2 A15
1 A7
3 A15
2 A7
3 A15
3 A7
3 A15
4 A7
3 A15
1 A7
4 A15
2 A7
4 A15
3 A7
4 A15
4 A7
4 A15
Fig. 7. Comparison of the speedup, power and energy between equal-share and balanced execution.
The observed errors are similar to the Amdahl’s model with the
speedup estimated within 1% error and the power dissipation
estimated within 4% difference between the theory and the
measurements.
Figure 6 compares the speedup between two workload
scaling modes for p = 0.3. The purely parallel scaling has
more effect for less parallelizable applications as it focuses
on reducing the sequential part of the execution, hence the
experiments with p = 0.9 show insignificant gain in the
speedup and are not presented here. Even though the purely
parallel scaling is harder to achieve in practice as it requires the
knowledge of p, it provides a highly significant speedup gain,
especially if the difference between the core performances is
high, like in the case of log, which gives almost 50% better
speedup. However, improper use may cause poor performance,
as seen in the sqrt example using A15 for the sequential
execution, which is slower than A7 for this function.
E. Balanced Execution
Previously described experiments use equal-share workload
distribution, which is simpler to implement, but results in faster
cores being idle while waiting for slower cores. The balanced
distribution, defined in (17), gives the optimal speedup for a
given workload. This section implements balanced distribution
of a fixed workload and compares it to the equal-share distri-
bution outcomes of Amdahl’s law. The results are presented
for p = 0.9, as it provides larger differences for this scenario.
In terms of model validation, the results are also very
accurate, giving up to 4% error in power estimation and, in
the most cases, within 1% error for the speedup. Slightly
larger errors appear for the log benchmark in the cases of
7 or more cores, which show speedup errors of almost 4%. It
is important to note that it is virtually impossible to achieve
perfectly balanced workload distribution in real life, hence
a slight mismatch between the benchmark execution and the
theoretical model is to be expected.
Figure 7 explores the differences between the equal-share
and optimal (balanced) cases of workload distribution in terms
of performance and energy properties of the system, calculated
according to Sections 4 and 5. The balanced distribution
leads up to 41% increase in the speedup. The average power
dissipation is also increased up to 33% as the cores are
exercised with as little idling as possible. However, the power
increase is not as big as the performance gain, hence the
balancing of the workload is generally beneficial to the system,
as can be seen from the calculated EDP metric.
On the other hand, the large differences in the speedup
between these two cases pose a slight drawback to the practical
use of the models. As discussed in the Section 4.1, these
models form the corner cases for a load balancing algorithm,
hence a larger range would reduce the predictive value of the
method. However, in real life the experiments show that the
load balancing indeed produces vastly different performance
results as demonstrated in Section 7. Thus, the model predic-
tions are correct, and narrowing the prediction range would
IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS 12
require detailed knowledge of the system load balancer.
VII. REAL APPLICATION WORKLOADS
This section is focused on experiments with realistic work-
loads based on the Parsec benchmark suite [24]. Parsec bench-
marks are designed for parallel multi-threaded computation
and include diverse workloads that are not exclusively focused
on high performance computing. Each application is supplied
with a set of pre-defined input data, ranging from small sizes
(test) to large (simlarge) and very large (native) sizes. Each
input is assumed to generate a fixed workload on a given sys-
tem. To our knowledge, Parsec benchmarks do not implement
workload scaling to Gustafson’s or Sun-Ni’s models, hence
this section is focused on Amdahl’s law only.
In our experiments we run a subset of Parsec bench-
marks, namely ferret (CPU-heavy), fluidanimate (memory-
heavy), and bodytrack (mixed), and use simlarge input. Core
pinning of is done at the application level using the taskset
command in Linux. The command takes a set of cores as an
argument and ensures that every thread of the application is
scheduled onto one of these cores. However, the threads are
still allowed to move between the cores within this set due to
the influence of the system load balancer [23]. This is different
from the synthetic benchmark described in Section 6, which
performed pinning of individual threads, one thread per core.
In this work, we do not study the actual algorithm of the
load balancing or the internal structure of Parsec benchmarks,
hence the workload distribution between the cores is con-
sidered a black box function: Nα is unknown. Section 4.1
addressed this issue by providing the range of values for
Nα. The minimum value corresponds to equal-share workload
distribution and gives the lower speedup limit Slow (n); the
maximum value is defined by the balanced workload and gives
the higher speedup limit Shigh (n).
The goal of the following experiments is to calculate these
limits and to find how the real measured speedup fits in
the range. The relation provides a quality metric q for the
load balancing algorithm, where q = 1 corresponds to the
theoretically optimal load balancer, and q = 0 is equivalent to
a nave approach (equal-share). Negative values may also be
possible and show that the balancing algorithm is not working
properly and creates an obstacle to the workload execution.
The metric q is calculated as follows:
q =
S (n)− Slow (n)
Shigh (n)− Slow (n) . (48)
The motivation for load balancing is to improve speedup by
approaching the balanced workload behaviour. Hill-Marty [12]
and related existing work [13], [14] covering core hetero-
geneity all assume that the workload is already balanced in
their models, implying q = 1. This work makes no such
assumption and studies real load balancer behaviours for dif-
ferent benchmarks, using novel models facilitating quantitative
comparisons.
A. Model Characterization
The model characterization is obtained from the homoge-
neous configuration experiments, and then the models are used
to predict system behaviour in heterogeneous configurations.
Each benchmark is studied independently. Table 4 shows the
obtained parameter values.
A7 is once again used as BCE, αA7 = 1; αA15 values are
derived from single core executions. Core frequencies of both
A7 and A15 are set to 1.4GHz.
Parameter αs is not known because it is not guaranteed that
the sequential part of the workload will be executed on the
fastest core, and it is also possible for the sequential execution
to be re-scheduled to different core types, however αs must
stay within the range of [αA7, αA15].
Parallelization factor p is determined from the measured
speedup S (n) for n > 1 using the equation:
p =
n
n− 1
(
1
S (n)
− 1
)
, (49)
which is derived from the homogeneous Amdahl’s law (3). For
different values of n, the equation gives different p, however
the differences are insignificant within the same type of core.
On the other hand, the differences in p for different core types
are substantial and cannot be ignored.
The lowest values of the model parameters p and αs are
used to calculate the lower limit of the heterogeneous speedup
Slow (n), and the highest values are used to calculate Shigh (n).
B. Quality of Load Balancer
Figure 8 presents the outcomes of the experiments for the
selected benchmarks and heterogeneous core configurations.
Time measurements have been collected from 10 runs in
each configuration to avoid any random flukes, however the
results were surprisingly consistent within 0.2% variability.
This indicates that the system scheduler and load balancer
behave deterministically in given conditions.
The graphs display the calculated speedup ranges
[Slow (n) , Shigh (n)] and the measured speedup S (n). The
numbers represent the load balancer quality q, calculated
from (48).
The first interesting observation is that the ferret benchmark
is executed with incredible scheduling efficiency despite the
system’s heterogeneity. The average value of q is 0.81 and
the maximum goes to 0.94. According to the benchmark’s
description, its data parallelism employs a pipeline, i.e. the
application implements a producer-consumer paradigm. In this
case, the workload distribution is managed by the application.
Consequently, the cores are always given work items to
execute and the longest possible idling time is less than the
execution of one item.
The observed q values never exceed 1, which validates the
hypothesis that (17) refers to the optimal workload distribution
and can be used to predict the system’s performance capacity.
The lower bound of q = 0 is also mostly respected. This
is not a hard limit, but a guideline that separates appropriate
workload distributions. This boundary is significantly violated
only in one case, as described below.
Bodytrack and fluidanimate show much less efficient work-
load distribution, compared to ferret, and their efficiency seems
to decrease when the core configuration includes more little
IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS 13
TABLE IV
CHARACTERIZATION OF PARSEC BENCHMARK PARALLELIZABILITY FROM HOMOGENEOUS SYSTEM SETUP
A7 A15
app S (2) S (3) S (4) pA7 S (2) S (3) S (4) pA15 α15
bodytrack 1.8721 2.6499 3.3081 0.9320 ±0.0020 1.8040 2.4606 3.0154 0.8910 ±0.0006 1.9728
ferret 1.8772 2.6697 3.3786 0.9371 ±0.0026 1.9192 2.7778 3.4648 0.9555 ±0.0070 1.8795
fluidanimate 1.5726 – 2.2250 0.7311 ±0.0029 1.4522 – 1.9422 0.6348 ±0.0120 1.8164
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
bodytrack
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
ferret
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
fluidanimate
low measured highlow measured highlow measured high
1 A7
1 A15
3 A7
1 A15
2 A7
2 A15
1 A7
3 A15
2 A7
3 A15
3 A7
3 A15
2 A7
4 A15
3 A7
4 A15
4 A7
4 A15
1 A7
1 A15
3 A7
1 A15
2 A7
2 A15
1 A7
3 A15
2 A7
3 A15
3 A7
3 A15
2 A7
4 A15
3 A7
4 A15
4 A7
4 A15
1 A7
1 A15
3 A7
1 A15
2 A7
2 A15
1 A7
3 A15
4 A7
4 A15
0.13
-0.68
0.19
0.54 0.37 0.23
0.58
0.52 0.48
0.86
0.61
0.78
0.94
0.88 0.74
0.90 0.80
0.79
0.18
-0.08
0.33
0.57
0.52
Fig. 8. Parsec speedup range results from heterogeneous system setup determining q – the quality of the system load balancer
than big cores. This effect is exceptionally impactful in the
case of three A7 cores and one A15 core executing four
threads of the bodytrack application. The value of q for this
configuration lies far in the negative range and can serve as
an evidence of load balancer malfunction. Indeed, the speedup
of this four-thread execution is only slightly higher than two-
threaded runs on one A7 and one A15. The execution time is
close to a single thread executed on one A15 core, showing
almost zero benefit from bringing in three more cores, and the
result is consistent across multiple runs of the experiment. This
issue requires a substantial investigation and lies beyond the
scope of this paper, however it demonstrates how the presented
method may help analyse the system behaviour and detect
problems in the scheduler and load balancer.
VIII. CONCLUSIONS
The models presented in the paper enhance our understand-
ing of scalability in heterogeneous many-core systems and will
be useful for platform designers and electronic engineers, as
well as for system level software developers.
This paper extends three classical speedup models – Am-
dahl’s law, Gustafson’s model and Sun Ni’s model – to
the range of heterogeneous system configurations that can
be described as a normal form heterogeneity. This type of
heterogeneity is defined and analysed with respect to the model
assumptions. The analysis includes an in-depth discussion of
possible model limitations and shows that the proposed models
are not reducing applicability in comparison to the original
models. The provided discussion can serve as a foundation
for multiple research directions in the future. Important as-
pects, such as workload distribution between heterogeneous
cores and various modes of workload scaling, are included
in the model derivation. In addition to performance, this
paper addresses the issue of power and energy modelling
by calculating power dissipation, energy consumption, and
energy delay product for the respective heterogeneous speedup
models.
The practical part of this work includes experiments on the
Odroid-XU3 board pertaining to model validation and real-
life application. The models have been validated against a
synthetic benchmark in a controlled environment. The exper-
iments confirm the accuracy of the models and show that
the models provide deeper insights and clearly demonstrate
the effects of various system parameters on performance and
energy scaling in different heterogeneous configurations.
The modelling method enables the study of the quality of
load balancing, used for improving speedup. A quantitative
metric for load balancing quality is proposed and a series
of experiments involving Parsec benchmarks are conducted.
The modelling method provides quantitative guidelines of
load balancing quality against which experimental results
can be compared. The Linux load balancer is shown to not
always provide high quality results. In certain situations it
may even produce worse results than the nave equal-share
approach. The study also showed that application-specific load
balancing using pipelines can produce results of much higher
quality, approaching the theoretical optimum obtained from
the models.
ACKNOWLEDGEMENT
M. A. N. Al-hayanni thanks the Iraqi Government for PhD
studentship funding. The authors thank Ali M. Aalsaud for
useful discussions.
REFERENCES
[1] G. M. Amdahl, “Validity of the single processor approach to achieving
large scale computing capabilities,” in Proceedings of the Spring Joint
Computer Conference, ser. AFIPS ’67 (Spring). ACM, 1967, pp. 483–
485. [Online]. Available: http://doi.acm.org/10.1145/1465482.1465560
[2] J. L. Gustafson, “Reevaluating amdahl’s law,” Communications of the
ACM, vol. 31, no. 5, pp. 532–533, 1988.
[3] X.-H. Sun and L. M. Ni, “Another view on parallel speedup,” in
Supercomputing’90., Proceedings of. IEEE, 1990, pp. 324–333.
[4] ——, “Scalable problems and memory-bounded speedup,” Journal of
Parallel and Distributed Computing, vol. 19, no. 1, pp. 27–37, 1993.
[5] S. Borkar, “Thousand core chips: A technology perspective,” in
Proceedings of the 44th Annual Design Automation Conference, ser.
DAC ’07. New York, NY, USA: ACM, 2007, pp. 746–749. [Online].
Available: http://doi.acm.org/10.1145/1278480.1278667
[6] G. E. Moore et al., “Cramming more components onto integrated
circuits,” Proceedings of the IEEE, vol. 86, no. 1, pp. 82–85, 1998.
[7] J. G. Koomey, S. Berard, M. Sanchez, and H. Wong, “Implications of
historical trends in the electrical efficiency of computing,” Annals of the
History of Computing, IEEE, vol. 33, no. 3, pp. 46–54, 2011.
IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS 14
[8] F. J. Pollack, “New microarchitecture challenges in the coming gener-
ations of cmos process technologies (keynote address),” in Proceedings
of the 32nd annual ACM/IEEE international symposium on Microarchi-
tecture. IEEE Computer Society, 1999, p. 2.
[9] P. Greenhalgh, big.LITTLE Processing with ARM Cortex-A15 & Cortex-
A7 – Improving Energy Efficiency in High-Performance Mobile Plat-
forms, ARM, 2011, white Paper.
[10] “Juno ARM development platform SoC technical overview,” ARM,
Tech. Rep., 2014. [Online]. Available: http://www.arm.com
[11] R. H. Dennard, V. Rideout, E. Bassous, and A. Leblanc, “Design of ion-
implanted mosfet’s with very small physical dimensions,” Solid-State
Circuits, IEEE Journal of, vol. 9, no. 5, pp. 256–268, 1974.
[12] M. D. Hill and M. R. Marty, “Amdahl’s law in the multicore era,”
Computer, no. 7, pp. 33–38, 2008.
[13] X.-H. Sun and Y. Chen, “Reevaluating amdahl’s law in the multicore
era,” Journal of Parallel and Distributed Computing, vol. 70, no. 2, pp.
183–188, 2010.
[14] N. Ye, Z. Hao, and X. Xie, “The speedup model for manycore proces-
sor,” in Information Science and Cloud Computing Companion (ISCC-
C), 2013 International Conference on. IEEE, 2013, pp. 469–474.
[15] D. H. Woo and H.-H. S. Lee, “Extending amdahl’s law for energy-
efficient computing in the many-core era,” Computer, no. 12, pp. 24–31,
2008.
[16] M. A. N. Al-Hayanni, A. Rafiev, R. Shafik, and F. Xia, “Power and
energy normalized speedup models for heterogeneous many core com-
puting,” in 16th International Conference on Application of Concurrency
to System Design (ACSD), June 2016, pp. 84–93.
[17] “Odroid XU3,” http://www.hardkernel.com/main/products.
[18] J. L. Peterson, Petri Net Theory and the Modeling of Systems. Upper
Saddle River, NJ, USA: Prentice Hall PTR, 1981.
[19] J. Baeten, C. A. Middelburg, and E. T. Netherlands, “Process algebra
with timing: Real time and discrete time,” in Handbook of Process
Algebra. Elsevier, 2000, pp. 627–684.
[20] A. B. Downey, “A model for speedup of parallel
programs,” EECS Department, University of California, Berkeley,
Tech. Rep. UCB/CSD-97-933, Jan 1997. [Online]. Available:
http://www.eecs.berkeley.edu/Pubs/TechRpts/1997/5394.html
[21] K. Georgiou, S. Kerrison, Z. Chamski, and K. Eder, “Energy trans-
parency for deeply embedded programs,” ACM Transactions on Archi-
tecture and Code Optimization (TACO), vol. 14, no. 1, pp. 1–26, 4 2017.
[22] X. Li and M. Malek, “Analysis of speedup and communica-
tion/computation ratio in multiprocessor systems,” in Proceedings. Real-
Time Systems Symposium, Dec 1988, pp. 282–288.
[23] J.-P. Lozi, B. Lepers, J. Funston, F. Gaud, V. Que´ma, and A. Fedorova,
“The linux scheduler: A decade of wasted cores,” in Proceedings of the
Eleventh European Conference on Computer Systems, ser. EuroSys ’16.
New York, NY, USA: ACM, 2016, pp. 1:1–1:16. [Online]. Available:
http://doi.acm.org/10.1145/2901318.2901326
[24] C. Bienia and K. Li, “Parsec 2.0: A new benchmark suite for chip-
multiprocessors,” in Proceedings of the 5th Annual Workshop on Mod-
eling, Benchmarking and Simulation, June 2009.
Ashur Rafiev has received his PhD in 2011 in
the School of Electrical, Electronic and Computer
Engineering, Newcaste University. At the moment,
he works in the School of Computing Science,
Newcastle University, as a Research Associate. His
research interest is focused on power modelling
and hardware-software co-simulation of many-core
systems.
Mohammed Al-hayanni (Student Member, IEEE
and IET) is an experienced electronics, computer and
software engineer. He is currently studying for his
PhD with the School of Electrical and Electronic En-
gineering, Newcastle University. His research inter-
ests include developing practically validated robust
performance adaptation models for energy-efficient
many-core computing systems.
Fei Xia is a Senior Research Associate with the
School of Electrical and Electronic Engineering,
Newcastle University. His research interests are in
asynchronous and concurrent systems with an em-
phasis on power and energy. He holds a PhD from
King’s College, London, an MSc from the University
of Alberta, Edmonton, and a BEng from Tsinghua
University, Beijing.
Rishad Shafik (MIEE06-to-date) is a Lecturer in
Electronic Systems at Newcastle University. His
research interests include design of intelligent and
energy-efficient embedded systems. He holds a PhD
and an MSc from Southampton Uni., and a BEng
from IUT, Bangladesh. He has authored 80+ re-
search articles published by IEEE/ACM, and is the
co-editor of ”Energy-efficient Fault-Tolerant Sys-
tems”. He is chairing DFT’17 (http://www.dfts.org),
to be held in Cambridge, UK.
Alexander Romanovsky is a Professor at Newcastle
University, UK and the leader of the Secure and
Resilient Systems group at the School of Computing
Science there. His main research interests are system
dependability, fault tolerance, software architectures,
exception handling, error recovery, system verifica-
tion for safety, system structuring and verification
of fault tolerance. He is a member of the edito-
rial boards of Computer Journal, IEEE Transactions
on Reliability, Journal of System Architecture and
International Journal of Critical Computer-Based
Systems.
Alex Yakovlev is a professor in the School of Elec-
trical and Electronic Engineering, Newcastle Uni-
versity. His research interests include asynchronous
circuits and systems, concurrency models, energy-
modulated computing. Yakovlev received a DSc in
Engineering at Newcastle University. He is Senior
Member of IEEE and Fellow of IET. In 2011-2013
he was a Dream Fellow of the UK Engineering and
Physical Sciences Research Council (EPSRC).
