The performance wall of parallelized sequential computing: the dark
  performance and the roofline of performance gain by Végh, János
Graphical Abstract
The performance wall of parallelized sequential computing:
the dark performance and the roofline of performance gain
János Végh
Using parallel computing for achieving the needed higher performance adds one more limitation to the already known ones.
The performance gain of parallelized sequential processor systems has a theoretical upper limit, derived from the laws of
nature, the paradigm and the technical implementation of computing. Based on the different limiting factors, the paper
derives some theoretical upper limit of the performance gain. From the rigorously controlled database of supercomputers,
performance gains of the ever-built supercomputers are analyzed and compared to the theoretical bound. It is shown that
the present implementations already achieved the theoretical upper bound. The upper bound of the performance gain for
processor-based brain simulation is also derived.
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
102
103
104
105
106
107
108
Year
P
er
f
or
m
a
n
ce
g
a
in
The roofline of performance gain of supercomputers
1st by RHPLMax
2nd by RHPLMax
3rd by RHPLMax
Best by αHPLeff
1st by RHPCGMax
2nd by RHPCGMax
3rd by RHPCGMax
Best by RHPCGMax
Brain simulation
ar
X
iv
:1
90
8.
02
28
0v
1 
 [c
s.D
C]
  2
 A
ug
 20
19
Highlights
The performance wall of parallelized sequential computing:
the dark performance and the roofline of performance gain
János Végh
• A new limitation of computing performance is derived
• A new merit for parallelization is introduced
• The theoretically possible performance gain is already achieved
• The performance of brain simulation is compared with that of the benchmarks
The performance wall of parallelized sequential computing:
the dark performance and the roofline of performance gain⋆
János Végh
aKalimános BT, Hungary. 4032 Debrecen, Komlóssy 26.
ART ICLE INFO
Keywords:
efficiency
single-processor
performance
performance gain
supercomputer
brain simulation
ABSTRACT
The computing performance today is developing mainly using parallelized sequential computing, in
many forms. The paper scrutinizes whether the performance of that type of computing has an upper
limit. The simple considerations point out that the theoretically possible upper bound is practically
achieved, and that the main obstacle of to step further is the presently used computing paradigm and
implementation technology. In addition to the former "walls", also the "performance wall" must be
considered. As the paper points out, similarly to the "dark silicon", also the "dark performance" is
always present in the parallelized many-processor systems.
1. Introduction
The parallelization in the computing today becomesmore
and more focus [5]. After that the growth of the single-
processor performance has stalled [45], the only hope for
making computing systems with higher performance is to
assemble them from a large number of sequentially work-
ing computers. "Computer architects have long sought the
"The City of Gold" (El Dorado) of computer design: to cre-
ate powerful computers simply by connecting many existing
smaller ones." [37]. Solving the satisfying scalability of the
task, however, is not simple at all. One of the major motiva-
tions of the origin of the Gordon Bell Prize was to increase
the resulting performance gain of the parallelized sequential
systems: "a speedup of at least 200 times on a real problem
running on a general purpose parallel processor" [6].
It is known from the very beginnings, that the "Valid-
ity of the Single Processor Approach to Achieving Large-
Scale Computing Capabilities" [3, 41] is at least question-
able, and the really large-scale tasks (like building super-
computers comprisingmillions of single processors or build-
ing brain simulators from single processors) may face seri-
ous scalability issues. After the initial difficulties, for today
the large-scale supercomputers are stretched to the limits [7],
and although the EFlops (payload) performance has not yet
been achieved, already the 104 times higher performance is
planned [32]. Many of the planned large-scale supercom-
puters, however, are delayed, canceled, paused, withdrawn,
re-targeted, etc. It looks like that in the present "gold rush" it
has not been scrutinized whether the resulting payload per-
formance of parallelized systems has some upper bound. In
the known feasibility studies this aspect remains out of sight
inUSA [44], in EU [17], in Japan [18] and in China [32]. The
designers did not follow "that system designers must make
⋆This work is supported by the National Research, Development and
Innovation Fund of Hungary, under K funding scheme Projects no. 125547
and 132683, as well as ERC-ECAS support of project 861938 is acknowl-
edged
∗Corresponding author
Vegh.Janos@gmail.com (J. Végh)
ORCID(s): 0000-0001-7511-2910 (J. Végh)
the effort to understand the relevant characteristics of the
benchmark applications they use, if they are to arrive at the
correct design decisions for building larger multiprocessor
systems" [41]. The rules of game are different for the segre-
gated processors and the parallelized ones [51]; the larger are
the systems, the more remarkable deviations appear in their
behavior. In addition to the already known limitations [35],
valid for segregated processors, a new limitation, valid for
parallelized processors appears [55].
This paper presents that the well-known Amdahl’s law,
the commonly used computing paradigm (the single-processor
approach [3]) and the commonly used implementation tech-
nology together form a strong upper bound for the perfor-
mance of the parallelized sequential computing systems, and
the experienced failures, saturation, etc. experiences can be
attributed to approaching and attempting to exceed that up-
per bound. In section 2 "We realize that Amdahl’s Law is
one of the few, fundamental laws of computing" [38], and
reinterpret it for the targeted goal. The section introduces
the mathematical formalism used and derives the deployed
logical merits. Based on the idea of Amdahl, in section 3
a general model of parallel computing in constructed. In
this (by intention) strongly simplified model the contribu-
tions are classified either as parallelizable or nonparalleliz-
able ones. The section provides an overview of the compo-
nents contributing to the non-parallelizable fraction of pro-
cessing, and attempts to reveal their origin and behavior. By
properly interpreting the contributions, it shows that under
different conditions different contributions can have the role
of being the performance-limiting factor. The section in-
terprets also the role of the benchmarks, and explains why
the different benchmarks produce different results. Section 4
shows that the parallelized sequential computing systems have
their inherent performance bound, and shows numerous po-
tential limiting factors. The section also provides some supercomputer-
specific numerical examples.
The last two sections directly underpin the previous theo-
retical discussionwithmeasured data. In section 5 some spe-
cific supercomputer features are discussed, where the case is
well documented and the special case enables to draw gen-
J. Végh: Preprint submitted to Elsevier Page 1 of 17
The roofline of parallelized performance gain
eral conclusions. This section also discusses the near (pre-
dictable) future of large-scale supercomputers and their be-
havior. Section 6 draws some statistical conclusions, based
on the available supercomputer database, containing rigor-
ously verified, reliable data for the complete supercomputer
history. The large number of data enables, among others, to
derive conclusions about a new law of the electronic devel-
opment, to tell where and when it is advantageous to apply
graphic accelerator units, as well as to derive a "roofline"
model of supercomputing.
2. Amdahl’s classic analysis
2.1. Origin and interpretation
The most commonly known and cited limitation on par-
allelization speedup ([3], the so called Amdahl’s law) con-
siders the fact that some parts (푃푖) of a code can be par-allelized, some (푆푖) must remain sequential. Amdahl onlywanted to draw the attention to that when putting together
several single processors, and using Single Processor Ap-
proach (SPA), the available speed gain due to using large-
scale computing capabilities has a theoretical upper bound.
He also mentioned that data housekeeping (non-payload cal-
culations) causes some overhead, and that the nature of that
overhead appears to be sequential, independently of its ori-
gin.
2.2. Validity
A general misconception (introduced by successors of
Amdahl) is to assume that Amdahl’s law is valid for software
only and that the non-parallelizable fraction means some-
thing like the ratio of numbers of the corresponding instruc-
tions. Amdahl in his famous paper speaks about "the frac-
tion of the computational load" and explicitly mentions, in
the same sentence and same rank, algorithmic reasons like
"computations required may be dependent on the states of
the variables at each point"; architectural aspects like "may
be strongly dependent on sweeping through the array along
different axes on succeeding passes" as well as "physical
problems" like "propagation rates of different physical ef-
fects may be quite different". His point of view is valid also
today: one has to consider the load of the complex hard-
ware (HW)/software (SW) system, rather than some segre-
gated component, and his idea describes parallelization im-
perfectness of any kind. When applied to a particular case,
however, one shall scrutinize which contributions can actu-
ally be neglected.
Actually, Amdahl’s law is valid for any partly paralleliz-
able activity (including computer unrelated ones) and the
non-parallelizable fragment shall be given as the ratio of
the time spent with non-parallelizable activity to the total
time. The concept was frequently successfully utilized on
quite unexpected fields, as well asmisunderstood and abused
(see [30, 23]).
As discussed in [35]
• many parallel computations today are limited by sev-
eral forms of communication and synchronization
• the parallel and sequential runtime components are
only slightly affected by cache operation
• the wires get increasingly slower relative to gates
2.3. Amdahl’s case under realistic conditions
The realistic case is that parallelized parts are not of equal
length (even if they comprise exactly the same instructions).
The hardware operation in modern processors may execute
them in considerably different times; for examples see [53,
24], and references cited therein; operation of hardware ac-
celerators inside a core, or network operation between pro-
cessors, etc. One can also see that the time required to con-
trol parallelization is not negligible and varying; represent-
ing another source of performance bound.
The static correspondence between program chunks and
processing units can be very inefficient: all assigned pro-
cessing units must wait the delayed unit. The measurable
performance does not match the nominal performance: lead-
ing to the appearance of the "dark performance": the pro-
cessors cannot be utilized at the same time, much similar to
how fraction of cores (because of energy dissipation) can-
not be utilized at the same time, leading to the issue "dark
silicon" [16, 15].
Also, some capacity is lost if the number of computing
resources exceeds number of parallelized chunks. If number
of processing units is smaller than that of the parallelized
threads, severals "rounds" for the remaining threads must
be organized, with all disadvantages of duty of synchroniza-
tion [57, 10]. In such cases it is not possible to apply Am-
dahl’s Law directly: the actual architecture is too complex or
not known. However, in all cases the speedup can be mea-
sured and expressed in function of the number of processors.
2.3.1. Factors affecting parallelism
Usually, Amdahl’s law is expressed as
푆−1 = (1 − 훼) + 훼∕푘 (1)
where 푘 is the number of parallelized code fragments, 훼
is the ratio of the parallelizable fraction to the total, 푆 is
the measurable speedup. The assumption can be visualized
that (assuming many processors) in 훼 fraction of running
time processors are processing data, in (1-훼) fraction they
are waiting (all but one). That is 훼 describes how much, in
average, processors are utilized. Having those data, the re-
sulting speedup can be estimated.
For the today’s complex systems, to calculate 훼 is hope-
less, but for a system under test, where 훼 is not a priory
known, one can derive from the measurable speedup 푆 an
effective parallelization factor as
훼푒푓푓 =
푘
푘 − 1
푆 − 1
푆
(2)
Obviously, this is not more than 훼 expressed in terms of 푆
and 푘 from Equ. (1). So, for the classical case, 훼 = 훼푒푓푓 ;
J. Végh: Preprint submitted to Elsevier Page 2 of 17
The roofline of parallelized performance gain
which simply means that in ideal case the actually measur-
able effective parallelization achieves the theoretically pos-
sible one. In other words, 훼 describes a system the architec-
ture of which is completely known, while 훼푒푓푓 characterizesa system the performance of which is known from experi-
ments. Again in other words, 훼 is the theoretical upper limit,
which can hardly be achieved, while 훼푒푓푓 is the experimen-
tal actual value, that describes the complex architecture and
the actual conditions.
The value of 훼푒푓푓 can then be used to refer back to Am-dahl’s classical assumption even in realistic cases when the
detailed architecture is not known. On one side, the speedup
푆 can be measured and 훼푒푓푓 can be utilized to characterizemeasurement setup and conditions [54], how much from the
theoretically possible maximum parallelization is realized.
On the other side, the theoretically achievable 푆 or 훼푒푓푓 canbe guessed from some general assumptions.
In the case of real tasks a Sequential/Parallel Execution
Model [57] shall be applied, which cannot use the simple
picture reflected by 훼, but 훼푒푓푓 gives a good merit of the de-gree of parallelization for the duration of executing the pro-
cess on the given hardware configuration, and can be com-
pared to the results of technology-dependent parametrized
formulas1. Numerically (1− 훼푒푓푓 ) equals to 푓 value, estab-lished theoretically [29].
To the scaling of parallel systems several models can be
applied, and all they can be goof on their specific field. How-
ever, one must recall that "The truth is that there is probably
no specific parameter scaling principle that can be univer-
sally applied [41]." This also means, that the validity of the
scaling methods for extremely large number of processors
must be scrutinized.
2.4. Efficiency of parallelization
The distinguished constituent in Amdahl’s classic anal-
ysis is the parallelizable payload fraction 훼, all the rest (in-
cluding wait time, communication, system contribution and
any other non-payload activities) goes into the apparently
"sequential-only" fraction according to this extremely sim-
ple model.
When using several processors, one of them makes the
sequential-only calculation, the others are waiting2 (use the
same amount of time). In the age of Amdahl the number
of processors was small and the contribution of the SW was
high, relative to the contribution of the parallized HW. Be-
cause of this, the contribution of the SW dominated the se-
quential part, so the value of (1 − 훼) really could be consid-
ered as a constant. The technical development resulted in
decreasing all non-parallelizable components; the large sys-
tems today can be idle also because of HW reasons.
1 Just notice here that passing parameters among cores as well as block-
ing each other (of course, through the operating system (OS)) are all a kind
of synchronization or communication, and their amount differs task by task.
2In a different technology age, the same phenomenon was already de-
scribed: "Amdahl argued that most parallel programs have some portion
of their execution that is inherently serial and must be executed by a single
processor while others remain idle." [41]
104
105
106
10710
−7
10−6 10−5
10−2
10−1
100
No
of
cor
es
(1− αHPLeff )
E
f
f
ic
ie
n
cy
Dependence of EHPL and EHPCG on (1− αHPLeff ) and N
TOP5’2018.11
Summit
Sierra
Taihulight
Tianhe-2
Piz Daint
Figure 1: The dependence of computing efficiency of paral-
lelized sequential computing systems on the parallelization ef-
ficacy and the number of cores. The surface is described by
Eq. (4), the data points for the TOP5 supercomputers are
calculated from the publicly available database [42]
Anyhow, when calculating speedup, one calculates
푆 = (1 − 훼) + 훼
(1 − 훼) + 훼∕푘
= 푘
푘(1 − 훼) + 훼
(3)
hence efficiency3 (how speedup scales with number of pro-
cessors)
퐸 = 푆
푘
= 1
푘(1 − 훼) + 훼
(4)
This means that according to Amdahl, as presented in
Fig. 1, the efficiency depends both on the total number of
processors in the system and on the perfectness4 of the par-
allelization. The perfectness comprises two factors: the the-
oretical limitation and the engineering ingenuity.
If parallelization is well-organized (load balanced, small
overhead, right number of Processing Unit (PU)s), 훼 satu-
rates at unity (in other words: sequential-only fraction ap-
proaches zero), so the tendencies can be better displayed
through using (1 − 훼푒푓푓 ) in the diagrams below.The importance of this practical term 훼푒푓푓 is underlinedby that it can be interpreted and utilized in many different
areas [54, 52] and the achievable speedup (the maximum
achievable performance gainwhen using infinitely large num-
ber of processors) can easily be derived from Equ. (1) as
퐺 = 1
(1 − 훼푒푓푓 )
(5)
3This quantity is almost exclusively used to describe computing per-
formance of multi-processor systems. In the case of supercomputers, 푅푀푎푥푅푃푒푎푘is provided, which is identical with 퐸
4As it will explained below, in a higher-order approximation the value
of (1 − 훼) itself also depends on the number of the processors.
J. Végh: Preprint submitted to Elsevier Page 3 of 17
The roofline of parallelized performance gain
Provided that the value of 훼푒푓푓 does not depend on thenumber of the processors, for a homogenous system the total
payload performance is
푃푡표푡푎푙 푝푎푦푙표푎푑 = 퐺 ⋅ 푃푠푖푛푔푙푒 푝푟표푐푒푠푠표푟 (6)
i.e. the total payload performance can be increased by in-
creasing the performance gain or by increasing the single-
processor performance, or both. Notice, however, that in-
creasing the single-processor performance through acceler-
ators also has its drawbacks and limitations [47], and that
the performance gain and the single-processor performance
are players of the same rank in defining the payload perfor-
mance.
2.4.1. Connecting efficiency and 훼
Through using Equ. (4), 푆푘 can be equally good for de-scribing efficiency of parallelization of a setup, but anyhow
a second parameter, the number of the processors 푘 is also
required. From Equ. (4)
훼퐸,푘 =
퐸푘 − 1
퐸(푘 − 1)
(7)
This quantity depends on both 퐸 and 푘, but in some cases
it can be assumed that 훼 is independent from the number of
processors. This seems to be confirmed by data calculated
from several publications as was noticed early,
At this point one can notice that 1퐸 in Equ. (4) is a lin-ear function of number of processors, and its slope equals
to (1 − 훼푒푓푓 ). The value calculated in this way is denotedby (1 − 훼Δ). Its numerical value is quite near to the valuecalculated (see Equ.( 2)) using all processors, and so it is not
displayed in the rest of figures. This also means that from
efficiency data one can estimate value of 훼Δ even for inter-mediate regions, i.e. without knowing the execution time on
a single processor (from technical reasons, it is the usual case
for supercomputers). From a handful of processors one can
find out if the supercomputer under construction can have
hopes to beat5 the No. 1 in Top500 [42]. This result can
also be used for investment protection.
2.4.2. Time to organize parallelization
The timing analysis given above can be applied to dif-
ferent kinds of parallelizations, from processor-level paral-
lelization (instruction or data level parallelization, in nanosec-
onds range) to OS-level parallelization (including thread-
level parallelization using several processors or cores, in mi-
croseconds range), to network-level (between networked com-
puters, like grids, in milliseconds range). The principles are
the same [10], independently of the kind of implementation.
In agreement with [57], housekeeping overhead is always
present (and mainly depends on the HW+SW architectural
solution) and remains a key question. The main focus is al-
ways on to reduce its effect. Notice that the application itself
also comprises some (variable) amount of sequential contri-
bution.
5At least a lower bound on 훼푒푓푓 (i.e. a higher bound on parallelizationgain) can be derived.
The actual speedup (or effective parallelization) depends
strongly on the ’tricks’ used during implementation. Al-
thoughHWand SWparallelisms are interpreted differently [26],
they even can be combined [8], resulting in hybrid architec-
tures. For those greatly different architectural solutions it is
hard even to interpret 훼, while 훼푒푓푓 enables to compare dif-ferent implementations (or the same implementation under
different conditions).
3. Our model of parallel execution
Asmentioned in section 2.2, Amdahl listed different rea-
sons why losses in the "computational load" can occur. To
understand the operation of computing systems working in
parallel, one needs to extend Amdahl’s original (rather than
that of the successors’) model in such a way, that the non-
parallelizable (i.e. apparently sequential) part comprises con-
tributions from HW, OS, SW and Propagation delay (PD),
and also some access time is needed for reaching the paral-
lelized system. The technical implementations of the differ-
ent parallelization methods show up infinite variety, so here
a (by intention) strongly simplified model is presented. Am-
dahl’s idea enables to put everything6 that cannot be paral-
lelized into the sequential-only fraction. The model is gen-
eral enough to discuss qualitatively some examples of par-
allely working systems, neglecting different contributions as
possible in the different cases. The model can also be con-
verted to a limited validity technical (quantitative) one.
3.1. Formal introduction of the model
The contributions of themodel component푋푋푋 to 훼푒푓푓
will be denoted by 훼푋푋푋푒푓푓 in the following. Notice the differ-ent nature of those contributions. They have only one com-
mon feature: they all consume time. The extended Amdahl’s
model is shown in Fig. 2. The vertical scale displays the
actual activity for processing units shown on the horizontal
scale.
Notice that our model assumes no interaction between
processes running on the parallelized systems in addition to
the absolutely necessary minimum: starting and terminating
the otherwise independent processes, which take parameters
at the beginning and return results at the end. It can, how-
ever, be trivially extended to the more general case when
processes must share some resource (like a database, which
shall provide different records for the different processes),
either implicitly or explicitly. Concurrent objects have in-
herent sequentiality [13], and synchronization and commu-
nication among those objects considerably increase [57] the
non-parallelizable fraction (i.e. contribution (1 − 훼푆푊푒푓푓 )), soin the case of extremely large number of processors special
6Although the modern diagnostic methods enable to separate the the-
oretically different contributions and measure their value separately, unfor-
tunately their effect is summed up in the non-parallelizable fraction. The
summing rule is not simple: some sequential contribution may occur in par-
allel with some other, like the propagation delay with the OS functionality;
and it must also be considered whether the given non-parallelizable item
contributes to the main thread or to one of the fellow threads.
J. Végh: Preprint submitted to Elsevier Page 4 of 17
The roofline of parallelized performance gain
Proc
T
im
e(
n
ot
p
ro
p
or
ti
on
a
l)
0
1
2
3
4
5
6
7
8
9
10
Model of parallel execution
P0 P1 P2 P3 P4
AccessInitiation
SoftwarePre
OSPre
T
0
P
D
0
0
P
ro
ce
ss
0
P
D
0
1
T
1
P
D
1
0
P
ro
ce
ss
1
P
D
1
1
T
2
P
D
2
0
P
ro
ce
ss
2
P
D
2
1
T
3
P
D
3
0
P
ro
ce
ss
3
P
D
3
1
T
4
P
D
4
0
P
ro
ce
ss
4
P
D
4
1
Just waiting
Just waiting
OSPost
SoftwarePost
AccessTermination
P
a
y
lo
a
d T
ot
a
l
E
x
te
n
d
ed
Figure 2: The extended Amdahl’s model of parallelizing se-
quential processing (somewhat idealistic and not proportional)
attention must be devoted to their role on efficiency of the
application on the parallelized system.
Let us notice that all contributions have a role during
measurement: the effect of contributions due to SW, HW,
OS and PD cannot be separated, though dedicated measure-
ments can reveal their role, at least approximately. The rela-
tive weights of the different contributions are very different
for the different parallelized systems, and even within those
cases depend on many specific factors, so in every single
parallelization case a careful analysis is required.
3.1.1. Access time
Initiating and terminating the parallel processing is usu-
ally made from within the same computer, except when one
can only access the parallelized computer system from an-
other computer (like in the case of clouds). This latter access
time is independent from the parallelized system, and one
must properly correct for the access time when derives tim-
ing data for the parallelized system. Amdahl’s law is valid
only for properly selected computing system. This is a one-
time, and usually fixed size time contribution.
3.1.2. Execution time
The execution time Total covers all processings on the
parallelized system. All applications, running on a paral-
lelized system, must make some non-parallelizable activity
at least before beginning and after terminating parallelizable
activity. This SW activity represents what was assumed by
Amdahl as the total sequential fraction7. As shown in Fig. 2,
the apparent execution time includes the real payload activ-
ity, as well as waiting and OS and SW activity. Recall that
the execution times may be different [36, 37, 39] in the indi-
vidual cases, even if the same processor executes the same
instruction, but executing an instruction mix many times re-
sults in practically identical execution times, at least at model
level. Note that the standard deviation of the execution times
appears as a contribution to the non-parallelizable fraction,
and in this way increases the "imperfectness" of the archi-
tecture. This feature of processors deserves serious consid-
eration when utilizing a large number of processors. Over-
optimizing a processor for single-thread regime hits back
when using it in a many-processor environment.
3.2. The principle of the measurements
When measuring performance, one faces serious diffi-
culties, see for example [37], chapter 1, both with making
measurements and interpreting them. When making a mea-
surement (i.e. running a benchmark) either on a single pro-
cessor or on a system of parallelized processors, an instruc-
tion mix is executed many times. There is, however, an cru-
cial difference: in the second case an extra activity is also
included: the job to organize the joint work. It is the reason
of the ’efficiency’, and it leads to critical issues in the case
of extremely large number of processors.
The large number of executions averages the rather dif-
ferent execution times [36], with an acceptable standard de-
viation. In the case when the executed instruction mix is
the same, the conditions (like cache and/or memory size, the
network bandwidth, Input/Output (I/O) operations, etc) are
different and they form the subject of the comparison. In
the case when comparing different algorithms (like results
of different benchmarks), the instruction mix itself is also
different,
Notice that the so called "algorithmic effects" – like deal-
ing with sparse data structures (which affects cache behav-
ior) or communication between the parallelly running threads,
like returning results repeatedly to the main thread in an iter-
ation (which greatly increases the non-parallelizable fraction
in the main thread) – manifest through the HW/SW architec-
ture, and they can hardly be separated. Also notice that there
are fixed-size contributions, like utilizing time measurement
facilities or calling system services. Since 훼푒푓푓 is a relativemerit, the absolute measurement time shall be long. When
utilizing efficiency data frommeasurementswhichwere ded-
icated to some other goal, a proper caution must be exercised
with the interpretation and accuracy of the data.
3.3. The measurement method (bechmarking)
Not to surprise, themethod of themeasurement basically
affects the result of the measurement: the "device under test"
and the "measurement device" are the same.
The benchmarks, utilized to derive numerical parameters
for supercomputers, are specialized and standardized pro-
7Although some OS activity was surely included, Amdahl assumed
some 20 % SW fraction, so the other contributions could be neglected com-
pared to SW contribution.
J. Végh: Preprint submitted to Elsevier Page 5 of 17
The roofline of parallelized performance gain
grams, which run in the HW/OS environment provided by
the parallelized computer under test. One can use bench-
marks for different goals. Two typical fields of utilization:
to describe the environment the computer application runs in
(a "best case" estimation), and to guess how quickly an appli-
cation will run on a given parallelized computer (a "real-life"
estimation).
If the goal is to characterize the supercomputer’s HW+OS
system itself, a benchmark program should distort HW+OS
contribution as little as possible, i.e. SW contribution must
be much lower than HW+OS contribution. In the case of su-
percomputers, benchmarkHigh Performance LINPACK [25]
(HPL) (with minor modifications) is used for this goal since
the beginning of the supercomputer age. The mathematical
behavior of HPL enables to minimize SW contribution, i.e.
HPL delivers the possible best estimation for 훼퐻푊 +푂푆푒푓푓 .If the goal is to estimate the expectable behavior of an ap-
plication, the benchmark program should imitate the struc-
ture and behavior of the application. In the case of super-
computers, a couple of years ago the benchmark High Per-
formance Conjugate Gradients [25] (HPCG) was introduced
for this goal, since "HPCG is designed to exercise compu-
tational and data access patterns that more closely match a
different and broad set of important applications, and to give
incentive to computer system designers to invest in capabil-
ities that will have impact on the collective performance of
these applications" [25]. However, its utilization can bemis-
leading: the ranking is only valid for the HPCG application,
and only utilizing that number of processors. HPCG seems
really to give better hints for designing supercomputer ap-
plications8, than HPL does. According to our model, in the
case of using the HPCG benchmark, the SW contribution
dominates9, i.e. HPCG delivers the best possible estimation
for 훼푆푊푒푓푓 for this class of supercomputer applications.
The different benchmarks provide different (1 − 훼푆푊푒푓푓 )contributions to the non-parallelizable fraction (resulting in
different efficiencies and ranking [27]), so comparing results
(and especially establishing ranking!) derived using differ-
ent benchmarks shall be done with maximum care. Since
the efficiency depends heavily on the number of cores (see
also Fig. 1 and Eq. (4)), the different configurations shall be
compared using the same benchmark and the same number
of processors (or same 푅푃푒푎푘).10
4. The inherent limitations of supercomputing
The limitations can be derived basically in two ways. Ei-
ther the experiences based on the implementations can be
utilized to draw conclusions (as an empirical technical limit),
or some theoretical assumptions can be utilized to derive
a kind of theoretical limit. The first way can be followed
8This is why for example [11] considers HPCG as "practical perfor-
mance".
9 Returning calculated gradients requires much more sequential com-
munication (unintended blocking).
10This is why it is misleading in some cases on the TOP500 lists to
compare HPL benchmark result, measured with all available cores, with
HPCG benchmark, measured with a small fragment of the cores.
only if a large number of rigorously verified data are avail-
able for drawing conclusions. This method will be followed
in connection with the supercomputers, where the reliable
database TOP500 [42] is available. This method is abso-
lutely empirical, and results only in an "up to now" achieved
value, so one cannot be sure whether the experienced limi-
tation is just a kind of engineering imperfectness. The other
method results in a theoretical upper bound, and one can-
not be sure whether it can technically be achieved. It is a
strong confirmation, however, that the two ways lead to the
same limitation: what can be achieved theoretically, is al-
ready achieved in the practice.
The technical implementation of the parallelized sequen-
tial computing systems shows up an infinite variety, so it is
not really possible to describe all of them in a uniformed
scheme. Instead, some originating factors arementioned and
their corresponding term in the model named. At this point
the simplicity of the model is a real advantage: all possi-
ble contributions shall be classified as parallelizable or non-
parallelizable ones only. The model uses time-equivalent
units, so all contributions are expressed with time, indepen-
dently of their origin.
That parallel programs have inherently sequential parts
(and so: inherent performance limit) is known since decades:
"Amdahl argued that most parallel programs have some por-
tion of their execution that is inherently serial and must be
executed by a single processor while others remain idle." [41]
Those limitations follow immediately from the physical im-
plementation and the computing paradigm; it depends on the
actual conditions, which of them will dominate. It is crucial
to understand that the decreasing efficiency (see Equ. (4))
is coming from the computing paradigm itself rather than
from some kind of engineering imperfectness. This inherent
limitation cannot be mitigated without changing the comput-
ing/implementation principle.
4.1. Propagation delay PD
In the modern high clock speed processors it is increas-
ingly hard to reach the right component inside the processor
at the right time, and it is even more hard, if the PUs are
at a distance much larger than the die size. Also, the techni-
cal implementation of the interconnection can seriously con-
tribute.
4.1.1. Wiring
As discussed in [35], the weight of wiring compared to
the processing is continuously increasing. The gates may be-
come (much) faster, but the speed of light is an absolute limit
for the signal propagation on the wiring connecting them.
This is increasingly true when considering large systems:
the need of cooling the modern high density processors in-
creases the length of wiring between them.
4.1.2. Physical size
Although the signals travel in a computing system with
nearly the speed of light, with increasing the physical size of
the computer system a considerable time passes between is-
suing and receiving a signal, causing the other party to wait,
J. Végh: Preprint submitted to Elsevier Page 6 of 17
The roofline of parallelized performance gain
without making any payload job. At the today’s frequencies
and chip sizes a signal cannot even travel in one clock pe-
riod from one side of the chip to the other, in the case of a
stadium-sized supercomputer this delay can be in the order
of several hundreds clock cycles. Since the time of Amdahl,
the ratio of the computing to the propagation time drastically
changed, so –as [35] calls the attention to it– it cannot be ne-
glected any more, although presently it is not (yet) a major
dominating term.
As long as computer components are in proximity in range
푚푚, contribution by PD can be neglected, but the distance of
PUs in supercomputers are typically in 100 푚 range, and the
nodes of a cloud system can be geographically far from each
other, so considerable propagation delays can also occur.
4.1.3. Interconnection
The interconnection between cores happens in very dif-
ferent contexts, from the public Internet connection between
clouds through the various connections used inside super-
computers down to the System-on-Chip (SoC) connections.
TheOS initiates only accessing the processors, after that HW
works partly in parallel with the next action of the OS and
with other actions initiating accessing other processors. This
period is denoted in Fig. 2 by 푇푥. After the correspondingsignals are generated, they must reach the target processor,
that is they need some propagation time. PDs are denoted by
푃퐷푥0 and 푃퐷푥1, corresponding to actions delivering inputdata and result, respectively. This propagation time (which
of course occurs in parallel with actions on other proces-
sors, but which is a sequential contributionwithin the thread)
depends strongly on how the processors are interconnected:
this contribution can be considerable if the distance to travel
is large or message transfer takes a long time (like lengthy
messages, signal latency, handshaking, store and forward
operations in networks, etc.).
Although sometimes even in quite large-scale systems
like [22] Ethernet-based internal communication is deployed,
it is getting accepted that "The idea of using the popular
shared bus to implement the communicationmedium [in large
systems] is no longer acceptable, mainly due to its high con-
tention." [34].
4.2. Other sources of delay
The internal operation of the processors can also con-
tribute to the issues experienced in the large parallel com-
puting systems.
4.2.1. Internal latency
The instruction executionmicro-environment can be quite
different for the different parallelized sequential systems. Here
it is interpreted as the internal non-payload time around ex-
ecuting (a bunch of) machine instructions, like waiting for
the instruction being processed in the pipeline, the bus be-
ing disabled for some short period, copying data between
address spaces, speculating or predicting.
4.2.2. Accelerators
It is a trivial idea that since the single processor perfor-
mance cannot be increased anymore, some external comput-
ing accelerator (Graphics Processing Unit (GPU)(s)) shall be
used. However, because of the SPA the data must be copied
to the memory of the accelerator, and this takes time. This
non-payload activity is a kind of sequential contribution and
surely makes the value of (1−훼푒푓푓 )worse. The difference isnegligible at low number of cores, but in large-scale systems
it strongly degrades the efficiency, see section 6.2.
4.2.3. Complexity
The processors are optimized for single-processor per-
formance. As a result, they attempt to make more and more
operations in a single clock cycle, and doing so introduces
a limitation for the length of the clock period itself: "we
believed that the ever-increasing complexity of superscalar
processors would have a negative impact upon their clock
rate, eventually leading to a leveling off of the rate of in-
crease in microprocessor performance". [40]
4.3. The computing paradigm
At the time when the basic operating principles of the
computer were formulated, there was literally only one pro-
cessor which lead naturally to using the SPA. For today, due
to development of the technology, the processor became a
"free resource" [22]. Despite of that, mainly by inertia (the
preferred incremental development) and because up to now
the performance could be improved even using SPA compo-
nents (and thinking), today the SPA is commonly used when
building large parallel computing systems, although the im-
portance of "cooperative computing" is already recognized
and demonstrated [58]. The stalling of the parallel perfor-
mance may lead to the need of introducing the Explicitly
Many-Processor Approach (EMPA) [48].
4.3.1. Addressing
One of the major drawbacks of the SPA is what resulted
in the components (constructed for SPA systems). Among
others the SPA processors use SPA memories through SPA
buses. This also means, that only one single addressing ac-
tion can take place at a time, that is why the need for address-
ing processors increases linearly with the size of the system.
Although this issue can be mitigated by segmenting, cluster-
ing, vectoring; the basic limitation effect is present.
4.3.2. The context switching
All applications must use OS services and some HW fa-
cilities to initiate themself as well as to access other pro-
cessors. Because operating system works in a different (su-
pervisor) mode, a considerable amount of time is required
for switching context. Actually, this means virtually "an-
other processor": a different (extended) Instruction Set Ar-
chitecture (ISA), and a new set of processor registers. The
processor registers are very useful in single-processor op-
timization, but their saving and restoring considerably in-
creases the internal latency, and what is worse, also intro-
duces many otherwise unneeded memory operations. This
J. Végh: Preprint submitted to Elsevier Page 7 of 17
The roofline of parallelized performance gain
10−2 10−1 100
10−2
10−1
RPeak (exaFLOPS)
R
M
a
x
(e
x
a
F
L
O
P
S
)
Prediction of RHPLMax of Top10 Supercomputers
Summit
Sierra
Taihulight
Tianhe-2
Piz Daint
Trinity
ABCI
SuperMUC-NG
Titan
Sequoia
10−3 10−2 10−1 100
10−5
10−4
10−3
10−2
10−1
RPeak (exaFLOPS)
R
M
a
x
(e
x
a
F
L
O
P
S
)
Development of RHPLMax for PizDaint Supercomputer
Xeon E5-2690 + NVIDIA Tesla P100 (2018)
Xeon E5-2690 + NVIDIA Tesla P100 (2017)
Xeon E5-2690 + NVIDIA Tesla P100 (2016)
Xeon E5-2670 + NVIDIA K20x (2013)
Xeon E5-2670 (2013)
Xeon E5-2670 (2012)
Figure 3: a) Dependence of payload supercomputer performance on the nominal performance for the TOP10 supercomputers
(as of November 2018) in case of utilizing the HPL benchmark. b) The timeline of the development of the payload performance
푅푀푎푥 as documented in the database TOP500 The actual positions are marked by bubbles on the diagram lines.
is usually not a really crucial contribution, but under the
extreme conditions represented by supercomputers (and es-
pecially if single-port memory is used), specialized operat-
ing systems must be used [22] or the calculation must be
run in supervisor mode [20] or every single core most run a
lightweight OS [58].
4.3.3. Synchronization
Although not explicitly dealt with here, notice that the
data exchange between the first thread and the other ones also
contributes to the non-parallelizable fraction and typically
uses system calls, for details see [57, 19, 10]. Actually, we
may have communicating serial processes, which does not
improve the effective parallelism at all [1]. Some classes of
applications (like artificial neural networks) need intensive
and frequent data exchange and in addition, because of the
"time grid" they need to use to coordinate the operation of
the "neurons" they use, they overload the networkwith bursts
of messages.
5. Supercomputer case studies
From the sections above it can be concluded that paral-
lelized sequential computing systems have some upper limit
on their payload performance (the nominal performance of
course can be increased without limitation, but the efficiency
decreases proportionally). In this section some case stud-
ies on supercomputer implementations are presented, utiliz-
ing only public information. Examples of deploying the for-
malism on other fields of parallelized computing are given
in [54].
5.1. Taihulight (Sunway)
In the parallelized sequential computing systems imple-
mented in SPA [3], the life begins in one such sequential
subsystem. In the large parallelized applications running on
general purpose supercomputers, initially and finally only
one thread exists, i.e. the minimal absolutely necessary non-
parallelizable activity is to fork the other threads and join
them again. With the present technology, no such actions
can be shorter than one processor clock period11. That is, the
absolute minimum value of the non-parallelizable fraction
will be given as the ratio of the time of the two clock periods
to the total execution time. The latter time is a free param-
eter in describing the efficiency, i.e. value of the effective
parallelization 훼푒푓푓 also depends on the total benchmarking
time (and so does the achievable parallelization gain, too).
This dependence is of course well known for supercom-
puter scientists: for measuring the efficiency with better ac-
curacy (and also for producing better 훼푒푓푓 values) hours ofexecution times are used in practice. In the case of bench-
marking 푇 푎푖ℎ푢푙푖푔ℎ푡 [12] 13,298 seconds benchmark run-
timewas used; on the 1.45 GHz processors it means 2 ∗ 1013
clock periods. The inherent limit of (1−훼푒푓푓 ) at such bench-
marking time is 10−13 (or equivalently the achievable perfor-
mance gain is 1013). If the fork/join is executed by the OS as
usual, because of the needed context switchings 2 ∗ 104 [43]
clock cycles are needed rather than the 2 clock cycles con-
sidered in the idealistic case, i.e. the derived values are cor-
respondingly by 4 orders of magnitude different; that is the
performance gain cannot be above 109. For the develop-
11 Taking this two clock periods as an ideal (but not realistic) case, the
actual limitation will be surely (thousands of times) worse than the one cal-
culated for this idealistic one. The actual number of clock periods depends
on many factors, as discussed below.
J. Végh: Preprint submitted to Elsevier Page 8 of 17
The roofline of parallelized performance gain
ment of the achieved performance gain and the values for
the top supercomputers, see Fig. 8. In the following for sim-
plicity 1.00 GHz processors (i.e. 1 ns clock cycle time) will
be assumed.
The supercomputers are also distributed systems. In a
stadium-sized supercomputer a distance between the proces-
sors (cable length) about 100m can be assumed. The net sig-
nal round trip time is ca. 10−6 seconds, or 103 clock periods,
i.e. in the case of a finite-sized supercomputer the perfor-
mance gain cannot be above 1010 (or 106 if context switch-
ing also needed). The presently available network interfaces
have 100. . . 200 ns latency times, and sending a message be-
tween processors takes time in the same order of magnitude.
Since the signal propagation time is longer than the latency
of the network, this also means that making better intercon-
nection is not really a bottleneck in enhancing computing
performance. This statement is underpinned also by statis-
tical considerations [47].
Taking the (maybe optimistic) value 2 ∗ 103 clock pe-
riods for the signal propagation time, the value of the effec-
tive parallelization (1 − 훼푒푓푓 ) will be at best in the range
of 10−10, only because of the physical size of the super-
computer. This also means that the expectations against the
absolute performance of supercomputers are excessive: as-
suming a 100 Gflop/s processor and realistic physical size,
no operating system and no non-parallelizable code fraction,
the achievable absolute nominal performance (see Eq. (5))
is 1011*1010 푓푙표푝∕푠, i.e. 1000 EFlops. To implement this,
around 109 processors are required. One can assume that the
value of (1−훼푒푓푓 ) will be12 around of the value 10−7. Withthose very optimistic assumptions (see Equ. 4) the payload
performance for benchmark HPL will be less than 10 Eflops,
and for the real-life applications of class of the benchmark
HPCG it will be surely below 0.01 EFlops, i.e. lower than
the payload performance of the present TOP1-3 supercom-
puters.
These predictions enable to assume that the presently
achieved value of (1 − 훼푒푓푓 ) persists also for roughly hun-dred times more cores. However, another major issue arises
from the computing principle SPA: on an SPA bus only one
core at a time can be addressed. As a consequence, mini-
mum as many clock cycles are to be used for organizing the
parallel work as many addressing steps required. Basically,
this number equals to the number of cores in the supercom-
puter, i.e. the addressing in the TOP10 positions typically
needs clock cycles in the order of 5 ∗ 105. . .107; degrading
the value of (1 − 훼푒푓푓 ) into the range 10−6. . .2 ∗ 10−5. Thenumber of the addressing steps can be mitigated using clus-
tering, vectoring, etc. or at the other end the processor itself
can take over the responsibility of addressing its cores [58].
Depending on the actual construction, the reducing factor of
clustering of those types can be in the range 101. . .5 ∗ 103,
i.e the resulting value of (1 − 훼푒푓푓 ) is expected to be around
12With the present technology the best achievable value is ca. 10−6,
which was successfully enhanced by clustering to ca. 2 ∗ 10−7 for 푆푢푚푚푖푡
and 푆푖푒푟푟푎, and the special cooperating cores of 푇 푎푖ℎ푢푙푖푔ℎ푡 enabled to
achieve 3 ∗ 10−8
10−7. Notice that utilizing "cooperative computing" [58] en-
hances further the value of (1 − 훼푒푓푓 ), but it means alreadyutilizing a (slightly) different computing paradigm: the cores
have a direct connection and can communicate with the ex-
clusion of the main memory.
An operating system must also be used, for protection
and convenience. If one considers context switching with
its consumed 2 ∗ 104 cycles [43], the absolute limit is cca.
5 ∗ 10−8, on a zero-sized supercomputer. This value is
somewhat better than the limiting value derived above, but
it is close to that value and surely represents a considerable
contribution. This is why 푇 푎푖ℎ푢푙푖푔ℎ푡 runs the actual com-
putations in kernel mode [58].
5.2. Sum of the non-parallelizable contributions
Notice the special role of the non-parallelizable activi-
ties: independently of their origin, they are summed up as
’sequential-only’ contribution and degrade considerably the
payload performance. In systems comprising parallelized
sequential processes actions like communication (including
also MPI), synchronization, accessing shared resources, etc.
[1, 10, 19, 57] all contribute to the sequential-only part. Their
effect becomes more and more drastic as the number of the
processors increases. One must take care, however, how the
communication is implemented. A nice example is shown
in [4], how direct core to core (in other words: direct thread
to thread) communication can enhance parallelism in large-
scale systems.
5.3. Competition of the 훼푥푒푓푓 contributions for the
dominance
As discussed above, the different contributions of (1 −
훼푒푓푓 ) depend on different factors, so their ranking in af-fecting the value of 푅푀푎푥 changes with the nominal perfor-mance and how the system is assembled from SPA proces-
sors. Fig. 4 attempts to provide a feeling on the effect of the
software contribution. A fictive supercomputer (with behav-
ior somewhat similar to that of supercomputer 푇 푎푖ℎ푢푙푖푔ℎ푡)
is modeled. All subfigures have dual scaling. The blue dia-
gram line refers to the right hand scale and shows the pay-
load performance corresponding to the actual 훼푒푓푓 contri-butions; all the rest refer to the left hand scale and display
(1−훼푋푋푒푓푓 ) (for the details see [49]) contributions to the non-parallelizable fraction. The turn-back of the (1 − 훼푒푓푓 ) dia-gram clearly shows the presence of the "performance wall"
(compare it to Fig. 1 in [41]).
For the sake of simplicity, only those components are de-
picted, that have some role in forming the (1 − 훼푒푓푓 ) value.In some other special cases other contributions may domi-
nate. For example, as presented in [52], in the case of brain
simulation a hidden clock signal is introduced and its effect
is in close competition with the effect of the frequent con-
text switchings for dominating the achievable performance.
Notice that the performance breakdown shown in the fig-
ures were experimentally measured by [41], [28](Fig. 7)
and [2](Fig. 8).
J. Végh: Preprint submitted to Elsevier Page 9 of 17
The roofline of parallelized performance gain
10−3 10−2 10−1 100
10−10
10−9
10−8
10−7
10−6
10−5
푅푃푒푎푘(퐸푓푙표푝∕푠)
훼퐻
푃
퐿
푒푓
푓
10−3
10−2
10−1
100
푅
퐻
푃
퐿
푀
푎푥
(퐸
푓
푙표
푝∕
푠)
훼푆푊
훼푂푆
훼푒푓푓
푅푀푎푥(퐸푓푙표푝∕푠)
10−3 10−2 10−1 100
10−10
10−9
10−8
10−7
10−6
10−5
푅푃푒푎푘(퐸푓푙표푝∕푠)
훼퐻
푃
퐶
퐺
푒푓
푓
10−3
10−2
10−1
100
푅
퐻
푃
퐶
퐺
푀
푎푥
(퐸
푓
푙표
푝∕
푠)
훼푆푊
훼푂푆
훼푒푓푓
푅푀푎푥(퐸푓푙표푝∕푠)
Figure 4: Contributions (1 − 훼푋푒푓푓 ) to (1 − 훼
푡표푡푎푙
푒푓푓 ) and max payload performance 푅푀푎푥 of a fictive supercomputer (푃 = 1퐺푓푙표푝∕푠
@ 1퐺퐻푧) in function of the nominal performance. The blue diagram line refers to the right hand scale (푅푀푎푥 values), all others
((1−훼푋푒푓푓 ) contributions) to the left scale. The left subfigure illustrates the behavior measured with benchmark HPL. The looping
contribution becomes remarkable around 0.1 Eflops, and breaks down payload performance when approaching 1 Eflops. In the
right subfigure the behavior measured with benchmark HPCG is displayed. In this case the contribution of the application (thin
brown line) is much higher, the looping contribution (thin green line) is the same as above. As a consequence, the achievable
payload performance is lower and also the breakdown of the performance is softer.
5.4. The future of supercomputing
Because all of this above, in the name of the company
PEZY13 the last two letters are surely obsolete. Also, no
Zettaflops supercomputers will be delivered for science and
military [14].
Experts expect the performance14 to achieve the magic
1 Eflop/s around year 2020, see Fig. 1 in [33], although al-
ready question marks, mystic events and communications
appeared, as the date approaches. The authors noticed that
"the performance increase of the No. 1 systems slowed down
around 2013, and it was the same for the sum performance",
but they extrapolate linearly and expect that the development
continues and the "zettascale computing" (i.e 104 timesmore
than the present performance) will be achieved is just more
that a decade. Although they address a series of important
questions, the question whether building computers of such
size is feasible, remains out of their sight.
FromTOP500 data, as a prediction,푅푀푎푥 values in func-tion of 푅푃푒푎푘 can be calculated, see Fig. 3 a). The reported(measured) performance values are marked by bubbles on
the figure. When making that prediction, the number of
processors was virtually changed for the different configu-
rations, without correcting for the increasing looping delay;
i.e. the graphs are strongly optimistic, see also Fig. 4. As ex-
pected, 푅푀푎푥 values (calculated in this optimistic way) sat-urate around .35 Eflop/s. Without some breakthrough in the
technology and/or paradigm even approaching the "dream
limit" is not possible.
13https://en.wikipedia.org/wiki/PEZY_Computing: The name PEZY is
an acronym derived from the greek derived Metric prefixs Peta, Eta, Zetta,
Yotta
14There are some doubts about the definition of exaFLOPS, whether it
means 푅푃푒푎푘 or 푅푀푎푥, in the former case whether it includes acceleratorcores, and in the latter case measured by which benchmark. Here the term
is used as 푅퐻푃퐿푀푎푥 .
5.5. Piz Daint
Due to the quick development of the technology, the su-
percomputers have usually not many items registered in the
database TOP500 on their development. One of the rare ex-
ceptions is supercomputer 푃 푖푧 퐷푎푖푛푡. Its development his-
tory spans 6 years, two orders of magnitude in performance
and used both non-accelerated computing and accelerated
computing using two different accelerators. Although usu-
ally more than one of its parameters was changed between
the registered stages of its development, it nicely underpins
the statements of the paper.
Fig. 3 b) displays how the payload performance in func-
tion of the nominal performance has developed in the case of
푃 푖푧 퐷푎푖푛푡 (see also Fig. 2 in [52]). The bubbles diplay the
measured performance values documented in the database
TOP500 [42] and the diagram lines show the (at that stage)
predicted performance. As the diagram lines show the "pre-
dicted performance", the accuracy of the prediction can also
be estimated through the data measured in the next stage. It
is very accurate for short distances, and the jumps can be
qualitatively understood with knowing the reason. In the
previous section the accuracy of the predictions based on the
model has been left open. This figure also validates the pre-
diction for the TOP10 supercomputers depicted in Fig. 3 a).
The data from the first two years of 푃 푖푧 퐷푎푖푛푡 (non-
accelerated mode of operation) can be compared directly.
Increasing the number of the cores results in the expected
higher performance, as the working point is still in the lin-
ear region of the efficiency surface. The value slightly above
the predicted one can be attributed to the fine-tuning of the
architecture.
Introducing accelerators resulted in a jump of payload
efficiency (and also moved the working point to the slightly
non-linear region, see Fig. 5), and the payload performance
J. Végh: Preprint submitted to Elsevier Page 10 of 17
The roofline of parallelized performance gain
104
105
106
10710−7 10−6 10−5
0.2
0.5
0.8
1
N
o
of
co
re
s
(1− αeff )
E
f
f
ic
ie
n
cy
Dependence of EHPL on (1− αeff ) and N
Piz Daint
2012/11
2013/06
2013/11
2016/11
2017/06
2018/11
Figure 5: The positions efficiency values of the supercomputer
"Piz Daint" on the two-dimensional efficiency surface in the
different stages of its building; calculated from the publicly
available database [42]
is roughly 3 times more than it would be expected purely on
the predicted value calculated from the non-accelerated ar-
chitecture. According to the general experience [31], only
a small fraction of the computing power hidden in the GPU
can be turned to payload performance, and the efficiency is
only about 3 times higher than would be without accelera-
tors.
The designers might be not satisfied with the accelerator,
so they changed to another one, with a slightly higher nom-
inal performance but much larger separated memory space.
The result was disappointing: the slight increase of the nom-
inal performance of the GPU could not counterbalance the
increased time needed to copy between the separated larger
address spaces, and finally resulted in a breakdown of both
the value of (1 − 훼푒푓푓 ) and efficiency although the payloadperformance slightly increased. Introducing the GPU accel-
erator increases the absolute performance, but (through in-
troducing the extra non-parallelizable component of copying
the data) increases the value of (1 − 훼푒푓푓 ) and decreases ef-ficiency, for a discussion see section 6.2. The decrease is the
more considerable the more data are to be copied. Again, the
the fine-tuning has helped both the efficiency and (1−훼푒푓푓 )to have a better value.
5.6. Gyoukou
A nice "experimental proof" for the existence of the per-
formance limit is the one-time appearance of supercomputer
퐺푦표푢푘표푢 on the TOP500 list in Nov. 2017. They did partic-
ipate in the competition with using 2.5M cores (out of the
20M available) and their (1 − 훼푒푓푓 ) value was 1.9 ∗ 10−7,
comparable with the data of 푆푢푚푚푖푡 (2.4M and 1.7 ∗ 10−7).
Simply, the performance bound did not enable to increase the
payload performance further.
5.7. Brain simulation
The artificial intelligence (including simulating the brain
operation using computing devices) shows up exponentially
growing interest, and also the size of such systems is contin-
uously growing. In the case of brain simulation the "flagship
goal" is to simulate tens of billions of neurons correspond-
ing to the capacity of the human brain. Those definitely huge
systems really go to the extremes, but also undergo the com-
mon limitation of the large-scale parallelized sequential sys-
tems. In recent studies it was shown that using the present
methods (paradigm and technology) the behavioral-level of
brain is simply out of reach of the research [2].
It was shown recently [52] that the special method of
simulating the artificial neural networks, using a "time grid",
causes a breakdown at relatively low computing performance,
and that under those special conditions the frequent context
switchings and the permanent need of synchronization are
competing for dominating the performance of the applica-
tion.
6. Statistical underpinning
For now, supercomputing has a quarter of century his-
tory and awell-documented and rigorously verified database [42]
on their architectural and performance data. The huge vari-
ety of solutions and ideas does not enlighten drawing con-
clusions and especially making forecasts for the future of su-
percomputing. The large number of available data, however,
enables to draw reliable general conclusions about some fea-
tures of the parallelized sequential computing systems. Those
conclusions have of course only statistical validity because
of the variety of sources of components, different technolo-
gies and ideas as well as the interplay of many factors. That
is, the result shows up a considerable scattering and requires
an extremely careful analysis. The large number of cases,
however, enables to draw some reliable general conclusions.
6.1. Correlation between the number of cores and
the achieved rank
Since the resulting performance (and so the ranking) de-
pends both on the number of processors and the effective
parallelization, those quantities are correlated in Fig. 6. As
expected, in the TOP50 supercomputers the higher the rank-
ing position is, the higher is the required number of proces-
sors in the configuration, and as outlined above, the more
processors, the lower (1 − 훼푒푓푓 ) is required (provided thatthe same efficiency is targeted).
In TOP10, the slope of the regression line on the left
subfigure sharply changes relative to the TOP50 regression
line, showing the strong competition for better ranking po-
sition. Maybe the value of the slope can provide the "cut
line" between "racing supercomputers" and "commodity su-
percomputers". On the right figure, TOP10 data points pro-
vide the same slope as TOP50 data points, demonstrating
that to produce a reasonable efficiency, the increasing num-
ber of cores must be accompanied with a proper decrease in
value of (1−훼푒푓푓 ), as expected from Equ. (4). Furthermore,that to achieve a good ranking, a good value of (1 − 훼푒푓푓 )
J. Végh: Preprint submitted to Elsevier Page 11 of 17
The roofline of parallelized performance gain
0 10 20 30 40 50
10−1
100
101
Ranking by 퐻푃퐿
N
o
of
P
ro
ce
ss
or
s/
1e
6
Data points
Regression Top50
Regression Top10
10−1 100 101
10−7
10−6
10−5
No of Processors/1e6
(1
-훼
푒푓
푓
by
퐻
푃
퐿
)
Data points
Regression TOP50
Regression TOP10
Figure 6: a) The correlation of the TOP50 supercomputer ranking with the number of the cores b) The correlation of the TOP50
supercomputer ranking with the value of (1 − 훼푒푓푓 )
0 10 20 30 40 50
0
50
100
150
Ranking by 퐻푃퐿
P
ro
ce
ss
or
pe
rf
or
m
an
ce
(G
flo
p/
s)
Accelerated
Non-accelerated
GPU-accelerated
Regression of accelerated
Regression of nonaccelerated
Regression of GPU accelerated
0 10 20 30 40 50
105
106
107
108
109
Ranking by 퐻푃퐿
P
er
fo
rm
an
ce
am
pl
ifi
ca
ti
on
fa
ct
or
Accelerated
Non-accelerated
GPU-accelerated
Regression of accelerated
Regression of nonaccelerated
Regression of GPU accelerated
Figure 7: a) The correlation of the single-processor performance of the TOP50 supercomputers (having processors with and
without accelerator) with ranking, in 2017. b) The correlation of the performance gain of the TOP50 supercomputers (having
processors with and without accelerator) with ranking, in 2017.
must be provided. Recall that the excellent performance of
푇 푎푖ℎ푢푙푖푔ℎ푡 shall be attributed to its special processor, de-
ploying "Cooperative computing" [58].
6.2. Deploying accelerators (GPUs)
As suggested by Eq. (5), the trivial way to increase the
absolute performance of a supercomputer is to increase the
single-processor performance of its processors. Since the
single processor performance has reached its limits, some
kind of accelerator (mostly General-Purpose Graphics Pro-
cessingUnit (GPGPU)) is frequently used for this goal. Fig. 7
shows how utilizing accelerators influences ranking of su-
percomputers. The two important factors of supercomput-
ers: the single-processor performance and the parallelization
efficiency in function of ranking are displayed.
As the left side of the figure depicts, the coprocessor ac-
celerated cores show up the lowest performance; they really
can benefit from acceleration15. The GPGPU accelerated
processors really increase the performance of processors by
a factor of 2-3. This result confirms results of a former study
where an average factor 2.5 was found [31].
However, this increased performance is about 40..70 times
lower than the nominal performance of the GPGPU acceler-
ator. The effect is attributed to the considerable overhead [9],
and it was demonstrated that with improving the transfer per-
formance, the application performance can be considerably
enhanced. Indirectly, that research also proved that the oper-
ating principle itself (i.e. that the data must be transferred to
and from theGPUmemory; and recall that GPUs do not have
cache memory) takes some extra time. In terms of Amdahl’s
law, this transfer time contributes to the non-parallelizable
15In the number of the total cores the number of coprocessors is included
J. Végh: Preprint submitted to Elsevier Page 12 of 17
The roofline of parallelized performance gain
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 201810
−8
10−7
10−6
10−5
10−4
10−3
10−2
Year
(1
−
훼)
Supercomputers, Top 500 1st-3rd
1푠푡
2푛푑
3푟푑
퐵푒푠푡 훼
Trend of (1 − 훼)
Sunway TaihuLight
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
102
103
104
105
106
107
108
Year
P
er
f
or
m
a
n
ce
g
a
in
The roofline of performance gain of supercomputers
1st by RHPLMax
2nd by RHPLMax
3rd by RHPLMax
Best by αHPLeff
1st by RHPCGMax
2nd by RHPCGMax
3rd by RHPCGMax
Best by RHPCGMax
Figure 8: a) The trend of the development of (1 − 훼) in the past 25 years, based on the first three (by 푅푚푎푥) and the first (by
(1 − 훼)) in the year in question. b)The performance gain of supercomputers modeled as "roofline" [56], as measured with the
benchmarks HPL and HPCG, and the one concluded for brain simulation
fraction, i.e. increases (1 − 훼푒푓푓 ), i.e. decreases the achiev-able performance gain.
The right side of the figure discovers this effect. The
effective parallelization of the GPU accelerated systems is
nearly ten timesworse than that of the coprocessor-accelerated
processors and about 5 times worse than that of the the non-
accelerated processors, i.e. the resulting efficiency is worse
than in the case of utilizing unaccelerated processors; this is
a definite disadvance when GPUs used in system with ex-
tremely large number of processors.
The key to this enigma is hidden in Eq. (4): the pay-
load performance increases by a factor of nearly 3, but the
value (increased by nearly an order of magnitude) of (1 −
훼푒푓푓 ) is multiplied by the number of cores in the system.In other words: while deploying GPGPU-accelerated cores
in systems having a few thousand cores is advantageous, in
supercomputers having processors in the range of million
is a rather expensive way to make supercomputer perfor-
mance worse. This makes at least questionable whether it
is worth to utilize GPGPUs in large-scale supercomputers.
For a discussion see section 5.5, for a direct experimental
proof see Figs. 3 and 5.
As the left figure shows, neither kind of acceleration shows
correlation between the ranking of supercomputer and the
type of the acceleration. Essentially the same is confirmed
by the right side of the figure: the performance amplifica-
tion raises with the better ranking position, and the slope is
higher for any kind of acceleration: to move the data from
one memory to other takes time.
6.3. The supercomputer timeline and the
"roofline" model
As a quick test, Equ. (7) can be applied to data from [42],
see Fig. 8 a). As shown, supercomputer history is about the
development of the effective parallelism, and Amdahl’s law
formulated by Equ. (7) is actually what Moore’s law is for
the size of electronic components.16 (The effect of Moore’s
law is eliminated when calculating 푅푀푎푥푅푃푒푎푘 .) To understandthe behavior of the trend line, just recall Equ. (4): to in-
crease the absolute performance, more processors shall be
included, and to provide a reasonable efficiency, the value
of (1 − 훼) must be properly reduced.
The "roofline" model [56] is successful in fields where
some resource limits the maximum performance resulting
from the interplay of the other components. Here the limit-
ing resource is the performance gain (originated from Am-
dahl’s law, the technical implementation and the comput-
ing paradigm together) that limits the engineering solutions.
Their interplay may result in more or less perfect perfor-
mance gains, but no combination enables to exceed that ab-
solute limit.
The two roofline levels displayed in Fig. 8 b), right side
correspond to the values measurable with the benchmarks
HPL and HPCG, respectively. The latter benchmark has
documented history in three items only, but the data are con-
vincing enough to set a reliable roofline level. Here the con-
tribution of the SW dominates ("the real-life" application
class), so the architectural solution makes no big difference,
see also section 3.3.
The third roofline level (see section 5.7) is inferred from
the single availablemeasured data [2]. The two smaller black
dots show the performance data of the full configuration (as
measured by the benchmark HPCG) of the two supercom-
puters the authors had access to, and the red dot denotes the
saturation value they experienced (and so they did not deploy
more hardware). This strongly supports the assumption, that
16They are also connected by that their validity terminates in the present
technological epoch.
J. Végh: Preprint submitted to Elsevier Page 13 of 17
The roofline of parallelized performance gain
the achievable supercomputer performance depends on the
type of the application. The roof level concluded from the
performance gain measured by benchmark HPL only mea-
sures the effect of organizing the joint work plus the fork/join
operation.
The roof level concluded fromHPCGbenchmark is about
two orders ofmagnitude lower because of the increased amount
of communication needed for the iteration. In the case of
the brain simulation the top level of the performance gain
is two more orders of magnitude lower because of the need
of more intensive communication (many other neurons must
also be periodically informed on the result of the neural cal-
culation). In the case of AI networks the intensity of com-
munication is between the last two (depending on the type
and size of the network), so correspondingly the achievable
performance gain must also reside between the last two roof
levels.
The figure demonstrates how the "communication-to-computation
ratio" introduced by [41] affects the achievable parformance
gain. Notice that the achieved performance gain (=speedup)
of brain simulation is about 103 and based on the amount
of communication, Artificial Neural Networks also cannot
show up much higher performance gain. The bottleneck is
not the performance of the floating operations; rather the
need of communication and the exceptionally high "communication-
to-computation ratio". Without the need of organizing the
joint work there would not be a roof level at all: adding more
cores would increase the performance gain with a permanent
slope. With having non-zero non-parallelizable contribution
the roof level appears and the higher that contribution is, the
lower is the value of the roof level (or, in other words: the
higher is the non-parallelizable contribution the lower is the
nominal performance at which the roofline effect appears; in
the figure expressed in years).
The results of the benchmark HPL show a considerable
scatter and some points are even above the roofline. This
benchmark is sensitive to the architectural solutions like clus-
tering (internal or external), accelerators, absolute perfor-
mance, etc., since here no ab ovo dominating component is
present. Because of this, some "relaxation time" was and
will be needed until a right combination resulting in a per-
formance gain approaching the roofline was/will be found.
The three points above the roofline belong to the same
supercomputer 푇 푎푖ℎ푢푙푖푔ℎ푡. Its "cooperating processors" [58]
work using a slightly different computing paradigm: they
slightly violate the principles of the SPA, which is applied
by all the rest of the supercomputers. Because of this, its
performance gain is limited by a slightly different roofline:
changing the computing paradigm (or the principles of im-
plementation) changes the rules of the game. This also hints
a possible way out: the computing paradigm shall be modi-
fied [50, 48] in order to introduce higher roofline level.
7. Summary
The payload performance of parallelized sequential com-
puting systems has been analyzed both theoretically and us-
ing the supercomputer database with well-documented per-
formance values. It was shown that both the (strongly sim-
plified) theoretical description and the empirical trend show
up limitation for the payload performance of large-scale par-
allelized computing, at the same value. The difficulties ex-
perienced in building ever-larger supercomputers and espe-
cially utilizing artificial intelligence applications on super-
computers or building brain simulators from SPA computer
components convincingly prove that the present supercom-
puting has achievedwhat was enabled by the computing paradigm
and implementation technology. To step further to the next
level [21] a real rebooting is required, among others renew-
ing computing [50] and introducing a new computing paradigm [46].
The "performance wall" [55] was hit.
References
[1] Abdallah, A.E., Jones, C., Sanders, J.W. (Eds.), 2005. Communicat-
ing Sequential Processes. The First 25 Years. Springer. doi:10.1007/
b136154.
[2] van Albada, S.J., Rowley, A.G., Senk, J., Hopkins, M., Schmidt, M.,
Stokes, A.B., Lester, D.R., Diesmann, M., Furber, S.B., 2018. Per-
formance Comparison of the Digital Neuromorphic Hardware SpiN-
Naker and the Neural Network Simulation Software NEST for a Full-
Scale CorticalMicrocircuitModel. Frontiers inNeuroscience 12, 291.
[3] Amdahl, G.M., 1967. Validity of the Single Processor Approach to
Achieving Large-Scale Computing Capabilities, in: AFIPS Confer-
ence Proceedings, pp. 483–485. doi:10.1145/1465482.1465560.
[4] Ao, Y., Yang, C., Liu, F., Yin, W., Jiang, L., Sun, Q., 2018. Perfor-
mance Optimization of the HPCG Benchmark on the Sunway Taihu-
Light Supercomputer. ACM Trans. Archit. Code Optim. 15, 11:1–
11:20.
[5] Asanovic, K., Bodik, R., Demmel, J., Keaveny, T., Keutzer, K., Kubi-
atowicz, J., Morgan, N., Patterson, D., Sen, K., Wawrzynek, J., Wes-
sel, D., Yelick, K., 2009. A View of the Parallel Computing Land-
scape. Communications of the ACM 52, 56–67.
[6] Bell, G., Bailey, D.H., Dongarra, J., Karp, A.H., Walsh, K., 2017. A
look back on 30 years of the gordon bell prize. The International Jour-
nal of High Performance Computing Applications 31, 469âĂŞ484.
URL: https://doi.org/10.1177/1094342017738610, doi:10.1177/
1094342017738610, arXiv:https://doi.org/10.1177/1094342017738610.
[7] Bourzac, K., 2017. Streching supercomputers to the limit. Nature
551, 554–556.
[8] Chandy, J.A., Singaraju, J., 2009. Hardware parallelism vs. software
parallelism, in: Proceedings of the First USENIX Conference on Hot
Topics in Parallelism, USENIX Association, Berkeley, CA, USA. pp.
2–2.
[9] Daga, M., Aji, A.M., c. Feng, W., 2011. On the efficacy of a fused
cpu+gpu processor (or apu) for parallel computing, in: 2011 Sympo-
sium on Application Accelerators in High-Performance Computing,
pp. 141–149. doi:10.1109/SAAHPC.2011.29.
[10] David, T., Guerraoui, R., Trigonakis, V., 2013. Everything you always
wanted to know about synchronization but were afraid to ask, in: Pro-
ceedings of the Twenty-Fourth ACM Symposium on Operating Sys-
tems Principles (SOSP ’13), pp. 33–48. doi:10.1145/2517349.2522714.
[11] Dettmers, T., 2015. The Brain vs Deep Learning Part
I: Computational Complexity âĂŤ Or Why the Singular-
ity Is Nowhere Near. http://timdettmers.com/2015/07/27/
brain-vs-deep-learning-singularity/.
[12] Dongarra, J., 2016. Report on the Sunway TaihuLight System. Tech-
nical Report Tech Report UT-EECS-16-742. University of Tennessee
Department of Electrical Engineering and Computer Science.
[13] Ellen, F., Hendler, D., Shavit, N., 2012. On the Inherent Sequentiality
of Concurrent Objects. SIAM J. Comput. 43, 519âĂŞ536. doi:10.
1137/08072646X.
[14] ERIK P. DeBenedictis, 2005. Petaflops, Exaflops, and Zettaflops
J. Végh: Preprint submitted to Elsevier Page 14 of 17
The roofline of parallelized performance gain
for Science and Defense. http://debenedictis.org/erik/SAND-2005/
SAND2005-2690-CUG2005-B.pdf.
[15] Esmaeilzadeh, H., 2015. Approximate acceleration: A path through
the era of dark silicon and big data, in: Proceedings of the 2015 In-
ternational Conference on Compilers, Architecture and Synthesis for
Embedded Systems, IEEE Press, Piscataway, NJ, USA. pp. 31–32.
URL: http://dl.acm.org/citation.cfm?id=2830689.2830693.
[16] Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam, K., et al.,
2012. Dark Silicon and the End of Multicore Scaling. IEEE Micro
32, 122–134.
[17] European Commission, 2016. Implementation of the Action
Plan for the European High-Performance Computing strategy.
http://ec.europa.eu/newsroom/dae/document.cfm?doc_id=15269.
[18] Extremtech, 2018. Japan Tests Silicon for Exascale Com-
puting in 2021. URL: https://www.extremetech.com/computing/
272558-japan-tests-silicon-for-exascale-computing-in-2021.
[19] Eyerman, S., Eeckhout, L., 2010. Modeling Critical Sections in Am-
dahl’s Law and Its Implications for Multicore Design. SIGARCH
Comput. Archit. News 38, 362–370. URL: http://doi.acm.org/10.
1145/1816038.1816011, doi:10.1145/1816038.1816011.
[20] Fu, H., Liao, J., Yang, J., Wang, L., Song, Z., Huang, X., Yang, C.,
Xue, W., Liu, F., Qiao, F., Zhao, W., Yin, X., Hou, C., Zhang, C.,
Ge, W., Zhang, J., Wang, Y., Zhou, C., Yang, G., 2016. The Sun-
way TaihuLight supercomputer: system and applications. Science
China Information Sciences 59, 1–16. URL: http://dx.doi.org/10.
1007/s11432-016-5588-7, doi:10.1007/s11432-016-5588-7.
[21] Fuller, S.H., Millett, L.I., 2011. Computing Performance: GameOver
or Next Level? Computer 44, 31–38.
[22] Furber, S.B., Lester, D.R., Plana, L.A., Garside, J.D., Painkras,
E., Temple, S., Brown, A.D., 2013. Overview of the SpiN-
Naker System Architecture. IEEE Transactions on Computers 62,
2454–2467. URL: doi.ieeecomputersociety.org/10.1109/TC.2012.
142, doi:10.1109/TC.2012.142.
[23] Gustafson, J.L., 1988. Reevaluating Amdahl’s Law. Commun. ACM
31, 532–533. doi:10.1145/42411.42415.
[24] Hennessy, J.L., Patterson, D.A., 2007. Computer Architecture: A
Quantitative Approach. Morgan Kaufmann Publishers.
[25] HPCG Benchmark, 2016. Hpcg benchmark. http://www.
hpcg-benchmark.org/.
[26] Hwang, K., Jotwani, N., 2016. Advanced Computer Architecture:
Parallelism, Scalability, Programmability. 3 ed., Mc Graw Hill.
[27] IEEE Spectrum, 2017. Two Different Top500 Supercom-
puting Benchmarks Show Two Different Top Supercomput-
ers. https://spectrum.ieee.org/tech-talk/computing/hardware/
two-different-top500-supercomputing-benchmarks-show-two-different-top-supercomputers.
[28] Ippen, T., Eppler, J.M., Plesser, H.E., Diesmann, M., 2017. Con-
structing Neuronal Network Models in Massively Parallel Environ-
ments. Frontiers in Neuroinformatics 11, 30. URL: https://
www.frontiersin.org/article/10.3389/fninf.2017.00030, doi:10.3389/
fninf.2017.00030.
[29] Karp, A.H., Flatt, H.P., 1990. Measuring Parallel Processor Perfor-
mance. Commun. ACM 33, 539–543. doi:10.1145/78607.78614.
[30] Krishnaprasad, S., 2001. Uses and Abuses of Amdahl’s Law. J. Com-
put. Sci. Coll. 17, 288–293. URL: http://dl.acm.org/citation.cfm?
id=775339.775386.
[31] Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen,
A.D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund,
P., Singhal, R., Dubey, P., 2010. Debunking the 100X GPU vs.
CPU Myth: An Evaluation of Throughput Computing on CPU
and GPU, in: Proceedings of the 37th Annual International Sym-
posium on Computer Architecture, ACM, New York, NY, USA.
pp. 451–460. URL: http://doi.acm.org/10.1145/1815961.1816021,
doi:10.1145/1815961.1816021.
[32] Liao, X., 2019. Moving from exascale to zettascale computing: chal-
lenges and techniques. Frontiers of Information Technology & Elec-
tronic Engineering , 1236–1244.
[33] Liao, X.k., Lu, K., Yang, C.q., Li, J.w., Yuan, Y., Lai, M.c.,
Huang, L.b., Lu, P.j., Fang, J.b., Ren, J., Shen, J., 2018. Mov-
ing from exascale to zettascale computing: challenges and tech-
niques. Frontiers of Information Technology & Electronic Engineer-
ing 19, 1236âĂŞ1244. URL: https://doi.org/10.1631/FITEE.1800494,
doi:10.1631/FITEE.1800494.
[34] de Macedo Mourelle, L., Nedjah, N., Pessanha, F.G., 2016. Re-
configurable and Adaptive Computing: Theory and Applications.
CRC press. chapter 5: Interprocess Communication via Crossbar for
Shared Memory Systems-on-chip.
[35] Markov, I., 2014. Limits on fundamental limits to computation. Na-
ture 512(7513), 147–154.
[36] Molnár, P., Végh, J., 2017. Measuring Performance of Processor In-
structions and Operating System Services in Soft Processor Based
Systems, in: 18th Internat. Carpathian Control Conf. ICCC, pp. 381–
387.
[37] Patterson, D., Hennessy, J. (Eds.), 2017. Computer Organization and
design. RISC-V Edition. Morgan Kaufmann.
[38] Paul, J.M., Meyer, B.H., 2007. Amdahl’s Law Revisited for Single
Chip Systems. International Journal of Parallel Programming 35,
101–123. URL: https://doi.org/10.1007/s10766-006-0028-8, doi:10.
1007/s10766-006-0028-8.
[39] Randal E. Bryant andDavid R. O’Hallaron, 2014. Computer Systems:
A Programmer’s Perspective. Pearson.
[40] Schlansker, M., Rau, B., 2000. EPIC: Explicitly Parallel Instruction
Computing. Computer 33, 37–45. doi:10.1109/2.820037.
[41] Singh, J.P., Hennessy, J.L., Gupta, A., 1993. Scaling parallel pro-
grams for multiprocessors: Methodology and examples. Computer
26, 42–50. doi:10.1109/MC.1993.274941.
[42] TOP500.org, 2019. The top 500 supercomputers. https://www.
top500.org/.
[43] Tsafrir, D., 2007. The context-switch overhead inflicted by hardware
interrupts (and the enigma of do-nothing loops), in: Proceedings of
the 2007 Workshop on Experimental Computer Science, ACM, New
York, NY, USA. pp. 3–3. URL: http://doi.acm.org/10.1145/1281700.
1281704, doi:10.1145/1281700.1281704.
[44] US Government NSA and DOE, 2016. A Report from the
NSA-DOE Technical Meeting on High Performance Comput-
ing. https://www.nitrd.gov/nitrdgroups/images/b/b4/NSA_DOE_HPC_
TechMeetingReport.pdf.
[45] US National Research Council, 2011. The Future of Computing Per-
formance: Game Over or Next Level? URL: http://science.energy.
gov/~/media/ascr/ascac/pdf/meetings/mar11/Yelick.pdf.
[46] Végh, J., 2016. A configurable accelerator for manycores: the Explic-
itly Many-Processor Approach. ArXiv e-prints URL: http://adsabs.
harvard.edu/abs/2016arXiv160701643V, arXiv:1607.01643.
[47] Végh, J., 2017. Statistical considerations on limitations of supercom-
puters. CoRR abs/1710.08951. URL: http://arxiv.org/abs/1710.
08951, arXiv:1710.08951.
[48] Végh, J., 2018. Introducing the Explicitly Many-Processor Approach.
Parallel Computing 75, 28 – 40.
[49] Végh, J., 2018. Limitations of performance of Exascale Applications
and supercomputers they are running on. ArXiv e-prints; submitted to
special issue of IEEE Journal of Parallel and Distributed Computing
arXiv:1808.05338.
[50] Végh, J., 2018. Renewing computing paradigms for more efficient
parallelization of single-threads. IOS Press. volume 29 of Advances
in Parallel Computing. chapter 13. pp. 305–330.
[51] Végh, J., 2019a. Classic versus modern computing: analogies with
classic versus modern physics. Information Science In review, ??–
???
[52] Végh, J., 2019b. How Amdahl’s Law limits the performance of
large artificial neural networks: (Why the functionality of full-
scale brain simulation on processor-based simulators is lim-
ited). Brain Informatics 6, 1–11. URL: https://braininformatics.
springeropen.com/articles/10.1186/s40708-019-0097-2.
[53] Végh, J., Bagoly, Z., Kicsák, A., Molnár, P., 2014. An alterna-
tive implementation for accelerating some functions of operating sys-
tem, in: Proceedings of the 9th International Conference on Soft-
ware Engineering and Applications (ICSOFT-EA-2014), pp. 494–
J. Végh: Preprint submitted to Elsevier Page 15 of 17
The roofline of parallelized performance gain
499. doi:10.5220/0005104704940499.
[54] Végh, J., Molnár, P., 2017. How to measure perfectness of paral-
lelization in hardware/software systems, in: 18th Internat. Carpathian
Control Conf. ICCC, pp. 394–399.
[55] Végh, J., Vásárhelyi, J., Drótos, D., 2019. The performance wall of
large parallel computing systems, Springer. pp. 224–237.
[56] Williams, S., Waterman, A., Patterson, D., 2009. Roofline: An in-
sightful visual performance model for multicore architectures. Com-
mun. ACM 52, 65–76.
[57] Yavits, L., Morad, A., Ginosar, R., 2014. The effect of communication
and synchronization on Amdahl’s law in multicore systems. Parallel
Computing 40, 1–16.
[58] Zheng, F., Li, H.L., Lv, H., Guo, F., Xu, X.H., Xie, X.H., 2015.
Cooperative computing techniques for a deeply fused and heteroge-
neous many-core processor architecture. Journal of Computer Sci-
ence and Technology 30, 145–162. URL: https://doi.org/10.1007/
s11390-015-1510-9, doi:10.1007/s11390-015-1510-9.
References
[1] Abdallah, A.E., Jones, C., Sanders, J.W. (Eds.), 2005. Communicat-
ing Sequential Processes. The First 25 Years. Springer. doi:10.1007/
b136154.
[2] van Albada, S.J., Rowley, A.G., Senk, J., Hopkins, M., Schmidt, M.,
Stokes, A.B., Lester, D.R., Diesmann, M., Furber, S.B., 2018. Per-
formance Comparison of the Digital Neuromorphic Hardware SpiN-
Naker and the Neural Network Simulation Software NEST for a Full-
Scale CorticalMicrocircuitModel. Frontiers inNeuroscience 12, 291.
[3] Amdahl, G.M., 1967. Validity of the Single Processor Approach to
Achieving Large-Scale Computing Capabilities, in: AFIPS Confer-
ence Proceedings, pp. 483–485. doi:10.1145/1465482.1465560.
[4] Ao, Y., Yang, C., Liu, F., Yin, W., Jiang, L., Sun, Q., 2018. Perfor-
mance Optimization of the HPCG Benchmark on the Sunway Taihu-
Light Supercomputer. ACM Trans. Archit. Code Optim. 15, 11:1–
11:20.
[5] Asanovic, K., Bodik, R., Demmel, J., Keaveny, T., Keutzer, K., Kubi-
atowicz, J., Morgan, N., Patterson, D., Sen, K., Wawrzynek, J., Wes-
sel, D., Yelick, K., 2009. A View of the Parallel Computing Land-
scape. Communications of the ACM 52, 56–67.
[6] Bell, G., Bailey, D.H., Dongarra, J., Karp, A.H., Walsh, K., 2017. A
look back on 30 years of the gordon bell prize. The International Jour-
nal of High Performance Computing Applications 31, 469âĂŞ484.
URL: https://doi.org/10.1177/1094342017738610, doi:10.1177/
1094342017738610, arXiv:https://doi.org/10.1177/1094342017738610.
[7] Bourzac, K., 2017. Streching supercomputers to the limit. Nature
551, 554–556.
[8] Chandy, J.A., Singaraju, J., 2009. Hardware parallelism vs. software
parallelism, in: Proceedings of the First USENIX Conference on Hot
Topics in Parallelism, USENIX Association, Berkeley, CA, USA. pp.
2–2.
[9] Daga, M., Aji, A.M., c. Feng, W., 2011. On the efficacy of a fused
cpu+gpu processor (or apu) for parallel computing, in: 2011 Sympo-
sium on Application Accelerators in High-Performance Computing,
pp. 141–149. doi:10.1109/SAAHPC.2011.29.
[10] David, T., Guerraoui, R., Trigonakis, V., 2013. Everything you always
wanted to know about synchronization but were afraid to ask, in: Pro-
ceedings of the Twenty-Fourth ACM Symposium on Operating Sys-
tems Principles (SOSP ’13), pp. 33–48. doi:10.1145/2517349.2522714.
[11] Dettmers, T., 2015. The Brain vs Deep Learning Part
I: Computational Complexity âĂŤ Or Why the Singular-
ity Is Nowhere Near. http://timdettmers.com/2015/07/27/
brain-vs-deep-learning-singularity/.
[12] Dongarra, J., 2016. Report on the Sunway TaihuLight System. Tech-
nical Report Tech Report UT-EECS-16-742. University of Tennessee
Department of Electrical Engineering and Computer Science.
[13] Ellen, F., Hendler, D., Shavit, N., 2012. On the Inherent Sequentiality
of Concurrent Objects. SIAM J. Comput. 43, 519âĂŞ536. doi:10.
1137/08072646X.
[14] ERIK P. DeBenedictis, 2005. Petaflops, Exaflops, and Zettaflops
for Science and Defense. http://debenedictis.org/erik/SAND-2005/
SAND2005-2690-CUG2005-B.pdf.
[15] Esmaeilzadeh, H., 2015. Approximate acceleration: A path through
the era of dark silicon and big data, in: Proceedings of the 2015 In-
ternational Conference on Compilers, Architecture and Synthesis for
Embedded Systems, IEEE Press, Piscataway, NJ, USA. pp. 31–32.
URL: http://dl.acm.org/citation.cfm?id=2830689.2830693.
[16] Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam, K., et al.,
2012. Dark Silicon and the End of Multicore Scaling. IEEE Micro
32, 122–134.
[17] European Commission, 2016. Implementation of the Action
Plan for the European High-Performance Computing strategy.
http://ec.europa.eu/newsroom/dae/document.cfm?doc_id=15269.
[18] Extremtech, 2018. Japan Tests Silicon for Exascale Com-
puting in 2021. URL: https://www.extremetech.com/computing/
272558-japan-tests-silicon-for-exascale-computing-in-2021.
[19] Eyerman, S., Eeckhout, L., 2010. Modeling Critical Sections in Am-
dahl’s Law and Its Implications for Multicore Design. SIGARCH
Comput. Archit. News 38, 362–370. URL: http://doi.acm.org/10.
1145/1816038.1816011, doi:10.1145/1816038.1816011.
[20] Fu, H., Liao, J., Yang, J., Wang, L., Song, Z., Huang, X., Yang, C.,
Xue, W., Liu, F., Qiao, F., Zhao, W., Yin, X., Hou, C., Zhang, C.,
Ge, W., Zhang, J., Wang, Y., Zhou, C., Yang, G., 2016. The Sun-
way TaihuLight supercomputer: system and applications. Science
China Information Sciences 59, 1–16. URL: http://dx.doi.org/10.
1007/s11432-016-5588-7, doi:10.1007/s11432-016-5588-7.
[21] Fuller, S.H., Millett, L.I., 2011. Computing Performance: GameOver
or Next Level? Computer 44, 31–38.
[22] Furber, S.B., Lester, D.R., Plana, L.A., Garside, J.D., Painkras,
E., Temple, S., Brown, A.D., 2013. Overview of the SpiN-
Naker System Architecture. IEEE Transactions on Computers 62,
2454–2467. URL: doi.ieeecomputersociety.org/10.1109/TC.2012.
142, doi:10.1109/TC.2012.142.
[23] Gustafson, J.L., 1988. Reevaluating Amdahl’s Law. Commun. ACM
31, 532–533. doi:10.1145/42411.42415.
[24] Hennessy, J.L., Patterson, D.A., 2007. Computer Architecture: A
Quantitative Approach. Morgan Kaufmann Publishers.
[25] HPCG Benchmark, 2016. Hpcg benchmark. http://www.
hpcg-benchmark.org/.
[26] Hwang, K., Jotwani, N., 2016. Advanced Computer Architecture:
Parallelism, Scalability, Programmability. 3 ed., Mc Graw Hill.
[27] IEEE Spectrum, 2017. Two Different Top500 Supercom-
puting Benchmarks Show Two Different Top Supercomput-
ers. https://spectrum.ieee.org/tech-talk/computing/hardware/
two-different-top500-supercomputing-benchmarks-show-two-different-top-supercomputers.
[28] Ippen, T., Eppler, J.M., Plesser, H.E., Diesmann, M., 2017. Con-
structing Neuronal Network Models in Massively Parallel Environ-
ments. Frontiers in Neuroinformatics 11, 30. URL: https://
www.frontiersin.org/article/10.3389/fninf.2017.00030, doi:10.3389/
fninf.2017.00030.
[29] Karp, A.H., Flatt, H.P., 1990. Measuring Parallel Processor Perfor-
mance. Commun. ACM 33, 539–543. doi:10.1145/78607.78614.
[30] Krishnaprasad, S., 2001. Uses and Abuses of Amdahl’s Law. J. Com-
put. Sci. Coll. 17, 288–293. URL: http://dl.acm.org/citation.cfm?
id=775339.775386.
[31] Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen,
A.D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund,
P., Singhal, R., Dubey, P., 2010. Debunking the 100X GPU vs.
CPU Myth: An Evaluation of Throughput Computing on CPU
and GPU, in: Proceedings of the 37th Annual International Sym-
posium on Computer Architecture, ACM, New York, NY, USA.
pp. 451–460. URL: http://doi.acm.org/10.1145/1815961.1816021,
doi:10.1145/1815961.1816021.
[32] Liao, X., 2019. Moving from exascale to zettascale computing: chal-
lenges and techniques. Frontiers of Information Technology & Elec-
tronic Engineering , 1236–1244.
[33] Liao, X.k., Lu, K., Yang, C.q., Li, J.w., Yuan, Y., Lai, M.c.,
J. Végh: Preprint submitted to Elsevier Page 16 of 17
The roofline of parallelized performance gain
Huang, L.b., Lu, P.j., Fang, J.b., Ren, J., Shen, J., 2018. Mov-
ing from exascale to zettascale computing: challenges and tech-
niques. Frontiers of Information Technology & Electronic Engineer-
ing 19, 1236âĂŞ1244. URL: https://doi.org/10.1631/FITEE.1800494,
doi:10.1631/FITEE.1800494.
[34] de Macedo Mourelle, L., Nedjah, N., Pessanha, F.G., 2016. Re-
configurable and Adaptive Computing: Theory and Applications.
CRC press. chapter 5: Interprocess Communication via Crossbar for
Shared Memory Systems-on-chip.
[35] Markov, I., 2014. Limits on fundamental limits to computation. Na-
ture 512(7513), 147–154.
[36] Molnár, P., Végh, J., 2017. Measuring Performance of Processor In-
structions and Operating System Services in Soft Processor Based
Systems, in: 18th Internat. Carpathian Control Conf. ICCC, pp. 381–
387.
[37] Patterson, D., Hennessy, J. (Eds.), 2017. Computer Organization and
design. RISC-V Edition. Morgan Kaufmann.
[38] Paul, J.M., Meyer, B.H., 2007. Amdahl’s Law Revisited for Single
Chip Systems. International Journal of Parallel Programming 35,
101–123. URL: https://doi.org/10.1007/s10766-006-0028-8, doi:10.
1007/s10766-006-0028-8.
[39] Randal E. Bryant andDavid R. O’Hallaron, 2014. Computer Systems:
A Programmer’s Perspective. Pearson.
[40] Schlansker, M., Rau, B., 2000. EPIC: Explicitly Parallel Instruction
Computing. Computer 33, 37–45. doi:10.1109/2.820037.
[41] Singh, J.P., Hennessy, J.L., Gupta, A., 1993. Scaling parallel pro-
grams for multiprocessors: Methodology and examples. Computer
26, 42–50. doi:10.1109/MC.1993.274941.
[42] TOP500.org, 2019. The top 500 supercomputers. https://www.
top500.org/.
[43] Tsafrir, D., 2007. The context-switch overhead inflicted by hardware
interrupts (and the enigma of do-nothing loops), in: Proceedings of
the 2007 Workshop on Experimental Computer Science, ACM, New
York, NY, USA. pp. 3–3. URL: http://doi.acm.org/10.1145/1281700.
1281704, doi:10.1145/1281700.1281704.
[44] US Government NSA and DOE, 2016. A Report from the
NSA-DOE Technical Meeting on High Performance Comput-
ing. https://www.nitrd.gov/nitrdgroups/images/b/b4/NSA_DOE_HPC_
TechMeetingReport.pdf.
[45] US National Research Council, 2011. The Future of Computing Per-
formance: Game Over or Next Level? URL: http://science.energy.
gov/~/media/ascr/ascac/pdf/meetings/mar11/Yelick.pdf.
[46] Végh, J., 2016. A configurable accelerator for manycores: the Explic-
itly Many-Processor Approach. ArXiv e-prints URL: http://adsabs.
harvard.edu/abs/2016arXiv160701643V, arXiv:1607.01643.
[47] Végh, J., 2017. Statistical considerations on limitations of supercom-
puters. CoRR abs/1710.08951. URL: http://arxiv.org/abs/1710.
08951, arXiv:1710.08951.
[48] Végh, J., 2018. Introducing the Explicitly Many-Processor Approach.
Parallel Computing 75, 28 – 40.
[49] Végh, J., 2018. Limitations of performance of Exascale Applications
and supercomputers they are running on. ArXiv e-prints; submitted to
special issue of IEEE Journal of Parallel and Distributed Computing
arXiv:1808.05338.
[50] Végh, J., 2018. Renewing computing paradigms for more efficient
parallelization of single-threads. IOS Press. volume 29 of Advances
in Parallel Computing. chapter 13. pp. 305–330.
[51] Végh, J., 2019a. Classic versus modern computing: analogies with
classic versus modern physics. Information Science In review, ??–
???
[52] Végh, J., 2019b. How Amdahl’s Law limits the performance of
large artificial neural networks: (Why the functionality of full-
scale brain simulation on processor-based simulators is lim-
ited). Brain Informatics 6, 1–11. URL: https://braininformatics.
springeropen.com/articles/10.1186/s40708-019-0097-2.
[53] Végh, J., Bagoly, Z., Kicsák, A., Molnár, P., 2014. An alterna-
tive implementation for accelerating some functions of operating sys-
tem, in: Proceedings of the 9th International Conference on Soft-
ware Engineering and Applications (ICSOFT-EA-2014), pp. 494–
499. doi:10.5220/0005104704940499.
[54] Végh, J., Molnár, P., 2017. How to measure perfectness of paral-
lelization in hardware/software systems, in: 18th Internat. Carpathian
Control Conf. ICCC, pp. 394–399.
[55] Végh, J., Vásárhelyi, J., Drótos, D., 2019. The performance wall of
large parallel computing systems, Springer. pp. 224–237.
[56] Williams, S., Waterman, A., Patterson, D., 2009. Roofline: An in-
sightful visual performance model for multicore architectures. Com-
mun. ACM 52, 65–76.
[57] Yavits, L., Morad, A., Ginosar, R., 2014. The effect of communication
and synchronization on Amdahl’s law in multicore systems. Parallel
Computing 40, 1–16.
[58] Zheng, F., Li, H.L., Lv, H., Guo, F., Xu, X.H., Xie, X.H., 2015.
Cooperative computing techniques for a deeply fused and heteroge-
neous many-core processor architecture. Journal of Computer Sci-
ence and Technology 30, 145–162. URL: https://doi.org/10.1007/
s11390-015-1510-9, doi:10.1007/s11390-015-1510-9.
J. Végh: Preprint submitted to Elsevier Page 17 of 17
