Finally, how many efficiencies supercomputers have? And, what do they
  measure? by Végh, János
Noname manuscript No.
(will be inserted by the editor)
Finally, how many efficiencies supercomputers have?
And, what do they measure?
Ja´nos Ve´gh
Received: date / Accepted: date
Abstract Using an extremely large number of processing elements in com-
puting systems leads to unexpected phenomena, such as different efficiencies
of the same system for different tasks, that cannot be explained in the frame
of classical computing paradigm. The simple non-technical model, introduced
here, enables us to set up a frame and formalism, needed to explain those
unexpected experiences around supercomputing. The paper shows that degra-
dation of efficiency of parallelized sequential systems is a natural consequence
of the classical computing paradigm, instead of an engineering imperfectness.
Workload is much responsible for wasting energy, as well as limiting size and
type of tasks, supercomputers can run. Case studies provide insight, how dif-
ferent contributions compete for dominating resulting payload performance
of a computing system, and how enhancing technology made the comput-
ing+communication to dominate in defining the efficiency of supercomputers.
Our model also enables to derive predictions about supercomputer perfor-
mance limitations for the near future as well as provides hints for enhancing
supercomputer components. Phenomena experienced in large-scale computing
show interesting parallels with phenomena experienced in science, more than
a century ago, and through their studying a modern science was developed.
Keywords Supercomputer performance · Parallelized sequential processing ·
Efficiency of supercomputers · Limitations of parallel processing · Behavior of
extrem-scale systems · ANN performance · ANN efficiency
Project no. 125547 has been implemented with the support provided from the National
Research, Development and Innovation Fund of Hungary, financed under the K funding
scheme. Also the ERC-ECAS support of project 886183 is acknowledged.
An extended and updated form of MS submitted to J. Supercomputing
J. Ve´gh
Kalima´nos BT, Hungary
E-mail: Vegh.Janos@gmail.com
ar
X
iv
:2
00
1.
01
26
6v
3 
 [c
s.P
F]
  2
2 J
ul 
20
20
2 Ja´nos Ve´gh
1 Introduction
Given that dynamic growing of single-processor performance has stalled about
two decades ago [1], the only way to achieve the required high computing per-
formance remained parallelizing work of a vast number of sequentially working
single processors. However, as was very early predicted [2] and decades later
experimentally confirmed [3], scaling of parallelized computing is not linear.
Even, ”there comes a point when using more processors . . . actually increases
the execution time rather than reducing it” [3]. Parallelized sequential process-
ing has different rules of game [3], [4]: performance gain (”speedup”) has its
inherent bounds [5].
Akin to as laws of science limit the performance of single-thread proces-
sors [6], the commonly used computing paradigm (through its technical im-
plementation) limits payload performance of supercomputers [4]. On one side,
experts expected performance1 to achieve the magic 1 Eflops around year
2020, see Figure 1 in [7]2. ”The performance increase of the No. 1 systems
slowed down around 2013, and it was the same for the sum performance” [7],
but the authors extrapolated linearly, expecting that development continues
and ”zettascale computing” (i.e., 104-fold more than the present performance)
shall be achieved in just more than a decade. On the other side, it has recently
been admitted that linearity is ”A trend that can’t go on ad infinitum.” Fur-
thermore, that it ”can be seen in our current situation where the historical
ten-year cadence between the attainment of megaflops, teraflops, and petaflops
has not been the case for exaflops”[8].
The expectations against supercomputers are excessive. For example, as
the name of company PEZY3 witnesses, a billion times increase in payload
performance is expected. It looks like that in feasibility studies of supercom-
puting using parallelized sequential systems an analysis whether building com-
puters of such size is feasible (and reasonable) remained out of sight either in
USA [9,10] or in EU [11] or in Japan [12] or in China [7]. The ”gold rush” is
going on, even in most prestigious journals [13,10]. In addition to the previ-
ously existing ”two different efficiencies of supercomputers” [14] further effi-
ciency/performance values appeared4 (of course with higher numeric figures)
and several more can easily be derived.
Although severe counter-arguments were also published, mainly based on
the power consumption of both single processors and large computing cen-
1 There are some doubts about the definition of exaFLOPS, whether it means nominal
performance RPeak or payload performance RMax, measured by which benchmark and
finally using what operand length. Here the term is used as RHPL−64Max . To produce higher
figures, several other benchmarks result, not related to floating computation, have been
published.
2 A special issue https://link.springer.com/journal/11714/19/10
3 https://en.wikipedia.org/wiki/PEZY Computing: The name PEZY is an acronym de-
rived from the greek derived Metric prefixes Peta, Eta, Zetta, Yotta
4 https://blogs.nvidia.com/blog/2019/06/17/hpc-ai-performance-record-summit/
https://www.olcf.ornl.gov/2018/06/08/genomics-code-exceeds-exaops-on-summit-
supercomputer/
The many efficiencies of supercomputers 3
ters [15], moon-shot of limitless parallelized processing performance is followed.
Probable sources of the idea are [16,3]. The former5 is based simply on a mis-
interpretation [17,18] of terms in Amdahl’s law [55]6. It the latter, it might
be misunderstood that ”the serial fraction . . . is a diminishing function of the
problem size.” It was valid at the time of writing. Still, because the relative
weight of housekeeping activity grows linearly with the number of cores in the
system (and so does the idle time of cores [19]), at today’s large number of
cores (”the problem size”), the function turns from diminishing to dominating.
In reality, Amdahl’s Law (in its original spirit) is valid for all parallelized se-
quential activities, including computing-unrelated ones, and it is the governing
law of distributed (including super-) computing.
Demonstrative failures of some systems (such as supercomputers Gyoukou
and Aurora’187, and brain simulator SpiNNaker8) are already known, and
many more are expected to follow: such as Aurora’21 [22], the mystic China
supercomputers9 and the EU planned supercomputers10. Fugaku, although it
considerably enhanced efficacy of computing, mainly due to clever placing and
use mode of its on-chip memory, also stalled at about 40% of its planned ca-
pacity [23]. A general warning sign is that three of the first ten supercomputers
(as of June, 2020) did not provide their High Performance Conjugate Gradi-
ents (HPCG) performance (i.e., performance when running ”real-life” appli-
cations), four of them (including the top 3) used only a smaller, typically: and
order of magnitude less, portion of their cores in HPCG benchmarking (see
also Fig. 2). Only three of the ten top supercomputers used all of their cores
when running HPCG, while, of course, they used all their cores when running
benchmark High Performance Linpack (HPL).
Similar is the case with exascale applications, such as brain simulation. Ex-
aggerated news about simulating brain of some animals or a large percentage
of the human brain, appeared. The reality is that the many-thread version of
the brain simulator can fill an extremely large amount of memory with data
of billions of artificial neurons [24], a purpose-built brain simulator can be
designed to simulate one billion neurons [25], but in practice, they both can
simulate only about 80 thousand neurons [26], mainly because of ”the quantal
nature of the computing time” [27]. ”More Is Different” [28].
The confusion is growing. The paper attempts to clear up the terms through
scrutinizing the basic terms, contributions, measurement methods. In section 2
a by intention strongly simplified non-technical model, based on temporal be-
5 The related work and speedup deserved the Gordon Bell Prize
6 As explicitly suspected in [18]: Gustafson’s formulation gives an illusion that as if N
can increase indefinitely.
7 It was also learned that specific processor design is needed for exascale As part of the
announcement the development line Knights Hill [20] was canceled and instead be replaced
by a ”new platform and new microarchitecture specifically designed for exascale”
8 Despite its failure, SpiNNaker2 is also under construction [21]
9 https://www.scmp.com/tech/policy/article/3015997/china-has-decided-not-fan-flames-
super-computing-rivalry-amid-us
10 https://ec.europa.eu/newsroom/dae/document.cfm? doc id =60156
4 Ja´nos Ve´gh
Proc1
Proc2
Res1
Res2
−0.5
0.5 1
1
2
2
4
x
y
t
1993 2018
(Sunway/Taihulight)
α = 1− 1 · 10−3 α = 1− 3.3 · 10
−8
Total = 1013 clocks
Ncores = 10
3 Ncores = 10
7
RMax
RPeak
= 1N ·(1−α)+α
= 1103·10−3+1
= 0.5
RMax
RPeak
= 1N ·(1−α)+α
= 0.74
Proc
T
im
e(
n
ot
p
ro
p
or
ti
on
a
l)
0
1
2
3
4
5
6
7
8
9
10
α =
Payload
Total
P0 P1 P2 P3 P4
AccessInitiation
SoftwarePre
OSPre
T
0
P
D
0
0
P
ro
ce
ss
0
P
D
0
1
T
1
P
D
1
0
P
ro
ce
ss
1
P
D
1
1
T
2
P
D
2
0
P
ro
ce
ss
2
P
D
2
1
T
3
P
D
3
0
P
ro
ce
ss
3
P
D
3
1
T
4
P
D
4
0
P
ro
ce
ss
4
P
D
4
1
Just waiting
Just waiting
OSPost
SoftwarePost
AccessTermination
P
a
y
lo
a
d T
ot
a
l
E
x
te
n
d
ed
Fig. 1 Left: Time diagram of parallelized sequential operation in time-space [19]. Right: A
non-technical, simplified model [55] of parallelized sequential computing operations, based
on their temporal behavior. Notice the different nature of those contributions, and that they
have only one common feature: they all consume time.
havior of physical implementation of computing [19], is presented. The no-
tations for Amdahl’s Law, which form the basis of the present paper, are
introduced in section 3. It is shown that the degradation of efficiency of paral-
lelized sequential systems, as was suspected early [3], is a natural consequence
of the computing paradigm, rather than an engineering imperfectness (in the
sense that it can be fixed later). Furthermore, its consequence is that paral-
lelized sequential computing systems by their very nature have an upper per-
formance bound. In section 4, the form of correction for adding performances
(i.e., the experienced ’empirical’ efficiency of parallelization), stemming out
from Amdahl’s law, is introduced. Interestingly enough, under extreme condi-
tions, technical objects of computing show up a series of behavior (for more
details see [4]), similar to that of natural objects. Different contributions form
the sequential portion of the task (and through this, degrade its parallelized
performance), as detailed in section 5. The established model model is vali-
dated in section 6.
Given that race to produce computing systems having components and
systems with higher performance numbers is going on, in section 7, expectable
results of developments in near future are predicted. The section introduces
some further performance merits, and, through interpreting them, concludes
that increasing size of supercomputers further, and making expensive enhance-
ments in their technologies, only increase their non-payload performance. Sec-
tion 8 explains, how workload on supercomputer defines payload portions its
different components can utilize, out of the total performance figures in the
datasheet. In this way the workload selects one of the potential performance
values of the supercomputer systems, as measuring select one of the potential
quantum states in science.
The many efficiencies of supercomputers 5
2 A non-technical model of parallelized sequential operation
Performance measurements are simple time measurements11 (although they
need careful handling and proper interpretation, see good textbooks such
as [29]): a standardized set of machine instructions is executed (a large number
of times) and the known number of operations is divided by the measurement
time; for both single-processor and distributed parallelized sequential systems.
In the latter case, however, the joint work must also be organized, implemented
with additional machine instructions and additional execution time, forming
an overhead12. This additional activity is the origin of efficiency: one of the pro-
cessors orchestrates the joint operation, the others are waiting. At this point,
”dark performance” appears: processing units are ready to operate, consume
power, but do not make any payload work. As discussed in details in [19], the
”stealthy nature” of incremental development of technology made its appear-
ance unnoticed. However, today ”idle time” is the primary reason that power
consumption is used mostly for delaying electronic signals [30] inside our com-
puting systems, and delivering data rather than making computations [15].
Amdahl listed [2] different reasons, why losses in ”computational load” can
occur. Amdahl’s idea enables us to put everything that cannot be parallelized,
i.e., distributed between fellow processing units, into the sequential-only frac-
tion of the task. For describing the parallelized operation of sequentially work-
ing units, the model depicted in Figure 1 was prepared (based on the temporal
behavior of components, as described in [19]). Technical implementations of
different parallelization methods show up virtually infinite variety [31], so here
a (by intention) strongly simplified model is presented. The model is general
enough, however, to discuss some examples of parallelly working systems qual-
itatively. We shall neglect different contributions as possible in the different
cases. Our model can easily be converted to a technical (quantitative) one
via interpreting its contributions in technical terms, although with some obvi-
ous limitations. Such technical interpretations also enable us to find out some
technical limiting factors of the performance of parallelized computing.
The non-parallelizable (i.e. apparently sequential) part of tasks comprises
contributions from hardware (HW), operating system (OS), software (SW)
and Propagation Delay (PD), and also some access time is needed for reach-
ing the parallelized system. This separation is rather conceptual than strict,
although dedicated measurements can reveal their role, at least approximately.
Some features can be implemented in either SW or HW, or shared between
them. Furthermore, some sequential activities may happen partly parallel with
each other. Relative weights of these different contributions are very different
for different parallelized systems, and even within those cases depend on many
specific factors. That means, in every single parallelization case, a careful anal-
ysis is required. SW activity represents what was assumed by Amdahl as the
11 Sometimes also secondary merits, such as GFlops/Watt or GFlops/USD are also derived
12 This aspect is neglected in the weak scaling approximation
6 Ja´nos Ve´gh
total sequential fraction13. Non-determinism of modern HW systems [32] [33]
also contributes to non-parallelizable portion of the task: the resulting execu-
tion time of parallelly working processing elements is defined by the slowest
unit.
Notice that our model assumes no interaction between processes running
on the parallelized system, in addition to the necessary minimum: starting
and terminating otherwise independent processes, which take parameters at
the beginning and return a result at the end. It can, however, be trivially
extended to more general cases when processes must share some resource
(such as a database, which shall provide different records for the different pro-
cesses), either implicitly or explicitly. Concurrent objects have their inherent
sequentiality [34]. Synchronization and communication between those objects
considerably increase [35] the non-parallelizable portion (i.e. contribution to
(1 − αSWeff ) or (1 − αOSeff )). Because of this effect, in the case of an extremely
large number of processors, special attention must be devoted to their role on
efficiency of applications on parallelized systems.
The physical size of the computing system also matters. A processor, con-
nected to the first one with a cable of length of dozens of meters, must spend
several hundred clock cycles with waiting. This waiting is only because of
the finite speed of propagation of light, topped by latency time and hoppings
of their interconnection (not mentioning geographically distributed computer
systems, such as some clouds, connected through general-purpose networks).
Detailed calculations are given in [36].
After reaching a certain number of processors, there is no more increase
in the payload fraction when adding more processors. The first fellow pro-
cessor already finished its task and is idle waiting, while the last one is still
idle waiting for its start command. This limiting number can be increased by
organizing the processors into clusters: the first computer must speak directly
only to the head of the cluster. Another way is to distribute the job near to
the processing units. It can happen either inside the processor [37], or using
processors to let do the job by the processing units of a GPGPU14.
This looping contribution is not considerable (and so: not noticeable) at
a low number of processing units, but can be a dominating factor at high
number of processing units. This ”high number” was a few dozens at the time
of writing the paper [3], today it is a few millions15. The method, how the
effect of looping contribution is considered, is the border line between first and
second-order approximations in modeling performance. Housekeeping keeps
growing with the number of processors. In contrast, the resulting performance
13 Although some OS activity was surely included, Amdahl concluded some 20 % SW
fraction, so at that time the other contributions could be neglected compared to SW con-
tribution. As shown in Figure 1 and discussed below, for today, this contribution became
by several orders of magnitude smaller. However, at the same time the number of the cores
grew several orders of magnitude larger.
14 Notice, however, that any additional actor on the scene increases the latency of compu-
tation.
15 Strongly depends on workload and the architecture.
The many efficiencies of supercomputers 7
of the system does not increase anymore. First-order approximation considers
the contribution of housekeeping as constant. Second-order approximation also
considers, that as the number of processing units grows, housekeeping grows
with, and gradually becomes the dominating factor of performance limitation,
and leads to a decrease in payload performance.
As Figure 1 shows, in parallelized operating mode (in addition to calcu-
lation, furthermore communication of data between its processing units) both
software and hardware contribute to execution time, i.e. they both must be
considered in Amdahl’s Law. This is not new, again: see [2]. Figure 1 also
shows where is place to improve computing efficiency. When combining PD
properly with sequential scheduling, non-payload time can be considerably re-
duced during fine-tuning the system (see the cases of performance increases
of Sierra and Summit, a half year after their appearance on the TOP500
list). Also, mismatching total time and extended measurement time (or not
making a proper correction) may lead to completely wrong conclusions [38] as
discussed in [36].
3 Amdahl’s Law in terms of our model
Usually, Amdahl’s law is expressed as
S−1 = (1− α) + α/N (1)
where N is the number of parallelized code fragments (or Processing Unit
(PU)s), α is the ratio of parallelizable fraction to total, S is the measurable
speedup. From this
α =
N
N − 1
S − 1
S
(2)
When calculating speedup, one actually calculates
S =
(1− α) + α
(1− α) + α/N =
N
N · (1− α) + α (3)
hence the resulting efficiency of system (see Figure 2)
E(N,α) =
S
N
=
1
N · (1− α) + α =
RMax
RPeak
(4)
This phenomenon itself is known since decades [3], and α is theoretically
established [39]. Presently, however, the theory was somewhat faded, mainly
due to the quick development of parallelization technology and the increase of
single-processor performance.
During the past quarter of a century, the proportion of contributions changed
considerably: today the number of processors is thousands of times higher than
it was a quarter of a century ago. Growing physical size and higher process-
ing speed increased the role of propagation overhead, furthermore the large
number of processing units strongly amplified the role of looping overhead. As
8 Ja´nos Ve´gh
105
106
107
10−7 10−6 10−5 10−4
10−4
10−3
10−2
10−1
100
No
of
pro
ces
sor
s
Non− payload/payload
E
f
f
ic
ie
n
cy
TOP500’2020.06
Fugaku
Summit
Sierra
Taihulight
K computer
≈Brain
Fig. 2 The 2-parameter efficiency surface (in function of parallelization efficiency measured
by benchmark HPL and number of the processing elements), as concluded from Amdahl’s
Law (see Eq. (4)), in first order approximation. Some sample efficiency values for some se-
lected supercomputers are shown, measured with benchmarks HPL and HPCG, respectively.
This decay in performance is not a fault of the architecture, but is dictated by the limited
parallelism [3]
a result of technical development, the phenomenon on performance limitation
returned in a technically different form, at much higher number of processors.
Through using Equ. (4), E = SN =
RMax
RPeak
can be equally good for describing
efficiency of parallelization of a setup:
αE,N =
E ·N − 1
E · (N − 1) (5)
As we discuss below, except for the extremely high number of processors, it
can be safely assumed that α is independent from the number of processors in
the system. Equ. (5) can be used to derive value of α from values of parameters
RMax/RPeak and number of cores N .
According to Eq. (4), efficiency can be described with a 2-dimensional sur-
face, as shown in Figure 2. On the surface, some measured efficiencies of present
top supercomputers are also depicted, just to illustrate some general rules. The
HPL16 efficiencies are sitting on the surface, while the corresponding HPCG17
16 http://www.netlib.org/benchmark/hpl/
17 https://www.epcc.ed.ac.uk/blog/2015/07/30/hpcg
The many efficiencies of supercomputers 9
0 10 20 30 40 50
10−1
100
101
Ranking by HPL
N
o
o
f
P
ro
ce
ss
o
rs
/1
e6
Data points
Regression Top50
Regression Top10
10−1 100 101
10−7
10−6
10−5
No of Processors/1e6
(1
−
α
H
P
L
e
f
f
)
Data points
Regression TOP50
Regression TOP10
Fig. 3 The interrelation of ranking (by the benchmark HPL), the number of processors
and the efficiency of parallelization. The data are taken from database TOP500 [41]. The
correlation is drawn for both the TOP10 and TOP50 supercomputers, respectively.
values are much below those values. As Figure 2 witnesses, Taihulight and
K computer stand out from the ”millions core” middle group. Thanks to its
0.3M cores, K computer has the best efficiency for the HPCG benchmark,
while Taihulight with its 10M cores the lowest one. The middle group fol-
lows the rules. For HPL benchmark: the more cores, the lower efficiency. For
HPCG benchmark: the ”roofline” [40] [55] of that communication intensity
reached, they have about the same efficiency. In the latter benchmark, using
less cores is advantageous: using more processors does not increase payload
performance, but decreases efficiency. Tendency of K − computer, Summit,
and Taihulight clearly show how a higher number of PUs degrades HPCG
efficiency. Fugaku and Sierra can produce seemingly out-of-order efficiency,
because they reach the performance roofline at a lower number of PUs, and
provide that efficiency, rather than the lower value, measured with using all
cores.
Projections to axes show that the top few supercomputers show up very
similar parallelization efficiency and core number values: they are both re-
quired to receive one of the top slots; see also Figure 3. Supercomputers
Taihulight and Fugaku are exceptions on both axes. They have the high-
est number of cores and the best HPL parallelization efficiency. An inter-
esting coincidence is, that processor of both supercomputers have ”assistant
cores” (i.e., some cores do not make payload computing, instead they take
over ”housekeeping duties”). This solution decreases the internal latency of
processors making payload computing and increases the efficiency of the sys-
tem. They both use a ”light-weight operating system” (and so does Fugaku
and Sierra; four out of the first four), also to reduce processor latency. This
efficiency of course, requires executing several floating instructions per clock
cycle. That mode of operation gets more and more challenging for the inter-
connection, delivering data to and from data processing units. Notice also in
their cases the role of ”near” memories: as explained in [19], data delivery
time considerably increases ”idle time” of computing. This idle time is why
10 Ja´nos Ve´gh
Fugaku, with its cleverly placed L2 cache memories, can be more effective
when measured with HPL. This trick, however, is not working in the case of
HPCG, because its ”sparse” computations use those cache memories ineffec-
tively. The ”true” HPCG efficiency of Fugaku is expected to be between the
corresponding values of Summit and Taihulight.
In addition, processor of Taihulight comprises cooperating cores [37]. The
direct core-to core transfer uses a (slightly) different computing paradigm:
processor cores explicitly assume the presence of another core, and in this way,
their effective parallelism becomes much better, see also Fig. 5. On that figure
this data and the ones using shorter operands (Summit and Fugaku) results
in effective parallelization values below the limiting line. Reducing loop count
by internal clustering (in addition to the ”hidden clustering”, enabled by its
assistant cores) and exchanging data without using global memory, however,
works only for the HPL case, where the contribution of SW is low. The poor
value of (1−αHPCGeff ) is not necessarily a sign of architectural weakness [9]: it
comprises about four times more cores than the now second Summit. Given
that HPCG mimics ”real-life” applications, one can conclude that for practical
purposes only systems comprising a few hundred thousand cores18 shall be
built. More cores contribute only to ”dark performance”.
According to Eq. (4), efficiency can be interpreted in terms of α and N , and
the efficiency of a parallelized sequential computing system can be calculated
as
P (N,α) =
N · Psingle
N · (1− α) + α (6)
This simple formula explains why the payload performance is not a linear
function of the nominal performance and why in the case of very good paral-
lelization ((1− α)  1) and low N , this nonlinearity cannot be noticed.
The value of α, however, can hardly be calculated for the present complex
HW/SW systems from their technical data (for a detailed discussion see [55]).
Two ways can be followed to estimate their value of α. One way is to calcu-
late α for existing supercomputing systems (making ”computational experi-
ments”[18]) applying Eq. (5) to data in TOP500 list [4]. This way provides a
lower bound for (1 − α), which is already achieved. Another way round is to
consider contributions of different origin, see section 5, and to calculate the
high limit of the value of (1−α), that the given contributions alone do not en-
able to exceed (provided that that contribution is the dominant one). It gives
us good confidence in the reliability of the parameters that values derived in
these two ways differ only within a factor of two. At the same time, this also
means that technology is already very close to its theoretical limitations.
Notice that the ”algorithmic effects” – like dealing with sparse data struc-
tures (which affects cache behavior, that will have a growing importance in
the age of ”real-time everything” and neural networks) or communication be-
tween parallelly running threads, such as returning results repeatedly to the
18 Assuming good interconnection. For computing systems connected with general-purpose
networks, such as some high-performance clouds, this limiting number is much less
The many efficiencies of supercomputers 11
main thread in an iteration (which greatly increases non-parallelizable fraction
in the main thread) – manifest through the HW/SW architecture, and they
can hardly be separated. Also notice, that there are one-time and fixed-size
contributions, such as utilizing time measurement facilities or calling system
services. Since αeff is a relative merit, absolute measurement time shall be
large. When utilizing efficiency data from measurements, which were dedicated
to some other goal, proper caution must be exercised with the interpretation
and accuracy of those data.
The ’right efficiency metric’ [42] has always been a question (for a summary
see cited references in [43]) when defining efficient supercomputing. The goal
of the present discussion is to find out the inherent limitations of parallelized
sequential computing, and providing numerical values for it. For this goal,
the ’classic interpretation’ [2,3,39] of performance was used, in its original
spirit. Contributions mentioned in those papers were scrutinized and their
importance under current technical conditions revised.
Left subfigure of Figure 3 shows that to get better ranking on the TOP500
list, a higher number of processors is required. The regression line is differ-
ent for TOP10 and the TOP50 positions. The cut line between ”racing” and
”commodity” supercomputers is around slot 10. As the right subfigure under-
pins, high number of processors must be accompanied with good parallelization
efficiency, otherwise, the large number of cores cannot counterbalance the de-
creasing efficiency, see Eq. (6).
4 Analogies with the case of modern vs classic science
Eq. (6) simply tells that (in first-order approximation) speedup of a paral-
lelized computing system cannot exceed 1/(1−α); a well-known consequence of
Amdahl’s statement. Due to this, computing performance cannot be increased
above the performance defined by single-processor performance, parallelization
technology, and number of processors. Laws of nature prohibit to exceed a spe-
cific computing performance (using the classical paradigm and its classical im-
plementation). There is an analogy between adding speeds in physics and
adding performances in computing. In both cases, a correction term is intro-
duced that provides a noticeable effect only at extremely large values. One
more analogy is introduced in section 7, in that case with quantum theory, an-
other is mentioned in section 5.6 and several more are discussed in [4]. It seems
to be an interesting parallel, that both nature and extremely cutting-edge
technical (computing) systems show up some extraordinary behavior, i.e., the
linear behavior experienced under normal conditions gets strongly non-linear at
large values of the independent variable. That behavior make validity of linear
extrapolations as well as linear addition of performances at least questionable
at high performance values in computing.
The analogies do not want to imply direct correspondence between certain
physical and computing phenomena. Instead, the paper draws the attention
to both that under extreme conditions qualitatively different behavior may be
12 Ja´nos Ve´gh
encountered, and that scrutinizing certain, formerly unnoticed or neglected as-
pects, enables us to explain the new phenomena. Unlike in nature, in comput-
ing, technical implementation of critical points can be changed, and through
this, the behavior of computing systems can also be altered.
5 Effect of different contributions to α
Theory can display data from systems with any contributors with any pa-
rameters, but from measured data, only the sum of all contributions can be
concluded, although dedicated measurements can reveal value of separated
contributions experimentally. The publicly available data enable us to draw
only conclusions of limited validity.
5.1 Estimating different limiting factors of α
The estimations below assume that the actual contribution is the dominat-
ing one, and as such, defines achievable performance alone. This situation is
usually not the case in practice, but this approach enables us to find out the
limiting (1− α) values for all contributions.
In the systems implemented in Single Processor Approach (SPA) [2] as par-
allelized sequential systems, the life begins in one such sequential subsystem,
see also Fig. 1. In large parallelized applications, running on general-purpose
supercomputers, initially and finally only one thread exists, i.e., the minimal
necessary non-parallelizable activity is to fork the other threads and join them
again.
With the present technology, no such actions can be shorter than one pro-
cessor clock period19. That is, the theoretical absolute minimum value of the
non-parallelizable portion of the task will be given as the ratio of the time of
the two clock periods to the total execution time. The latter time is a free
parameter in describing efficiency. That is, the value of effective paralleliza-
tion αeff depends on the total benchmarking time (and so does the achievable
parallelization gain, too).
This dependence is, of course, well known for supercomputer scientists:
for measuring efficiency with better accuracy (and also for producing bet-
ter αeff values), hours of execution times are used in practice. In the case
of benchmarking supercomputer Taihulight [44] 13,298 seconds HPL bench-
mark runtime was used; on the 1.45 GHz processors it means 2 ∗ 1013 clock
periods. The inherent limit of (1− αeff ) at such benchmarking time is 10−13
(or, equivalently, the achievable performance gain is 1013). In the paper, for
simplicity, 1.00 GHz processors (i.e., 1 ns clock cycle time) will be assumed.
19 This statement is valid even if some parallelly working units can execute more than
one instruction in a clock period. One can take these two clock periods as an ideal (but
not realistic) case. However, the actual limitation will inevitably be (much) worse than the
one calculated for this idealistic case. The exact number of clock periods depends on many
factors, as discussed below.
The many efficiencies of supercomputers 13
Supercomputers are also distributed systems. In a stadium-sized super-
computer, a distance between processors (cable length) up to about 100 m
can be assumed. The net signal round trip time is ca. 10−6 seconds, or 103
clock periods, i.e., in the case of a finite-sized supercomputer performance gain
cannot be above 1010, only because of the physical size of the supercomputer.
Presently available network interfaces have 100. . . 200 ns latency times, and
sending a message between processors takes the time in the same order of
magnitude, typically 500 ns. This timing also means, that making better in-
terconnection is not a bottleneck in enhancing performance. This statement is
also underpinned by discussion in section 5.3.
These predictions enable us to assume that presently achieved value of
(1−αeff ) could also persist for roughly a hundred times more cores. However,
another major issue arises from computing principle SPA: only one processor
at a time can be addressed by the first one. As a consequence, at least as
many clock cycles are to be used for organizing the parallelized work as many
addressing steps required. This number equals to the number of cores in su-
percomputer, i.e., the addressing operation in supercomputers in the TOP10
positions typically needs clock cycles in the order of 5 ∗ 105. . . 107; degrading
the value of (1−αeff ) into the range 10−6. . . 2∗10−5. Two tricks may be used
to mitigate the number of the addressing steps: either cores are organized into
clusters as many supercomputer builders do, or at the other end, the processor
itself can take over the responsibility of addressing its cores [37]. In function
of actual construction, reducing factor of clustering of those types can be in
the range 101. . . 5 ∗ 103, i.e the resulting value of (1− αeff ) is expected to be
around 10−7.
An operating system must also be used for protection and convenience. If
fork/join is executed by OS as usual, because of needed context switchings
2 ∗ 104 [45,46] clock cycles are used, rather than 2 clock cycles considered
in the idealistic case. The derived values are correspondingly by four orders
of magnitude different, that is, the absolute limit is ≈ 5 ∗ 10−8, on a zero-
sized supercomputer. This value is somewhat better than the limiting value
derived above, but it is close to that value and surely represents a considerable
contribution. This limitation is why a growing number of supercomputers uses
”ligh-weights kernel” or runs its actual computations in kernel mode; a method
of computing that can be used only with well-known benchmarks.
However, this optimistic limit assumes that an instruction can be accessed
in one clock cycle. It is usually not the case, but it seems to be a good approxi-
mation. On one side, even a cached instruction in the memory needs about five
times more access time, and the time required to access ’far’ memory is roughly
100 times longer. Correspondingly, the most optimistic achievable performance
gain values shall be scaled down by a factor of 5 . . . 100. A considerable part
of the difference between efficiencies αHPLeff and α
HPCG
eff can be attributed to
different cache behavior because of the ’sparse’ matrix operations.
14 Ja´nos Ve´gh
5.2 Effect of workflow
The overly complex Figure 4 attempts to explain the phenomenon, why and
how the performance of a supercomputer configuration depends on the ap-
plication it runs. The non-parallelizable fraction (denoted on the figure by
αXeff ) of the computing task comprises components X of different origin. As al-
ready discussed, and was noticed decades ago, ”the inherent communication-to-
computation ratio in a parallel application is one of the important determinants
of its performance on any architecture” [3], suggesting that communication
can be a dominant contribution in systems performance. Figure 4.A displays
a case with minimum communication, and Figure 4.B a case with moderately
increased communication (corresponding to real-life supercomputer tasks). As
nominal performance increases linearly and payload performance decreases in-
versely with the number of cores, at some critical value where an inflection
point occurs, the resulting payload performance starts to drop. The result-
ing non-parallelizable fraction sharply decreases efficacy (or, in other words:
performance gain or speedup) of the system [47,48]. The effect was noticed
early [3], under different technical conditions, but somewhat faded due to suc-
cesses of development of parallelization technology.
Figure 4.A illustrates behavior measured with HPL benchmark. The loop-
ing contribution becomes remarkable around 0.1 Eflops, and breaks down pay-
load performance (see also Figure 1 in [3]) when approaching 1 Eflops. In
Figure 4.B, behavior measured with benchmark HPCG is displayed. In this
case contribution of the application (brown line) is much higher. The looping
contribution (thin green line) is the same as above. Consequently, achievable
payload performance is lower, and also the breakdown of the payload perfor-
mance is softer.
Given that no dedicated measurements exist, it is hard to make a direct
comparison between theoretical prediction and measured data. However, the
impressive and quick development of interconnecting technologies provides a
helping hand.
5.3 Contribution of the interconnection
As discussed above, in a somewhat simplified view, resulting performance can
be calculated using the contributions to α as
P (N,α) =
N · Psingle
N · (1− αNet − αCompute − αOthers) + ≈ 1 (7)
That is, two of the contributions are handled with emphasis. Theory pro-
vides values for contributions of interconnection and calculation separately.
Fortunately, the public database TOP500 [50] also provides data measured un-
der conditions greatly similar to ’net’ interconnection contribution. Of course,
measured data reflect the sum contributions of all components. However, as
will be shown below, in the total contribution those mentioned contributions
The many efficiencies of supercomputers 15
x
a1
a2
a3
am
y
Input Layer ”HPL Layer” Output Layer
10−3 10−2 10−1 100
10−10
10−9
10−8
10−7
10−6
10−5
10−4
RPeak(Eflop/s)
(1
−
α
H
P
L
e
f
f
)
10−5
10−4
10−3
10−2
10−1
100
R
H
P
L
M
a
x
(E
f
lo
p
/
s)
αSW
αOS
αeff
RMax(Eflop/s)
A
x
a1
a2
a3
am
y1 yn
n1
n2
n3
nm
Input Layer ”HPCG Layer 1” ”HPCG Layer n” Output Layer
10−3 10−2 10−1 100
10−10
10−9
10−8
10−7
10−6
10−5
10−4
RPeak(Eflop/s)
(1
−
α
H
P
C
G
e
f
f
)
10−5
10−4
10−3
10−2
10−1
100
R
H
P
C
G
M
a
x
(E
f
lo
p
/s
)
αSW
αOS
αeff
RMax(Eflop/s)
B
x1
x2
x3
xn
a1
a2
a3
am
n1
n2
n3
nm
y1
y2
y3
yk
Input Layer Hidden Layer 1 Hidden Layer n Output Layer
10−3 10−2 10−1 100
10−10
10−9
10−8
10−7
10−6
10−5
10−4
RPeak(Eflop/s)
(1
−
α
N
N
e
f
f
)
10−5
10−4
10−3
10−2
10−1
100
R
N
N
M
a
x
(E
f
lo
p
/s
)
αSW
αOS
αeff
RMax(Eflop/s)
C
Fig. 4 The figure explains how different communication/computation intensities of appli-
cations lead to different payload performance values in the same supercomputer system.
Left column: models of the computing intensities for different benchmarks. Right column:
the corresponding payload performances and α contributions in function of nominal perfor-
mance of a fictive supercomputer (P = 1Gflop/s @ 1GHz). The blue diagram line refers to
the right hand scale (RMax values), all others ((1 − αXeff ) contributions) to the left hand
scale. The figure is purely illustrating the concepts; the displayed numbers are somewhat
similar to real ones. The performance breakdown shown in the figures were experimentally
measured by [3], [49](Figure 7) and [26](Figure 8).
dominate, and all but the contribution from networking are (nearly) unchanged
so the difference of measured α can be directly compared to difference of the
corresponding sum of calculated α values, although here only qualitative agree-
ment can be expected.
Both quality of the interconnection and the nominal performance are a
parametric function of their time of construction, so one can assume on the
theory side that (in a limited period), interconnection contribution was chang-
ing in function of nominal performance as shown in Figure 5A. The other
major contribution is assumed to be calculation20 itself. Benchmark calcula-
20 This time also accessing data (”accessing data” is included)
16 Ja´nos Ve´gh
tion contributions for HPL and HPCG are very different, so the sum of the
respective component, plus the interconnection component are also very dif-
ferent. Given that at the beginning of the considered period, the contribution
from HPCG calculation and that of interconnection were in the same order of
magnitude, their sum only changes marginally (see upper diagram lines), i.e.,
measured performance improved only slightly. Because benchmark HPCG is
communication-bound (and so are real-life programs), their efficiency would
be an order of magnitude worse. The reason is Eq. (4): when supercomputers
use all of their cores, the achievable performance is not higher (or maybe even
lower), only the power consumption is higher (and the calculated efficiency is
lower). As predicted: ”scaling thus put larger machines at an inherent disad-
vantage” [3]. The cloud-like supercomputers have a disadvantage in the HPCG
competition: the Ethernet-like operation results in relatively high (1−α) val-
ues.
The case with HPL calculation is drastically different (see lower diagram
lines). Since in this case at the beginning of the considered period, the con-
tribution from interconnection is very much larger than that from the com-
putation, the sum of these two contributions changes sensitively as the speed
of the interconnection improves. As soon as the contribution from intercon-
nection decreases to a value comparable with that from the computation, the
decrease of the sum slows down considerably, and further improvement of in-
terconnection causes only marginal decrease in the value of resulting α (and
so only a marginal increase in payload performance).
Measured data enable us to draw the same conclusion, but one must con-
sider that here multiple parameters may have been changed. Their tendency,
however, is surprisingly clear. Figure 5.B is actually a 2.5D diagram: the size of
marks is proportional to the time passed since the beginning of the considered
period. A decade ago, the speed of interconnection gave the major contribution
to αtotal. Enhancing it drastically in the past few years, increased efficacy. At
the same time, because of stalled single-processor performance, other technol-
ogy components only changed marginally. Calculation contribution to α from
benchmark HPL remained constant in function of time, so quick improvement
of interconnection technology resulted in a quick decrease of αtotal, and rela-
tive weights of αNet and αCompute reversed. The decrease in value of (1 − α)
can be considered as the result of decreased contribution from interconnection.
However, the total α contribution decreased considerably only until αNet
reached the order of magnitude of αCompute. This match occurred in the first
4-5 years of the period shown in Figure 5B: the sloping line is due to the en-
hancement of interconnection. Then, the two contributors changed their role,
and the constant contribution due to computation started to dominate, i.e.,
the total α contribution decreased only marginally. As soon as computing
contribution took over the dominating role, the αtotal did not fall any more:
all measured data remained above that value of α. Correspondingly, the pay-
load performance improved only marginally (and due to factors other than
interconnection).
The many efficiencies of supercomputers 17
10−3 10−2 10−1 100
10−8
10−7
10−6
10−5
10−4
10−3
10−2
RPeak(Eflop/s)
(1
−
α
)
αtotal and αinterconnection, theory
αInterconnect
αHPL
αHPCG
αInterconnect+HPL
αInterconnect+HPCG
A
10−3 10−2 10−1 100
10−8
10−7
10−6
10−5
10−4
10−3
10−2
RPeak(Eflop/s)
(1
−
α
)
αtotal and αinterconnection, measured
June 2009, HPL
June 2011, HPL
June 2013, HPL
June 2015, HPL
June 2017, HPL
June 2019, HPL
June 2019, HPL
June 2020, HPL-AI
June 2017, HPCG
June 2020, HPCG
B
Fig. 5 The effect of changing the dominating contribution. The left subfigure shows the-
oretical estimation, the right subfigure the corresponding measured data, as derived from
public database TOP500 [50] (only values for the first four supercomputers are shown).
When the contribution from interconnection drops below that of calculation, the value of
(1− α) (and the performance gain) get saturated.
At this point, as a consequence of that the dominating contributor changed,
it was noticed, that efficacy of benchmark HPL and efficacy of real-life appli-
cations started to differ by up to two orders of magnitude. Because of this,
new benchmark program HPCG [51] was introduced, since ”HPCG is designed
to exercise computational and data access patterns that more closely match a
different and broad set of important applications” [52].
Since the major contributor is computing itself, the different benchmarks
contribute differently and since that time ”supercomputers have two different
efficiencies” [14]. Yes, if the dominating α contribution (from the benchmark
calculation) is different, then the same computer shows different efficiencies
in the function of computation it runs. Since that time, the interconnection
provides less contribution than the computation of the benchmark. Due to
that change, enhancing the interconnection contributes mainly to the dark
performance, rather than to the payload performance.
5.4 Effect of reduced operand length
The so-called HPL-AI benchmark used Mixed Precision21 rather than Double
Precision computations. The name suggests that AI applications may run on
the supercomputer with that efficiency. However, the type of workload does
21 Both names are rather inconsequent. On one side, the test itself has not much to do
with AI, it just uses the operand length common in Artificial Intelligence (AI) tasks;(HPL,
similarly to AI, is a workload type). On the other side, Mixed Precision is Half Precision:
it is natural that for multiplication twice as long operands are used temporarily. It is a
different question is that operations are contracted.
18 Ja´nos Ve´gh
not change, so that one can expect the same overall behavior for AI applica-
tions, including Artificial Neural Network (ANN)s, than for double-precision
operands. For AI applications, limitations remain the same as described above;
except that when using Mixed Precision, efficiency shall be better by a factor
of 2 · · · 3.
Unfortunately, when using half-precision, the enhancement comes from ac-
cessing less data in memory and using quicker operations on shorter operands,
instead of reducing communication intensity22, that defines efficiency. Simi-
larly, exchanging data directly between processing units [37] (without using
global and even local memory) also enhances α (and payload performance) [54],
but it represents a (slightly) different computing paradigm. Only the two men-
tioned measured data fall below the limiting line of (1− α) in Figure 5.
Recent supercomputers Fugaku [23] and Summit [53] provided their HPL
performance for both 64-bit and 16-bit operands. Of course, performance seems
to be much better with shorter operand length (at the same number of op-
erations, the total measurement time is much shorter). It was expected that
their performance should be four times higher when using four times shorter
operands. Power consumption data [53] underpin the expectations: power con-
sumption is about four times lower for four times shorter operands. Comput-
ing performance, however, shows a slighter performance enhancement: 3.01 for
Summit and 3.42 for Fugaku, because of the needed housekeeping.
In the long run, a TimeX value comprises housekeeping and computation.
We assume that housekeeping (indexing, branching) is the same fixed amount
for different operand lengths, and the other time contribution (data delivery
and bit manipulation) is proportional with operand length. Given that accord-
ing to our model, measured payload performance directly reflects the sum of
all contributions, we can assume that
Time16 = F0 + F16
Time64 = F0 + 4 · F16
(8)
where F0 is the time contribution from housekeeping (in a long-term run, using
benchmark HPL), and F16 is the time contribution due to manipulating 16
bits.
Table 1 shows data published in TOP500 list for supercomputers Fugaku
and Summit, together with their parameters calculated as described above.
Eff64 and Eff16 are calculated from the corresponding publishedRMax/RPeak
values. Amdahl’s parameter is calculated using Eq. (5), for the two different
operand lengths. As discussed, (1 − α) is the sequential portion of the total
measurement time. Assuming that total measurement time is the time unit, the
limiting time of performing a floating operation with 64-bit operands Time64
(on our arbitrary time scale) is directly derived from the (1 − α64) value. To
get a Time16 value on the same scale, the measured (1 − α16) value must be
corrected for their differing measurement time (measured performance ratios
3.42 and 3.01, respectively).
22 On the contrary, the relative weight of communication is increased in this way.
The many efficiencies of supercomputers 19
Table 1 Floating point characteristics of supercomputers Fugaku and Summit
Name Eff64 Eff16 (1−α64) (1−α16) Time64 Time16 F16 F0
Fugaku 0.808 0.691 3.25e-8 6.12e-8 3.25e-8 1.79e-8 0.49e-8 1.3e-8
Summit 0.74 0.557 14.7e-8 33.2e-8 14.7e-8 11.0e-8 1.23e-8 9.77e-8
The absolute values of data in the last two columns of the table shall not be
compared directly with those of the other supercomputer (their measurement
time was different). Given that the task and the computing model was the
same, we can directly compare ratios of values F16 and F0. The proportions of
both F0 values and F16/F0 values show that housekeeping is much better for
Fugaku than for Summit. Given that their architectures are globally similar,
the plausible reason for the difference in their efficacy (and performance) is
that in the case of Summit, the processor core is in a role of proxy (and
in this way it represents a bottleneck), while Fugaku uses ”assistant cores”.
Housekeeping increases latency and significantly decreases the performance of
the system. The plausible reason for their better F16 values is the clever use
(and positioning! see [19]) of L2-vel cache memories.
In the case of Summit, we also know HPCG efficiency. In its case the
TimeHPCG64 value is 2.08·e−5, i.e. several hundred times higher than TimeHPL64 .
Given F16 and F64 are the same in the case of the two benchmarks, the differ-
ence is caused by F0. In the long run, the different workload (iteration, more
intensive communication, ”sparse” computation forcing different cache utiliza-
tion) forces different F0 value, and that leads to ”different efficiencies” [14] of
supercomputers under different workloads.
In the case of Fugaku, only a fraction of its core was used in the bench-
mark HPCG, so only the achieved performance can be validated. It is very
plausible that HPCG performance reached its roofline, and (because of the
higher number of cores), its real HPCG efficiency would be around that of
Taihulight. Anyhow, it would not be fair either to assume that speculation
or to accept their value measure at a different number of cores. There is no
measured data.
These data directly underpin that technology is (almost) perfect: contri-
bution from benchmark calculation HPCG-FP64 is by orders of magnitude
larger23 than the contribution from all the rests. Recalling that that bench-
mark program imitates the behavior (as defined by the resulting α) of real-life
programs, one can see that the contribution from other computing-related ac-
tors is about a thousand times smaller than the contribution of the computa-
tion+communication.
The unique role of ”mixed precision” efficiencies (a third kind of efficiencies
of a supercomputer), see the red ’x’ marks on the Figure 5, deserves special
attention. Strictly speaking, the points cannot be correctly positioned in the
figure; they belong to a different scale (they are measured on a different HW).
On one side, the same number of operations are performed, using the same
23 Recall here that cache behavior may be included
20 Ja´nos Ve´gh
amount of PUs. On the other side, four times less data are transferred and ma-
nipulated. The nominal performance is expected to be four times higher than
in the case of using double-precision operands. Without correcting for the
more than three times shorter measurement (see below), the efficiency mark is
slightly above the corresponding value measured with double length operands
(relative weight of F0 is higher), with correction, it is somewhat below it.
This difference, however, is noticeable only with benchmark HPL; in the case
of HPCG workload, computation (including operand length) has a marginal
effect. In the former case, contributions of computation and communication
are in the same range of magnitude, and they are competing for dominat-
ing performance of the system. In the latter case, communication dominates
performance; computing has a marginal role.
5.5 Further efficiency values
The performance corresponding to αFP0HPL is slightly above 1 EFlops (when
making no floating operations, i.e., rather Eops). Another peak performance
reported24 when running genomics code on Summit (by using a mixture of
numerical precisions and mostly non-floating point instructions) is 1.88 Eops,
corresponding to αFP0Genom = 1 ∗ 10−8. Given that those two values refer to a
different mixture of instructions, the agreement is more than satisfactory.
Our simple calculations result in, that in the case of benchmark HPL, FP0
values are in the order of FP16, and that benchmark HPL is computing bound:
reducing housekeeping (including communication) has some sence for ”racing
supercomputers”, for real-life applications it has only marginal effect. On the
other side, increasing housekeeping (more communication such as in the case of
benchmark HPCG or ANNs) degrades apparent performance. At sufficiently
large amount of communication [55], housekeeping dominates performance,
and contribution of FPX becomes marginal. For large ANN applications us-
ing FP16 operands makes no real difference, their workload defines their per-
formance (and efficiency).
5.6 Effect of the length of clock period
The behavior of time in computing systems is in parallel with the quantal
nature of energy, known from modern science. Time in computing passes in
discrete steps rather than continuously. This difference is not noticeable under
usual conditions: both human perception and macroscopic computing oper-
ations are million-fold longer. Under the extreme conditions represented by
many-many core systems, however, the quantal nature is the source of the in-
herent limitation of parallelized sequential systems. The fundamental issue is
24 https://www.olcf.ornl.gov/2018/06/08/genomics-code-exceeds-exaops-on-summit-
supercomputer/
The many efficiencies of supercomputers 21
that operations must be synchronized; asynchronous operation provides per-
formance advantages [56].
The need to synchronize operations (including those of many-many proces-
sors) using a central clock signal is especially disadvantageous when attempt-
ing to imitate behavior of biological systems without such a central signal.
Although the intention to provide asynchronous operating mode was a major
design point [25], the hidden synchronization (mainly introduced by thinking
in conventional SW solutions) led to very poor efficiency [26] when the sys-
tem was attempting to perform its flagship goal: simulating functionality of (a
large part of) human brain.
As was discussed in section 5.1, performance also depends on the length
of measurement time, because of fixed-time contributions. When making only
10 seconds long measurements, the smaller denominator (compared to HPL
benchmarking time), may result in up to 103 times worse (1 − αeff ) and
performance gain values. The dominant limiting factor, however, is a different
one.
In brain simulation, a 1 ms integration time (essentially sampling time)
is commonly used [26]. Biological time (when events happen) and computing
time (how much computing time is required to perform computing operations
to describe the same) are not only different, but also not directly related. Work-
ing with ”signals from the future” must be excluded. For this goal, at the end
of this period, calculated new values of state variables must be communicated
to all (interested) fellow neurons. This action essentially introduces a ”biology
clock signal period”, being a million times longer than the electronic clock
signal period. Correspondingly, the achievable performance is desperately low:
less than 105 neurons can be simulated, out of the planned 109 [26]25. For a
detailed discussion see [27,19,55].
Figure 6 depicts experimental equivalent of Figure 4. In [26], power con-
sumption efficiency was also investigated. It is presumed that (to avoid obsolete
energy consumption), they performed the measurement at the point, where
involving more cores increases the power consumption but does not increase
payload simulation performance. This assumption resulted in the ”reasoned
guess” for the efficiency of brain simulation in Figure 4. As using AI workload
(for a discussion from this point of view see [57]) on supercomputers is of
growing importance, performance gain of an AI application can be estimated
to be between those of HPCG and brain simulation, closer to that of HPCG.
As discussed experimentally in [58] and theoretically in [57], in the case of
neural networks (especially in the case of selecting improper layering depth)
the efficiency can be much lower26.
Recall, that since AI nodes usually perform simple calculations compared to
functionality of supercomputer benchmarks, their communication/calculation
25 Despite its failure, the SpiNNaker2 is also under construction [21]
26 https://www.nextplatform.com/2019/10/30/cray-revamps-clusterstor-for-the-exascale-
era/ : ”artificial intelligence, . . . it’s the most disruptive workload from an I/O pattern
perspective”
22 Ja´nos Ve´gh
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
102
103
104
105
106
107
108
Year
P
er
f
or
m
a
n
ce
g
a
in
The rooflines of performance gain of supercomputers
1st by RHPLMax
2nd by RHPLMax
3rd by RHPLMax
Best by αHPLeff
1st by RHPCGMax
2nd by RHPCGMax
3rd by RHPCGMax
Best by RHPCGMax
Brain simulation
Fig. 6 Performance gains of supercomputers modeled as ”roofline” [40], as measured with
benchmarks HPL and HPCG (data taken from database TOP500 [41]), and the one for
brain simulation is concluded from [26].
ratio is much higher, making the efficacy even worse. Our conclusions are
underpinned by experimental research [58]:
– ”strong scaling is stalling after only a few dozen nodes”
– ”The scalability stalls when the compute times drop below the communica-
tion times, leaving compute units idle. Hence becoming an communication
bound problem.”
– ”network layout has a large impact on the crucial communication/compute
ratio: shallow networks with many neurons per layer . . . scale worse than
deep networks with less neurons.”
The massively ”bursty” nature of data (different nodes of the layer want to
use the communication at the same moment) also makes the case harder. The
commonly used global bus is overloaded with messages (for a detailed discus-
sion see [19]), that may lead to a ”communicational collapse” (demonstrated
in Figure 5.(a) in [59]): at an extremely large number of cores, exceeding the
critical threshold of communication intensity, leads to unexpected and drastic
change of network latency.
The many efficiencies of supercomputers 23
104
105
106
10710−7 10−6 10−5
0.2
0.5
0.8
1
N
o
of
co
re
s
(1− αeff )
E
f
f
ic
ie
n
cy
Dependence of EHPL on (1− αeff ) and N
Piz Daint
2012/11
2013/06
2013/11
2016/11
2017/06
2018/11
10−3 10−2 10−1 100
10−5
10−4
10−3
10−2
10−1
RPeak (exaFLOPS)
R
M
a
x
(e
x
a
F
L
O
P
S
)
Development of RHPLMax for PizDaint Supercomputer
Xeon E5-2690 + NVIDIA Tesla P100 (2018)
Xeon E5-2690 + NVIDIA Tesla P100 (2017)
Xeon E5-2690 + NVIDIA Tesla P100 (2016)
Xeon E5-2670 + NVIDIA K20x (2013)
Xeon E5-2670 (2013)
Xeon E5-2670 (2012)
Fig. 7 History of supercomputer Piz Daint in terms of efficiency and payload perfor-
mance [50]. Left subfigure shows how efficiency changed as developers proceeded towards
higher performance. Right subfigure shows the reported performance data (the bubbles),
together with diagram line calculated from the value as described above. Compare value of
diagram line to measured performance data in the next reported stage.
6 Accuracy and reliability of our model
As parameters of our model are inferred from non-dedicated single-shot mea-
surements, their reliability is limited. One can verify, however, how our model
predicts values derived from later measurements. Supercomputers usually do
not have a long lifespan and several documented stages. One of the rare ex-
ceptions is supercomputer Piz Daint. Its documented lifetime spans over six
years. At that period, different amounts of cores, without and with accelera-
tion, using different accelerators, were used.
Figure 7 depicts performance and efficiency values published during its
lifetime, together with diagram lines predicting (at the time of making the
prediction) values at higher nominal performance values. The left subfigure
shows how changes made in the configuration affected its efficiency (the time-
line starts in the top right corner, and a line connects adjacent stages).
In the right subfigure, bubbles represent data published in adjacent edi-
tions of the TOP500 lists, the diagram lines crossing them are predictions made
from that snapshot. The predicted value shall be compared to the value pub-
lished in the next list. It is especially remarkable that introducing GPGPU
acceleration resulted only in a slight increase (in good agreement with [60]
and [15]) compared to the value expected based purely on the increase in the
number of cores. Although between our ”samplings” more than one parameter
was changed, that is, the net effect cannot be demonstrated clearly, measured
data sufficiently underpin our limited validity conclusions and show that the-
ory correctly describes tendency of development of performance and efficiency,
and even its predicted performance values are reasonably accurate.
Introducing GPU accelerator is a one-time performance increase step [15],
and cannot be taken into account by theory. Notice, that introducing accel-
24 Ja´nos Ve´gh
erator, increased payload performance, but decreased efficiency (copying data
from one address space to another increases latency). Changing the accelerator
to another type with slightly higher performance (but higher latency due to its
larger GPGPU memory) caused a slight decrease in the absolute performance
because of the considerably dropped efficiency.
7 Towards zettaflops
As detailed above, our theoretical model enables us to calculate payload per-
formance in first-order approximation at any nominal performance value. In
the light of all of this, one can estimate a short time and a longer time develop-
ment of supercomputer performance, see Figs. 8.A and 8.B. The diagram lines
are calculated using Eq. (4), with α parameter values derived from TOP500
data of Summit supercomputer; the bubbles show measured values. The di-
agram lines from the bottom up show the double floating precision HPCG,
HPL and the half precision [53] HPL (HPL-AI) diagrams. Given that param-
eter values are calculated from a snapshot, and that calculation is essentially
an extrapolation, furthermore that at high nominal performance values using
second-order approximation is more and more pressing, predictions shown in
Figure 8 are rough and very optimistic approximations, but somewhat similar
to real upper limit values.
In addition to measured and published performance data, two more dia-
gram lines representing two more calculated α values are also depicted. The
’FP0’ (orange) diagram line is calculated with the assumption that super-
computer makes the stuff needed to perform the HPL benchmark, but actual
FP operations are not performed. In other words, the computer works with
zero-bit length floating operands (FP0)27.
The ’Science’ (red) diagram line is calculated with the assumption that
nothing is calculated, but science (the finite propagation time due to the fi-
nite speed of light limits payload performance28). The ’ideal interconnection’
diagram line should come between diagram lines ’Science’ and ’FP0’. The
nonlinearity of payload performance around the Eflops nominal performance
is visible, and depends both on the amount of computing+communication and
nominal performance (represented by the number of cores).
Figure 8B shows the farther future (in first-order approximation): towards
Zflops [7]. No surprise that all payload performance diagram lines run into sat-
uration, even the ’FP0’ and ’Science’ ones. For comparison, double-precision
performance of Fugaku is also displayed. Recall that diagram lines are cal-
culated in first-order approximation. In second-order approximation, it is ex-
pected that diagram lines reach their inflection point and break down. These
top supercomputers are near to that point; this is why their development
stalled.
27 The role of αFP0HPL is akin to the execution time of the ”empty loop” in programming.
28 100 m cable length was assumed, which means 106 processors pro cm and some GW
dissipation.
The many efficiencies of supercomputers 25
10−4 10−3 10−2 10−1 100
10−4
10−3
10−2
10−1
100
Nominal performance (EFlops)
P
ay
lo
ad
p
er
fo
rm
an
ce
(E
F
lo
p
s)
Payload performances @Summit
1-alpha = HPCG− FP64
1-alpha = HPL− FP64
1-alpha = HPL− FP16
1-alpha = HPL− FP0
1-alpha = Science
10−3 10−2 10−1 100 101 102 103
10−3
10−2
10−1
100
101
102
Nominal performance (EFlops)
P
ay
lo
ad
p
er
fo
rm
an
ce
(E
F
lo
p
s)
Payload performances @Exascale
1-alpha = HPCG− FP64
1-alpha = HPL− FP64
1-alpha = HPL− FP0
1-alpha = HPL− FP64, Fugaku
1-alpha = Science
Fig. 8 Tendency of development of payload performance, in the near and farther future of
supercomputing. The parameters of Summit are used for illustration, and for comparison
double precision performance diagram line of Fugaku. The diagram lines are calculated from
the theory, the marks are the measured values from the database TOP500 [50] and [53].
8 ’Quantum states’ of supercomputers
As discussed in details in [4], the behavior of computing systems under ex-
treme conditions shows surprising parallels with the behavior of natural ob-
jects. Really, ”More Is Different” [28]. Behavior of supercomputers is some-
what analogous with that of quantum systems, where measurement selects
one of its possible states (and at the same time, kills all other possible states).
In computing, supercomputer –as a general-purpose computing system– has
the potential of high performance defined by the impressive parameters of its
components. However, when we run a computation (that is, we measure the
computing performance of our computer), that workload selects the best pos-
sible combination of limitations that defines performance and kills all other
potential performances.
Logical dependence of operation of components implicitly also means their
temporal dependence [19], and it introduces idle times to computing. In this
way, workload defines how much of those potential abilities can be used:
datasheet values represent hard limits, and workload sets the soft limits. Work-
load defines a fill-out factor: it introduces different idle times into the operation
of components, and in this way, forces workload-defined soft performance limits
to components of supercomputer. Different workloads force different limitations
(use available resources differently), giving a natural explanation of ”different
efficiencies” [14]. In other words, running some calculation destroys the poten-
tially achievable high performance, defined individually by its components.
Benchmarking such computing systems introduces one more limiting com-
ponent: the needed computation. For floating-point computations, the ’best
possible’ (producing the highest figures of merit) benchmark is HPL. With the
development of parallelization and processor technology, floating computation
itself became the major contributor defining the efficiency and performance
26 Ja´nos Ve´gh
of the system. Since the benchmark measurement method itself is a computa-
tion, measurable floating payload performance value cannot be smaller than
the value that the benchmark procedure itself represents.
For real-life programs (such as HPCG) the workload-defined performance
level (saturation value) already set well below the Eflops nominal performance,
see Fig. 2. Further enhancements in technology, such as tensor processors and
OpenCAPI connection bus, can slightly increase their saturation level but
cannot change the science-defined shape of the diagram line. Supercomputers
reached their technical limitations, their development is out of steam. To con-
tinue enhancing components of a supercomputer that wants to run any calcu-
lation, without changing its underlying computing paradigm, is not worth any
more. To enter ”next level”, really renewing the classic computing paradigm
is needed [61–63,19].
9 Conclusion
The ironic remark that ’Perhaps supercomputers should just be required to have
written in small letters at the bottom on their shiny cabinets: Object manip-
ulations in this supercomputer run slower than they appear. [14]’ is becoming
increasingly relevant. The impressive numbers about the performance of their
components (including single-processor and/or GPU performance and speed
of interconnection) are becoming less relevant when going to the extremes.
Given that the most substantial α contribution today takes its origin in the
computation the supercomputer runs, even the best possible benchmark HPL
dominates floating performance measurement. Enhancing other contributions,
such as interconnection, result in marginal enhancement of performance, i.e.,
the overwhelming majority of expenses increase ”dark performance” only. Be-
cause of this, the answers to the questions in the title are: there are as many
performance values as many measurement methods (that can be varied with
how big portion of available cores are used in the measurement), and actu-
ally benchmarks measure mainly how much mathematics/communication the
benchmark program does, rather than the supercomputer architecture (provided
that all components deliver their technically achievable best parameters).
References
1. S. H. Fuller and L. I. Millett, Eds., The Future of Computing Performance: Game Over
or Next Level? National Academies Press, Washington, 2011.
2. G. M. Amdahl, “Validity of the Single Processor Approach to Achieving Large-Scale
Computing Capabilities,” in AFIPS Conference Proceedings, vol. 30, 1967, pp. 483–485.
3. J. P. Singh, J. L. Hennessy, and A. Gupta, “Scaling parallel programs for multiproces-
sors: Methodology and examples,” Computer, vol. 26, no. 7, pp. 42–50, Jul. 1993.
4. J. Ve´gh and A. Tisan, “The need for modern computing paradigm: Science applied to
computing,” in Computational Science and Computational Intelligence CSCI The 25th
Int’l Conf on Parallel and Distributed Processing Techniques and Applications. IEEE,
2019, pp. 1523–1532. [Online]. Available: http://arxiv.org/abs/1908.02651
The many efficiencies of supercomputers 27
5. J. Ve´gh, “The performance wall of parallelized sequential computing: the roofline
of supercomputer performance gain,” Parallel Computing, vol. in review, p.
http://arxiv.org/abs/1908.02280, 2019.
6. I. Markov, “Limits on fundamental limits to computation,” Nature, vol. 512(7513), pp.
147–154, 2014.
7. Liao, Xiang-ke et al, “Moving from exascale to zettascale computing: chal-
lenges and techniques,” Frontiers of Information Technology & Electronic
Engineering, vol. 19, no. 10, p. 1236-1244, Oct 2018. [Online]. Available:
https://doi.org/10.1631/FITEE.1800494
8. M. Feldman, ”Exascale is Not Your Grandfather’s HPC”
https://www.nextplatform.com/2019/10/22/exascale-is-not-your-grandfathers-hpc/,
2019.
9. US Government NSA and DOE, “A Report from the NSA-DOE Technical Meeting on
High Performance Computing,” https://www.nitrd.gov/
nitrdgroups/images/b/b4/NSA DOE HPC TechMeetingReport.pdf, December 2016.
10. R. F. Service, “Design for U.S. exascale computer takes shape,” Science, vol. 359, pp.
617–618, 2018.
11. European Commission, “Implementation of the Action Plan for the European High-
Performance Computing strategy,” http://ec.europa.eu/newsroom/dae/document.cfm?
doc id=15269, 2016.
12. Extremtech, “ Japan Tests Silicon for Exascale Computing in 2021.”
https://www.extremetech.com/computing/ 272558-japan-tests-silicon-for-exascale-
computing -in-2021, 2018.
13. K. Bourzac, “Streching supercomputers to the limit,” Nature, vol. 551, pp. 554–556,
2017.
14. IEEE Spectrum, “Two Different Top500 Supercomputing Benchmarks
Show Two Different Top Supercomputers,” https://spectrum.ieee.org/tech-
talk/computing/hardware/two-different-top500-supercomputing- benchmarks-show
-two -different-top-supercomputers, 2017.
15. H. Simon, “Why we need Exascale and why we won’t get there by 2020,”
in Exascale Radioastronomy Meeting, ser. AASCTS2, 2014. [Online]. Available:
https://www.researchgate.net/publication/261879110
Why we need Exascale and why we won’t get there by 2020
16. J. L. Gustafson, “Reevaluating Amdahl’s Law,” Commun. ACM, vol. 31, no. 5, pp.
532–533, May 1988.
17. S. Krishnaprasad, “Uses and Abuses of Amdahl’s Law,” J. Comput. Sci. Coll., vol. 17,
no. 2, pp. 288–293, Dec. 2001.
18. Y. Shi, “Reevaluating Amdahl’s Law and Gustafson’s Law,”
https://www.researchgate.net/publication/ 228367369 Reevaluating Amdahl’s law and
Gustafson’s law, 1996.
19. J. Ve´gh, “Introducing Temporal Behavior to Computing Science,” in 2020 CSCE,
Fundamentals of Computing Science. IEEE, 2020, pp. Accepted FCS2930, in print.
[Online]. Available: https://arxiv.org/abs/2006.01128
20. www.top500.org, “Intel dumps knights hill, future of xeon phi product line
uncertain,” https://www.top500.org/news/intel-dumps-knights-hill-future-of-xeon-phi-
product-line-uncertain///, 2017.
21. C. Liu, et al., “Memory-Efficient Deep Learning on a SpiNNaker 2 Proto-
type,” Frontiers in Neuroscience, vol. 12, p. 840, 2018. [Online]. Available:
https://www.frontiersin.org/article/10.3389/fnins.2018.00840
22. Top500.org, ”Retooled Aurora Supercomputer Will Be Amarica’s First Exascale Sys-
tem” https://www.top500.org/news/retooled-aurora-supercomputer-will-be -americas-
first-exascale-system/, 2017.
23. J. Dongarra, “Report on the Fujitsu Fugaku System,” University of Tennessee
Department of Electrical Engineering and Computer Science, Tech. Rep. Tech Report
ICL-UT-20-06, June 2016. [Online]. Available: http://bit.ly/fugaku-report
24. S. Kunkel, M. Schmidt, J. M. Eppler, H. E. Plesser, G. Masumoto, J. Igarashi, S. Ishii,
T. Fukai, A. Morrison, M. Diesmann, and M. Helias, “Spiking network simulation code
for petascale computers,” Frontiers in Neuroinformatics, vol. 8, p. 78, 2014.
28 Ja´nos Ve´gh
25. S. B. Furber, D. R. Lester, L. A. Plana, J. D. Garside, E. Painkras, S. Temple, and
A. D. Brown, “Overview of the SpiNNaker System Architecture,” IEEE Transactions
on Computers, vol. 62, no. 12, pp. 2454–2467, 2013.
26. S. J. van Albada, A. G. Rowley, J. Senk, M. Hopkins, M. Schmidt, A. B. Stokes, D. R.
Lester, M. Diesmann, and S. B. Furber, “Performance Comparison of the Digital Neuro-
morphic Hardware SpiNNaker and the Neural Network Simulation Software NEST for
a Full-Scale Cortical Microcircuit Model,” Frontiers in Neuroscience, vol. 12, p. 291,
2018.
27. J. Ve´gh, “How Amdahl’s Law limits performance of large artifi-
cial neural networks,” Brain Informatics, vol. 6, pp. 1–11, 2019. [On-
line]. Available: https://braininformatics.springeropen.com/articles/10.1186/s40708-
019-0097-2/metrics
28. P. W. Anderson, “More Is Different,” Science, vol. 177, pp. 393–396, 1972.
29. D. Patterson and J. Hennessy, Eds., Computer Organization and design. RISC-V Edi-
tion. Morgan Kaufmann, 2017.
30. R. Waser, Ed., Advanced Electronics Materials and Novel Devices, ser. Nanoelectronics
and Information Technology. Wiley, 2012.
31. K. Hwang and N. Jotwani, Advanced Computer Architecture: Parallelism, Scalability,
Programmability, 3rd ed. Mc Graw Hill, 2016.
32. V. Weaver, D. Terpstra, and S. Moore, “Non-determinism and overcount on modern
hardware performance counter implementations,” in Performance Analysis of Systems
and Software (ISPASS), 2013 IEEE International Symposium on, April 2013, pp. 215–
224.
33. P. Molna´r and J. Ve´gh, “Measuring Performance of Processor Instructions and Oper-
ating System Services in Soft Processor Based Systems,” in 18th Internat. Carpathian
Control Conf. ICCC, 2017, pp. 381–387.
34. F. Ellen, D. Hendler, and N. Shavit, “On the Inherent Sequentiality of Concurrent
Objects,” SIAM J. Comput., vol. 43, no. 3, p. 519-536, 2012.
35. L. Yavits, A. Morad, and R. Ginosar, “The effect of communication and synchronization
on Amdahl’s law in multicore systems,” Parallel Computing, vol. 40, no. 1, pp. 1–16,
2014.
36. J. Ve´gh and P. Molna´r, “How to measure perfectness of parallelization in hard-
ware/software systems,” in 18th Internat. Carpathian Control Conf. ICCC, 2017, pp.
394–399.
37. F. Zheng, H.-L. Li, H. Lv, F. Guo, X.-H. Xu, and X.-H. Xie, “Cooperative computing
techniques for a deeply fused and heterogeneous many-core processor architecture,”
Journal of Computer Science and Technology, vol. 30, no. 1, pp. 145–162, Jan 2015.
38. M. Mohammadi and T. Bazhirov, “Comparative Benchmarking of Cloud Computing
Vendors with High Performance Linpack,” in Proceedings of the 2Nd International
Conference on High Performance Compilation, Computing and Communications, ser.
HP3C. New York, NY, USA: ACM, 2018, pp. 1–5.
39. A. H. Karp and H. P. Flatt, “Measuring Parallel Processor Performance,” Commun.
ACM, vol. 33, no. 5, pp. 539–543, May 1990.
40. S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual perfor-
mance model for multicore architectures,” Commun. ACM, vol. 52, no. 4, pp. 65–76,
Apr. 2009.
41. TOP500, “November 2017 list of supercomputers,”
https://www.top500.org/lists/2017/11/, 2017.
42. C.-H. Hsu, J. A. Kuehn, and S. W. Poole, “Towards efficient supercomputing: search-
ing for the right efficiency metric,” in Proceedings of the 3rd ACM/SPEC Interna-
tional Conference on Performance Engineering, 2012, p. 1157-1162. [Online]. Available:
https://doi.org/10.1145/2188286.2188309
43. D. S. Martin, “HARDWARE AND SOFTWARE TECHNIQUES FOR SCALABLE
THOUSAND-CORE SYSTEMS,” Ph.D. dissertation, Stanford University, Berkeley,
2012.
44. J. Dongarra, “Report on the Sunway TaihuLight System,” University of
Tennessee Department of Electrical Engineering and Computer Science,
Tech. Rep. Tech Report UT-EECS-16-742, June 2016. [Online]. Available:
http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-report-2016.pdf
The many efficiencies of supercomputers 29
45. D. Tsafrir, “The context-switch overhead inflicted by hardware interrupts (and the
enigma of do-nothing loops),” in Proceedings of the 2007 Workshop on Experimental
Computer Science, ser. ExpCS ’07. New York, NY, USA: ACM, 2007, pp. 3–3.
46. F. M. David, J. C. Carlyle, and R. H. Campbell, “Context Switch Overheads for Linux
on ARM Platforms,” in Proceedings of the 2007 Workshop on Experimental Computer
Science, ser. ExpCS ’07. New York, NY, USA: ACM, 2007. [Online]. Available:
http://doi.acm.org/10.1145/1281700.1281703
47. J. Ve´gh, J. Va´sa´rhelyi and D. Dro´tos, “The performance wall of large parallel
computing systems,” in Lecture Notes in Networks and Systems 68. Springer, 2019,
pp. 224–237. [Online]. Available: https://link.springer.com/chapter/10.1007%2F978-3-
030-12450-2 21
48. J. Ve´gh, “How Amdahl’s law restricts supercomputer applications and building
ever bigger supercomputers,” CoRR, vol. abs/1708.01462, 2018. [Online]. Available:
http://arxiv.org/abs/1708.01462
49. T. Ippen, J. M. Eppler, H. E. Plesser, and M. Diesmann, “Constructing Neuronal
Network Models in Massively Parallel Environments,” Frontiers in Neuroinformatics,
vol. 11, p. 30, 2017.
50. TOP500.org, “The top 500 supercomputers,” https://www.top500.org/, 2019.
51. J. Dongarra, M. A. Heroux, and P. Luszczek, “High-performance conjugate-gradient
benchmark: A new metric for ranking high-performance computing systems,” The
International Journal of High Performance Computing Applications, 2015. [Online].
Available: https://doi.org/10.1177/1094342015593158
52. HPCG Benchmark, “HPCG Benchmark,” http://www.hpcg-benchmark.org/, 2016.
53. A. Haidar, S. Tomov, J. Dongarra, and N. J. Higham, “Harnessing GPU Tensor Cores
for Fast FP16 Arithmetic to Speed Up Mixed-precision Iterative Refinement Solvers,”
in Proceedings of the International Conference for High Performance Computing, Net-
working, Storage, and Analysis, ser. SC ’18. IEEE Press, 2018, pp. 47:1–47:11.
54. Y. Ao, C. Yang, F. Liu, W. Yin, L. Jiang, and Q. Sun, “Performance Optimization
of the HPCG Benchmark on the Sunway TaihuLight Supercomputer,” ACM Trans.
Archit. Code Optim., vol. 15, no. 1, pp. 11:1–11:20, Mar. 2018.
55. J. Ve´gh, “Which scaling rule applies to Artificial Neural Networks,” in Com-
putational Intelligence (CSCE) The 22nd Int’l Conf on Artificial Intelligence
(ICAI’20). IEEE, 2020, pp. Accepted ICA2246, in print. [Online]. Available:
http://arxiv.org/abs/2005.08942
56. G. P, Horn.J, J. He, A. Papageorgiou, and C. Poole, “IBM
CICS Asynchronous API: Concurrent Processing Made Simple,”
http://www.redbooks.ibm.com/redbooks/pdfs/sg248411.pdf , 2017.
57. J. Ve´gh, How deep machine learning can be, ser. A Closer Look at Convolutional
Neural Networks. Nova, In press, 2020, pp. 141–169. [Online]. Available:
https://arxiv.org/abs/2005.00872
58. J. Keuper and F.-J. Preundt, “Distributed Training of Deep Neural Networks:
Theoretical and Practical Limits of Parallel Scalability,” in 2nd Workshop on Machine
Learning in HPC Environments (MLHPC). IEEE, 2016, pp. 1469–1476. [Online].
Available: https://www.researchgate.net/publication/308457837
59. S. Moradi and R. Manohar, “The impact of on-chip communication on memory tech-
nologies for neuromorphic systems,” Journal of Physics D: Applied Physics, vol. 52,
no. 1, p. 014003, oct 2018.
60. V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish,
M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey,
“Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing
on CPU and GPU,” in Proceedings of the 37th Annual International Symposium on
Computer Architecture, ser. ISCA ’10. New York, NY, USA: ACM, 2010, pp. 451–460.
[Online]. Available: http://doi.acm.org/10.1145/1815961.1816021
61. J. Ve´gh, Renewing computing paradigms for more efficient parallelization of single-
threads, ser. Advances in Parallel Computing. IOS Press, 2018, vol. 29, ch. 13, pp.
305–330. [Online]. Available: https://arxiv.org/abs/1803.04784
62. J. Ve´gh, “Introducing the Explicitly Many-Processor Approach,” Parallel Computing,
vol. 75, pp. 28 – 40, 2018.
30 Ja´nos Ve´gh
63. ——, “How to extend the Single-Processor Paradigm to the Explicitly Many-Processor
Approach,” in 2020 CSCE, Fundamentals of Computing Science. IEEE, 2020, pp.
Accepted FCS2243, in print. [Online]. Available: https://arxiv.org/abs/2006.00532
