Statistical considerations on limitations of supercomputers by Végh, János
ar
X
iv
:1
71
0.
08
95
1v
3 
 [c
s.D
C]
  2
8 M
ar 
20
18
Noname manuscript No.
(will be inserted by the editor)
Statistical considerations on limitations of
supercomputers
Ja´nos Ve´gh
Received: date / Accepted: date
Abstract Supercomputer building is a many sceene, many authors game,
comprising a lot of different technologies, manufacturers and ideas. Checking
data available in the public database in a systematic way, some general ten-
dencies and limitations can be concluded, both for the past and the future.
The feasibility of building exa-scale computers as well as their limitations and
utilization are also discussed. The statistical considerations provide a strong
support for the conclusions.
Keywords Supercomputer · efficiency · Limits
1 Introduction
For now, supercomputing has a quarter of century history and a well-documented
and verified database [8] on their architectural and performance data. The
huge variety of solutions and ideas does not enlighten drawing conclusions
and especially making forecasts for the future of supercomputing.
In section 2 Amdahl’s law is reconsidered, interpreting it for the modern
computing architectures, with keeping an eye on measurability. Choosing the
right merit [14] of their characteristics and utilizing a large number of reliable
measured data [8], clear conclusions are drawn in section 3. After validating
the method, some predictions are made through extrapolating the tendencies
for the near future in section 4.
Project no. 125547 has been implemented with the support provided from the National
Research, Development and Innovation Fund of Hungary, financed under the K funding
scheme.
J. Ve´gh
University of Miskolc
Tel.: +36-46-565-111/1753
E-mail: J.Vegh@uni-miskolc.hu
2 Ja´nos Ve´gh
2 Supercomputers and Amdahl’s law
Amdahl’s law [1] on the joint performance of parallelly working systems ”is
one of the few, fundamental laws of computing” [7] which seems to be nearly
forgotten in the field of supercomputing. As tought in introductory courses
on parallel processing, some fraction of the computing job cannot be paral-
lelized (i.e. cannot be distributed among the parallelly working units), and
this fraction limits the achievable resulting computing performance.
Although Amdahl only wanted to draw the attention to that the so called
Single-Processor Approach introduces some serious limitations on computing
performance (especially when large number of processors is utilized in the
system of parallelly working processors), his successors formulated his idea
(commonly known as Amdahl’s law) differently. A common misconception is
to assume that Amdahl’s law is valid for software only and that paralleliz-
able fraction α contains something like ratio of numbers of the corresponding
instructions to the respective total number.
Amdahl’s law is much more general, and is actually used on many different
fields [5]. If Amdahl’s law is interpreted correctly: for the time needed for
some activity rather than for some fraction of the code, it should describe also
performance and limits of operation of supercomputers. Even, supercomputers
are an excellent playground to check validity of Amdahl’s law in the case of
extremely large number of processors.
2.1 Terms in Amdahl’s law
First the notations used in [14] are introduced and a summary of the ideas
explained and illustrated in details there is given. If α stands for the time frac-
tion of activity that can be outsourced to several parallelly working processing
units, all the rests, (1−α) fraction, independently of their origin, fall into the
category of non-parallelizable activity and (as discussed by Amdahl) appear as
if they were sequential-only activity. If the parallelizable fraction is distributed
among k processing units, the speedup S which can be achieved is
S−1 = (1− α) + α/k (1)
The speedup multiplied with the P absolute performance of one processor,
the (apparent) resulting performance1 is given as
PMax = P
1
(1− α)
(2)
This is a theoretical upper limit for the performance (also of a supercom-
puter) which can only be achieved in idealistic case, as discussed in [14]. This
usually cannot be computed in advance, because α is not known in advance.
1 The factor 1
(1−α)
can be considered as a kind of performance gain or performance
amplification factor
Statistical considerations of supercomputers 3
However, on a ”black box” supercomputer one can measure RMax and it is
also known that RPeak = kP . Since
S =
(1− α) + α
(1− α) + α/k
=
k
k(1− α) + α
(3)
and the efficiency
E =
S
k
=
1
k(1− α) + α
=
RMax
RPeak
(4)
the measured payload performance provides information also on the ”effective
parallelism”. That is, only a fraction of nominal performance can be utilized as
payload performance, the rest remains a kind of ”dark performance”. One can
easily express the ”effective parallelization” αeff from the measured efficiency
as
αeff =
k
k − 1
S − 1
S
(5)
or equivalently
αeff =
Ek − 1
E(k − 1)
(6)
Using measured performance values published for supercomputers [8], αeff
values for the supercomputer configurations can be calculated, see Fig 2. Notice
that for a given configuration αeff depends on k.
2.2 A simple model for supercomputing
To understand the meaning of the values derived in this way, a simple model
shown in Fig. 1 should be derived. Although the model is empirical rather
than technical, with slightly extending it and giving technical meaning to its
terms, it can easily be converted to technical model. Also note that here no
communication is assumed between the parallelly working units, but the model
can be trivially extended to the case when the parallelly working processors
communicate (explicitly or implicitly, like sharing some resource). The model
assumes that several components contribute to the total execution time, as
simple sum of either some components or the largest of some components.
The access time is usually small: whether the time is measured on the
parallelized system or outside of it, one must compensate for its contribution
(in the case of supercomputers, it is usually negligible). The contribution of
the executed program αSWeff depends heavily on the nature of the program.
The contributions due to OS and HW are tightly connected, so it is not
easy to separate them without making dedicated measurements; at this level
their joint contribution will be handled as αHW+OSeff . Within that contribu-
tions there are some parts which may become critical, like the looping delay
Tx due to utilizing extremely large number of processors or the propagation
4 Ja´nos Ve´gh
Proc
T
im
e
(n
o
t
p
r
o
p
o
r
ti
o
n
a
l)
0
1
2
3
4
5
6
7
8
9
10
Model of parallel execution
P0 P1 P2 P3 P4
AccessInitiation
SoftwarePre
OSPre
T
0
P
D
0
0
P
r
o
c
e
s
s
0
P
D
0
1
T
1
P
D
1
0
P
r
o
c
e
s
s
1
P
D
1
1
T
2
P
D
2
0
P
r
o
c
e
s
s
2
P
D
2
1
T
3
P
D
3
0
P
r
o
c
e
s
s
3
P
D
3
1
T
4
P
D
4
0
P
r
o
c
e
s
s
4
P
D
4
1
Just waiting
Just waiting
OSPost
SoftwarePost
AccessTermination
P
a
y
lo
a
d
T
o
ta
l
E
x
te
n
d
e
d
Fig. 1 The extended Amdahl’s model (somewhat idealistic)
delay PDxx due to having large physical size of the supercomputer; they will
be mentioned separately, and in the technical model they shall be handled
specifically. The time scale shown in the figure serves only for illustration, the
actual contributions will strongly vary with the actual conditions.
From the figure the meaning of αeff can be easily identified as Payload/Total.
Also, the reasons of ”dark performance” can be identified: the ready-to-fire
processing units are simply idle. The common mistake of handling the ac-
cess time improperly can falsify the conclusions, although in the case of long
measurement times this effect can be neglected.
3 Performance and architecture checks
The available, rigorously validated database [8] enables to draw reliable con-
clusions, although the variety of sources of components, different technologies
and ideas as well as the interplay of different factors cause a considerable
scatter and requires extremely careful analysis.
Statistical considerations of supercomputers 5
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016
10−8
10−7
10−6
10−5
10−4
10−3
10−2
Year
(1
−
α
)
Supercomputers, Top 500 1st-3rd
1st
2nd
3rd
Best α
Trend of (1− α)
Sunway TaihuLight
Fig. 2 The trend of the development of (1−α) in the past 25 years, based on the first three
(by Rmax) and the first (by (1 − α)) in the year in question.
3.1 Supercomputer timeline
As a quick test, Equ. (6) can be applied to data from [8], see Fig. 2. As shown,
supercomputer history is about the development of effective parallelism, and
Amdahl’s law formulated by Equ. (6) is actually what Moore’s law is for the
size of electronic components. (The effect of Moore’s law is eliminated when
calculating RMax
RPeak
.) To understand the behavior of the trend line, just recall
Equ. (4): to increase the absolute performance, more processors shall be in-
cluded, and to provide reasonable efficiency, the value of (1−α) must be prop-
erly reduced. Just notice that the excellent performance of Taihulight shall be
attributed to its special processor, deploying ”Cooperative computing” [15].
3.2 Single-processor performance
A common myth is that, as suggested by Equ. (2), the trivial way to in-
crease the absolute performance of a supercomputer is to increase the single-
processor performance of its processors. Since the single processor performance
has reached its limits, some kind of accelerators are frequently used for this
6 Ja´nos Ve´gh
0 10 20 30 40 50
0
50
100
150
Ranking by HPL
P
ro
ce
ss
o
r
p
er
fo
rm
a
n
ce
(G
fl
o
p
/
s)
Accelerated
Non-accelerated
GPU-accelerated
Regression of accelerated
Regression of nonaccelerated
Regression of GPU accelerated
0 10 20 30 40 50
105
106
107
108
Ranking by HPL
P
er
fo
rm
a
n
ce
a
m
p
li
fi
ca
ti
o
n
fa
ct
o
r
Accelerated
Non-accelerated
GPU-accelerated
Regression of accelerated
Regression of nonaccelerated
Regression of GPU accelerated
Fig. 3 Correlation of the efficiency and the performance amplification with the ranking,
for different acceleration methods.
goal. Fig. 3 shows how utilizing accelerators influences ranking of supercom-
puters.
As the left side of the figure depicts, the coprocessor accelerated cores
show up the lowest performance; they really can benefit from acceleration2.
GPU accelerated processors really increase performance of processors by a
factor of 2-3, however this increased performance is about 40..70 times lower
than the nominal performance of the GPU accelerator. This result confirms
results of a former study where an average factor 2.5 was found [6]. The effect
is attributed to the considerable overhead [2], and it was demonstrated that
with improving the transfer performance, the computing performance can be
considerably enhanced. Indirectly, this research also proved that the operating
principle itself (i.e. that the data must be transferred to and from the GPU
memory; and recall that GPUs do not have cache memory) takes some extra
time. In terms of Amdahl’s law, this transfer time contributes to the non-
parallelizable fraction, i.e. increases (1 − αeff ), i.e. decreases the achievable
performance gain. See also Fig. 5.
The right side of the figure discovers this effect. The performance am-
plification factor of the GPU accelerated systems is about ten times worse
than that of the coprocessor-accelerated processors and about 5 times worse
than that of the the non-accelerated processors, i.e. the resulting efficiency is
worse than in the case of utilizing unaccelerated processors; this is a definite
disadvance when GPUs used in system with extremely large number of pro-
cessors. This makes at least questionable whether it is worth to utilize GPUs
in supercomputers.
As the left figure shows, neither type of processors shows correlation be-
tween ranking of supercomputer and type of the acceleration. Essentially the
same is confirmed by the right side of the figure: the performance gain de-
creases with the ranking position: to move the data form one memory to other
takes time.
2 In the number of the total cores the number of coprocessors is included
Statistical considerations of supercomputers 7
0 10 20 30 40 50
10−1
100
101
Ranking by HPL
N
o
o
f
P
ro
ce
ss
o
rs
/
1
e6
Data points
Regression Top50
Regression Top10
10−1 100 101
10−7
10−6
10−5
No of Processors/1e6
(1
-α
e
f
f
b
y
H
P
L
)
Data points
Regression TOP50
Regression TOP10
Fig. 4 Correlation of number of processors with ranking and effective parallelism with
number of processors.
104 105 106 107
105
106
107
Number of processors
P
er
fo
rm
a
n
ce
a
m
p
li
fi
ca
ti
o
n
fa
ct
o
r
Sunway
PEZY
Spark
Power PC
Intel
Intel+NVIDIA
Intel+Intel
103 104 105 106 107
0.4
0.6
0.8
1
Number of processors
E
ffi
ci
en
cy
Sunway
PEZY
Spark
Power PC
Intel
Intel+NVIDIA
Intel+Intel
Fig. 5 Correlation of the performance amplification and the efficiency with the number of
processors, for some Intel based systems, with and without acceleration. For comparison
data for some other processors are also depicted.
3.3 Number of processors
Since the resulting performance depends both on the number of processors
and the effective parallelization, both quantities are correlated in Fig. 4. As
expected, in TOP50 the higher the ranking position is, the higher is the re-
quired number of processors in the configuration, and as outlined above, the
more processors, the lower (1 − αeff ) is required (provided that the same
efficiency is targeted).
In TOP10, the slope of the regression line sharply changes in the left fig-
ure, showing the strong competition for the better ranking position. Maybe
this marks the cut line between the ”race supercomputers” and ”commodity
supercomputers”. On the right figure, TOP10 data points provide the same
slope as TOP50 data points, demonstrating that to produce a reasonable ef-
ficiency, the increasing number of cores must be accompanied with a proper
decrease in value of (1 − αeff ), as expected from Equ. (4), furthermore, that
to achieve a good ranking a good value of (1 − αeff ) must be provided.
8 Ja´nos Ve´gh
0 10 20 30 40 50
10−5
10−4
10−3
Rank of supercomputer in 2000
(1
−
α
e
f
f
)
MPP in 2000
Cluster in 2000
Regression of MPP in 2000
Regression of cluster in 2000
0 10 20 30 40 50
10−7
10−6
10−5
Rank of supercomputer in 2016
(1
−
α
e
f
f
)
MPP in 2016
Cluster in 2016
Regression of MPP in 2016
Regression of cluster in 2016
Fig. 6 Dependence of (1 − αeff ) on the architectural solution of supercomputer in 2000
and 2016. Data derived using the HPL benchmark.
The effect of acceleration, discussed in section 3.2, can also be scrutinized
under more clean conditions, in function of the number of the cores rather
than in function of the payload performance. To make further cleanup, only
data about processors from the same manufacturer are depicted in Fig. 5. As
shown,GPU acceleration results in both rather wrong performance amplifica-
tion parameters and efficiency, even at processor numbers below 105. In other
words: deploying GPU-accelerated cores in supercomputers having millions
of processors is a rather expensive way to make supercomputer performance
worse.
3.4 Architectural solution
Another common myth is that the internal interconnection method can con-
siderably enhance the effective parallelism. As shown in Fig. 6, with time the
composition of the type of the architectural solutions as well as the value
of parallelization efficiency have considerably changed. However, in neither
time the architectural solution caused significant difference compared to the
other one; the slope is the same for both solutions, in both years. This means
that the internal connection bandwidth is not a real bottleneck in improving
performance. At the same time, (1 − αeff ) has improved independently and
considerably.
3.5 Benchmarking
According to the model, the SW (including benchmark programs) also con-
tributes to the measured (1 − αeff ), and its contribution is different for the
different programs. Fortunately, since the beginnings the same benchmark pro-
gram HPL is used to qualify supercomputers. HPL contributes only a low
amount of overhead activity, so it can be used as the best estimator for de-
scribing the HW+OS environment of a supercomputer. Unfortunately, most
Statistical considerations of supercomputers 9
100 101
10−7
10−6
No of Processors/1e6
(1
-α
e
f
f
b
y
H
P
L
)
Data points
Regression TOP10
100 101
10−5
10−4
Number of processors/1e6
(1
-α
e
f
f
b
y
H
P
C
G
)
Data points
Regression TOP10
Fig. 7 Correlation of (1 − αHPL
eff
) and (1− αHPCG
eff
) with the number of processors.
0 2 4 6 8 10
0
2
4
6
8
10
Ranking by HPL
R
a
n
k
in
g
b
y
H
P
C
G
Rankings
Regression of rankings
10−8 10−7 10−6
10−5
10−4
(1-αeff by HPL)
(1
-α
e
f
f
b
y
H
P
C
G
)
αeff
Regression of αeff s
Fig. 8 Correlation of ranking and αeff , derived using HPL and HPCG.
real-life applications have much higher SW contribution, so recently bench-
mark HPCG has been suggested to imitate their behavior. Fig. 7 shows how
(1 − αeff ) correlates with number of processing units, for the two mentioned
benchmark programs. The behavior is quite similar on the left and right fig-
ures, but the value differs by about two orders of magnitude. Because of this,
it can be safely stated that HPCG measures the behavior of the program
on the architecture rather than the architecture (αHW+OSeff ) itself. Notice also,
how the relative αeff measured values change between the two benchmarks.
3.6 Ranking
For ranking, different merits can be used. One possible approach is to measure
RMax, using benchmarks either HPL or HPG. Of course, these two mea-
surements lead to different rankings. Another possible approach is to rank by
αeff , measured with either of the two benchmarks. Fig. 8 compares how these
two measurements correlate with each other. Data points on the left figure
show no correlation, strongly supporting the statement that HPL measures
the architecture, HPCG measures the SW contribution, and so they are not
10 Ja´nos Ve´gh
1 10
0.5
0.6
0.7
0.8
0.9
No of Processors/1e6
R
M
a
x
R
P
e
a
k
Efficiency for HPL
Regression TOP10
1 10
0.005
0.01
0.02
0.05
Number of processors/1e6
R
M
a
x
R
P
e
a
k
Efficiency for HPCG
Regression TOP10
Fig. 9 Correlation of efficiency with the number of processors, for the TOP10 supercom-
puters in 2017. Left: results for benchmark HPL. Right: results for benchmark HPCG.
correlated at all. In contrast, the two (1 − αeff ) values strongly correlate, al-
though the dominating contribution changes the order of magnitude on the
two axes.
3.7 Efficiency
Although RMax
RPeak
measured with benchmark HPL is an important feature of
the HW+OS assembly, it is a reliable merit only when αSWeff is less than
αHW+OSeff . As long as HPL is used to rank supercomputers, architects keep
efficiency around 0.73; although in the case of Taihulight [4] (because of the
extremely large number of processors) it is only possible through using special
HW unitsMPE [15]. In the case of real-life programs, however, αSWeff is about
two orders of magnitude higher than αHW+OSeff (see Fig. 7), so in that case
the efficiency steeply decreases as the number of processors increases, see the
right side of Fig. 9. Notice that the measured efficiency of Taihulight changes
drastically: utilizing MPEs decreases αHW+OSeff which is considerable in the
case of benchmark HPL, but in the case of HPCG αSWeff dominates, so the
effect of MPEs are negligible.
It is important to notice, that αeff sensitively changes with number of
processors (see Fig. 7 ), while efficiency does not. This mostly follows from the
fact that supercomputers are ranked based on benchmarkHPL. When chang-
ing to HPCG, the ranking – and accordingly the direction of development –
will change, having effect also on these parameters.
4 Future of supercomputers
The race for achieving Eflop/s performance is continuing. From the presently
existing implementations some conclusions can be already drawn. From Fig. 3
one can conclude for the near future an optimistic single processor performance
Statistical considerations of supercomputers 11
10−2 10−1 100
10−2
10−1
RPeak (exaFLOPS)
R
M
a
x
(e
x
a
F
L
O
P
S
)
RMax of Top10 Supercomputers for benchmark HPL
Taihulight
Tianhe-2
Piz Daint
Gyoukou
Titan
Sequoia
Trinity
Cori
Oakforest
K computer
Fig. 10 RMax performance of selected TOP10 (as of 2017 July) supercomputers in function
of their peak performance RPeak , for the HPL benchmark. The actual RPeak values are
denoted by a bubble.
P of value 50 Gflop/s. From Equ. (2) one shall conclude that for achieving 1
Eflop/s payload performance (1−α) of value 5× 10−8 effective parallelization
should be achieved. Compare the values to the case of Taihulight: P = 11.8
Gflop/s and (1 − αeff ) = 3.3 × 10
−8: the limiting top performance is about
0.4 Eflop/s. Even for the system with the best parameters achieving that
dream limit seems to be not realistic.
4.1 Extrapolating the empirical parameters
One way to derive more accurate estimations for the performance limitations
is to utilize the empirical model. Keeping all other parameters constant, the
number of processors can be virtually changed. Fig. 10 depicts how the virtual
versions of present TOP10 supercomputers will achieve the nominal 1 Eflops/s.
To provide a feeling, how the effective parallelization influences the mea-
surable performance, Fig. 11 depicts what payload performance could be mea-
sured on that virtual Taihulight when running benchmark programs having
different (1−αeff ). This could be crucial when running real-life programs, the
12 Ja´nos Ve´gh
10−6 10−5 10−4 10−3 10−2 10−1
10−6
10−5
10−4
10−3
10−2
10−1
RPeak (exaFLOPS)
R
M
a
x
(e
x
a
F
L
O
P
S
)
1 ∗ 10−8
HPL
1 ∗ 10−7
1 ∗ 10−6
1 ∗ 10−5
HPCG
1 ∗ 10−4
3 ∗ 10−4
Fig. 11 RMax performance in function of peak performance RPeak, at different 1−αeff ()
values. Bubbles display measured values when using HPL and HPCG benchmarks, for
Taihulight and K computer, respectively.
need for communication between processing units arises, and especially when
they must share some resource.
4.2 Introducing a technical model
Based on the empirical model, some technical meaning can be attributed to the
αXeff components. Although without considering the technical specifications
in details, only the order of magnitude of the contributions can be estimated,
it is accurate enough to draw some qualitative conclusions, especially of the
limiting values of the different contributions. The total (1 − αeff ) is about
3.3×10−8, so one upper limiting value is known in advance: (1−αSWeff ) cannot
be higher than that value.
To turn our empirical model to a technical one, data published in [3] are
used. The 13,298 seconds benchmark runtime on the 1.45 GHz processors
means 2 ∗ 1013 clock periods. The absolutely necessary non-parallelizable ac-
tivity is to start and stop the calculation. If starting and stopping a program
on a zero-sized supercomputer without OS could be done in 2 clock periods,
then the absolute limit for (1 − α) would be 10−13.
Statistical considerations of supercomputers 13
From the model follows that two of the contributions can be critical when
building ”big” supercomputers. TheOS looping contribution increases linearly
with number of processors, and PD contribution linearly increases with the
physical size of the computer. As depicted in Fig. 1, these contributions can be
combined in such a way that small contributions from OS are linked to large
contributions from PD and vice versa. Anyhow, these two contributions will
also provide an upper bound to the absolute performance of supercomputers.
Since any of them can be quite small, the limit will be the lower of the two
individual bounds.
For considering PD bound, let us consider a cca. 100 meter sized computer
having 1 GHz cores: the signal round trip time is cca. 10−6 seconds, or 103
clock periods. When using high speed internal network, the message length has
no considerable contribution and a network message exchange time (including
operating time of HW) can be estimated to be of length 10−5 seconds, or
104 clock periods. So, the absolute limit for (1 − α) of a supercomputer with
realistic size, but no operating system, is 10−9.
An operating system must, however, be used. If one considers context
change with its consumed 104 cycles [9], the absolute limit is cca. 10−9, on a
zero-sized supercomputer. In addition, all cores must be manipulated through
system calls, which contribution increases linearly with the number of cores
and contribution from OS can be dominant at high number of cores.
For the 10 million processors of Taihulight, at least 107 clock cycles must
be used. Even when parameters can be passed in one clock cycle, for 10M
parameter passings the absolute bound due to OS looping contribution would
be in the range of 10−6. It is surely the dominating contribution for such large
number of processors. Is then something wrong with the model? The measur-
able (1−αeff) for Taihulight must not be lower than any of the contributions,
including the one due to looping in OS.
At this point one can understand the role of modularization some super-
computers utilize. In the case of Taihulight, from the 260 cores 4 serves as
management processing element (MPE) [4,3], so only the processors (or core
groups) rather than individual cores shall be addressed, the rest will be orga-
nized by MPEs. This trick reduces the absolute computing performance of a
processor only by 2% on one side, but on the other side reduces loop count
by about two orders of magnitude, decreasing contribution (1− αOSeff ) by two
orders of magnitude; in this way enabling to achive effective parallelization of
value 1× 10−8. Just notice that the processors [15] in Taihulight attempt to
reduce the non-payload time though utilizing special OS operating modes on
the system, which enables application program to run without needing context
change.
4.3 Changing the computing model
Introducing MPEs decreased (1−αeff ) and enabled to build supercomputer
with 10M processors and at the same time reasonable efficiency. UsingMPEs,
14 Ja´nos Ve´gh
however, violates computing paradigms: those ”more equal” processors know
that some other processors exist. As the above analysis demonstrated, (among
others) the presently used Single-Processor Approach (SPA), that is the com-
puting paradigm itself, is a limiting factor in building larger supercomputers.
The Explicitly Many-Processor Approach (EMPA) [10,12,13] enables to use
forking-like handling of starting processing units, and in this way the OS loop-
ing contribution can be reduced from 10M cycles to 24, in this way eliminating
the most limiting obstacle from the way of building supercomputers from even
more processors.
This is not against Amdahl’s law: if the processors can cooperate, in
Equ. (1) f(k) should be used instead of k, and the nature of f(k) enables
such drastic changes in the behavior of parallelly working systems. It looks
like Amdahl was right with saying: ”the organization of a single computer has
reached its limits and that truly significant advances can be made only by in-
terconnection of a multiplicity of computers in such a manner as to permit
cooperative solution”.
After introducing EMPA, the context change becomes the largest contri-
bution to αOSeff . Through introducing a reasonable layering [11], this contribu-
tion can be lowered by orders of magnitude; making the propagation time PD
the dominating contribution. It can be reduced by decreasing the physical size
of supercomputers, say using 3D arrangement. Making all mentioned changes,
in principle even Zflop/s supercomputers can be built. However, whithout
making all those changes, even Eflop/s cannot be achieved.
5 Conclusions
The present technical implementations of supercomputers practically reached
their technical limits. The reliable database of parameters of supercomputers
can be used to draw reliable statistical conclusions on some parameters and
limitations of supercomputers. Although the extrapolation of the tendencies
enables to make predictions for some future configurations, the careful analysis
reveals that the presently exclusively used Single-Processor Approach really
forms an upper bound for the performance of supercomputers. The experienced
difficulties in building ever-larger supercomputers are of principial rather than
technical nature.
References
1. Amdahl, G.M.: Validity of the Single Processor Approach to Achieving Large-Scale
Computing Capabilities. In: AFIPS Conference Proceedings, vol. 30, pp. 483–485 (1967).
DOI 10.1145/1465482.1465560
2. Daga, M., Aji, A.M., c. Feng, W.: On the efficacy of a fused cpu+gpu processor (or
apu) for parallel computing. In: 2011 Symposium on Application Accelerators in High-
Performance Computing, pp. 141–149 (2011). DOI 10.1109/SAAHPC.2011.29
3. Dongarra, J.: Report on the Sunway TaihuLight System. Tech. Rep. Tech Report UT-
EECS-16-742, University of Tennessee Department of Electrical Engineering and Com-
puter Science (2016)
Statistical considerations of supercomputers 15
4. Fu, H., Liao, J., Yang, J., Wang, L., Song, Z., Huang, X., Yang, C., Xue, W., Liu, F.,
Qiao, F., Zhao, W., Yin, X., Hou, C., Zhang, C., Ge, W., Zhang, J., Wang, Y., Zhou,
C., Yang, G.: The Sunway TaihuLight supercomputer: system and applications. Science
China Information Sciences 59(7), 1–16 (2016). DOI 10.1007/s11432-016-5588-7. URL
http://dx.doi.org/10.1007/s11432-016-5588-7
5. Krishnaprasad, S.: Uses and Abuses of Amdahl’s Law. J. Comput. Sci. Coll. 17(2),
288–293 (2001). URL http://dl.acm.org/citation.cfm?id=775339.775386
6. Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish, N.,
Smelyanskiy, M., Chennupaty, S., Hammarlund, P., Singhal, R., Dubey, P.: Debunking
the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU
and GPU. In: Proceedings of the 37th Annual International Symposium on Computer
Architecture, ISCA ’10, pp. 451–460. ACM, New York, NY, USA (2010). DOI 10.1145/
1815961.1816021. URL http://doi.acm.org/10.1145/1815961.1816021
7. Paul, J.M., Meyer, B.H.: Amdahl’s law revisited for single chip systems. Inter-
national Journal of Parallel Programming 35(2), 101–123 (2007). DOI 10.1007/
s10766-006-0028-8. URL https://doi.org/10.1007/s10766-006-0028-8
8. TOP500.org: The top 500 supercomputers. https://www.top500.org/ (2016)
9. Tsafrir, D.: The context-switch overhead inflicted by hardware interrupts (and the
enigma of do-nothing loops). In: Proceedings of the 2007 Workshop on Experi-
mental Computer Science, ExpCS ’07. ACM, New York, NY, USA (2007). DOI
10.1145/1281700.1281704. URL http://doi.acm.org/10.1145/1281700.1281704
10. Ve´gh, J.: EMPAthY86: A cycle accurate simulator for Explicitly Many-Processor
Approach (EMPA) computer. (2016). DOI 10.5281/zenodo.58063). URL
https://github.com/jvegh/EMPAthY86
11. Ve´gh, J.: Do we need cross layering activities or reasonable layering in computing sys-
tems? IEEE Design & Test in review (2017)
12. Ve´gh, J.: Introducing the explicitly many-processor approach. Parallel Comput-
ing 75, 28 – 40 (2018). DOI https://doi.org/10.1016/j.parco.2018.03.001. URL
https://www.sciencedirect.com/science/article/pii/S0167819118300577
13. Ve´gh, J.: Renewing computing paradigms for more efficient parallelization of single-
threads,pp. 305-330 Advances in Parallel Computing. IOS Press (2017)
14. Ve´gh, J., Molna´r, P.: How to measure perfectness of parallelization in hardware/software
systems. In: 18th Internat. Carpathian Control Conf. ICCC, p. 394–399 121 (2017)
15. Zheng, F., Li, H.L., Lv, H., Guo, F., Xu, X.H., Xie, X.H.: Cooperative comput-
ing techniques for a deeply fused and heterogeneous many-core processor architec-
ture. Journal of Computer Science and Technology 30(1), 145–162 (2015). DOI
10.1007/s11390-015-1510-9. URL https://doi.org/10.1007/s11390-015-1510-9
