The need for modern computing paradigm: Science applied to computing by Végh, János
The need for modern computing paradigm:
Science applied to computing
Ja´nos Ve´gh
Kalima´nos BT
Debrecen, Hungary
Vegh.Janos@gmail.com
Abstract—More than hundred years ago the ’classic physics’
was it in its full power, with just a few unexplained phenomena;
which however led to a revolution and the development of the
’modern physics’. Today the computing is in a similar position:
computing is a sound success story, with exponentially growing
utilization, but with a growing number of difficulties and unex-
pected issues as moving towards extreme utilization conditions. In
physics studying the nature under extreme conditions has lead to
the understanding of the relativistic and quantal behavior. Quite
similarly in computing some phenomena, acquired in connection
with extreme (computing) conditions, cannot be understood based
on of the ’classic computing paradigm’. The paper draws the
attention that under extreme conditions qualitatively different
behaviors may be encountered in both classic and world, and
pinpointing that certain, formerly unnoticed or neglected aspects
enable to explain new phenomena as well as to enhance comput-
ing features. Moreover,an idea of modern computing paradigm
implementation is proposed.
Index Terms—computing paradigm, performance limitation,
efficiency, parallelized computing, distributed computing
Type: Full/Regular Research Papers
Relevant symposium: 6th Annual Conf. on Computational
Science & Computational Intelligence (CSCI’19) — Dec 05-
07, 2019 — Las Vegas, Nevada, USA
Parallel & Distributed Computing (CSCI-ISPD)
Projects no. 125547 and 132683 have been implemented with the support
provided from the National Research, Development and Innovation Fund of
Hungary, financed under the K funding scheme. Also the ERC-ECAS support
of project 886183 is acknowledged.
I. INTRODUCTION
Initially computers were constructed with the goal to re-
duce the computation time of rather complex but sequential
mathematical tasks [1]–[4]. Today a computer is deployed for
radically different goals, where the mathematical computation
in most cases is just a very small part of the task. The majority
of the non-computational activities are reduced to or imitated
by some kind of computations, because this is the only activity
that the computer can do.
Because of this, the current computer architecture (con-
cluded from the 70-year old ’classic’ paradigm) is not really
suitable for the goals they are expected to work for: to handle
the varying degree of parallelism in a dynamic way; to work
in multi-tasking mode; to prepare systems comprising many-
many processors with a good efficacy; to cooperate with the
large number of other processors (both on-chip and off-chip);
to monitor many external actions and react to them in real-
time; to operate ’big data’ processing systems, etc. Not only
the efficacy of computing is very low because of the different
performance losses [5], but more and more limitations come
to the light [6].
Not only that in extreme conditions quickly reaches its
performance limit [6], [7], but this performance degradation
is even more accelerated in systems comprising parallelized
sequential processors [8]. Consequently, in real-time multi-
tasking systems can be encountered the following situations:
and the systems comprising parallelized sequential processors
also introduce their own limits [8]. The very rigid HW-
SW separation leads to the phenomenon known as priority
inversion [9] [9], the very large rigid architectures suffer
from frequent component errors, the complex cyber-physical
systems must be equipped with excessive computing facilities
to provide real-time operation, delivering large amount of data
from the big storage centres to the places where the processing
capacity is concentrated consitutes a major challenge, etc. All
these issues have a common reason [10]: the classical com-
puting paradigm that reflects the state of the art of computing
70 years ago. Computing needs renewal [11].
It was early discovered that the age of conventional archi-
tectures was over [12], [13]; the only question that remained
open whether the game is over [14], too. When comes to
parallel computing, the todays technology is able to deliver
not only many, but ”too many” cores [15]. Their computing
ar
X
iv
:1
90
8.
02
65
1v
2 
 [c
s.G
L]
  2
4 O
ct 
20
19
performance, however, does not increase linearly with the
number of cores [16], and only a few of them can be utilized
simultaneously [16]. Consequently, it is unreasonable to use
high number of cores for such extreme system operation [17].
Approaching the limits of the classic paradigm of computing
has rearranged the technology ranking [18]. ”New ways of
exploiting the silicon real estate need to be explored” [19]
and a modern computing paradigm is needed indeed.
One of the major issues is, that ”being able to run software
developed for today’s computers on tomorrow’s has reigned in
the cost of software development. But this guarantee between
computer designers and software designers has become a
barrier to a new computing era.” [18] A new computer
paradigm, that is an extension rather than a replacement of
the old paradigm, has better chances to be accepted, since it
provides a smooth transition from the age of old computing
to the age of modern computing.
II. ANALOGIES WITH THE CLASSIC VERSUS MODERN
PHYSICS
The case of computing is very much analogous with the
case of classic physics versus the modern (relativistic and
quantum) physics. In the world we live in it is rather counter-
intuitive to accept that as we move towards unusual condi-
tions, the adding of velocities behaves differently, the energy
becomes discontinuous, the momentum and the position of
a particle cannot be measured accurately at the same time.
In normal cases there is no notable difference, when using
if applying or not the non-classical principles. However, as
we get farther from the everyday conditions, the difference
gets more considerable, and even leads to phenomena one can
never experience under the usual, everyday conditions. The
analogies do not want to imply direct correspondence between
certain physical and computing phenomena. Rather, the paper
draws the attention to both that under extreme conditions
qualitatively different behavior may be encountered, and that
scrutinizing certain, formerly unnoticed or neglected aspects
enables to explain the new phenomena. Unlike in the nature,
the technical implementation of the critical points can be
changed, and through this the behavior of the computing
systems can also be changed. In this paper, only two of the
affected important areas of computing will be touched in more
details: parallel processing and multitasking.
A. Analogy with the special relativity
In the above sense, there is an important difference between
the operation and the performance of the single-processor and
those of the parallelized but sequentially working computer
systems. As long as just a few (thousands) single processors
are aggregated into a large computer system, the resulting
performance will correspond (approximately) to the sum of the
single-processor performance values: similarly to the classic
rule of adding speeds (see also table I). However, when
assembling larger computing systems (and approaching with
their performance ”the speed of light” of computing systems in
the range of millions of processors) the nominal performance
Physics Computing
Adding of velocities Adding of performance
Classic Classic
v(t) = t× g Perftotal(n) = n×Perfsingle
c = Light Speed
t = time n = number of cores
g = acceleration Perfsingle
n = optical density α = parallelism
Modern (relativistic) Modern [21]
v(t) = t×g√
1+( t×g
c/n
)2
Perftotal(n) =
n×Perfsingle
n×(1−α)+α
TABLE I
THE ”ADDING VELOCITIES” ANALOGIES BETWEEN THE CLASSIC AND
MODERN ARTS OF SCIENCE AND COMPUTING, RESPECTIVELY
105
106
107
108
10−7 10−6
10−8
10−2
10−1
100
No
of
cor
es
(1− αHPLeff )
E
f
f
ic
ie
n
cy
Dependence of EHPL and EHPCG on (1− αHPLeff ) and N
TOP5’2018.11
Summit
Sierra
Taihulight
Tianhe-2
K computer
Fig. 1. The efficiency surface corresponding to the ”modern paradigm”,
see Table I, with some measured (with HPL and HPCG) efficiencies of
supercomputers. Recall that ”this decay in performance is not a fault of the
architecture, but is dictated by the limited parallelism”. [22]
starts to deviate from the experienced payload performance:
the phenomenon known as efficiency appears. Even, not only
one efficiency [20]: the measurable efficiency depends on the
method of measurement (the utilized benchmark program).
Actually, the simple Amdahl’s Law results in a 2-dim surface1
shown in Figure 1.
The phenomenon is surprisingly similar to increasing the
speed of a body with constant acceleration (see Figure 5): at
low time (or speed) values there is no noticeable difference
between the speeds calculated with and without relativistic
correction, but as the limiting speed comes closer, the speed
of the accelerated body saturates. Whether is formulated that
the mass increases or the time slows down, the essence is the
same: when approaching (the specific) limits, the behavior of
the subject drastically changes.
The performance measurements are simple time measure-
1As can be seen from Figure 6, at extreme high number of cores (1− α)
also depends on the number of cores.
Param 1993 Param 2018
(Sunway/Taihulight)
α = 1− 10−3
α = 3.3− 10−8
Total = 1013 clocks
Ncores = 10
3 Ncores = 1.06× 107
RMax
RPeak
= 1N×(1−α)+α
= 1103×10−3+1
= 0.5
RMax
RPeak
= 1N×(1−α)+α
= 0.74
Proc
T
im
e(
n
ot
p
ro
p
or
ti
on
a
l)
0
1
2
3
4
5
6
7
8
9
10
Model of distributed parallel processing
α = PayloadTotal
P0 P1 P2 P3 P4
AccessInitiation
SoftwarePre
OSPre
T
0
P
D
0
0
P
ro
ce
ss
0
P
D
0
1
T
1
P
D
1
0
P
ro
ce
ss
1
P
D
1
1
T
2
P
D
2
0
P
ro
ce
ss
2
P
D
2
1
T
3
P
D
3
0
P
ro
ce
ss
3
P
D
3
1
T
4
P
D
4
0
P
ro
ce
ss
4
P
D
4
1
Just waiting
Just waiting
OSPost
SoftwarePost
AccessTermination
P
a
y
lo
a
d T
ot
a
l
E
x
te
n
d
ed
Fig. 2. A general model of parallel operation. For better visibility, the lengths of the boxes are not proportional with the time the corresponding action needs.
The data are taken from [22] and [23]
ments: a standardized set of machine instructions is executed
(a large number of times) and the known number of operations
is divided by the measurement time; both for the single-
processor and the distributed parallelized sequential systems.
In the latter case, however, the joint work must also be
organized, implemented with extra machine instructions and
extra execution time. As Fig. 1 shows, in the parallel operating
mode both the software and the hardware contribute to the
execution time, i.e. both must be considered in Amdahl’s Law.
This is the origin of the efficiency: one of the processors
orchestrates the joint operation, the others are waiting. After
reaching a certain number of processors there is no more
increase in the payload fraction when adding more processors:
the first fellow processor already finished the task and is wait-
ing while the last one is still waiting for the start command.
The physical size of the computing system also matters: the
processor connected with a cable of length of dozens of meters
to the first one must spend several hundreds clock cycles with
waiting (not mentioning geographically distributed computer
systems, such as some clouds, connected through general-
purpose networks). Detailed calculations are given in [24].
The phenomenon itself is known since decades [22]: ”Am-
dahl argued that most parallel programs have some portion of
their execution that is inherently serial and must be executed
by a single processor while others remain idle. . . . In fact,
there comes a point when using more processors . . . actually
increases the execution time rather than reducing it.”
Presently however the theory was almost forgotten mainly due
to the quick development of the parallelization technology and
the increase of the single-processor performance.
During the past quarter of century, the proportion of the
contributions changed considerably: today the number of pro-
cessors is thousands of times higher than a quarter of century
ago, the growing physical size and the higher processing
speed increased the role of the propagation delay. As a result
of the technical development the same phenomenon returned
in a technically different form at much higher number of
processors. Due to this, the computing performance cannot be
increased above the performance defined by the parallelization
technology and the number of processors (this is similar to
affirming that an object having the speed of light cannot be fur-
ther accelerated). Exceeding a certain computing performance
(using the classic paradigm and its classic implementation) is
prohibited by the laws of nature.
The ”speed of light” limit is specific for the different
architectures, but the higher is the achieved performance, the
harder is to increase it. It looks like that in the feasibility
studies an analysis whether some inherent performance bound
exists remained out of sight either in USA [25], [26] or in
EU [27] or in Japan [28] or in China [29].
In Fig. 3, the development of the payload performance of
some top supercomputers in function of the year of con-
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
100
101
102
years
R
P
e
a
k
(p
et
aF
L
O
PS
)
Summit
Sierra
Trinity
Taihulight
Tianhe-2A
K computer
Gyoukou
Piz Daint
Fig. 3. The RMax payload performance in function of the year of construc-
tion for different configurations. The performances of all configurations seem
to have their individual saturation value. The later a supercomputer appears
in the competition, the smaller is the performance ratio with respect to its
predecessor; the higher is its rank, the harder is to improve its performance.
struction is depicted. In Fig. 4 the dependence of payload
performance on nominal performance is depicted for different
benchmark types, specifically for the commonly utilized High
Performance Linpack (HPL) and High Performance Conjugate
Gradients (HPCG) benchmarks. All diagram lines clearly show
the signs of saturation [30] (or rooflines [31]): the more inter-
processor communication is needed for solving the task, the
more ”dense” is the environment and the smaller is the specific
”speed of light”. The third ”roofline” is rather guessed [32]
for the case of processor-based brain simulation case, where
higher orders of magnitude of communication is required. The
payload performance of the processor-based Artificial Intelli-
gence tasks, given the amount of communication involved, is
expected to be between the latter two rooflines, closer to that
of the HPCG level.
Supercomputers are computing systems stretched to the
limit [26], [33]. They are a reflection of the engineering
perfectness which, in the case of Summit supercomputer,
showed an increase of 17 % in the computing performance
when adding 5 % more cores (and making fine-tuning after
its quick startup) in its first half year after its appearance,
and another 3.5 % increase in its performance when adding
0.7 % more cores after another half a year. The one-time
appearance of Gyoukou is mystic: it could catch slot #4 on
the list using just 12% of its 20M cores although the explicitly
ambition was to be the #1. These examples show that the
lack of understanding of the computing systems behaviour
under extreme conditions led the false presumptions of frauds
when reporting computing performances [34]. Another cham-
pion candidate, supercomputer Aurora [35] –after years of
building– was retargeted just weeks before its planned startup.
In November 2017 Intel announced that Aurora has been
shifted to 2021. As part of the announcement Knights Hill
10−6 10−5 10−4 10−3 10−2 10−1
10−6
10−5
10−4
10−3
10−2
10−1
RPeak (exaFLOPS)
R
M
a
x
(e
x
a
F
L
O
P
S
)
HPL
5 ∗ 10−7
1 ∗ 10−5
HPCG
1 ∗ 10−4
1.5 ∗ 10−3
Fig. 4. The payload performance RMax in function of the nominal per-
formance RPeak , at different (1 − αeff ) values. The figures display the
measured values derived using HPL (empty marks) and HPCG (filled marks)
benchmarks, for the TOP15 supercomputers (as of 2019 June). The diagram
lines marked as HPL and HPCG correspond to the behavior of supercomputer
Taihulight at (1− αeff ) values 3.3 ∗ 10−8 and 2.4 ∗ 10−5, respectively.
The uncorrected values of the new supercomputers Summit and Sierra
are shown as diamonds, and the same values corrected for single-processor
performance are shown as rectangles. The black dots mark the performance
data of supercomputers JUQUEEN and K as of 2014 June, for HPL
and HPCG benchmarks, respectively. The saturation effect can be observed
for both HPL and HPCG benchmarks. The shaded area only highlights the
nonlinearity. The red dot denotes the performance value of the system used
by [36]. The dashed line shows the plausible saturation performance value of
the brain simulation.
was canceled and instead be replaced by a ”new platform and
new microarchitecture specifically designed for exascale”. The
lesson learned was that specific design is needed for exascale.
As Figure 4 shows (and discussed in details in [8]), the
deviation from the linear dependence is unnoticeable at low
performance values: here the ”classic speed addition” is valid.
At extremely large performance values, however, the depen-
dence is strongly non-linear and specific on the measurement
conditions: here the performance limits manifest. The large
scatter of the measured data is generated by the variance
of different processor and connection types, design ideas,
manufacturers, etc. However, the diagram perfectly reflects the
theory that describes the tendency.
B. Analogy with the general relativity
The mentioned decrease of the system’s performance man-
ifests in the appearance of the performance wall [8], [31], a
new limitation due to the parallelized sequential computing.
To make a parallel with the modern physics, it is known that
the objects with extreme large masses behave differently from
what we know they should behave under ’normal’ conditions.
In other words, the behavior of large scale ’matter’ largely
deviates from that of the small scale ’matter’. Another analogy
to physics is the ’dark silicon’ [16] whose behaviour resembles
the ’dark matter’: the (silicon) cores are there and usable, but
105 106 107 108
106
107
108
time(s)
v
el
oc
it
y
(m
/s
)
Relativistic speed of body accelerated by ’g’
v(t), n = 1
v(t), n = 2
Fig. 5. How the velocity of a body accelerated by g depends on the time, in
relativistic approach, see also table I. Compare to Figs. 3 and 6
(because of the thermal dissipation) the large amount of cores
behaves differently.
Moreover, the parallel computing introduced the ”dark per-
formance”. Due to the classic computing principles, the first
core must speak to all fellow cores, and this non-parallelizable
fraction of the time increases with the number of the cores [31],
[37]. The result is that the top supercomputers show only
an efficacy around 1% when solving real-life tasks. Another
analogy is made with the ”gravitational collapse”, where
a ”communicational collapse” is demonstrated in Fig. 5.(a)
in [38]: showing that at extremely large number of cores
that exceeds communication intensity leads to unexpected and
drastic changes of the network latency.
The signals propagation time is also very much similar
to the finite propagation of the physical fields, and even the
latency time of the interfaces can be paired with creating and
attenuating the physical carriers. Zero-time on/off signals are
possible both in the classical physics and classical computing,
while in the corresponding modern counterparts, to create,
transfer and detect signals time must be accounted for; the
effect noticed as that the time delay through wires (in this
extended sense) grows compared to the time of gating [7].
Figure. 6 introduces a further analogy. The non-
parallelizable fraction (denoted on the figure by αXeff ) of the
computing task comprises components of different origin. As
already discussed, and was noticed decades ago, ”The inherent
communication-to-computation ratio in a parallel application
is one of the important determinants of its performance on
any architecture” [22], suggesting that the communication
can be a dominant contribution in systems performance. Fig-
ure. 6.A displays the case of a minimum communication, and
Figure. 6.B the moderately increased one (corresponding to
real-life supercomputer tasks). As the nominal performance
10−3 10−2 10−1 100
10−10
10−9
10−8
10−7
10−6
10−5
10−4
RPeak(Eflop/s)
(1
−
α
H
P
L
e
f
f
)
10−5
10−4
10−3
10−2
10−1
100
R
H
P
L
M
a
x
(E
f
lo
p
/
s)
αSW
αOS
αeff
RMax(Eflop/s)
10−3 10−2 10−1 100
10−10
10−9
10−8
10−7
10−6
10−5
10−4
RPeak(Eflop/s)
(1
−
α
H
P
C
G
e
f
f
)
10−5
10−4
10−3
10−2
10−1
100
R
H
P
C
G
M
a
x
(E
f
lo
p
/
s)
αSW
αOS
αeff
RMax(Eflop/s)
10−3 10−2 10−1 100
10−10
10−9
10−8
10−7
10−6
10−5
10−4
RPeak(Eflop/s)
(1
−
α
N
N
e
f
f
)
10−5
10−4
10−3
10−2
10−1
100
R
N
N
M
a
x
(E
f
lo
p
/s
)
αSW
αOS
αeff
RMax(Eflop/s)
Fig. 6. Contributions (1 − αXeff ) to (1 − αtotaleff ) and max payload
performance RMax of a fictive supercomputer (P = 1Gflop/s @ 1GHz)
in function of the nominal performance. The blue diagram line refers to the
right hand scale (RMax values), all others ((1−αXeff ) contributions) to the
left scale. The figure is purely illustrating the concepts; the displayed numbers
are somewhat similar to the real ones.
increases linearly and the performance decreases exponentially
with the number of cores, at some critical value where an
inflection point occurs, the resulting performance starts to de-
crease. The resulting large non-parallelizable fraction strongly
decreases the efficacy (or in other words: the performance gain
or speedup) of the system [8], [39]. The effect was noticed
early [22], under different technical conditions but forgotten
due to the successes of the development of the parallelization
technology.
Figure 6.A illustrates the behavior measured with HPL
benchmark. The looping contribution becomes remarkable
around 0.1 Eflops, and breaks down payload performance
when approaching 1 Eflops (see also Fig. 1 in [22]). In
figure 6.B the behavior measured with benchmark HPCG is
displayed. In this case the contribution of the application (thin
brown line) is much higher, the looping contribution (thin
green line) is the same as above. Consequently, the achievable
payload performance is lower and also the breakdown of the
performance is softer. The black dots mark the HPL and HPCG
performance of the computer used in works [36], [40].
C. Analogy with the quantum physics
The electronic computers are clock-driven systems, i.e. no
action can happen in a time period shorter than the length
of one clock period. The typical value of that ”quantum of
time” today is in the nanosecond range, so in the everyday
practice, the time seems to be continuous and the ”quantal
nature of time” cannot be noticed. Some (sequential) non-
payload fragment in the total time is always present in the
parallelized sequential systems: it cannot be smaller than the
ratio of the length of two clock periods divided by the total
measurement time, since forking and joining the other threads
cannot be shorter than one clock period.
Unfortunately, the technical implementation needs about ten
thousand times longer time [41], [42]. The total time of the
performance measurement is large (typically hours) but finite,
so the non-parallelizable fraction is small but finite. As dis-
cussed, the latter increases with the number of the cores in the
system. Because of this, in contrast with the statement in [22]
that ”the serial fraction . . . is a diminishing function of the
problem size”, at sufficiently large number of cores the serial
fraction even starts to dominate. Because of Amdahl’s Law, the
absolute value of the computing performance of parallelized
systems has inherently an upper limit, and the efficiency is the
lower the higher is the number of the aggregated processing
units.
The processor-based brain simulation provides an ”experi-
mental evidence” [32] that the time in computing shows quan-
tal behavior, analogously with the energy in physics. When
simulating neurons using processors, the ratio of the simulated
(biological) time and the processor time used to simulate the
biological effect may considerably differ, so to avoid working
with ”signals from the future”, periodic synchronization is
required that introduces a special ”biological clock cycle”. The
role of this clock period is the same as that of the clock signal
in the clocked digital electronics: what happens in this period,
it happens ”at the same time”2.
The brain simulation (and in somewhat smaller scale: ar-
tificial neural computing) requires intensive data exchange
between the parallel threads: the neurons are expected to tell
the result of their neural calculations periodically to thousands
of fellow neurons. The commonly used 1 ms ”grid time” is,
however, 106 times longer than the 1 ns clock cycle common
2This periodic synchronization will be a limiting factor in large-scale
utilization of processor-based artificial neural chips [43], although thanks
to the ca. thousand times higher ”single-processor performance”, only when
approaching the computing capacity of (part of) the brain.
in the digital electronics [36]. Correspondingly, its influence
on the performance is noticeable. Figure 6.C demonstrates
what happens if the clock cycle is 5000 times longer: it
causes a drastic decrease in the achievable performance and
strongly shifts the performance breakdown toward lower nom-
inal performance values. As shown, the ”quantal nature of
time” in computing changes the behavior of the performance
drastically.
In addition, the thousands times more communication con-
tributes considerably to the non-payload sequential-only frac-
tion, so it degrades further the efficacy of the computing sys-
tem. What is worse, they are expected to send their messages
at the end of the grid time period, causing a huge burst of
messages.
Not only the achievable performance is by orders of mag-
nitude lower, but also the ”communicational collapse” (see
also [22]) occurs at orders of magnitude lower nominal per-
formance. This is the reason why less than one percent of
the planned capacity can be achieved even by the custom-
built large scale ANN simulators [44]. Similarly, the SW
and HW based simulators show up the same limitation [32],
[36]. This is why only a few dozens of thousands of neurons
can be simulated on the processor-based brain simulators
(including both the many-thread software simulators and the
purpose-built brain simulator), as found in [31]. The memory
of extremely large supercomputers can be populated with
objects simulating neurons [45], but as soon as they need to
to communicate, the task collapses as predicted in Fig. 6. This
is indirectly underpinned [40] by that the different handling
of the threads changes the efficacy sensitively and that the
time required for more detailed simulation increases non-
linearly [32], [36].
D. Analogy with the interactions of particles
In the classic computing the processors ability to com-
municating with each other is not a native feature: in the
Single Processor Approach questions like message sending
to and receiving from some other party as well as sharing
resources has no sense at all (as no other party exists);
messaging is very ineffectively imitated by SW in the layer
between HW and the real SW. This feature alone prevents
building exascale supercomputers [8]: after reaching a critical
number of processors, adding more processors leads to a
decreasing resulting performance [8], [32], as experienced
in [22], [40] and causes demonstrative failures such as the
cases of Gyoukou or Aurora or SpiNNaker. This critical
number (using the present technology and implementation) is
under 10M cores; the only exception (as of end of 2019) is
Taihuligh, because it is using a slightly different computing
principle with its cooperating processors [46].
The laws of parallel computing result in the actual behavior
of the computing systems the more difference from that
expected on the basis of the classical computing the more
communication takes place [31]. Similarly, as in physics the
behavior of an atom is strongly changed by the interaction
(communication) with other particles.
This phenomenon cannot be explained in the frames of
the ”classic computing”. The limits of single-processor per-
formance enforced by the laws of nature [7] are topped by the
limitation of parallel computing [8], [31], and further limited
through introducing the ”biological clock period” [32]. Notice
that these contributions are competing with each other, the
actual circumstances decide which of them will dominate.
Their effect, however, is very similar: according to Amdahl,
what is not parallel is qualified as sequential.
E. Analogy with the uncertainty principle
Even the quantum physical uncertainty principle which
states that (unlike in classical physics) one cannot measure
accurately certain pairs of physical properties of a particle
(like the position and the momentum) at the same time, has
its counterpart in computing. Using registers (and caches and
pipelines), one can perform computations with much higher
speed, but to service an interrupt, one has to save/restore
registers and renew cache content. Similar is the case with
accelerators: copying data from one memory to another or
dealing with coherence increases latency. That is, one cannot
have low latency and high performance at the same time, using
the same processor design principles. The same processor
design principles cannot be equally good for preparing high-
performance single thread applications and high performance
parallelized sequential systems.
III. THE CLASSIC VERSUS MODERN PARADIGM
Today we have extremely inexpensive (and at the same time:
extremely complex and powerful) processors around (a ”free
resource” [44]) and we come to the age when no additional
reasonable functionality can be implemented in processors
through adding more transistors, the over-engineered pro-
cessors optimized for single-processor regime do not enable
reducing the clock period [47]. The computing power hidden
in many-core processors cannot be utilized effectively for pay-
load work, because of the ”power wall” (partly because of the
improper working regime [48]): we came at the age of ”dark
silicon” [16], we have ”too many” processors [15] around.
The supercomputers face critical efficiency and performance
issues; the real-time (especially the cyber-physical) systems
experience serious predictability, latency and throughput is-
sues; in summary, the computing performance (without chang-
ing the present paradigm) reached its technological bounds.
Computing needs renewal [11]. Our proposal, the Explicitly
Many-Processor Approach (EMPA) [49], is to introduce a new
computing paradigm and through that to reshape the way in
which computers are designed and used today.
A. Overview of the modern paradigm
The new paradigm is based on making fine distinctions in
some points, present also in the old paradigm. Those points,
however, must be scrutinized in all occurring cases, whether
and how long can they be neglected. These points are:
• consider explicitly that not only one processor (aka
Central Processing Unit) exists, i.e.
– the processing capability (akin to the data storage
capability) is one of the resources rather than a
central singleton
– not necessarily the same processing unit (out of the
several identical ones) is used to solve all parts of
the problem
– a kind of redundancy (an easy method of replac-
ing a flawed processing unit) through using virtual
processing units is provided (mainly to increase the
mean time between technical errors), like [50], [51]
– different processors can and must cooperate in solv-
ing a task, i.e. direct data and control exchange
among the processing units are made possible; the
ability to communicate with other processing units,
like [46], is a native feature
– flexibility for making ad-hoc assemblies for more
efficient processing is provided
– the large number of processors is used for unusual
tasks, like replacing memory operation with using
additional processors
• the misconception of the segregated computer compo-
nents is reinterpreted
– the efficacy of utilization of the several processors
is increased by using multi-port memories (similar
to [52])
– a ”memory only” concept (somewhat similar to that
in [53]) is introduced (as opposed to the ”registers
only” concept), using purpose-oriented, optionally
distributed, partly local, memory banks
– the principle of locality is introduced into memory
handling at hardware level, through introducing hi-
erarchic buses
• the misconception of the ”sequential only” execution [54]
is reinterpreted
– von Neumann required only ”proper sequencing” for
the single processing unit; this is extended to several
processing units
– the tasks are broken into reasonably sized and logi-
cally interconnected fragments, unlike unreasonably
fragmented by the scheduler
– the ”one-processor-one process” principle remains
valid for the a task fragments, but not necessarily
for the complete task
– the fragments can be executed in parallel if both
data dependence and hardware availability enables
it (another kind of asynchronous computing [55])
• a closer hardware/software cooperation is elaborated
– the hardware and software only exist together (akin
to ”stack memory”)
– when a hardware has no duty, it can sleep (”does not
exist” and does not take power)
– the overwhelming part of the duties of synchroniza-
tion, scheduling, etc. of the OS are taken over by the
hardware
– the compiler helps the processor with compile-time
information and the processor is able to adapt (con-
figure) itself to the task depending on the actual
hardware availability
– strong support for multi-threading and resource shar-
ing, as well as low real-time latency is provided, at
processor level
– the internal latency of the assembled large-scale
systems is much reduced, while their performance
is considerably enhanced
– the task fragments are able to return control voluntar-
ily without the intervention of the operating system
(OS), enabling to implement more effective and more
simple operating systems
B. Details of the concept
Our proposal introduces a new concept that permits work-
ing with virtual processors at programming level and their
mapping them to physical cores at runtime level, i.e. to let
the computing system to adapt itself to the task. A major idea
of EMPA (for an early and less mature version see [49]) is
to use quasi-thread (QT) as atomic unit of processing that
comprises both the HW (the physical core) and the SW (the
code fragment running on the core). Its idea was derived
with having in mind the best features of both the HW core
and the SW thread. In analogy again: the QTs have ”dual
nature”: in the HW world of the ”classic computing” they
are represented as a ’core’, in the ’SW’ world as a ’thread’.
However, they are the same entity in the sense of the ’modern
computing’. The terms ’core’ and ’thread’ are borrowed from
the conventional computing, but in the ’modern computing’
they can actually exist only together in a time-limited way3.
EMPA is a new computing paradigm which needs a new
underlying architecture, rather than a new kind of parallel
processing running on a conventional architecture, so it can
be reasonably compared to the terms and ideas used in
conventional computing only in a very limited way; although
many of its ideas and solutions are adapted from the ’classic
computing’.
The executable task is broken into reasonably sized and
loosely dependent QTs. (The QTs can optionally be embedded
into each other, akin to subroutines.) In EMPA for every new
QT a new independent processing unit (PU) is also implied,
the internals (PC and part of registers) are set up properly,
and they execute their task independently4 (but under the
supervision of the processor comprising the cores).
3Akin to dynamic variables on the stack: their lifetime is limited to the
period when the HW and SW are properly connected. The physical memory
is always there, but it is ”stack memory” only when properly handled by the
HW/SW components.
4Although the idea of executing the single-thread task ”in pieces” may
look strange for the first moment, actually the same happens when the
OS schedules/blocks a task. The key differences are that in EMPA not the
same processor is used, the QTs are cut into fragments in a reasonable way
(preventing issues like priority inversion [9]), the QTs can be processed at
the same time as long as their mathematical dependence and the actual HW
resource availability enable it.
In other words: the processing capacity is considered as
a resource in the same sense as the memory is considered
as a storage resource. This approach enables the programmer
to work with virtual processors (mapped to physical PUs
by the computer at run-time) and they can utilize the quick
resource PUs where they can replace utilizing the slow re-
source memory (say, hiring a quick processor from a core
pool can be competitive with saving and restoring registers
in the slow memory, for example when making a recursive
call). The third major idea is that the PUs can cooperate in
various ways, including data and control synchronization, as
well as outsourcing part of the received job (received as an
embedded QT) to a helper core. An obvious example is to
outsource the housekeeping activity of loop organization to
a helper core: counting, addressing, comparing, etc. can be
done by a helper core, while the main calculation remains to
the originally delegated core. As the mapping to physical cores
occurs at runtime, (depending on the actual HW availability)
the processor can eliminate the (maybe temporarily) denied
cores as well as to adapt the resource need (requested by
the compiler) of the task to the actual computing resource
availability.
The processor has an additional control layer for organiz-
ing the joint work of its cores. The cores have just a few
extra communication signals and are able to execute both
conventional and so called meta-instructions (for configuring
the architecture). The latter ones are executed in a co-processor
style: when finding a meta-instruction, the core notifies the
processor which suspends the conventional operation of the
core, controls executing the meta-instruction (utilizing the
resources of the core, providing helper cores and handling
the connections between the cores as requested) then resumes
core operation.
The processor needs to find the needed PUs (cores) and the
processing ability has to accommodate to the task. Also inside
the processor; quickly, flexibly, effectively and inexpensively.
A kind of ‘On demand’ computing that works ‘As-a-Service’.
This is a task not only for the processor: the complete com-
puting system must participate and for that goal the complete
computing stack must be rebuilt.
Behind the former attempts to optimize code execution
inside the processor there was no established theory, and
they actually could achieve only moderate success because in
SPA the processor is working in real time, it has not enough
resources, knowledge and time do discover those options
completely [56]. In the classic computing, the compiler can
find out anything about enhancing the performance but has
no information about the actual run-time HW availability,
furthermore it has no way to tell its findings to the processor.
The processor has the HW availability information, but has to
”reinvent the wheel” with respect to enhancing performance;
in real time. In EMPA, the compiler puts its findings in the
executable code in form of meta-instructions (”configware”),
and the actual core executes them with the assistance of the
new control layer of the processor. The processor can choose
from those options, considering the actual HW availability, in a
style ’if NeededNumberOfResourcesAvailable then Method1
else Method2’, maybe embedded one to another.
C. Some advantages of EMPA
The approach results in several considerable advantages, but
the page limit enables to mention just a few of them.
• as a new QT receives a new Processing Unit (PU)(s),
there is no need to save/restore registers and return ad-
dress (less memory utilization and less instruction cycles)
• the OS can receive its own PU, which is initialized in
kernel mode and can promptly (i.e. without the need of
context change) service the requests from the requestor
core
• for resource sharing, temporarily a PU can be delegated
to protect the critical section; the next call to run the code
fragment with the same offset will be delayed until the
processing by the first PU terminates
• the processor can natively accommodate to the variable
need of parallelization
• the actually out-of-use cores are waiting in low energy
consumption mode
• the hierarchic core-to-core communication greatly in-
creases the memory throughput
• the asynchronous-style computing [57] largely reduces
the loss due to the gap [58] between speed of the
processor and that of the memory
• the direct core-to-core connection (more dynamic than
in [46]) greatly enhances efficacy in large systems [59]
• the thread-like feature to fork() and the hierarchic buses
change the dependence of on the number of cores from
linear to logarithmic [8] (enables to build really exa-scale
supercomputers)
The very first version of EMPA [11] has been implemented
in a form of simple (practically untimed) simulator [60], now
an advanced (Transaction Level Modelled) simulator is pre-
pared in SystemC. The initial version adapted Y86 cores [61],
the new one RISC-V cores. Also part-solutions are modeled
in FPGA.
SUMMARY
The today’s computing is more and more typically uti-
lized under extreme conditions: measuring the speed of light
using system having components slower than the speed of
light; providing extremely low latency time in interrupt-driven
system; providing extremely large computing performance in
parallelly working system, relatively high performance for a
long time on a complex computer system, service requests in
an energy aware mode. To some measure, these activities are
solved under the umbrella of the old paradigm for non-extreme
scale systems. Those experiences, however, must be reviewed
when working with extreme-large system, because the scaling
is as much nonlinear, that qualitatively new phenomena are
experienced. Though scrutinizing the details of the basic
principles of computing, a ”modern computing paradigm”
can be constructed, that on one side enables to explain the
new phenomena and on the other side enables to construct
computing systems with much more advantageous features.
REFERENCES
[1] J. J. P. Eckert and J. W. Mauchly, “Automatic High-Speed Computing:
A Progress Report on the EDVAC,” Moore School Library, University of
Pennsylvania, Philadephia, Tech. Rep. Report of Work under Contract
No. W-670-ORD-4926, Supplement No 4, September 1945.
[2] J. von Neumann, “First Draft of a Report on the EDVAC,” http://www.
wiley.com/legacy/wileychi/wang archi/supp/appendix a.pdf, 1945.
[3] M. R. Williams, “The Origins, Uses, and Fate of the EDVAC,”
IEEE Ann. Hist. Comput., vol. 15, no. 1, pp. 22–38, 1993. [Online].
Available: http://dx.doi.org/10.1109/85.194089
[4] M. D. Godfrey and D. F. Hendry, “The Computer as von Neumann
Planned It,” IEEE Annals of the History of Computing, vol. 15, no. 1,
pp. 11–21, 1993.
[5] R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. C.
Lee, S. Richardson, C. Kozyrakis, and M. Horowitz, “Understanding
sources of inefficiency in general-purpose chips,” in Proceedings of the
37th Annual International Symposium on Computer Architecture, ser.
ISCA ’10. New York, NY, USA: ACM, 2010, pp. 37–47. [Online].
Available: http://doi.acm.org/10.1145/1815961.1815968
[6] US DOE Office of Science, “Report of a Roundtable Convened to Con-
sider Neuromorphic Computing Basic Research Needs,” https://science.
energy.gov/∼/media/bes/pdf/reports/2016/NCFMtSA rpt.pdf, 2015.
[7] I. Markov, “Limits on fundamental limits to computation,” Nature, vol.
512(7513), pp. 147–154, 2014.
[8] J. Ve´gh, J. Va´sa´rhelyi, and D. Dro´tos, “The performance wall of large
parallel computing systems,” in Lecture Notes in Networks and Systems
68. Springer, 2019, pp. 224–237.
[9] O. Babaoglu, K. Marzullo, and F. B. Schneider, “A formalization of
priority inversion,” Real-Time Systems, vol. 5, no. 4, p. 285303, Oct
1993. [Online]. Available: https://doi.org/10.1007/BF01088832
[10] S(o)OS project, “Resource-independent execution support on exa-scale
systems,” http://www.soos-project.eu/index.php/related-initiatives, 2010.
[11] J. Ve´gh, Renewing computing paradigms for more efficient paralleliza-
tion of single-threads, ser. Advances in Parallel Computing. IOS Press,
2018, vol. 29, ch. 13, pp. 305–330.
[12] G. M. Amdahl, “Validity of the Single Processor Approach to Achieving
Large-Scale Computing Capabilities,” in AFIPS Conference Proceed-
ings, vol. 30, 1967, pp. 483–485.
[13] M. D. Godfrey, “Innovation in Computational Architecture and Design,”
ICL Technical Journal, vol. 5, pp. 18–31, 1986.
[14] S. H. Fuller and L. I. Millett, Eds., The Future of Computing Per-
formance: Game Over or Next Level? National Academies Press,
Washington, 2011.
[15] A. Mendelson, “How many cores are too many cores?”
https://www.research.ibm.com/haifa/Workshops/compiler2007/present/
avi mendelson.pdf, 2007.
[16] M. Shafique and S. Garg, “Computing in the dark silicon era: Current
trends and research challenges,” IEEE Design and Test, vol. 34, no. 2,
pp. 8–23, 4 2017.
[17] T. Ungerer, “Multi-core execution of hard real-time applications sup-
porting analyzability,” IEEE Micro, vol. 99, pp. 66–75, 2010.
[18] T. M. Conte, E. P. Debenedictis, and D. H. R. S. Williams, M,
“Challenges to Keeping the Computer Industry Centered in the US,”
https://arxiv.org/abs/1706.10267, 2017.
[19] J. A. Chandy and J. Singaraju, “Hardware parallelism vs. software
parallelism,” in Proceedings of the First USENIX Conference on Hot
Topics in Parallelism, ser. HotPar’09. Berkeley, CA, USA: USENIX
Association, 2009, pp. 2–2.
[20] IEEE Spectrum, “Two Different Top500 Supercomputing
Benchmarks Show Two Different Top Supercomput-
ers,” https://spectrum.ieee.org/tech-talk/computing/hardware/
two-different-top500-supercomputing-benchmarks-show-two-different-top-supercomputers,
2017.
[21] J. Ve´gh, P. Molna´r, and J. Va´sa´rhelyi, “A figure of merit for
describing the performance of scaling of parallelization,” CoRR, vol.
abs/1606.02686, 2016. [Online]. Available: http://arxiv.org/abs/1606.
02686
[22] J. P. Singh, J. L. Hennessy, and A. Gupta, “Scaling parallel programs for
multiprocessors: Methodology and examples,” Computer, vol. 26, no. 7,
pp. 42–50, 1993.
[23] J. Dongarra, “Report on the Sunway TaihuLight System,” http://www.
netlib.org/utk/people/JackDongarra/PAPERS/sunway-report-2016.pdf,
University of Tennessee Department of Electrical Engineering and
Computer Science, Tech. Rep. Tech Report UT-EECS-16-742, June
2016.
[24] J. Ve´gh and P. Molna´r, “How to measure perfectness of parallelization
in hardware/software systems,” in 18th Internat. Carpathian Control
Conf. ICCC, 2017, pp. 394–399.
[25] US Government NSA and DOE, “A Report from the NSA-
DOE Technical Meeting on High Performance Computing,”
https://www.nitrd.gov/nitrdgroups/images/b/b4/NSA DOE HPC
TechMeetingReport.pdf, December 2016.
[26] R. F. Service, “Design for U.S. exascale computer takes shape,” Science,
vol. 359, pp. 617–618, 2018.
[27] European Commission, “Implementation of the Action Plan
for the European High-Performance Computing strategy,”
http://ec.europa.eu/newsroom/dae/document.cfm?doc id=15269, 2016.
[28] Extremtech, “ Japan Tests Silicon for Exascale Computing in 2021.”
2018. [Online]. Available: https://www.extremetech.com/computing/
272558-japan-tests-silicon-for-exascale-computing-in-2021
[29] X.-k. Liao, K. Lu, C.-q. Yang, J.-w. Li, Y. Yuan, M.-c. Lai,
L.-b. Huang, P.-j. Lu, J.-b. Fang, J. Ren, and J. Shen, “Moving
from exascale to zettascale computing: challenges and techniques,”
Frontiers of Information Technology & Electronic Engineering,
vol. 19, no. 10, p. 12361244, Oct 2018. [Online]. Available:
https://doi.org/10.1631/FITEE.1800494
[30] P. J. Denning and T. Lewis, “Exponential Laws of Computing Growth,”
Communications of the ACM, pp. 54–65, 2017.
[31] J. Ve´gh, “The performance wall of parallelized sequential computing:
the roofline of supercomputer performance gain,” Parallel Computing,
vol. in review, p. http://arxiv.org/abs/1908.02280, 2019.
[32] J. Ve´gh, “How Amdahl’s Law limits the performance of large artificial
neural networks: (Why the functionality of full-scale brain simula-
tion on processor-based simulators is limited),” Brain Informatics,
vol. 6, pp. 1–11, 2019.
[33] K. Bourzac, “Streching supercomputers to the limit,” Nature, vol. 551,
pp. 554–556, 2017.
[34] The Japan Times, “Chief of firm behind worlds fourth-
fastest supercomputer arrested in Tokyo for alleged fraud,”
https://www.japantimes.co.jp/news/2017/12/05/national/crime-legal/
chief-firm-behind-worlds-fourth-fastest-supercomputer-arrested-tokyo-alleged-fraud/
#.WmQ-KXRG3CI, 2017.
[35] Inside HPC, “Is Aurora Morphing into an Exas-
cale AI Supercomputer?” https://insidehpc.com/2017/06/
told-aurora-morphing-novel-architecture-ai-supercomputer/, 2017.
[36] S. J. van Albada, A. G. Rowley, J. Senk, M. Hopkins, M. Schmidt, A. B.
Stokes, D. R. Lester, M. Diesmann, and S. B. Furber, “Performance
Comparison of the Digital Neuromorphic Hardware SpiNNaker and the
Neural Network Simulation Software NEST for a Full-Scale Cortical
Microcircuit Model,” Frontiers in Neuroscience, vol. 12, p. 291, 2018.
[37] J. Ve´gh, The performance wall of the parallelized sequential computing
– Can parallelization save the (computing) world?, 1st ed. Lambert
Academic Publishing, 2019.
[38] S. Moradi and R. Manohar, “The impact of on-chip communication on
memory technologies for neuromorphic systems,” Journal of Physics D:
Applied Physics, vol. 52, no. 1, p. 014003, oct 2018.
[39] J. Ve´gh, “How Amdahl’s law restricts supercomputer applications
and building ever bigger supercomputers,” CoRR, vol. abs/1708.01462,
2018. [Online]. Available: http://arxiv.org/abs/1708.01462
[40] T. Ippen, J. M. Eppler, H. E. Plesser, and M. Diesmann, “Constructing
Neuronal Network Models in Massively Parallel Environments,” Fron-
tiers in Neuroinformatics, vol. 11, p. 30, 2017.
[41] D. Tsafrir, “The context-switch overhead inflicted by hardware
interrupts (and the enigma of do-nothing loops),” in Proceedings of
the 2007 Workshop on Experimental Computer Science, ser. ExpCS
’07. New York, NY, USA: ACM, 2007, pp. 3–3. [Online]. Available:
http://doi.acm.org/10.1145/1281700.1281704
[42] F. M. David, J. C. Carlyle, and R. H. Campbell, “Context switch
overheads for linux on arm platforms,” in Proceedings of the
2007 Workshop on Experimental Computer Science, ser. ExpCS
’07. New York, NY, USA: ACM, 2007. [Online]. Available:
http://doi.acm.org/10.1145/1281700.1281703
[43] M. D. et al, “Loihi: A Neuromorphic Manycore Processor with
On-Chip Learning,” IEEE Micro, vol. 38, pp. 82–99, 2018.
[44] S. B. Furber, D. R. Lester, L. A. Plana, J. D. Garside, E. Painkras,
S. Temple, and A. D. Brown, “Overview of the SpiNNaker System
Architecture,” IEEE Transactions on Computers, vol. 62, no. 12, pp.
2454–2467, 2013.
[45] S. Kunkel, M. Schmidt, J. M. Eppler, H. E. Plesser, G. Masumoto,
J. Igarashi, S. Ishii, T. Fukai, A. Morrison, M. Diesmann, and M. Helias,
“Spiking network simulation code for petascale computers,” Frontiers in
Neuroinformatics, vol. 8, p. 78, 2014.
[46] F. Zheng, H.-L. Li, H. Lv, F. Guo, X.-H. Xu, and X.-H. Xie, “Co-
operative computing techniques for a deeply fused and heterogeneous
many-core processor architecture,” Journal of Computer Science and
Technology, vol. 30, no. 1, pp. 145–162, 2015.
[47] M. Schlansker and B. Rau, “EPIC: Explicitly Parallel Instruction Com-
puting,” Computer, vol. 33, no. 2, pp. 37–45, Feb 2000.
[48] L. A. Barroso and U. Ho¨lzle, “The Case for Energy-Proportional
Computing,” Computer, vol. 40, pp. 33–37, 2007.
[49] J. Ve´gh, “Introducing the Explicitly Many-Processor Approach,” Paral-
lel Computing, vol. 75, pp. 28 – 40, 2018.
[50] ARM, “big.LITTLE technology,” 2011. [Online]. Available: https:
//developer.arm.com/technologies/big-little
[51] J. Congy and et al, “Accelerating Sequential Applications on CMPs
Using Core Spilling,” Parallel and Distributed Systems, vol. 18, pp.
1094–1107, 2007.
[52] Cypress, “CY7C026A: 16K x 16 Dual-Port Static
RAM,” http://www.cypress.com/documentation/datasheets/
cy7c026a-16k-x-16-dual-port-static-ram, 2015.
[53] R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel,
“Scratchpad memory: Design alternative for cache on-chip memory in
embedded systems,” in Proceedings of the Tenth International Sympo-
sium on Hardware/Software Codesign, ser. CODES ’02. New York,
NY, USA: ACM, 2002, pp. 73–78.
[54] J. Backus, “Can Programming Languages Be liberated from the von
Neumann Style? A Functional Style and its Algebra of Programs,”
Communications of the ACM, vol. 21, pp. 613–641, 1978.
[55] G. P, Horn.J, J. He, A. Papageorgiou, and C. Poole, “IBM CICS
Asynchronous API: Concurrent Processing Made Simple,” http://www.
redbooks.ibm.com/redbooks/pdfs/sg248411.pdf, 2017.
[56] D. W. Wall, “Limits of instruction-level parallelism,” New York, NY,
USA, pp. 176–188, 1991. [Online]. Available: http://doi.acm.org/10.
1145/106974.106991
[57] IBM, “IBM CICS Asynchronous APIConcurrent Processing Made Sim-
ple,” http://www.redbooks.ibm.com/redbooks/pdfs/sg248411.pdf, 2019.
[58] N. Satish, C. Kim, J. Chhugani, H. Saito, R. Krishnaiyer,
M. Smelyanskiy, M. Girkar, and P. Dubey, “Can Traditional
Programming Bridge the Ninja Performance Gap for Parallel
Computing Applications?” Commun. ACM, vol. 58, no. 5, pp. 77–86,
2015. [Online]. Available: http://doi.acm.org/10.1145/2742910
[59] Y. Ao, C. Yang, F. Liu, W. Yin, L. Jiang, and Q. Sun, “Performance
Optimization of the HPCG Benchmark on the Sunway TaihuLight
Supercomputer,” ACM Trans. Archit. Code Optim., vol. 15, no. 1, pp.
11:1–11:20, 2018.
[60] J. Ve´gh, “EMPAthY86: A cycle accurate simulator for Explicitly
Many-Processor Approach (EMPA) computer.” jul 2016. [Online].
Available: https://github.com/jvegh/EMPAthY86
[61] Randal E. Bryant and David R. O’Hallaron, Computer Systems: A
Programmer’s Perspective. Pearson, 2014.
