How deep the machine learning can be by Végh, János
How deep machine learning can be 141
Chapter 1
HOW DEEP THE MACHINE LEARNING CAN BE∗
Ja´nos Ve´gh †
Kalima´nos BT, Debrecen, Hungary
Abstract
Today we live in the age of artificial intelligence and machine learning; from small
startups to HW or SW giants, everyone wants to build machine intelligence chips, ap-
plications. The task, however, is hard: not only because of the size of the problem:
the technology one can utilize (and the paradigm it is based upon) strongly degrades
the chances to succeed efficiently. Today the single-processor performance practi-
cally reached the limits the laws of nature enable. The only feasible way to achieve
the needed high computing performance seems to be parallelizing many sequentially
working units. The laws of the (massively) parallelized computing, however, are dif-
ferent from those experienced in connection with assembling and utilizing systems
comprising just-a-few single processors. As machine learning is mostly based on the
conventional computing (processors), we scrutinize the (known, but somewhat faded)
laws of the parallel computing, concerning AI. This paper attempts to review some
of the caveats, especially concerning scaling the computing performance of the AI
solutions.
Key Words: computing performance, efficacy of ANN, parallel processing, high-
performance computing, deep learning, scaling, machine learning
∗Submitted to appear in book ”A closer look at deep learning”, by Nova
†E-mail address: Vegh.Janos@gmail.com
ar
X
iv
:2
00
5.
00
87
2v
1 
 [c
s.D
C]
  2
 M
ay
 20
20
142 Ja´nos Ve´gh
1 Introduction
The computer initially had the task to automate lengthy computations on a small amount
of data, there was only one processor and the memory access time was in the same order
of magnitude as the time of performing a machine instruction. At that time, the limiting
resource was the speed the computer could perform the needed elementary operations with.
For today the development of the technology and the way of utilizing computers for AI has
changed to the exact opposite: a huge (and growing) number of processors shall process a
huge amount of data. The processors spend a considerable fraction of their working time
with waiting for the availability of the data (either from the instruction or data memory), or
shared resources (like the high-speed bus), or communicating the results of the computa-
tions from/to the fellow processors, or are just idle because of the computing paradigm and
its implementation. In addition to the parallel performance issues detailed below, similar is-
sues occur under other specific extreme conditions too, but they receive much less spotlight.
Some of these examples are:
• the very inflexible and inefficient separation of the hardware HW and software SW
in a real-time multitasking system leads to the phenomenon known as ’priority inver-
sion’ [1]
• the massive inflexible architectures suffer from frequent component errors
• the complex cyber-physical systems must be equipped with excessive computing fa-
cilities to provide real-time operation
• a serious challenge is to deliver the vast amount of data from the big storage centers
to the places where the processing capacity is concentrated
• only a fragment of the continuously running cores can be utilized [2] simultaneously
• supplying energy to the huge number of computers has begun more and more prob-
lematic
All the issues have a common reason: the classic computing paradigm that reflects the state
of the art of 70 years ago [3]. Computing needs renewal [4].
In section 2, we briefly review some of the terms, milestones, and issues of the paral-
lelization. Section 3 introduces the terms of the efficiency of parallelization and demon-
strates that parallelization introduces a new performance limit (on the top of the already
known limitations of the single-processor performance). Section 4 presents that when us-
ing computing under extreme conditions (such as the vast number of cores in the paral-
lelized systems or extremely high communication to computing ratio) unusual phenomena
occur. The section draws analogies with the classic versus modern physics: the new (and
unexplained) phenomena experienced under extreme conditions forced revising our basic
knowledge.
As in all parallelized systems, the different synchronization mechanisms play a unique
role also in AI, as detailed in section 5. This section comprises a case study about brain
simulation: at those extreme large scales of parallelized systems the operating principle and
How deep machine learning can be 143
implementation make the ’quantal nature’ of computing time dominating factor of perfor-
mance loss.
In the light of the previous sections, section 6 discusses how the present architectural
principles define the resulting efficacy.
2 Major terms and milestones of parallelization
It was early discovered that the age of conventional architectures was over [5, 6]; the only
question remained open for decades later whether the ”game is over”, too [7].
2.1 The moon-shot of linear performance scaling
Despite the early warnings [5, 8] the computing followed the path dictated by the Moore’s
Law: the moon-shot of unlimited growing of computer performance, not considering that
the growth seems to be exponential only at the beginning [9] and ”a trend that can’t go on
ad infinitum.”[10]. The development efforts concentrated on achieving high-performance
single-processor, but there are scientific reasons why the ’single-processor’ computation
shall reach a limit [11, 12]. Another moon-shot today is parallelization, the only way to
increase the computing performance of the system with conventional processors is to par-
allelize the sequentially working processors. Even today an (exponentially) linear depen-
dence is expected [13] and the ”gold rush” for achieving exascale performance [14] is going
on. Even in the most prestigious journals [15, 16].
However, as demonstrated decades ago, the computing performance not only does not
increase linearly with the number of computing units, but also decreases after exceeding a
critical number of processors [8]. Although the EFlops (payload) performance has not yet
been achieved, already the 104 times higher supercomputer performance is planned [13].
It looks like that in the feasibility studies an analysis, whether an inherent performance
bound exists, remained out of sight in USA [16, 17] or in EU [18] or in Japan [19] or
in China [13]. However, at extreme large scale systems serious performance limitations
occur [20, 21]. The parallelized sequential processing has different rules of game [8], [22]:
the performance gain (”the speedup”) has its inherent bounds [23]. It has recently been
admitted that it ”can be seen in our current situation where the historical ten-year cadence
between the attainment of megaflops, teraflops, and petaflops has not been the case for
exaflops”[10].
2.2 Model of parallelized sequential processing
The only form of cooperation of processing units that researchers elaborated till now, was
parallelizing the operation of some otherwise sequentially working processors. However,
there are attempts to transfer the workload from one core to another [24, 25]; the immediate
register-to-register transfer [26] enabled to bo build supercomputer with 10M cores and
considerably increased the efficiency of the benchmark HPCG [27]. As it was predicted [5],
the principle of parallel processing introduces its own limitations on the resulting parallel
performance. The simple model shown in Fig. 1 may help to understand the parallelized
sequential operation. Despite its non-technical nature and simplicity, the model correctly
144 Ja´nos Ve´gh
Param 1993 Param 2018
(Sunway/Taihulight)
α = 1− 10−3
α = 3.3− 10−8
Total = 1013 clocks
Ncores = 10
3 Ncores = 1.06× 107
RMax
RPeak
= 1N×(1−α)+α
= 1103×10−3+1
= 0.5
RMax
RPeak
= 1N×(1−α)+α
= 0.74
Proc
T
im
e(
n
ot
p
ro
p
or
ti
on
a
l)
0
1
2
3
4
5
6
7
8
9
10
Model of distributed parallel processing
α = PayloadTotal
P0 P1 P2 P3 P4
AccessInitiation
SoftwarePre
OSPre
T
0
P
D
0
0
P
ro
ce
ss
0
P
D
0
1
T
1
P
D
1
0
P
ro
ce
ss
1
P
D
1
1
T
2
P
D
2
0
P
ro
ce
ss
2
P
D
2
1
T
3
P
D
3
0
P
ro
ce
ss
3
P
D
3
1
T
4
P
D
4
0
P
ro
ce
ss
4
P
D
4
1
Just waiting
Just waiting
OSPost
SoftwarePost
AccessTermination
P
a
y
lo
a
d T
ot
a
l
E
x
te
n
d
ed
Figure 1: The simple model of distributed parallel processing. With proper interpretation
of the terms, it can describe all kinds of distributed parallel processing.
describes the development of supercomputing in its quarter of century history and predicts
the present stalling and limitations.
The supercomputing is a distributed parallelized computing: one of the processors dis-
tributes the job for the rest of the processors that perform their parts sequentially. A one-
time initialization (as well as termination) is needed both in the SW and the OS. Then the
first processing unit must spend some Tn time with addressing the processor (also part of
the OS). After this, the propagation delay takes place (PDn0 to and PDn1 from the nth
processor)1. The same timing follows after that the computation is performed by the cor-
responding processing unit. Notice that the first processing unit must work alone until all
tasks scheduled to the fellow processors, and also it must wait until the last fellow finishes
its task.
Amdahl’s original intention was to call the attention to that parallelizing sequentially
processing units introduces serious performance limitations. His successors formulated
Amdahl’s law as
S−1 = (1− α) + α/N (1)
where N is the number of parallelized code fragments, α is the ratio of the parallelizable
portion to the total, S is a measurable speedup. Although calculating α for the today’s
1Notice that the Tn and PDnx terms can be combined to achieve lower total execution time, see Fig. 1.
How deep machine learning can be 145
sophisticated hardware/software systems is extremely hard, one can express the parallel
portion as
α =
N
N − 1
S − 1
S
(2)
The speedup can be measured, we know the number of parallelized threads (or processing
units), so α provides an ’empirical parallelization’. As we can easily conclude from the
simple model of parallelized sequential processing (see Fig. 1), α actually corresponds to
the ratio of the time of the payload computations to the total measurement time. One can
visualize Amdahl’s assumption that (assuming many processors) in α = Payload/Total
fraction of the measured processing time, the processors are processing data, in (1-α) frac-
tion they are waiting (all but one). That is α describes how much, on average, processors
are utilized.
This quantity defines the empirical factor that includes everything, from the engineering
perfectness of assembling the components to the sequential-only fragments of the compu-
tation to the delay of the internal connections. According to Amdahl, any fraction that we
cannot parallelize, contributes to the sequential-only fraction. As Amdahl discussed, and
can quickly be concluded from Fig. 1, the sequential-only fraction has contributions from
the software, from the operating system, from the hardware. Even, the science also pro-
vides its contribution: the physical size of the computer plays a role. At extreme sizes, the
speed of signal propagation (and of course the technology components of the network) lead
to a considerable increase of the ’idle time’. In many-processor systems serious competi-
tion takes place for the shared resources (such as the buses), the queuing of the messages
inside the system seriously increases the sequential-only fraction, and so it considerably re-
duces the parallelizable fraction of the processing time. Depending on the actual situation,
those contributions have a very different order of magnitude, and also they can compete for
dominance.
2.3 The performance of parallel processing
The fact that the parallel efficiency has its limitations was recognized early, and even its
reason was correctly identified: ”As pointed out by Amdahl [5], the [constant problem
size] CPS scaling leads to a rapid reduction in parallel efficiency as more processors are
used to solve a fixed-size, deterministic problem. It was Amdahl argued that most parallel
programs have some portion of their execution that is inherently serial and must be executed
by a single processor while others remain idle.” [8]
When calculating speedup, one actually calculates
S =
(1− α) + α
(1− α) + α/N =
N
N(1− α) + α (3)
hence the efficiency2 (how speedup scales with the number of processors)
E =
S
N
=
1
N(1− α) + α (4)
2This quantity is almost exclusively used to describe computing performance of multi-processor systems.
In the case of supercomputers, RMax
RPeak
is provided, which is identical with E, excepts that it is actually a
2-parameter function, see Fig. 2 and the bottom row of Table I.
146 Ja´nos Ve´gh
105
106
107
108
10−7 10−6
10−8
10−2
10−1
100
No
of
cor
es
(1− αHPLeff )
E
f
f
ic
ie
n
cy
Dependence of EHPL and EHPCG on (1− αHPLeff ) and N
TOP5’2018.11
Summit
Sierra
Taihulight
Tianhe-2
K computer
Figure 2: The efficiency of the performance of parallelized computing systems on the pa-
rameters number of cores and efficiency of parallelization The data points for the present
top supercomputers are derived using data in the TOP500 database. The top marks on the
vertical lines refer to benchmark HPL, the bottom marks to HPCG.
hence the efficiency
E(N,α) =
S
N
=
1
N(1− α) + α =
RMax
RPeak
(5)
That is, according to Amdahl, the efficiency depends on two critical variables: paralleliza-
tion efficiency (the mathematical details are discussed in [28]) and the number of the pro-
cessing units (or threads), as shown in Fig. 2. The surface corresponds to the algorithmic
dependence of the efficiency on those two variables3. The data points of efficiency shown
at different values of the scaled parameters at the given numbers of respective scaling pa-
rameters for some of the TOP25 supercomputers.
The inefficiency was attributed (at that time correctly) to the software contribution ”for
example, include initialization, task creation, or some other phase of computation”. The
key sentences that the processors spend ”some portion of their execution that is inherently
serial and must be executed by a single processor while others remain idle” and ”scaling
thus put larger machines at an inherent disadvantage” [8] remained out of sight for a con-
siderable while. As Fig. 2 witnesses, Taihulight, and K computer stand out from the
3Recall that ”this decay in performance is not a fault of the architecture, but is dictated by the limited
parallelism.”[8].
How deep machine learning can be 147
”million core” group. Thanks to its 0.3M cores, K computer has the best efficiency for
the HPCG benchmark, while Taihulight with its 10M cores the lowest one (the good
efficiency for HPL originates in using ”cooperating processors” [26]). The middle group
follows the rules. For HPL benchmark: the more cores, the lower efficiency. For HPCG
benchmark: the ”roofline” [23] of that communication intensity reached, they are about the
same efficiency.
2.4 The syndrome of ’idle processors’
At that time both the parallel efficiency of the hardware and the number of processors was
low. However, while in 1993 the TOP500 list one can find ”supercomputers” assembled
from only 2 (or even 1) processors, the clock frequency was about 1% of the today’s clock
frequency and the perfectness of their parallelization [28] was about 10−3, for today, the
top supercomputers comprise millions of processors, and their perfectness of parallelization
is about 10−7. That is, all important factors have changed by orders of magnitude. Today,
despite the huge improvement of parallel efficiency and the respectable single-processor
performance, the large number of processors dominates. Millions of processors are idle
–because of both software and hardware reasons– while one processor executes the serial-
only fraction of the code [20]. In contrast with [8], that ”the serial fraction . . . is a dimin-
ishing function of the problem size”, the weight of the housekeeping activity grows linearly
with the number of the cores (and so does the idle time of the other cores). The issue
known since decades returned in a technically different form at a vast number of cores.
After reaching a critical number of processors (using the present paradigm and technology,
one can guess its value to be slightly less than 10M), adding more processors leads to a
decreasing performance [20, 8, 29], as witnessed by demonstrative supercomputer failures
such as Gyoukou, Aurora or SpiNNaker. It was also noticed that a larger number of
computing nodes, the performance starts to decrease [30]. The mathematical details are
discussed in [20, 21].
At the time of writing the paper [8], the quantitative breakdown of the factors of ineffi-
ciency were not yet sufficiently known: the relatively inefficient HW parallel efficiency (as
well as the high contribution of the software) did not enable us to study the other contribu-
tions. Unfortunately, later different other scalings were also introduced (sometimes not with
the proper care, see [31]), suggesting that the parallel efficiency can be enhanced infinitely,
i.e. that the computing performance of parallel systems has no (at least close-lying) limit.
Some researchers, however, correctly guessed that initialization, threading, are the main
reasons of inefficiency. The software also contributes: the different need for synchroniza-
tion of the different tasks causes different supercomputer efficiencies [32] and, of course,
the payload computation also needs time. All this leads to different ”rooflines” of paral-
lelized computing [23]. As otherwise also predicted by [8], the amount of synchronization
(data and control communication) of the processors is a decisive factor in defining the effi-
ciency [33, 34, 35]. Another critical factor can be a ”higher-level synchronization” such as
the commonly used ”biological clock time” in brain simulation [21] that also decreases the
parallel efficacy of the computing systems by orders of magnitude. This paper discusses
how much the parallel efficiency depends on those different factors.
148 Ja´nos Ve´gh
3 The efficacy of parallel processing
The architectures based on the classic paradigm, the ’Single Processor Approach’, have
serious limitations [5]; among others their efficiency strongly degrades [8] as the number
of parallelized units increases. The quick development of the underlying technology (the
fake promise of the Moore-observation) covered the need to look for ways of utilizing
’cooperating processors’, as advised. For today, it became obvious that the single-processor
performance shall reach its limits within years because of some laws of nature [12], but
today the computing is prepared for the post-Moore era [36], rather than looking for a new
computing paradigm. Under the term of ”new computing paradigm”, using gates other
than those built from transistors even when thinking about rebooting computing [37] or
speculations on utilizing computing in some specific utilization fields, is understood.
3.1 The role of the communication
As was analyzed a quarter of century before us [8], the principle of the parallelization
itself creates an additional bottleneck [38] [20]. As seen also from our model, at the be-
ginning and end of the measurement, most of the processors are idle. The bottleneck is
the non-parallelizable fraction of the job: according to Amdahl, this factor alone limits the
achievable speedup (or performance gain: the resulting performance compared to the per-
formance of a component processor). The researchers also recognized that one of the major
contributors to the bottleneck is the needed communication between the parallelized se-
quential units. We also know that at a given technology of parallelization, after some point,
the computing performance of the system not only saturates, but (because of the speedily
growing fraction of housekeeping) starts to decrease when increasing further the number of
processors in the parallelized system [8]. At that time, the critical number was a few dozens
of processors, see Fig. 1 in [8]; today, one can guess it to be at a few million processing
units.
3.2 Why the issue re-appeared
The technology of parallelization at that time was at the stage [39] that the performance
gain (despite that thousands of processors were used in the systems) could not achieve 200.
According to that stage, one considered the major bottleneck to be the software contribution
(like initialization, thread creation), at that time correctly. It was stated [8] that ”the serial
fraction, s, in most applications does not remain constant but is a diminishing function of
the problem size, Amdahls law does not apply directly when the problem is scaled”. The
benchmarking time in the case of supercomputers is in the order of several hours [40] (or
1013 clock cycles), so the non-parallelizable fraction (the one-time initialization and task
creation) becomes small, but remains finite. That small sequential-only fraction, while
one processor organizes the work (and talks individually to all fellow processors), shall be
multiplied by the total number of processors in the system4, so the formerly ’diminishing
part’ is amplified strongly. Not only is it non-negligible, but can even dominate. For a
detailed discussion see [20].
4One can decrease this number by clustering or more successfully using ’internal clustering’ [26]
How deep machine learning can be 149
Although that statement was valid at the time of writing paper [8], in the age of sev-
eral dozens of processors only, it must be revised when there are millions of processors
in the computing systems. Nowadays, the time when one processor is performing the
non-parallelizable activity while the others remain idle, shall be multiplied by a factor of
103. . . 106 times higher (so much more times idle time).
During the past quarter of a century, the number of processors increased, and the effi-
ciency of their parallelization enhanced. Now it is time to scrutinize the roles of the different
contributions. As it turned out [20], the technical development reordered ranking of the dif-
ferent contributions to the non-parallelizable fraction of computing. In the modern systems,
the HW parallelization efficiency (mainly due to the higher interconnection speed) can be
smaller than that of the already mentioned software contribution. Due to the larger sizes,
one must re-evaluate the role of the physical size and the interconnection.
For today, the need of communication between the parallelized parts became a dom-
inant factor [8]. It is well is known that at large core numbers, the efficiency of the su-
percomputers also depends on the task running on the supercomputer [32]. As experienced
for the two popular benchmark programs HPL and HPCG, there are as many efficiencies
as many benchmarks. The lack of understanding of the role of the number of processors
and the parallelization efficiency leads to explanations that some ’architectural weakness’
is behind the much lower efficiency of Taihulight compared with the K-computer. The real
reason is the 30-times higher number of the processors and Amdahl’s law. And, of course:
the efficiency depends on the task, more precisely on the amount of communication the task
requires.
As the careful analysis enabled to conclude [23], the mentioned ’inherently non-
parallelizable contribution’ [38] [20] of the task effectively represents a resource that limits
the achievable performance, so the popular ’roofline model’ [41] can also be applied [23]
to the achievable performance of the parallelized systems, see Fig. 3.
4 The need for ’modern computing’
The case of computing is very much analogous with the case of classic physics versus the
modern (relativistic and quantum) physics [22]. In the world we live in, it is rather counter-
intuitive to accept that as we move towards unusual conditions, the adding of speeds behaves
differently, the energy becomes discontinuous, the momentum and the position of a particle
cannot be measured accurately at the same time. The computations result in no reasonable
difference in the cases we experience around us, with and without using the non-classical
principles. However, as we get farther from the normal conditions, the difference gets
more considerable, and even leads to phenomena one can never experience under the usual,
everyday conditions. The analogies do not want to derive direct correspondence between
certain physical and computing phenomena. Rather, this paper wants to call the attention
to both that under extreme conditions, qualitatively different behavior may be encountered,
and that scrutinizing certain, formerly unnoticed or neglected aspects enables to explain
the new phenomena. In computing, unlike in nature, the technical implementation of the
critical points can be changed. In this way the behavior of the computing systems can also
be changed. In this paper, only the affected (and for AI operation relevant) important area
of computing can be touched in more detail: parallel processing.
150 Ja´nos Ve´gh
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
102
103
104
105
106
107
108
Year
P
er
f
or
m
a
n
ce
g
a
in
The roofline of performance gain of supercomputers
1st by RHPLMax
2nd by RHPLMax
3rd by RHPLMax
Best by αHPLeff
1st by RHPCGMax
2nd by RHPCGMax
3rd by RHPCGMax
Best by RHPCGMax
Brain simulation
Figure 3: The rooflines of the performance gain, for different levels of communication: the
minimum level (the HPL benchmark), the medium level (the HPCG benchmark) and the
very high level (brain simulation) [21]
4.1 Anomalies in extreme-scale computing
In both previous figures, one can see an anomaly (the same anomaly from two different
points of view).
In Fig. 2 the only 10M processor supercomputer Taihuight stands out from the group
of other top-class supercomputers, and also in Fig. 3 the HPL performance gain values for
Taihuight fall above the corresponding roofline level. The reason is the same: their proces-
sor [26] utilizes direct core-to-core data transfer, i.e., the computing principle is (slightly)
different. The amount of communicated data furthermore the computations are the same
as that at the other supercomputers, but the communication takes place without using the
global bus systems; because of this, the communication takes less time.
Reducing the communication makes sense. The so-called HPL-AI benchmark uses
Mixed Precision [42] rather than Double Precision computations. This change in the
operand length enabled them to achieve nearly 3 times better performance gain, that (as
correctly stated in the announcement) ”Achieving a 445 petaflops mixed-precision result
on HPL (equivalent to our 148.6 petaflops DP result)”, i.e. the peak DP performance did
not change. Unfortunately, this achievement has not much to do with AI: it utilizes the
data representation commonly used in AI, but the achievement comes from accessing less
data in memory and using quicker operations on the shorter data rather than reducing the
How deep machine learning can be 151
Physics Computing
Adding of speeds Adding of performance
Classic Classic
v(t) = t · g Perftotal(n) = n · Perfsingle
c = Light Speed
t = time n = number of cores
g = acceleration Perfsingle
n = optical density α = parallelism
Modern (relativistic) Modern[43]
v(t) = t·g√
1+( t×g
c/n
)2
Perftotal(n) =
n·Perfsingle
n·(1−α)+α
Table 1: The ”adding speeds” analogies between the classic and modern arts of science and
computing, respectively
communication intensity. For AI applications, the limitations remain the same as described
above; except that when using Mixed Precision, the efficiency can be better by a factor of
2-3. For other effects, see also section 6.
This performance increase clearly shows that not the number of communication op-
erations to computation operations, but the time spent with those operations defines the
efficiency of the parallel system. Thanks to this, Taihuight is the only competitor with
10M cores; all the others have only a fragment of this number. This difference is the reason
is why Gyoukou disappeared in a half year after it was able to utilize only 12 % of its
20M available cores, and this is why Aurora did not appear. The phenomena with large-
scale systems show that computing under extreme conditions –such as large scale parallel
systems– behaves differently compared to just a few processor systems.
The supercomputers are stretched to the limit [15, 16]. It underlines the importance of
the engineering perfectness that in the case of Summit, adding 5 % more cores (and making
fine-tuning after its quick startup) resulted in 17 % increase in the computing performance
in a half year after its appearance on the list, and in another half year a 0.7 % increase in
the number of cores caused a 3.5 % increase in its performance. The one-time appearance
of Gyoukou is mystic: it could catch slot #4 on the list using just 12% of it 20M cores, al-
though the explicit ambition was to be the #1. The lack of understanding that under extreme
conditions, the computing performance behaves differently led to the false presumptions of
frauds when reporting computing performances [44]. The another champion candidate, su-
percomputer Aurora [45] –after years of building– was retargeted, weeks before its planned
startup. In November 2017 Intel announced that Aurora has been shifted to 2021. As part
of the announcement, the development line Knights Hill [46] was canceled, and instead be
replaced by a ”new platform and new microarchitecture specifically designed for exascale”.
The lesson learned was that one needs specific design for exascale.
152 Ja´nos Ve´gh
105 106 107 108
106
107
108
time(s)
sp
ee
d
(m
/s
)
Relativistic speed of body accelerated by ’g’
v(t), n = 1
v(t), n = 2.5
v(t), n = 5
10−4 10−3 10−2 10−1 100
10−4
10−3
10−2
10−1
100
Nominal performance (EFlops)
P
ay
lo
ad
p
er
fo
rm
an
ce
(E
F
lo
p
s)
Payload performances @Summit
1-alpha = HPCG− FP64
1-alpha = HPL− FP64
1-alpha = HPL− FP16
1-alpha = HPL− FP0
1-alpha = Science
Figure 4: The effect of the correction term for the relativistic acceleration and the payload
performance of supercomputers.
4.2 Analogy with the special relativity
In the above sense, there is an important difference between the operation and the per-
formance of the single-processor and those of the parallelized but sequentially working
computer systems. As long as just a few (thousands) single processors are aggregated into
a large computer system, the resulting performance corresponds (approximately) to the sum
of the single-processor performance values: similarly to the classic rule of adding speeds.
However, when assembling larger computing systems (and approaching with their perfor-
mance ”the speed of light” of computing systems in the range of millions of processors)
the experienced payload performance starts to deviate from the nominal performance: the
phenomenon known as efficiency appears.
The performance measurements are simple time measurements: the benchmark pro-
gram executes a standardized set of machine instructions (a large number of times), and the
known number of operations is divided by the measurement time. This procedure happens
in the same way in the case of measuring the performance, for the single-processor and the
parallelized sequential computing systems. In the latter case, however, the joint work must
also be organized. With that activity, an extra task (implemented with additional machine
instructions and additional execution time) appears; see also Fig. 1. Due to this, the com-
puting performance cannot increase above the performance defined by (the parallelization
technology and) that number of processors, in analogy with that an object having the speed
of light cannot be further accelerated. Exceeding a specific computing performance (using
the classic paradigm and its implementation) is prohibited by the laws of nature.
Table 1 and Fig. 4 show why the analogy is relevant. Although their function forms
are different, the modern approach introduces in both cases a correction term. That term
remains close to unity until the extreme conditions approached, then both functions satu-
rate: the specific ”Light Speed” cannot be exceeded. The figure shows the performance
of the current ”world champion” Summit. The diagram lines show the performance calcu-
How deep machine learning can be 153
lated from the theory, for the cases of (from bottom-up) the HPCG benchmark, the HPL
benchmark (both with FP-64), the HPL-AI (FP-16) benchmark [42]. The next diagram line
shows the estimated performance for the FP-0 case, that is when no floating operations
are executed, but the program executes non-floating operations like addressing, comparing,
incrementing, jumping. (this latter is similar to the ”empty loop”, frequently used in pro-
gramming.) The top diagram line corresponds to the case when only the physical size of
the computer limits the performance (but the computer makes nothing useful). The bubbles
refer to the values measured on the system Summit. As shown, at low nominal perfor-
mances, the payload values of performance show up no considerable difference. At higher
performances, however, the payload performance depends on what task the supercomputer
runs.
4.3 Analogy with the general relativity
The mentioned losses manifest in the appearance of the performance wall [20, 23], a new
limitation due to the parallelized sequential computing. In science, we know that the enor-
mous masses behave differently from what we know under ’normal’ conditions. If we
assume we know how the ’matter’ behaves, we need to assume the presence of ’dark mat-
ter’. The latter is like the ’matter’ but not quite. Which is another phrase to tell that the large
scale behavior of ’matter’ largely deviates from that we have concluded from the smaller
amount of ’matter’. An analogy with this field is in use already: the phenomenon of ’dark
silicon’ [2] is named in analogy with the ’dark matter’: the (silicon) cores are there, and us-
able, but (because of the thermal dissipation) the large amount of cores behaves differently.
Analogously, the parallel computing introduces the ”dark performance”: the cores are
there and clocked, they consume power, but they make no payload work: they are idle.
Because of the principle of the classic computing, the first core must speak to all fellow
cores, and this non-parallelizable fraction of the time increases with the number of the
cores [23, 47]. The result is that the top supercomputers (depending on the number of
cores) show up efficacy around 1% when solving real-life tasks. In analogy with the ”grav-
itational collapse”, a ”communicational collapse” is demonstrated in Fig. 5.(a) in [48]:
at an extremely large number of cores exceeding the critical threshold of communication
intensity leads to unexpected and drastic change of network latency.
The propagation time of signals is also very much similar to that of the effects of phys-
ical fields. The latency time of the interfaces can be paired with creating and attenuating
the physical carriers. Zero-time on/off signals are possible both in the classical physics and
classical computing, while in the corresponding modern counterparts also the time needed
to create, transfer and detect the signals must be accounted; the effect noticed as that the
time of wiring (in this extended sense) grows compared to the time of gating [12].
4.4 Analogy with the quantum physics
The electronic computers are clock-driven systems, i.e., no action can happen in a time pe-
riod shorter than the length of one clock period. The typical value of that ”quantum of time”
in computing today is in the nanosecond range, so in the everyday praxis, the time seems to
be continuous, and the ”quantal nature of time” cannot be noticed. Some (sequential) non-
payload fraction in the total time is always present in the parallelized sequential systems,
154 Ja´nos Ve´gh
not only because of spawning and joining the threads but also because of their concur-
rency [38]. That fraction cannot be smaller than the ratio of the length of two clock periods
divided by the total measurement time, since also the forking and joining the other threads
cannot be shorter than one clock period. Unfortunately, the technical implementation needs
about ten thousand times longer time to do those actions [49, 51].
The total time of the performance measurement is large (typically hours) but finite, so
the non-parallelizable fraction is small but finite. Because of Amdahl’s Law (the computing
paradigm and the implementation technology together), the absolute value of the computing
performance of parallelized systems has inherently an upper limit, and the efficiency is the
lower, the higher is the number of the aggregated processing units.
As discussed below and in detail in [21], the processor-based brain simulation provides
an ”experimental evidence” that the time in computing shows quantal behavior, analogously
with the energy in physics. When simulating neurons utilizing processors, the ratio of the
simulated (biological) time and the processor time used to simulate the biological effect
may considerably differ, so to avoid working with ”signals from the future”, periodic syn-
chronization is required that introduces a particular ”biological clock cycle”. The role of
this clock period is the same as that of the clock signal in the clocked digital electronics:
what happens in this period, it happens ”at the same time”.
The commonly used 1ms [52] ”grid time” is, however, 106 times longer than the 1 ns
clock cycle common in the digital electronics. Correspondingly, its influence on the per-
formance is noticeable, see the subfigure Fig. 6.C. As shown, the ”quantal nature of time”
in computing changes the behavior of the performance drastically. Not only the achievable
performance is by orders of magnitude lower, but also the ”communicational collapse” (see
also [8]) occurs at orders of magnitude lower nominal performance. This effect is the reason
why less than one percent of the planned capacity can be achieved even by the purpose-built
brain simulator [53] as well as that the SW based and HW based simulations show up the
same limitation [52, 21]. The memory of huge supercomputers can be populated [54] with
objects simulating neurons, but as soon as they need to start to communicate, the task col-
lapses as predicted in Fig. 6. This reasoning is indirectly underpinned [29] by that the
different handling of the threads changes the efficacy sensitively and that the time required
for more detailed simulation increases non-linearly [52, 21].
4.5 Analogy with the interactions of particles
The ability of communicating with each other is not a native feature of processors in the
’classic computing’: in the Single Processor Approach questions like message sending to
and receiving from some other party has no sense at all (as no other party exists); messaging
is very ineffectively imitated by SW in the layer between HW and the real SW. In the Single
Processor Approach (SPA) the communication is a non-parallelizable fraction of the activity
of the cores, and similarly sharing resources has no sense (although it is an elementary
requirement in all modern systems).
The laws of parallel computing result in the actual behavior of the computing systems:
the more communication takes place [23], the more deviation from the classic behavior
is experienced. Similarly, as in physics, the behavior of an atom sharply changes by the
interaction (communication) with other particles.
How deep machine learning can be 155
4.6 Rebooting our thinking
”Classic computing” cannot explain these phenomena. The limits of single-processor per-
formance enforced by the laws of nature [12] are topped by the limitation of parallel com-
puting [20, 23]. The ”quantal nature of time” [21] limits that performance further. Notice
that these contributions are competing with each other; the actual circumstances decide
which of them can dominate. Their effect, however, is very similar: according to Amdahl,
what is not parallel is qualified as sequential. To understand the phenomena, we introduced
a modern computing paradigm [22] (as opposed to the 70-year old classic paradigm).
Anyhow, if we want to completely understand the phenomena experienced under ex-
treme (computing) conditions, we must reboot also our thinking; using another technology
implementation as ”rebooting computing models” [37] is not sufficient. Today we have
extremely inexpensive (and at the same time: extremely complex and powerful) processors
around (a ”free resource” [53]) and we arrived to the age when no additional reasonable
functionality can be implemented in processors through adding more transistors, the over-
engineered processors optimized for single-processor regime do not enable reducing the
clock period [55]. The computing power hidden in many-core processors cannot be utilized
effectively for payload work, because of the ”power wall” (partly because of the improper
working regime [56]): we arrived at the age of ”dark silicon” [2], we have ”too many” pro-
cessors [57] around. The supercomputers face critical efficiency and performance issues;
the real-time (especially the cyber-physical) systems experience serious predictability, la-
tency and throughput issues; in summary, the computing performance (without changing
the present paradigm) reached its technological bounds. Computing needs renewal [4]. Our
proposal, the Explicitly Many-Processor Approach (EMPA) [58], is to introduce a new com-
puting paradigm and through that to reshape the way in which computers (including ANNs)
are designed and used today.
5 The effect of synchronization on the efficiency
The engineering practice commonly uses the synchronization of the different and indepen-
dently working circuits. On one side, it makes the conditions clear: the time dependence
of the signals is removed in this way. On the other side, however, it synchronizes also the
needs of communication: the different units want to send their result at the same time and
also they receive their input at the same time; leading to a considerable amount of idle
waiting.
5.1 Benchmarking and efficiency
Communication is a dominating factor in the operation of the many-many processor sys-
tems. Fig. 5 attempts to provide a feeling of how the different tasks behave for different
communication intensity. The communication to computational intensity [8] is, of course,
not proportional in the cases of the subfigures, but the figure illustrates excellently how the
communication need of the different computer tasks changes with the type of the task.
There are two commonly used benchmarks in supercomputing. As discussed in [23],
the HPL class tasks essentially need communication only at the very beginning and at the
156 Ja´nos Ve´gh
HPL
x
a1
a2
a3
am
y
Input Layer ”HPL Layer” Output Layer
HPCG
x
a1
a2
a3
am
y1 yn
n1
n2
n3
nm
Input Layer ”HPCG Layer 1” ”HPCG Layer n” Output Layer
AI
x1
x2
x3
xn
a1
a2
a3
am
n1
n2
n3
nm
y1
y2
y3
yk
Input Layer Hidden Layer 1 Hidden Layer n Output Layer
Figure 5: The communication intensity in the different supercomputer tasks. The subfigures
correspond to the HPL, HPCG and AI cases. The arrows represent a communication action.
How deep machine learning can be 157
very end of the task. This communication type is, however, not the way as the real-life
programs work. Because of this difference, the benchmark HPCG was introduced in a
couple of years ago: the experience shows that the expectable payload performance is much
more accurately approximated by HPCG than by HPL, because the real-life tasks need
much more communication. The supercomputers show different efficiencies when using
different benchmark programs [32]. The efficiencies differ by a factor of ca. 200-500,
when measured by HPL and HPCG, respectively, mainly due to the differing number of
cores.
In Fig. 5 three special cases (owing different communication intensity) are compared.
In the top and middle figures the benchmark communication intensities of the popular su-
percomputer benchmarks HPL and HPCG are displayed in the style of AI networks. The
”input layer” and ”output layer” are the same, and comprise the initiating node only, while
the other ”layers” are again the same: the rest of the cores. Subfigure 6.C depicts an AI net-
work comprising n input nodes and k output nodes, furthermore h hidden layers comprising
m nodes.
In the HPL class, the communication intensity is the lowest possible one: the computing
units receive their task (and parameters) at the beginning of the computation, and they return
their result at the very end. That is, the core coordinating their work must deal with the
fellow cores only in these periods, so the communication intensity is proportional to the
number of the cores in the system. Notice the need to queue the requests at the beginning
and the end of the task.
In HPCG class, iteration takes place: the cores return the result of one iteration to the
coordinator core, which makes sequential operations: not only receives and re-sends the
parameters, but also needs to compute the new parameters before sending them to the fellow
cores, and repeats this several times. As a consequence, the non-parallelizable fraction of
the benchmarking time grows proportionally with the number of iterations. The effect of the
extra communication decreases the achievable performance roofline: as shown in Fig. 3, the
HPCG roofline is about 200 times lower than the HPL one. The fact that supercomputers
show two different efficacy for these two different benchmarks [32] is well known, but its
reason is not fully understood. Rather than comprehending that the reason is simply the
effect of the number of cores on the efficiency of the parallelized sequential systems, some
”architectural weaknesses” are supposed.
As can be easily seen from the figure, in the case of the benchmark HPL the initiating
node must issue m communication messages and collect m returned results, i.e. the execu-
tion time is O(2m). In the case of the benchmark HPCG this execution time is O(2Nm)
where N is the number of iterations. (One cannot compare the execution times directly
because of the different amount of computations).
5.2 Supercomputer efficiency in terms of AI
The bottom part of Fig. 5 depicts how the Artificial Neural Networks are supposed to op-
erate. The life begins in several input channels (rather than one as in the HPL and HPCG
cases) that would be advantageous. However, the values must be communicated to all nodes
in the top hidden layer: the more input nodes and the more nodes in the hidden layer(s),
the many times more communication is required for the operation. The same also hap-
158 Ja´nos Ve´gh
pens when the first hidden layer communicates data to the second one, except that here the
square of the number of the nodes is to be used as a weight factor of communication.
Initially the n input nodes issue messages, each one m messages (queuing#1) to the
nodes in the first hidden layer, i.e., altogether nm messages. If one uses a shared bus to
transfer the messages, these nm messages must be queued (queuing#2). Also, every single
node in the hidden layer receives (and processes) m input messages (queuing#3). Between
the hidden layers, the same is repeated (maybe several times) with mm messages, and
finally km messages are sent to the output nodes. In all cases, the messages are queuing
3 times. To make a fair comparison with benchmarks HPL and HPCG, let us assume
one input and one output node. In this case, the AI execution time is O(h×m2), provided
that h hidden layers are implemented. (Here it was assumed that the messaging mechanism
between layers is independent from each other. It is not so if they share a global bus. 5)
For a numerical example: let us assume that in the supercomputers 1M cores are used,
and in the AI network, 1K nodes are present in the hidden layers, and only one input and
output nodes are used. In that case, all execution times are O(1M) (again, the amount
of computation is sharply different, so the scaling can be compared, but not the execution
times). This communication intensity explains why in Fig. 3 the HPCG ”roofline” falls
hundreds of times lower than that of the HPL: the increased communication need strongly
decreases the achievable performance gain.
Notice that the number computation operations increases with m, while the number of
communication operations with m2. In other words: the more nodes in the hidden layers,
the higher is the communication intensity (communication/computation ratio), and because
of this, the lower is the efficiency of the system. Recall that since the AI nodes perform
simple computations compared to the functionality of the supercomputer benchmarks, the
communication/computation ratio is much higher, making the efficacy even worse. The
conclusions are underpinned by experimental research [60]:
• ”strong scaling is stalling after only a few dozen nodes”
• ”The scalability stalls when the compute times drop below the communication times,
leaving compute units idle. Hence becoming a communication bound problem.”
• ”the network layout has a large impact on the crucial communication/computation
ratio: shallow networks with many neurons per layer . . . scale worse than deep net-
works with less neurons.”
The massively ”bursty” nature of the data (the different nodes of the layer want to use
the communication at the same moment) also makes the case harder. The commonly used
global bus is overloaded with messages. The possibility for wired point-to-point communi-
cation is limited; but deploying them at least for the inter-layer communication buses can
help a lot.
The communication circuits receive the task of sending the data to N other nodes. The
computation and communication are ab ovo sequential, and the communication channel can
only transfer one data value at a time. What is worse, bus arbitration, addressing, latency,
prolong the transfer time (and in his way decreases the efficacy of the system).
5”The idea of using the popular shared bus to implement the communication medium is no longer accept-
able, mainly due to its high contention.” [59]
How deep machine learning can be 159
5.3 The case of brain simulation
The case of brain simulation is greatly similar to the combined HPCG and AI case. The
number of layer nodes have fellow neurons in the range 103. . .104, i.e., after one compu-
tational step the layer neurons must wait up to 104 times the communication time. When
assuming that the communication time is 102 times longer than the computation time, it
means that the efficacy of such a system can be about 105. . .106 times lower than the effi-
cacy without this type of communication. Even if one does NOT consider another limiting
(technical) factors, like the bandwidth of communication.
One one side, even with today’s large number of available cores in the top supercomput-
ers, several orders of magnitude are missing to achieve the units in the brain. On the other
side, a processor core today has much more computing performance than needed to simu-
late the operation of a single neuron. The need and possibility leads to resource sharing: the
computing capacity of one core is shared by several ’neurons’. To rearrange the scene (to
switch to another thread) needs anyhow extra organizational (again, sequential-only) time,
and the context switching is extremely expensive [49, 51].
The brain simulation (and in somewhat smaller scale: artificial neural computing) re-
quires intensive data exchange between the parallel threads: the neurons are expected to
tell the result of their neural computations periodically to thousands of fellow neurons. Be-
cause the neurons must work on the same (biological) time scale, the (commonly used)
1 millisecond ”grid time” (”the quantum of computing time”) has a noticeable effect on the
performance.
As pointed out in [21], using the commonly accepted 1 ms ’time slot’ (the ’biological
clock period’ or the manifestation of the ’quantal nature of time’) alone degrades the effi-
cacy of the simulator program, which is topped by the large amount of burst-like commu-
nication. As a consequence, the maximum performance gain of supercomputer programs
(or purpose-built, but processor based nervous simulators) cannot exceed ≈ 104, although
the idea of introducing at least partly hierarchic communication paths like[50] can reduce
by about 2 orders of magnitude the communication6, i.e., it can increase the performance
gain by about 2 orders of magnitude. The efficacy of the AI networks must be between
the efficacy of the real-life supercomputer tasks and that of the brain simulation; for details
see [23].
6 Efficacy of AI networks
In the history of computing, the ”need for speed” is a central question. The different ef-
forts to develop the single-processor performance wanted to achieve a higher number of
operations, literally at any price. The case of parallelized processing is not different. Since
the limits of single-processor performance are approached, and also the limitations of the
parallel operation are not entirely discovered, one must be careful with the new designs. To
avoid some traps: to make some features better, one must make some other worse.
6Loihi is a purpose-built computing chip, i.e., its performance on its special field, as demonstrated, can be
about hundreds of times higher than that of the competitors, but the resulting scaling in extreme size systems
also undergoes limitations.
160 Ja´nos Ve´gh
As mentioned above, to achieve high performance and low latency at the same time is
not possible, so the proper balance between them depends heavily on the conditions of the
application area.
6.1 Factors affecting efficacy of the systems
Fig. 6 attempts to summarize the role of the different contributions to the resulting efficacy
of parallelized systems. The subfigure A illustrates the behavior measured with benchmark
High Performance Linpack (HPL). The looping contribution becomes remarkable around
0.1 Eflops, and breaks down payload performance when approaching 1 Eflops. The black
dot marks the HPL performance of the computer used in works [52, 29]. Subfigure B depicts
the behavior measured with benchmark High Performance Conjugate Gradients (HPCG).
In this case, the contribution of the application (thin brown line) is much higher, the loop-
ing contribution (thin green line) is the same as above. As a consequence, the achievable
payload performance is lower, and also the breakdown of the performance is softer. The
black dot marks the HPCG performance of the same computer. Subfigure C demonstrates
what happens if the clock cycle is 5000 times longer: it causes a drastic decrease in the
achievable performance and firmly shifts the performance breakdown toward lower nomi-
nal performance values.
Subfigures A and B roughly correspond to the supercomputer benchmark cases (HPL
and HPCG), respectively, the subfigure C approximately illustrates the case of AI per-
formance. As shown, using the present principles and implementations, the performance
of parallelized sequential systems is drastically lower than expected from the nominal per-
formances. The actual values depend on the actual methods of implementation, i.e., the
methods and principles of the implementation (including both the architecture and the math-
ematical algorithm) of the AI networks should be carefully selected if one wants to achieve
not only the functionality, but also a proper efficacy.
6.2 The role of accelerators
As an illustration, the special role of some accelerators is discussed here. ”How those ac-
celerators connect to systems is a significant concern” [61], really. On one side, they help
to perform the computations more quickly. On the other side, they increase the latency,
i.e. decrease efficacy. The communication intensity is a decisive factor in the operation
of the system, both in latency and efficacy. The communication intensity can be enhanced
either by increasing the proportion spent with computations or decreasing the time of data
communication (increasing the communication bandwidth or using other methods of com-
munication).
Using GPGPUs to accelerate the computing is very popular also in supercomputing,
and the huge and rigorously controlled database [62] enables us to demonstrate these two
effects. Fig. 7 depicts the dependence of the ”single processor performance” and ”par-
allelization efficiency” for supercomputers with and without GPGPU accelerators, in the
function of ranking in 2017. As the figure shows, the GPGPU accelerated performance
–independently of the ranking– is about 2-3 times higher than that of the non-accelerated
How deep machine learning can be 161
10−3 10−2 10−1 100
10−10
10−9
10−8
10−7
10−6
10−5
10−4
RPeak(Eflop/s)
(1
−
α
H
P
L
e
f
f
)
10−5
10−4
10−3
10−2
10−1
100
R
H
P
L
M
a
x
(E
f
lo
p
/
s)
αSW
αOS
αeff
RMax(Eflop/s)
A
10−3 10−2 10−1 100
10−10
10−9
10−8
10−7
10−6
10−5
10−4
RPeak(Eflop/s)
(1
−
α
H
P
C
G
e
f
f
)
10−5
10−4
10−3
10−2
10−1
100
R
H
P
C
G
M
a
x
(E
f
lo
p
/
s)
αSW
αOS
αeff
RMax(Eflop/s)
B
10−3 10−2 10−1 100
10−10
10−9
10−8
10−7
10−6
10−5
10−4
RPeak(Eflop/s)
(1
−
α
N
N
e
f
f
)
10−5
10−4
10−3
10−2
10−1
100
R
N
N
M
a
x
(E
f
lo
p
/s
)
αSW
αOS
αeff
RMax(Eflop/s)
C
Figure 6: Contributions (1−αXeff ) to (1−αtotaleff ) and max payload performanceRMax of a
fictive supercomputer (P = 1Gflop/s @ 1GHz) in function of the nominal performance.
The blue diagram line refers to the right hand scale (RMax values), all others ((1 − αXeff )
contributions) to the left scale. The figure is purely illustrating the concepts; the displayed
numbers are somewhat similar to the real ones.
162 Ja´nos Ve´gh
0 10 20 30 40 50
0
50
100
150
Ranking by HPL
P
ro
ce
ss
or
p
er
fo
rm
an
ce
(G
fl
op
/s
) Accelerated
Non-accelerated
GPU-accelerated
Regression of accelerated
Regression of nonaccelerated
Regression of GPU accelerated
0 10 20 30 40 50
105
106
107
108
109
Ranking by HPL
P
er
fo
rm
an
ce
am
p
li
fi
ca
ti
on
fa
ct
or Accelerated
Non-accelerated
GPU-accelerated
Regression of accelerated
Regression of nonaccelerated
Regression of GPU accelerated
Figure 7: The effect of the GPGPU acceleration on the single-processor performance and
parallelization efficiency. The data are taken from the database [62].
one. The right subfigure displays that the performance gain changes with ranking.7 At
higher ranking (at a lower number of cores) the performance amplification is higher than
4, while at a high number of cores (at low ranking) the performance amplification is only
slightly above 2. This result is in good accordance with the result of a systematic CPU/GPU
performance comparison [63].
One must, however, consider that in the case of the AI networks the network la-
tency times are about a hundred times higher than the computational times, so making the
computations faster does not decrease the bottleneck. On one side, the absolute time of
processing decreases. On the other side, the communication intensity (i.e., the efficiency)
gets worse. The effect is very much similar to that of the processor accelerator GPG-
PUs: the computing performance increases linearly, the efficiency decreases exponentially.
These accelerators behave differently on small scale and large scale systems. On small
scale systems, only the performance increase can be noticed; the number of the cores is
low. On high scale systems, the large number of cores increases the latency (i.e., decreases
the efficacy). That is, new principles for data communication must be sought [59].
6.3 Traps of enhancing systems’s efficacy
Preparing systems with high payload computing performance is a more complex task than
commonly assumed. The efficacy changes with the number of processing units and the con-
tributions to the non-parallelizable portion of the task α. Both of them must be concerted to
achieve high performance. It is not something new, again: ”the effort expended on achiev-
ing high parallel processing rates is wasted unless it is accompanied by achievements in
sequential processing rates of very nearly the same magnitude” [5].
On one side, the demonstrative failures of supercomputers with an extremely large num-
ber of processing units prove that using the conventional architecture (the conventional
contributions of the technical components) the number of processing units are limited to
7This is due to the increased latency caused by the need to copy the data from one memory to another.
Introducing OpenCAPI [61] enhances the latency (see the success of Summit and Sierra), but presently no
statistically evaluable data are available on that.
How deep machine learning can be 163
about 1 million. On the other side, it is shown that the clustering successfully attacks the
looping contribution, although it was already noticed that at extremely large number of
nodes a similar performance breakdown occurs [30]. The chip-internal clustering intro-
duced by [26] is successful, at least enables us to operate 10+ million processing units.
That solution can produce outstanding computing performance when measured via HPL
benchmark, but for real-life tasks (having efficacy similar to that of the HPCG benchmark)
Amdahl’ law limits the performance. The reason for that difference is that the dominating
contribution changes between the two benchmarks: in HPL the contributions from intercon-
nection+HW+OS dominate [22], while in HPCG the dominating contribution comes from
the computation+communication (i.e., the ”measuring device” itself).
The GPGPU acceleration, as discussed, also behaves differently in small-scale and
large-scale systems, see Fig. 7. The basic reason (without using OpenCAPI [61] intercon-
nection) is that the data must be copied between the different address spaces that increases
the sequential-only component.
Ironically enough, using lower precision also hides a similar trap. As discussed, the
communication/computation ratio is one of the decisive factors of the payload performance.
When using half-precision instead of double-precision, the computation time shall be four
times less, but the communication time remains the same, i.e., the communication to com-
putation ratio shall be higher. On one side, the absolute amount of time spent with com-
putation+communication decreases, but so does the total measured time, too. As a conse-
quence, the contribution to the (Relative!) factor (1−α) is less than expected. When using
half-precision arithmetics, Summit consumes four times less power [42], but the execution
time of the benchmark HPL is only three times shorter. Consequently, using shorter floating
operands [42] and new representations of floating numbers [64] in AI applications increase
the relative communication intensity, and through this decrease the efficiency (although it
is counter-balanced by the decreased computing time). Besides, the reduced precision also
has its limitations [65].
6.4 The case of PizDaint
The supercomputers usually have not many items registered in the database TOP500 on
their development. One of the rare exceptions is supercomputer Piz Daint. Its devel-
opment history spans 6 years, two orders of magnitude in performance, and used both
non-accelerated computing and accelerated computing using two different accelerators. Al-
though usually more than one of its parameters was changed between the registered stages
of its development, it nicely underpins the statements of this paper. Fig. 8 displays how the
payload performance in the function of the nominal performance has developed in the case
of supercomputer Piz Daint, and how at the same time the efficacy evolved.
In the right subfigure, the bubbles display the measured performance values documented
in the database TOP500 [62], and the diagram lines show the (at that stage) predicted per-
formance. As the diagram lines show the ”predicted performance”, the accuracy of the
prediction can also be estimated through the data measured in the next stage. The left sub-
figure shows the efficacy values on the 2-parameter efficacy surface, in the same stages of
development.
The data from the first two years of Piz Daint (non-accelerated mode of operation)
164 Ja´nos Ve´gh
104
105
106
10710−7 10−6 10−5
0.2
0.5
0.8
1
N
o
of
co
re
s
(1− αeff )
E
f
f
ic
ie
n
cy
Dependence of EHPL on (1− αeff ) and N
Piz Daint
2012/11
2013/06
2013/11
2016/11
2017/06
2018/11
10−3 10−2 10−1 100
10−5
10−4
10−3
10−2
10−1
RPeak (exaFLOPS)
R
M
a
x
(e
x
a
F
L
O
P
S
)
Development of RHPLMax for PizDaint Supercomputer
Xeon E5-2690 + NVIDIA Tesla P100 (2018)
Xeon E5-2690 + NVIDIA Tesla P100 (2017)
Xeon E5-2690 + NVIDIA Tesla P100 (2016)
Xeon E5-2670 + NVIDIA K20x (2013)
Xeon E5-2670 (2013)
Xeon E5-2670 (2012)
Figure 8: The history of supercomputer Piz Daint in terms of efficiency and payload per-
formance [62].
can be compared directly. Increasing the number of the cores results in the expected higher
performance, as the working point is still in the linear region of the efficiency surface. The
value slightly above the predicted one can be attributed to the fine-tuning of the architecture.
Introducing accelerators resulted in a jump of payload efficiency (and also moved the
working point to the slightly non-linear region, see Fig. 8), and the payload performance
is roughly 3 times more than it would be expected purely on the predicted value computed
from the non-accelerated architecture. According to the general experience [63], only a
small fraction of the computing power hidden in the GPU can be turned to payload perfor-
mance.
The designers might not be satisfied with the accelerator, so they changed to another
one, with a slightly higher nominal performance but much larger separated memory space.
The result was disappointing: the slight increase of the nominal performance of the GPU
could not counterbalance the increased time needed to copy between the separated larger
address spaces, and finally resulted in a breakdown of both the value of (1 − αeff ) and
efficiency, although the payload performance slightly increased. Introducing the GPU
accelerator increases the absolute performance, but (through introducing the extra non-
parallelizable component of copying the data) increases the value of (1 − αeff ) and de-
creases efficiency.
7 Conclusion
The biology-inspired computing (such as Artificial Neural Networks, Deep Learning, Brain
Simulation) are exciting and useful fields. One must not forget, however, that there are
fundamental differences between the biologically and electronically implemented systems.
The clock-driven, parallelized sequential electronic systems, in some cases, can only very
ineffectively imitate the behavior of the biological systems. The paper wanted to call atten-
How deep machine learning can be 165
tion to some key efficacy questions of their design.
Acknowledgements
Project no. 125547 has been implemented with the support provided from the National
Research, Development and Innovation Fund of Hungary, financed under the K funding
scheme.
References
[1] O. Babaoglu, K. Marzullo, F.B. Schneider, Real-Time Systems 5(4), 285303 (1993).
DOI 10.1007/BF01088832. URL https://doi.org/10.1007/BF01088832
[2] M. Shafique, S. Garg, IEEE Design and Test 34(2), 8 (2017). DOI
10.1109/MDAT.2016.2633408
[3] S(o)OS project. Resource-independent execution support on exa-scale systems.
http://www.soos-project.eu/index.php/related-initiatives (2010)
[4] J. Ve´gh, Renewing computing paradigms for more efficient parallelization of single-
threads (IOS Press, 2018), Advances in Parallel Computing, vol. 29, chap. 13, pp.
305–330
[5] G.M. Amdahl, in AFIPS Conference Proceedings, vol. 30 (1967), vol. 30, pp. 483–
485. DOI 10.1145/1465482.1465560
[6] M.D. Godfrey, ICL Technical Journal 5, 18 (1986)
[7] S.H. Fuller, L.I. Millett (eds.), The Future of Computing Performance: Game Over or
Next Level? (National Academies Press, Washington, 2011)
[8] J.P. Singh, J.L. Hennessy, A. Gupta, Computer 26(7), 42 (1993). DOI
10.1109/MC.1993.274941
[9] P.J. Denning, T. Lewis, Communications of the ACM pp. 54–65 (2017)
[10] M. Feldman. Exascale Is Not Your Grandfathers HPC.
https://www.nextplatform.com/2019/10/22/exascale-is-not-your-grandfathers-hpc/
(2019)
[11] US DOE Office of Science. Report of a Roundtable Convened to Consider
Neuromorphic Computing Basic Research Needs. https://science.osti.gov/-
/media/ascr/pdf/programdocuments/docs/Neuromorphic-Computing-
Report FNLBLP.pdf (2015)
[12] I. Markov, Nature 512(7513), 147 (2014)
166 Ja´nos Ve´gh
[13] Liao, Xiang-ke et al, Frontiers of Information Technology & Electronic En-
gineering 19(10), 12361244 (2018). DOI 10.1631/FITEE.1800494. URL
https://doi.org/10.1631/FITEE.1800494
[14] US DOE. The Opportunities and Challenges of Ex-
ascale Computing. https://science.energy.gov/ /me-
dia/ascr/ascac/pdf/reports/Exascale subcommittee report.pdf (2010)
[15] K. Bourzac, Nature 551, 554 (2017)
[16] R.F. Service, Science 359, 617 (2018)
[17] US Government NSA and DOE. A Report from the NSA-DOE Technical Meeting on
High Performance Computing. https://www.nitrd.gov/
nitrdgroups/images/b/b4/NSA DOE HPC TechMeetingReport.pdf (2016)
[18] European Commission. Implementation of the Action Plan for the European High-
Performance Computing strategy. http://ec.europa.eu/newsroom/dae/document.cfm?
doc id=15269 (2016)
[19] Extremtech. Japan Tests Silicon for Exascale Computing in 2021.
https://www.extremetech.com/computing/ 272558-japan-tests-silicon-for-exascale-
computing -in-2021 (2018)
[20] J. Ve´gh, J. Va´sa´rhelyi, D. Dro´tos, in Lecture Notes in Networks and Systems 68
(Springer, 2019), pp. 224–237
[21] J. Ve´gh, Brain Informatics 6, 1 (2019)
[22] J. Ve´gh, A. Tisan, in 2019 International Conference on Computational Science
and Computational Intelligence (CSCI) (IEEE, 2019), pp. Awarded as ”Outstanding
Achievement”, In print. URL http://arxiv.org/abs/1908.02651
[23] J. Ve´gh, Parallel Computing in review, http://arxiv.org/abs/1908.02280 (2019)
[24] J. Congy, et al, Parallel and Distributed Systems 18, 1094 (2007)
[25] ARM. big.LITTLE technology (2011). URL
https://developer.arm.com/technologies/big-little
[26] F. Zheng, H.L. Li, H. Lv, F. Guo, X.H. Xu, X.H. Xie, Journal of Computer Science
and Technology 30(1), 145 (2015)
[27] Y. Ao, C. Yang, F. Liu, W. Yin, L. Jiang, Q. Sun, ACM Trans. Archit. Code Optim.
15(1), 11:1 (2018)
[28] J. Ve´gh, P. Molna´r, in 18th Internat. Carpathian Control Conf. ICCC (2017), pp. 394–
399
[29] T. Ippen, J.M. Eppler, H.E. Plesser, M. Diesmann, Frontiers in Neuroinformatics 11,
30 (2017)
How deep machine learning can be 167
[30] www.nextplatform.com. Tooling Up For Exascale.
https://www.nextplatform.com/2019/11/12/tooling-up-for-exascale/ (2019)
[31] S. Krishnaprasad, J. Comput. Sci. Coll. 17(2), 288 (2001)
[32] IEEE Spectrum. Two Different Top500 Supercomputing Benchmarks
Show Two Different Top Supercomputers. https://spectrum.ieee.org/tech-
talk/computing/hardware/two-different-top500-supercomputing- benchmarks-show
-two -different-top-supercomputers (2017)
[33] S. Eyerman, L. Eeckhout, SIGARCH Comput. Archit. News 38(3), 362 (2010)
[34] T. David, R. Guerraoui, V. Trigonakis, in Proceedings of the Twenty-Fourth ACM
Symposium on Operating Systems Principles (SOSP ’13) (2013), pp. 33–48. DOI
10.1145/2517349.2522714
[35] L. Yavits, A. Morad, R. Ginosar, Parallel Computing 40(1), 1 (2014)
[36] J.S. Vetter, E.P. DeBenedictis, T.M. Conte, IEEE Micro 37, 6 (2017)
[37] P.C. et al., in Proceedings of the 2019 Design, Automation & Test in Europe
Conference & Exhibition (DATE) (IEEE Press, 2019), pp. 1469–1476. DOI
10.23919/DATE.2019.8715167
[38] F. Ellen, D. Hendler, N. Shavit, SIAM J. Comput. 43(3), 519536 (2012). DOI
10.1137/08072646X
[39] G. Bell, D.H. Bailey, J. Dongarra, A.H. Karp, K. Walsh, The International Jour-
nal of High Performance Computing Applications 31(6), 469484 (2017). URL
https://doi.org/10.1177/1094342017738610
[40] J. Dongarra, Report on the Sunway TaihuLight System. Tech.
Rep. Tech Report UT-EECS-16-742, University of Tennessee Depart-
ment of Electrical Engineering and Computer Science (2016). URL
http://www.netlib.org/utk/people/JackDongarra/PAPERS/sunway-report-2016.pdf
[41] S. Williams, A. Waterman, D. Patterson, Commun. ACM 52(4), 65 (2009)
[42] A. Haidar, S. Tomov, J. Dongarra, N.J. Higham, in Proceedings of the International
Conference for High Performance Computing, Networking, Storage, and Analysis
(IEEE Press, 2018), SC ’18, pp. 47:1–47:11
[43] J. Ve´gh, P. Molna´r, J. Va´sa´rhelyi, CoRR abs/1606.02686 (2016). URL
http://arxiv.org/abs/1606.02686
[44] The Japan Times. Chief of firm behind worlds fourth-
fastest supercomputer arrested in Tokyo for alleged fraud.
https://www.japantimes.co.jp/news/2017/12/05/national/crime-legal/chief-firm-
behind-worlds-fourth-fastest-supercomputer-arrested-tokyo-alleged-fraud/#.WmQ-
KXRG3CI (2017)
168 Ja´nos Ve´gh
[45] Inside HPC. Is Aurora Morphing into an Exascale AI Supercomputer?
https://insidehpc.com/2017/06/told-aurora-morphing-novel-architecture-ai-
supercomputer/ (2017)
[46] www.top500.org. Intel dumps knights hill, future of xeon phi product line un-
certain. https://www.top500.org/news/intel-dumps-knights-hill-future-of-xeon-phi-
product-line-uncertain/// (2017)
[47] J. Ve´gh, The performance wall of the parallelized sequential computing – Can par-
allelization save the (computing) world?, 1st edn. (Lambert Academic Publishing,
2019)
[48] S. Moradi, R. Manohar, Journal of Physics D: Applied Physics 52(1), 014003 (2018)
[49] D. Tsafrir, in Proceedings of the 2007 Workshop on Experimental Computer Science
(ACM, New York, NY, USA, 2007), ExpCS ’07, pp. 3–3
[50] M. Davies, et al, IEEE Micro 38, 82–99 (2018)
[51] F.M. David, J.C. Carlyle, R.H. Campbell, in Proceedings of the 2007 Workshop on
Experimental Computer Science (ACM, New York, NY, USA, 2007), ExpCS ’07.
DOI 10.1145/1281700.1281703. URL http://doi.acm.org/10.1145/1281700.1281703
[52] S.J. van Albada, A.G. Rowley, J. Senk, M. Hopkins, M. Schmidt, A.B. Stokes, D.R.
Lester, M. Diesmann, S.B. Furber, Frontiers in Neuroscience 12, 291 (2018)
[53] S.B. Furber, D.R. Lester, L.A. Plana, J.D. Garside, E. Painkras, S. Temple, A.D.
Brown, IEEE Transactions on Computers 62(12), 2454 (2013)
[54] S. Kunkel, M. Schmidt, J.M. Eppler, H.E. Plesser, G. Masumoto, J. Igarashi, S. Ishii,
T. Fukai, A. Morrison, M. Diesmann, M. Helias, Frontiers in Neuroinformatics 8, 78
(2014). DOI 10.3389/fninf.2014.00078
[55] M. Schlansker, B. Rau, Computer 33(2), 37 (2000)
[56] L.A. Barroso, U. Hlzle, Computer 40, 33 (2007)
[57] A. Mendelson, (2007)
[58] J. Ve´gh, Parallel Computing 75, 28 (2018)
[59] L. de Macedo Mourelle, N. Nedjah, F.G. Pessanha, Reconfigurable and Adaptive
Computing: Theory and Applications (CRC press, 2016), chap. 5: Interprocess Com-
munication via Crossbar for Shared Memory Systems-on-chip
[60] J. Keuper, F.J. Preundt, in 2nd Workshop on Machine Learning in HPC Environments
(MLHPC) (IEEE, 2016), pp. 1469–1476. DOI 10.1109/MLHPC.2016.006. URL
https://www.researchgate.net/publication/308457837
[61] IBM. Why IBM sees OpenCAPI and OMI as the future for accelerator-driven
computing. https://www.techrepublic.com/article/why-ibm-sees-opencapi-and-omi-
as-the- future-for-accelerator-driven-computing/ (2019)
How deep machine learning can be 169
[62] TOP500.org. The top 500 supercomputers. https://www.top500.org/ (2019)
[63] V.W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A.D. Nguyen, N. Satish,
M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, P. Dubey, in Proceed-
ings of the 37th Annual International Symposium on Computer Architecture (ACM,
New York, NY, USA, 2010), ISCA ’10, pp. 451–460. DOI 10.1145/1815961.1816021.
URL http://doi.acm.org/10.1145/1815961.1816021
[64] J.L. Gustafson, I. Yonemoto. Beating Floating Point at its Own Game: Posit Arith-
metic. http://www.johngustafson.net/pdfs/BeatingFloatingPoint.pdf (2018). DOI
10.14529/js170206
[65] A. Haidar, P. Wu, S. Tomov, J. Dongarra, in Proceedings of the 8th Workshop on
Latest Advances in Scalable Algorithms for Large-Scale Systems (ACM, New York,
NY, USA, 2017), ScalA ’17, pp. 10:1–10:8
