Which scaling rule applies to Artificial Neural Networks by Végh, János
Which scaling rule applies
to Artificial Neural Networks
Ja´nos Ve´gh
Kalima´nos BT
Debrecen, Hungary
Vegh.Janos@gmail.com ORCID: 0000-0002-3247-7810
Abstract—Although Artificial Neural Network (ANN)s are
biology-mimicking systems, they are built from components
designed/fabricated for use in conventional computing, created
by experts trained in conventional computing. Building strongly
cooperating and communicating computing systems using segre-
gated single processors, however, has severe performance limita-
tions. The achievable payload computing performance sensitively
depends on workload type, and this effect is only poorly known.
The workload type, the Artificial Intelligence (AI) systems re-
quire, has an especially low payload computational performance
for ANN applications. Unfortunately, the initial successes of
demo systems that comprise only a few ”neurons” and solve
simple tasks are misleading: the scaling of computing-based
ANN systems is strongly nonlinear. The paper discusses some
major limiting factors that affect performance and points out
that for building biology-mimicking large systems, it is inevitable
to perform drastic changes in the present computing paradigm
and architectures.
Index Terms—energy efficiency, computing efficiency, Artificial
Intelligence, scaling rule, neural network
I. INTRODUCTION
By now, the single-processor performance has practically
reached its inherent limits defined by the laws of nature.
Given the proliferation of ANN-based devices, applications,
and methods; and the fact that even supercomputers are re-
targeted for AI applications, the efficacy of such systems is
gaining growing importance. As that the attempts to prepare
truly parallel computing systems failed [1], it seems that the
only way to produce the required high computing performance
is by creating parallelized sequential computing systems. In
such systems, the component processors work sequentially,
but their work is such orchestrated (usually by one of the
processors), that it gives the illusion that the system was work-
ing in a parallel manner. The laws of (massively) parallelized
computing systems [2], however, highly differ from those of
systems comprising a single (or only a few) processors.
Amdahl proposed to build ”general purpose computers with
a generalized interconnection of memories or as specialized
computers with geometrically related memory interconnections
and controlled by one or more instruction streams” [3].
Amdahl has also pointed out why the processors should be
Project no. 125547 has been implemented with the support provided
from the National Research, Development and Innovation Fund of Hungary,
financed under the K funding scheme.
submitted to The 22nd International Conference on Artificial Intelligence
(ICAI’20: July 27-30, 2020, USA)
assembled to form a higher-performance system in the way he
proposed: ”the non-parallelizable portion [(aka serial part)] of
the task seriously restricts the achievable performance of the
resulting system”. His followers derived the famous Amdahl’s
formula [4], which served as the base for the ”strong scaling”.
That formalism provided a ”pessimistic” prediction for the
resulting performance of the parallelized systems, and led to
decades-long debates about the validity of the Law.
The appearance of the ”massively parallel” systems im-
proved the degree of parallelization so much that the re-
searchers suspected that Amdahl’s Law was not valid. As
all the scaling laws represent approximations [4] (they are
based on some omissions, and so they have their range of
validity), the new experiences led to a new approximation: the
”weak scaling” [5] appeared. The underlying assumption was
that the payload performance increases linearly with the size
of the task (provided that the nominal performance increases
correspondingly); that is, the efficiency remains unity.
However, that ”in practice, for several applications, the
fraction of the serial part happens to be very, very small thus
leading to near-linear speedups” [6], mislead the researchers.
Due to this, as [7] discovered: ”Gustafson’s formulation gives
an illusion that as if N can increase indefinitely”. Gustafson
concluded his ”scaling” for several hundred processors only,
and the interplay of improving parallelization and general
hardware (HW) development (including the non-determinism
of the modern HW [8]), covered for decades that the scaling
was used far outside of its range of validity. Although common
experience showed that it was not valid for the case of systems
comprising an ever higher number of processors (an ”empirical
efficiency” appeared), and later researchers measured ”two
different efficiencies” [9] for the same supercomputer (under
different workloads), the ”weak scaling” was not suspected to
be responsible for the issues. Within its range of validity it
was a good approximation.
On one side, as discussed in [4], [7], ”weak scaling” is
based on a simple arithmetical (or logical) mistake. After
correcting that mistake, these two famous scaling laws became
two different formulations of the same behavior. For modern
complex HW and utilization conditions one must introduce
a slight correction, called ”modern scaling”, according to
which the parallelized sequential systems have their inherent
performance limitation [2], [4]. Using that formalism and
parameters from the TOP500 database [10], we could estimate
ar
X
iv
:2
00
5.
08
94
2v
1 
 [c
s.D
C]
  1
5 M
ay
 20
20
the performance limits for present supercomputers. It enabled
us to comprehend why the supercomputers have a performance
limit [11] as well as why ”can be seen in our current situation
where the historical ten-year cadence between the attainment
of megaflops, teraflops, and petaflops has not been the case
for exaflops” [12]. On the other side, ”weak scaling” omits
all non-payload (but needed for the operation) activities, such
as interconnection time, physical size (signal propagation
time), accessing data in an amount exceeding cache size,
synchronization of different kinds, that are surely present when
working with ANNs. Because of this, except some very few
neuron systems, ”weak scaling” cannot be safely used for
ANNs, even as a rough approximation.
We also validated [4] the ”modern scaling” through applying
it, among others, for qualifying load balancing compiler,
cloud operation, on-chip communication. Given that experts
with the same background build ANN systems from similar
components, we can safely assume that the same scaling is
valid for those systems, too. However, ANNs represent a
specific workload (see below). Calibrating the systems for
some specific workload (due to lack of validated data) is not
possible, but one can compare the behavior of systems and
draw some general conclusions.
II. PERFORMANCE LIMIT OF COMPUTER-BASED AI
SYSTEMS
As discussed in detail in [11], the payload performance
P (N,α) of parallelized systems comprising N processors is
described1 as
P (N,α) =
N · Psingle
N · (1− α) + α (1)
where Psingle is the single-thread performance of the indi-
vidual processors, and α is describing parallelization of the
given system for the given workload (i.e., it depends on both
of them). In contrast with the prediction of ’weak scaling’,
payload performance and nominal performance differ by a
factor, growing with the number of cores. This conclusion
is well-known, but forgotten: ”This decay in performance is
not a fault of the architecture, but is dictated by the limited
parallelism” [13].
The key issue is, however, that one can hardly calculate the
value of α for the present complex HW/software (SW) systems
from their technical data, although some estimated values
can be derived. For supercomputers, however, one can derive
a theoretical ”best possible”, and already achieved ”worst
case” values. It gives us good confidence that those values
deviate only within a factor of two. We cannot expect similar
results for ANNs: there are no generally accepted benchmark
computations, and also there are no standard architectures2.
Using a benchmark means a particular workload, and com-
paring results of even a standardized ANN benchmark on
1At least in a first approximation, see [11]
2Just notice, that selecting a benchmark also directs the architectural
development: the benchmarks High Performance Linpack (HPL) and High
Performance Conjugate Gradients (HPCG) result in different rankings.
different architectures is as little useful as comparing results
of benchmarks HPL and HPCG on the same architecture.
Recall also, that at a large number of processors, the internal
latency of processor also matters. Following the failure of
supercomputer Aurora’18, Intel admitted: ”Knights Hill was
canceled and instead be replaced by a ”new platform and new
microarchitecture specifically designed for exascale”” [14].
We expect that shortly it shall be admitted that building large-
scale AI systems is simply not possible based on the old archi-
tectural principles [15], [16]. The potential new architectures,
however, require a new computing paradigm, that can give a
proper reply to power consumption and performance issues of
– among others – ANN computing.
III. FACTORS AFFECTING PERFORMANCE
Many factors affect the payload performance of ANN
systems; we can discuss only some of them here. Commu-
nication is a dominating factor in the operation of many-
many processor systems, and especially for ANNs. This paper
attempts to provide a feeling of how different factors affect
the performance of many-processor systems.
A. Communication-to-computation ratio
As we learned decades ago, ”the inherent communication-
to-computation ratio in a parallel application is one of the
important determinants of its performance on any architec-
ture” [13], suggesting that communication can be a dominant
contribution to the systems performance. In the case of neural
simulation, a very intensive communication must take place,
so the non-payload to payload ratio has a significant impact
on the performance of ANN type computations. That ratio and
the corresponding workload type are closely related: using a
specific benchmark implies using a specific communication-
to-computation ratio. In the case of supercomputing, the same
workload is running on (nearly) the same type of architecture,
which is not the case for ANNs.
B. The computing benchmarks
There are two commonly used benchmarks in supercom-
puting. The HPL class tasks essentially need communication
only at the very beginning and at the very end of the task. The
real-life programs, however, usually work in a non-standard
way. Because of this reason, a couple of years ago, the
community introduced the benchmark HPCG: the experience
shows that payload performance is much more accurately
approximated by HPCG than by HPL, because real-life tasks
need much more communication than HPL. Importantly, since
the quality of their interconnection improved considerably,
the supercomputers show different efficiencies when using
different benchmark programs [9]. Their efficiencies differ by
a factor of ca. 200-500 (a fact that remains unexplained in the
frame of ”weak scaling”), when measured by HPL and HPCG,
respectively.
In the HPL class, communication intensity is the lowest
possible one: computing units receive their task (and param-
eters) at the beginning of computation, and they return their
xa1
a2
a3
am
y
Input Layer ”HPL Layer” Output Layer
10−3 10−2 10−1 100
10−10
10−9
10−8
10−7
10−6
10−5
10−4
RPeak(Eflop/s)
(1
−
α
H
P
L
e
f
f
)
10−5
10−4
10−3
10−2
10−1
100
R
H
P
L
M
a
x
(E
f
lo
p
/
s)
αSW
αOS
αeff
RMax(Eflop/s)
A
x
a1
a2
a3
am
y1 yn
n1
n2
n3
nm
Input Layer ”HPCG Layer 1” ”HPCG Layer n” Output Layer
10−3 10−2 10−1 100
10−10
10−9
10−8
10−7
10−6
10−5
10−4
RPeak(Eflop/s)
(1
−
α
H
P
C
G
e
f
f
)
10−5
10−4
10−3
10−2
10−1
100
R
H
P
C
G
M
a
x
(E
f
lo
p
/s
)
αSW
αOS
αeff
RMax(Eflop/s)
B
x1
x2
x3
xn
a1
a2
a3
am
n1
n2
n3
nm
y1
y2
y3
yk
Input Layer Hidden Layer 1 Hidden Layer n Output Layer
10−3 10−2 10−1 100
10−10
10−9
10−8
10−7
10−6
10−5
10−4
RPeak(Eflop/s)
(1
−
α
N
N
e
f
f
)
10−5
10−4
10−3
10−2
10−1
100
R
N
N
M
a
x
(E
f
lo
p
/s
)
αSW
αOS
αeff
RMax(Eflop/s)
C
Fig. 1. The different communication/computation intensities of the applications lead to different payload performance values in the same supercomputer system.
Left column: the models of computing intensities for different benchmarks. Right column: the corresponding payload performances and α contributions in
function of the nominal performance of a fictive supercomputer (P = 1Gflop/s @ 1GHz) in function of the nominal performance. The blue diagram
line refers to the right hand scale (RMax values), all others ((1 − αXeff ) contributions) to the left scale. The figure is purely illustrating the concepts; the
displayed numbers are somewhat similar to the real ones. The performance breakdown shown in the figures were experimentally measured by [13], [17](Fig. 7)
and [18](Fig. 8).
result at the very end. That is, the core coordinating their
work must deal with the fellow cores only in these periods,
so the communication intensity is proportional to the number
of cores in the system. Notice the need to queue requests at
the beginning and the end of the task.
In the HPCG class, iteration takes place: cores return the
result of one iteration to the coordinator core, which makes
sequential operations: not only receives and re-sends the
parameters, but also needs to compute new parameters before
sending them to the fellow cores. The program repeats the pro-
cess several times. As a consequence, the non-parallelizable
fraction of benchmarking time grows proportionally to the
number of iterations. The effect of that extra communication
decreases the achievable performance roofline [19]: as shown
in Fig. 2, the HPCG roofline is about 200 times smaller than
the HPL one.
C. The workload type
The role of workload came to light after that interconnection
technology was greatly improved, and as a consequence, the
benchmarking computation (defining the type of the workload)
became the dominating contributor defining the value of
α (and as a consequence, the payload performance), for a
discussion see [11]. The overly complex Fig. 1 illustrates the
phenomenon, why and how the performance of a configuration
depends on the application it runs.
Fig. 1 compares three workloads (owing different commu-
nication intensity. In the top and middle figures the communi-
cation intensities of the standard supercomputer benchmarks
HPL and HPCG are displayed in the style of AI networks.
The ”input layer” and ”output layer” are the same, and
comprise the initiating node only, while the other ”layers” are
again the same: the rest of the cores. Subfigure 1.C depicts an
AI network comprising n input nodes and k output nodes,
furthermore h hidden layers are comprising m nodes. The
communication-to-computation intensity [13] is, of course, not
proportional in the cases of subfigures, but the figure illustrates
excellently how the communication need of different computer
tasks changes with the type of the task.
As can be easily seen from the figure, in the case of
benchmark HPL the initiating node must issue m commu-
nication messages and collect m returned results, i.e., the
execution time is O(2m). In the case of benchmark HPCG
this execution time is O(2Nm) where N is the number of
iterations (one cannot directly compare the execution times
because of the different amounts of computations).
Figure 1.A displays the case of minimum communication,
and Figure 1.B a moderately increased one (corresponding to
real-life supercomputer tasks). As the nominal performance
increases linearly and payload performance decreases expo-
nentially with the number of cores, at some critical value
where an inflection point occurs, the resulting performance
starts to decrease. The resulting sizeable non-parallelizable
fraction sharply decreases the efficacy (in other words: the
performance gain or speedup) of the system [20]. The effect
was noticed early [13], under different technical conditions but
somewhat faded due to the development of the parallelization
technology.
The non-parallelizable fraction (denoted on the figure by
αXeff ) of a computing task comprises components X of
different origin. As already discussed, and was noticed decades
ago, ”the inherent communication-to-computation ratio in a
parallel application is one of the important determinants of
its performance on any architecture” [13], suggesting that
communication can be a dominant contribution to the systems
performance.
The workload in ANN systems comprises components of
type ”computation” and ”communication” (this time also com-
prising data access and synchronization, i.e., everything that
is ’non-computation’). As the logical interdependence between
those contributions is strictly defined, the payload performance
of the system is limited by both factors, and the same system
(maybe even within the same workload, case by case) can be
either computing-bound and communication bound, or both.
As section III-G discusses, in some cases, a third competitor
can also appear on the scene, and even can play a major role.
That is, it is not easy at all to describe an ANN system in
terms of the ”roofline” [19] model: depending on the actual
conditions, the dominant player (the one that defines the level
of the roofline) may change. Anyhow: it is sure that the
one, defining the lowest roofline, shall dominate, but their
competition may result in unexpected issues (for an example
see how the computation and interconnection changed their
dominating rule, in [11]). Because of this, Fig. 2 has limited
validity, but provides a feeling that 1/ for all workflow types
a performance plateau exists and already reached; 2/ how
high-performance gain value can be achieved for the different
workloads; 3/ where the efficiency of the particular kind of
ANNs, the brain simulation on supercomputers, is located
compared to that of the standard benchmark (a reasoned
guess); 4/ the efficiency of ANNs can be somewhere between
that of HPCG and brain simulation, closer to the latter one.
D. Accelerators
Presumably, as the side-effect of ”weak scaling”, it is usu-
ally assumed that decreasing the time needed for the payload
contribution, affects the efficiency of ANN systems linearly.
However, it is not so. As discussed in detail in [21], we also
change the non-payload to payload ratio that defines efficiency.
We mention two prominent examples here: using shorter
operands (move less data and perform less bit manipulations)
and to mimic the operation of the neuron in an entirely
different way: using quick analog signal processing rather than
slow digital calculation.
The so-called HPL-AI benchmark used Mixed Precision3
[22] rather than Double Precision computations. The re-
searchers succeeded to achieve nearly 3 times better perfor-
mance gain, that (as correctly stated in the announcement)
3Neither of the names are consequent. On one side, the test itself has not
much to do with AI; it just uses the operand length common in AI tasks (HPL,
similarly to AI, is a workload type). On the other side, the Mixed Precision is
Half Precision: it is natural that for multiplication twice as long operands are
used temporarily. It is a different question that the operations are contracted.
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
102
103
104
105
106
107
108
Year
P
er
f
or
m
a
n
ce
g
a
in
The roofline of performance gain of supercomputers
1st by RHPLMax
2nd by RHPLMax
3rd by RHPLMax
Best by αHPLeff
1st by RHPCGMax
2nd by RHPCGMax
3rd by RHPCGMax
Best by RHPCGMax
Brain simulation
Fig. 2. The RMax payload performance in function of the nominal performance RPeak , at different (1−αeff ) values. The figures display the measured values
derived using HPL and HPCG benchmarks, for the TOP3 supercomputers. The small black dots mark the performance data of supercomputers JUQUEEN
and K as of 2014 June, for HPL and HPCG benchmarks, respectively. The big black dot denotes the performance value of the system used by [18]. The
saturation effect can be observed for both HPL and HPCG benchmarks.
”Achieving a 445 petaflops mixed-precision result on HPL
(equivalent to our 148.6 petaflops DP result)”, i.e. the peak
DP performance did not change. However, this suggests the
illusion that when using supercomputers for AI tasks and using
half-precision, one can expect this peak performance.
Unfortunately, the achievement comes from accessing less
data in memory and using quicker operations on the shorter
operands rather than reducing the communication intensity.
For AI applications, limitations remain the same as described
above; except that when using Mixed Precision, the efficiency
shall be better by a factor of 2-3, compared to the efficiency
measured with using double precision operands.4
We expect that when using half-precision (FP16) rather
than double precision (FP64) operands in the calculations,
four times less data are transferred and manipulated by the
system. The measured power consumption data underpin the
statement. However, computing performance is only 3 times
higher than in the case of using 64-bit (FP64) operands. The
non-linearity has its effect even in this simple case (recall
that HPL uses minimum communication). In the benchmark,
housekeeping activity (data access, indexing, counting, ad-
dressing) also takes time. Even, measured performance data
4 Similarly, exchanging data directly between the processing units [23]
(without using the global memory) also enhances α (and payload perfor-
mance) [24], but it represents a (slightly) different computing paradigm.
enable us to estimate the execution time with zero precision
(FP0) operands, see [11].
Another plausible assumption is that if we use quick analog
signal processing to replace the slow digital calculation, as
proposed in [25], [26], the system gets proportionally quicker.
Presumably, on systems comprising just a few neurons, one
can measure a considerable, but less than expected, speedup.
The housekeeping becomes more considerable than in the
case of purely digital processing. That is, in a hypothetical
measurement, the speedup would be much less than the ratio
of the analog/digital processing times, even in the case of HPL
benchmark. Recall that here the workload is of AI type, with
much worse parallelization (and non-linearity). As a conse-
quence, one cannot expect a considerable speedup in large
neuromorphic systems. Besides, adding analog components to
a digital processor has its price. Given that a digital processor
cannot handle resources outside of its world, one must call the
operating system (OS) for help. That help, however, is rather
expensive in terms of execution time. The required context
switching takes time in the order of executing 104 instruc-
tions [27], [28], which greatly increases total execution time.
This effect dramatically increases the time of housekeeping,
and makes the non-payload to payload ratio much worse than
it was before introducing that enhancement.
E. The timing of activities
In ANNs, data transfer time must be considered seriously.
In both biological and electronic systems, both the distance
between the entities of the network, and the signal propagation
speed is finite. Because of this, in physically large-sized and/or
intensively communicating systems the ”idle time” of proces-
sors defines the final performance a parallelized sequential
system can achieve. In conventional computing systems also
the ’data dependence’ limits the achievable parallelism: we
must compute data before we can use it as an argument for
another computation. Although, of course, also in conventional
computing, data must be delivered to the place of second
utilization, thanks to ”weak scaling” [5], this ”communication
time” is neglected. For example, scaling of matrix operations
and ”sparsity”, mentioned in [29], work linearly only if data
transfer time is neglected.
In neuromorphic computing, including ANNs, transfer time
is a vital part of information processing. A biological brain
must deploy a ”speed accelerator” to ensure that the control
signals arrive at the target destination before the arrival of
the controlled messages, despite that the former derived from
a distant part of the brain [30]. This aspect is so vital in
biology that the brain deploys many cells with the associated
energy investment to keep the communication speed higher for
the control signal. Computer technology cannot speed up the
communication selectively, as in biology, and it is not worth to
keep part of the system for a lower speed selectively. However,
as discussed in [21], handling data timing adequately, is vital,
especially for bio-mimicking ANNs.
F. The layer structure
The bottom part of Fig. 1 depicts how ANNs are supposed to
operate. The life begins in several input channels (rather than
one as in HPL and HPCG cases) that would be advantageous.
However, the system must communicate the values to all nodes
in the top hidden layer: the more input nodes and the more
nodes in the hidden layer(s), the many times more commu-
nication is required for the operation. The same situation also
happens when the first hidden layer communicates data to the
second one, except that here the square of the number of nodes
is to be used as a weight factor of communication.
Initially, n input nodes issue messages, each one m mes-
sages (queuing#1) to nodes in the first hidden layer, i.e.,
altogether nm messages. If one uses a commonly used shared
bus to transfer messages, these nm messages must be queued
(queuing#2). Also, every single node in the hidden layer
receives (and processes) m input messages (queuing#3). Be-
tween hidden layers, the same queuing is repeated (maybe
several times) with mm messages, and finally, km messages
are sent to the output nodes. During this process, the system
queues messages three times. To make a fair comparison
with benchmarks HPL and HPCG, let us assume one input
and one output node. In this case, the AI execution time is
O(h×m2), provided that the AI system has h hidden layers.
(Here we assumed that the messaging mechanism between
layers is independent from each other. It is not so if they
share a global bus. 5)
For a numerical example: let us assume that in the super-
computers, 1M cores are used, in the AI network 1K nodes
are present in the hidden layers, and only one input and output
nodes are used. In that case, all execution times are O(1M)
(again, the amount of computation is sharply different, so the
scaling can be compared, but not the execution times). This
communication intensity explains why in Fig. 3 the HPCG
”roofline” falls hundreds of times lower than that of the
HPL: the increased communication need strongly decreases
the achievable performance gain.
Notice that the number computation operations increases
with m, while the number of communication operations with
m2. In other words: the more nodes in the hidden layers,
the higher is the communication intensity (communication-
to-computation ratio), and because of this, the lower is the
efficiency of the system. Recall that since AI nodes perform
simple computations compared to the functionality of su-
percomputer benchmarks, the communication-to-computation
ratio is much higher, making efficacy even worse. The con-
clusions are underpinned by experimental research [15]:
• ”strong scaling is stalling after only a few dozen nodes”
• ”The scalability stalls when the compute times drop
below the communication times, leaving compute units
idle. Hence becoming a communication bound problem.”
• ”the network layout has a large impact on the crucial
communication/compute ratio: shallow networks with
5”The idea of using the popular shared bus to implement the communication
medium is no longer acceptable, mainly due to its high contention.” [31]
many neurons per layer . . . scale worse than deep net-
works with less neurons.”
The massively ”bursty” nature of the data (the different
nodes of the layer want to communicate at the same moment)
also makes the case harder. Communication circuits receive
the task of sending data to N other nodes. What is worse, bus
arbitration, addressing, latency, prolong the transfer time (and
in his way decrease the efficacy of the system). This type of
communicational burst may easily lead to a ”communicational
collapse” [32].
G. The ”quantal nature of computing time”
One of the famous cases demonstrating the existence and
the competition of those limitations in the fields of AI is the
research published in [18]. The systems used in the study
were a HW simulator [33] explicitly designed to simulate
109 neurons (106 cores and 103 neurons per core) and many-
thread simulation running on a supercomputer [34] able to
simulate 2·108 neurons (the authors mention 2·103 neurons per
core and supercomputers having 105 cores). The experience,
however, showed [18] that the scaling stalled at 8·104 neurons,
i.e., about 4 orders of magnitude less than expected. They
experienced stalling about the same number of neurons, for
both the HW and the SW simulator.
Given that supercomputers have a performance limit [11],
one can comprehend the former experience: the brain simu-
lation needs heavy communication (the authors estimated that
≈ 10% of the execution time was spent with non-payload
activity), that decreases sharply the achievable performance, so
their system reached the maximum performance that the given
(1−α) enables: the sequential portion was too high. But why
the purpose-built brain simulator cannot reach is maximum
expected performance? And, is it just an accident that they
both stalled at the same value, or some other limiting factor
came into play? Paper [35] gives the detailed explanation.
The short reply is that the digital systems, including brain
simulators, have a central clock signal that represents an
inherent performance limit: no action in the system can happen
in a shorter time. The total time divided by the length of the
clock period defines the maximum performance gain [11]. If
the length of the clock period is the commonly used 1 ns,
and the measurement time (in the case of supercomputers) is
in the order of several hours, this does not mean a limitation.
Computational time and biological time are not only not
equal, but they are also not proportional. To synchronize the
neurons periodically, a ”time grid”, commonly with 1 ms
integration time, was introduced. The systems use this grid
time to put the free-running artificial neurons back to the
biological time scale, i.e., they act as a clock signal: simulation
of the next computation step can only start when this clock
signal arrives. This action is analogous with introducing a
clock signal for executing machine instructions: the processor,
even when it is idle, cannot begin execution of its next
machine instruction until this clock signal arrives. That is,
in this case, the clock signal is 106 times longer than the
clock signal of the processor. Just because neurons must work
on the same (biological) time scale, when using this method
of synchronization, the (commonly used) 1 millisecond ”grid
time” (”the quantum of computing time”) has a noticeable
effect on the performance.6
Also, the measurement time is shorter than for supercomput-
ers. Due to this, the inherent non-parallelizable portion is much
higher. Under these circumstances, as witnessed by Fig. 2, a
performance gain about 103 was guessed. This performance
was reached using 1 % of the available HW.
As using more cores only increased the nominal perfor-
mance (and correspondingly the power consumption) of their
system [18], the authors decided to use only a small fragment
of the resources, for example only 1% of the cores available
in the HW simulator. This measurement enables to guess the
estimated efficacy of ANNs, and to place the efficacy of brain
simulation to the scale of supercomputer benchmarking, see
Fig. 2.
Recall also the ”communicational collapse” from the previ-
ous section: even if communication packages are randomized
in time, it represents a colossal peak traffic, mainly if a single
global (although high speed) bus is used. It was found [18] that
only a few dozens of thousands of neurons can be simulated on
processor-based brain simulators (including both many-thread
software simulators and purpose-built brain simulator.7) Recall
also from [15] that ”strong scaling is stalling after only a few
dozen nodes”.
IV. SUMMARY
The operating characteristics of ANNs are practically
not known, mainly because of their mostly proprietary de-
sign/documentation. The existing theoretical predictions [16]
and measured results [15] show good agreement, but dedicated
measurements using well-documented benchmarks and a va-
riety of well-documented architectures are needed. The low
efficacy, predicted by the present ”real scaling”, on one side
needs a careful design using the existing components (i.e., to
select the ”least worse” configuration; millions of devices shall
work with low energy and computational efficacy!), on the
other side urges working out different computing paradigm [2]
(and architecture [39] based on it).
The Gordon Bell Prize jury noticed [40] that ”Surprisingly,
there have been no brain inspired massively parallel special-
ized computers [among the winners]”. It was predicted [13]:
”scaling thus put larger machines [the brain inspired com-
puters built from components designed for Single Processor
Approach (SPA) computers] at an inherent disadvantage”.
REFERENCES
[1] K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubi-
atowicz, N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, D. Wessel,
6This periodic synchronization shall be a limiting factor in large-scale
utilization of processor-based artificial neural chips [36], [37], although thanks
to the ca. thousand times higher ”single-processor performance”, only when
approaching the computing capacity of (part of) the brain; or when the
simulation turns to be communication-bound.
7Despite this, Spinnaker2, this time with 10M processors is under build-
ing [38]
and K. Yelick, “A View of the Parallel Computing Landscape,” Comm.
ACM, vol. 52, no. 10, pp. 56–67, 2009.
[2] J. Ve´gh and A. Tisan, “The need for modern computing paradigm:
Science applied to computing,” in 2019 International Conference on
Computational Science and Computational Intelligence (CSCI). IEEE,
2019, pp. 1523–1532. [Online]. Available: http://arxiv.org/abs/1908.
02651
[3] G. M. Amdahl, “Validity of the Single Processor Approach to Achieving
Large-Scale Computing Capabilities,” in AFIPS Conference Proceed-
ings, vol. 30, 1967, pp. 483–485.
[4] J. Ve´gh, “Re-evaluating scaling methods for distributed parallel
systems,” IEEE Transactions on Distributed and Parallel Computing,
vol. ??, p. in review, 2020. [Online]. Available: https://arxiv.org/abs/
2002.08316
[5] J. L. Gustafson, “Reevaluating Amdahl’s Law,” Commun. ACM, vol. 31,
no. 5, pp. 532–533, May 1988.
[6] S. Krishnaprasad, “Uses and Abuses of Amdahl’s Law,” J. Comput. Sci.
Coll., vol. 17, no. 2, pp. 288–293, Dec. 2001.
[7] Y. Shi, “Reevaluating Amdahl’s Law and Gustafson’s Law,”
https://www.researchgate.net/publication/228367369 Reevaluating
Amdahl’s law and Gustafson’s law, 1996.
[8] V. Weaver, D. Terpstra, and S. Moore, “Non-determinism and over-
count on modern hardware performance counter implementations,” in
Performance Analysis of Systems and Software (ISPASS), 2013 IEEE
International Symposium on, April 2013, pp. 215–224.
[9] IEEE Spectrum, “Two Different Top500 Supercomputing
Benchmarks Show Two Different Top Supercomput-
ers,” https://spectrum.ieee.org/tech-talk/computing/hardware/
two-different-top500-supercomputing-benchmarks-show\
-two-different-top-supercomputers, 2017.
[10] TOP500.org, “The top 500 supercomputers,” https://www.top500.org/,
2019.
[11] J. Ve´gh, “Finally, how many efficiencies the supercomputers have?” The
Journal of Supercomputing, feb 2020.
[12] M. Feldman, “Exascale Is Not Your Grandfathers HPC,” https://www.
nextplatform.com/2019/10/22/exascale-is-not-your-grandfathers-hpc/,
2019.
[13] J. P. Singh, J. L. Hennessy, and A. Gupta, “Scaling parallel programs for
multiprocessors: Methodology and examples,” Computer, vol. 26, no. 7,
pp. 42–50, Jul. 1993.
[14] www.top500.org, “Intel dumps knights hill, future of xeon
phi product line uncertain,” https://www.top500.org/news/
intel-dumps-knights-hill-future-of-xeon-phi-product-line-uncertain///,
2017.
[15] J. Keuper and F.-J. Preundt, “Distributed Training of Deep Neural
Networks: Theoretical and Practical Limits of Parallel Scalability,”
in 2nd Workshop on Machine Learning in HPC Environments
(MLHPC). IEEE, 2016, pp. 1469–1476. [Online]. Available: https:
//www.researchgate.net/publication/308457837
[16] J. Ve´gh, How deep the machine learning can be, ser. A Closer Look at
Convolutional Neural Networks. Nova, In press, 2020, pp. 141–169.
[Online]. Available: https://arxiv.org/abs/2005.00872
[17] T. Ippen, J. M. Eppler, H. E. Plesser, and M. Diesmann, “Constructing
Neuronal Network Models in Massively Parallel Environments,” Fron-
tiers in Neuroinformatics, vol. 11, p. 30, 2017.
[18] S. J. van Albada, A. G. Rowley, J. Senk, M. Hopkins, M. Schmidt, A. B.
Stokes, D. R. Lester, M. Diesmann, and S. B. Furber, “Performance
Comparison of the Digital Neuromorphic Hardware SpiNNaker and the
Neural Network Simulation Software NEST for a Full-Scale Cortical
Microcircuit Model,” Frontiers in Neuroscience, vol. 12, p. 291, 2018.
[19] S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful
visual performance model for multicore architectures,” Commun. ACM,
vol. 52, no. 4, pp. 65–76, Apr. 2009.
[20] J. Ve´gh, J. Va´sa´rhelyi and D. Dro´tos, “The performance wall of
large parallel computing systems,” in Lecture Notes in Networks
and Systems 68. Springer, 2019, pp. 224–237. [Online]. Available:
https://link.springer.com/chapter/10.1007%2F978-3-030-12450-2 21
[21] J. Ve´gh and A. J. Berki, “Do we know the operating principles of our
computers better than those of our brain?” Neurocomputing, vol. ??, p.
in review, 2020. [Online]. Available: https://arxiv.org/abs/2005.05061
[22] A. Haidar, S. Tomov, J. Dongarra, and N. J. Higham, “Harnessing GPU
Tensor Cores for Fast FP16 Arithmetic to Speed Up Mixed-precision
Iterative Refinement Solvers,” in Proceedings of the International Con-
ference for High Performance Computing, Networking, Storage, and
Analysis, ser. SC ’18. IEEE Press, 2018, pp. 47:1–47:11.
[23] F. Zheng, H.-L. Li, H. Lv, F. Guo, X.-H. Xu, and X.-H. Xie, “Co-
operative computing techniques for a deeply fused and heterogeneous
many-core processor architecture,” Journal of Computer Science and
Technology, vol. 30, no. 1, pp. 145–162, Jan 2015.
[24] Y. Ao, C. Yang, F. Liu, W. Yin, L. Jiang, and Q. Sun, “Performance
Optimization of the HPCG Benchmark on the Sunway TaihuLight
Supercomputer,” ACM Trans. Archit. Code Optim., vol. 15, no. 1, pp.
11:1–11:20, Mar. 2018.
[25] E. Chicca and G. Indiveri, “A recipe for creating ideal hybrid
memristive-CMOS neuromorphic processing systems,” Applied Physics
Letters, vol. 116, no. 12, p. 120501, 2020. [Online]. Available:
https://doi.org/10.1063/1.5142089
[26] “Building brain-inspired computing,” Nature Communications, vol. 10,
no. 12, p. 4838, 2019. [Online]. Available: https://doi.org/10.1038/
s41467-019-12521-x
[27] F. M. David, J. C. Carlyle, and R. H. Campbell, “Context Switch
Overheads for Linux on ARM Platforms,” in Proceedings of the
2007 Workshop on Experimental Computer Science, ser. ExpCS
’07. New York, NY, USA: ACM, 2007. [Online]. Available:
http://doi.acm.org/10.1145/1281700.1281703
[28] D. Tsafrir, “The context-switch overhead inflicted by hardware interrupts
(and the enigma of do-nothing loops),” in Proceedings of the 2007
Workshop on Experimental Computer Science, ser. ExpCS ’07. New
York, NY, USA: ACM, 2007, pp. 3–3.
[29] J. D. Kendall and S. Kumar, “The building blocks of a brain-inspired
computer,” Appl. Phys. Rev., vol. 7, p. 011305, 2020.
[30] Gyo¨rgy Buzsa´ki and Xiao-Jing Wang, “Mechanisms ofGamma Oscilla-
tions,” Annual Reviews of Neurosciences, vol. 3, no. 4, pp. 19:1–19:29,
nov 2012.
[31] L. de Macedo Mourelle, N. Nedjah, and F. G. Pessanha, Reconfigurable
and Adaptive Computing: Theory and Applications. CRC press, 2016,
ch. 5: Interprocess Communication via Crossbar for Shared Memory
Systems-on-chip.
[32] S. Moradi and R. Manohar, “The impact of on-chip communication on
memory technologies for neuromorphic systems,” Journal of Physics D:
Applied Physics, vol. 52, no. 1, p. 014003, oct 2018.
[33] S. B. Furber, D. R. Lester, L. A. Plana, J. D. Garside, E. Painkras,
S. Temple, and A. D. Brown, “Overview of the SpiNNaker System
Architecture,” IEEE Transactions on Computers, vol. 62, no. 12, pp.
2454–2467, 2013.
[34] S. Kunkel, M. Schmidt, J. M. Eppler, H. E. Plesser, G. Masumoto,
J. Igarashi, S. Ishii, T. Fukai, A. Morrison, M. Diesmann, and M. Helias,
“Spiking network simulation code for petascale computers,” Frontiers in
Neuroinformatics, vol. 8, p. 78, 2014.
[35] J. Ve´gh, “How Amdahl’s Law limits the performance of
large artificial neural networks: (Why the functionality of
full-scale brain simulation on processor-based simulators
is limited),” Brain Informatics, vol. 6, pp. 1–11, 2019.
[Online]. Available: https://braininformatics.springeropen.com/articles/
10.1186/s40708-019-0097-2/metrics
[36] M. Davies, et al, “Loihi: A Neuromorphic Manycore Processor with
On-Chip Learning,” IEEE Micro, vol. 38, pp. 82–99, 2018.
[37] J. S. et al, “TrueNorth Ecosystem for Brain-Inspired Computing: Scal-
able Systems, Software, and Applications,” in SC ’16: Proceedings of the
International Conference for High Performance Computing, Networking,
Storage and Analysis, 2016, pp. 130–141.
[38] C. Liu, G. Bellec, B. Vogginger, D. Kappel, J. Partzsch, F. Neumrker,
S. Hppner, W. Maass, S. B. Furber, R. Legenstein, and C. G. Mayr,
“Memory-Efficient Deep Learning on a SpiNNaker 2 Prototype,”
Frontiers in Neuroscience, vol. 12, p. 840, 2018. [Online]. Available:
https://www.frontiersin.org/article/10.3389/fnins.2018.00840
[39] J. Ve´gh, “How to extend the Single-Processor Paradigm to the Explicitly
Many-Processor Approach,” in 2020 International Conference on
Computational Science and Computational Intelligence (CSCE). IEEE,
2020, p. In print. [Online]. Available: ??http://arxiv.org/abs/1908.02651
[40] G. Bell, D. H. Bailey, J. Dongarra, A. H. Karp, and K. Walsh, “A look
back on 30 years of the Gordon Bell Prize,” The International Journal
of High Performance Computing Applications, vol. 31, no. 6, p. 469484,
2017. [Online]. Available: https://doi.org/10.1177/1094342017738610
