Do we know the operating principles of our computers better than those
  of our brain? by Végh, János & Berki, Ádám J.
Do we know the operating principles of our computers
better than those of our brain?
Ja´nos Ve´gh∗
Kalima´nos BT, Hungary
A´da´m J. Berki∗
aUniversity of Medicine, Pharmacy, Sciences and Technology of Targu Mures, Romania
Abstract
The increasing interest in understanding the behavior of the biological neural
networks, and the increasing utilization of artificial neural networks in differ-
ent fields and scales, both require a thorough understanding of how neuromor-
phic computing works. On the one side, the need to program those artificial
neuron-like elements, and, on the other side, the necessity for a large number
of such elements to cooperate, communicate and compute during tasks, need to
be scrutinized to determine how efficiently conventional computing can assist
in implementing such systems. Some electronic components bear a surprising
resemblance to some biological structures. However, combining them with com-
ponents that work using different principles can result in systems with very poor
efficacy. The paper discusses how the conventional principles, components and
thinking about computing limit mimicking the biological systems. We describe
what changes will be necessary in the computing paradigms to get closer to the
marvelously efficient operation of biological neural networks.
Keywords: operating principle of computers; operating principle of brain;
efficiency of simulating neural networks; single-processor approach
Contents
1 Introduction 2
2 Issues with the large scale computing 3
3 Limitations due to the Single Processor Approach 4
4 The importance of imitating the timely behavior 8
∗Corresponding author
Email address: Vegh.Janos@gmail.com (Ja´nos Ve´gh)
Preprint submitted to Neurocomputing May 12, 2020
ar
X
iv
:2
00
5.
05
06
1v
1 
 [c
s.D
C]
  6
 M
ay
 20
20
5 The role of the workload on the computing efficiency 10
6 Limitations due to the classic computing paradigm 12
7 A system is not a simple sum of its components 12
8 Summary 14
1. Introduction
Today we have the ”golden age” of neuromorphic (brain-inspired, artificial
intelligence) architectures and computing. However, the meaning of the word
has changed considerably since Carver Mead [1] coined the wording. Today
practically every single solution that borrows at least one single operating prin-
ciple from the biology and mimics some of its functionality in a more or less
successful way deserves this name. As always, to grasp out some single aspect
and implement it in an environment and from components based on entirely dif-
ferent principles, is dangerous. Historically, ’neuromorphic’ architectures were
suggested to be based on different principles and components, such as mechan-
ics, pneumatics, telephones, analog and digital electronics, computing. Some
initial resemblance surely exists, and even some straightforward systems can
demonstrate more or less successfully functionality in some aspects similar to
that of the nervous system. There is a noteworthy analogy between the deep
learning of neuronal nodes and the long-term potentiation found in synapses.
However, when scrutinizing the scalability (i.e., how those systems shall work
when used under real-life conditions in which a vast number of similar subsys-
tems shall work and cooperate), the picture is not favorable at all. ”Successfully
addressing these challenges [of neuromorphic computing] will lead to a new class
of computers and systems architectures” [2] has been targeted. However, as no-
ticed by the judges of the Gordon Bell Prize, ”surprisingly, [among the winners,]
there have been no brain-inspired massively parallel specialized computers” [3].
Despite the vast need and investments, furthermore the concentrated and coor-
dinated efforts, just because of mimicking the biological systems with computing
inadequately.
Given ”that the quest to build an electronic computer based on the operational
principles of biological brains has attracted attention over many years” [4], mod-
eling the neuronal operation became a well-known field in electronics. At the
same time, more and more details come to light about the computational opera-
tions of the brain. However, it would appear, that the ’wet’ neuroscience is miles
ahead of the ’silicon’ neuroscience. There are projects and exaggerated claims
about extremely large computing systems, even about targeting the simulation
of the brain of some animals and eventually even the human brain. Often these
claims are followed by a long silence, or some rather slim or no results. As that
the operating principles of the large computer systems tend to deviate from the
operating principles of a single processor, it is worth reopening the discussion
on a decade-old question ”Do computer engineers have something to contribute.
2
. . to the understanding of brain and mind?” [4]. Maybe, and they surely have
something to contribute to the understanding of computing itself. There is no
doubt that the brain does computing, the key question is how?
Section 2 presents some computing systems, having, as we point out: as a
consequence of the computing paradigm, enormously high energy consumption.
Section 3 discuses the primary reasons for the issues and failures: the comput-
ing paradigm and their consequences such as the serial bus and the effect of the
physical size. The timely behavior is especially important in the biological ob-
jects, so their fair imitation in the computing systems is of crucial importance,
as section 4 discusses it. The neuromorphic computing is a special type of work-
loads that have a dominating role in forming the computational efficiency of the
computing systems, as section 5 discusses it. Section 6 presents some further
limitations, rooting in the classical paradigm; furthermore, it draws parallels
with classic versus modern science and classic versus modern computing. Sec-
tion 7 provides examples, why is of limited validity to consider the role of a
grasped-out component: a neuromorphic system is not a simple sum of its com-
ponents. In section 8, the paper attempts to make a clear pointer where we can
continue using the classic computing and where we shall base the systems on
the new principles.
2. Issues with the large scale computing
The worst limiting factor in conventional computing is the method of com-
munication between processors, which increases exponentially with increasing
complexity/number. Historically in the model of computing proposed by von
Neumann, there is one single entity, an isolated (non-communicating) processor,
whereas in the bio-inspired models, billions of entities, organized into specific
assemblies, cooperate via communication. (The communication here means not
only sending data, but also sending/receiving signals, including synchronization
of the operation of the entities.) Neuromorphic systems, expected to perform
tasks in one paradigm, but assembled from components manufactured using
principles of (and implemented by experts trained in) the other paradigm are
unable to perform at the required speed and efficacy for real-world solutions.
The larger the system, the higher the communication load and the performance
debt.
To get nearer to the marvelously efficient operation of the biological brain,
other features must also be mimicked from the biology. Only a little portion
of the neurons are working simultaneously in solving the actual task; there is
a massive number of very simple (’very thin’) processors rather than a ’fat’
processor1; only a portion of the functionality and connection are pre-wired,
the rest is mobile; there is an inherent redundancy, replacing a faulty neuron
1One should scrutinize whether it is worth to implement accelerators (such as pipelines,
branch predictors) intended to be used in large computing systems, to achieve just a couple
of times higher processing speed at the price of using several hundred times more transistors
3
may be possible via systematic training. The conventional processors can only
either run or halt, but not to make a little break. The biology uses purely
event-driven (asynchronous) computing, while modern electronics uses clock-
driven systems; for the catastrophic consequences of attempting to simulate a
neuromorphic system, such as the human brain, using components prepared for
conventional computing, see the case of SpiNNaker, discussed below.
The large computing systems can cope with the tasks with growing dif-
ficulty, enormously decreasing computing efficiency, and enormously growing
energy consumption. Being not aware of that the collaboration between proces-
sors needs a different approach (another paradigm), resulted in demonstrative
failures already known (such as the supercomputers Gyoukou and Aurora’18,
or the brain simulator SpiNNaker)2 and many more (all they intend to deliver
0.13-0.2 Eflops) may follow: such as Aurora’21 [6], the China mystic supercom-
puters3 and the EU planned supercomputers4. Systems having ”only” millions
of processors already show the issues, and the brain-like systems want to com-
prise four orders of magnitude higher number of computing elements. Besides,
the scaling is strongly nonlinear [7, 8]. When targeting neuromorphic features
such as ”deep learning training”, the issues start to manifest at just a couple of
dozens of processors [9][10].
3. Limitations due to the Single Processor Approach
As suspected by many experts, the computing paradigm itself, ”the implicit
hardware/software contract [11]”, is responsible for the experienced issues: ”No
current programming model is able to cope with this development [of processors],
though, as they essentially still follow the classical van Neumann model” [12].
When thinking about ”advances beyond 2020”, the solution was expected from
the ”more efficient implementation of the von Neumann architecture” [13], how-
ever. Even when speaking about building up computing from scratch (”reboot-
ing the model” [14]), only implementing different gating technology for the same
computing model is meant. However, the paradigm prevents building large neu-
romorphic systems, too.
The bottleneck is essentially the ”technical implementation” of the commu-
nication, stemming from the Single Processor Approach (SPA), as illustrated in
Fig. 1. The inset shows a simple neuromorphic use case: one input neuron and
one output neuron communicating through a hidden layer, comprising only two
neurons. Fig. 1.A mostly shows the biological implementation: all neurons are
2The explanations are quite different: Gyoukou is simply withdrawn; Aurora is practically
withdrawn: retargeted and delayed; Despite the failure of SpiNNaker1, the SpiNNaker2 is also
under construction [5]; ”Chinese decision-makers decided to withhold the country’s newest
Shuguang supercomputers even though they operate more than 50 percent faster than the
best current US machines”.
3https://www.scmp.com/tech/policy/article/3015997/china-has-decided-not-fan-flames-
super-computing-rivalry-amid-us
4https://ec.europa.eu/newsroom/dae/document.cfm? doc id =60156
4
xa1 a2
y
In
p
u
t
L
ay
er
H
id
d
en
L
ay
er
O
u
tp
u
t
L
ay
er
Neuron
T
im
e(
n
ot
p
ro
p
or
ti
on
a
l)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
A
Parallel buses
Linear communication/processing
N0 N1 N2 N3
T
0
,i
n
P
ro
ce
ss
0
T
0
,o
u
t
T
1
,i
n
P
ro
ce
ss
1
T
1
,o
u
t
T
2
,i
n
P
ro
ce
ss
2
T
2
,o
u
t
T
3
,i
n
P
ro
ce
ss
3
T
3
,o
u
t
T
ot
a
l
ti
m
e
Neuron
T
im
e(
n
ot
p
ro
p
or
ti
on
a
l)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
B
Sequential bus
Communication bound; nonlinear
N0 N1 N2 N3 B
U
S
T
0
,i
n
T
0
,i
n
P
ro
ce
ss
0
T
0
,o
u
t
T
0
,o
u
t
T
1
,i
n
P
ro
ce
ss
1
T
2
,i
n
P
ro
ce
ss
2
T
1
,o
u
t
T
2
,o
u
t
T
1
,o
u
t
T
2
,o
u
t
T
3
,i
n
T
3
,i
n
P
ro
ce
ss
3
T
3
,o
u
t
T
3
,o
u
t
T
ot
a
l
ti
m
e
B
a
n
d
w
id
th
B
a
n
d
w
id
th
Neuron
T
im
e(
n
ot
p
ro
p
or
ti
on
a
l)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
C
Sequential bus
Communication roofline; nonlinear
N0 N1 N2 N3 B
U
S
T
0
,i
n
T
0
,i
n
P
ro
ce
ss
0
T
0
,o
u
t
T
0
,o
u
t
T
1
,i
n
P
ro
ce
ss
1
T
2
,i
n
T
1
,o
u
t
P
ro
ce
ss
2
T
1
,o
u
t
T
2
,o
u
t
T
3
,i
n
T
3
,i
n
T
2
,o
u
t
T
3
,i
n
T
3
,i
n
P
ro
ce
ss
3
T
3
,o
u
t
T
3
,o
u
t
T
ot
a
l
ti
m
e
B
a
n
d
w
id
th
B
a
n
d
w
id
th
Figure 1: Implementing neuronal communication in different technical approaches. A: the
parallel bus; B and C: the shared serial bus, before and after reaching the communication
”roofline” [15]
directly wired to their partners, i.e., a system of ”parallel buses” (the axons)
exists. Notice that the operating time also comprises two non-payload times:
the data input and data output, which coincide with the non-payload time of
the other communication party. The diagram displays the logical and temporal
dependencies of the neuronal functionality.
The payload operation (”the computing”) can only start after the data is
delivered (by the, from this point of view, non-payload functionality: input-
side communication), and the output communication can only begin when the
computing finished. Importantly, the communication and calculation mutually
block each other. Two important points that neuromorphic systems must mimic
noticed immediately: i/ the communication time is an integral part of the total
execution time, and ii/ the ability to communicate is a native functionality of
the system. In such a parallel implementation, the performance of the system,
measured as the resulting total time (processing + transmitting), scales lin-
early with increasing both the non-payload communication speed and the payload
processing speed.
The present technical approaches assume a similar linearity of the perfor-
mance of the computing systems as ”Gustafson’s formulation [16] gives an illu-
sion that as if N [the number of the processors] can increase indefinitely” [17].
The fact that ”in practice, for several applications, the fraction of the serial
part happens to be very, very small thus leading to near-linear speedups” [17],
however, misled the researchers. Gustafson’s ’linear scaling’ neglects the com-
munication entirely (which is not the case, especially not in neuromorphic com-
puting). He established his conclusions on only several hundred processors, and
the interplay of the improving parallelization and the general hardware (HW)
5
development (including the non-determinism of the modern HW [18]) covered
for decades that the scaling was used far outside of its range of validity [7, 8].
Not considering the effect of the time of communication (i.e., the timely behav-
ior), means not considering a vital feature of the biological system. Essentially
the same effect (the vastly increased number of idle cycles due to the physi-
cal size of supercomputers) leads to the failures of supercomputer projects (for
a detailed discussion, see [19]). The ’real scaling’ is strongly nonlinear, with
nature-defined bound.
Fig. 1.B shows a technical implementation of a high-speed shared bus for
communication. To the right of the grid, the activity that loads the bus at the
given time is shown. A double arrow illustrates the communication bandwidth,
the length of which is proportional to the number of packages the bus can deliver
in a given time unit. The high-speed bus is only very slightly loaded. We assume
that the input neuron can send its information in a single message to the hidden
layer furthermore that the processing by the neurons in the hidden layer both
starts and ends at the same time. However, the neurons must compete for
accessing the bus, and only one of them can send its message immediately,
the other(s) must wait until the bus gets released. The output neuron can only
receive the message when the first neuron completes it. Furthermore, the output
neuron must first acquire the second message from the bus, and the processing
can only begin after having both input arguments. This constraint results in
sequential bus delays both during non-payload processing in the hidden layer and
the payload processing in the output neuron. Adding one more neuron to the
layer introduces one more delay, which explains why ”shallow networks with
many neurons per layer . . . scale worse than deep networks with less neurons”
[9]: the system sends them at different times in the different layers (and even
they may have independent buses between the layers), although the shared bus
persists in limiting the communication.
The dependence of the performance is strongly nonlinear at higher perfor-
mance values (implemented using a large number of processors). The effect
is especially disadvantageous for the systems, such as the neuromorphic ones,
that need much more communication, thus making the non-payload to payload
ratio very wrong. The linear dependence at low nominal performance values
explains why the initial successes of any new technology, material or method in
the field, using the classic computing model, can be misleading: in simple cases,
the classic paradigm performs tolerably well thanks to that compared to biolog-
ical neural networks, current neuron/dendrite models are simple, the networks
small and learning models appear to be rather basic. Recall that for artificial
neuronal networks the saturation is reached at just dozens of processors [9],
because of the extreme high proportion of communication.
Fig. 2 depicts how the SPA paradigm also defines the computational perfor-
mance of the parallelized sequential system. Given that the task defines how
many computations it needs, and the computational time is inversely propor-
tional with the efficiency, one can trivially conclude that the decreasing compu-
tational efficiency leads to increasing energy consumption, just because of the
SPA. ”This decay in performance is not a fault of the architecture, but is dictated
6
105
106
107
10−7 10−6 10−5 10−4
10−4
10−3
10−2
10−1
100
No
of
pro
ces
sor
s
Non− payload/payload
E
f
f
ic
ie
n
cy
TOP5’2018.11
Summit
Sierra
Taihulight
Tianhe-2
K computer
Brain
Figure 2: The effect of the SPA implementation of the computing performance. The surface
displays how the efficiency of the SPA systems depends on the goodness of the parallelization
(aka the non-payload/payload ratio) and the number of cores. The figure marks show at what
efficiency values can the top supercomputers run different workloads (for a discussion see
section 5): the ’best load’ benchmark HPL, and the ’real-life load’ HPCG. The right bottom
part displays the expected efficiency of running neuromorphic calculations on SPA computers.
7
by the limited parallelism” [20].
4. The importance of imitating the timely behavior
In both biological and electronic systems, both the distance between the
entities of the network, and the signal propagation speed is finite. Because
of this, in the physically large-sized systems the ’idle time’ of the processors
defines the final performance a parallelized sequential system can achieve. In the
conventional computing systems also the ’data dependence’ limits the available
parallelism: we must compute the data before we can use it as an argument
for another computation. Although of course in the conventional computing
the data must be delivered to the place of the second utilization, thanks to the
’weak scaling’ [16], this ’communication time’ is neglected.
In neuromorphic computing, however, as discussed in connection with Fig. 1,
the transfer time is a vital part of information processing. A biological brain
must deploy a ”speed accelerator” to ensure that the control signals arrive at
the target destination before the arrival of the controlled messages, despite that
the former derived from a distant part of the brain [21]. This aspect is so vital
in biology that the brain deploys many cells with the associated energy invest-
ment to keep the communication speed higher for the control signal. Computer
technology cannot speed up the communication selectively, as in biology, and it
is not worth to keep part of the system for a lower speed selectively.
Extending on our previous work [22], here we introduce the Explicitly Many-
Processor Approach (EMPA) which is a new method using clusters of compu-
tational units (‘neurons’ Fig. 3B) to mimic the timely behaviour of ‘biological
brains’. The clusters are built using the following key novel ideas: 1) imple-
menting directly-wired connections between physically neighbouring cells; 2)
creating a special hierarchical bus system; 3) placing a special communication
unit, the (Inter-Core Communication Block (ICCB), Fig. 3B, purple) between
the computer cores mimicking neurons ( Fig. 3B, green); 4) creating a specialized
‘cluster head’ ( Fig. 3B) with the extra abilities to access the local and far mem-
ories ( Fig. 3 M) and to forward messages via the gateway ( Fig. 3 G) to other
‘clusters’ (similar gateways can be implemented for the inter-processor commu-
nication, and higher organizational levels; providing access to different levels of
the hierarchical buses). The cluster members are denoted by their relative posi-
tion (the addressing mode enables using virtual cores, mapped to physical cores
at runtime), and they can access the memory and other clusters only through
the head of the cluster. This enables both easy sharing of locally important
state variables, keeps local traffic away from the bus(es) and reduces wiring
inside the chip. The ICCBs can forward messages via direct wired connections
with up to 2 ’hops’ to the immediate neighbors and the second neighbors (even
if they belong to another cluster). This solution enables billions of ’neurons’
to communicate at the same time, without delay, although the distant neuron
must use one (or more) of the hierarchical buses, in function or their location.
The resemblance between Fig. 3A and Fig. 7 in reference [21] underlines the
8
Inter-ICCB Communication
(Local Neuron Communication)
Inter-Cluster Communication
(Distant Neuron Communication)
A
G,M
B
ICCB =Inter-Core
Communication Block
M Memory handling/bus
G Gateway handling/bus
M G
N
NW
SW
S
SE
NE
Figure 3: Subfigure A (compare to Fig.7 in [21]) shows a proposal [22] of how to reduce
the limiting effect of the SPA, via mimicking the communication between local neurons using
direct-wired inter-core communication and the communication between the farther neurons
via using the inter-cluster communication bus, in the cluster head. Subfigure B suggests a
possible implementation of the principle: the Inter-Core Communication Blocks represent a
”local bus” (directly wired, with no contention), while the cores can communicate with the
cores in other clusters through the ’G’ gateway as well as the ’M’ (local and global) memory.
importance of making a clear distinction between handling ’near’ and ’far’ sig-
nals, and accounting for relative signal timing. Although the ICCB blocks can
adequately represent ’locally connected’ interneurons and the ’G’ gateway the
’long-range interneurons’ [21], in biological systems conduction time must be
separately maintained by the neurons in biology-mimicking computing systems.
Making time-stamps and relying on the computer network delivery principles is
not sufficient for maintaining correct relative timing. The timely behavior is a
vital feature of the biology-mimicking systems, cannot not be replaced it with the
synchronization principles of computing. Ignoring this requirement massively
contributes to the failures of biology-mimicking computing systems. Commu-
nication time is less vital for using neurons in Artificial Intelligence (AI), but
even in that case, one must consider the communication time explicitly.
Of course, computing works through having time quanta: what happens
within a clock period of the processor, happens ”at the same time”. Given that
the clock period of computers is in the range of nanoseconds, in the classic com-
puting good approximation is that computing time is continuous. Simulating
many-neuron systems in SPA, however, one faces a lack of the cooperative be-
havior. As the computing time in the artificial neurons is not proportional with
the biological time they simulate, these different time scales must be scaled to
each other.
One possible way is to put a ”time grid” on the processes simulating biol-
9
ogy: within a time slot, the artificial neurons would be free to compute, but
at the time boundary they would send the results of their calculation to each
other. This results in the neurons continuing their calculation periodically from
some concerted state. Such a method of synchronization introduces a ”biological
clock period” that is million-fold longer than the clock period of the processor:
what happens in this ”grid time”, happens ”at the same time”. Although this
effect drastically reduces the achievable computing temporal performance [23],
the synchronization principle is so common that also the special-purpose neu-
romorphic chips [24, 25] use it as a built-in feature. In their case the speed of
neuronal functionality is hundreds of times higher than that of the competing
solutions, and the communication principles are slightly different (i.e., the non-
payload/payload ratio is vastly different), the performance-limiting effect of the
”quantal nature of computing time” persists when used in extensive systems.
5. The role of the workload on the computing efficiency
As was very early predicted [26] and decades later experimentally con-
firmed [20], the scaling of the parallelized computing is not linear. Even, ”there
comes a point when using more processors . . . actually increases the execution
time rather than reducing it” [20]. Where that point comes, depends on the
workload. Paper [19] discusses first/second order approaches to explain the
issue. The first order approach explains the experienced saturation, and the
second order the predicted decrease.
As [19] discusses, the different workloads, mainly due to their different
communication-to-computation ratio, work with different efficiency on the same
computer system [29]. The neuromorphic operation on conventional architec-
tures shows the same issues [10, 7, 8]. Fig. 4 illustrates how the different
workloads cause a saturation in the value of the performance gain. Compared
to the benchmark HPL, the HPCG comprises much more communication be-
cause of the iterative nature of the task.
Fig. 4 also depicts an estimated efficiency for the case of simulating brain-like
operation on a conventional architecture. Given that in [28], the power consump-
tion efficiency was also investigated, one can presume that (to avoid obsolete
energy consumption) the authors measured at the point where involving more
cores increased the power consumption but did not increase the payload simula-
tion performance. The performance gain of an AI workload on supercomputers
can be estimated to be between those of HPCG and brain simulation; closer to
the HPCG gain. As discussed experimentally in [9] and theoretically in [10], in
the case of neural networks (especially in the case of selecting improper layering
depth) the efficiency can be much lower. Depending on the architecture, the
performance gain reaches the saturation level by using just dozens of cores in
the system, mimicking neuromorphic operation on a conventional system.
10
1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020
102
103
104
105
106
107
108
Year
P
er
f
or
m
a
n
ce
g
a
in
The rooflines of performance gain of supercomputers
1st by RHPLMax
2nd by RHPLMax
3rd by RHPLMax
Best by αHPLeff
1st by RHPCGMax
2nd by RHPCGMax
3rd by RHPCGMax
Best by RHPCGMax
Brain simulation
Figure 4: The performance gains of supercomputers modeled as ”roofline” [15], as measured
with the benchmarks HPL and HPCG (taken from the database TOP500 [27]), and the one
for brain simulation is concluded from [28].
105 106 107 108
106
107
108
time(s)
sp
ee
d
(m
/s
)
Relativistic speed of body accelerated by ’g’
v(t), n = 1
v(t), n = 2.5
v(t), n = 5
10−5 10−4 10−3 10−2 10−1 100
10−5
10−4
10−3
10−2
10−1
100
Nominal performance (EFlops)
P
ay
lo
ad
p
er
fo
rm
an
ce
(E
F
lo
p
s)
Payload performances of N cores @100GFlops
1-alpha = 1e− 10
1-alpha = 1e− 8
1-alpha = 1e− 7
1-alpha = 1e− 6
1-alpha = 1e− 5
1-alpha = 1e− 4
Figure 5: The limiting effect considered in the ’modern’ theories. One left side, the speed limit,
as explained by the theory of relativity, is illustrated. The refractory index of the medium
defines the value of the speed limit. On the right side, the payload performance limit of the
parallelized sequential computing systems, as explained by the ”modern paradigm”, is illus-
trated. The ratio of the non-payload to payload processing defines the payload performance.
11
6. Limitations due to the classic computing paradigm
Mainly, the effect of the shared bus defines the payload performance of
the computing systems assembled from components manufactured for SPA sys-
tems. The right subfigure in Fig. 5 displays the payload performance of a
many-processor SPA system when executing different workloads (that define
the non-payload to payload ratio); for the math details see [30, 19]. The top
diagram lines represent the best payload performance that the supercomputers
can achieve when running the benchmark HPL that represents the minimum
communication a parallelized sequential system needs. The bottom diagram
line represents the estimation of the payload performance that neuromorphic-
type processing can achieve in SPA systems. Notice the similarity with the left
subfigure: under extreme conditions, in the science, an environment-dependent
speed limit exists, and in computing, a workload-dependent payload perfor-
mance limit exists [30].
The careful analysis discovers a remarkable parallel between the proposed
’modern computing’ [30] versus the classic computing and the modern science
versus the classic science. ”Modern computing” does not invalidate the ”classic
computing”. Instead, it draws the range of the validity of the classic approx-
imation and sets up the rules of computing under extreme conditions. The
”modern computing” in its field leads to counter-intuitive and shocking conclu-
sions at some extreme parameter values, as did the ”modern science” more than
hundred years ago. The parallel can help to accept that what one can not ex-
perience in the every-day computing, can be true when using computing under
extreme conditions. Fig. 5 depicts one such consequence. In the modern science,
unlike in the classic science, a speed limit exists. In the modern computing, un-
like in the classic computing, a payload performance limit exists. For further
parallels, see [30].
Using another computing theory is a must, especially when targeting neu-
romorphic computing. In the frames of ”classic computing”, as was bitterly
admitted [28], ”any studies on processes like plasticity, learning, and develop-
ment exhibited over hours and days of biological time are outside our reach”.
7. A system is not a simple sum of its components
Although it is valid for most systems, that one must not conclude from a
feature of a component to the similar feature of the system: the non-linearity
discussed above it is especially valid for the large-scale computing systems mim-
icking neuromorphic operation. We mention two prominent examples here. One
can assume that if the time of the operation of a neuron can be shortened, the
performance of the whole system gets proportionally better. Two distinct op-
tions are to use shorter operands (move less data and to perform less bit ma-
nipulations) and to mimic the operation of the neuron in an entirely different
way: using quick analog signal processing rather than slow digital calculation.
12
The so-called HPL-AI benchmark used Mixed Precision5 [31] rather than
Double Precision operands in benchmarking their supercomputer. The name
suggests as if in solving AI tasks, the supercomputer can show that peak effi-
ciency. When executing the HPL benchmark, this change resulted in a higher
performance merit number. However, as correctly stated in the announcement,
”Achieving a 445 petaflops mixed-precision result on HPL (equivalent to our
148.6 petaflops DP result)”, i.e. the peak DP performance did not change.
We expect that when using half-precision (FP16) rather than double preci-
sion (FP64) operands in the calculations, four times less data are transferred
and manipulated by the system. The measured power consumption data un-
derpin the statement. However, the computing performance is only 3 times
higher than in the case of using 64-bit (FP64) operands. The non-linearity has
its effect even in this simple case. In the benchmark, the housekeeping activity
also takes time. Even the measured performance data enable us to estimate the
execution time with zero precision (FP0) operands, see [19]. The performance
corresponding to αFP0HPL is slightly above 1 EFlops (when making no floating
operations, i.e., rather Eops). Another peak performance reported6 when run-
ning genomics code on the same supercomputer (by using a mixture of operands
with different numerical precision and mostly non-floating point instructions) is
1.88 Eops, corresponding to αFP0Genom = 1 ∗ 10−8; for the scaling of that type of
calculations see Fig. 5. Given that those two values refer to a different mixture
of instructions, the agreement is more than satisfactory.
Another plausible assumption is that if we use quick analog signal process-
ing to replace the slow digital calculation, as proposed in [32, 33], the system
gets proportionally quicker. Adding analog components to a digital processor,
however, has its price. Given that the digital processor cannot handle resources
outside of its world, one must call the operating system (OS) for help. That
help, however, is rather expensive in terms of execution time. The required
context switching takes time in the order of executing 104 instructions [34, 35],
which greatly increases the total execution time and makes the non-payload to
payload ratio much worse.
Although these cases seem to be very different, they share at least the com-
mon feature, that they change not only one parameter: they also change the
non-payload to payload ratio that defines the efficiency. They have different
side-effects: changing the operand length has its effect on the cache behavior,
using analog processing needs linking between the analog and the digital pro-
cessing. However, even those one-parameter changes have a nonlinear effect on
the efficiency of the system.
5Both names are used rather inconsequentially. On one side, the test itself has not much to
do with AI, just uses the operand length common in AI tasks; the benchmark HPL, similarly
to AI, is a workload type. On the other side, the Mixed Precision is Half Precision: it is
natural that for multiplication twice as long operands are used temporarily. It is a different
question that the operations are contracted.
6https://www.olcf.ornl.gov/2018/06/08/genomics-code-exceeds-exaops-on-summit-
supercomputer/
13
8. Summary
The authors have identified some critical bottlenecks in current computa-
tional systems/neuronal networks rendering the conventional computing archi-
tectures unadaptable to large (and even medium) sized neuromorphic comput-
ing. Built with the segregated processor (SPA, wording from Amdahl [26]),
the current systems lack autonomous communication of processors and have an
inefficient method of imitating biological systems. To overcome these limita-
tions, the authors introduce a drastically different approach to computing, the
Explicitly Many-Processor Approach (EMPA), which can serve as the basis for
development as well as specific and practical solutions.
Acknowledgemens
The authors thank Prof. Pe´ter Somogyi for valuable comments on a previous
version of the manuscript. Project no. 125547 should have been implemented
with the support provided from the National Research, Development and Inno-
vation Fund of Hungary, financed under the K funding scheme. However, at the
time of writing the paper, the fund is in 20 month delay with providing the sup-
port. Because of this, temporarily, the project is supported by the Kalima´nos
BT.
References
[1] C. Mead, Neuromorphic electronic systems, Proc. IEEE 78 (1990) 1629–
1636.
[2] US DOE Office of Science, Report of a Roundtable Convened
to Consider Neuromorphic Computing Basic Research Needs,
https://science.osti.gov/-/media/ascr/pdf/programdocuments/
docs/Neuromorphic-Computing-Report_FNLBLP.pdf (2015).
[3] G. Bell, D. H. Bailey, J. Dongarra, A. H. Karp, K. Walsh, A look back
on 30 years of the Gordon Bell Prize, The International Journal of High
Performance Computing Applications 31 (6) (2017) 469–484.
URL https://doi.org/10.1177/1094342017738610
[4] Steve Furber and Steve Temple, Neural systems engineering, J. R. Soc.
Interface 4 (2007) 193–206. doi:10.1098/rsif.2006.0177.
[5] C. Liu, G. Bellec, B. Vogginger, D. Kappel, J. Partzsch, F. Neuma¨rker,
S. Ho¨ppner, W. Maass, S. B. Furber, R. Legenstein, C. G. Mayr,
Memory-Efficient Deep Learning on a SpiNNaker 2 Prototype, Frontiers
in Neuroscience 12 (2018) 840. doi:10.3389/fnins.2018.00840.
URL https://www.frontiersin.org/article/10.3389/fnins.2018.
00840
14
[6] Top500.org, Retooled Aurora Supercomputer Will Be Amer-
ica’s First Exascale System, https://www.top500.org/news/
retooled-aurora-supercomputer-will-be-americas-first-exascale-system/
(2017).
[7] J. Ve´gh, Re-evaluating scaling methods for distributed parallel systems,
IEEE Transactions on Distributed and Parallel Computing ?? (2020) in
review.
URL https://arxiv.org/abs/2002.08316
[8] J. Ve´gh, Which scaling rule applies to Artificial Neural Networks, in: 2020
International Conference on Computational Science and Computational In-
telligence (CSCI), IEEE, 2020, p. Submitted.
[9] J. Keuper, F.-J. Preundt, Distributed Training of Deep Neural Networks:
Theoretical and Practical Limits of Parallel Scalability, in: 2nd Workshop
on Machine Learning in HPC Environments (MLHPC), IEEE, 2016, pp.
1469–1476. doi:10.1109/MLHPC.2016.006.
URL https://www.researchgate.net/publication/308457837
[10] J. Ve´gh, How deep the machine learning can be, A Closer Look at Convo-
lutional Neural Networks, Nova, In press, 2020, pp. 141–169.
[11] K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatow-
icz, N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, D. Wessel, K. Yelick,
A View of the Parallel Computing Landscape, Comm. ACM 52 (10) (2009)
56–67.
[12] S(o)OS project, Resource-independent execution support on exa-scale sys-
tems, http://www.soos-project.eu/index.php/related-initiatives
(2010).
[13] Machine Intelligence Research Institute, Erik DeBenedictis on supercom-
puting (2014).
URL https://intelligence.org/2014/04/03/erik-debenedictis/
[14] P. C. et al., Rebooting Our Computing Models, in: Proceedings of the 2019
Design, Automation & Test in Europe Conference & Exhibition (DATE),
IEEE Press, 2019, pp. 1469–1476. doi:10.23919/DATE.2019.8715167.
[15] S. Williams, A. Waterman, D. Patterson, Roofline: An insightful visual per-
formance model for multicore architectures, Commun. ACM 52 (4) (2009)
65–76.
[16] J. L. Gustafson, Reevaluating Amdahl’s Law, Commun. ACM 31 (5) (1988)
532–533. doi:10.1145/42411.42415.
[17] Y. Shi, Reevaluating Amdahl’s Law and Gustafson’s Law, https:
//www.researchgate.net/publication/228367369_Reevaluating_
Amdahl’s_law_and_Gustafson’s_law (1996).
15
[18] V. Weaver, D. Terpstra, S. Moore, Non-determinism and overcount on
modern hardware performance counter implementations, in: Performance
Analysis of Systems and Software (ISPASS), 2013 IEEE International Sym-
posium on, 2013, pp. 215–224. doi:10.1109/ISPASS.2013.6557172.
[19] J. Ve´gh, Finally, how many efficiencies the supercomputers have?, The
Journal of Supercomputingdoi:10.1007/s11227-020-03210-4.
[20] J. P. Singh, J. L. Hennessy, A. Gupta, Scaling parallel programs for mul-
tiprocessors: Methodology and examples, Computer 26 (7) (1993) 42–50.
doi:10.1109/MC.1993.274941.
[21] Gyo¨rgy Buzsa´ki and Xiao-Jing Wang, Mechanisms ofGamma Oscillations,
Annual Reviews of Neurosciences 3 (4) (2012) 19:1–19:29. doi:10.1146/
annurev-neuro-062111-150444.
[22] J. Ve´gh, How to extend the Single-Processor Paradigm to the Explicitly
Many-Processor Approach, in: 2020 International Conference on Compu-
tational Science and Computational Intelligence (CSCI), IEEE, 2020, p. In
print.
URL ??http://arxiv.org/abs/1908.02651
[23] J. Ve´gh, How Amdahl’s Law limits the performance of large artificial
neural networks: (Why the functionality of full-scale brain simulation on
processor-based simulators is limited), Brain Informatics 6 (2019) 1–11.
URL https://braininformatics.springeropen.com/articles/10.
1186/s40708-019-0097-2/metrics
[24] J. S. et al, TrueNorth Ecosystem for Brain-Inspired Computing: Scalable
Systems, Software, and Applications, in: SC ’16: Proceedings of the Inter-
national Conference for High Performance Computing, Networking, Storage
and Analysis, 2016, pp. 130–141.
[25] M. Davies, et al, Loihi: A Neuromorphic Manycore Processor with On-Chip
Learning, IEEE Micro 38 (2018) 82–99.
[26] G. M. Amdahl, Validity of the Single Processor Approach to Achieving
Large-Scale Computing Capabilities, in: AFIPS Conference Proceedings,
Vol. 30, 1967, pp. 483–485. doi:10.1145/1465482.1465560.
[27] TOP500, November 2017 list of supercomputers, https://www.top500.
org/lists/2017/11/ (2017).
[28] S. J. van Albada, A. G. Rowley, J. Senk, M. Hopkins, M. Schmidt, A. B.
Stokes, D. R. Lester, M. Diesmann, S. B. Furber, Performance Comparison
of the Digital Neuromorphic Hardware SpiNNaker and the Neural Network
Simulation Software NEST for a Full-Scale Cortical Microcircuit Model,
Frontiers in Neuroscience 12 (2018) 291.
16
[29] IEEE Spectrum, Two Different Top500 Supercomput-
ing Benchmarks Show Two Different Top Supercomputers,
https://spectrum.ieee.org/tech-talk/computing/hardware/
two-different-top500-supercomputing-benchmarks-show\
-two-different-top-supercomputers (2017).
[30] J. Ve´gh, A. Tisan, The need for modern computing paradigm: Science
applied to computing, in: 2019 International Conference on Computational
Science and Computational Intelligence (CSCI), IEEE, 2019, pp. 1523–
1532. doi:10.1109/CSCI49370.2019.00283.
URL http://arxiv.org/abs/1908.02651
[31] A. Haidar, S. Tomov, J. Dongarra, N. J. Higham, Harnessing GPU Tensor
Cores for Fast FP16 Arithmetic to Speed Up Mixed-precision Iterative
Refinement Solvers, in: Proceedings of the International Conference for
High Performance Computing, Networking, Storage, and Analysis, SC ’18,
IEEE Press, 2018, pp. 47:1–47:11.
[32] E. Chicca, G. Indiveri, A recipe for creating ideal hybrid memristive-CMOS
neuromorphic processing systems, Applied Physics Letters 116 (12) (2020)
120501. arXiv:https://doi.org/10.1063/1.5142089, doi:10.1063/1.
5142089.
URL https://doi.org/10.1063/1.5142089
[33] Building brain-inspired computing, Nature Communications 10 (12) (2019)
4838. doi:10.1063/1.5142089.
URL https://doi.org/10.1038/s41467-019-12521-x
[34] F. M. David, J. C. Carlyle, R. H. Campbell, Context Switch Overheads
for Linux on ARM Platforms, in: Proceedings of the 2007 Workshop on
Experimental Computer Science, ExpCS ’07, ACM, New York, NY, USA,
2007. doi:10.1145/1281700.1281703.
URL http://doi.acm.org/10.1145/1281700.1281703
[35] D. Tsafrir, The context-switch overhead inflicted by hardware interrupts
(and the enigma of do-nothing loops), in: Proceedings of the 2007 Work-
shop on Experimental Computer Science, ExpCS ’07, ACM, New York,
NY, USA, 2007, pp. 3–3.
17
