Introducing temporal behavior to computing science by Végh, János
Introducing temporal behavior
to computing science
Ja´nos Ve´gh
Kalima´nos BT
Debrecen, Hungary
Vegh.Janos@gmail.com ORCID: 0000-0002-3247-7810
Abstract—The abstraction introduced by von Neumann cor-
rectly reflected the state of the art 70 years ago. Although
it omitted data transmission time between components of the
computer, it served as an excellent base for classic computing
for decades. Modern computer components and architectures,
however, require to consider their temporal behavior: data
transmission time in contemporary systems may be higher than
their processing time. Using the classic paradigm leaves some
issues unexplained, from enormously high power consumption
to days-long training of artificial neural networks to failures of
some cutting-edge supercomputer projects. The paper introduces
the up to now missing timely behavior (a temporal logic)
into computing, while keeps the solid computing science base.
The careful analysis discovers that with considering the timely
behavior of components and architectural principles, the mystic
issues have a trivial explanation. Some classic design principles
must be revised, and the temporal logic enables us to design a
more powerful and efficient computing.
Index Terms—temporal logic, computing efficiency, operating
principle of computing, modern computing paradigm, perfor-
mance loss in computing, idle waiting time in computing
I. INTRODUCTION
Computing science is on the border of mathematics
and, through its physical implementation, science. Since
the beginning of computing, the computing paradigm itself,
”the implicit hardware/software contract [1]”, defined how
mathematics-based theory and its science-based implementa-
tion must cooperate. Mathematics, however, considers only the
dependencies between its operands; it assumes that the needed
operands are instantly available. That is, computing science
considers that performing operations, delivering operands to
and from processing units, is as kind of engineering imper-
fectness. At the time when von Neumann proposed his famous
abstraction, both time of processing and time of accessing
data (including those on a mass storage device) were in the
milliseconds region, while physical data delivery time was
in the range of microseconds, i.e., three orders of magnitude
smaller. It was a plausible assumption to consider that the total
time of processing comprises only the time of computation plus
the time of data access; the data delivery time was neglected.
For today, however, technical development changed the
relations between those timings drastically. Today the data
Project no. 125547 has been implemented with the support provided
from the National Research, Development and Innovation Fund of Hungary,
financed under the K funding scheme.
Submited to 2020 International Conference on Computational Science and
Computational Intelligence (CSCE), Las Vegas, US
access time is much larger than the time needed to process
them. Besides, the relative weight of the data transfer time has
grown tremendously, for many reasons. Firstly, miniaturizing
the processors to sub-micron size, while keeping the rest of
the components (such as buses) above the centimeter scale.
Secondly, the single-processor performance stalled [2], and
increasing the single-processor performance was not possible
anymore, mainly because of reaching the limits, the laws of
nature enable [3]. Thirdly, making truly parallel computers
failed [1], and today one needs to reach the needed high
computing performance though putting together an excessive
number of segregated processors. This latter way replaces
parallel computing with parallelized sequential computing,
disregarding that the operating rules of that different kind of
computing [4] [5], [6] sharply differ from those of the seg-
regated processors. Fourthly, the mode of utilization (mainly
multitasking) forced out using operating system (OS)s, which
imitate a ”new processor” for a new task, at serious time
expenses. Finally, the idea of ”real-time connected every-
thing” introduced geographically large distances with the
corresponding several millisecond data delivery times. Theory
of computing kept the idea of ”instant delivery”; although
even within the core, wiring has an increasing role. The idea
of non-temporal behavior was confirmed by accepting ”weak
scaling” [7], suggesting that all housekeeping times, such as
organizing the joint work of the parallelized serial processors,
sharing resources, using exceptions and OS services, deliver-
ing data between processing units and data storage units, are
negligible.
Vast computing systems can cope with their tasks with
growing difficulty, enormously decreasing computing effi-
ciency, and enormously growing energy consumption; one
can experience similar issues in the world of networked edge
devices. Being not aware of that the collaboration between
processors needs a different approach (another paradigm),
resulted in demonstrative failures already known (such as the
supercomputers Gyoukou and Aurora’18, or the brain simu-
lator SpiNNaker)1 and many more (all they intend to deliver
1The explanations are quite different: Gyoukou was withdrawn after its
first appearance; Aurora failed: retargeted and delayed; Despite the failure
of SpiNNaker1, the SpiNNaker2 is also under construction [8]; ”Chinese
decision-makers decided to withhold the countrys newest Shuguang super-
computers even though they operate more than 50 percent faster than the best
current US machines”.
ar
X
iv
:2
00
6.
01
12
8v
1 
 [c
s.D
C]
  3
1 M
ay
 20
20
0.13-0.2 Eflops) may follow: such as Aurora’21 [9], the China
mystic supercomputers2 and the EU planned supercomputers3.
General-purpose computing systems comprising ”only” mil-
lions of processors already show the issues, and brain-like
systems want to comprise four orders of magnitude higher
number of computing elements. When targeting neuromorphic
features such as ”deep learning training”, the issues start to
manifest at just a couple of dozens of processors [10] [11].
The scaling is nonlinear [5], [12], strongly depending on
the workload type, and the Artificial Intelligence (AI)-class
workload is one of the worst workloads [11], [12] one can
run on conventional architectures.
”Successfully addressing these challenges [of neuromorphic
computing] will lead to a new class of computers and systems
architectures” [13]. However, the roundtable concentrated only
on finding new materials and different gate devices. They
did not even mention that for such systems new computing
paradigm may also be needed. The result was that, as noticed
by judges of the Gordon Bell Prize, ”surprisingly, [among the
winners of the supercomputer competition] there have been no
brain-inspired massively parallel specialized computers” [14].
Despite the vast need and investments, furthermore the con-
centrated and coordinated efforts, just because of the vital
bottleneck: the missing theory.
II. INTRODUCING TIME TO COMPUTING
As suspected by many experts, the computing paradigm
itself, ”the implicit hardware/software contract [1]”, is respon-
sible for the experienced issues: ”No current programming
model is able to cope with this development [of processors],
though, as they essentially still follow the classical van Neu-
mann model” [15]. When thinking about ”advances beyond
2020”, the solution was expected from the ”more efficient im-
plementation of the von Neumann architecture” [16], however.
Even when speaking about building up computing from scratch
(”rebooting the model” [17]), only implementing different
gating technology for the same computing model is meant.
However, the paradigm prevents (among others) building large
neuromorphic systems, too.
There are many analogies between science and comput-
ing [18]; among others, how they handle time. Both classic
science and classic computing assume instant (infinitely quick)
interaction between its objects. That is, an event happening
at any location can be instantly seen at all other locations:
time has no specific role, and an event has an immediate
effect on all other considered objects. In science, inventing
that the speed of light is insurmountable, led to introducing
the four-dimensional space-time. Special relativity introduces
a ’fourth space dimension’, and we calculate that coordinate
of the Minkowski space from the time as the distance the light
traverses in a given time.
To introduce a temporal logic into computing, the reverse
of that transformation is required. In computing, distances
2https://www.scmp.com/tech/policy/article/3015997/china-has-decided-not-
fan-flames-super-computing-rivalry-amid-us
3https://ec.europa.eu/newsroom/dae/document.cfm? doc id =60156
get defined during fabrication of components and assembling
the system. In biological systems, nature defines the neuronal
distances, and in ’wet’ neuro-biology, signal timing rather than
axon length is the right (measurable) parameter. To describe
the temporal operation of computing systems correctly, we
need to find out how much later a component notices that
an event occurred in the system. That is, we need to use a
special 4-vector, where all coordinates are time values: the
first three are the corresponding local coordinates (distances
from the location of the event, divided by the speed of the
interaction) having time dimension, and the fourth coordinate
is the time itself; that is, we introduce a 4 dimensional time-
space system. The resemblance with the Minkowski space is
obvious, and the name difference signals the different aspects
of utilization.
Figure 1a (essentially a light cone in 2D space plus a time
dimension) shows why time must be considered explicitly in
all kinds of computing. The figure shows (for visibility) a 3-
dimensional coordinate system: how an event behaves in a
two-dimensional space (the concept is easier to visualize with
the number of spatial dimensions reduced from three to two).
In the figure, the direction ’y’ is not used, but enables to place
observers at the same distance from the event, without the
need to locate them in the same point. The event happens at
the point (0,0,0), the observers are located on the ’x’ axis; the
vertical scale corresponds to the time.
In the classic physical hypothetical experiment, we switch
on a light in the origo, and the observer switches his light
when notices that the first light was switched on. If we
graph the growing circle with the vertical axis of the graph
representing time, the result is a cone, known as the future light
cone. Both light sources have some ”processing time’”, that
passes between noticing the light (receiving the instruction)
and switching the light on (performing the instruction). That
is, the instruction is received at the origo, at the bottom of the
green arrow. The light goes on at the head of the arrow, (i.e., at
the same location, but at a later time), after that the ’processing
time’ Tp passed. Following that, the light propagates in the two
spatial dimensions as a circle around axis ”t”. Observers at
larger distance notice the light at a later time: a ’transmission
time’ Tt is needed. If ”processing time” of the light source
of the first event were zero, the light would propagate along
the gray surface at the origo. However, because of the finite
processing time, the light will propagate along the blueish cone
surface, at the head of the green arrow.
A circle denotes position of our observer on the axis ”x”.
With zero ”transmission time”, the second gray conical surface
(at the head of the green dotted arrow) would describe his
light. However, its ”processing time” can only begin when the
observer notices the light at his position: when the dotted red
arrow hits the blueish surface. At that point begins ”process-
ing time” of the second light source; the yellowish conical
surface describes the second light propagation. The horizontal
(green dotted) arrow describes the physical distance of the
observer (as a time coordinate), the vertical (red dotted) arrow
describes the time delay of the observer light. It comprises
−1
1
2
2
1
x
y
t
(a) The computing operation in time-space approach.
The processing operators can be gates, processors,
neurons or networked computers.
105
106
107
10−7 10−6 10−5 10−4
10−4
10−3
10−2
10−1
100
No
of
pro
ces
sor
s
Non− payload/payload
E
f
f
ic
ie
n
cy
TOP5’2018.11
Summit
Sierra
Taihulight
Tianhe-2
K computer
Brain
(b) The surface and the figure marks show at what efficiency the top supercomputers
run the ’best workload’ benchmark High Performance Linpack (HPL), and the ’real-
life load’ High Performance Conjugate Gradients (HPCG). [6] The right bottom part
displays the expected efficiency [19] of running neuromorphic calculations on Single
Processor Approach (SPA) computers.
Fig. 1: The origin of ”idle waiting time” and its effect on the efficiency on parallelized sequential processing systems
two components: the Tt transmission time to the observer and
its Tp processing time. The light cone of the observer starts
at t = 2 ∗ Tp + Tt.
The red arrow represents the resulting apparent process-
ing time TA: the longer is the red vector; the slower is
the system. As the vectors are in the same plane, TA =√
T 2t + (2 · Tp + Tt)2, that is TA = Tp ·
√
R2 + (2 +R)2.
This means, that the apparent time is a non-linear function
of both of its component times and their ratio R. If more
computing elements are involved, Tt denotes the longest
transmission time. (Similar statement is valid if the Tp times
are different) The effect is significant: if R = 1, the apparent
execution time of performing the two computations is more
than 3 times longer than the processing time. Two more
observers are located on the axis ’x’, at the same position.
For visibility, their timings are displayed at points ’1’ and
’2’, respectively. Their results illustrate the influence of the
transmission speed (and/or the ratio R). In their case the
transmission speed differs by a factor of two compared to that
displayed at point ’0’; in this way three different R = Tt/Tp
ratios are displayed.
Notice that at half transmission speed (the horizontal green
arrow is twice as long as that in the origo) the vector is
considerably longer, while at double transmission speed, the
decrease of the time is much less expressed4. Given that the
4 [6] discusses this phenomenon in detail.
apparent processing time TA defines the performance of the
system, Tp and Tt must be concerted. Mimicking the biology
is useful also here: the time window where the decision is
made5 is of the same size, independently of the path traversed
by the signal (the axon length) and the speed of the signal
(conduction velocity); and is in the order of the ”processing
time” of the neurons.6
It is not reasonable to fabricate smaller components without
decreasing the processing time proportionally; and similarly,
replacing the processing element with a very much quicker
one (such as proposed in [20], [21], and may be proposed
using any future new physical effect and/or material) has
only marginal effect if the physical distance of the computing
elements cannot be reduced proportionally at the same time.
Using shorter operands (half precision rather than double
precision) reduces TA non-proportionally: the housekeeping
costs (such as fetching, addressing) remain constant (although
the amount of data movement and manipulation decreases).
One expects a four-fold performance increase when using half-
precision rather than double precision operands [22], and the
consumed power consumption data underpin that expectation.
However, the measured increase in computing performance
was only three times higher: the apparent execution time TA
and the processing time Tp differ.
5In computing: when the logical function is evaluated
6The biology can change the conduction velocity, that needs energy, so
finding an optimum is not as simple.
Notice one more important aspect: the Tp transmission
time is an ’idle time’ (the orange arrow on the figure) for
the observer: it is ready to run, takes power, but does no
useful work. Due to their finite physical size and limited
interaction speed (both neglected in the classic paradigm),
temporal operation of computing systems results inherently
in an idle time of their processing units7, and – since it
sensitively depends on many factors and conditions – can
be a significant contributor to non-payload portion of their
processing time. With other major contributors, originating
from their technical implementation (see section III-B), these
”idle waiting” times sharply decrease payload performance
of the systems. Figure 1b depicts how efficiencies of recent
supercomputers depend [6] on the number of single-threaded
processors in the system and the parameter (1−α), describing
non-payload portion of the corresponding benchmark task.
It is known fpr decades that ”this decay in performance is
not a fault of the architecture, but is dictated by the limited
parallelism” [4]; in excessive systems of modern hardware
(HW), is also dictated by laws of nature [18].
All fields of computing benefit from introducing temporal
behavior for the components, from explaining the need of
”in-memory computing” to reasoning the low efficiency of
Graphic Processing Unit (GPU)s in general-purpose applica-
tions. We present some possible case studies below. Neglecting
their temporal behavior limits the utility of any new method,
material or technology, if they are designed/developed/used in
the spirit of the old (timeless) paradigm.
III. IDENTIFYING BOTTLENECKS OF COMPUTING DUE TO
THEIR TECHNICAL IMPLEMENTATION
A. Synchronous and asynchronous operation
The case depicted in Fig.1a is an asynchronous operation:
when the light cone arrives at the observer, second processing
can start. If we have additional observers, their TAt and T
B
t
may be different, and we have no way to synchronize their
operation. If we have another observer at the point mirrored
to the origo, the light cone arrives at it at about the same
TAt , but to synchronize the operation of the two observers, we
would need Tsynch = Tt+TAt +T
B
t . Instead, we issue another
light cone (the central clock) at the origo (it the case of that
light cone the processing time is zero, just a rising edge) and
the observers are instructed to start their processing when this
synchronizing light cone reaches their point of observation.
In the time-space system, not only observers on the surface
of the cone, but also the ones inside the cone, can notice that
the first light went on. If Tsynch is large enough, all observers
will notice the first light. After noticing the light, they all can
start their processing at that time t = 2 ∗ Tp + Tsync. Given
that both Tp,i and Tt,i can be different, Tsynch ≥ Tp,i + Tt,i,
for any observer i, must be fulfilled. This time is larger than
any of the Tp,i + Tt,i times: for the rest of observers, the
idle time increases. Given that their internal wiring can be
very different, we must choose the clock period according to
7it can be a crucial factor of inefficiency of general-purpose chips [23]
the ”worst-case”. For the rest of the observers, this constraint
means a significant increase in their value Tt,i. All observers
must wait for the slowest one. The more observers (and the
more steps!), the more waiting. This effect is considerable
even inside the chip (at ≤ cm distances); in the case of
supercomputers, the distance is about 100 m.
A careful analysis [19] discovered that using synchronous
computing (using clock signals) has a significant effect on
performance of large-scale systems mimicking neuromorphic
operation. The performance analysis [25] of large-scale brain
simulation facilities demonstrated another exciting parallel
between modern science and large-scale computing. The com-
monly used 1 ms integration time, limited both the many-
thread software (SW) simulator, running on general-purpose
supercomputers, and the purpose-build HW brain simulator
to the same value of payload performance. Similar shall be
the case very soon in connection with building the targeted
large-scale neuromorphic systems, despite the initial success
of specialized neuronal chips (such as [26], [27]). Although
at a higher value (about two orders of magnitude higher than
the one in [25]), systems built from such chips also shall stall
because of the ”quantal nature of time” [18], although using
asynchronous operating mode can rearrange the scene.
B. The high speed serial bus
The components of technical computing systems (including
the biology-mimicking neuromorphic ones) are connected
through a set of wires, called ”bus”. The bus is essentially
the physical appearance of the ”technical implementation”
of communication, stemming from the SPA, as illustrated in
Fig. 2. The inset shows a simple neuromorphic use case:
one input neuron and one output neuron communicating
through a hidden layer, comprising only two neurons. Fig. 2.A
mostly shows the biological implementation: all neurons are
directly wired to their partners, i.e., a system of ”parallel
buses” (the axons) exists. Notice that the operating time also
comprises two non-payload times (Tt): the data input and
data output, which coincide with the non-payload time of
the other communication party. The diagram displays logical
and temporal dependencies of the neuronal functionality. The
payload operation (”the computing”) can only start after
data is delivered (by the, from this point of view, non-
payload functionality: input-side communication), and output
communication can only begin when the computing finished.
Importantly, communication and calculation mutually block
each other. Two important points that neuromorphic systems
must mimic noticed immediately: i/ the communication time is
an integral part of the total execution time, and ii/ the ability to
communicate is a native functionality of the system. In such a
parallel implementation, performance of the system, measured
as the resulting total time (processing + transmitting), scales
linearly with increasing both non-payload communication
speed and payload processing speed.
Fig. 2.B shows a technical implementation of a high-speed
shared bus for communication. To the right of the grid, the
activity that loads the bus at the given time is shown. A
xa1 a2
y
In
p
u
t
L
ay
er
H
id
d
en
L
ay
er
O
u
tp
u
t
L
ay
er
Neuron
T
im
e(
n
ot
p
ro
p
or
ti
on
a
l)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
A
Parallel buses
Linear communication/processing
N0 N1 N2 N3
T
0
,i
n
P
ro
ce
ss
0
T
0
,o
u
t
T
1
,i
n
P
ro
ce
ss
1
T
1
,o
u
t
T
2
,i
n
P
ro
ce
ss
2
T
2
,o
u
t
T
3
,i
n
P
ro
ce
ss
3
T
3
,o
u
t
T
ot
a
l
ti
m
e
Neuron
T
im
e(
n
ot
p
ro
p
or
ti
on
a
l)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
B
Sequential bus
Communication bound; nonlinear
N0 N1 N2 N3 B
U
S
T
0
,i
n
T
0
,i
n
P
ro
ce
ss
0
T
0
,o
u
t
T
0
,o
u
t
T
1
,i
n
P
ro
ce
ss
1
T
2
,i
n
P
ro
ce
ss
2
T
1
,o
u
t
T
2
,o
u
t
T
1
,o
u
t
T
2
,o
u
t
T
3
,i
n
T
3
,i
n
P
ro
ce
ss
3
T
3
,o
u
t
T
3
,o
u
t
T
ot
a
l
ti
m
e
B
a
n
d
w
id
th
B
a
n
d
w
id
th
Neuron
T
im
e(
n
ot
p
ro
p
or
ti
on
a
l)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
C
Sequential bus
Communication roofline; nonlinear
N0 N1 N2 N3 B
U
S
T
0
,i
n
T
0
,i
n
P
ro
ce
ss
0
T
0
,o
u
t
T
0
,o
u
t
T
1
,i
n
P
ro
ce
ss
1
T
2
,i
n
T
1
,o
u
t
P
ro
ce
ss
2
T
1
,o
u
t
T
2
,o
u
t
T
3
,i
n
T
3
,i
n
T
2
,o
u
t
T
3
,i
n
T
3
,i
n
P
ro
ce
ss
3
T
3
,o
u
t
T
3
,o
u
t
T
ot
a
l
ti
m
e
B
a
n
d
w
id
th
B
a
n
d
w
id
th
Fig. 2: Implementing neuronal communication in different technical approaches. A: the parallel bus; B and C: the shared serial
bus, before and after reaching the communication ”roofline” [24]
double arrow illustrates communication bandwidth, the length
of which is proportional to the number of packages the bus
can deliver in a given time unit. We assume that the input
neuron can send its information in a single message to the
hidden layer, furthermore, that the processing by neurons in the
hidden layer both starts and ends at the same time. However,
the neurons must compete for accessing the bus, and only
one of them can send its message immediately, the other(s)
must wait until the bus gets released. The output neuron can
only receive the message when the first neuron completed it.
Furthermore, the output neuron must first acquire the second
message from the bus, and the processing can only begin
after having both input arguments. This constraint results in
sequential bus delays both during non-payload processing in
the hidden layer and payload processing in the output neuron.
Adding one more neuron to the layer, introduces one more
delay.
Using the formalism introduced above, Tt = 2·TB+Td+X ,
i.e., the bus must be reached in time TB (not only the operand
delivered to the bus, but also waiting for arbitration: the right
to use the bus), twice, plus the physical delivery through the
bus. The X denotes ”foreign contribution”: if the bus is not
dedicated for ”neurons in this layer only”, any other traffic
also loads the bus: both messages from different layers and
the general system messages may make processing slower.8
Even if only one single neuron exists in the hidden layer, it
must use the mechanisms of sharing the bus, case by case.
The physical delivery to the bus takes more time than a
transfer to a neighboring neuron (both the arbiter and the bus
8”The idea of using the popular shared bus to implement the communication
medium is no longer acceptable, mainly due to its high contention.” [28]
are in cm distance range). If we have more neurons (such
as a hidden layer) on the bus, and they work in parallel,
they all must wait for the bus. The high-speed bus is very
slightly loaded when only a couple of neurons are present,
and its load increases linearly with the number of neurons
in the hidden layer (or, maybe, all neurons in the system).
The timely behavior of the bus, however, is different. Under
the biology-mimicking workload, the second neuron must wait
for all its inputs originating in the hidden layer. If we have
L neurons in the hidden layer, the transmission time of the
neuron behind the hidden layer is Tt = L · 2 · TB + Td +X .
This temporal behavior explains why ”shallow networks with
many neurons per layer . . . scale worse than deep networks
with less neurons” [10]: the physical bus delivery time Td, as
well as the processing time Tp, become marginal if the layer
forces to make many arbitrations to reach the bus: the number
of the neurons in the hidden layer defines the transfer time
(Recall Fig. 1a for the consequences of increasing the transfer
time). In deeper networks, the system sends its messages at
different times in different layers (and, even they may have
independent buses between the layers), although the shared
bus persists in limiting the communication. Notice that there
is no way to organize the message traffic: only one bus exists.
Given that the bus bandwidth is finite, there comes the
point when the amount of messages exceeds the available bus
bandwidth. Fig. 2.C demonstrates the case, where for better
visibility, the bus bandwidth is lower, but the required packet
bandwidth slice is the same. In this case, the second neuron
in the hidden layer cannot send its message when the first one
finishes its transmission: the bus transmission roofline [24] is
reached. In that case the transmission time shall be extended
1993 2018
(Sunway/Taihulight)
α = 1− 1 · 10−3 α = 1− 3.3 · 10
−8
Total = 1013 clocks
Ncores = 10
3 Ncores = 10
7
RMax
RPeak
= 1N ·(1−α)+α
= 1103·10−3+1
= 0.5
RMax
RPeak
= 1N ·(1−α)+α
= 0.74
Proc
T
im
e(
n
ot
p
ro
p
or
ti
on
a
l)
0
1
2
3
4
5
6
7
8
9
10
α =
Payload
Total
P0 P1 P2 P3 P4
AccessInitiation
SoftwarePre
OSPre
T
0
P
D
0
0
P
ro
ce
ss
0
P
D
0
1
T
1
P
D
1
0
P
ro
ce
ss
1
P
D
1
1
T
2
P
D
2
0
P
ro
ce
ss
2
P
D
2
1
T
3
P
D
3
0
P
ro
ce
ss
3
P
D
3
1
T
4
P
D
4
0
P
ro
ce
ss
4
P
D
4
1
Just waiting
Just waiting
OSPost
SoftwarePost
AccessTermination
P
a
y
lo
a
d T
ot
a
l
E
x
te
n
d
ed
Fig. 3: A non-technical, simplified model of parallelized sequential computing operations. The contributions of the model
component XXX to α (sometimes used as αeff to emphasize that it is an effective, empirical value) will be denoted by
αXXXeff in the text. Notice the different nature of those contributions. They have only one common feature: they all consume
time. The vertical scale displays the actual activity for processing units shown on the horizontal scale.
with a new term Tt = (B+L)·2·TB+Td+X , where B is the
number of messages above the number of messages that the
bus can deliver in a unit time. Reaching the roofline causes
further extra delay in both non-payload furthermore payload
processing times, extending the total execution time. A single
sequential bus can deliver the messages only one after the
other, i.e., increasing the number of neurons increases utiliza-
tion of the bus and prolongs total execution time as well as
apparent processing time of the individual neurons. This effect
can be so strong in large systems, that emergency measures
must have been introduced: the events ”are processed as they
come in and are dropped if the receiving process is busy over
several delivery cycles” [25].
When using a shared bus, increasing either processing speed
or communication speed does not affect linearly the total
execution time any more. Furthermore, it is not the bus speed
that limits performance. Recall Fig. 1a again, to see, how the
time projection of a relatively small increase in the transfer
time Tt can lead to a relatively large change in the value of
apparent processing time TA; and so leads to incomprehensible
slowdown of the system: the slowest component defines effi-
ciency. Conventional way of communication may work fine as
long as there is no competition for the bus, but leads to queuing
of messages in the case of (more than one!) independent
sources of communication. The bursty nature, caused by the
need of central synchronization, tops the effect, and leads to
a ”communicational collapse” [29], that denies huge many-
processor systems, especially neuromorphic ones [30].
To have a chance to connect a large number of computing
units in biology-mimicking systems, drastically new bus system
and drastically new traffic organization is required.
C. Parallelized sequential processing
Present technical approaches assume a linear dependence
between payload and nominal performances of computing
systems as ”Gustafson’s formulation [7] gives an illusion
that as if N [the number of the processors] can increase
indefinitely” [31]. The fact that ”in practice, for several appli-
cations, the fraction of the serial part happens to be very, very
small thus leading to near-linear speedups” [31], however,
misled the researchers. Gustafson’s ”linear scaling” neglects
all non-payload contributions entirely, including the temporal
behavior of the components. He established his conclusions on
only several hundred processors. The interplay of improving
parallelization and general HW development (including non-
determinism of modern HW [32]) covered for decades that the
weak scaling was used far outside of its range of validity [5],
[12]. The ”real scaling” is strongly nonlinear, with nature-
defined bound.
In our terminology, Gustafson’s assumption means that
Tt = 0, which is not the case, in any computing system, and
especially not in the case of neuromorphic computing systems.
105 106 107 108
106
107
108
time(s)
sp
ee
d
(m
/s
)
Relativistic speed of body accelerated by ’g’
v(t), n = 1
v(t), n = 2.5
v(t), n = 5
10−5 10−4 10−3 10−2 10−1 100
10−5
10−4
10−3
10−2
10−1
100
Nominal performance (EFlops)
P
ay
lo
ad
p
er
fo
rm
an
ce
(E
F
lo
p
s)
Payload performances of N cores @100GFlops
1-alpha = 1e− 10
1-alpha = 1e− 8
1-alpha = 1e− 7
1-alpha = 1e− 6
1-alpha = 1e− 5
1-alpha = 1e− 4
Fig. 4: The limiting effect considered in the ”modern” theories. One left side, the speed limit, as explained by the theory of
relativity, is illustrated. The refractory index of the medium defines the value of the speed limit. On the right side, the payload
performance limit of the parallelized sequential computing systems, as explained by the ”modern paradigm”, is illustrated. The
ratio of the non-payload to payload processing defines the value of the payload performance.
As pointed out above, having idle time in computing systems
is inevitable; the vastly increased number of idle cycles due
to physical size and operating mode of computing systems led
to the effects detailed above.
Amdahl listed [33] different reasons why losses in ”compu-
tational load” can occur. Fortunately, Amdahl’s idea enables us
to put everything that cannot be parallelized, i.e., distributed
between the fellow processing units, into the sequential-only
fraction. For describing the parallel operation of sequentially
working units, the model depicted in Figure 3 was prepared.
The technical implementations of the different paralleliza-
tion methods show up virtually infinite variety [34], so we
present here a (by intention) strongly simplified, non-technical,
model. The model has some obvious limitations, among
others, because of the non-determinism of the modern HW
systems [32] [35].
In addition to ”idle time”s discussed above, the serialized
parallel processing adds one more contribution. Even the
simplest (parallelized sequential) task has a non-parallelizable
portion of time, that –according to Amdahl’s Law– limits
the achievable payload computing performance. Here the
sequential bus and the transmission delay play a role, again.
Because, in the SPA, the initiating processor can address only
one processor (or through clustering: just a few of them), the
other processors must make additional idle waiting: the loop
to address them takes time, and the cable length significantly
increases their Tt. This effect, however, comes to light only
at a relatively high number of cores and real-life workloads.
At a lower number of cores and HPL-class benchmarks, only
a slight deviation from the linearity, predicted by the ”weak
scaling”, can be noticed.
The right subfigure in Fig. 4 displays the payload perfor-
mance of a many-processor SPA system when executing differ-
ent workloads (that define the non-payload to payload ratio);
for the math details see [6], [18]. The top diagram lines repre-
sent the best payload performance that the supercomputers can
achieve when running the benchmark HPL, which represents
the minimum communication a parallelized sequential system
needs. The bottom diagram line represents the estimation of
the payload performance that neuromorphic-type processing
can achieve in SPA systems (See also Figure 1b). Notice the
similarity with the left subfigure: under extreme conditions, in
the science, an environment-dependent speed limit exist, and in
computing, a workload-dependent payload performance limit
exists [18].
To have hopes to significantly increase computing perfor-
mance of our present cutting-edge conventional and future
neuromorphic computing systems, principle other than par-
allelizing otherwise sequentially processing systems, must be
discovered. The recent paradigm leads to not only inherent
performance limits, but also to irrationally high power con-
sumption.
D. Communication
One of the worst computing performance limiting factors
is the method of communication between processors, which
increases exponentially with increasing complexity/number.
Historically, in the model of computing proposed by von
Neumann, there was one single entity, an isolated (non-
communicating) processor, whereas in bio-inspired models,
billions of entities, organized into specific assemblies, co-
operate via communication. (Communication here means not
only sending data, but also sending/receiving signals, including
synchronization of the operation of entities.) Neuromorphic
systems, expected to perform tasks in one paradigm, but
assembled from components manufactured using principles of
(and implemented by experts trained in) the other paradigm,
are unable to perform at the required speed and efficacy for
real-world solutions. The larger the system, the higher the
Inter-ICCB Communication
(Local Neuron Communication)
Inter-Cluster Communication
(Distant Neuron Communication)
A
G,M
(a) The conceptual communication diagram (compare to Fig.7 in [36]),
mimicking the communication between local neurons the farther neu-
rons.
B
ICCB =Inter-Core
Communication Block
M Memory handling/bus
G Gateway handling/bus
M G
N
NW
SW
S
SE
NE
(b) The proposed implementation: the Inter-Core Communication
Blocks represent a ”local bus” (directly wired, with no contention),
while the cores can communicate with the cores in other clusters
through the ’G’ gateway as well as the ’M’ (local and global) memory.
Fig. 5: The communication scheme between local and farther neurons, as can be implemented in technically [37]
communication load and the performance debt. With reference
to Fig. 1a, time contribution of the communication is part
of the processing time Tp, although the overwhelming part
of it could be done in parallel with the computing activity.
This feature both decreases available processing capacity of a
neuron, and strongly changes value of R. More importantly, it
must use communication facilities through Input/Output (I/O)
instructions, wasting a massive amount of time for that.
IV. THE EFFECT OF TEMPORAL BEHAVIOR ON SCALING
Dependence of the payload performance on the nominal
performance in many-many processor systems is strongly
nonlinear at higher performance values (implemented using a
large number of processors). This effect is especially disadvan-
tageous for networks, such as neuromorphic ones, that show
up non-proportionally much idle wait time, mainly because
of the reasons presented above. The linear dependence at low
nominal performance values explains why initial successes of
any new technology, material or method in the field, using the
classic computing model, can be misleading: in simple cases
classic paradigm performs tolerably well thanks to that com-
pared to biological neural networks, current neuron/dendrite
models are simple, the networks small and learning models
appear to be rather basic.
The biology is aware of that the transmission time is a
crucial part of the processing. ”Importantly, distally projecting
axons of long-range interneurons have several-fold thicker
axons and larger diameter myelin sheaths than do pyrami-
dal cells, allowing for considerably faster axon conduction
velocity” [36]. Faster conduction increases the energy con-
sumption of a cell (needing more myelin), but it prevents
a race condition between the signals. The biology ”wastes”
extra energy only when required, and here there appears
the need to refine the ”fire and wire together” operating
principle with modulating the conduction velocity. The sur-
prising resemblance between Figs. 5a and Fig. 7 in [36]
also underlines the importance of making a clear distinction
between handling ’near’ and ’far’ signals. Although the Inter-
Core Communication Block (ICCB) blocks in the biology-
mimicking architecture [37] can adequately represent ’locally
connected’ interneurons and the ’G’ gateway the ’long-range
interneurons’, the biological conduction time must be sep-
arately maintained. Computer technology cannot speed up
communication selectively, as biology does, and it is not worth
to slow it down selectively. Making time-stamps and relying
on computer network delivery principles is not sufficient:
temporal behavior is a vital feature of biology-mimicking
systems and we must not replace them with synchronization
principles of computing.
V. SUMMARY
Statements such as ”The von Neumann architecture is
fundamentally inefficient and non-scalable for representing
massively interconnected neural networks” [38] should be
modified like this ”the architectures based on the non-temporal
abstraction proposed by von Neumann”. Especially the figures
above, provide a very clear pointer: to make efficient and large
systems (including neuromorphic ones), the fundamental prin-
ciples of operation of computing, communication, including
the bus system and principle of handling messages, as well as
the cooperation between processors, must be scrutinized and
drastically changed. Comprehending the timely behavior of
the components can serve as a good starting point to do so.
REFERENCES
[1] K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubi-
atowicz, N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, D. Wessel,
and K. Yelick, “A View of the Parallel Computing Landscape,” Comm.
ACM, vol. 52, no. 10, pp. 56–67, 2009.
[2] US National Research Council. (2011) The Future
of Computing Performance: Game Over or Next
Level? [Online]. Available: http://science.energy.gov/ /me-
dia/ascr/ascac/pdf/meetings/mar11/Yelick.pdf
[3] I. Markov, “Limits on fundamental limits to computation,” Nature, vol.
512(7513), pp. 147–154, 2014.
[4] J. P. Singh, J. L. Hennessy, and A. Gupta, “Scaling parallel programs for
multiprocessors: Methodology and examples,” Computer, vol. 26, no. 7,
pp. 42–50, Jul. 1993.
[5] J. Ve´gh, “Re-evaluating scaling methods for distributed parallel
systems,” IEEE Transactions on Distributed and Parallel
Computing, vol. ??, p. in review, 2020. [Online]. Available:
https://arxiv.org/abs/2002.08316
[6] J. Ve´gh, “Finally, how many efficiencies the supercomputers have?”
The Journal of Supercomputing, feb 2020. [Online]. Available:
https://doi.org/10.1007%2Fs11227-020-03210-4
[7] J. L. Gustafson, “Reevaluating Amdahl’s Law,” Commun. ACM, vol. 31,
no. 5, pp. 532–533, May 1988.
[8] C. Liu, et al, “Memory-Efficient Deep Learning on a SpiNNaker 2
Prototype,” Frontiers in Neuroscience, vol. 12, p. 840, 2018. [Online].
Available: https://www.frontiersin.org/article/10.3389/fnins.2018.00840
[9] https://www.top500.org/news/retooled-aurora-supercomputer-will-be
-americas- first-exascale-system/, 2017.
[10] J. Keuper and F.-J. Preundt, “Distributed Training of Deep Neural
Networks: Theoretical and Practical Limits of Parallel Scalability,”
in 2nd Workshop on Machine Learning in HPC Environments
(MLHPC). IEEE, 2016, pp. 1469–1476. [Online]. Available:
https://www.researchgate.net/publication/308457837
[11] J. Ve´gh, How deep the machine learning can be, ser. A Closer Look at
Convolutional Neural Networks. Nova, In press, 2020, pp. 141–169.
[Online]. Available: https://arxiv.org/abs/2005.00872
[12] ——, “Which scaling rule applies to Artificial Neural Networks,”
in 2020 International Conference on Computational Science and
Computational Intelligence (CSCE). IEEE, 2020, p. Accepted
ICA2246. [Online]. Available: http://arxiv.org/abs/2005.08942
[13] US DOE Office of Science, “Report of a Roundtable Convened to
Consider Neuromorphic Computing Basic Research Needs,” 2015.
[14] G. Bell, D. H. Bailey, J. Dongarra, A. H. Karp, and K. Walsh, “A
look back on 30 years of the Gordon Bell Prize,” [Online]. Available:
https://doi.org/10.1177/1094342017738610
[15] S(o)OS project, “Resource-independent execution support on exa-scale
systems,” http://www.soos-project.eu/index.php/related-initiatives, 2010.
[16] Machine Intelligence Research Institute, “Erik DeBene-
dictis on supercomputing,” 2014. [Online]. Available:
https://intelligence.org/2014/04/03/erik-debenedictis/
[17] P. C. et al., “Rebooting Our Computing Models,” in Proceedings of the
2019 Design, Automation & Test in Europe Conference & Exhibition
(DATE). IEEE Press, 2019, pp. 1469–1476.
[18] J. Ve´gh and A. Tisan, “The need for modern computing
paradigm: Science applied to computing,” in 2019 International
Conference on Computational Science and Computational Intelligence
(CSCI). IEEE, 2019, pp. 1523–1532. [Online]. Available:
http://arxiv.org/abs/1908.02651
[19] J. Ve´gh, “How Amdahl’s Law limits the perfor-
mance of large artificial neural networks: ,” Brain
Informatics, vol. 6, pp. 1–11, 2019. [Online]. Avail-
able: https://braininformatics.springeropen.com/articles/10.1186/s40708-
019-0097-2/metrics
[20] E. Chicca and G. Indiveri, “A recipe for creating ideal hybrid
memristive-CMOS neuromorphic processing systems,” Applied Physics
Letters, vol. 116, no. 12, p. 120501, 2020. [Online]. Available:
https://doi.org/10.1063/1.5142089
[21] “Building brain-inspired computing,” Nature Communications,
vol. 10, no. 12, p. 4838, 2019. [Online]. Available:
https://doi.org/10.1038/s41467-019-12521-x
[22] A. Haidar, P. Wu, S. Tomov, and J. Dongarra, “Investigating Half
Precision Arithmetic to Accelerate Dense Linear System Solvers,” in
Proceedings of the 8th Workshop on Latest Advances in Scalable
Algorithms for Large-Scale Systems, ser. ScalA ’17. New York, NY,
USA: ACM, 2017, pp. 10:1–10:8.
[23] R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. C.
Lee, S. Richardson, C. Kozyrakis, and M. Horowitz, “Understanding
sources of inefficiency in general-purpose chips,” in Proceedings of the
37th Annual International Symposium on Computer Architecture, ser.
ISCA ’10. New York, NY, USA: ACM, 2010, pp. 37–47. [Online].
Available: http://doi.acm.org/10.1145/1815961.1815968
[24] S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful
visual performance model for multicore architectures,” Commun. ACM,
vol. 52, no. 4, pp. 65–76, Apr. 2009.
[25] S. J. van Albada, A. G. Rowley, J. Senk, M. Hopkins, M. Schmidt, A. B.
Stokes, D. R. Lester, M. Diesmann, and S. B. Furber, “Performance
Comparison of the Digital Neuromorphic Hardware SpiNNaker and the
Neural Network Simulation Software NEST for a Full-Scale Cortical
Microcircuit Model,” Frontiers in Neuroscience, vol. 12, p. 291, 2018.
[26] F. Akopyan, “Design and Tool Flow of IBM’s TrueNorth: An Ultra-Low
Power Programmable Neurosynaptic Chip with 1 Million Neurons,”
in Proceedings of the 2016 on International Symposium on Physical
Design, ser. ISPD ’16. New York, NY, USA: ACM, 2016, pp. 59–60.
[Online]. Available: http://doi.acm.org/10.1145/2872334.2878629
[27] M. Davies, et al, “Loihi: A Neuromorphic Manycore Processor with
On-Chip Learning,” IEEE Micro, vol. 38, pp. 82–99, 2018.
[28] L. de Macedo Mourelle, N. Nedjah, and F. G. Pessanha, Reconfigurable
and Adaptive Computing: Theory and Applications. CRC press, 2016,
ch. 5: Interprocess Communication via Crossbar for Shared Memory
Systems-on-chip.
[29] S. Moradi and R. Manohar, “The impact of on-chip communication on
memory technologies for neuromorphic systems,” Journal of Physics D:
Applied Physics, vol. 52, no. 1, p. 014003, oct 2018.
[30] S. B. Furber, D. R. Lester, L. A. Plana, J. D. Garside, E. Painkras,
S. Temple, and A. D. Brown, “Overview of the SpiNNaker System
Architecture,” IEEE Transactions on Computers, vol. 62, no. 12, pp.
2454–2467, 2013.
[31] Y. Shi, “Reevaluating Amdahl’s Law and Gustafson’s
Law,” https://www.researchgate.net/publication/
228367369 Reevaluating Amdahl’s law and Gustafson’s law, 1996.
[32] V. Weaver, D. Terpstra, and S. Moore, “Non-determinism and over-
count on modern hardware performance counter implementations,” in
Performance Analysis of Systems and Software (ISPASS), 2013 IEEE
International Symposium on, April 2013, pp. 215–224.
[33] G. M. Amdahl, “Validity of the Single Processor Approach to Achieving
Large-Scale Computing Capabilities,” in AFIPS Conference Proceed-
ings, vol. 30, 1967, pp. 483–485.
[34] K. Hwang and N. Jotwani, Advanced Computer Architecture: Paral-
lelism, Scalability, Programmability, 3rd ed. Mc Graw Hill, 2016.
[35] P. Molna´r and J. Ve´gh, “Measuring Performance of Processor Instruc-
tions and Operating System Services in Soft Processor Based Systems,”
in 18th Internat. Carpathian Control Conf. ICCC, 2017, pp. 381–387.
[36] Gyo¨rgy Buzsa´ki and Xiao-Jing Wang, “Mechanisms ofGamma Oscilla-
tions,” Annual Reviews of Neurosciences, vol. 3, no. 4, pp. 19:1–19:29,
nov 2012.
[37] J. Ve´gh, “How to extend the Single-Processor Paradigm to the Explic-
itly Many-Processor Approach,” in 2020 International Conference on
Computational Science and Computational Intelligence (CSCE). IEEE,
2020, pp. Accepted FCS2243, in print.
[38] J. S. et al, “TrueNorth Ecosystem for Brain-Inspired Computing: Scal-
able Systems, Software, and Applications,” in SC ’16: Proceedings of the
International Conference for High Performance Computing, Networking,
Storage and Analysis, 2016, pp. 130–141.
