How to extend the Single-Processor Paradigm to the Explicitly
  Many-Processor Approach by Végh, János
How to extend the Single-Processor Paradigm
to the Explicitly Many-Processor Approach
Ja´nos Ve´gh
Kalima´nos BT
Debrecen, Hungary
Vegh.Janos@gmail.com ORCID: 0000-0002-3247-7810
Abstract—The computing paradigm invented for processing a
small amount of data on a single segregated processor cannot
meet the challenges set by the present-day computing demands.
The paper proposes a new computing paradigm (extending the
old one to use several processors explicitly) and discusses some
questions of its possible implementation. Some advantages of the
implemented approach, illustrated with the results of a loosely-
timed simulator, are presented.
Index Terms—modern computing paradigm, performance lim-
itation, efficiency, parallelized computing, distributed computing
I. INTRODUCTION
Physical implementations of the 70-year old computing
paradigm has several limitations [1]. As the time passes,
more and more difficulties come to light, but development
of processor, the central element of a computer, could keep
pace with the growing demand on computing till some point.
Around 2005 it became evident that the price paid for keep-
ing the Single Processor Approach (SPA) paradigm [2], (as
Amdahl coined the wording), became too high. ”The implicit
hardware/software contract was, that increases in transistor
count and power dissipation were OK as long as architects
maintained the existing sequential programming model. This
contract led to innovations that were inefficient in transistors
and power – such as multiple instruction issue, deep pipelines,
out-of-order execution, speculative execution, and prefetching
– but which increased performance while preserving the se-
quential programming model.” [3] The conclusion was that
”new ways of exploiting the silicon real estate need to be
explored” [4].
”Future growth in computing performance must come from
parallelism” [5] is the common point of view. However, ”when
we start talking about parallelism and ease of use of truly
parallel computers, were talking about a problem thats as hard
as any that computer science has faced.” [3]. Mainly because
of this, parallel utilization of computers could not replace the
energy-wasting solutions introduced to the formerly favorited
single-thread processors. They remained in the Multi-Core
and/or Many-Core (MC) processors, greatly contributing to
Projects no. 125547 has been implemented with the support provided
from the National Research, Development and Innovation Fund of Hungary,
financed under the K funding scheme.
Submited to 2020 International Conference on Computational Science and
Computational Intelligence (CSCE), Las Vegas, US
their dissipation and, through this, to the overall crisis of
computing [6].
Computing paradigm itself, the implicit hardware/software
contract, was suspected even more explicitly: ”Processor and
network architectures are making rapid progress with more
and more cores being integrated into single processors and
more and more machines getting connected with increasing
bandwidth. Processors become heterogeneous and reconfig-
urable . . . No current programming model is able to cope with
this development, though, as they essentially still follow the
classical van Neumann model.” [7] On one side, when thinking
about ”advances beyond 2020”, the solution was expected
from the ”more efficient implementation of the von Neumann
architecture” [8]. On the other side, there are statements such
as ”The von Neumann architecture is fundamentally inefficient
and non-scalable for representing massively interconnected
neural networks” [9].
It is worth, therefore, to scrutinize that implicit hard-
ware/software contract, whether the processor architecture
could be better adapted to the changes that occurred in the past
seven decades in the technology and utilization of computing.
Implicitly, both the hardware (HW) and software (SW) solu-
tions advantageously use multi-processing. The paper shows
that using a less rigid interpretation of the terms that that
contract is based upon, one can extend the single-thread
paradigm to use several processors explicitly (enebling direct
core-to-core interaction), without violating the ’contract’, the
70-year old HW/SW interface.
Section II shortly summarizes some of the major challenges
the modern computing is expected to cope with and sketches
the principles that enable it to give a proper reply. The way
to implement those uncommon principles proposed here is
discussed in section III. Because of the limited space, only
a few of the advantages are demonstrated in section IV.
II. THE GENERAL PRINCIPLES OF EMPA
During the past two decades, computing developed in
direction to conquer also some extremes: the ’ubiquitous
computing’ led to billions of connected and interacting pro-
cessors [10], the always higher need for more/finer details,
more data and shorter processing times led to building com-
puters comprising millions of processors to target challenging
tasks [11], different cooperative solutions [12] attempt to
handle the demand of dynamically varying computing in the
ar
X
iv
:2
00
6.
00
53
2v
1 
 [c
s.A
R]
  3
1 M
ay
 20
20
present more and more mobile computing. Using computing
under those extreme conditions led to shocking and counter-
intuitive experiences that can be more easily comprehended
and accepted using parallels with the modern science [13].
Developing a new computing paradigm being able to pro-
vide a theoretical basis for state of the art cannot be postponed
anymore. Based on that, one must develop different types of
processors. As was admitted following the failure of super-
computer Aurora’18: ”Knights Hill was canceled and instead
be replaced by a ”new platform and new microarchitecture
specifically designed for exascale”” [14]. Similarly, we expect
shortly to admit that building large-scale Artificial Intelligence
(AI) systems is simply not possible based on the old paradigm
and architectural principles [15], [16]. The new architectures,
however, require a new computing paradigm, that can give a
proper reply to power consumption and performance issues of
our present-day computing.
A. Overview of the modern paradigm
The new paradigm proposed here is based on fine distinc-
tions in some points, present also in the old paradigm. Those
points, however, must be scrutinized individually, whether and
how long omissions can be made. These points are:
• consider that not only one processor (aka Central Pro-
cessing Unit) exists, i.e.
– processing capability is one of the resources rather
than a central singleton
– not necessarily the same processing unit is used to
solve all parts of the problem
– a kind of redundancy (an easy method of replac-
ing a flawed processing unit) through using virtual
processing units is provided (mainly to increase the
mean time between technical errors)
– instruction stream can be transferred to another pro-
cessing unit [17], [18]
– different processors can and must cooperate in solv-
ing a task, including direct data and control ex-
change between cores, communicating with each
other, being able to set up ad-hoc assemblies for
more efficient processing in a flexible way
– the large number of processors can used for replacing
memory operations with using more processors
– a core can outsource the received task
• misconception of segregated computer components is
reinterpreted
– efficacy of using a vast number of processors is
increased by using multi-port memories (similar
to [19])
– a ”memory only” concept (somewhat similar to that
in [20]) is introduced (as opposed to the ”registers
only” concept), using purpose-oriented, optionally
distributed, partly local, memory banks
– principle of locality is introduced at hardware level,
through introducing hierarchic buses
• misconception of ”sequential only” execution [21] is
reinterpreted
– von Neumann required only ”proper sequencing” for
a single processing unit; this concept is extended to
several processing units
– tasks are broken into reasonably sized and logically
interconnected fragments
– the ”one-processor-one process” principle remains
valid for task fragments, but not necessarily for the
complete task
– fragments can be executed (at least partly) simul-
taneously if both data dependence and hardware
availability enables it (another kind of asynchronous
computing [22])
• a closer hardware/software cooperation is elaborated
– hardware and software only exist together: the pro-
grammer works with virtual processors in the same
sense as [23] uses these term and lets computing
system to adapt itself to its task at run-time, through
mapping virtual processors to physical cores
– when a hardware has no duty, it can sleep (”does not
exist”, does not take power)
– the overwhelming part of the duties such as synchro-
nization, scheduling of the OS are taken over by the
hardware
– the compiler helps work of the processor with
compile-time information and the processor can
adapt (configure) itself to its task depending on the
actual hardware availability
– strong support for multi-threading, resource sharing
and low real-time latency is provided, at HW level
– the internal latency of large-scale systems is much
reduced, while their performance is considerably
enhanced
– task fragments shall be able to return control vol-
untarily without the intervention of operating system
(OS), enabling to implement more effective and more
simple operating systems
– the processor becomes ”green”: only working cores
take power
B. Details of the concept
We propose to work at programming level with virtual
processors and to map them to physical cores at run-time,
i.e., to let the computing system to adapt itself to its task.
A major idea of EMPA is to use quasi-thread (QT) as atomic
unit of processing, that comprises both HW (the physical core)
and the SW (the code fragment running on the core). Its idea
was derived with having in mind the best features of both
HW core and SW thread. QTs have ”dual nature” [13]: in the
HW world of ”classic computing” they are represented as a
’core’, in SW world as a ’thread’. However, they are the same
entity in the sense of ’modern computing’. We borrow the
terms ’core’ and ’thread’ from conventional computing, but in
’modern computing’, they can actually exist only together in a
time-limited way1. EMPA is a new computing paradigm which
needs a new underlying architecture, rather than a new kind
of parallel processing running on a conventional architecture,
so it can be reasonably compared to terms and ideas used
in conventional computing only in a minimal way; although
the new approach adapts many of its ideas and solutions from
’classic computing’.
One can break the executable task into reasonably sized
and loosely dependent Quasi-Thread (QT)s. (The QTs can
optionally be embedded into each other, akin to subroutines.)
In EMPA, for every new QT a new independent Processing
Unit (PU) is also implied, the internals (PC and part of
registers) are set up properly, and they execute their task
independently2 (but under the supervision of the processor
comprising the cores).
In other words: we consider processing capacity as a
resource in the same sense as memory is considered as a
storage resource. This approach enables the programmers to
work with virtual processors (mapped to physical PUs by the
computer at run-time) and they can utilize the quick resource
PUs to replace utilizing the slow resource memory (say,
renting a quick processor from a core pool can be competitive
with saving and restoring registers in the slow memory, for
example when making a subroutine call). The third primary
idea is that PUs can cooperate in various ways, including
data and control synchronization, as well as outsourcing part
of the received job (received as an embedded QT) to a helper
core. An obvious example is to outsource the housekeeping
activity to a helper core: counting, addressing, comparing, can
be done by a helper core, while the main calculation remains
to the originally delegated core. As mapping to physical cores
occurs at run-time, (a function of actual HW availability) the
processor can avoid using (maybe temporarily) denied cores as
well as to adapt the resource need (requested by the compiler)
of the task to actual computing resource availability.
Processor has an additional control layer for organizing joint
work of its cores. Cores have just a few extra communication
signals and can execute both conventional and so-called meta-
instructions (for configuring their internal architecture). A
core executes a meta-instruction in a co-processor style: when
finding a meta-instruction, the core notifies its processor which
suspends conventional operation of the core, then controls
executing the meta-instruction (utilizing resources of the core,
providing helper cores and handling connections between the
cores as requested), then resumes core operation.
The processor needs to find the needed PUs (cores), and its
processing ability has to accommodate to the received task.
1Akin to dynamic variables on the stack: their lifetime is limited to the
period when the HW and SW are appropriately connected. The physical
memory is always there, but it is ”stack memory” only when handled
adequately by the HW/SW components.
2Although the idea of executing the single-thread task ”in pieces” may look
strange for the first moment, the same happens when the OS schedules/blocks
a task. The key differences are that in EMPA not the same processor is used,
the Explicitly Many-Processor Approach (EMPA) cuts the task into fragments
in a reasonable way (preventing issues like priority inversion [24]). The QTs
can be processed at the same time as long as their mathematical dependence
and the actual HW resource availability enable it.
Also, inside the processor, quickly, flexibly, effectively, and
inexpensively. A kind of ‘On demand’ computing that works
‘As-a-Service’. This task is not only for the processor: the
complete computing system must participate, and for that goal,
the complete computing stack must be rebuilt.
Behind the former attempts to optimize code execution
inside the processor, there was no established theory, and
they had only marginal effect because processor is working
in real-time, it has not enough resources, knowledge and time
do discover those options entirely [25]. In contrary, compiler
can find out anything about enhancing performance but has
no information about the actual run-time HW availability.
Furthermore, it has no way to tell its findings to the pro-
cessor. Processor has HW availability information but has to
”reinvent the wheel” to enhance its performance; in real-time.
In EMPA, compiler puts its findings in the executable code
in form of meta-instructions (”configware”), and the actual
core executes them with the assistance of the new control
layer of the processor. The processor can choose from those
options, considering the actual HW availability, in a style
’if NeededNumberOfResourcesAvalable then Method1 else
Method2’, maybe nested one into another.
C. Some advantages of EMPA
The approach results in several considerable advantages, but
the page limit enables us to mention just a few of them.
• as a new QT receives a new PU, there is no need to
save/restore registers and return address (less memory
utilization and less instruction cycles)
• OS can receive its PU, initialized in kernel mode and
can promptly (i.e., without the need of context change)
service the requests from the requestor core
• for resource sharing, a PU can be temporarily delegated
to protect the critical section; the next call to run the code
fragment with the same offset shall be delayed (by the
processor) until processing by the first PU terminates
• processor can natively accommodate to the variable need
of parallelization
• out-of-use cores are waiting in low energy consumption
mode
• hierarchic core-to-core communication greatly increases
memory throughput
• asynchronous-style computing [26] largely reduces loss
stemming from the gap [27] between speeds of processor
and memory
• principle of locality can be applied inside the proces-
sor: direct core-to-core connection (more dynamic than
in [28]) greatly enhances efficacy in large systems [29]
• the communication/computation ratio, defining decisively
efficiency [16], [30], [31], is reduced considerably
• QTs thread-like feature akin to fork() and hierarchic
buses change the dependence of the time of creating many
threads on the number of cores from linear to logarithmic
(enables to build exascale supercomputers)
• inter-core communication can be organized in some sense
similar to Local Area Network (LAN)s of computer
networking. For cooperating, cores can prefer cores in
their topological proximity
III. HOW TO IMPLEMENT EMPA
The best starting point to understand implementation of
the EMPA principles is conventional many-core processors.
Present electronic technology made kilo-core processors avail-
able [32], [33], in a very inexpensive way and in immediate
proximity of each other, in this way making the computing
elements a ”free resource” [34]. Principles of SPA, however,
enable us to use them in a rather ineffective way [35].
Given that true parallelism cannot be achieved (working
with the components anyhow needs time and synchronization
via signals and/or messages, the question is only the time res-
olution), EMPA targets an enhanced and synchronized paral-
lelized sequential processing based on using many cooperating
processors. The implementation uses variable granularity and
as much truly parallel portions as possible. However, focus
is on the optimization of the operation, rather than providing
some new kind of parallelization. The ideas of cooperation
comprise job outsourcing, sharing different resources and pro-
viding specialized many-core computing primitives in addition
to the single-processor instructions.
In this way EMPA is an extension of SPA: conventional
computing is considered consisting of a single non-granulated
thread, where (mostly) SW imitates the required illusion
of granulating and synchronizing code fragments. Mainly
because of this, many of components have a name and/or
functionality familiar from conventional computing. However,
there are subtle details that are different from those of con-
ventional computing. Furthermore, we consider the computing
process as a whole to be the subject of optimization rather than
the segregated components individually.
In SPA, there is only one active element, the Central
Processing Unit (CPU). The rest of components of the system
serves requests from the CPU in a passive way. As the
EMPA wants to extend conventional computing, rather than
to replace it, its operating principle is somewhat similar to
the conventional one, with important differences in some
key points. Fig. 1 provides an overview of the operating
principle and major components of EMPA. We follow hints
by Amdahl : ”general purpose computers with a generalized
interconnection of memories or as specialized computers with
geometrically related memory interconnections and controlled
by one or more instruction streams” [2].
A. The core
An EMPA core of course comprises an EMPA Processing
Element (EPE). Furthermore, it addresses two key deficiencies
of conventional computing : the inflexibility of computing
architecture by EMPA Morphing Element (EME), and the
lack of autonomous communication by EMPA Communicating
Element (ECE). Notice the important difference to conven-
tional computing: the next instruction can be taken either from
memory pointed out by the instruction pointer or from the Meta
FIFO.
1) The Processing Element: The EPE receives an address,
fetches the instruction (if needed, also its operands). If the
fetched instruction is a meta-instruction, EPE sets its ’Meta’
signal (changes to ’Morphing’ regime) for the EME and waits
(suspends processing instructions) until the EME clears that
signal.
2) The Morphing Element: When EPE sets the ’Meta’
signal, EME comes into play. Since the instruction and its
operands are available, it attempts to process the received
meta-instruction. However, the meta-instruction refers to re-
sources handled by the processor. At processor level, order
of execution of meta-instructions depends on their priority.
Meta-instructions, however, may handle the ’Wait’ or the
core signal correspondingly. Notice that the idea is different
from configurable spatial accelerator [36], [37]: the needed
configuration is assembled ad-hoc, rather than chosen from a
list of preconfigured assemblies.
Unlike in SPA, communication is a native feature of EMPA
cores and it is implemented by ECE. The core assembles
the message content (including addresses), then after setting
a signal, the message is routed to its destination, without
involving a computing element and without any respect to
where the destination is. The message finds its path to its des-
tination autonomously, using EMPA’s hierarchic bus system
and ECEs of the fellow cores, taking the shortest (in terms of
transfer time) path. Sending messages is transparent for both
programmer and EPE.
3) The Storage Management Element: EMPA Storage Man-
ager Element (ESME) is implemented only in cluster head
cores, and its task is to manage storage-related messages
passing through ECE. It has the functionality (among oth-
ers) similar to that of memory management unit and cache
controller in conventional computing.
B. Executing the code
1) The quasi-threads: Code (here it means a reasonably
sized sequence of instructions) execution begins with ’hiring’
a core: the cores by default are in a ’core pool’, in low energy
consumption mode. The ’hiring core’ asks for a helper core
from the processor. If no cores are available at that moment,
the processor sets the ’Wait’ signal for the requester core and
keeps the request pending. At a later time, processor can serve
this pending request with a ’reprocessed’ core.
Notice that the idea is quite different from the idea of
eXplicit MultiThreading [38], [39]. Although they share some
ideas such as the need for fine-grained multi-threaded pro-
gramming model and architectural support for concurrently ex-
ecuting multiple contexts on-chip, unlike XMTs, QTs embody
not simply mapping the idea of multi-threading to HW level.
They are based on a completely unconventional computing
paradigm; the QTs can be nested.
This operating principle also means that the code fragment
and the active core exist only together, and this combination
(called QT) has a lifetime. The principle of the implementation
is akin to that of the ’dynamic variable’. EMPA hires a
core for executing a well-defined code fragment, and only
EMPA cluster head core
Morphing element Processing Element
PC
Meta
Awake/Sleep
Wait
Avail
Allocated
Preallocated
Denied
Processor
PC
Registers Latches
ALU
Morphing
registers Latches
Morhp
LU
Meta FIFO Proc Status
M
et
aI
ns
tr
uc
tio
n
M
et
aI
ns
tr
uc
tio
n
Co
re
se
lec
t
Av
ai
l
Al
lo
ca
te
d
Pr
ea
llo
ca
te
d
D
en
ied
Communicating element
Message routing
Storage manager element
Inter-core block
Se
nd
M
sg
Re
ce
iv
e
M
sg
Se
nd
M
sg
Re
ce
iv
e
M
sg
Cluster storage element
Register extension
Cache, stack,I/O buffer
W
rit
e
Re
ad
Cl
us
te
r
m
em
be
rs Messages
In
te
r-c
lu
st
er
bu
s
Messages
Pa
re
nt
/C
hi
ld
co
re
s
Global
memory
Messages
Messages
In
te
r-p
ro
ce
ss
or
bu
s Messages
Parent/Child
relations
Fig. 1. The logical overview of the EMPA-based computing.
for the period between creating and terminating a QT. In
two different executions, the same code fraction may run on
different physical cores.
2) The process of code execution: When a new task
fragment appears, an EMPA processor must provide a new
computing resource for that task fragment (a new register
file is available). Since the executing core is ’hired’ only for
the period of executing a specific code fragment, it must be
returned to core pool when execution of the task fragment
terminates. The ’hired’ PU is working on behalf of the
’hiring’ core, so it must have the essential information needed
for performing the task. The core-to-core register messages
provide a way to transfer register contents from the parent
core to the child core.
Beginning the execution of an instruction sets the signal
’Meta’, i.e. selects either EPE or EME for the execution,
and that element executes the requested action. The core
repeats the process until the ’hired’ core finds and ’end of
code fragment’ code. Notice the difference to conventional
computing: processing of the task does not terminate; only
the core is put back into ’core pool’ as at the moment it is not
anymore needed.
When ’hired’ core becomes available, processing continues
with fetching an instruction by the ’hired’ core. For this, the
core sends a message with the address of the location of
the instruction. The requested memory content arrives at the
core in a reply message logically from the addressed memory,
but the ESME typically intercepts the action. The process
is similar to that in the conventional computing. However,
here the memory requests to send a reply to the request
when it finds the requested contents. Different local memories,
such as addressable cache, can also be handled. Notice also
that the system uses complete messages (rather than simple
signals with the address); this makes the way of accessing
some content independent from its location, although it needs
location-dependent time.
Of course, ’hiring’ core wants to get back some results
from the ’hired’ core. When starting a new QT, ’hiring’ core
also defines, with sending a mask, which register contents the
hired core shall send back. In this case, synchronization is
a serious issue: parent core utilizes its registers for its task,
so it is not allowed to overwrite any of its registers without
an explicit request from the parent. Because of this, when a
child terminates, it writes the expected register contents to
the latch storage of the parent, then it may go back to the
’core pool’. When the parent core reaches the point where it
needs the register contents received from its child, it explicitly
asks to clone the required contents from the latches to its
corresponding register(s). It is the parent’s responsibility to
issue this command at such a time when no accidental register
overwriting can take place.
Notice that beginning execution of a new code fragment
needs more resources, while terminating it frees some re-
sources. Because of this, terminating a QT has a higher
priority than creating one. This policy, combined with that
the cores are able to wait until their processor can provide
the requested amount of resources, prevents ”eating up” the
computing resources when the task (comprising virtually an
infinite number of QTs) execution begins.
3) Compatibility with conventional computing: Conven-
tional code shall run on an EMPA processor (as an implicitly
created QT). However, that code can only use a single core,
since it contains no meta-instructions to create more QTs. This
feature enables us to mix EMPA-aware code with conventional
code, and (among others) enables us to use the plethora of
standard libraries without rewriting that code.
4) Synchronizing the cooperation: The cores execute their
instruction sequences independently, but their operation must
be synchronized at several points. Their initial synchronization
is trivial: processing begins when the ’hired’ core received all
theits required operands (including instruction pointer, core
state, initial register contents, mask of registers the contents
of which the hiring core requests to return). The final syn-
chronization on the side of the ’hired’ core is simple: the core
simply sends the contents of the registers as was requested at
the beginning of the execution of the code fragment.
P
ro
ce
ss
so
r,
vi
a
la
tc
he
s
ParentAvail,Meta,Wait Inter-Core Block
ModeTriggering
Children Mask
Preallocated MaskRegister file
Backlinking
Cloning
Pseudo RegisterTriggering
FromParent
FromChild
ForParent
ForChild
Code Offset IDTriggering
ChildAvail,Meta,Wait Inter-Core Block
Parent ModeTriggering
ParentID
Register file
Backlinking
Cloning
Pseudo RegisterTriggering
FromParent
FromChild
ForParent
ForChild
Code Offset IDTriggering
Fig. 2. Implementing the parent-child relationships: registers and operations
of the EICB
On the side of the ’hiring’ core, the case is much more
complex. The ’hiring’ core may wait for the termination of the
code fragment running on the ’hired’ core, or maybe it is in the
middle of its processing. In the former case, a simple waiting
until the message arrives is sufficient, but in the latter case,
receiving some new register contents in some inopportune
time would destroy its processing. Because of this, the register
contents from the ’hired’ core are stored temporarily in latch
registers, and they are copied to the corresponding registers
of the ’hiring’ core only when the ’hiring’ core requests so
explicitly. Fig. 2 attempts to illustrate the complex cooperation
between the EMPA components.
C. Organizing ’ad hoc’ structures
EME can ’morph’ the internal architecture of the EMPA
processor, as required by the actual task (fragment). EMPA
uses the principle of creating ’parent-child’ (rather than
’Master-Slave’) relation between its cores. The ’hiring’ core
becomes the parent, and the ’hired’ core becomes the child. A
child has only one parent, but parents can have any number of
children. Children can become parents in some next phase of
execution; in this way, several ’generations’ can cooperate.
This principle provides a dynamic processing capacity for
different tasks (in different phases of execution). The ’parent-
child’ relations simply mean storing addressing information,
in the case of children combined with concluding the address
from the ’hot’ bits of a mask.
As ’the parents are responsible for their children’, parents
cannot terminate their code execution until all their children
returned the result of the code fragment that they delegated
for them. This method enables parents also to trust in their
children: when they delegate some fragment of their code
to their children, they can assume that that code fragment
is (logically) executed. It is the task of the compiler to
provide the required dependence information, how those code
fragments can be synchronized.
This fundamental cooperation method enables the purest
form of delegating code to existing (and available) cores.
In this way, all available processing capacity can be used,
while only the actually used cores need energy supply (and
dissipate). Despite its simplicity, this feature enables us to
make subroutine calls without needing to save/restore contents
through memory and to implement mutexes working thousands
of times quicker than in conventional computing.
D. The processor
The processor comprises many physical EMPA cores. The
EMPA processor appears in the role of a ’manager’ rather than
a number-crunching unit, it only manages its resources.
Although individual cores initiate meta-instructions, their
synchronized operation requires the assistance of the proces-
sor. Meta-instructions received by EMPA cores are written
first (without authorization) in a priority-ordered queue (Meta
FIFO) in the processor, so the processor can always read and
execute only the highest priority meta-instruction (a core can
have at most one active meta-instruction).
E. Clustering of the cores
The idea of arranging EMPA cores to form clusters is some-
what similar to that of CNNs [40]. In computing technology,
one of the most severe limitations is given by internal wiring,
both for internal signal propagation time and area occupied on
the chip [1]. In conventional architectures, cores are physically
arranged to form a 2-dimensional rectangular grid matrix.
Because of SPA, there should not be any connection between
segregated cores, so the inter-core area is only used by some
kind of internal interconnection networks or another wiring.
In EMPA processors, even-numbered columns in the grid
are shifted up by a half grid position. In this way cores
are arranged in a way that they have common boundaries
with cores in their neighboring columns. In addition to these
neighboring cores, cores have (up to two) neighbors in their
column, with altogether up to six immediate neighbors, with
common boundaries. This method of positioning also means
that cores, logically, can be arranged to form a hexagonal grid,
as shown in Fig. 3. Cores physically have a rectangular shape
with joint boundaries with their neighbors, so logically the
cores form a hexagonal grid. This positioning enables to form
”clusters” of cores, forming a ”flower”: an orange ovary (the
cluster head) and six petals (the leaf cores of the cluster, the
members).
Between cores arranged in this way also neighborhood size
can be interpreted similarly to the case of cellular computing.
0,0
1,1
0,2
1,3
0,4
1,5
0,6
1,7
0,8
1,9
0,10
1,11
2,0
3,1
2,2
3,3
2,4
3,5
2,6
3,7
2,8
3,9
2,10
3,11
4,0
5,1
4,2
5,3
4,4
5,5
4,6
5,7
4,8
5,9
4,10
5,11
6,0
7,1
6,2
7,3
6,4
7,5
6,6
7,7
6,8
7,9
6,10
7,11
8,0
9,1
8,2
9,3
8,4
9,5
8,6
9,7
8,8
9,9
8,10
9,11
-1,1
-1,-1
0,-2
1,-1
9,13
8,12
10,10
10,12
Fig. 3. The (logically) hexagonal arrangement (internal clustering) of EMPA
cores in the EMPA processor
Based on neighborhood of size r = 1 (that means that cores
have precisely one common boundary), a cluster comprising
up to six cores (cluster members) can be formed, with the
orange cell (of size r = 0, the cluster head) in the middle.
Cluster members have shared boundaries with their immediate
neighbors, including their cluster head. These cores define the
external boundary of the cluster (the ”flower”). Cores within
this external boundary are the ”ordinary members” of the
cluster, and the one in the central position is the head of the
cluster.
There are also ”corresponding members” (of size r = 2):
cores having at least one common boundary with one of
the ”ordinary members” of the cluster head in question.
”Corresponding members” may or may not have their clus-
ter head, but have a common boundary with the ”ordinary
members”. The white cells in the figure represent ”external
members” (also of size r = 2): they have at least one common
boundary with an ”ordinary member”, like the ”corresponding
members”, but unlike the ”corresponding members” they do
not have their cluster head. Also, there are some ”phantom
members” (see the violet petals in the figure) around the
square edges in the figure: they have a cluster head and the
corresponding cluster address, but (as they are physically not
implemented in the square grid of cores during the manufac-
turing process) they do not exist physically.
That means: a cluster has one core as ”cluster head”; up
to six ”ordinary members”, and up to twelve ”corresponding
members”; i.e., an ”extended cluster” can also be formed,
comprising up to 1+6+12 members. Notice that around the
edge of the square grid ”external members” can be in the
position of the ”corresponding members”, but the upper limit
of the total number of members in an extended cluster does
not change. Interpreting members of size r >= 2 has no
practical importance. The cores with r <= 2 have a direct
communication mechanism.
0 1 1 0 0 0 1 0 0 0 0 1 1 0 0 1 0 0
Processor address Cluster address Core Address
NeighborProxy
Fig. 4. Implementing the hierarchical cluster-based addressing bit fields of
the cores of EMPA processors. A cluster address is globally unique.
The addressing system must provide support for all those
addressing modes. The cluster addressing is of central im-
portance because of the topology of cores: the cores having
common boundary surely do not need a bus between the
neighboring cores. The addressing must support the goal to
keep the messages inside the cluster, if possible. Messages
from/to outside the cluster are received/sent by the cluster
head. The rest of the messages are sent directly or with using a
proxy to their final destination. To implement that goal, EMPA
processors use the addressing scheme shown in Fig. 4. Notice
that the proposed addressing system a network logical address
can be directly (and transparently) mapped to the ID and vice
versa.
In EMPA, cluster addressing carries also topological in-
formation, partly relies on relative topological positions, and
enables to introduce different classes of relationship between
cells. As mentioned, cluster head cores have a physically
distinct role (In this sense, they can also be a ”fat” core) and
enables us to introducecluster addressing for members of the
extended clusters. Only cluster head cores have an immediate
global memory access (considerably reducing the need for
wiring). The cores being in neighborhood of size r = 1 can
access memory through their cluster head. These cores can
also be used as a proxy for cores in neighborhood of size
r = 2. The latter also enables to replace a denied cluster head
core.
In SPA, the grid and linear addressing are purely logical
ones, which use absolute addresses known at compile time.
Similarly to computer networks, EMPA cores have (closely
related through the cluster architecture) both logical and phys-
ical addresses, enabling autonomous (computing-unrelated)
communication and virtual addressing.
F. The compiler
Compiler plays a significant role in the EMPA. It should
discover all possibilities of cooperation, especially the ones
that become newly available with the philosophy that the
appearance of a new task is attached with the appearance
of new computing resource, with a new register file. Because
at the time of compilation actual HW availability cannot be
known, code for different scenarios must be prepared and put
in the object code.
The philosophy of coding must be drastically changed.
Given that, with outsourcing, a new computing facility ap-
pears, and processor assures the proper synchronization and
data transfer, there is no need to store/restore return address
and save/restore data in the registers, leading to less memory
traffic, and quicker execution time.
The object code is essentially unchanged, except that some
fragments (the QTs) are bracketed by meta-instructions. The
QTs can be nested (i.e., meta-instructions are inserted into
conventional code). One can consider that QTs represent a
kind of atomic macros which have some input and output
register contents but do not need processing capacity from
the actual core.
IV. THE NEW FEATURES THE EMPA OFFERS
Although EMPA does not want to address all of the chal-
lenges of computing, it addresses many of them (and leaves
the door open for addressing further challenges). Due to lack
of space, code examples, comparisons, and evaluations, based
on the loosely-timed SystemC simulation [41], are left for
simulator documentation and the early published version [42].
A. Architectural aspects
Notice that ad hoc assemblies consider both current state of
the cores, and also their ’Denied’ signal. That is, the flawed (or
just temporarily overheated) cores are not used, significantly
increasing the mean time between machine failures. Also,
notice that this approach enables using ’hot swap’ cores, in this
way providing dynamic, connected systems (the addressing is
universal, and the information is delivered by messages; it
takes time, but possible), as well as to deliver the code to the
data: the physical cores can be located in the proximity of the
’big data’ storage, the instruction is delivered to the place, and
only the processed, needed result is to be transported back.
1) Virtualization at HW level: In EMPA no absolute pro-
cessor levels are utilized: virtual processors seen by the
programmer are mapped ’on the fly’ to physical core by the
EMPA processor. Physical cores have a ’denied’ state that can
be set permanently (like fabrication yield) or temporarily (like
overheating), in which case the core will not be used to map a
virtual core to it. When combined with a proper self-diagnostic
system, this feature prevents extensive systems to fail because
of a failing core. The processor has the right and possibility
to replace a physical core any time with another one.
2) Redundancy: Huge masses (literally millions/billions)
of silicon-based elements are deployed in all systems. As a
consequence, the components showing a tolerable error rate
in ”normal” systems, but (purely due to the high number
of components) need special care in the case of large-scale
systems [43].
The usual engineering practice is to rely on the high reli-
ability of the components. The fault-tolerant systems require
particular technologies, typically majority voting, but they are
also based on the same single high-reliability components.
3) Reduced power consumption: The operating principle
of a processor is based on the assumption that processors
are working continuously, executing instructions one after the
other, as their control unit defines the required sequencing.
Because of this principle, in the OS an ’idle’ task is needed.
In EMPA, cores can return control voluntarily, enabling most
of the cores to stay in an ’idle’ state.
B. Attacking the memory wall
The ’memory wall’ is known as the ’von Neumann’ bot-
tleneck of computing, especially after that memory access
time became hundreds of times slower than processing time.
Although ’register only’ processing and cache memories can
seriously mitigate its effect, in the case of large systems
the ’sparse’ calculations that poorly use the cache, show up
orders of magnitude worse computing efficacy, i.e., further
improvement in using the memory is of utmost importance.
1) Register-to-register transfer : The idea of immediate
register-to-register transfer [28] seriously can increase the
performance of real-life tasks [29]. In EMPA, the idea is used
in combination with the flexibility via using virtual cores,
multiple register arrays via children.
2) Subroutine call without stack : In SPA, a subroutine
call requires to save/restore the return address and (at least
part of) the register file; unfortunately, one can use only
the main memory for that temporary storage. In EMPA, for
executing subroutine code, another PU is provided. Because of
this solution, HW can remember (in a nested way) the return
address. Furthermore, working area is provided by the register
file of the ’hired’ core. Given that a register-to-register transfer
is provided, code execution can be hundreds of times quicker.
With proper organization, hiring and hired cores can also run
partly parallel.
3) Interrupt and systems calls without context switching:
Given that interrupts and OS service calls can be considered as
special service calls, where also context switching is needed,
using a prepared (waiting in kernel mode) core can service
the request thousands of times quicker. Event, interrupts can
be serviced without interrupting the running process.
4) Resource sharing without scheduling: For multitasking,
only the OS can provide exclusive access to some resource (as
in SPA, no other processor/task exists). EMPA offers a simple,
elegant, and quick solution: it can delegate a QT for the task
of guarding a critical section, and all tasks issue a conditional
subroutine call to the code guarded by that QT. All but the first
requester QT must wait (but are scheduled automatically by
the processor), and after servicing all requests, the delegated
core is put back to the pool. Since the compiler creates
reasonably sized code fragments, cases leading to priority
inversion [24] cannot happen, so no specialized protocols are
needed in the OS: the orchestrated work in EMPA prevents
those issues.
C. Attacking the communication wall
In SPA, communication is not natively present (no other
processor exists); it must be performed and synchronized using
Input/Output (I/O) instructions and OS operations, in payload
processing time; resulting in performing a severe amount of
non-payload instructions.
1) Decreasing the internal latency: When using intercon-
nected cores, ECE can take over most of the non-payload
duties, enabling to decrease the sequential-only portions of
the task that decisively define communication/computation
ratio [30]; a significant point when developing large scale
computing systems [31] or using AI-type workloads [16].
2) Hierarchic (local) communication: Using temporally or
spatially local memory accesses can increase the efficiency
dozens of times. Similarly, providing ’interconnection cache’
for the EMPA processor can result in considerable improve-
ment in final efficiency of the system. As computing tasks
change their state between ’computing bound’ and ’com-
munication bound’ dynamically, this solution mitigates both
limiting factors as much as possible.
3) Fully asynchronous operation: As von Neumann only
required a ’proper sequencing’ of instructions, and having
less ’idle’ times during core operation appears as performance
increase, asynchronous operation (i.e., turning all components
to active) can considerably contribute to more effective (i.e.,
comprising fewer losses) operation.
V. SUMMARY
In computing, the incremental development methods face
more and more difficulties, because of the drastic changes both
in technology and utilization. The final reason, as has been
suspected by many researchers, is the computing paradigm
reflecting a 70-year old state of the art. Computing needs
renewal [42] and rebooting. As a first step, the validity of
restrictions was scrutinized. It was presented that it is not a
necessary condition that the same computer solves all the
tasks: von Neumann only required a ”proper sequencing”
in executing machine instructions. This requirement can be
satisfied in a much better way via using the presently available
many ”free” processors. That way requires an entirely different
thinking (and component base) and offers real advantages. We
can implement the introduced new paradigm by putting the
presently available technology solutions along with different
principles that approach offers considerable advantages.
REFERENCES
[1] I. Markov, “Limits on fundamental limits to computation,” Nature, vol.
512(7513), pp. 147–154, 2014.
[2] G. M. Amdahl, “Validity of the Single Processor Approach to Achieving
Large-Scale Computing Capabilities,” in AFIPS Conference Proceed-
ings, vol. 30, 1967, pp. 483–485.
[3] K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubi-
atowicz, N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, D. Wessel,
and K. Yelick, “A View of the Parallel Computing Landscape,” Comm.
ACM, vol. 52, no. 10, pp. 56–67, 2009.
[4] J. A. Chandy and J. Singaraju, “Hardware parallelism vs. software
parallelism,” in Proceedings of the First USENIX Conference on Hot
Topics in Parallelism, ser. HotPar’09. Berkeley, CA, USA: USENIX
Association, 2009, pp. 2–2.
[5] S. H. Fuller and L. I. Millett, “Computing Performance: Game Over or
Next Level?” Computer, vol. 44, pp. 31–38, 2011.
[6] US National Research Council. (2011) The Future
of Computing Performance: Game Over or Next
Level? [Online]. Available: http://science.energy.gov/ /me-
dia/ascr/ascac/pdf/meetings/mar11/Yelick.pdf
[7] S(o)OS project, “Resource-independent execution support on exa-scale
systems,” http://www.soos-project.eu/index.php/related-initiatives, 2010.
[8] Machine Intelligence Research Institute, “Erik DeBene-
dictis on supercomputing,” 2014. [Online]. Available:
https://intelligence.org/2014/04/03/erik-debenedictis/
[9] J. S. et al, “TrueNorth Ecosystem for Brain-Inspired Computing: Scal-
able Systems, Software, and Applications,” in SC ’16: Proceedings of the
International Conference for High Performance Computing, Networking,
Storage and Analysis, 2016, pp. 130–141.
[10] J. Gubbi, R. Buyya, S. Marusic, and M. Palaniswami, “Internet of Things
(IoT): A vision, architectural elements, and future directions,” Future
Generation Computer Systems, vol. 29, pp. 1645–1660, 2013.
[11] R. F. Service, “Design for U.S. exascale computer takes shape,” Science,
vol. 359, pp. 617–618, 2018.
[12] J. Du, L. Zhao, J. Feng, and X. Chu, “Computation Offloading and
Resource Allocation in Mixed Fog/Cloud Computing Systems With
Min-Max Fairness Guarantee,” IEEE Transactions on Communications,
vol. 66, pp. 1594–1608, 2018.
[13] J. Ve´gh and A. Tisan, “The need for modern computing
paradigm: Science applied to computing,” in 2019 International
Conference on Computational Science and Computational Intelligence
(CSCI). IEEE, 2019, pp. 1523–1532. [Online]. Available:
http://arxiv.org/abs/1908.02651
[14] www.top500.org, “Intel dumps knights hill, future of xeon phi product
line uncertain,” https://www.top500.org/news/intel-dumps-knights-hill-
future-of-xeon-phi-product-line-uncertain///, 2017.
[15] J. Keuper and F.-J. Preundt, “Distributed Training of Deep Neural
Networks: Theoretical and Practical Limits of Parallel Scalability,”
in 2nd Workshop on Machine Learning in HPC Environments
(MLHPC). IEEE, 2016, pp. 1469–1476. [Online]. Available:
https://www.researchgate.net/publication/308457837
[16] J. Ve´gh, How deep the machine learning can be, ser. A Closer Look at
Convolutional Neural Networks. Nova, In press, 2020, pp. 141–169.
[Online]. Available: https://arxiv.org/abs/2005.00872
[17] ARM. (2011) big.LITTLE technology. [Online]. Available:
https://developer.arm.com/technologies/big-little
[18] J. Congy and et al, “Accelerating Sequential Applications on CMPs
Using Core Spilling,” Parallel and Distributed Systems, vol. 18, pp.
1094–1107, 2007.
[19] Cypress, “CY7C026A: 16K x 16 Dual-Port Static RAM,”
http://www.cypress.com/documentation/datasheets/cy7c026a-16k-x-
16-dual-port-static-ram, 2015.
[20] R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel,
“Scratchpad memory: Design alternative for cache on-chip memory
in embedded systems,” in Proceedings of the Tenth International
Symposium on Hardware/Software Codesign, ser. CODES ’02. New
York, NY, USA: ACM, 2002, pp. 73–78. [Online]. Available:
http://doi.acm.org/10.1145/774789.774805
[21] J. Backus, “Can Programming Languages Be liberated from the von
Neumann Style? A Functional Style and its Algebra of Programs,”
Communications of the ACM, vol. 21, pp. 613–641, 1978.
[22] G. P, Horn.J, J. He, A. Papageorgiou, and C. Poole, “IBM
CICS Asynchronous API: Concurrent Processing Made Simple,”
http://www.redbooks.ibm.com/redbooks/pdfs/sg248411.pdf , 2017.
[23] R. H. Arpaci-Dusseau and A. C. Arpaci-Dusseau, Operating Systems:
Three Easy Pieces, 0th ed. Arpaci-Dusseau Books, May 2015.
[24] O. Babaoglu, K. Marzullo, and F. B. Schneider, “A
formalization of priority inversion,” [Online]. Available:
https://doi.org/10.1007/BF01088832
[25] D. W. Wall, “Limits of instruction-level parallelism,” New
York, NY, USA, pp. 176–188, Apr. 1991. [Online]. Available:
http://doi.acm.org/10.1145/106974.106991
[26] S. K. et al, “Acceleration of an asynchronous message
driven programming paradigm on ibm blue gene/q,” in
2013 IEEE 27th International Symposium on Parallel and
Distributed Processing. Boston: IEEE, 2013. [Online]. Available:
https://https://ieeexplore.ieee.org/abstract/document/6569854
[27] N. Satish, C. Kim, J. Chhugani, H. Saito, R. Krishnaiyer,
M. Smelyanskiy, M. Girkar, and P. Dubey, “Can Traditional
Programming Bridge the Ninja Performance Gap for Parallel
Computing Applications?” Commun. ACM, vol. 58, no. 5, pp. 77–86,
Apr. 2015. [Online]. Available: http://doi.acm.org/10.1145/2742910
[28] F. Zheng, H.-L. Li, H. Lv, F. Guo, X.-H. Xu, and X.-H. Xie, “Co-
operative computing techniques for a deeply fused and heterogeneous
many-core processor architecture,” Journal of Computer Science and
Technology, vol. 30, no. 1, pp. 145–162, Jan 2015.
[29] Y. Ao, C. Yang, F. Liu, W. Yin, L. Jiang, and Q. Sun, “Performance
Optimization of the HPCG Benchmark on the Sunway TaihuLight
Supercomputer,” ACM Trans. Archit. Code Optim., vol. 15, no. 1, pp.
11:1–11:20, Mar. 2018.
[30] J. P. Singh, J. L. Hennessy, and A. Gupta, “Scaling parallel programs for
multiprocessors: Methodology and examples,” Computer, vol. 26, no. 7,
pp. 42–50, Jul. 1993.
[31] J. Ve´gh, “Finally, how many efficiencies the supercomputers have?”
The Journal of Supercomputing, feb 2020. [Online]. Available:
https://doi.org/10.1007%2Fs11227-020-03210-4
[32] B. Bohnenstiehl, A. Stillmaker, J. J. Pimentel, T. Andreas, B. Liu, A. T.
Tran, E. Adeagbo, and B. M. Baas, “KiloCore: A 32-nm 1000-Processor
Computational Array,” IEEE Journal of Solid-State Circuits, vol. 52,
no. 4, pp. 891–902, April 2017.
[33] PEZY. (2017) 2048 core chip. https://www.top500.org/green500/lists/2017/11/.
[34] S. B. Furber, D. R. Lester, L. A. Plana, J. D. Garside, E. Painkras,
S. Temple, and A. D. Brown, “Overview of the SpiNNaker System
Architecture,” IEEE Transactions on Computers, vol. 62, no. 12, pp.
2454–2467, 2013.
[35] M. D. Hill and M. R. Marty, “Amdahl’s Law in the Multicore Era,”
IEEE Computer, vol. 41, no. 7, pp. 33–38, 2008.
[36] F. Jr., K. E., K. D. Glossop, S. C. Steely Jr., J. Tang, and
A. G. Gara, “Processors, methods, and systems with a configurable
spatial accelerator,” no. 20180189231, July 2018. [Online]. Available:
http://www.freepatentsonline.com/y2018/0189231.html
[37] Intel, “Processors, methods and systems with a configurable spa-
tial accelerator,” http://www.freepatentsonline.com/y2018/0189231.html,
2018.
[38] Uzi Vishkin, “Explicit Multi-Threading (XMT): A PRAM-On-Chip Vi-
sion – A Desktop Supercomputer,” Last accessed Dec. 12, 2015 [Online].
http://www.umiacs.umd.edu/users/vishkin/XMT/index.shtml, 2007.
[39] U. Y. Vishkin, “Spawn-join instruction set ar-
chitecture for providing explicit multithreading ,”
https://patents.google.com/patent/US6463527B1/en, 1998.
[40] V. Cimagalli and M. Balsi, “Cellular neural networks: A review,” in Proc.
6th Italian Workshop on Parallel Architectures and Neural Networks,
Vietri sul Mare, Italy. World Scientific, 1993, pp. 12–14, iSBN:
9789814534604.
[41] J. Ve´gh, “EMPAthY86: A cycle accurate simulator for Explicitly
Many-Processor Approach (EMPA) computer.” jul 2016. [Online].
Available: https://github.com/jvegh/EMPAthY86
[42] J. Ve´gh, Renewing computing paradigms for more efficient
parallelization of single-threads, ser. Advances in Parallel Computing.
IOS Press, 2018, vol. 29, ch. 13, pp. 305–330. [Online]. Available:
https://arxiv.org/abs/1803.04784
[43] c. wrired, “Cosmic Ray Showers Crash Supercomputers. Here’s What to
Do About It,” https://www.wired.com/story/cosmic-ray-showers-crash-
supercomputers-heres-what-to-do-about-it/ , 2018.
