Abstract-The processor accelerators are effective because they are working not (completely) on principles of stored program computers. They use some kind of parallelism, and it is rather hard to program them effectively: a parallel architecture by means of (and thinking in) sequential programming. The recently introduced EMPA architecture uses a new kind of parallelism, which offers the potential of reaching higher degree of parallelism, and also provides extra possibilities and challenges. It not only provides synchronization and inherent parallelization, but also takes over some duties typically offered by the OS, and even opens the till now closed machine instructions for the end-user. A toolchain for EMPA architecture with Y86 cores has been prepared, including an assembler and a cycle-accurate simulator. The assembler is equipped with some meta-instructions, which allow to use all advanced possibilities of the EMPA architecture, and at the same time provide a (nearly) conventional-style programming. The cycle accurate simulator is able to execute the EMPA-aware object code, and is a good tool for developing algorithms for EMPA.
Manuscript received July 10, 2016; revised August 26, 2016.
The EMPA architecture [1] seems to be especially hard to program: the processor architecture can be configured by the end-user, and so the architecture may change continuously during the operation; the cores communicate with each other; synchronized internal data transfer between cores takes place; different program parts run in parallel, which must handle data and control dependencies; there are many program counters, belonging to independently running cores; even the conventionally closed unit "machine instruction" can be opened for other processing units, and so on. For the first look, it does not seem possible to program it using facilities mostly similar to the conventional programming. The paper introduces a new kind of parallelism, as well as a programming language and methodology, which allows to utilize the enhanced performance of the EMPA processors, while the programming methods remain as close as possible to the conventional ones.
THE DYNAMIC PARALLELISM
The parallelism assumes the presence of several processing units, and the reachable speedup of course strongly depends on the availability of the corresponding HW units. Let us suppose we want to calculate expressions (see [8] ) A = (C * D)+(E * F) B = (C * D)-(E * F) where we have altogether 4 load operations, 2 multiplications, and 2 additions.
Theoretical (software) parallelism
The theoretical (or SW) parallelism only considers the different kinds of dependences between the values and operations (control and data dependences), and assumes the presence of the needed number of HW units; i.e. it provides a kind of upper bound for the reachable parallelism. For calculating the theoretical parallelism, one can assume that we have a processor, which has (at least) 4 memory access units, 2 multipliers and 2 adders (or equally: an at least 4-issue universal processor). With such a processor (see Fig. 1 ) we can load all the four operands in the first machine cycle, to do the multiplications in the second cycle, and to make the addition/subtraction in the last cycle.
Multiple-issue processor parallelism
The real processors, however, are not built with arbitrarily large number of processing units. In practice, the processors may be built with having so called multiple-issue, single pipeline architecture, i.e. in the same cycle can execute more than one operations, if there are available processing units which are able to perform the requested operation. A so called two-issue processor can make (for example) an arithmetic and a memory access operation at once. Before making the first multiplication (see Fig 1, right side), in the first two cycles the processor can load the two operands, and in the third cycle, it can make the first multiplication. During the multiplication, the memory access unit is free, so it can load simultaneously the third operand. In the fourth cycle, the fourth operand is loaded, and so finally the second operand for the second multiplication is provided (the first operand is waiting since the third cycle). In the fifth cycle the second multiplication can be performed, and so for the sixth cycle result A is provided, and similarly for the seventh cycle result B is also available. Notice that both the memory access and the aritmetic units are only utilized in 4 cycles (out of the 7), in 3 machine cycles they are unused. Only cycle 3 is when both units are in use.
Dual core parallelism
One might think that using two independent, single issue processors communicating through shared memories can be equally good for solving the sample task. Initially, both processors can load their arguments (see Fig. 2 ) and make their multiplication. However, after those operations processors must share their result with their party, i.e. they store their result in the shared memory, and load the result stored by their party from the shared memory. For doing so, store (S i ) and load (L i ) operations must be inserted in the chain of operations of both processors. These (essentially obsolete, but needed for the communication) operations increase the number of operations to 12, and so both processors must execute 6 cycles. Compare it to the 7 cycles of a 2-issue single-processor system above. Obviously, investing into the second processor and shared memory HW, does not result in the expected increase of performance (in addition, the memory access operations are very expensive in terms of execution time; and also we can only hope that the operand the processor reads was already written by the other party).
Dynamic parallelism
Both utilizing a limited number of special multiple processing units, and communicating through shared memory degrades the parallelism with respect to the theoretically reachable one. Increasing the number of specialized processing units is possible, but (as Fig. 1 shows) in most of the general purpose cases, those units cannot be fully utilized. Communicating through a shared memory inserts new (obsolete) machine cycles with memory access, and so (as Fig. 2 shows) the number of the cycles needed for executing the task reduces disproportionally. Both solutions strongly limit the available spedup, strongly increase the needed resources and the dissipated power. As an extreme case: the GeneralPurpose Graphics Processing Unit (GPGPU)s outperform multiple-issue single processor only 2.5 times, although about 100 times better performability is expected [9] . In the first case the cause of the inefficiency is the inflexible architecture, in the second case the lack of any facility of intercore communication. Let us suppose we have a kind of "on demand" type, flexible architecture, i.e. the processor can provide the needed number of processing units for the operation, at the expense of using some time for "renting" the needed unit(s). The rented units are singleissue processors, but they are able to do both memory access and arithmetic operations. In Fig 3 right side, it is assumed that the "cost of renting" is one fifth of a machine cycle. At the very beginning the originating processor (in state O 1 ) notices that two multiplications shall be performed, so it rents Processing Unit (PU)s H 1 and H 2 , one by one (each in 0.2 machine cycle) for this goal. Those helper units notice they need two operands to load, so similarly they rent two more processing units only to do the loading. After loading, the helper processors receive their operands, so they can make the multiplication in their second machine cycle and then deliver the result back to the originating processor, which rents again two more processors for the addition and subtraction, (in the third machine cycle of the originating processor) for the last two operations. After the 3rd machine cycle, both result operands are delivered back to the originating processing unit. The execution time is longer than the theoretical 3 machine cycles. In the figure the total execution time is 3.8 machine cycles, and in the peak period, 6 simple-functionality singleissue processors are used.
Some quantitative parameters of the mentioned parallelization models in the case of calculating our sample expressions are listed in Table 1 . Since we have 8 operations, the single-thread execution time is 8. The average degree of parallelization is calculated as the ratio of the number of operations and number of cycles, and the efficiency is given as the ratio of speedup and number of PUs. As it can be expected, this dynamic parallelization model works in a way very similar to the theoretical one, and its degree of parallelization approaches the theoretical one (see Table 1 ). 
Requirements and consequences of dynamic parallelism
The graph in Fig. 3 right side is essentially the extension of the graph on the left side. The PU is "rented" from some resource pool and is returned after the operation finished, so in the next machine cycle it can be rented for a different goal. It introduces some (trivial) dependence: before making an operation, a processing unit must be rented; and after the operation finished, it must be released, before starting the next cycle. The renting process is transparent for the originating processing unit, so the original dependence is preserved. The states are in parent-child relationship: a parent can create any number of children, but a child can have only one parent. The parent remains responsible for performing the task it received, but it can delegate part of the task to its children. If the parent has delegated part of its job to the children, it must wait until they terminate. Since the operation is performed on a different PU rather than the original one, the complete state of the originating internals must be cloned into the created child unit and after finishing the operation, part of the state (the result) must be returned to the parent, in a synchronized way.
In the conventional architectures the machine cycles are uniform. In the dynamic parallelization model the "all children ready" signal triggers the next cycle, which can be somewhat longer, but can also be shorter, if in the child the last internal instruction stage is not utilized. With an effective pre-allocation mechanism, the time needed to allocate a helper core, can approach zero.
Mapping the operations to processing units
The dynamic parallelism remains "theoretical" in the sense that nothing limits the number of the needed processing units, while in a physical system the number of PUs is limited. The processing graph in Fig. 5 exactly corresponds to the theoretical graph of dynamic parallelism in Fig. 3 , the 8th core cannot be utilized by the example code. On a processor having finite number of PUs the processing graph can be compressed horizontally, at the price of increasing the number of the cycles, see Fig. 6 . When one keeps the dependence, some operations will simply be postponed for | a later machine cycle, prolonging the processing time and decreasing the reached parallelism. The PUs are "rented" strictly for the time of performing the processing step, so after a while "reprocessed" PUs get available.
Obviously, the traditional fixed architectures are not able to adapt themself to the task executed, so for that a special architecture [1] must be used, which has some extra signals, storages and functionality, see Fig. 4 . Such an architecture can be implemented using methods known in reconfigurable technology, like using block RAMs, configurable wiring between fixed functionality blocks, etc. Also, to program such a task special programming instructions are needed. The code producing the processing diagrams (see section 6.2) in Figs. 5-6 is shown in Listing 1. Technically, one 'higher level' core is needed, which embeds the code calculating the expressions in the sample. The individual machine instructions are put in Quasi-Thread (QT) frames [1] only to provide complete visual analogy with Fig. 3 . The same core could start reserving a helper core to load one operand; while waiting, the core itself could load the another operand, and make the operation itself. The used method demonstrates, however, that this kind of parallism can be extended towards both elementary operations (like individual machine instructions) and complex expression evaluations (like the 8 elementary operations in the example code). The operations in all cases can be independently executed, and when discovering parallelism, no HW limitations shall be considered.
Here an event-controlled, rather than clock-controlled, operation takes place, much similarly to the pipelining and hyperthreading. This operating principle is not foreign from the Neumann paradigms: there a new operation can only start when the old operation frees the PU.
THE TOOL CHAIN FOR UTILIZING DYNAMIC PAR-

ALLELISM
The goal of the programming tools
The EMPA architecture not only provides a flexible HW for implementing dynamic parallelism (actually: provides an end-user programmable architecture), but also provides other forms of acceleration, like replacing certain machine instruction sequences with using inter-core operating signals and replacing (apparently non-parallelizable) operations with inter-core cooperation. Those facilities are unusual in the conventional programming, so a special programming approach had to be developed. That approach must use conventional terms and programming interface, to give chance to use higher-level languages for programming the EMPA architecture, and at the same time must provide a way to fully utilize the unconventional features of EMPA.
The Y86 processor
EMPA actually means some architectural principles, rather than a certain concrete processor or core. The work described here is based on using the Y86 [2] processor as core. It is not a real processor in the sense that it has very few instructions (finally, its purpose is educational). From our particular point of view it has advantages, like • its Instruction Set Architecture (ISA) allows to implement additional instructions and registers with easy As the first steps towards a tool chain, an EMPA-aware ISA simulator and an assembler has been prepared [11] . These simple tools allow to prepare executable programs for the EMPA and characterize the performance features of the architecture [1] , as well as to develop and scrutinize further features.
For educational purposes, a simplified Intel X86 processor has been developed and made publicly available [2] , including an ISA-level simulator and an assembler. Using a more avanced (non-educational) processor simulator would considerably extend the usability of the simulator. However, those simulators are optimized for an absolutely different single-processor architecture, and also usually do not provide an easy path to adding the needed extensions. It is only fair to compare EMPA to an unoptimized conventional processor. If the comparison is advantageous for EMPA, similar, but different HW accelerators will be developed for this architecture, too, and those optimized architectures can again be fairly compared to the optimized conventional architectures.
Extensions to the Y86 ISA
The original Y86 ISA [2] utilizes one-byte instruction codes, and in an unused instruction slot the group of metainstructions utilized to configure the newly introduced supervisor (SV) control layer [1] of the processor has been implemented. Following the Y86 conventions, a member of the EMPA metainstruction group is coded as the group code in the high nibble and the member code in the low nibble. The mnemonic of a metainstruction always starts with a 'Q' (for QT). The metainstructions can have zero, one, or two arguments, and their total length is between one and six bytes.
The EMPA simulator(s)
The simulator was written having electronic components in mind, i.e. it operates in a cycle-accurate way. The engine uses as core functionality the Y86 ISA simulator, slightly extended. Both a command line based and a Qt5 [12] based graphical interfaces have been added to the simulator. The GUI simulator is equipped with step-wise execution and logging; it produces processing diagrams like the one in Fig. 7 , and provides different kinds of statistics, allowing to scrutinize the sophisticated operation of the EMPA/Y86 processor, and to derive operational characteristics [1] .
ASSEMBLY EXTENSIONS FOR EMPA
The support for the unconventional EMPA features is implemented through a surprisingly low number of new assembly (meta)instructions and other extensions, causing just a little difference relative to the traditional single-processor case.
Creating and terminating quasi-threads
The EMPA-aware code is organized into special units of QTs [1] , of intermediate size and structure, somewhere between the HW unit 'machine instruction' and SW unit 'thread'. Handling QTs is supported by the new assembly instructions QCreate and QTerm, which must be used in a bracket-like way.
The QCreate instruction has two arguments. The first one is the address of the matching QTerm. This informs the parent core, where to continue after delegating the code in the QT. The second argument is the link register (either a physical one or a pseudo-register, see section 4.5).
The QTerm instruction has no arguments, but implies a QWait -1 and clones back the link register, defined by the matching QCreate.
Example (As shown, extensive labeling is utilized for referencing): CLabel: QCreate TLabel, %eax ... executable instructions ... TLabel: QTerm
Synchronizing QTs
The assembler provides two kinds of instructions to support explicit waiting. Instruction QWait only considers its own children, while QPWait considers the sisters (the other children of the parent). Upon finding a QxWait, the requesting The QxWait metainstructions play an important role also in synchronizing data transfer between parent and child cores. As described in [1] , the asynchronous operation of cores needs well-designed transport policy, especially when using the link register. Table 2 shows the triggered data exchange policy between the parent and child cores.
Example : QWait CLabel # Wait specific QT QPWait -1 #Wait all sister QTs
Subroutine call with EMPA
The instruction QCall provides a possibility to place the called QT out of the body of the code. The instruction has an address argument, where the called QT begins. The referred to QT commences in the child core, the control in the parent returns immediately to the address next to QCall. Since the argument of the metainstruction is the address of a metainstruction QCreate, its functionality is automatically implied in functionality of QCall. Practically, the difference is that the called QT code is located outside the body of the parent control flow, resulting in a modular, clear program structure.
Notice that the return address, unlike in Single Processor Approach (SPA) case, should not be remembered: the called subroutine runs on a new core and uses its own Program Counter (PC), while the caller continues processing with instruction next to QCall. The HW should not save the return address, in this way less memory cycles and machine instructions needed, and also the HW addresses are not interlaced in the SWhandled stack items. This also reduces the need for calling frames and simplifies addressing of automatic or passed variables.
Example: QCall CLabel
Supporting cooperation between cores
Several classes of processor accelerators serve executing masses of instructions in parallel. EMPA provides mass processing modes for this goal. Mass processing functionality is implemented through the metainstructions QAlloc, QTCreate and QFCreate.
The two arguments of QAlloc are a mode value, and a register, containing the argument for the requested operation. Since QAlloc actually is a request to the SV that the requestor core wants to rent additional core(s), the program must prepare for two answers, and handle them in two different branches. The two branches are implemented by metainstructions QTCreate and QFCreate. Exactly one of them will be executed after the last QAlloc, the other party will behave as a NOP instruction (i.e. it follows an if..then..else logic).
Both cases must be programmed: there is one QT prepared for the case when mass processing in the requested mode is possible, and another one for the case when not. These two metainstructions have the arguments and functionality identical with those of QCreate, except that they are only executed if the pre-allocation was successful and was NOT successful, respectively.
Metacommand QAlloc must also assure that the needed number of cores will be available at some later time for the parent core. For this goal, the needed cores are preallocated for the core: they will appear for the concurrently working cores as unavailable cores, but QTCreate can use them to start new QTs.
If the core preallocation was successful (i.e. there are enough cores), the QTCreate branch will be executed as many times as needed. This is controlled by the SV, and is carried out in consecutive clock cycles, one by one. The PC of the parent will advance to the address next to the matching QTerm only when QTCreate was processed the requested number of times, in certain modes each time with a newly allocated child core. In this way the QTCreate actually corresponds to as many QCreate metainstructions as many repetitions were requested by QAlloc. In these actions only the preallocated (rather than unallocated) cores are used.
The functionality of QTCreate is not simply making a replica of the parent for the preallocated cores. The pseudoregister %esv behaves as configured by the 'mode' argument of QAlloc. The QFCreate is just a wrapper for the instructions and metainstructions to be executed when there are not enough cores for the given type of mass processing, i.e. the processing must follow some other way. These conditional allocations can be nested. Note that the link register for both branches must be the same.
Example 
Pseudo-registers
For implementing an effective data transfer between the cooperating cores, some pseudo-registers have been implemented. The pseudo-registers are seen by the ISA as registers, but they represent not a simple storage. Rather, they might behave in an extraordinary way: they can transfer data synchronously between parent and child, in both directions, and they can change the data they provide for their partner between the consecutive accesses.
Register %eno is used where the syntax requires the presence of a register argument, but the related activity is not desirable. Register %ecc is for returning condition codes only, while %esv is used for parent-child related activity. While the first two pseudo-registers can only be used as arguments of QCreate (i.e. as link registers), the functionality of %esv largely depends on the context it is used in, see Table 3 . Fig. 4 (repeated from [1] for convenience) shows how the parent and child cores communicate with each other using latch registers. The %esv register is mapped in a context-dependent way to the latched registers, see Table  3 . As shown in the Table, the mass processing parent role is divided into PRE-processing (between QAlloc and QTCreate), and POST-processing (between QTerm of the child and QTerm of the parent) phases. Using %esv as link register, the SV reads the content of 'ForParent' in the child and writes it to 'FromChild' in the parent. This means, that if a child core wants to transfer to its parent the data it received from its own child, the child core must use an explicite rrmovl %esv, %esv instruction. Register %esv is designed for helping cooperation, and cannot be used as general purpose register.
ALGORITHMIC ASPECTS
It was early recognized [3] , that even our programming languages are heavily influenced by the single-processor approach, and so are our algorithms. The disclosed new possibilities in the EMPA architecture also need new thinking in designing the algorithms. The synergy between the possibilities of EMPA and the new EMPA-aware algorithms (i.e. suggesting methods to implement in EMPA which can simplify or boost old or develop more efficient new algorithms) can result in further performance increase of our HW/SW systems. EMPA provides a couple of general frames and methods for using such possibilities, as shown by the examples below, and is ready to implement further such frames. Below, a simple programming example is presented in four different versions, to illustrate how different accelerating principles [1] of EMPA can be used in practice.
The conventional coding (or NO mode mass processing)
The first mode is the NO mass processing mode. It exactly matches the traditional programming: NO real mass processing takes place, no metainstructions are used and the required loop control functionality is provided through calculations. It requires only the original PU, uses the same instructions and has the same execution time, as the traditional programs.
In this code the operands are loaded immediately before the calculation. The summing is as simple as possible: first the sum is cleared (Listing 2, line 6), and the number of items verified (Listing 2, line 7). These are one-time actions, not parallelized.
From beginning with "Loop", the usual activity takes place: in addition to the payload operation (Listing 2, line 10) addl %esi, %eax, using non-payload operations the item is addressed, fetched (Listing 2, line 9), the address advanced to the next item (Listing 2, line 12), the loop counter advanced and verified (Listing 2, line 14), and a conditional jump instruction (Listing 2, line 15) closes the loop. Upon exiting the loop, register %eax contains the sum (Listing 2, line 16).
Notice that register %eax contains the partial sum during the calculation. This is the main source of inefficiency: in the payload operation addl %esi, %eax the previous content of the destination register is read, then the operation performed and the result is written back as new content to the destination register. Notice that the non-payload instructions have no role at all for the calculation, furthermore that we are interested in the final result only, and not at all in the partial results.
The basic loop: FOR mode mass processing
As seen above, in such a simple loop the non-payload activities require much more executable instructions, than the payload activity; and so: they take most of the processing time. The overall performance can be enhanced through omitting those service instructions as machine instructions, and use HW-implemented facilities instead. EMPA provides simple loop organization facility, which helps to eliminate those non-payload instructions. The first three machine instructions (Listing 3, lines 4-7) are identical with those shown in listing 2. The metainstruction QAlloc 1, %edx (Listing 3, line 8) sets operating mode 1 (this is FOR), preallocates one core and tells SV it wants to use the pre-allocated core %edx (actually: 4) times for looping. This core will be available for the requesting core only, and only until looping is over.
The SV clears in the parent the base index in 'ForChild' to be transferred to the rented child. This value will be incremented by 4 between the iterations, so the actual child can always reach the actual offset value. Since %ecx contains the base address of the vector, and the child inherits the register file of the parent, the child could calculate the actual address from the base and the offset. However, it takes time, so the pseudo-register %esv provides a useful facility to shorten the code.
In the pre-processing phase of the loop (between QAlloc and QTCreate, see Table 3 ) writing %esv (Listing 3, line 9) means writing into 'ForChild'. So, the instruction rrmovl %ecx, %esv writes the base address into 'ForChild' in the parent. Since the contents of 'ForChild' is increased by 4 between iterations, the child receives a ready-made address, there is no need to make address calculation.
The metainstruction QTCreate QT1LoopT, %eax (Listing 3, line 10) will create child QT. The SV keeps the core pre-allocated until loop terminates. In this mode PC of the parent core remains pointing to QTCreate (Listing 3, line 10) while the loop is running. In the following clock periods, the parent must wait, since the QT running on the child core is not yet terminated. When the child QT finally terminates and the SV notifies the parent, it immediately executes the next iteration, until the iteration count reached.
The payload activity is done by the child core, i.e. by instructions between QTCreate (Listing 3, line 10) and (Listing 3, line 13). Here the core fetches the actual argument (Listing 3 7 ) from the address given by contents of its own pseudoregister %esv. In this mode reading %esv corresponds to reading 'FromParent' (see Table 3 ), which receives its contents from 'ForChild' in the parent when QTCreate is executed.
The child core inherits the internals (including contents of register file) of its parent when the QT is created, and returns the content of its link register to the corresponding register in the parent when QTerm is executed, see Table 2 . I.e. on entry (Listing 3 8 ) %eax contains the previous partial sum, on exit %eax contains the new partial sum, which will be cloned back to the parent, and serves as the old partial sum in the next iteration.
Although not used in the present example, to provide a possibility to break out of the loop, the child can write its own pseudo-register %esv, which means writing into 'ForParent'. The child can write 0 into that register. Upon executing QTerm, the contents of that register will be written in register 'FromChild' in the parent. Before executing QTCreate, the SV checks 'FromChild' in the parent, and terminates loop if it is cleared, otherwise executes QTCreate again: increases the address in 'ForChild' and decreases count in 'FromChild'. Of course, at the beginning QAlloc writes the requested number of repetitions into 'FromChild'.
Notice that the complete loop organization is accomplished by the SV, on behalf of the parent core. The child's kernel can do any, much more complex activity. The only limiting factor is that only the content of the link register is back cloned to the parent. Also notice that here actually no parallelization occurs. The parent is waiting when the child is processing, and always only one child is used. Another variant for FOR functionaly is to reserve a core for the individual vector elements. The child cores, as they would be created in adjacent cycles in that mode, would receive the correct address from the parent. However, after termination they would overwrite %eax, or would have to wait the termination of the previous QT without performance gain. Because of this, that mode cannot be used for summing up elements of a vector. However, EMPA has a more elegant and useful method for solving that problem.
The specialized loop: SUMUP mode mass processing
In summing up elements of a vector, the partial sum must be written back into a register in a machine instruction, and read out the same again in the next cycle. It is because the atomic unit in SPA is the machine instruction. Since for the time of looping a persistent connection can exist between the parent and its children, EMPA can provide a way to eliminate this weakness, using its SUMUP mode.
The first three executed instructions are the same as in case of Listing 2. The metainstruction QAlloc (Listing 4, line 8) now sets mode 5, and %edx now contains the number of requested helper cores (i.e this time we want to use several cores in parallel, rather than one core several times). To spare time, the next instruction (Listing 4, line 9) overwrites the offset address passed to the child with the base address of the array, exactly, as in the case of FOR mode. Also the same, that PC in the parent will stay pointing to QTCreate (Listing 2, line 10) and creating children, one after the other. It is, however, different, that several cores are preallocated at the beginning, so the parent shall not wait: in the consecutive cycles will create children, every time in a new core, which child core will work in parallel with each other child cores and the parent core.
The payload instructions (i.e. instructions between QTCreate (Listing 4, line 10) and QTerm (Listing 4, line 13) are very similar to the case of FOR mode. The important difference is that now the partial sum is 'stored' in register %esv. In this mass processing mode writing %esv means sending the data to a prepared adder in the parent [1] , where the addition is immediately executed, rather than reading the previous partial sum from a temporary storage and writing it back. (Technically, it is written in 'ForParent' in the child, but the instruction triggers copying the summand to 'FromChild' in the parent, which is connected to one of the inputs of the adder, while the other input is connected to the output of the adder.) Both the old and new partial sums are only stored in the adder circuit, rather than in some special register. The child QTs are created with one clock cycle delay relative to each other, so they will send their fetched data also with the same delay for the parent, so the adder can receive the data and execute the addition without waiting or queuing.
The parent and its children will run in parallel, and after starting the last child, the parent will continue with the instruction at address next to QTerm (Listing 4, line 14). At that time some of the children might still work, so here a QWait metainstruction must be used, otherwise the adder might contain not the final sum. When all children terminated, the parent will be in post-processing phase, so reading %esv results in reading 'FromChild' which latches the output of the adder. The instruction rrmovl %esv, %eax (Listing 4, line 15) will bring to light the till invisible sum. Note, that in this mode the link register has no role, so %eno is used in QT creation (Listing 4, line 10).
The adaptive processing
When using QAlloc (Listing 5, line 8), the successful execution is not granted at all. The compiler cannot know in advance, how many cores will be available when the metainstruction will be executed, from having as many cores as vector elements, to having one core only, so it must prepare for all possible cases. The metainstructions QTCreate and QFCreate provide an if...then...else construction to solve this problem. It means that the compiler must generate code for all those cases, and the SV chooses one according to the actual core availability. As it will be shown in the example, these constructions can be nested. This adaptive program is displayed in Listing 5. Actually, the kernels of the three previous programs are put together into a special structure. The most performable operating mode for summing up elements of a vector is SUMUP mode, provided that there are enough cores available at the moment when the summation must be executed. If the first QTCreate (Listing 5, line 10) is not successful, then a QFCreate (Listing 5, line 16) is executed. Within this latter block another QTCreate (Listing 5, line 21-25) QT pair is located. If less than 4 cores are available, then the program attempts to allocate at least one more extra core (i.e. attempts to use the next, less performable, but also less resourcehungry operating mode). If this also fails, then continues with the conventional processing. In Table 4 , the execution time of program in Listing 2 serves as the base of comparison. The rest of the lines in the table show cases when during executing the program given in Listing 5 the processor has different number of available cores. The slight increase relative to row 1 in execution time is due to executing the metainstructions: this is the price one has to pay for running a program, designed for manycore systems, on a single-core system. As shown in Table 4 , in a system having 5 cores, this summing is nearly 4 times quicker (as shown in Fig 7, only a small fragment of the code is parallelized). The column α ef f [13] also shows, that EMPA is designed for many-cores: the utilization efficiency increases with the increasing number of cores used. A more detailed analyzis of performance of EMPA is given in [1] . 
THE SIMULATOR
The toolset has been published [11] with online documentation. Because of the permanent development, it is still in alpha quality, but it is usable. The unconventional features need careful utilization.
The command line simulator
The command line version of the simulator runs to completion, and makes extensive logging in a file showing all the details of the operation. Although it is very useful during debugging, and is inevitable when a power user attempts to "fine tune" his program, for educational and demonstrational purposes a Qt5 [12] based graphical simulator has also been prepared. On the screen the complete internal life of the EMPA architecture is displayed, as the cores are rented, the intercommunication of the cores, etc. The execution (in step-wise or run modes) can be followed.
The processing diagram
The processing diagram is a by-product of the simulators and it attempts to visualize the rather sophisticated internal operation of the EMPA processor. The diagram should show, at which time, by which core, which instruction was executed;
Instruction times
T 5   2e 2e  2e 2e  26 26  26 26  56 56  56 56  26 26  26 26  4e 4e  56 56  56 56  4e 4e  26 26  26 26  56 56  4e 4e  56 56  26 26  4e 4e  4e 4e  56 56  56 56   2b 2b  4e 4e  56 56  4e 4e  56 56  56 56   53 53   56 56  56 56  56 56   5b 5b   6d 6d  6d 6d  6d 6d  6d 6d  72 72   00  06  06  0c  19  2e  34  41   18 and how the cores interacted with each other; the cores can execute conventional executable or metainstructions. So, a lot of information must be crowded into the figure.
The diagram shows the cores on the horizontal axis and the time on the vertical one. For better orientation, grid lines are put to every 5th clock cycle. The clock cycle here is the length of an SV operation, the instruction execution is supposed to be of variable length. Arbitrary, but reasonable instruction lengths are assumed.
The rectangular blocks represent the QTs, with hooks at the top and bottom, for their creation and termination. In the columns C x the vertical rectangles represent the "lifetime" of a QT. At the times outside the QT rectangles, the core is in power economy mode, not running a QT.
The parent-child relationship is illustrated with the labels of the QTs: the first few chars are identical with those of the parent, and the last char denotes the sequence number of the child. On the figure (as well as in the simulator log files), for the human reader, core sequence numbers and textual QT ID strings are shown rather than the "one-hot" bitmasks used by the simulator.
The memory address of a metainstruction is shown on the right side of the QT in a rectangle, the address of an executable command is shown on the top of a bigger ball, and some smaller balls represent the duration of the instruction. While a core is waiting, at the corresponding time a circle with the respective memory address is displayed at the left side of the QT blocks. From the memory address the source code can be found using the listings. Accessing pseudo register %esv of parent by children is marked by an angular bracket, also showing the direction of the transferred data. The places where summands are sent for their parent for summation, are marked by an extra plus character.
Dynamic parallelism
The processing graph (see Fig. 5 ) is derived from the theoretical dependencies, so the memory accesses within the cycles have no dependency on each other; i.e. the memory can be accessed without limitation, no need for assuring coherence, i.e. no need to use slow, power hungry and
T 7   2e 2e  2e 2e  26 26  2e  2e 2e  26 26  26 26  2e  2e 2e  26 26  26 26  2e  2e 2e  26 26  2e 2e  26 26   2b 2b  56 56  56 56  41  41 41  56 56  56 56  41 41  56 56  56 56  4e 4e  4e 4e  56 56  56 56  4e 4e  4e 4e  56 56  56 56  4e 4e  4e 4e  56 expensive shared memory. A memory of type like [10] with several independent ports can solve the task. The dynamic parallelism remains "theoretical" in the sense that nothing limits the number of the needed processing units, while in a physical system the number of PUs is limited. The processing graph in Fig. 5 exactly corresponds to the theoretical graph of dynamic parallelism in Fig. 3 , the 8th core cannot be utilized by the example code.
Instruction times On a processor having finite number of PUs the processing graph can be compressed horizontally, at the price of increasing the number of the cycles, see Fig. 6 . That diagram shows how the same code will be executed in a system with 4 PUs only. The first load operation can be started immediately, the second loading only when some PU is released from the first operation. The SV in EMPA notices that core C 2 finished the processing and is free again, so the state H 2 is mapped to C 2 rather than to C 4 as in the case of 8 cores. Anyhow, the operation takes place when both operands are available.
Some operations will simply be postponed for a later machine cycle, prolonging the processing time and decreasing the reached parallelism. The programmer can give the theoretical dependency independently of the HW, and the processing graph will adjust itself to the HW at runtime. Fig. 7 visualizes the operation of the EMPA processor, when running program shown in Listing 4 on a 5-core system. Notice how the inter-core operations (receiving the address of the operand and sending the summand to the parent) are shifted in time, and the final sum remains latent until an explicit parent instruction takes it out into a "visible" register.
Vector sumup
CONCLUSIONS
The presented programming methodology demonstrates, how the dynamic parallelism and other unusual features of the EMPA architecture can be supported in a programming style, which is powerful enough to use the performance increasing facilities of the architecture and at the same time remains close enough to the conventional programming style. This promises good hopes that the EMPA architecture could be effectively supported from high-level languages.
In the EMPA implementation, a special purpose "adhoc" computing unit is assembled on the fly, just for the time of the (maybe complex) operation. This computing unit works with the maximum reachable parallelism, using the needed minimum of computing resources, with maximum efficiency. The operation may be simple like loading an operand or computing the two expressions used in discussing types of parallelisms, or complex like summing up elements of a vector. The introduced extra signals and local storages allow to omit some instructions used traditionally to organize a loop, and replace them with using internal signals. The close vicinity of the PUs (like in the case of modern many-core processors) allows using cooperative regime for making calculations, and allows to parallelize even the classic non-parallelizable sumup operation. This latter mode enables to gain an order of magnitude in speedup. The programming facilities allow the programmers (person or compiler) to use the unusual facilities offered by EMPA, using traditional terms and tools. Although the dependencies shall be correctly considered, the hardware conditions should not be known at the time of coding: the architecture follows the prescribed logic of programming, but adapts its resource need to the momentary HW availablility.
The real-time characteristics of processors also benefit from EMPA. To service an interrupt, no state saving and restoring is needed, saving memory cycles and code. The program execution will be predictable: the processor need not be stolen from the running main process. The atomic nature of executing QTs will prevent issues like priority inversion, eliminating the need for special protection protocols.
From the point of view of accelerators, an EMPA processor provides a natural interface for linking special accelerators to the processor. Any circuit, being able to handle data and signals shown in Fig. 4 can be linked to an EMPA processor with easy.
