More than hundred years ago the 'classic physics' was it its full power, with just a few unexplained phenomena; which however lead to a revolution and developing the 'modern physics'. Today the computing is in a similar position: computing is a sound success story, with exponentially growing utilization, but with a growing number of difficulties and unexpected issues as moving towards extreme utilization conditions. In the physics studying the nature under extreme conditions (like approaching the speed of light, studying atomic and sub-atomic interactions, considering objects on the scale of the Universe) has lead to the understanding of the relativistic and quantal behavior. Quite similarly, in the computing some phenomena, acquired in connection with extreme (computing) conditions, cannot be understood on the basis of the 'classic computing paradigm'. Using analogies with the classic vs. modern physics, the need for a "modern computing paradigm" is introduced and underpinned. The analogies do not want to derive direct correspondence between certain physical quantities and some computing phenomena. Rather, the paper wants to call the attention to both that under extreme conditions qualitatively different behavior may be encountered in both worlds, and that pinpointing certain, formerly unnoticed or neglected aspects enable to explain the new phenomena.
Introduction
Initially computers were constructed with the goal to reduce the computation time of rather complex but sequential mathematical tasks [1, 2, 3, 4] (e.g. the computation of trajectories for missiles). Therefore their architecture was designed to enable that kind of activity, and even it was recognized early at that time that that architecture is not really efficient in solving certain kinds of problems. Today a computer is deployed for radically different goals, where the mathematical computation in most cases is just a very small part of the task. The majority of the non-computational activities are reduced to or imitated by some kind of computations, because this is the only activity that the computer can do. In addition, the computers work in a very energy-wasting regime: they must execute instructions continuously, in a considerable fragment of time in an 'idle' loop (or 'idle' task in the OS), waiting for a peripheral or the slow memory or network access; they must imitate the simultaneous execution of hundreds of threads on just a few processing units, with the high cost of context switching.
Because of this, the computers with their present architecture (concluded from the 70-year old 'classic' paradigm) are not really suitable for the goals they are expected to work for: to handle the varying degree of parallelism in a dynamic way; to prepare systems comprising many-many processors with a good efficacy; to cooperate with the large number of other processors (both on-chip and off-chip) around; to watch many external actions and react to them in real-time; to build servers in environments where the workload changes in an extreme way; to operate 'big data' processing systems, etc. Under the conditions the computers are utilized today not only the efficacy of computing is very low because of the different performance losses [5] , but more and more limitations come to the light [6, 7] .
There are scientific reasons why computation will shortly reach its performance limit [6, 8] , and in addition the systems comprising parallelized sequential processors introduce their own limits [9] . The very rigid separation of the HW and SW in a real-time multitasking system leads to the phenomenon known as 'priority inversion' [10] , the very large rigid architectures suffer from frequent component errors, the complex cyber-physical systems must be equipped with excessive computing facilities to provide realtime operation, a serious challenge is to deliver the large amount of data from the big storage centers to the places where the processing capacity is concentrated, furthermore more and more problematic is to supply with energy the extremely large number and extremely large size of energy wasting computers. So the big dilemma today in the field of computing is how to solve very different problems on an architecture that was designed mainly for sequential execution of mathematical operations. A common reason can be found [11] behind those issues: the classical computing paradigm that reflects the state of the art of 70 years ago. Computing needs renewal [12] .
It was early discovered that the age of conventional architectures was over [13, 14] ; the only question that remained open whether the "game is over" [15] , too. The today's technology is able to deliver literally thousands of simple cores on a single die (not only many, but "too many" [16] cores), but their computing performance does not increase linearly [17] with the number of cores, only a fragment of them can be utilized simultaneously [18] and no reasonable use can be found of those high number of cores for example for system operation [19] . Approaching the limits of the "classic paradigm of computing" has rearranged the technology ranking [20] . Really "New ways of exploiting the silicon real estate need to be explored " [7] and a modern computing paradigm is needed.
One of the major issues is however, that "Development of software is expensive and requires a highly educated workforce. Being able to run software developed for today's computers on tomorrow's has reigned in the cost of software development. But this guarantee between computer designers and software designers has become a barrier to a new computing era. The "good news" is that there are several ways to restart the meteoric rise of computer performance. But the "bad news" is that the more revolutionary of these approaches will require not only significant hardware investment, but also significant software rewriting. Such change is risky, and unfortunately, both US software and hardware industries today are risk averse." [20] That is, it was more safe to leave the barrier on the road; however, there is no more way to circumvent it. A computer paradigm, that is an extension rather than a replacement of the old paradigm, has serious chance to be accepted, since it provides a smooth transition from the age of old computing to the age of modern computing. It requires only moderate hardware investment, no software rewriting is needed (at the price of continuing performance loosing as before, but a simple recompilation can deliver part of the advantages), the devices manufactured having modern paradigm in mind can be utilized together with those manufactured in the old age.
Analogies with the classic versus modern physics
The case of computing is very much analogous with the case of classic physics versus the modern (relativistic and quantum) physics. In the world we live in it is rather counter-intuitive to accept that as we move towards unusual conditions, the addition of velocities behaves differently, the energy becomes discontinuous, the momentum and the position of a particle cannot be measured accurately at the same time. Basically, these are curiosities or nuances for those who deploy their knowledge in physics in the everyday life, and actually the calculations result in no reasonable difference in the cases we experience around us, with and without using the non-classical principles. However, as we get farther from the everyday conditions, the difference gets considerable, and even leads to phenomena one can never experience under the usual, everyday conditions. The analogies do not want to derive direct correspondence between certain physical and computing phenomena. Rather, the paper wants to call the attention to both that under extreme conditions qualitatively different behavior may be encountered, and that pinpointing certain, formerly unnoticed or neglected aspects enables to explain the new phenomena. In this paper, only two of the affected important areas of computing can be touched in more details: parallel processing and multitasking.
Analogy with the special relativity
In the above sense, there is an important difference between the operation and the performance of the single-processor and those of the huge parallelized computer systems. As long as just a few (maybe up to a few thousands) single processors are aggregated into a large computer system, the resulting performance will correspond (approximately) to the sum of the single-processor performance values: similarly to the classic rule of summing speeds. However, when assembling larger computing systems (and approaching with their performance "the speed of light" of computing systems in the range of millions of processors) the nominal performance (calculated as the simple sum of single-processor performances) starts to deviate from the experienced payload performance: the phenomenon known as efficiency appears. Even, not only one efficiency [21] : the measurable efficiency depends on the method of measurement (the utilized benchmark program); the phenomenon is recognized but there exists no commonly accepted explanation.
The performance measurements are simple time measurements: a standardized set of machine instructions is executed (a large number of times) and the known number of operations is divided by the measurement time. This happens in the same way in the case of measuring the performance of both the single-processor and the parallelized sequential computing systems. In the latter case, however, the joint work must also be organized, i. The later a supercomputer appears in the competition, the smaller is the performance ratio with respect to its predecessor; the higher is its rank, the harder is to improve its performance.
time) appears. One of the processors must tell the others (at least) what fraction they should perform and also the result of the calculations must be collected. This is the origin of the efficiency: one of the processors orchestrates the joint operation, the others are waiting; all of them wasting a fraction of the measured operating time to non-payload activity. After a certain number of processors there is no more increase in the payload fraction when adding more processors: the first fellow processor already finished the task and is waiting while the last one is still waiting for the start command. Due to this, the computing performance cannot be increased above the performance defined by that number of processors, in analogy with that an object having the speed of light cannot be further accelerated 1 . This limiting number can be increased by organizing the processors into clusters: then the first computer must speak directly only to the head of the cluster. The physical size of the computing system also matters: the processor connected with a cable of length of dozens of meters to the first one receives the start signal much later so it must spend several hundreds clock cycles with waiting.
The phenomenon itself was already experienced (see Fig. 1 in [23] ) and explained [23] : "Amdahl argued that most parallel programs have some portion of their execution that is inherently serial and must be executed by a single processor while others remain idle. . . . In fact, there comes a point when using more processors to solve a fixed-size problem actually increases the execution time rather than reducing it." However, for today it was completely forgotten mainly due to the quick development of the parallelization technology and the single-processor performance, as well as the considerably longer benchmarking time. Today the number of processors is thousands of times higher than a quarter of century ago, and the same phenomenon returned in a technically different form at much higher number of processors.
In Fig. 1 , the development of the payload performance of some top supercomputers in function of the year of construction is depicted. Despite of the huge competition, the payload performance changes with relatively small value as the time passes. The "speed of light" is different for the different architectures, but the higher is the achieved performance, the harder is to increase it. Considerable performance increase happened only at relatively The black dots mark the performance data of supercomputers JU QU EEN and K as of 2014 June, for HPL and HPCG benchmarks, respectively. The saturation effect can be observed for both HPL and HPCG benchmarks. The shaded area only highlights the nonlinearity. The red dot denotes the performance value of the system used by [24] . The dashed line shows the plausible saturation performance value of the brain simulation. The computing performance of AI applications may be between the diagram line marked by HPCG and that of the brain simulation.
low computing performance values. It looks like that in the feasibility studies an analysis whether some inherent performance bound exists remained out of sight either in USA [25, 26] or in EU [27] or in Japan [28] or in China [29] .
The supercomputers are streched to the limit [30] . In the case of Summit, adding 5 % more cores (and making fine-tuning after its quick startup) resulted in 17 % increase in the computing performance in a half year after its appearance on the list, and in another half year a 0.7 % increase in the number of cores caused a 3.5 % increase in its performance. (That also underlines the importance of the engineering perfectness.) In the light of this, it is hard to believe that in the case of T aihulight vs. Sierra (See also Gyoukou vs T rinity vs P iz Daint), neither of the parties attempted to raise the payload performance. Actually, the R HP L M ax values differ by about 0.2%, i.e. a slight increase of payload performance of either supercomputer would change the prestigious ranking of the supercomputers. Similarly, the one-time appearance of Gyoukou is mystic: it could catch slot #4 on the list using just 12% of it 20M cores, although the ambition explicitly was to be the #1. The lack of understanding that under extreme conditions the computing performance behaves differently, caused to suppose that some kind of fraud occurred [31] . Some secret law prevents increasing their payload performance? Or their specific "speed of light" has been reached?
As discussed above (and in details in [9] ), because of the classic computing paradigm (and natural laws) the computer performance does not increase linearly with adding more nominal performance (more processors). The nonlinearity is not noticeable at low performance values and its exact form depends also on the benchmarking task. On Fig. 2 the dependence is depicted for different benchmark types, specifically for the commonly utilized HPL and HPCG benchmarks. All diagram lines clearly show the signs of saturation [32] (or rooflines [33] ): the more communication is needed for solving the task, the more "dense" is the environment and the smaller is the specific "speed of light". The third "roofline" is guessed for the case of processorbased brain simulation [34] where orders of magnitude higher amount of communication is required. The payload performance of the processor-based Artificial Intelligence tasks is expected, based on the amount of communication, to be between the latter two rooflines, closer to that of the HPCG level.
As the figure shows, at low performance values the deviation from the linear dependence cannot be noticed: the "classic speed addition" is valid. Under the extremely large performance values, however, the dependence is strongly non-linear and specific (depending on the conditions of the measurement) performance limits exist. The large scatter of the measured data has its origin in the perturbance due to the different manufacturers, processor and connection types, design ideas, etc. However, the diagram lines corresponding to the theory describe the tendency perfectly.
Analogy with the general relativity
The mentioned losses manifest in appearing of the performance wall [9] , a new limitation due to the parallelized sequential computing. It is known that the extremely large masses behave differently from what we know under 'normal' conditions. Not only the quite extreme phenomena with the 'black holes' and company, but also that if we assume we know how the 'matter' behaves, we need to assume the presence of 'dark matter'. The latter is like the 'matter' but not quite. Which is another phrase to tell that the large scale behavior of 'matter' largely deviates from that we have concluded from the smaller amount of 'matter'. BTW, this analogy is in use already: the phenomenon of 'dark silicon' [18] is obviously named in analogy with the 'dark matter': the (silicon) cores are there and usable, but (because of the thermal dissipation) the large amount of cores behaves differently.
In an analogous way, the parallel computing introduces the "dark performance". Because of the principle of the classic computing, the first core must speak to all fellow cores, and this non-parallelizable fraction of the time of the cores is multiplied by the number of the cores [33, 35] . The result is that on one side the top supercomputers (depending on the number of cores) show up efficacy around 1% when solving real-life tasks, on the other side mystic events occur like the one-time appearance of the supercomputer Gyoukou, which used only 2.4M (out of 20M) of its processors in the competition for the #1 slot of the TOP500 [36] list of supercomputers, or -after years of building-retargeting supercomputer Aurora [37] weeks before its planned startup. Exceeding a certain computing performance is prohibited by the laws of the nature.
Note that also the "black holes" have their analogs in computing. Adding more nominal performance (more cores) has no noticable effect: the "black hole" does not enable to emit more payload performance than a specific (environment-dependent) maximum.
The top row of Fig. 3 introduces a further analogy. The non-parallelizable fraction (denoted on the figure by α X ef f ) of the computing task comprises components of different origin. As already discussed, and was noticed decades Fig. 1 in [23] ). The black dot marks the HPL performance of the computer used in works [24, 22] . In the top right figure the behavior measured with benchmark HPCG is displayed. In this case the contribution of the application (thin brown line) is much higher, the looping contribution (thin green line) is the same as above. As a consequence, the achievable payload performance is lower and also the breakdown of the performance is softer. The black dot marks the HPCG performance of the same computer. The rightmost figure demonstrates what happens if the clock cycle is 5000 times longer: it causes a drastic decrease in the achievable performance and strongly shifts the performance breakdown toward lower nominal performance values. The figure is purely illustrating the concepts; the displayed numbers are somewhat similar to the real ones.
ago "The inherent communication-to-computation ratio in a parallel application is one of the important determinants of its performance on any architecture" [23] , the communication can be a dominant contribution. The left subfigure of the row displays the case of the minimum communication, the right subfigure the moderately increased one (corresponding to real-life supercomputer tasks). As discussed, the OS contribution increases linearly with the nominal performance (the number of cores) and with the growing number of cores, becomes dominant. The resulting large non-parallelizable fraction strongly decreases the efficacy (or in other words: the performance gain) of the system [9, 38] . As the nominal performance increases linearly and the performance decreases exponentially with the number of cores, at some critical value an inflection point occurs and the resulting performance starts to decrease; in analogy with the size in the case of a gravitational collapse. Such effect in computing is implicitly experienced in the cases of failed projects already mentioned, and explicitly measured in [22] , see their Fig. 7 . Just notice that the effect was noticed early [23] , but forgotten due to the successes of the development of the parallelization technology.
Analogy with the quantum physics
The electronic computers are clock-driven systems, i.e. no action can happen in a time period shorter than the length of one clock period. The typical value of that "quantum of time" today is in the nanosecond range. On the everyday scale, the time seems to be continuous: the time difference of actions we can perceive is in the range of dozens of milliseconds, so here the "quantal nature of computing" cannot be noticed. Some (sequential) nonpayload fragment in the total time is always present: it cannot be smaller than the ratio of the length of two clock periods divided by the total measurement time, since also the forking and joining the other threads cannot be shorter than one clock period.
Unfortunately, the technical implementation needs about ten thousand times longer time to do those actions: only the operating system is allowed to perform such operations, and it takes a long time [39, 40] . The total time of the performance measurement is large (typically hours) but finite, so the non-parallelizable fraction is small but finite (and -as discussed above-it is multiplied by the number of the cores in the system). Because of Amdahl's Law, the absolute value of the computing performance of parallelized systems has inherently an upper limit, and the efficiency is the lower the higher is the number of the aggregated processing units.
The propagation time of signals is also very much similar to that of the effects of physical fields, and even the latency time of the interfaces can be paired with creating and attenuating the physical carriers. Zero-time on/off signals are possible both in the classical physics and classical computing, while in the corresponding modern counterparts also the time needed to create and detect the signals must be accounted; the effect noticed as that the time of wiring (in this extended sense) grows compared to the time of gating [8] .
The processor-based brain simulation provides an experimental proof [34] that the time in computing shows quantal behavior, analogously with the energy in physics. When simulating neurons utilizing processors, the ratio of the simulated (biological) time and the processor time used to simulate the biological effect may considerably differ, so to avoid working with "signals from the future", periodic synchronization is required. The commonly used "grid time" here is 1 ms [24] . This grid time introduces a special "biological clock cycle". The role of this clock period is the same as that of the clock signal in the clocked digital electronics: what happens in this period, it happens "at the same time". This clock signal is, however, 10 6 times longer than the 1 ns clock cycle common in the digital electronics. Correspondingly, its influence on the performance is noticeable, see the bottom subfigure in Fig. 3 . As shown, the "quantal nature of time" in computing changes the behavior of the performance drastically. Not only the achievable performance is by orders of magnitude lower, but also the "gravitational collapse" (see also [23] ) occurs at orders of magnitude lower nominal performance. This is the reason why less than one percent of the planned capacity can be achieved even by the purpose-build brain simulator [41] as well as that the SW based and HW based simulation shows up the same limitation [24, 34] . The memory of extremely large supercomputers can be populated [42] with objects simulating neurons, but as soon as they need to start to communicate, the task collapses as predicted in Fig. 3 . This is indirectly underpinned [22] by that the different handling of the threads changes efficacy sensitively and that the time required for more detailed simulation increases non-linearly [24, 34] .
Analogy with the interactions of particles
The ability of communicating with each other is not a native feature of processors in the 'classic computing': in the Single Processor Approach questions like message sending to and receiving from some other party has no sense at all (no other party exists); messaging is very ineffectively imitated by SW in the layer between HW and the real SW. This feature alone prevents building exascale supercomputers [9] : in Single Processor Approach (SPA) the communication is a non-parallelizable fraction of the activity of the cores, and similarly sharing resources has no sense (although it is an elementary requirement in all modern systems). See also the features of EMPA in section 3. The laws of parallel computing result in the actual behavior of the computing systems the more difference from that expected on the basis of the classical computing the more communication takes place. Similarly, as in the physics the behavior of an atom strongly changes by the interaction (communication) with other particles.
The brain simulation (and in somewhat smaller scale: artificial neural computing) requires intensive data exchange among the parallel threads: the neurons are expected to tell the result of their neural calculations periodically to thousands of fellow neurons. Because the neurons must work on the same (biological) time scale, the (commonly used) 1 millisecond "grid time" ("the quantum of computing time") has noticeable effect on the performance. In addition, the thousands times more communication contributes considerably to the non-payload sequential-only fraction, so it degrades further the efficacy of the computing system. (What is worse, they are expected to send their messages at the end of the grid time, causing a huge burst of messages.) This is why it was found [24] that only a few dozens of thousands of neurons can be simulated on the processor-based brain simulators (including both the many-thread software simulators and the purpose-built brain simulator).
This phenomenon cannot be explained in the frames of the "classic computing". The limits of single-processor performance enforced by the laws of nature [8] are topped by the limitation of parallel computing [9, 33] , and further limited through introducing the "biological clock period". Notice that these contributions are competing with each other, the actual circumstances decide which of them will dominate. Their effect, however, is very similar: according to Amdahl, what is not parallel is qualified as sequential.
Analogy with the uncertainty principle
Even the quantum physical uncertainty principle that (unlike in classical physics) one cannot measure accurately certain pairs of physical properties of a particle (like the position and the momentum) at the same time, has its counterpart in computing. Using registers (and caches), one can perform computations with much higher speed, but to service an interrupt, one has to save/restore registers and renew cache content. That is, one cannot have low latency and high performance at the same time, using the same processor design principles.
The classic versus modern paradigm
Today we have extremely inexpensive (and at the same time: extremely complex and powerful) processors around (a "free resource" [41] ) and we arrived to the age when no additional reasonable functionality can be implemented in processors through adding more transistors, the over-engineered processors optimized for single-processor regime do not enable reducing the clock period [43] . The computing power hidden in many-core processors cannot be utilized effectively for payload work, because of the "power wall" (partly because of the improper working regime [44] ): we arrived at the age of "dark silicon" [18] , we have too many processors [16] around. The supercomputers face critical efficiency and performance issues; the real-time (especially the cyber-physical) systems experience serious predictability, latency and throughput issues; in summary, the computing performance (without changing the present paradigm) reached its technological bounds. Computing needs renewal [12] . Our proposal, the Explicitly Many-Processor Approach (EMPA) [45] , is to introduce a new computing paradigm and through that to reshape the way in which computers are designed and used today.
Overview of the modern paradigm
The new paradigm (following the analogies again) shall be based on fine distinctions in some points, present also in the old paradigm. Those points, however, must be scrutinized in all occurring cases, whether and how long can they be neglected. These points are:
• consider explicitly that not only one processor (aka Central Processing Unit) exists, i.e.
-the processing capability (akin to the data storage capability) is one of the resources rather than a central singleton -not necessarily the same processing unit (out of the several identical ones) is used to solve all parts of the problem -a kind of redundancy (an easy method of replacing a flawed processing unit) through using virtual processing units is provided (mainly to increase the mean time between technical errors), like [46, 47] -different processors can and must cooperate in solving a task, i.e. direct data and control exchange among the processing units are made possible; the ability to communicate with other processing units, like [48] , is a native feature -flexibility for making ad-hoc assemblies for more efficient processing is provided
-the large number of processors is used for unusual tasks, like replacing memory operation with using additional processors
• the misconception of the segregated computer components is reinterpreted -the efficacy of utilization of the several processors is increased by using multi-port memories (similar to [49] )
-a "memory only" concept (somewhat similar to that in [50] ) is introduced (as opposed to the "registers only" concept), using purpose-oriented, optionally distributed, partly local, memory banks -the principle of locality is introduced into memory handling at hardware level, through introducing hierarchic buses
• the misconception of the "sequential only" execution [51] is reinterpreted -von Neumann required only "proper sequencing" for the single processing unit; this is extended to several processing units -the tasks are broken into reasonably sized and logically interconnected fragments, unlike unreasonably fragmented by the scheduler -the "one-processor-one process" principle remains valid for the a task fragments, but not necessarily for the complete task -the fragments can be executed in parallel if both data dependence and hardware availability enables it (another kind of asynchronous computing [52] )
• a closer hardware/software cooperation is elaborated -the hardware and software only exist together (akin to "stack memory")
-when a hardware has no duty, it can sleep ("does not exist" and does not take power)
-the overwhelming part of the duties of synchronization, scheduling, etc. of the OS are taken over by the hardware -the compiler helps the processor with compile-time information and the processor is able to adapt (configure) itself to the task depending on the actual hardware availability -strong support for multi-threading and resource sharing, as well as low real-time latency is provided, at processor level -the internal latency of the assembled large-scale systems is much reduced, while their performance is considerably enhanced -the task fragments are able to return control voluntarily without the intervention of the operating system (OS), enabling to implement more effective and more simple operating systems
Details of the concept
Our proposal is to work at programming level with virtual processors and to map them to physical cores at runtime, i.e. to let the computing system to adapt itself to the task. A major idea of EMPA (for an early and less mature version see [45] ) is to use quasi-thread (QT) as atomic unit of processing that comprises both the HW (the physical core) and the SW (the code fragment running on the core). Its idea was derived with having in mind the best features of both the HW core and the SW thread. In analogy again: the QT s have "dual nature": in the HW world of the "classic computing" they are represented as a 'core', in the 'SW' world as a 'thread'. However, they are the same entity in the sense of the 'modern computing'. The terms 'core' and 'thread' are borrowed from the conventional computing, but in the 'modern computing' they can actually exist only together in a time-limited way 2 .
EMPA is a new computing paradigm which needs a new underlying architecture, rather than a new kind of parallel processing running on a conventional architecture, so it can be reasonably compared to the terms and ideas used in conventional computing only in a very limited way; although many of its ideas and solutions are adapted from the 'classic computing'. The executable task is broken into reasonably sized and loosely dependent QT s. (The QT s can optionally be embedded into each other, akin to subroutines.) In EMPA for every new QT a new independent processing unit (PU) is also implied, the internals (PC and part of registers) are set up properly, and they execute their task independently 3 (but under the supervision of the processor comprising the cores).
In other words: the processing capacity is considered as a resource in the same sense as the memory is considered as a storage resource. This approach enables the programmer to work with virtual processors (mapped to physical PU s by the computer at run-time) and they can utilize the quick resource PU s where they can replace utilizing the slow resource memory (say, hiring a quick processor from a core pool can be competitive with saving and restoring registers in the slow memory, for example when making a recoursive call). The third major idea is that the PUs can cooperate in various ways, including data and control synchronization, as well as outsourcing part of the received job (received as an embedded QT) to a helper core. An obvious example is to outsource the housekeeping activity of loop organization to a helper core: counting, addressing, comparing, etc. can be done by a helper core, while the main calculation remains to the originally delegated core. As the mapping to physical cores occurs at runtime, (depending on the actual HW availability) the processor can eliminate the (maybe temporarily) denied cores as well as to adapt the resource need (requested by the compiler) of the task to the actual computing resource availability.
The processor has an additional control layer for organizing the joint work of its cores. The cores have just a few extra communication signals and are able to execute both conventional and so called meta-instructions (for configuring the architecture). The latter ones are executed in a co-processor style: when finding a meta-instruction, the core notifies the processor which suspends the conventional operation of the core, controls executing the metainstruction (utilizing the resources of the core, providing helper cores and handling the connections among the cores as requested) then resumes core operation.
The processor needs to find the needed PUs (cores) and the processing ability has to accommodate to the task. Also inside the processor; quickly, flexibly, effectively and inexpensively. A kind of 'On demand' computing that works 'As-a-Service'. This is a task not only for the processor: the complete computing system must participate and for that goal the complete computing stack must be rebuilt.
Behind the former attempts to optimize code execution inside the processor there was no established theory, and they actually could achieve only moderate success because in SPA the processor is working in real time, it has not enough resources, knowledge and time do discover those options completely [53] . In the classic computing, the compiler can find out anything about enhancing the performance but has no information about the actual run-time HW availability, furthermore it has no way to tell its findings to the processor. The processor has the HW availability information, but has to "reinvent the wheel" with respect to enhancing performance; in real time. In EMPA, the compiler puts its findings in the executable code in form of meta-instructions ("configware"), and the actual core executes them with the assistance of the new control layer of the processor. The processor can choose from those options, considering the actual HW availability, in a style 'if NeededNumberOfResourcesAvalable then Method1 else Method2', maybe embedded one to another.
Some advantages of EMPA
The approach results in several considerable advantages, but the page limit enables to mention just a few of them.
• as a new QT receives a new Processing Unit (PU)(s), there is no need to save/restore registers and return address (less memory utilization and less instruction cycles)
• the OS can receive its own PU, which is initialized in kernel mode and can promptly (i.e. without the need of context change) service the requests from the requestor core
• for resource sharing, temporarily a PU can be delegated to protect the critical section; the next call to run the code fragment with the same offset will be delayed until the processing by the first PU terminates
• the processor can natively accomodate to the variable need of parallelization
• the actually out-of-use cores are waiting in low energy consumption mode
• the hierarchic core-to-core communication greatly increases the memory throughput
• the asynchronous-style computing [54] largely reduces the loss due to the gap [55] between speed of the processor and that of the memory
• the direct core-to-core connection (more dynamic than in [48] ) greatly enhances efficacy in large systems [56] • the thread-like feature to fork() and the hierarchic buses change the dependence of on the number of cores from linear to logarithmic [9] (enables to build really exa-scale supercomputers)
The very first version of EMPA [12] has been implemented in a form of simple (practically untimed) simulator [57] , now an advanced (Transaction Level Modeled) simulator is prepared in SystemC. The initial version adapted Y86 cores [58] , the new one RISC-V cores [59] . Also part-solutions are modeled in FPGA.
Summary
The today's computing is more and more typically utilized under extreme conditions: measuring the speed of light using system having components slower than the speed of light; providing extremely low latency time in interrupt-driven system; providing extremely large computing performance in parallelly working system, relatively high performance for a long time on a complex computer system, service requests in an energy aware mode. To some measure, these activities are solved under the umbrella of the old paradigm for non-extreme scale systems. Those experiences, however, must be reviewed when working with extreme-large system, because the scaling is as much nonlinear, that qualitatively new phenomena are experienced. Though scrutinizing the details of the basic principles of computing, a "modern computing paradigm" can be constructed, that on one side enables to explain the new phenomena and on the other side enables to construct computing systems with much more advantageous features.
