The performance wall of parallelized sequential computing: the dark performance and the roofline of performance gain
Introduction
The parallelization in the computing today becomes more and more focus [5] . After that the growth of the singleprocessor performance has stalled [45] , the only hope for making computing systems with higher performance is to assemble them from a large number of sequentially working computers. "Computer architects have long sought the "The City of Gold" (El Dorado) of computer design: to create powerful computers simply by connecting many existing smaller ones." [37] . Solving the satisfying scalability of the task, however, is not simple at all. One of the major motivations of the origin of the Gordon Bell Prize was to increase the resulting performance gain of the parallelized sequential systems: "a speedup of at least 200 times on a real problem running on a general purpose parallel processor" [6] .
It is known from the very beginnings, that the "Validity of the Single Processor Approach to Achieving LargeScale Computing Capabilities" [3, 41] is at least questionable, and the really large-scale tasks (like building supercomputers comprising millions of single processors or building brain simulators from single processors) may face serious scalability issues. After the initial difficulties, for today the large-scale supercomputers are stretched to the limits [7] , and although the EFlops (payload) performance has not yet been achieved, already the 10 4 times higher performance is planned [32] . Many of the planned large-scale supercomputers, however, are delayed, canceled, paused, withdrawn, re-targeted, etc. It looks like that in the present "gold rush" it has not been scrutinized whether the resulting payload performance of parallelized systems has some upper bound. In the known feasibility studies this aspect remains out of sight in USA [44] , in EU [17] , in Japan [18] and in China [32] . The designers did not follow "that system designers must make ⋆ This work is supported by the National Research, Development and Innovation Fund of Hungary, under K funding scheme Projects no. 125547 and 132683, as well as ERC-ECAS support of project 861938 is acknowledged * Corresponding author
Vegh.Janos@gmail.com (J. Végh) ORCID(s): 0000-0001-7511-2910 (J. Végh) the effort to understand the relevant characteristics of the benchmark applications they use, if they are to arrive at the correct design decisions for building larger multiprocessor systems" [41] . The rules of game are different for the segregated processors and the parallelized ones [51] ; the larger are the systems, the more remarkable deviations appear in their behavior. In addition to the already known limitations [35] , valid for segregated processors, a new limitation, valid for parallelized processors appears [55] .
This paper presents that the well-known Amdahl's law, the commonly used computing paradigm (the single-processor approach [3] ) and the commonly used implementation technology together form a strong upper bound for the performance of the parallelized sequential computing systems, and the experienced failures, saturation, etc. experiences can be attributed to approaching and attempting to exceed that upper bound. In section 2 "We realize that Amdahl' s Law is one of the few, fundamental laws of computing" [38] , and reinterpret it for the targeted goal. The section introduces the mathematical formalism used and derives the deployed logical merits. Based on the idea of Amdahl, in section 3 a general model of parallel computing in constructed. In this (by intention) strongly simplified model the contributions are classified either as parallelizable or nonparallelizable ones. The section provides an overview of the components contributing to the non-parallelizable fraction of processing, and attempts to reveal their origin and behavior. By properly interpreting the contributions, it shows that under different conditions different contributions can have the role of being the performance-limiting factor. The section interprets also the role of the benchmarks, and explains why the different benchmarks produce different results. Section 4 shows that the parallelized sequential computing systems have their inherent performance bound, and shows numerous potential limiting factors. The section also provides some supercomputerspecific numerical examples.
The last two sections directly underpin the previous theoretical discussion with measured data. In section 5 some specific supercomputer features are discussed, where the case is well documented and the special case enables to draw gen-eral conclusions. This section also discusses the near (predictable) future of large-scale supercomputers and their behavior. Section 6 draws some statistical conclusions, based on the available supercomputer database, containing rigorously verified, reliable data for the complete supercomputer history. The large number of data enables, among others, to derive conclusions about a new law of the electronic development, to tell where and when it is advantageous to apply graphic accelerator units, as well as to derive a "roofline" model of supercomputing.
Amdahl's classic analysis

Origin and interpretation
The most commonly known and cited limitation on parallelization speedup ([3] , the so called Amdahl's law) considers the fact that some parts ( ) of a code can be parallelized, some ( ) must remain sequential. Amdahl only wanted to draw the attention to that when putting together several single processors, and using Single Processor Approach (SPA), the available speed gain due to using largescale computing capabilities has a theoretical upper bound. He also mentioned that data housekeeping (non-payload calculations) causes some overhead, and that the nature of that overhead appears to be sequential, independently of its origin.
Validity
A general misconception (introduced by successors of Amdahl) is to assume that Amdahl's law is valid for software only and that the non-parallelizable fraction means something like the ratio of numbers of the corresponding instructions. Amdahl in his famous paper speaks about "the fraction of the computational load" and explicitly mentions, in the same sentence and same rank, algorithmic reasons like "computations required may be dependent on the states of the variables at each point"; architectural aspects like "may be strongly dependent on sweeping through the array along different axes on succeeding passes" as well as "physical problems" like "propagation rates of different physical effects may be quite different". His point of view is valid also today: one has to consider the load of the complex hardware (HW)/software (SW) system, rather than some segregated component, and his idea describes parallelization imperfectness of any kind. When applied to a particular case, however, one shall scrutinize which contributions can actually be neglected.
Actually, Amdahl's law is valid for any partly parallelizable activity (including computer unrelated ones) and the non-parallelizable fragment shall be given as the ratio of the time spent with non-parallelizable activity to the total time. The concept was frequently successfully utilized on quite unexpected fields, as well as misunderstood and abused (see [30, 23] ).
As discussed in [35] • many parallel computations today are limited by several forms of communication and synchronization
• the parallel and sequential runtime components are only slightly affected by cache operation
• the wires get increasingly slower relative to gates
Amdahl's case under realistic conditions
The realistic case is that parallelized parts are not of equal length (even if they comprise exactly the same instructions). The hardware operation in modern processors may execute them in considerably different times; for examples see [53, 24] , and references cited therein; operation of hardware accelerators inside a core, or network operation between processors, etc. One can also see that the time required to control parallelization is not negligible and varying; representing another source of performance bound.
The static correspondence between program chunks and processing units can be very inefficient: all assigned processing units must wait the delayed unit. The measurable performance does not match the nominal performance: leading to the appearance of the "dark performance": the processors cannot be utilized at the same time, much similar to how fraction of cores (because of energy dissipation) cannot be utilized at the same time, leading to the issue "dark silicon" [16, 15] .
Also, some capacity is lost if the number of computing resources exceeds number of parallelized chunks. If number of processing units is smaller than that of the parallelized threads, severals "rounds" for the remaining threads must be organized, with all disadvantages of duty of synchronization [57, 10] . In such cases it is not possible to apply Amdahl's Law directly: the actual architecture is too complex or not known. However, in all cases the speedup can be measured and expressed in function of the number of processors.
Factors affecting parallelism
Usually, Amdahl's law is expressed as
where is the number of parallelized code fragments, is the ratio of the parallelizable fraction to the total, is the measurable speedup. The assumption can be visualized that (assuming many processors) in fraction of running time processors are processing data, in (1-) fraction they are waiting (all but one). That is describes how much, in average, processors are utilized. Having those data, the resulting speedup can be estimated. For the today's complex systems, to calculate is hopeless, but for a system under test, where is not a priory known, one can derive from the measurable speedup an effective parallelization factor as
Obviously, this is not more than expressed in terms of and from Equ. (1) . So, for the classical case, = ; which simply means that in ideal case the actually measurable effective parallelization achieves the theoretically possible one. In other words, describes a system the architecture of which is completely known, while characterizes a system the performance of which is known from experiments. Again in other words, is the theoretical upper limit, which can hardly be achieved, while is the experimental actual value, that describes the complex architecture and the actual conditions.
The value of can then be used to refer back to Amdahl's classical assumption even in realistic cases when the detailed architecture is not known. On one side, the speedup can be measured and can be utilized to characterize measurement setup and conditions [54] , how much from the theoretically possible maximum parallelization is realized. On the other side, the theoretically achievable or can be guessed from some general assumptions.
In the case of real tasks a Sequential/Parallel Execution Model [57] shall be applied, which cannot use the simple picture reflected by , but gives a good merit of the degree of parallelization for the duration of executing the process on the given hardware configuration, and can be compared to the results of technology-dependent parametrized formulas 1 . Numerically (1 − ) equals to value, established theoretically [29] .
To the scaling of parallel systems several models can be applied, and all they can be goof on their specific field. However, one must recall that "The truth is that there is probably no specific parameter scaling principle that can be universally applied [41] ." This also means, that the validity of the scaling methods for extremely large number of processors must be scrutinized.
Efficiency of parallelization
The distinguished constituent in Amdahl's classic analysis is the parallelizable payload fraction , all the rest (including wait time, communication, system contribution and any other non-payload activities) goes into the apparently "sequential-only" fraction according to this extremely simple model.
When using several processors, one of them makes the sequential-only calculation, the others are waiting 2 (use the same amount of time). In the age of Amdahl the number of processors was small and the contribution of the SW was high, relative to the contribution of the parallized HW. Because of this, the contribution of the SW dominated the sequential part, so the value of (1 − ) really could be considered as a constant. The technical development resulted in decreasing all non-parallelizable components; the large systems today can be idle also because of HW reasons. 1 Just notice here that passing parameters among cores as well as blocking each other (of course, through the operating system (OS)) are all a kind of synchronization or communication, and their amount differs task by task. 2 In a different technology age, the same phenomenon was already described: "Amdahl argued that most parallel programs have some portion of their execution that is inherently serial and must be executed by a single processor while others remain idle." [41] The dependence of computing efficiency of parallelized sequential computing systems on the parallelization efficacy and the number of cores. The surface is described by Eq. (4), the data points for the TOP5 supercomputers are calculated from the publicly available database [42] Anyhow, when calculating speedup, one calculates
hence efficiency 3 (how speedup scales with number of processors)
This means that according to Amdahl, as presented in Fig. 1 , the efficiency depends both on the total number of processors in the system and on the perfectness 4 of the parallelization. The perfectness comprises two factors: the theoretical limitation and the engineering ingenuity.
If parallelization is well-organized (load balanced, small overhead, right number of Processing Unit (PU)s), saturates at unity (in other words: sequential-only fraction approaches zero), so the tendencies can be better displayed through using (1 − ) in the diagrams below. The importance of this practical term is underlined by that it can be interpreted and utilized in many different areas [54, 52] and the achievable speedup (the maximum achievable performance gain when using infinitely large number of processors) can easily be derived from Equ. (1) as
Provided that the value of does not depend on the number of the processors, for a homogenous system the total payload performance is
i.e. the total payload performance can be increased by increasing the performance gain or by increasing the singleprocessor performance, or both. Notice, however, that increasing the single-processor performance through accelerators also has its drawbacks and limitations [47] , and that the performance gain and the single-processor performance are players of the same rank in defining the payload performance.
Connecting efficiency and
Through using Equ. (4), can be equally good for describing efficiency of parallelization of a setup, but anyhow a second parameter, the number of the processors is also required. From Equ. (4)
This quantity depends on both and , but in some cases it can be assumed that is independent from the number of processors. This seems to be confirmed by data calculated from several publications as was noticed early, At this point one can notice that 1 in Equ. (4) is a linear function of number of processors, and its slope equals to (1 − ). The value calculated in this way is denoted by (1 − Δ ). Its numerical value is quite near to the value calculated (see Equ.( 2)) using all processors, and so it is not displayed in the rest of figures. This also means that from efficiency data one can estimate value of Δ even for intermediate regions, i.e. without knowing the execution time on a single processor (from technical reasons, it is the usual case for supercomputers). From a handful of processors one can find out if the supercomputer under construction can have hopes to beat 5 the No. 1 in Top500 [42] . This result can also be used for investment protection.
Time to organize parallelization
The timing analysis given above can be applied to different kinds of parallelizations, from processor-level parallelization (instruction or data level parallelization, in nanoseconds range) to OS-level parallelization (including threadlevel parallelization using several processors or cores, in microseconds range), to network-level (between networked computers, like grids, in milliseconds range). The principles are the same [10] , independently of the kind of implementation. In agreement with [57] , housekeeping overhead is always present (and mainly depends on the HW+SW architectural solution) and remains a key question. The main focus is always on to reduce its effect. Notice that the application itself also comprises some (variable) amount of sequential contribution. 5 At least a lower bound on (i.e. a higher bound on parallelization gain) can be derived.
The actual speedup (or effective parallelization) depends strongly on the 'tricks' used during implementation. Although HW and SW parallelisms are interpreted differently [26] , they even can be combined [8] , resulting in hybrid architectures. For those greatly different architectural solutions it is hard even to interpret , while enables to compare different implementations (or the same implementation under different conditions).
Our model of parallel execution
As mentioned in section 2.2, Amdahl listed different reasons why losses in the "computational load" can occur. To understand the operation of computing systems working in parallel, one needs to extend Amdahl's original (rather than that of the successors') model in such a way, that the nonparallelizable (i.e. apparently sequential) part comprises contributions from HW, OS, SW and Propagation delay (PD), and also some access time is needed for reaching the parallelized system. The technical implementations of the different parallelization methods show up infinite variety, so here a (by intention) strongly simplified model is presented. Amdahl's idea enables to put everything 6 that cannot be parallelized into the sequential-only fraction. The model is general enough to discuss qualitatively some examples of parallely working systems, neglecting different contributions as possible in the different cases. The model can also be converted to a limited validity technical (quantitative) one.
Formal introduction of the model
The contributions of the model component to will be denoted by in the following. Notice the different nature of those contributions. They have only one common feature: they all consume time. The extended Amdahl's model is shown in Fig. 2 . The vertical scale displays the actual activity for processing units shown on the horizontal scale.
Notice that our model assumes no interaction between processes running on the parallelized systems in addition to the absolutely necessary minimum: starting and terminating the otherwise independent processes, which take parameters at the beginning and return results at the end. It can, however, be trivially extended to the more general case when processes must share some resource (like a database, which shall provide different records for the different processes), either implicitly or explicitly. Concurrent objects have inherent sequentiality [13] , and synchronization and communication among those objects considerably increase [57] the non-parallelizable fraction (i.e. contribution (1 − )), so in the case of extremely large number of processors special M odel of parallel execution P 0 P 1 P 2 P 3 P 4
Access Initiation Sof tware P re OS P re
Just waiting
Just waiting OS P ost Sof tware P ost attention must be devoted to their role on efficiency of the application on the parallelized system. Let us notice that all contributions have a role during measurement: the effect of contributions due to SW, HW, OS and PD cannot be separated, though dedicated measurements can reveal their role, at least approximately. The relative weights of the different contributions are very different for the different parallelized systems, and even within those cases depend on many specific factors, so in every single parallelization case a careful analysis is required.
Access time
Initiating and terminating the parallel processing is usually made from within the same computer, except when one can only access the parallelized computer system from another computer (like in the case of clouds). This latter access time is independent from the parallelized system, and one must properly correct for the access time when derives timing data for the parallelized system. Amdahl's law is valid only for properly selected computing system. This is a onetime, and usually fixed size time contribution.
Execution time
The execution time Total covers all processings on the parallelized system. All applications, running on a parallelized system, must make some non-parallelizable activity at least before beginning and after terminating parallelizable activity. This SW activity represents what was assumed by Amdahl as the total sequential fraction 7 . As shown in Fig. 2 , the apparent execution time includes the real payload activity, as well as waiting and OS and SW activity. Recall that the execution times may be different [36, 37, 39] in the individual cases, even if the same processor executes the same instruction, but executing an instruction mix many times results in practically identical execution times, at least at model level. Note that the standard deviation of the execution times appears as a contribution to the non-parallelizable fraction, and in this way increases the "imperfectness" of the architecture. This feature of processors deserves serious consideration when utilizing a large number of processors. Overoptimizing a processor for single-thread regime hits back when using it in a many-processor environment.
The principle of the measurements
When measuring performance, one faces serious difficulties, see for example [37] , chapter 1, both with making measurements and interpreting them. When making a measurement (i.e. running a benchmark) either on a single processor or on a system of parallelized processors, an instruction mix is executed many times. There is, however, an crucial difference: in the second case an extra activity is also included: the job to organize the joint work. It is the reason of the 'efficiency', and it leads to critical issues in the case of extremely large number of processors.
The large number of executions averages the rather different execution times [36] , with an acceptable standard deviation. In the case when the executed instruction mix is the same, the conditions (like cache and/or memory size, the network bandwidth, Input/Output (I/O) operations, etc) are different and they form the subject of the comparison. In the case when comparing different algorithms (like results of different benchmarks), the instruction mix itself is also different, Notice that the so called "algorithmic effects" -like dealing with sparse data structures (which affects cache behavior) or communication between the parallelly running threads, like returning results repeatedly to the main thread in an iteration (which greatly increases the non-parallelizable fraction in the main thread) -manifest through the HW/SW architecture, and they can hardly be separated. Also notice that there are fixed-size contributions, like utilizing time measurement facilities or calling system services. Since is a relative merit, the absolute measurement time shall be long. When utilizing efficiency data from measurements which were dedicated to some other goal, a proper caution must be exercised with the interpretation and accuracy of the data.
The measurement method (bechmarking)
Not to surprise, the method of the measurement basically affects the result of the measurement: the "device under test" and the "measurement device" are the same.
The benchmarks, utilized to derive numerical parameters for supercomputers, are specialized and standardized pro-grams, which run in the HW/OS environment provided by the parallelized computer under test. One can use benchmarks for different goals. Two typical fields of utilization: to describe the environment the computer application runs in (a "best case" estimation), and to guess how quickly an application will run on a given parallelized computer (a "real-life" estimation).
If the goal is to characterize the supercomputer's HW+OS system itself, a benchmark program should distort HW+OS contribution as little as possible, i.e. SW contribution must be much lower than HW+OS contribution. In the case of supercomputers, benchmark High Performance LINPACK [25] (HPL) (with minor modifications) is used for this goal since the beginning of the supercomputer age. The mathematical behavior of HPL enables to minimize SW contribution, i.e. HPL delivers the possible best estimation for
If the goal is to estimate the expectable behavior of an application, the benchmark program should imitate the structure and behavior of the application. In the case of supercomputers, a couple of years ago the benchmark High Performance Conjugate Gradients [25] (HPCG) was introduced for this goal, since "HPCG is designed to exercise computational and data access patterns that more closely match a different and broad set of important applications, and to give incentive to computer system designers to invest in capabilities that will have impact on the collective performance of these applications" [25] . However, its utilization can be misleading: the ranking is only valid for the HPCG application, and only utilizing that number of processors. HPCG seems really to give better hints for designing supercomputer applications 8 , than HPL does. According to our model, in the case of using the HPCG benchmark, the SW contribution dominates 9 , i.e. HPCG delivers the best possible estimation for for this class of supercomputer applications.
The different benchmarks provide different (1 − ) contributions to the non-parallelizable fraction (resulting in different efficiencies and ranking [27] ), so comparing results (and especially establishing ranking!) derived using different benchmarks shall be done with maximum care. Since the efficiency depends heavily on the number of cores (see also Fig. 1 
and Eq. (4)), the different configurations shall be compared using the same benchmark and the same number of processors (or same
). 10 
The inherent limitations of supercomputing
The limitations can be derived basically in two ways. Either the experiences based on the implementations can be utilized to draw conclusions (as an empirical technical limit), or some theoretical assumptions can be utilized to derive a kind of theoretical limit. The first way can be followed only if a large number of rigorously verified data are available for drawing conclusions. This method will be followed in connection with the supercomputers, where the reliable database TOP500 [42] is available. This method is absolutely empirical, and results only in an "up to now" achieved value, so one cannot be sure whether the experienced limitation is just a kind of engineering imperfectness. The other method results in a theoretical upper bound, and one cannot be sure whether it can technically be achieved. It is a strong confirmation, however, that the two ways lead to the same limitation: what can be achieved theoretically, is already achieved in the practice.
The technical implementation of the parallelized sequential computing systems shows up an infinite variety, so it is not really possible to describe all of them in a uniformed scheme. Instead, some originating factors are mentioned and their corresponding term in the model named. At this point the simplicity of the model is a real advantage: all possible contributions shall be classified as parallelizable or nonparallelizable ones only. The model uses time-equivalent units, so all contributions are expressed with time, independently of their origin.
That parallel programs have inherently sequential parts (and so: inherent performance limit) is known since decades: "Amdahl argued that most parallel programs have some portion of their execution that is inherently serial and must be executed by a single processor while others remain idle." [41] Those limitations follow immediately from the physical implementation and the computing paradigm; it depends on the actual conditions, which of them will dominate. It is crucial to understand that the decreasing efficiency (see Equ. (4)) is coming from the computing paradigm itself rather than from some kind of engineering imperfectness. This inherent limitation cannot be mitigated without changing the computing/implementation principle.
Propagation delay PD
In the modern high clock speed processors it is increasingly hard to reach the right component inside the processor at the right time, and it is even more hard, if the PUs are at a distance much larger than the die size. Also, the technical implementation of the interconnection can seriously contribute.
Wiring
As discussed in [35] , the weight of wiring compared to the processing is continuously increasing. The gates may become (much) faster, but the speed of light is an absolute limit for the signal propagation on the wiring connecting them. This is increasingly true when considering large systems: the need of cooling the modern high density processors increases the length of wiring between them.
Physical size
Although the signals travel in a computing system with nearly the speed of light, with increasing the physical size of the computer system a considerable time passes between issuing and receiving a signal, causing the other party to wait, without making any payload job. At the today's frequencies and chip sizes a signal cannot even travel in one clock period from one side of the chip to the other, in the case of a stadium-sized supercomputer this delay can be in the order of several hundreds clock cycles. Since the time of Amdahl, the ratio of the computing to the propagation time drastically changed, so -as [35] calls the attention to it-it cannot be neglected any more, although presently it is not (yet) a major dominating term.
As long as computer components are in proximity in range , contribution by PD can be neglected, but the distance of PUs in supercomputers are typically in 100 range, and the nodes of a cloud system can be geographically far from each other, so considerable propagation delays can also occur.
Interconnection
The interconnection between cores happens in very different contexts, from the public Internet connection between clouds through the various connections used inside supercomputers down to the System-on-Chip (SoC) connections. The OS initiates only accessing the processors, after that HW works partly in parallel with the next action of the OS and with other actions initiating accessing other processors. This period is denoted in Fig. 2 by . After the corresponding signals are generated, they must reach the target processor, that is they need some propagation time. PDs are denoted by 0 and 1 , corresponding to actions delivering input data and result, respectively. This propagation time (which of course occurs in parallel with actions on other processors, but which is a sequential contribution within the thread) depends strongly on how the processors are interconnected: this contribution can be considerable if the distance to travel is large or message transfer takes a long time (like lengthy messages, signal latency, handshaking, store and forward operations in networks, etc.).
Although sometimes even in quite large-scale systems like [22] Ethernet-based internal communication is deployed, it is getting accepted that "The idea of using the popular shared bus to implement the communication medium [in large systems] is no longer acceptable, mainly due to its high contention." [34] .
Other sources of delay
The internal operation of the processors can also contribute to the issues experienced in the large parallel computing systems.
Internal latency
The instruction execution micro-environment can be quite different for the different parallelized sequential systems. Here it is interpreted as the internal non-payload time around executing (a bunch of) machine instructions, like waiting for the instruction being processed in the pipeline, the bus being disabled for some short period, copying data between address spaces, speculating or predicting.
Accelerators
It is a trivial idea that since the single processor performance cannot be increased any more, some external computing accelerator (Graphics Processing Unit (GPU)(s)) shall be used. However, because of the SPA the data must be copied to the memory of the accelerator, and this takes time. This non-payload activity is a kind of sequential contribution and surely makes the value of (1− ) worse. The difference is negligible at low number of cores, but in large-scale systems it strongly degrades the efficiency, see section 6.2.
Complexity
The processors are optimized for single-processor performance. As a result, they attempt to make more and more operations in a single clock cycle, and doing so introduces a limitation for the length of the clock period itself: "we believed that the ever-increasing complexity of superscalar processors would have a negative impact upon their clock rate, eventually leading to a leveling off of the rate of increase in microprocessor performance". [40] 
The computing paradigm
At the time when the basic operating principles of the computer were formulated, there was literally only one processor which lead naturally to using the SPA. For today, due to development of the technology, the processor became a "free resource" [22] . Despite of that, mainly by inertia (the preferred incremental development) and because up to now the performance could be improved even using SPA components (and thinking), today the SPA is commonly used when building large parallel computing systems, although the importance of "cooperative computing" is already recognized and demonstrated [58] . The stalling of the parallel performance may lead to the need of introducing the Explicitly Many-Processor Approach (EMPA) [48] .
Addressing
One of the major drawbacks of the SPA is what resulted in the components (constructed for SPA systems). Among others the SPA processors use SPA memories through SPA buses. This also means, that only one single addressing action can take place at a time, that is why the need for addressing processors increases linearly with the size of the system. Although this issue can be mitigated by segmenting, clustering, vectoring; the basic limitation effect is present.
The context switching
All applications must use OS services and some HW facilities to initiate themself as well as to access other processors. Because operating system works in a different (supervisor) mode, a considerable amount of time is required for switching context. Actually, this means virtually "another processor": a different (extended) Instruction Set Architecture (ISA), and a new set of processor registers. The processor registers are very useful in single-processor optimization, but their saving and restoring considerably increases the internal latency, and what is worse, also introduces many otherwise unneeded memory operations. This is usually not a really crucial contribution, but under the extreme conditions represented by supercomputers (and especially if single-port memory is used), specialized operating systems must be used [22] or the calculation must be run in supervisor mode [20] or every single core most run a lightweight OS [58] .
Synchronization
Although not explicitly dealt with here, notice that the data exchange between the first thread and the other ones also contributes to the non-parallelizable fraction and typically uses system calls, for details see [57, 19, 10] . Actually, we may have communicating serial processes, which does not improve the effective parallelism at all [1] . Some classes of applications (like artificial neural networks) need intensive and frequent data exchange and in addition, because of the "time grid" they need to use to coordinate the operation of the "neurons" they use, they overload the network with bursts of messages.
Supercomputer case studies
From the sections above it can be concluded that parallelized sequential computing systems have some upper limit on their payload performance (the nominal performance of course can be increased without limitation, but the efficiency decreases proportionally). In this section some case studies on supercomputer implementations are presented, utilizing only public information. Examples of deploying the formalism on other fields of parallelized computing are given in [54] .
Taihulight (Sunway)
In the parallelized sequential computing systems implemented in SPA [3] , the life begins in one such sequential subsystem. In the large parallelized applications running on general purpose supercomputers, initially and finally only one thread exists, i.e. the minimal absolutely necessary nonparallelizable activity is to fork the other threads and join them again. With the present technology, no such actions can be shorter than one processor clock period 11 . That is, the absolute minimum value of the non-parallelizable fraction will be given as the ratio of the time of the two clock periods to the total execution time. The latter time is a free parameter in describing the efficiency, i.e. value of the effective parallelization also depends on the total benchmarking time (and so does the achievable parallelization gain, too).
This dependence is of course well known for supercomputer scientists: for measuring the efficiency with better accuracy (and also for producing better values) hours of execution times are used in practice. In the case of benchmarking ℎ ℎ [12] 13,298 seconds benchmark runtime was used; on the 1.45 GHz processors it means 2 * 10 13 clock periods. The inherent limit of (1− ) at such benchmarking time is 10 −13 (or equivalently the achievable performance gain is 10 13 ). If the fork/join is executed by the OS as usual, because of the needed context switchings 2 * 10 4 [43] clock cycles are needed rather than the 2 clock cycles considered in the idealistic case, i.e. the derived values are correspondingly by 4 orders of magnitude different; that is the performance gain cannot be above 10 9 . For the develop- 11 Taking this two clock periods as an ideal (but not realistic) case, the actual limitation will be surely (thousands of times) worse than the one calculated for this idealistic one. The actual number of clock periods depends on many factors, as discussed below. ment of the achieved performance gain and the values for the top supercomputers, see Fig. 8 . In the following for simplicity 1.00 GHz processors (i.e. 1 ns clock cycle time) will be assumed.
The supercomputers are also distributed systems. In a stadium-sized supercomputer a distance between the processors (cable length) about 100 m can be assumed. The net signal round trip time is ca. 10 −6 seconds, or 10 3 clock periods, i.e. in the case of a finite-sized supercomputer the performance gain cannot be above 10 10 (or 10 6 if context switching also needed). The presently available network interfaces have 100. . . 200 ns latency times, and sending a message between processors takes time in the same order of magnitude. Since the signal propagation time is longer than the latency of the network, this also means that making better interconnection is not really a bottleneck in enhancing computing performance. This statement is underpinned also by statistical considerations [47] .
Taking the (maybe optimistic) value 2 * 10 3 clock periods for the signal propagation time, the value of the effective parallelization (1 − ) will be at best in the range of 10 −10 , only because of the physical size of the supercomputer. This also means that the expectations against the absolute performance of supercomputers are excessive: assuming a 100 Gflop/s processor and realistic physical size, no operating system and no non-parallelizable code fraction, the achievable absolute nominal performance (see Eq. (5)) is 10 11 *10 10 ∕ , i.e. 1000 EFlops. To implement this, around 10 9 processors are required. One can assume that the value of (1 − ) will be 12 around of the value 10 −7 . With those very optimistic assumptions (see Equ. 4) the payload performance for benchmark HPL will be less than 10 Eflops, and for the real-life applications of class of the benchmark HPCG it will be surely below 0.01 EFlops, i.e. lower than the payload performance of the present TOP1-3 supercomputers.
These predictions enable to assume that the presently achieved value of (1 − ) persists also for roughly hundred times more cores. However, another major issue arises from the computing principle SPA: on an SPA bus only one core at a time can be addressed. As a consequence, minimum as many clock cycles are to be used for organizing the parallel work as many addressing steps required. Basically, this number equals to the number of cores in the supercomputer, i.e. the addressing in the TOP10 positions typically needs clock cycles in the order of 5 * 10 5 . . . 10 7 ; degrading the value of (1 − ) into the range 10 −6 . . . 2 * 10 −5 . The number of the addressing steps can be mitigated using clustering, vectoring, etc. or at the other end the processor itself can take over the responsibility of addressing its cores [58] . Depending on the actual construction, the reducing factor of clustering of those types can be in the range 10 1 . . . 5 * 10 3 , i.e the resulting value of (1 − ) is expected to be around 12 With the present technology the best achievable value is ca. 10 −6 , which was successfully enhanced by clustering to ca. 2 * 10 −7 for and , and the special cooperating cores of ℎ ℎ enabled to achieve 3 * 10 −8 10 −7 . Notice that utilizing "cooperative computing" [58] enhances further the value of (1 − ), but it means already utilizing a (slightly) different computing paradigm: the cores have a direct connection and can communicate with the exclusion of the main memory.
An operating system must also be used, for protection and convenience. If one considers context switching with its consumed 2 * 10 4 cycles [43] , the absolute limit is cca. 5 * 10 −8 , on a zero-sized supercomputer. This value is somewhat better than the limiting value derived above, but it is close to that value and surely represents a considerable contribution. This is why ℎ ℎ runs the actual computations in kernel mode [58] .
Sum of the non-parallelizable contributions
Notice the special role of the non-parallelizable activities: independently of their origin, they are summed up as 'sequential-only' contribution and degrade considerably the payload performance. In systems comprising parallelized sequential processes actions like communication (including also MPI), synchronization, accessing shared resources, etc. [1, 10, 19, 57] all contribute to the sequential-only part. Their effect becomes more and more drastic as the number of the processors increases. One must take care, however, how the communication is implemented. A nice example is shown in [4] , how direct core to core (in other words: direct thread to thread) communication can enhance parallelism in largescale systems.
Competition of the contributions for the dominance
As discussed above, the different contributions of (1 − ) depend on different factors, so their ranking in affecting the value of changes with the nominal performance and how the system is assembled from SPA processors. Fig. 4 attempts to provide a feeling on the effect of the software contribution. A fictive supercomputer (with behavior somewhat similar to that of supercomputer ℎ ℎ ) is modeled. All subfigures have dual scaling. The blue diagram line refers to the right hand scale and shows the payload performance corresponding to the actual contributions; all the rest refer to the left hand scale and display (1 − ) (for the details see [49] ) contributions to the nonparallelizable fraction. The turn-back of the (1 − ) diagram clearly shows the presence of the "performance wall" (compare it to Fig. 1 in [41] ).
For the sake of simplicity, only those components are depicted, that have some role in forming the (1 − ) value. In some other special cases other contributions may dominate. For example, as presented in [52] , in the case of brain simulation a hidden clock signal is introduced and its effect is in close competition with the effect of the frequent context switchings for dominating the achievable performance. Notice that the performance breakdown shown in the figures were experimentally measured by [41] , [28] (Fig. 7) and [2] (Fig. 8) . The looping contribution becomes remarkable around 0.1 Eflops, and breaks down payload performance when approaching 1 Eflops. In the right subfigure the behavior measured with benchmark HPCG is displayed. In this case the contribution of the application (thin brown line) is much higher, the looping contribution (thin green line) is the same as above. As a consequence, the achievable payload performance is lower and also the breakdown of the performance is softer.
The future of supercomputing
Because all of this above, in the name of the company PEZY 13 the last two letters are surely obsolete. Also, no Zettaflops supercomputers will be delivered for science and military [14] .
Experts expect the performance 14 to achieve the magic 1 Eflop/s around year 2020, see Fig. 1 in [33] , although already question marks, mystic events and communications appeared, as the date approaches. The authors noticed that "the performance increase of the No. 1 systems slowed down around 2013, and it was the same for the sum performance", but they extrapolate linearly and expect that the development continues and the "zettascale computing" (i.e 10 4 times more than the present performance) will be achieved is just more that a decade. Although they address a series of important questions, the question whether building computers of such size is feasible, remains out of their sight.
From TOP500 data, as a prediction, values in function of can be calculated, see Fig. 3 a) . The reported (measured) performance values are marked by bubbles on the figure. When making that prediction, the number of processors was virtually changed for the different configurations, without correcting for the increasing looping delay; i.e. the graphs are strongly optimistic, see also Fig. 4 . As expected, values (calculated in this optimistic way) saturate around .35 Eflop/s. Without some breakthrough in the technology and/or paradigm even approaching the "dream limit" is not possible. 13 https://en.wikipedia.org/wiki/PEZY_Computing: The name PEZY is an acronym derived from the greek derived Metric prefixs Peta, Eta, Zetta, Yotta 14 There are some doubts about the definition of exaFLOPS, whether it means or , in the former case whether it includes accelerator cores, and in the latter case measured by which benchmark. Here the term is used as .
Piz Daint
Due to the quick development of the technology, the supercomputers have usually not many items registered in the database TOP500 on their development. One of the rare exceptions is supercomputer . Its development history spans 6 years, two orders of magnitude in performance and used both non-accelerated computing and accelerated computing using two different accelerators. Although usually more than one of its parameters was changed between the registered stages of its development, it nicely underpins the statements of the paper. Fig. 3 b) displays how the payload performance in function of the nominal performance has developed in the case of (see also Fig. 2 in [52] ). The bubbles diplay the measured performance values documented in the database TOP500 [42] and the diagram lines show the (at that stage) predicted performance. As the diagram lines show the "predicted performance", the accuracy of the prediction can also be estimated through the data measured in the next stage. It is very accurate for short distances, and the jumps can be qualitatively understood with knowing the reason. In the previous section the accuracy of the predictions based on the model has been left open. This figure also validates the prediction for the TOP10 supercomputers depicted in Fig. 3 a) .
The data from the first two years of (nonaccelerated mode of operation) can be compared directly. Increasing the number of the cores results in the expected higher performance, as the working point is still in the linear region of the efficiency surface. The value slightly above the predicted one can be attributed to the fine-tuning of the architecture.
Introducing accelerators resulted in a jump of payload efficiency (and also moved the working point to the slightly non-linear region, see Fig. 5 ), and the payload performance The positions efficiency values of the supercomputer "Piz Daint" on the two-dimensional efficiency surface in the different stages of its building; calculated from the publicly available database [42] is roughly 3 times more than it would be expected purely on the predicted value calculated from the non-accelerated architecture. According to the general experience [31] , only a small fraction of the computing power hidden in the GPU can be turned to payload performance, and the efficiency is only about 3 times higher than would be without accelerators. The designers might be not satisfied with the accelerator, so they changed to another one, with a slightly higher nominal performance but much larger separated memory space. The result was disappointing: the slight increase of the nominal performance of the GPU could not counterbalance the increased time needed to copy between the separated larger address spaces, and finally resulted in a breakdown of both the value of (1 − ) and efficiency although the payload performance slightly increased. Introducing the GPU accelerator increases the absolute performance, but (through introducing the extra non-parallelizable component of copying the data) increases the value of (1 − ) and decreases efficiency, for a discussion see section 6.2. The decrease is the more considerable the more data are to be copied. Again, the the fine-tuning has helped both the efficiency and (1 − ) to have a better value.
Gyoukou
A nice "experimental proof" for the existence of the performance limit is the one-time appearance of supercomputer on the TOP500 list in Nov. 2017. They did participate in the competition with using 2.5M cores (out of the 20M available) and their (1 − ) value was 1.9 * 10 −7 , comparable with the data of (2.4M and 1.7 * 10 −7 ). Simply, the performance bound did not enable to increase the payload performance further.
Brain simulation
The artificial intelligence (including simulating the brain operation using computing devices) shows up exponentially growing interest, and also the size of such systems is continuously growing. In the case of brain simulation the "flagship goal" is to simulate tens of billions of neurons corresponding to the capacity of the human brain. Those definitely huge systems really go to the extremes, but also undergo the common limitation of the large-scale parallelized sequential systems. In recent studies it was shown that using the present methods (paradigm and technology) the behavioral-level of brain is simply out of reach of the research [2] .
It was shown recently [52] that the special method of simulating the artificial neural networks, using a "time grid", causes a breakdown at relatively low computing performance, and that under those special conditions the frequent context switchings and the permanent need of synchronization are competing for dominating the performance of the application.
Statistical underpinning
For now, supercomputing has a quarter of century history and a well-documented and rigorously verified database [42] on their architectural and performance data. The huge variety of solutions and ideas does not enlighten drawing conclusions and especially making forecasts for the future of supercomputing. The large number of available data, however, enables to draw reliable general conclusions about some features of the parallelized sequential computing systems. Those conclusions have of course only statistical validity because of the variety of sources of components, different technologies and ideas as well as the interplay of many factors. That is, the result shows up a considerable scattering and requires an extremely careful analysis. The large number of cases, however, enables to draw some reliable general conclusions.
Correlation between the number of cores and the achieved rank
Since the resulting performance (and so the ranking) depends both on the number of processors and the effective parallelization, those quantities are correlated in Fig. 6 . As expected, in the TOP50 supercomputers the higher the ranking position is, the higher is the required number of processors in the configuration, and as outlined above, the more processors, the lower (1 − ) is required (provided that the same efficiency is targeted).
In TOP10, the slope of the regression line on the left subfigure sharply changes relative to the TOP50 regression line, showing the strong competition for better ranking position. Maybe the value of the slope can provide the "cut line" between "racing supercomputers" and "commodity supercomputers". On the right figure, TOP10 data points provide the same slope as TOP50 data points, demonstrating that to produce a reasonable efficiency, the increasing number of cores must be accompanied with a proper decrease in value of (1 − ), as expected from Equ. (4) . Furthermore, that to achieve a good ranking, a good value of (1 − ) must be provided. Recall that the excellent performance of ℎ ℎ shall be attributed to its special processor, deploying "Cooperative computing" [58] .
Deploying accelerators (GPUs)
As suggested by Eq. (5), the trivial way to increase the absolute performance of a supercomputer is to increase the single-processor performance of its processors. Since the single processor performance has reached its limits, some kind of accelerator (mostly General-Purpose Graphics Processing Unit (GPGPU)) is frequently used for this goal. Fig. 7 shows how utilizing accelerators influences ranking of supercomputers. The two important factors of supercomputers: the single-processor performance and the parallelization efficiency in function of ranking are displayed.
As the left side of the figure depicts, the coprocessor accelerated cores show up the lowest performance; they really can benefit from acceleration 15 . The GPGPU accelerated processors really increase the performance of processors by a factor of 2-3. This result confirms results of a former study where an average factor 2.5 was found [31] . However, this increased performance is about 40..70 times lower than the nominal performance of the GPGPU accelerator. The effect is attributed to the considerable overhead [9] , and it was demonstrated that with improving the transfer performance, the application performance can be considerably enhanced. Indirectly, that research also proved that the operating principle itself (i.e. that the data must be transferred to and from the GPU memory; and recall that GPUs do not have cache memory) takes some extra time. In terms of Amdahl's law, this transfer time contributes to the non-parallelizable 15 In the number of the total cores the number of coprocessors is included The right side of the figure discovers this effect. The effective parallelization of the GPU accelerated systems is nearly ten times worse than that of the coprocessor-accelerated processors and about 5 times worse than that of the the nonaccelerated processors, i.e. the resulting efficiency is worse than in the case of utilizing unaccelerated processors; this is a definite disadvance when GPUs used in system with extremely large number of processors.
The key to this enigma is hidden in Eq. (4): the payload performance increases by a factor of nearly 3, but the value (increased by nearly an order of magnitude) of (1 − ) is multiplied by the number of cores in the system. In other words: while deploying GPGPU-accelerated cores in systems having a few thousand cores is advantageous, in supercomputers having processors in the range of million is a rather expensive way to make supercomputer performance worse. This makes at least questionable whether it is worth to utilize GPGPUs in large-scale supercomputers. For a discussion see section 5.5, for a direct experimental proof see Figs. 3 and 5 .
As the left figure shows, neither kind of acceleration shows correlation between the ranking of supercomputer and the type of the acceleration. Essentially the same is confirmed by the right side of the figure: the performance amplification raises with the better ranking position, and the slope is higher for any kind of acceleration: to move the data from one memory to other takes time.
The supercomputer timeline and the "roofline" model
As a quick test, Equ. (7) can be applied to data from [42] , see Fig. 8 a) . As shown, supercomputer history is about the development of the effective parallelism, and Amdahl's law formulated by Equ. (7) is actually what Moore's law is for the size of electronic components. 16 (The effect of Moore's law is eliminated when calculating .) To understand the behavior of the trend line, just recall Equ. (4): to increase the absolute performance, more processors shall be included, and to provide a reasonable efficiency, the value of (1 − ) must be properly reduced.
The "roofline" model [56] is successful in fields where some resource limits the maximum performance resulting from the interplay of the other components. Here the limiting resource is the performance gain (originated from Amdahl's law, the technical implementation and the computing paradigm together) that limits the engineering solutions. Their interplay may result in more or less perfect performance gains, but no combination enables to exceed that absolute limit.
The two roofline levels displayed in Fig. 8 b) , right side correspond to the values measurable with the benchmarks HPL and HPCG, respectively. The latter benchmark has documented history in three items only, but the data are convincing enough to set a reliable roofline level. Here the contribution of the SW dominates ("the real-life" application class), so the architectural solution makes no big difference, see also section 3.3.
The third roofline level (see section 5.7) is inferred from the single available measured data [2] . The two smaller black dots show the performance data of the full configuration (as measured by the benchmark HPCG) of the two supercomputers the authors had access to, and the red dot denotes the saturation value they experienced (and so they did not deploy more hardware). This strongly supports the assumption, that the achievable supercomputer performance depends on the type of the application. The roof level concluded from the performance gain measured by benchmark HPL only measures the effect of organizing the joint work plus the fork/join operation.
The roof level concluded from HPCG benchmark is about two orders of magnitude lower because of the increased amount of communication needed for the iteration. In the case of the brain simulation the top level of the performance gain is two more orders of magnitude lower because of the need of more intensive communication (many other neurons must also be periodically informed on the result of the neural calculation). In the case of AI networks the intensity of communication is between the last two (depending on the type and size of the network), so correspondingly the achievable performance gain must also reside between the last two roof levels.
The figure demonstrates how the "communication-to-computation ratio" introduced by [41] affects the achievable parformance gain. Notice that the achieved performance gain (=speedup) of brain simulation is about 10 3 and based on the amount of communication, Artificial Neural Networks also cannot show up much higher performance gain. The bottleneck is not the performance of the floating operations; rather the need of communication and the exceptionally high "communicationto-computation ratio". Without the need of organizing the joint work there would not be a roof level at all: adding more cores would increase the performance gain with a permanent slope. With having non-zero non-parallelizable contribution the roof level appears and the higher that contribution is, the lower is the value of the roof level (or, in other words: the higher is the non-parallelizable contribution the lower is the nominal performance at which the roofline effect appears; in the figure expressed in years).
The results of the benchmark HPL show a considerable scatter and some points are even above the roofline. This benchmark is sensitive to the architectural solutions like clustering (internal or external), accelerators, absolute performance, etc., since here no ab ovo dominating component is present. Because of this, some "relaxation time" was and will be needed until a right combination resulting in a performance gain approaching the roofline was/will be found.
The three points above the roofline belong to the same supercomputer ℎ ℎ . Its "cooperating processors" [58] work using a slightly different computing paradigm: they slightly violate the principles of the SPA, which is applied by all the rest of the supercomputers. Because of this, its performance gain is limited by a slightly different roofline: changing the computing paradigm (or the principles of implementation) changes the rules of the game. This also hints a possible way out: the computing paradigm shall be modified [50, 48] in order to introduce higher roofline level.
Summary
The payload performance of parallelized sequential computing systems has been analyzed both theoretically and using the supercomputer database with well-documented performance values. It was shown that both the (strongly simplified) theoretical description and the empirical trend show up limitation for the payload performance of large-scale parallelized computing, at the same value. The difficulties experienced in building ever-larger supercomputers and especially utilizing artificial intelligence applications on supercomputers or building brain simulators from SPA computer components convincingly prove that the present supercomputing has achieved what was enabled by the computing paradigm and implementation technology. To step further to the next level [21] a real rebooting is required, among others renewing computing [50] and introducing a new computing paradigm [46] . The "performance wall" [55] was hit.
