Integrating Amdahl's and Amdahl-like laws with Divisible Load Theory promises mathematical design methodology that will make possible the efficient design of tomorrow's systems.
Our goal should be to make it possible to achieve a solid understanding of complex computer and information systems design using mathematical modeling. To this end we propose integrating the foundational Amdahl's Law and variants with divisible load scheduling theory to provide such an understanding.
Divisible Load Theory
To deal with large amount of data in modern computation system, divisible load theory (DLT) has emerged as a potential tool. Divisible loads are loads of large amounts finely parallelizable data. The data has no precedence relationships and can be divided into parts of arbitrary size. It is a different paradigm than atomic task scheduling. Work since 1988 [1, 2] has established means of distributing and processing such load in a time optimal fashion in many types of networks [3, 4] . It is of interest when loads are in fact divisible or as an approximation in the spirit of fluid flow packet models. Potential applications include image signal processing, big data and massive experimental data processing.
Such loads are commonly encountered in applications where a great amount of similar data units is being processed. Generally, DLT model scheduling processing occurs in two steps: load distribution and load processing. The data is usually distributed from one (or more) processors to multiple processors and processed in parallel. An optimal schedule will be obtained to achieve the minimum finish time (makespan). Linear equations or recursions are widely used in DLT analysis, which makes it efficiently solvable.
The significance of integrating Amdahl-like laws with divisible load scheduling theory is to give designers (i.e. computer scientists and engineers) the mathematical tools to aid this growing technological revolution in much the same way as Steinmitz's mathematical work at the turn of the last century made possible over a century of systematic and tractable design of alternating current electrical systems. Today's and tomorrow's systems that will benefit from this include 5G, and systems in health care, social media, commerce, government and scientific research.
Amdahl's Law
Amdahl argued in 1967 [5, 6] that even if one could solve the parallel part of a program in near zero time due to the use of a large number of parallel processors, the bottleneck was the sequential part of the program which could only be processed on a single processor.
The performance metric called "speedup", S , is a basic way of expressing parallel processing time advantage. It is defined as the ratio of solution time of a problem on one processor, (1), to solution time of the same problem on processors, ( ):
To write this mathematically, let be the workload fraction that is parallelizable and 1 − be the workload fraction that is serial. Let be the number of homogeneous (i.e. identical) processors. Let (1) be the time to solve the workload on one processor and ( ) be the time to solve the workload on processors. Finally let be the serial execution time for the entire program.
Then:
Here ( ) is a weighted sum of serial and parallel execution time. The parallel execution time is / , the parallel workload, divided by , the number of processors used. Here also it is assumed that there is no time overlap between the serial and the parallel execution.
So, one has in terms of speedup, Amdahl's Law:
In a 1988 paper J.L. Gustafson made an argument that the Amdahl Law assumption of constant problem size is usually never the case [7] . More cores are normally used to solve larger and more complicated problems. Thus, one would be justified in having a parallel fraction that grows linearly in problem size (i.e. using instead of a single ). One finds [7] S = (1 − ) + . One could have a parallel fraction growth factor, ( ), that is between a constant (Amdahl's Law) and linear growth (Gustafson's Law) (see [8, 9, 10] ). It is not the only possibility but one could use a square root function, ( ) = √ . This leads to a general law with speedup between that of Amdahl's and Gustafson's laws.
Amdahl's Law has inspired a number of interesting and useful studies over the past years. A representative sample includes Hill and Marty who in 2008 [8, 9 ] applied Amdahl's Law to multicore architectures and attempted to answer system level design questions. Marowka did a performance study applying Amdahl's Law to systems of CPUs and GPUs [11] . Cassidy found objective functions for average delay and average energy using Amdahl's Law [12] . Díaz-del-Río [13] presented a performance study of when it is preferable to off-load computation from a mobile device to the cloud.
A More Complete Approach
Over time (often closed form) expressions for divisible load model speedup have been developed for various multi-processor interconnection topology strategies and load distribution policies. Interconnection topologies include buses, stars, multi-level tree networks, meshes, hypercubes and other networks. Load distribution policies include sequential load distribution and concurrent load distribution and with simultaneous start or staggered start. In all these Amdahl's Law can be modified to be:
Here ( ) is the speedup of a divisible load model of any architectured parallel facility with processors. Such a facility is a basic model that has no sequential component but considers the facility issues, which involve degrees of efficiencies due to communication delay, interconnection topology, load distribution policy and the relative difference in computation and communication intensity and speeds. Significantly these additional factors can now be included in Amdahl-like Laws. Figure 1 where load is distributed from the root node to the children nodes. Here, is the th link's inverse link speed and is the th processor's inverse computing speed. . . . . . .
2 3
The boxed material indicates analytical divisible load speedup expressions for three fundamental load distribution protocols in the single level tree network (star type network) [14] . The order of processors to achieve the shortest finishing time is 1 ≤ 2 ≤ 3 … ≤ . The rule can be intuitively described as the processors with faster link speeds will receive load prior to the ones with slower link speeds.
The timing diagrams for communication and computation are shown in Figure 2 . The first model is sequential load distribution where the source (root) node distributes load to one child processor at a time in one pass.
The second and third models involve simultaneous (concurrent) load distribution of load over all links. In the second model (staggered start) computation at a child begins only once all its computational load is received from the source node. In the third model (simultaneous start) computation at a child begins as soon as it begins to receive load.
We thus have more complete models than Amdahl's original Law. 
where:
For the system with homogeneous processors, the inverse processing speed and link speed of each processor (except for the root) is the same.
In this case, Equation 5 can be simplified as:
where: σ = z /
MODEL 2: Simultaneous Distribution, Staggered Start
For the system with homogeneous processors, Equation 7 can be simplified as: : Computing intensity constant: the entire load is processed in seconds by the th processor;
: Communication intensity constant: the entire load is transmitted in seconds over the th link;
( ): The speedup with processors in the systems using a DLT model;
The speedup with homogeneous processors in the systems using a DLT model;
Calculation and Analysis
To test and compare the speedup levels for different networks, the boxed equations were inserted into Equation 4 and compared with Amdahl's original Law (Equation 3) . The values used are listed in Table 1 . Both systems with heterogeneous processors and homogeneous processors are tested and the results are shown in Figure 3 , 4, 5 and 6. We find:
Results Depend on Parameters:
By comparing Figure 3 and 4, one can observe that the speedup values for the system with homogeneous processors are higher than the values of the system with heterogeneous processors for our parameters. For example, for the network topology of model 2 (with simultaneous distribution and staggered start), the speedup in Figure 3 with 30 processors is 3.86, and the speedup in Figure 4 with 30 processors is 4.25. For the same model, the curve in Figure 4 is generally higher than the one in Figure 3 . This is because the processing speed for the homogeneous processors equals the highest processing speed among the heterogenous processors, which results in a higher computation power for the homogenous system.
Simultaneous Distribution Beats Sequential Distribution:
By comparing the values of different network topologies, one can discover that the systems with simultaneous distribution have higher speedup values than sequential distribution. This is because with simultaneous distribution, the processors can all start receiving load near the starting time, while with sequential distribution, the processors ranked lower in the sequence must wait for a considerable length of time.
Simultaneous Start Beats Staggered Start:
Meanwhile, the system with simultaneous start has higher speedup values than the one with staggered start. This is because staggered start means that each the processor must wait until it finishes receiving the load before starting computation. But with simultaneous start, all the processors can start computing at the time when they start to receive load.
Amdahl's Law is an Upper Bound:
Note that the pure Amdahl's Law prediction is the upper bound of Divisible Load Theory analysis because it does not take account divisible load based inefficiencies. This upper bound could be achieved by using model 3 and setting the root processor's computing speed to be the same as other homogeneous processors.
This upper bound is shown in both Figure 4 and Figure 6 where the system has homogeneous processors. In this case, model 3 (Simultaneous Distribution, Simultaneous Start) will have the same performance as Amdahl's Law's original analysis, which is shown in Equation 3. In our calculation, since the source processor also shares the computing task, the system in fact has + 1 processors working simultaneously. So, the variable in Equation 3 is updated to be + 1. We also set 0 = = 4.2. As a result, the curve of model 3 is overlapped with the Amdahl's Law Equation curve in both Figure 4 and Figure 6 .
The Influence of the Size of the Parallelizable Load ( ):
The speedup of different divisible load models versus different values for the systems with either heterogeneous and homogeneous processors are shown in Figure 5 and Figure 6 . For the entire calculation, there are 20 children processors in the system. The parameters are the same as Table 1 . Overall, when a system has a higher value of (the workload fraction that is parallelizable), it has a higher speedup. At the same time, when has a larger value, the speedup grows more quickly. For example, in Figure 6 , the S ℎ is around 4.2 or 6.8 for the two systems with simultaneous distribution (with either staggered start or simultaneous start) at = 0.8 or = 0.9. While = 1, which means that all data is parallelizable, S ℎ is around 20. 
Significance
The integration of Amdahl's Law and Amdahl-like laws with Divisible Load Theory is significant in showing how issues besides Amdahl's sequential/parallel paradigm may be included in an overall closed form analytical model of speedup (and makespan as well). For some more involved models ( ) can be found numerically and inserted into Equation 4 as well. Similar integrations can be done for Gustafson's law and other Amdahl law variants. It should be mentioned that it is also possible to substitute the speedup of the pure Amdahl's law Equation 3 into the number of processors variable, , in the boxed divisible load equations. The correct way to proceed would depend on the actual application.
Conclusion
Amdahl's law and its variations provide much to think about when evaluating the performance of parallel systems. The speedup value is a key metric while comparing the performances of different system topologies. Since parallel systems are increasingly prevalent, these issues are likely to be of interest for a considerable amount of time.
