Abstract-Within the scope of the multithreaded dataflow, the problem of scheduling/allocation of DOACROSS loops has been discussed and it was shown that the so-called staggered allocation offers higher performance and resource utilization than other schemes described in the literature. The staggered scheme, however, produces an unbalanced load among processors. This paper introduces an extension to the staggered scheme-cyclic staggered scheme-that produces a more balanced distribution of iterations among processors. The cyclic staggered scheme is simulated and its performance improvement is analyzed.
INTRODUCTION
IN a traditional multiprocessor organization, the basis of controlflow processing is extended to allow more than one execution thread to be active at an instance. However, architects of such an organization must address the loss in processor efficiency due to two fundamental issues: memory latencies and synchronization overhead. The dataflow model of computation was proposed as an alternative to the conventional control-flow model of computation. It explicitly addresses the issue of programmability as well as memory latency and synchronization. Theoretically, in a dataflow machine, maximal concurrency can be exploited, constrained only by the availability of hardware resources.
There are basically three types of loops; sequential loops, vector loops (DOALL), and loops of intermediate parallelism (DOACROSS) [1] . For a DOALL loop, all N iterations can be executed concurrently. Sequential loops have zero percent parallelism and would not gain any improvement if executed in a multiprocessor. Hence, loops of intermediate parallelism are of the greatest interest, since these can be scheduled or distributed in various ways to achieve speedup in a multiprocessor environment. The DOACROSS loop model proposed in [1] was aimed to model the execution of sequential loops, vector loops, and loops of intermediate parallelism by considering control and data dependencies. This paper considers DOACROSS loops which have a lexicallybackward dependence (LBD) that cannot be eliminated by reordering the statements. A new loop scheduling that maximizes the utilization of processors while achieving significant speedup is examined.
STAGGERED DISTRIBUTION SCHEME
The Staggered distribution originally developed for multithreaded dataflow multiprocessors [4] , [7] the same number of iterations, then all processors would take the same amount of time to finish executing the T(S 1 , S s )-d portion of the loop-independent portion of each loop. But each processor PE i (1 < i ≤ P), has to wait for processor PE i-1 to finish executing the d portion and send the partial results (synchronization message) to allow processor PE i to continue execution. This creates delays due to the LBD and communication.
To achieve higher performance and resource utilization, the loop iterations are then distributed according to the following policy: The iterations assigned to PE i succeed the iterations assigned to PE i-1 with PE i having m more iteration nodes assigned to it than PE i-1 . The delay caused by iterations assigned to PE i-1 will be equal to d per iteration plus the communication cost C. This delay will be masked out by the T(S 1 , S s ) -d portion of the additional iterations (m) assigned to PE i . As a result, instead of waiting for the message to arrive, the additional number of iterations in each chunk relative to the previous chunk keeps the processor busy. Incremental distribution of the iterations among processors is hence determined by:
where n i-1 is the number of iterations allocated to PE i-1 , T(S 1 , S s ) is the execution time of one iteration, d is the delay, and C is the inter-processor communication cost. The number of iterations n i allocated to PE i would be:
The distribution is performed by expanding (2) to determine the value of n 1 -number of iterations assigned to the first processor.
The n 1 value is used to calculate n i (1 < i ≤ P). The n i s are then fine tuned for better resource utilization. This scheme automatically controls and determines the maximum number of processors (maxpe) required for efficient execution of the loop based on the physical characteristics of the loop and the underlying machine architecture-i.e., higher resource utilization. The maxpe can be determined by expanding (1) and (2), and considering the n 1 and n. The synchronization overhead is only (P -1) * C, which is significantly less than the synchronization overhead incurred by cyclic scheduling and pre-synchronized scheduling. Staggered scheme, however, distributes an unbalanced load among processors, with the last processor receiving the largest number of iterations. To remedy this problem and to be able to handle variations in iteration execution times, a modification of this scheme is required and is presented in the next subsection.
Cyclic Staggered Distribution
As mentioned earlier, the staggered scheme determines the maximum number of processors (maxpe) required for each loop. If the number of processors available P is less than maxpe, the initial distribution for P would be the same as the first P processors of maxpe. The remaining iterations n r are then redistributed among the P processors starting from the first processor, utilizing the staggered concept according to (3) .
7 4 9 2 7 4 9
where n p is the number of iterations previously allocated to processor PE i . The extended scheme results in a more balanced load and improved speedup than the original staggered scheme on P processors. We refer to this scheme as CS1. In case of variations in iteration execution time, this scheme can be utilized dynamically during run-time to account for the difference between the worst case iteration execution time and the actual execution time in determining the distribution for the second and subsequent passes.
As an alternative, one iteration is assigned to the first processor in the first pass, and subsequent iterations are assigned to the next processors using (2) . After the first pass, (3) is applied. We call this scheme CS2. The number of iterations assigned to a processor at each scheduling step for cyclic, static chunking (SC), staggered distribution (SD), and cyclic staggered (first version (CS1) and second version (CS2)) has been simulated [6] . It was shown that the two cyclic staggered versions offer more even distribution than the staggered scheme. CS2 distributes iterations more evenly since it assigns smaller chunks per scheduling step than CS1. CS1 and CS2 incur more communication cost, however, they offer an overall better execution time.
Cyclic Staggered for Control-Flow Environment
The cyclic staggered scheme can be adapted to a control-flow environment. In order to get the same behavior for a control-flow environment the loop would have to be separated into two loops [6] . The first loop would be the instructions that are involved in the T(S 1 , S s ) -d portion of the loop and the second would be the instructions involved in the d portion.
SIMULATION RESULTS
Effectiveness of the Staggered scheme has been simulated and compared against those of static chunking and cyclic scheduling [4] . Dynamic scheduling schemes, such as GSS and Factoring, have been proposed for vector loops, therefore, they cannot be used as a means to evaluate the staggered scheme. Furthermore, our studies have shown that static chunking performs better than the aforementioned dynamic schemes. We also did not consider presynchronized scheduling [5] , since the best-case performance of this scheme would be equivalent to cyclic scheduling.
Our test-bed includes a representative loop with the execution time of T(S 1 , S s ) = 50 and loops 3, 5, 11, 13, and 19 of the Livermore Loops, which have cross-iteration dependencies [3] . In our simulation: 
1) The inter-PE communication delays are varied based on the ratio of communication time to iteration execution time (C/T(S 1 , S s )). 2) Delays due to LBD are computed for various k values, where k = d/T(S 1 , S s ).
We also computed the average parallelism (AP), which is the ratio of the total execution time to the critical-path length:
AP n T S S T S S d n s s
9 2 7 1 6 4 9
The number of PEs required to attain maximum speedup for both Staggered distribution (SD) and Static chunking (SC) has been simulated and analyzed. Fig. 1 shows the results for T(S 1 , S s ) = 50, n = 2,000, and k varying from 0.1 to 0.9. In general, irrespective of the values of n, C/T(S 1 , S s ), and k, the staggered approach uses fewer processors to obtain significant speedup, even though the common effect of delays due to the lexically backward dependency (LBD) and inter-processor communication tends to reduce resource utilization. As can be seen from Fig. 1 , for k from 0.1 to 0.4, the staggered approach offers a greater speedup, and for k from 0.5 to 0.9, it achieves almost the same speedup factor. The static chunking scheme distributes the iterations evenly among the processing elements without any consideration for delays. Each processor, except the first, is idle for some period of time, which makes static chunking an inefficient scheme. From Figs. 1a and 1d , it can be concluded that the speedup achieved by SD at k = 0.1 decreases from 7.27 to 6.49, while, for SC, it decreases from 5.82 to 4.98. Aside from the fact that SD attains a significantly higher speedup, the speedup for SD decreased by only 10.7 percent, while, for SC, speedup decreased by 14.4 percent when the communication cost was doubled. On the other hand, for k = 0.2, the speedup for SD decreased by only 2.24 percent, but, for SC, speedup decreased by 9.4 percent. Both schemes however, require fewer PEs to attain maximum speedup as the C/T(S 1 , S s ) ratio increases.
As expected, Cyclic scheduling (CYC) is ineffective if the communication cost is significant. For C/T(S 1 , S s ) greater than or equal to 1.0, CYC did not produce any speedup. Hence, we ran simulations for lower C/T(S 1 , S s ) ratios. Our simulation results showed that, as the C/T(S 1 , S s ) ratio increases, the speedups realized by the CYC scheme decreases faster in comparison with the SD scheme. The speedup achieved by SD at k = 0.1 decreases from 8.14 to 8.00, while for CYC, it decreases from 3.33 to 1.67. Aside from the fact that CYC attains a significantly lower speedup, the speedup for SD decreased by only 1.7 percent, while, for CYC, speedup decreased by 50 percent when the communication cost was almost doubled. The speedups for SD from k = 0.2 and up remained constant, as the C/T(S 1 , S s ) ratio increases, while for CYC the speedup falls rapidly, dropping to less than two for C/T(S 1 , S s ) = 0.5 and less than one when k = 0.6. Finally, the maximum speedups attained by CYC for C/T(S 1 , S s ) = 1.0 and up are all less than one. This means that the loops can obtain better performance if they were executed serially in one PE. The number of PEs required to realize maximum speedup for CYC drops to two independent of k for C/T(S 1 , S s ) ≥ 0.5. This is due to the fact that for C/T(S 1 , S s ) = 0.5, after two iterations, the communication delay would be equivalent to the execution time of one iteration T (S 1 , S s ) . Therefore, the third and fourth iterations can be executed in the same two processors without any additional delay. The cycle will be repeated for every pair of iterations-using more processors does not affect the performance. Fig. 2 depicts the average parallelism (AP) and the maximum speedup (MS) for n = 2,000. In general, regardless of the values of n, C/T(S 1 , S s ), and k, the SD scheme in the presence of inter-PE communication, offers a maximum speedup close to the average parallelism. Tables 1 and 2 show the speedup of SD over SC and CYC when the Livermore Loops were simulated. Timing values and interprocessor communication used in the simulation were based upon instruction and communication times for the nCUBE 3200 and 6400 [2] . Loop 19 consists of two loops. Therefore, we tested each loop separately (19(1) and 19 (2)). The number of iterations for each loop were based on the specification of each loop. Loops 3, 5, and 13 were simulated for n = 1,000, Loop 11 with n = 500, and Loops 19(1 and 2) with n = 100. Although the number of iterations for Loop 11 can reach a maximum of 1,000, we felt that 500 would give us a different perspective from Loop 3, since they both have the same value of k. There was not much speedup for Loop 13, since it had a negligible delay. For Loops 3, 5, 11, and 19(1 and 2) when PE = 8, the SD scheme utilized fewer PEs than the available number of PEs. This agrees with the results shown earlier in Fig. 1 that SD offers better resource utilization. Furthermore, the number of PEs required also decreases as the communication cost increases.
Effectiveness of the Cyclic Staggered scheme, first version (CS1) and second version (CS2), was simulated and compared against the original Staggered scheme (SD). The speed-up factor has been used as a measure of the evaluation. Our test-bed includes the same representative loop with an execution time of T(S 1 , S s ) = 50 and n = 2,000. The speedup attained was calculated by varying the k, the interprocessor communication cost, and the number of the processors (Figs. 3 and 4) . As can be seen, both cyclic staggered distribution schemes performed better than SD regardless of the values of n, C/T(S 1 , S s ), and k, especially when the number of PEs was halfway between two and maxpe-1. This is due to the fact that under such circumstances, both schemes have a higher number of remaining iterations n r for redistribution, which results in a more balanced load. Also, since the number of PEs is greater than two, the remaining iterations can be distributed to more PEs, again resulting in a more balanced load, hence better speedup. As expected, the speedups start to converge as the number of PEs approaches the maxpe, since these schemes produce a distribution similar to SD. This is more evident in the case of CS1. Finally, both CS1 and CS2 schemes attained an almost linear speedup for smaller number of PEs, even with delays due to LBD and communication cost. We showed that SD offers better resource utilization, since it attains better speedup than cyclic scheduling and static chunking, utilizing fewer number of PEs. We also showed that the maximum speedup for SD in the presence of inter-PE communication, is very close to the average parallelism, which is the maximum speedup possible for a particular loop. Since CS1 and CS2 outperform SD, we can conclude that CS1 and CS2 come even closer to the maximum speedup possible for a particular loop. However, these advantages are made possible if the number of PEs available is less than maxpe.
SUMMARY AND FUTURE DIRECTIONS
A distribution scheme for DOACROSS loops-Cyclic Staggered distribution with its two variations-has been introduced. It uses the same concepts as our previous strategy, Staggered distribution-to distribute the loop iterations, albeit unevenly, among processors in order to mask out the delay caused by data dependencies as well as inter-PE communication. This approach offers a more balanced load and better speedup. Effectiveness of this new scheme relative to our previous Staggered scheme has been reported, based on simulation and execution on an nCUBE 2 multiprocessor [6] . Cyclic Staggered scheme attains better speedup than Staggered, when the number of PEs is less than maxpe (number of processors needed to attain optimum speedup). It also produces an almost linear speedup for a small number of PEs, even in the presence of LBDs and inter-PE communication.
The success of multithreading depends on how quickly context switching can be supported. This is only possible if threads are resident in fast memories, such as cache. The sizes of cache are usually small, hence the number of active threads and, thus, the amount of latency that can be tolerated is limited. The generality of dataflow scheduling makes it difficult to execute a logically related set of threads through the processor pipeline, thereby removing any opportunity to utilize registers across thread boundaries. Relegating the responsibilities of scheduling and storage management to the compiler alleviates this problem to some extent. Appropriate means of directing scheduling based on some global-level understanding of program execution will be crucial to the success of future dataflow architectures. We are currently investigating strategies based on cyclic staggered approaches to enhance locality in dataflow architectures, for the purpose of using cache memories.
