Introduction
Implicit parallelism present in loops is an important source of fine-grained parallelism.
In d-is paper we present a tightly coupled tine-grained MIMD architecture whose processors can execute relatively independent streams of ittshuctions as well as tightly synchronized instruction streams. The system is designed using a traditional RISC processor augmented with multiprocessor support. Fine-grained parallelism is exploited by executing multiple instructions in parallel on different processors as well as overlapped execution of instrttctions on pipelined processors. Globally shared registers and dedicated chanrtel queues are provided which allow the processors to exchange data at high speed are provided. If no synchronization is required during the emnrnunication of a data value from one processor to another, then a globally shared register is used to communicate a data value, or else the channel queue from the sending processor to the r~eiving processor is used to cmrunttnicate the data value. An MIMD system is tolerant of delays caused by unpre&ctable events as such memory access conflicts.
Partialty supported by an NSF Presidential Young Investigator
Award CCR-9157371 to the University of Pittsburgh.
Permission to copy without fee atf or part of this material is granted prOvided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copyiug is by permission of the Association for Computing Machhery. To copy otherwise, or to repubhsh.require-s a fee and/or specific petilon. A VLIW machine is unsuitable for the execution of independent irtstntction streams. After the detection of fme-grained parallelism the compiler must perform the following steps to generate code for the fine-pined MIMD architecture.
(i) A parallel execution schedule is generated. The schedule should exploit the MIMD nature of the system. By scheduling parallelism such that there are fewer interprocessor data dependencies we improve the performance of the processor pipeline.
(ii) The interprocessor data depen&ncies, including depen&n-cies across loop iterations, must be resolved through the shared registers and channel queues. We assume that the iteration distances of all dependencies are constants which are known at compile-time.
(iii) The size of each channel queue is fixed. Thus, delays can be caused if art attempt is made to write to a chartnel that is full. Techniques are required to anticipate and avoid such delays.
Instruction Scheduling
The topdown instruction scheduling algorithm that generates schedules for exploiting parallelism with low irtterprocessor communication was developed in previous worlc.7 Consider an operation in a DAG that receives its two operands born two other operations in the same DAG. The operation requiring the two operands can be assigned to one of the processorsassigned to the operations that compute the operands. This will reduce interprocessor dependencies. For a computation containing more parallelism than the processors in dte system are able to exploi~schedules involving fewer irtterprocessor data dependencies can be generated by cartying out scheduling in a top down fashion. If the number of operations ready to be scheduled is greater than or equal to the number of processors, then several nodes km the subgraphs rooted at these nodes are scheduled on each of the processors. The number of nodes scheduled on each pmwsor equals the number of nodea in the smallest subgraphs. By attempting to schedule the same number of operations on each processor good load balancing and hence better speedups can be expected. Thus, this scheduling algorithm tries to minimize the number of channels needed without sacrificing the degree of parallelism exploited. By reducing interprocessor dependencies, opportunities for reordering the code scheduled tm a processd to reduce pipeline delays are created. Instruction reordering techniques developed by Gross and Hermessy10 can be used for reordering the group of statemerm that are simultaneously assigned to a given processor at the same time.
In Figure 2 we show a loop which is first transformed using Aiken and Nlcolau's algorithm to expose parallelism.
The schedule SchedI is generated using topdown scheduling for the execution of the loop on two processors. If we examine the schedule Schedf we cart see that there is no interprocessor data dependency with iteration d~tance zero. This will avoid the processor pipelines from being underutilized due to irtterprocessor communication. In addition we can transform SchedZ to Sched2 which separates intraprocessor dependencies apart which furdter reduces the likelihood of pipeline delays. i.e., makes the latter unnecessary. We assume that the iteration distances of all interproce.ssor depcndenciea are constants that are known at compile-time. Associated with each dependency is the iteration dntance which is zero for non-loop carried dependencies and non-zero for loop-carried dependencies.
Next we compute all implied synchronizations using theorem 2. In the tlnal step we classi@ each real dependence edge as either requiring a shared register or a channel using theorem 1. The algorithm gttazazttees that the or&r in which a receiving pr ocessor reads data vahtes from a channel queue is exactly the same as the or&r in which the data values are written to the channel queue by the sending processor. Thus, the implementation of channels as queues is an appropriate choice.
Step 1: Construction of a Graph Representing the Correct
Execution Order
We construct a directed graph G=(V,E), from the parallel schedule and data dependency information, representing the constraints on the execution order of the ststernenLs as tt~(s # (s)) -the expected time elapsed from the beginning of a loop iteration to the end of the execution of statement s on processor (s).
V -set of statements in the computatio~and E -set of edges in the graph which are determined as follows.
An edge is introduced from statement Si to statement Sj if:
(i) P (Si )=P (sj ) Snd sj is executed immediately aftez sj; oz @l)P (si~(Sj) and there is a data dependency from Si to Sj.
An edge from statement Si to statement Sj is dCrlOted SS [t.ti(-$i # (si)). tz (sj# (sj)). d], where d is the iteration dwtance of the depen&ncy known at compile-time.
Step 2 Computation of Implied Synchronizations
In this step we compute the set of implied synchronizations bemwett pairs of processors in accordance with theorem 2. The computation requires a single bottom up traversal of the graph cxm.strutted in step 1. In the algorithm in Figure 9 the set 1 ia the set of implied synchronization edges.
Step Y Identify the Mode of Communication for Interproceaaor Data Dependencies
The set of flow dependence edges E is partitioned into the s= of edges E= which will make use of shared registers and the set of edges E* which will make use of channel queues using the algorithm in Flguze 10.
Reducing Delays due to Bounded Channels
If an interprocesaor data dependence with iteration dis- 
E EWI DO
Identify edges from pi to pj that we subsumed by e as follows:
Titen E~=Em u {e'); E =E -{e') Else EC' =E" more than one iteration.
The techniques described in this paper have been implemented and they were applied to some of the Livermore loops. 
