Abstract
Introduction
A basic tenet of' computer architecture, and indeed all of engineering is to allocate resources where they have the greatest impact. This paper addresses the results of applying pipeline processor techniques to SIMD architectures. We consider the impact of these methods on processor array control, and processing elements, and derive rrlationships between capabilities of processor memory and communication systems. and pipeline performance. At a time when many advocate the use of off the shelf parts as processing elements, we believe that it is appropriate to revisit these issues, because it is not particularly likely that a commercial microprocessor manufacturer will invest in such an endeavor. Nevertheless, there are strong arguments that SIMI) architectures are ideal for truly massive systems from the perspectives of simplicity and regulmty. However, we observe that processing element performance has been less than stellar for most SIMD implemtmtations. Thus we are led to inquire whether the 1063-7133/95 $4.00 0 1995 IEEE cvolution of SIMD architecture might benefit from the application of similar principles that led the RISC revolution in the microprocessor arena.
Specifically, we will explore the potential to design a SIMD machine which operates at clock speeds comparable to the current generation of middle to high performance microprocessors: i.e. pipelined implementations which operate in the interval 50 MHz 5 f 2 100 MHz . This is certainly within the range of ca. 1994 process and packaging technologies.
A Brief History of SIMD
Single instruction multiple data architectures have been in existence for at least twenty years. Table 1 presents the clock period and year of introduction for a selection of these machines[6,19,10,7,5,4,14,131. A cursory 
Clock Cycle Machine
Illiac IV
CM-1 1985 Blitzen

MP-1
GFll
I 1990
inspection of this data reveals that SIMD has not enjoyed the steady upward trend in clock frequency that has been seen in uniprocessor implementation. Others have also observed this phenomenon [20] . The reason for this lack of improvement is not due to the use of old or low performance technologies. For example, the processor chip for the MasPar MP-2 is fabricated using a I .O micron CMOS process. While this is not state-of-the-art in 1995, it is certainly capable of delivering speeds greater than 12 MHz. In the remainder of this paper, we will explore various solutions to overcome this SIMD clock frequency barrier.
(ACUI In the decode stage, the instruction is expanded from a 23-bit instruction into a horizontal 59-bit micro instruction.
During the second stage, the micro instruction is broadcast to all the PES. In the final stage all the PES cxecute the micro instruction. Since both the decode and execute stages will complete well within the 5Ons cyclc time, there is a clear opportunity to expand the broadcast stage.
-
The architectural organization of a typical SlMD machine is shown in Figure 1 . An implicit premise of this organization is that the PES are direcdy controlled every cycle. This implicit cycle by cycle synchronization may be onc of the fundamental reasons that SIMD clock speeds have not risen over the last two decades. During each cycle, a microinstruction or nanoinstruction is issued from the Array Control Unit (ACIJ), and executed by the PES. Sirice there may be man) thousands or conceivably milliuns of PES to be controlled, instruction broadcast requires a considerable amount of time to complete. This in,\ lriiction bruudcast hotllencck is a fundamental limit on the scalability of SIMD architectures.
Pipelined Broadcast
An approach to overcome the broadcast bottleneck is to hide the average broadcast latency by pipelining the broadcast. We refer to this as instruction distribution. Figurv 1 shows both instruction broadcast and distribution. The instructions are delivered to the PES via a k -ary tree composed of instruction latches (ILs) in the body of the tree, and PE chips at the leaves of the tree. In fact, depending on the number of PES per chip, the pipeline tree may extend onto the PE chil) a:., well. As a hastcrical note, this is a generalization of the approach used by the Blitzen project [SI. In that system, PE instructions ire passed thiough a three stage pipeline: decode, broadcast, execute.
We now consider several issues involved in the design of such an instruction delivery mechanism. In general, for n leaf nodes, and d stages in the network then
where kL is a positive integer representing the fan out at stage 1 . However (1) does not directly aid us in the design of the network because n is fixed and d , and kl are the parameters which are required. We could arbitrarily decide to make the fan out of all stages equal such that Vi; kl = k (this is a restriction on packaging and board layout). Using this simplification reduces (1) to n = kd . Thus we have
The choice of k is determined by the time required to move an instruction through an individual stage. If we assume a simple capacitive load model for all stages then we can estimate this time as
where tl is the time required to drive the load, and 7, is the RC time constant of stage i. Now let f M I N be the minimum clock period possible to execute every microinstruction. This is a constant based on VLSI technology and complexity of the microinstructions. Since the PES are the drain of the instruction broadcast pipe, we set the cycle time to T = tl = t M I N .
Using this value in (2), we can determine the fan out as a function of broadcast technology. Now by formulating how the instruction will be distributed to the PES, we can determine the minimum number of pipeline stages needed to reach all of the PES.
If the system is constructed from a number of printed circuit boards inter-connected on a backplane, then 'I' is the maximum of the time required to send one instruction over this bus and t M I N . However, from an architectural stand point, it is immaterial how the instructions are distributed. We are only interested in cycle time T I and the number of pipeline stages, d .
Program Nn
M;/M/l
Rockoff has recently proposed ii SIML) instruction cache as another alternative to overcome the instruction broadcast bottleneck [20]. The notion is to download blocks of instructions to the PE array. and then sequence through them locally. He.nce, these instructions avoid the bottleneck. As such, the cached instructions can he processed at a higher speed. This method is able to get subslantial speedup on a variety of algorithms.
.
0058
. 025 .032
Reduction Hazards
C holesky mean
Pipelining the instruction broadcast trades latency for throughput. For most insuuctions the latency in execution is unimportant. However, there is one class of operations which have an interaction from the PES back to the AC'U.
Wc refer to these as gluhal redurfion instructions. All SlMD architectures have some means of logically combining a value from every PE to produce ;I single result. For thc remainder of this paper, and without loss of generality, wt' will assume that the operation is it global OR (GOR) of some value on the PES.
. 014 .011
.039 .03 1
Most GOR occurrences are due to parallel if-then-else constructs. Figure 3 shows the way ii compiler might generate code for a branch which depends on parallel data.
Now consider the number of cycles between when the reduce instruction is executed and when the corresponding branch may be executed. On a base machine with no pipelining, i.e. d = 0 , let the reduction operation take I cycles. Now define s to be the improvement in clock speed available from pipelining the instruction broadcast, i.e. Told = sTnew. Assuming that the time to complete a reduction is constant, reduce now requires rsll cycles to execute. In addition, an instruction no longer begins execution on the cycle it is issued from the ACU. Rather, the instructions is executed d + 1 cycles after it is issued. So the total number of stall cycles between the reduce instruction <and the brunch is d + I sll + 1 .
If we assume that none of these stall cycles can be filled with useful work, then the time spent executing reduce instructions places an upper bound on the speedup available. It also serves to degrade the ptx-formance improvement obtained by increasing the clock speed through pipelined instruction distribution. We find that
where f C o R is the fraction of time spent performing global reduction operations in the original execution, and f is the fraction of instructions that are global reduction instructions. The first two terms in the denominator are the standard terms from Amdahl's Law. The third term represents the additional time spent waiting for the GOR instructions to reach the PES. Table 2 shows these quantities for several sample programs. These data were obtained from dynamic instruction traces executing on a MasPar MP-2. The program Nn is an artificial neural net The Active Fhg (A-Bag) is a single bit in each PE which indicates that the processor is currently executing. The reduce instruction is placed into the code so that if the set of active processors is empty. time is not wasted in exccuting those instructions. As a result. a data dependency is crcated between the reduce and the hnmch. In a pipelined system, this dependency may necessitate the insertion of stltll cycles after the reduce, until the result of the reducp is known. As the speedup figures indicate, a critical requirement for high speed pipelined distribution of microcode to be effective is the ability to fill the reduce stall cycles with useful work. It is tempting to assume that thc time for a reduce operation scales at the same rate as the clock period. However, since the reduction network is simply a network of logic gates this assumption does riot hold. So we will now descrihe some of the possible methods to fill the stall slots created by GOR instructions. The discussion will center around the pseudo code found in Figure 3 .
The most obvious method is to movc the GOR insmction earlier in the instmction stream. Unfortunately, in our example each of the four prior instructions has a data dependency with its own prior instruction. This means that the instructions can not be moved up unless the entire block is moved up. Also setting the Active Flag can not be moved ahead of any PE instruction. An alternative is to perform a Reduce-OR on PRO. This has two advantages. First, it allows us to move the setting of the Active Flag into one ofthe stall slots. Secondly. and of greater importance, it allows us to move instructions from above the first instruction to between the Reduce-OR and Branch as data dependencies allow.
A third possibility is to remove the reduce instruction and the corresponding branch. The CM-2 does this optimization when running the code fragments that take less time than producing the result of the reduction [7] . It may also be possible to do an analysis of the program at compile time and determine that the active partition will never be empty. In such cases there is no reason to perform the reduction.
Another solution is to assume branch iiot takeii and move code from beyond the then-rode label branch and place that code above the ACU branch. PE instructions from after the branch will execute conditionally based upon the value of the A-flag. This means they can be ishued, execute, and even complete before the Reduce-OR is resolved. The reason that this is possible is that (local to each PE) the results of an instruction will be stored conditionally based on the value of the A-flag. If no PES are active and the branch is to be taken I hen all of the PES will squash a broadcast instruction locally. Therefore, there is nothing to undo and the branch can be taken when it is resolved.
There are two classes of instructions that cannot be moved ahead of a branch. The first consists of PE instructions that operate unconditionally. These instructions are ncirmally used in setting and restoring the A-flag. Since thcy are unconditional instructions, they are not ignored when the A-flag is not set. UncondtionaJ instructions may only be moved ahead of the branch it they do not need to be undone. For example, instructions that operate on a temporary location that will not be read again hefore I t is written could be moved ahead of the branch. The second cldss of instructions that can not he moved arc ,4CU instructions. Since these will not be squashed, they can not hc. allowed to complete. However, thesr: instructions could br executed speculatively if that capability is present in the ACU.
Pipelining PE Instructions
We now consider the possibility of pipelining the execution of PE instructions. This differs from the previous discussion in a number of ways. Before, we were only concerned with hiding the latency involved in broadcasting an instruction to a large number of PES. We werc not changing the instruction executed by the PES. Other than the possibility of performing the full instruction decode o i~e cycle ahead of cxecution wc were not trying to find p;rallelism inside the PE instruction In addition. we did not consider optimimtion of the PE architecture. In lhis section, we will inveqligate all three objectives.
For the processing element (PE) mhitecture, wc propose the use of a 'vanilla' reduced instruction set computer (KISC) approach, such a s used by the MIPS R2000, and many others [ 1,2.8.9 ]. In particular, we believe the following features characteristic of that c1a.s of architectures will bc. important in achieving our pipeline and cycle time oljjectives. First, we posit a simple pipeline either four or tive stages deep. Next, all simple integer instructions must execute in one cycle. Third, all instructions must complete in order. We assume that floating point instructions will take multiple cycles and the floating point units may not be pipelined (although FP primitives such as align may be).
Similarly, integer multiply and divide operations will take more than one cycle and are not explicitly pipelined. Finally, branch conditions are computed using a general purpose register target. That is, our architecture will not utilize condition codes.
What are the effects of data hazards in a pipelined SIMD PE? Hazards between ALU operatlons can be resolved by forwarding in the usual manner. In addtion, if an instruction causes a hazard on one PE it will cause a hazard on all currently active PES. Therefore i t is possible to move the hazard detection circuitry into the ACU. An additional huard we inusl consider is introduced by communication operations. Since a result from a i.ommunicalion operation may not be available for several cycles, there is the possibility that communication operations will create long stalls in the instruction pipeline. Generally. we can treat communication hazards as we would load/store hazards.
Control hazards are handled much differently on SIMD PES compared with uniprocessors. Control hazards result from instructions changing the PC. e.g. branches. On a SIMD machine, there is no (parallel) PC and therefore no branch penalties. Branches on a parallel condition set a register which determines whether received instructions ;we executed. Theretore we may use this register to control writeback of results from the ALU, and to control the memory interface (this is similar to Section 3.0, where the instruction after the branch was automatically squashed). Assuming all PE state information is explicitly set by the instructions and the active flag is forwarded to the memory interface (in case the active flap is modified followed by a memory instruction), then thc instruction immediately following the parallel branch will he executed correctly on the next cycle.
Architecturally, this design has a number of advantages over previous microcoded PES. First, most instructions are designed to be executed i n on? cycle. By reducing the ideal CPI of rnosl integer instructions from three or four to one, we gain a corresponding speed up for those instructions. In addition, separating register ;iccess and ALU execution into different cycles enables a further reduction of cycle time rMlh:.
Floating Point Performance
Generally, floating point operations take considerable tiirie on SIMD machines. For example, the MP-2 can perfoim an integer add in three cycles while a floating point ad11 takes 25 cycles [ 151. Depending on the methodology chosen for floating point operations, pipelining the integer unit may bring a corresponding improvement in floating point performance. For example, if floating point operations are constructed from primitive operations (supported at the hardware Level) similar to simple integer instructions, then floating point performance will improve at nearly the rate of integer performance. Depending on the particular sequences ot microinstructions, the speed up tor flo,tting point operatmns should be between one and three. Floating point speedup was hound at three since pertorm;mce improvement \hould never be better than that of simple integer operations. ' Table 7 shows the predickxl rchults for our test examples. For these results, we iirc Table 3 , we see that the only signific:rnt speedup is duc to rvtiucing the number of ( ycles needed lor floating polnt opcrattions. This is prei isdy because thew henchina ks :m tlohng point intcnsivc.
Cost of Pipelined Execution of SIMD Instructions
Three ported register file
The first necessity for pipelined execution is a multiported regisler file. The major cost of the multiple register ports is silicon ;uea. We consider two basic cost models that may bc adopted. The first model proposed by Snyder and Holman is based on equal area analysis [ 111. Under this model, the area used to enhance a PE could have been used to construct additional PES. A further assumption is that the machine will achieve linear speed up with respect to the additional PES. Consequently, equivalent speed up is given as (4) where cBA is the area of a PE in the original architecture, cIPE is the area of an enhanced PE, f' is the fraction of the time the enhancement is effective. SI/ is the speed up gained from the enhancement, and SI( is the speed up gained after considering the cost of the enhancement. Note the second term is simply Amdahl's law. So (4 ) states that for an enhancement to be worthwhile it has to increase performance by factor greater than the increase in PE size to implement the enhancement.
Since we have already calculated the speedup available from pipelining the execution, we need only to estimate its area cost. For this we use the results of Mulder et. al, who find that single ported registers require approximately 0.6 times the area of triple ported registers [16] . Equivalently, a three ported register file is l . f 6 times the size of a one ported register file. Thus where p H E G is the fraction of PE area occupied by the register file. On the MP-2 this was found to be 0.31211. So for the MasPar, pipelining is advantageous only if i t can deliver at least a 20% improvement in performance. This statement ignores any possible reduction in clock cycle time that might be enabled by pipelining the cxecution of PE instructions. From Table 3 , we see that under this constraint a 20% speedup is possible if there is a tactor of two or better improvement in floating point performance.
The second model lor the cost of silicon area is much simpler. It stiites that for a machine in at leas1 its second generation, the silicon used to enhance a PE is tree so long as it does not change the number of PES on a chip. The model is based on the observation that most software cannot automatically take advantage of additional PES. Also some software may count on a certain PE configuration. The addition of PES may disrupt this configuration and therefore cause the program not to run correctly. Under these assumptions, it is generally not possiblc to add PES to the machine. If this is the case then the additional silicon ;uea cannot be used for such purposes. So additional area that becomes available c m only be used to enhance the PES. This does not eliminate the possible conflict in deciding between two improvements, but such trade-offs are always present.
I/O Costs
There is an additional cost involved in pipelining the execution of PE instructions. This additional cost is the increase in off-chip I/O bandwidth. The increase in band-U idth is composed of two factors. The first is an increase in the number of bits per cycle that are needed to support one PE. For example, suppose a program runs in (-cycles on a non-pipelined machine. On a pipeline machine. the program now runs in c/z cycles, where s is speedup in cvcles gained from pipelining. Now consider the 1/0 requirements of the program on both machines. Let the program generate )n word of I/O. 'The non-pipelined machines requires at least r n / c words of traffic per cycle per PE. The pipelined machine requires at least ( s m ) Cwords of I/O per cycle pcr PE. So the I/O pt:r cycle must grow at the same rate as processor speedup. If the clock rate doesn't change then this factor dso represents the increase in bandwidth per sec ond. However, we stated that pipelining should reduce rlUlN. This reduction i n cycle time produces a second factor in deiermining required bandwidth. The two faclorb are mulliplied together to ohlain thc total increase in physical bandwidth required per PE. Supporting this high bandwidth requirement may be the hardest part in designing pipelined SIMD machines.
Conclusions
The clock frequencies of SIMD machines have not increased at the same rate as those of uniprocessors. The principal reason for this is the insfruction broadcast hoftlencwk By transforming instruction broadcast to a k -ary distrihution tree, we trade instruction latency for instruction thioughput. This enables higher clock rates for the PE array. The specti improvcment naturally translates into high system performance.
If the global reduction network is not similarly enhanced then the time to perform 1 educe operations is a limiting factor for the speedup gained from increasing the clock rate. If the instruction broadcast is pipelined, then there are additional penalties due to the latency between issue and execution. By scheduling code, it is possible to fill the slots after a reduction with usel'ul work. It is also possible to mitigate the penalty for a reduction following a paallel branch by moving the GOR instruction beyond the branch. This ability IS due to the semantics of SIMD machines.
Additionally, it is possible to pipeline the execution of PE instructions. This allows simple integer operations to execute in one cycle. It also may allow floating point operations to finish in fewer cycles. The resulting speed up is highly dependant on the amount of improvement achieved on floating point operations.
Among issues remaining to be explored are methods of constructing high performance memory and communication systems that are required to support the new higher level of processing. Code scheduling is also important, and new algorithms to effectively schedule slots after a GOR need to be derived. In addition, the set of benchmark programs must be expanded to include a wider variety of algorithms.
Acknowledgments
