Abstract-A formal mathematical model of single instruction stream-multiple data stream (SIMD) machines is defined. It is used as a basis for analyzing various types of interconnection networks that have been discussed in the literature. The networks are evaluated in terms ofthe lower and upper bounds on the tune required for each of the networks discussed to simulate the other networks. SINM machine algorithms are presented as proofs of the upper time bounds on these simulation tasks. These simulations are used to demonstrate techniques for proving the correctness of SIMD machine algorithms, i.e., analyzing the simultaneous flow of N data items among N processors. Processor address masks, a concise notation for activating and deactivating processors, are used in the algorithms. The methods used to prove the lower bounds and to construct (and prove correct) simulation algorithms to show the upper bounds can be generalized and applied to the analysis of other networks.
tion stream-multiple data stream (SIMD) machines is defined. It is used as a basis for analyzing various types of interconnection networks that have been discussed in the literature. The networks are evaluated in terms ofthe lower and upper bounds on the tune required for each of the networks discussed to simulate the other networks. SINM machine algorithms are presented as proofs of the upper time bounds on these simulation tasks. These simulations are used to demonstrate techniques for proving the correctness of SIMD machine algorithms, i.e., analyzing the simultaneous flow of N data items among N processors. Processor address masks, a concise notation for activating and deactivating processors, are used in the algorithms. The methods used to prove the lower bounds and to construct (and prove correct) simulation algorithms to show the upper bounds can be generalized and applied to the analysis of other networks.
Index Terms-Algorithm correctness, array processors, computer architecture, Illiac IV, interconnection networks, n-cube array, parallel processing, perfect shuffle, permutation networks, SIMD machines, STARAN.
I. INTRODUCTION
A single instruction stream-multiple data stream (SIMD) machine is a computer system consisting ofa control unit, N processing elements, and an interconnection network. The control unit broadcasts instructions to the processing elements, and all active processing elements execute the same instruction at the same time. Each active processing element executes the instruction on data in its own memory. The interconnection network provides communication among the processing elements. This type of machine structure is designed to exploit the parallelism of tasks such as vector and matrix operations.
A formal mathematical model of SIMD machines is defined. The model was designed to provide a basis for the evaluation of various SIMD computer system components. It is used to define and analyze five interconnection networks. These networks include the types used in the Illiac discussed by Feng [10] , Lang [15] , Lang and Stone [16] , Lawrie [17] , Orcutt [18] , Pease [19] , Siegel [21] , [23] - [25] , [27] , Smith and Siegel [28] , [29] , and Stone [30] .
Designing an interconnection network has been recognized as an important problem by computer architects (e.g., [11] ). The five networks evaluated here were chosen because they form the basis of most of the SIMD machine interconnection networks that have been proposed and shown to be useful in the literature.
The ability of an interconnection network to simulate the actions of an interconnection not in that network is important to SIMD machine designers, who must choose a set of interconnections to implement in the hardware of the system. The number of interconnections which may be included is constrained by such factors as cost and hardware complexity. Therefore, architects must consider the ability of the interconnection network that is chosen to simulate other interconnections which may be necessary for the machine to perform various computational tasks.
To study this simulation capability, the times required for each of the five networks to simulate the other networks are examined. Algorithms are presented as proofs of upper time bounds on these simulation tasks. Techniques for proving the correctness of SIMD machine algorithms are demonstrated. The lower time bounds are an extension of the work presented in [21] . In the analyses very few assumptions are made about the exact architecture of SIMD machines, so the results are applicable to a variety of actual machines. Therefore, an SIMD architect can use the information on the time bounds for the various simulations, and even the simuioltion algorithms themselves, if one of these networks is implemented.
The various methods used to construct SIMD machine algorithms and prove their correctness can be generalized and applied to other interconnection networks. Thus, the significance of this paper lies not only in the specific results, but also in the methods used to obtain these results.
II. SIMD MACHINES
One way to view the physical structure of an SIMD machine [13] is as a set of N processing elements (PE's) (where each PE consists of a processor with its own memory), interconnected by a network, and fed instructions by a control unit. The network connects each PE to some subset of the other PE's. An interprocessor transfer instruction causes data to be moved from each PE to one ofthe PE's to which the element is connected by the network. To move data between two PE's that are not directly connected, the 0018-9340/79/1200-0907$00.75 © 1979 IEEE 907 data must be passed through intermediate PE's by executing a programmed sequence of data transfers. An alternative structure is to position the network between the processors and the memories. The PE-to-PE configuration is used here, although the results presented are also valid for the processor-to-memory type of structure.
The model of an SIMD machine presented here consists of four parts: processing elements, interconnection functions, machine instructions, and masking schemes. It is a mathematical model that provides a common formal basis for evaluating and comparing the various components of different SIMD machines. The model was designed to reflect the features of actual SIMD computer systems.
Each PE is a processor together with its own memory, a set of fast access registers, and a data transfer register (DTR). It is assumed that there are at least three fast access registers, which will be referred to as A, B, and C. The DTR ofeach PE is connected to the DTR's of the other PE's via the interconnection network. When data transfers among PE's occur it is the DTR contents of each PE that are transferred. There are N PE's, each assigned an address from 0 to N -1, where N = 2m; i.e., log2N = m. Each PE has a register ADDR which contains the address of that PE. A PE is shown in Fig.  1 .
An alternative processor organization would have two DTR's, one for input and one for output. In this paper the single DTR organization is used in the various SIMD machine algorithms. This is done for two reasons: 1) the algorithms can easily be made to operate for the two DTR organization by adding an instruction to move the data from the input DTR to the output DTR immediately after each interprocessor data transfer, and 2) the single DTR implementation manifests problems the two DTR organization does not, as will be discussed later in this section.
Each PE is either in the active or the inactive mode. Ifa PE is active it executes the instructions broadcast to it by the control unit. If a PE is inactive it will not execute the instructions broadcast to it. The masking schemes are used to specify which PE's will be active.
An interconnection network is a set of interconnection functions. Each interconnectionfunction is a bijection on the set of PE addresses. When an interconnection function f is applied, PEi copies the contents of its DTR into the DTR of PER(,). This occurs for all i simultaneously, for 0 < i < N and PEi active. An inactive PE may receive data from another PE if an interconnection function is executed, but it cannot send data. To pass data from one PE to another PE a programmed sequence of interconnection functions must be executed. Fig. 2 these interconnections are shown. This network has been shown to be useful by Lang [15] , Lang and Stone [16] , and Stone [30] . It is the basis of Lawrie's Omega network [17] and is included in the networks of the Omen [14] and RAP [7] systems.-
The Illiac network consists of four functions defined as follows:
Il+l (x)=x+ 1 mod N IL_l (x)=x-l modN
where n is the square root of N, which is assumed to be a perfect square. If the PE's are considered as a n x n array, then each PE is connected to its north, south, east, and west neighbors, as shown in Fig. 3 . This is the network implemented in the Illiac IV [6] and DAP [12] systems. Its ability to perform various tasks is discussed in [1] , [6] , [12] , [18] . 
Po)
Pm-I Pi+IPiPi-I Po for 0 < i < m. When the PE addresses are considered as the corners of an m-dimensional cube this network connects each PE to its m neighbors, as shown in Fig. 4 . Pease's binary n-cube [19] , the network used in STARAN [2] , [3] and the network proposed for Phoenix [9] are each wired series of cube functions. In [2] , [4] , [19] the applicability of this network to practical problems is discussed. [21] .
Cross bar networks, Benes networks [5] , and Swanson's networks [31] are not considered here for reasons discussed in [21] . Note that the PEPE machine has no interconnection network [8] . The masking schemes component of the model consists of methods for controlling the active/inactive status of the PE's. Masking schemes provide the system user with devices to enable some PE's and disable others. For some masking schemes, the set of PE's that can be activated is any member of the power s-et of integers 0, 1, , N -1. For other schemes, the set of PE's that can be activated is more restricted. In this paper, two masking schemes are used: PE address masks and data conditional masks.
The PE address masking scheme uses an m-position mask to specify which PE's are to be activated, each position ofthe mask corresponding to a bit position in the addresses of the PE's. Each position of the mask will contain either a 0, 1, or X ("don't care"). The only PE's that will be active are those whose address matches the mask: 0 matches 0, 1 matches 1, and either 0 or 1 matches X.
Superscripts will be used as repetition factors when describing masks or PE addresses (e.g., 1202 = 1100). Square brackets will be used to denote a mask. A PE address mask could accompany each machine instruction and interconnection function, or a-separate mask instruction could be executed whenever a change in the active status ofthe PE's is required. A compiler or assembler could be used to convert from one method to the other. For the sake of clarity, each instruction will be accompanied with a mask. For example, executing "A +-B + C [Xm-' 1]" causes each PE with an odd numbered address to add the contents of B and C, and store the sum in A.
The way this masking scheme interacts with various interconnection networks is examined in [21] . In [22] variations of this masking scheme and methods to implement it are discussed.
Data conditional masks are the implicit result of performing a conditional branch dependent on local data in an SIMD machine environment, where the result of different PE's evaluations may differ. The notation where <data condition> do ... elsewhere ... will be used. Thus, as a result of a conditional where statement each PE will set its own internal flag to activate itself for either the "do" or the "elsewhere," but not both. The execution of the "elsewhere" statements must follow the "do"' statements; i.e., the "do" and "elsewhere" statements cannot be executed simultaneously. For example, as a result of executing the statement where A > B do C -A elsewhere C-B each PE will load its C register with the maximum of its A and B registers, i.e., some PE's will execute "C +-A," and then the rest will execute "C *--B." This type of masking is used in such machines as the Illiac IV [1] and PEPE [8] .
"Where statements" may be nested using a run-time control stack. This is described in [26] .
Data conditional statements are an essential part of all programming languages, so it is reasonable to assume they would be present in all SIMD machines. The results of this paper would still be valid even if only data conditional masks were used. This is because if each PE knows its own address, then data conditional masks could be used to simulate PE address masks using no additional interprocessor data transfers. When PE address masks and data conditional masks are used together, PE address masks must accompany each instruction in the "do" block and in the "elsewhere" block. Thus, in order for a PE to be active it must be in active mode as a result of the where statement and match the PE address mask accompanying the instruction.
When an interconnection function is executed with all PE's in the active state no DTR data is destroyed, it is just transferred to another DTR. However, if some PE's are inactive, then their DTR contents can be destroyed, i.e., overwritten and not transferred. This is the problem, referred to earlier, that the single DTR organization has and the two DTR organization does not. For example, when N = 8, the data transfer instruction "shuf [001]" would overwrite and destroy the DTR contents ofPE2. To prevent the loss of the data in the DTR of PE2 a copy of it must be saved prior to executing the data transfer instruction. An analysis is presented in [21] of the combinations of PE address masks and interconnection functions which will destroy data, i.e., are not bijections. In the algorithms in the next section, only when data will be destroyed by an interprocessor data transfer which is not a bijection will this issue be discussed.
In summary, an SIMD machine can be formally represented as the 4- 2) The interconnection function is to be simulated as if it were executed with all PE's being active. It will be shown after the theorem how this restriction can be removed.
3) The bounds are in terms ofthe number ofexecutions of interconnection functions required to perform the simulation.
4) The interconnection function of the network to be simulated which requires the most time to simulate will determine the time bounds for the simulation of that network.
The instructions in the simulation algorithms can be divided into three categories: control unit operations, interprocessor data transfers, and register-to-register operations. Control unit operations, such as incrementing an index register in a "for loop," can usually be done in parallel with the previously broadcast PE instruction, thus, taking no additional time. Register-to-register operations, within a PE, will probably involve a single chip, or at worst adjacent chips. The interprocessor data transfers will involve setting the controls of the interconnection network and passing data among the PE's, involving board-to-board, and probably rack-to-rack, distances. Furthermore, if there is a sequence ofdata transfers and the two DTR model is implemented (to reduce clocking problems), then an input DTR to output DTR move will be required after each transfer. Thus, unless the number ofregister-to-register operations is much greater than the number of interprocessor data transfers, the time for the interprocessor transfers will be the dominating factor in determining the execution time.
In the theorem that follows, the lower bound and upper bound on the time required to perform each simulation are shown. In most cases these bounds are tight. Unless indicated otherwise, the lower bound is proven in [21] . Each upper bound is based on the time complexity of the algorithm presented to do that simulation. Each algorithm is proven correct. Most proofs are only sketched and are provided as an explanation of the algorithm; details are in [20] . Several proofs are given in detail to demonstrate the techniques used. In the proofs, the transferring ofdata from PEX to PEY will also be referred to as mapping the address x to y, since the interconnections have been defined as functions on the PE addresses.
Theorem: In Table I For example, when i = 1 and N> 4, the data from the DTR of PE3 is moved to PE5, and then to PEB. For address P whose ith bit is 0, Step 1 maps P to P + 2i = cube, (P), which does not match the mask in Step 2. For address P whose ith bit is 1, Step 1 maps P to P + 2' and Step 2 maps P + 21 to P = cubei (P). Step 5 reloads it.
Step 4 is similar to 
Piopi-I PiPo (see Appendix I). After
Step 2 is executed with i = 0, the data from PE OPm-2 ... PI po will have been transferred to PE Pm-2 ... p*Po0* Step 3 saves this data and Step 5 reloads it.
The data transfers resulting from Step 2 will not be bijections, but the data that was originally from the DTR's of PE's that had a 0 in the hoba will never be destroyed by these transfers. Consider Step Follows from the Cube -> PM2I and PM2I --Illiac analyses.
Cube -÷ PS (Upper Bound):
For the exchange use cubeo. For PM+i, m/2 < i < m: forj = 1 until 2k/n do II+,, [Xm] : all PE's execute Il+ n 2'/n times For example, when i = 3 and N = 16, the data in the DTR of PE6 is moved to PE1o whenj = 1, and then to PE14when j = 2. II+,, executed 2k/n times equals PM+ .
For PM+ i, 0 < i < m/2: the algorithm is the same as the case above, except both "n's" are changed to "l's." Il+n(X)) = d(x, IL,,(x)) = n, so to move a distance of n/2 the II+ 1 and/or Il_ 1 functions must be used also, e.g., ifII + ,was used to map 0 to n/2, it would have to be followed by n/2 executions of Il_ 1. Step 1 maps address P to P + (n/2) = cube(m/2)_ 1(P), where the (m/2) -1st bit of P is 0.
Step 1 maps address P to P + (n/2) and Step 2 maps it to P -(n/2) = cube(m/2) 1 (P), where the (m/2) -1st bit of P is 1.
For cube,, 0 < i < (m/2) -2: the algorithm is the same as the m/2 < i < m -2 case above, if "n" is changed to "1" in lines Step 2 and Step 4. For the shuffle: in [18] an algorithm for the Illiac to simulate the shuffle function using 4n -4 interprocessor data transfers is presented. The algorithm presented here, which uses only 3n -4 interprocessor data transfers, is based on the PM2I -+ shuffle algorithm. The key is that the instruction "PM+i [Xm-(i+2)01Xi]]" in the PM2I -+ shuffle algorithm is equivalent to the sequence of instructions: (S1) A -DTR [Xm] : all PE's save original DTR contents in A : all PE's execute i shuffles For example, when i = 1 and N = 8, the data from the DTR of PE6 is moved to 5 and then to 3 by Step 1, then to 2 by Step 2, and finally to 4 by Step 3. By definition (shufi(exch(shufm-i(pm,i-IPi Po))))= : all PE's load DTR with contents of B register For example, when i = 2 and N = 8, the data in the DTR of PE6 is moved to PE3 by Step 2, to PE2 by Step 5, to the B register by Step 6 , and back to the DTR by Step 9. For all PE addresses P not of the form Im --i ... P IPo, Step 2 performs the simulation, since WPM + ,(P) = PM + i(P). For PE addresses of the form l"-pi_1
... pIpO ,except for 1,
Step 2 maps them to (O"p'pi 1 ... p po) + 1 and Step S subtracts the "+ 1."
Step 5 maps lm to lm-O and Step 7 maps this to 0"'-li. Step To find the lower bounds of these simulation algorithms, one must consider which operations occur in parallel and find ways to describe these actions mathematically. To construct the simulation algorithms, the parallel flow of N data words through N processing elements must be understood. It must be determined which data may get destroyed by-a data transfer that is not representable as a bijection on the PE addresses and save that data in such a way that it can later be identified and used. Furthermore, special actions must be taken when the interconnection to be simulated is executed with some PE's disabled. To prove that the algorithms are correct, standard mathematical techniques, such as induction and case analysis, were adapted for use in parallel program analysis. In addition, the approach of considering the simulation problem as a task to map one integer (or class of integers) to another integer (or another class of integers) was taken, as opposed to viewing the process strictly as one of transferring data among PE's.
The methods presented in this paper may be generalized and used to compare other networks. These techniques were demonstrated by examples since there are currently no good "algorithms" for generating such lower bound proofs, simulation algorithms, or SIMD algorithm correctness proofs for arbitrary networks.
No attempt is made to claim that any one network is "best" as a result of the analysis in this paper. One factor which will-influence the decision of which network to implement in the system is the types of computations for which the system will be primarily used. For example, for simple pattern recognition smoothing algorithms, the nearest neighbor interconnection scheme, with only eight interconnection functions, may be the most efficient. The number of processing elements in the system is another important factor. For small N, a cross bar switch may be acceptable. Other factors include computational speed'and cost requirements. Assuming a "general purpose" SIMD machine, where N is large, some comments about the advantages and disadvantages of the various networks, independent of the factors above, can be made.
The Illiac network is much more limited with respect to simulation capabilities than its superset the PM2I, although it has the advantage of having only four interconnection functions, as compared to the 2m - does not have, but the WPM2I network is significantly more difficult for the programmer of an SIMD machine to use. The PM2I is based-on simple mod N arithmetic, while the data transfer patterns established by the WPM2I network are complicated by the "wrap-around." Thus, the PM2I network is preferable where the particular group theoretic properties of WPM2I are not required.
The Cube and PM2I networks are conceptually similar, with the Cube connection pattern based on a "logical neighborhood" and the PM2I pattern based on a "modulo N addition/subtraction" neighborhood. The Cube uses almost half as many interconnection functions as PM2I, and the algorithm presented for simulating the shuffle is almost twice as fast as the algorithm using the PM2I (although the lower bounds on the two tasks are the same). However, the Cube requires m steps to simulate the PM2I while the PM2I requires only two steps to simulate the Cube, and the number of connections in the Cube cannot be reduced as it can with the PM2I. Due to these considerations, the PM2I may be preferable in the general case.
The PS network has the advantage of requiring only two interconnection functions, with which it can simulate the other networks discussed using at most 2m steps. If the system architect is concerned with minimizing hardware costs and is willing to use 2m transfers for 'the various interconnection patterns which were shown would require that amount, then PS is an excellent choice. If the main concern of the system architect is computational speed, without "unreasonable" expense, then a good choice is a PM2I-shuffle hybrid, consisting of the 2m -1 PM2I functions and a shuffle function, which could simulate any ofthe functions discussed in at most two steps. Such a -hybrid network would offer great flexibility and speed, while being a relatively small portion of the total cost of a system when N is in the 21o-2"4 range.
In conclusion, the results of this paper provide comparison information to aid the SIMD machine architect in choosing an interconnection network which will be best suited to the needs of the system. The methods presented provide tools for the designer to use to evaluate and compare other networks and hybrids of networks. 
