INTRODUCTION
The discipline of Opérations Research has. long been concerned with the development of various mathematical models and efficient algorithms for their satisfactory solution. A substantial number of these algorithms are computationally feasible today simply because of the availability of high speed digital computers. For example, a lot of algorithms that employ branch and bound techniques, or dynamic programming techniques would be practically unusable were it not for digital computers. An important aspect of the efficient implementation of such algorithms is the efficiency of the algorithm itself. This concern is amply evident in the continuous efforts of researchers to deveîop better algorithms. An equally important aspect, in our opinion, is the architecture of the machine on which the algorithm is implemented. While the gênerai purpose machine (uniprocessor) may be capable of executing the algorithm, alternative architectures may offer substantial improvements in performance. This fact has already been recognized in certain application areas, as for example, in the area of image processing [16, 22] and missile tracking [6] .
There appears to be little work done on special architectures suited to opérations research problems. We hope to call attention to the interesting possibilités offered by special architectures by developing one for a class of opérations research problems. Before proceeding with the development, a few words regarding the gênerai objectives of the paper are in order.
Our immédiate objective is to expose the benefits of a special architecture to a class of opérations research problems. The long term objective of course is to stimulate the development of special architectures for other classes of problems in thefield. No attempt has been made to "fine tune" the architecture to exploit intricate properties of problems in the class we have investigated. Such an effort needs considérable interaction between hardware specialists and experts in the opérations research area. Thus, in reading the paper, it should be kept in mind that we are not presenting a fully developed System for évaluation. However, the potential for special architectures in this area should strike the reader. Refmements to the system will be the subject of further work.
THE LINEAR PROGRAMMING PROBLEM
We have chosen probably one of the simplest and most widely used models in opérations research -namely linear programming(LP). It has been in use for a long time and there are well established techniques, most notably the Simplex method, for its solution. Since the LP problem is being used only as an example of a target applications area that could profit from special architectures, the discussion in this section is tutorial in nature. Any elaborate treatment of the LP problem here would probably be of interest only to linear programming specialists. The basic problem may be stated as:
maximize c x x x + c 2 x 2 +, ..., -f c n x n , subject to a xi x 1 The simplex method for solving this problem is well known and will not be repeated hère. For details see[$\. A program for executing the simplex method on a parallel processor will be developed later in the paper. The marked improvement in solution time will then be readily seen.
Consider a class of LP problems that exhibit a special structure.
The practical problems in this class are typically very large, i. e., a large number of variables, and a large number of constraints. The constraint matrix has a few, so called, coupling constraints. The rest of the matrix displays a block angular structure. 
01
Matrices A u A 2 , A$, A 4 ail have the same number of rows, this number being equal to the number of coupling constraints. The number of columns in the matrices will not necessarily be the same, depending on how the variables have been partitioned to match the block angular structure. Of course, this notation can easily be extended to the case withp blocks in the block angular structure.
At first glance, this structure appears to be special enough not to appear frequently in practice. However, many transportation problems have this forai [29] . More importantly, this form of the LP problem arises when several departments are independently trying to optimize that portion of the objective function relevant to itself while using resources that are also needed by other departments. A central authority would like to impute certain prices to these shared resources so that the departments use the shared resources in quantities that maximize the total objective of the firm as a whole [10, 13] .
The coupling constraints represent the constraint relations pertaining to shared resources. The matrices B t and vectors b h 1 ^/^p, represent constraints that affect only one particular department. The décision variables for the *' -th department are contained in the vector x t .
This LP problem in block angular form can be solved by the well known Dantzig-Wolfe décomposition aigorithm [9] , and the reader is referred to [11], pp. 148-152 for mathematical details. Hère, we choose to describe informally, the nature of the aigorithm, to see how it could benefit from parallelism. The aigorithm is two level with a "master program* 9 at the second level and PARALLEL ARCHITECTURES AND OPERATIONS RESEARCH 321 subproblems at the first. Each subproblem is of the form (see preceding formulation):
Such that B t x~S i9
where % i is a vector of imputed prices handed down to each subproblem by the master problem, Each subproblem arrives at optimal values for its décision variables Xi(n x ) and objective function z? s and returns these values to the master problem. At this stage, the master problem checks appropriate criteria (again, see [11] ) to see if has an optimal solution, ötherwise a new vector of imputed prices is computed and returned to the subproblems. The process repeats itself, and, if the master program is not degenerate, the décomposition principle will fmd the optimum in a fmite number of itérations.
The subproblems to be solved are independent in that they do not need any information from each other, Hence their solution could logically proceed in parallel This parallelism in itself would cause an improvement in the solution time. Each subproblem is an LP. It was mentioned sometime earlier that the ordinary LP problem could be solved on a parallel processor with improvement in solution time. Thus, in addition to solving the subproblems in parallel, each subproblem is solved on a parallel processor with attendant time savings. Thus the basic strategy is to solve the subproblems in parallel and parallelize the solution of each subproblem as much as possible. In the next section, a brief introduction to a class of parallel processors is presented. An architecture of a parallel processor which facilitâtes solution of the problem follows. Estimâtes of improvements in solution time are presented.
PARALLEL PROCESSING
A class of parallel processing machines is the SIMD {Single Instruction Stream Multiple Data Stream) machines [2, 4] . A genera! SIMD architecture is shown in Fig. 3 The ILLIAC machine [2, 5] is an example of an SIMD machine. PEPE [7] is another. SIMD machines, in fact any parallel processor, involve considérable hardware and development costs. Adequate applications must be found to make these machines economically viable. PEPE is designed for missile tracking applications [6] , and ILLIAC for weather forecasting. Architectures for image processing applications have been proposed in [16, 22] .
Skewed storage
A technique for storing matrices in such SIMD machines that is of particular importance to us is the skewed storage technique. Consider an 8 by 8 matrix. In skewed storage format, the matrix would be stored as shown in the memory map of Fig. 3.2 . In Fig. 3 .2, each column is the memory associated with the PE marked above it. Each cell corresponds to a memory unit (word, byte, etc.). 
R.A.LR.O. Recherche opérationnelle/Opérations Research PARALLEL ARCHITECTURES AND OPERATIONS RESEARCH

323
With this storage technique, any row and any column of the matrix can be loaded into the eight processors in parallel. Entry (i,j) in a cell in Fig. 3 .2 signifies that the value corresponding to row^ and column^ of the matrix is stored in that cell.
E. g., (i) To load row 3 (note that rows are numbered starting from zero)., PE fc loads cell 3 into its DTR (the memory cells of each PE are also numbered from zero to seven). Then each PE fc transfers the contents of its DTR to PE (/fc _ 3)ModP where P is the number of processors. This is a uniform inter PE shift and is discussed in more detail in section 6.4. In gênerai, to load row r, each PE k loads its r-th cell and shifts its data to PE (fc " r)ModP .
(ii) To load column 4, PE k loads cell (A; + 4)Mod P into its DTR. It then transfers the contents of its DTR to PE u _ 4)ModP . In gênerai, to load column c, PE k loads cell (A:+c)Mod P into its DTR and then transfers that data to PE(*-c) Mod p . For a gênerai description see [26] . It is important to point out that the transfer from PE* to PE (k _ c)ModP is done in parallel (simultaneously) for t = 0, 1 P-l. This should become clear in section 6.4.
Recursive doubling
Another useful technique used often with SIMD machines is recursive doubling. Consider finding the minimum of N numbers. On a uniprocessor, one would exécute JV-1 paired comparisons, retaining the smallest number at each stage and obtain the smallest of the N numbers. Assuming each comparison takes one time unit, the sequential process takes IV -1 time units. The recursive doubling can be best described by an example. Consider fmding the minimum of 8 numbers. Figure 3 .3 describes the process graphicaliy.
In Step 1, the following pairs of PE's simultaneously make comparisons -[{0, 1}, {2, 3}, {4, 5}, {6, 7}]. PE's {0 s 2, 4, 6} contain the minimum values of the respective comparisons.
In
Step 2 the following pairs make comparisons [{0, 2} {4 S 6}]. Subsequently, at the end of step 3 5 PE 0 contains the minimum of the eight numbers. Thus, recursive doubling required three time units. In gênerai, for N number s, [log 2 N] time units are required as compared to N -1 time units using the sequential technique. If N is not a power of two, some dummy numbers can be 'padded up' to obtain a power of two as needed by recursive doubling. As N increases to practical values (say 100) the savings in the number of steps is substantial. (For N=100, 7 steps are needed with recursive doubling as compared to 99 steps sequentially.) The recursive doubling technique also requires some data transfers which dégrades performance somewhat. However, many existing and proposed interconnection networks can do the data transfers required at each step of the recursive doubling process in one move [3, 12, 15] .
EXECUTING THE SIMPLEX METHOD ON AN SIMD MACHINE
A standard form of the simplex tableau is shown in Figure 4 .1. There are several variants of this tableau. This method was chosen simply for expository purposes. Starting with an initial feasible solution, the solution is obtained as a séquence of pivot opérations. The tableau is assumed stored in the PE's of the SIMD machine in skewed storage format [26] . Skewed storage allows the retrieval of an entire row or an en tire column of the tableau in parallel. The architecture of In- Figure 4 .2, the opération of pivoting brings column j* into the basis, A high level "program" for running this on an SIMD machine h given in the next section.
A Parallel program for the simplex method
An explanation of the symbols used in the program follows, The symbols have been divided into three groups. The first group refers to hardware components of the parallel processor. The second group describes variables and the third group explains the opérations. It must be emphasised that we are not attempting to list PE : A processing element, consisting of a processor and its associated memory. Each PE has, among other components, the following three components of interest for our purposes.
DTR : Data transfer register. This register is the only one linked to the interconnection network. To transfer data to/from a PE, the data item must be in the DTR of the PE.
A : A privileged register. Most arithmetic opérations incolve this register. Further, any data loaded from a PE's memory must be loaded first into A.
B : A genera! purpose register.
(ii) Variables : Variables in an SIMD machine can be of two types. One is the PE variable. lts value is local to every PE. The other is a CU (Control Unit) variable. This variable résides in the CU and may be broadcast to all PE's in parallel. This is the usual method of transmitting constant values to PE's.
PEN : PE number. This is the address of a PE. If there are P processors, they are numbered 0 to P -1.
DIV, JSTAR, ISTAR SCALE, MIN : These are CU variables. DIV holds the value of the pivot element; JSTAR, ISTAR are the column and row number, respectively, of the pivot element. MIN holds the minimum value of some comparisons.
SCALE holds a common multiplier used in updating the coefficient matrix using the pivot element.
(iii) Opérations: «-: Assignment. The quantity on the right is assigned to the variable to the left of the arrow. Recall that every opération is done in parallel by each PE unless otherwise modified by a mask as described below.
[Mask] : Normally, all PE's exécute any instruction handed down by the CU. The masking opération sélects some PE's to perform an opération. Such PE's are said to be active. Theoretically, every opération could be accompanied by a mask. There are various représentations of masks [17] . When an opération is accompanied by a mask, only PE's whose address (PEN) matches the mask will exécute that instruction.
PARALLEL ARCHITECTURES AND OPERATIONS RESEARCH 327
x Mod P : x modulo P = x-P*\ -. Shift x : This is an inter PE data transfer between the A registers via the DTR's and interconnection network. This opération causes PE f to transfer its data to PE (i + Jc)ModF where P is the numberofprocessors.lt must be emphasized that all PE's can do this at the same time due to the nature of the interconnexion network. This should be evident in section 6. WHERE (Logical expression) DO; END; : This is a data conditional opération. Initially, all active PE's evaluate the logical expression using their individual data streams. Those PE's among this active set, where the expression évaluâtes to false, are deactivated. Only the remaining active PE's exécute the subséquent DO; ... END; block. This is a standard construction in languages for parallel processors [1, 25] .
R,D. Comp itoj : PE's i' tojperform recursive doubling as described in section 3.2. The numbers to be compared, réside in the A registers of PE's / to ƒ The minimum value appearing as a resuit of the recursive doubling process is contained in the DTR of PE^. It is assumed that ij and.that7 -i +1 is a power of two or the data is padded up. The purpose of this procedure is to show how the simplex method can be executed on an SIMD machine. Estimâtes of improvement in solution time over the uniprocessor case follows in section 4.3. Note that the test WHERE (A = MIN)... may result in nonunique PE numbers. In such cases, any one of the PEN's holding the minimum value could be arbitrarily assigned to JSTAR (or ISTAR). This does not affect the algorithm.
Wraparound
It could very well happen that the number of columns/rows in the tableau is greater than the number of PE's. In such case, the strategy of wrapping the matrix around the PE's is followed. One or more PE's will then have more than one row/column. Opération on these rows/columns must proceed sequentially and this dégrades performance. As an example of wrap around, consider an 8 by 12 matrix and only 8 PE's. The matrix stored in wraparound format will look as shown in Figure 4 The result of this wraparound is that we will have part parallel and part sequential opération. The degree of sequential opérations dépends on how many wraps we have. For example, with the present set up, any opération on the first eight columns could be done in parallel. Opérations involving columns 1 to 10 say will take two steps since PE 0 , PEj and PE 2 will have to operate sequentially. In the degenerate case of only one PE, all cells of the array are stored continuously and purely sequential opération results. Let w represent the number of wraps for columns and w* the number of wraps for rows. These two numbers need not be the same when the matrix is non-square and skewed storage format is followed. Therefore: W*z=\ and *4
• 3. Estimâtes of improvement îo solution time
The simplex tableau on which the estimâtes are based is shown again for convenience below in Figure 4 If the simplex algorithm of Figure 4 .2 is executed on a uniprocessor, the analysis of the number of steps is as follows:
Stepl: Find the lowest reduced cost. There are JV values for reduced costs that must be examined for the minimum. This will take JV -1 comparisons.
Step 2: Is the minimum reduced cost ^0? This is one compare opération.
Step 3: Find the row with minimum ratio of xja^ij* is the column corresponding to the minimum reduced cost). This requires M division opérations and M-l comparisons (to find the minimum).
Step 4: Pivot opérations assume a t * f is the pivot, i* is the row with the lowest ratio in step 3. Row ? * will require JV + 1 divisions. Every other row will require JV+1 multiplications and JV+1 subtractions. Step 1 : The recursive doubling takes log 2 JV compares and log 2 JV shifts. Such shifts can be done in one move (see Section 6.4). In gênerai, at most K/ + log 2 JV compares and log? JV shifts are needed. The step to test if the minimum reduced cost is nonnegative also requires one comparison, as in the uniprocessor. (Note w' compares are done within each Pe.) Step 2 : Finding the row with the minimum ratio of x^a^**. This requires one shift (Shift_ JSTAR ) opération and one divide opération. The recursive doubling to fïnd the minimum element needs log 2 M compares and iog 2 M shifts* In gênerai, at most Qa-HogaM) shifts and w dîvides and ig+log 2 M compares are needed.
Step 3 : Pivot Opération, For row 1STÂR) one division is needed, or in genera!» at most w' divisions, Now, each itération involves at most, 2w' shifts» w' multiplies, and w' subtractions. Since we go through the loop M times, we have a total of (M w') shifts, w* divisions, Mw' multiplies, and For the SÏMD machine, the critical factor is the inter PE data transfer time (shifttime) since there are so many of them. A slight change in data transfer time can substantially improve the SIMD performance, It is assumed that a uniprocessor and a processor in an SIMD machine performs an arithmetic/logic opération in the same time.
R.A.LR.0, Recherche opératiotuielle/Operations Research
AN ARCHITECTURE FOR PARALLEL IMPLEMENTAL OF THE DANTZIG-WOLFE
ALGORITHM
The previous sections exposed the parallelism that could be exploited when solving a regular LP problem using the simplex method. This section discusses the architecture of a parallel processing machine that is suited for the class of LP problems that have the block angular structure as described before. The architecture presented in Figure 5 .1 is different from the regular SIMD architecture presented in Figure 3 .1 in two major ways. First, there are two levels of controL As before, there are P PE's and they perform the bulk of the computation. Also as in Figure 3 .1, there is an interconnection network for the PE's to communicate with each other. However, in Figure 5 .1, the interconnection network is also partitionable in the sensé to be described presently. Our machine has several CU's as opposed to one in the gênerai SIMD machine. (There are five CU's in the diagram only as an example.) The number of CU's is fixed at the time of design. One CU together with a set of PE's constitutes an SIMD machine. Figure 5 .1 in effect represents an architecture for several cooperating SIMD machines. The idea of having several SIMD machines (a multiple SIMD or MSIMD system) was also envisioned in the design of the ILLIAC [2] . Other Systems using a similar concept are described in [14 3 21] . There is a master control processor (MCP) that is a small processor in its own right and co-ordinates the ClFs. MCP communicates with the CIPs through the driver network and the CU's communicate with the PE's through the channeling network that also happens to be partitioaable.
Execution of the Dantzig-Wolfe (D -W) procedure occurs in the following manner. The MCP will be in charge of handling the master problem of (D -W), and co-ordinating the CU's. Each CU handles one or more subproblems. This is where the ability to partition the machine structure becomes crucial. The sizes of subproblems (i e., number of constraints and/or variables) will vary from problem to problem. Recall from earlier sections that each subproblem in D -W is an LP problem and that an LP problem could be executed efFiciently using the parallelism inherent in an SIMD machine, As such, the architecture must permit the flexibility of assigning a variable number of PE's to each CU, depending on the size of the subproblems. This will make maximum use of the Computing resources. À collection of PE's together with the CU to which they are assigned, solves one subproblem. Such a CU-PE collection unit functions as an SIMD machine. It is thus apparent why partitioning capability in the interconnection network and in the channeling network is needed. Each partition of these two networks wiil serve a group of PE's that are working on the same subproblem. The next section discusses details of the various components of the machine.
Most parallel processing machines function as special purpose processors in conjunction with some larger gênerai purpose computer. The ILLIAC has a B 6500 as a host for example [5] . The function of the gênerai purpose machine is to perform I/O fonctions and set up data in the respective PE's, handle interrupts, etc. The reader may also refer to [28] 5 for descriptions of such total Systems, 6 . SYSTEM DETAILS We now elaborate on the different structural components in the architecture of the parallel processor presented in Figure 5 .1. Since implementation issues are beyond the scope of this paper, unnecessary detail has been omitted. However, it should be clear that the structural components are relatively simple.
The processing element
The internai structure of a processing element is shown in Figure 6 . L The A register is the accumulator; registers B and C hold operands for binary opérations, Any data transfers to and from other PE's is done via the DTR. The X register is an index register and D a status register. Each PE has Unes to its control unit for transmission of status and for reception of common data and for instruction exécution control.
The control units (CU)
The CU's drive the PE's. opérations to drive the PE's. Any data that is common to all PE's, say a common multiplier, is transmitted to all PE's in parallel through the Common Data Bus. Recall from the program in section 4, that particular PE's had to be deactivated during certain opération. Thus, while the same instruction is broadcast to ail PE's, only the active ones exécute it. The status unes are used to transmit this information to PE's.
The accumulators and ALU are the registers and arithmetic/logic unit within the control unit. They are necessary for instruction decoding and other simple arithmetic opérations the CU may have to perform.
The CU memory would be typically much smaller that the PE memories. The program to be executed is stored in the CU memory. Reserved areas of this memory can be written to by PE's using the Results data bus and read by the MCP. As in the ILLIAC [5] , the ADB(advanced Data Buffer) is just a high speed scrath pad memory. The internai structure of this CU reflects the needs of the application. For example, the Results data bus, and the bus leading to the MCP are necessary to receive and transmit subproblem results to the master control processor.
Channeling network
A muitiplexor arrangement is proposedfor the channeiing network, with the requirement of partitionability in mind. Each PE gets its instruction stream from our particular CU S depending on its MUX CONTROL. The MUX CONTROL is set by the MCP for a particular assignment of PE's to CU's. This multiplexor arrangement enables us to assign a variable number of processors to a CU depending on the size of subproblems. This is in effect a cross-bar switch as used, in the MAP, MSIMD system [14] .
Interconnection network
The interconnection network allows PE's to transfer data among themselves. There are various propositions for interconnection networks each with its own advantages, as mentioned in section 3. The interconnection can be viewed as a blackbox as shown in Figure 6 .4. Data from each PE can be routed to other PE's through the network. The network we wish to use is called the multistage cube network. As mentioned in section 5 in connection with the analysis, inter PE data transfers are a major overhead with SIMD machine. Hence it would be to our advantage to design the network to reduce the number of transfers required for the class of applications envisioned. One of the major transfer opérations performed for the D-W algorithm is during loading of rows and columns from skewed storage. This has to be done very frequently. The other requirement that must be satisfied is that the network be partitionable; i.e., the various subnetworks arising out of the partition should have the same data transfer capabilities as the whole network. Recall that PE's are assigned to certain CU's. It is undesirable that PE's working on one subproblem transfer data to/from another set of PE's working on another subproblem.
The multistage cube network [18] succeeds on both counts. Its configuration for the case of eight processors is shown in Figure 6 .5. At every stage /, input lines that differ in the z-th bit of the binary représentation of their number, are paired together. The boxes are known as interchange boxes and they can either pass their two inputs striaght through or switch them as shown by dotted lines in Figure 6 .5. 
PE 2 OUTPUTS
A uniform shift of x occurs when ever PE[Ï] transfers data to PE[(ï-hx) ModP] where P is the number of processors. The multistage cube network can perform uniform shifts in a single pass through the network. This has been proved [12] . Each interchange box is independently controllable. One routing scheme employs destination tags. Each data item is tagged with the PEN of its destination PE. For example, for a uniform shift of 3, PE 0 would tag it data item with 011 (binary représentation of 3, the destination PE), PE 4 would attach a tag of 111. At stage i 9 if the i-th bit of the sending PE is a 0 then the upper box ouput is taken, if it is a 1 the lower box output is taken. Consider the uniform shift of 3. The switching pattern of the boxes is shown in Figure 6 . 6, The multistage cube network can also be partitioned into smaller networks, each of which is in itseM* a multistage cube network. For example, setting all the boxes in stage 2 to "straight" produces two subnetworks» -{ 0, 1, 2 5 3} and {4, 5, 6, 7}. Each of these is in itself a multistage cube network. For details, see [20, 23] . The reader will have noticed that partitions are powers of two. For example,-1024 processors, can be partitioned into subgroups of 64,64 5 128,256, 512 or 128, 128, 256, 512, etc.
5* The master control processor and driver network
The Master Control Processor (MCP) is in charge of handling the master problem. ït is also charged with setting the multiplexor in the channelling network; L e,, it controls the assignment of certain PE*s to ClFs. The CU's coordinate solutions to subproblems in turn. Steps 2 and 3 in D -PF(Section 2) are associated with the Master problem. The internai structure of MCP is much the same as that of a CU except that it has more sophisticated arithmetic/logic capabilities than the CU. This is necessary because in addition to instruction decoding, it has to perform the arithmetic/logic demanded by steps It is quite likely that the number of subproblems will be much smaller than the number of rows or columns of the constraint matrix, Hence, it seems practical to perform the summation of £ Z ( in Step 2 of D -JF(Section 2) sequentially. The hatched portion of MCPM is a reserved area in which the CU's deposit their values of Z is and décision variables. Recall that simplex multipliers from the master problem must be transmitted to the CU's for use in subproblem solving. This is accomplished through the DTR (Data Transfer Register) in Figure 6 .7. In other respects, the MCP is just like a control unit. It has an ALU and some gênerai purpose registers. 7 . SUMMARY An architecture of a parallel processor that is specially suitedfor solving a class of opérations research problems has been presented. We recognize the existence of many techniques for efficiently solving large scale Systems optimization problems. In our opinion, parallel processors are an interesting class of machines which could be profitably used in the efficient solution of large scale optimization problems. With the émergence of powerful low cost microprocessors, the viability of ha ving a very large number of processors in one machine is hardly in doubt. However, the successful application of such machines to the optimization area will require the close co-operation of both hardware specialists and opérations research specialists. Such joint efforts are already underway in other areas of application [24, 27] .
