Abstract-A new multiprocessor architecture, called orthogonal multiprocessor (OMP), is proposed in this paper. This OMP architecture has a simplified busing structure and partially shared memory, which compares very favorably over fully shared-memory multiprocessors using crossbar switch, multiple buses, or multistage networks. The higher performance comes mainly from significantly increased memory bandwidth, fully exploited parallelism, reduced communication overhead, and lower hardware control complexities. Parallel algorithms being mapped include matrix arithmetic, linear system solver, FFT, array sorting, linear programming, and parallel PDE solutions. In most cases, linear speedup can be achieved on the OMP system. The OMP architecture provides linearly scalable performance and is well suited for building special-purpose scientific computers such as for signal/image processing, machine sorting, linear system solvers, and PDE machines, etc.
I. INTRODUCTION ARIOUS shared-memory multiprocessor architectures
V have appeared in recent years; including bus-structured systems: Encore Multimax [12] and Elxsi 6400 [23] , crossbar connected systems: Alliant FX/8 [36] , directIy connected systems: IBM 3090/400, Univac 1100/94, Cray X-MP [7] , Cyberplus [lo] , and multistage network systems: IBM RP3 [28] and BBN Butterfly Processor [4] . Each architectural topology has its own merit in terms of supporting tightlycoupled MIMD operations using shared memory. However, each of them also has its own weakness in terms of hardware demand, arbitration control, effective memory bandwidth, and fault tolerance [ 
141.
Recently, three research groups have independently developed a similar multiprocessor architecture for parallel processing 161, [18] , [30] . These architectures use a two-dimensional array of partially shared memory, from which the processors access concurrently via dedicated memory buses. The interesting feature of such a shared-memory array is its capability to support orthogonal memory access without conflicts. Such a conflict-free memory architecture facilitates the implementation of a large class of parallel algorithms for matrix algebra, solving linear systems, linear programming, signal and image processing, sorting of large arrays, and parallel solution of PDE problems. We call such an architecture an orthogonal multiprocessor (OMP) system. Previously, thiq ONP architecture was called a reduced-mesh multiprocessor with orthogonally shared memory [ 181, or an orthogonal memory-access multiprocessor [30] or the ETH-multiprocessor EMPRESS [6] .
The OMP architecture appeals very much to today's VLSI and busing technology. Sophisticated processors with built-in instruction cache and functional pipelines, 256K RAM'S, and VME bus controllers are now available in monolithic chips. The access time of semiconductor memories has been significantly reduced to match with the cycle time of microprocessors or coprocessors. Furthermore, high-speed system buses are now available, such as the 100 Mbyte/s Nanobus used in Multimax system and the 320 Mbyte/s bus used in Elxsi 6400.
These technological advances have inspired many computer architects to develop high-performance multiprocessors for parallel/vector processing [2] , [ 111-[ 141, [ 191-[22] . The OMP architecture offers an orthogonality concept, which as we shall prove, plays a crucial role in delivering high performance. The OMP architecture is designed for a moderate degree of parallelism, say from 16 to 256 processors, based on stateof-the-art electronic technology [15] , [33] . The hardware demand of an OMP is comparable to a crossbar multiprocessor or a multibus multiprocessor with fully shared memories. However, the control complexity of the OMP is significantly lower than that of any existing multiprocessor due to the restricted operating modes imposed by orthogonality.
The main contributions of the paper lie in the characterization of the OMP architecture, providing principles of orthogonal memory access, mapping parallel algorithms onto OMP, and analyzing multiprocessor performance. The results obtained should be useful to those who are involved in the development of efficient multiprocessor systems for :scientific and engineering applications.
THE ORTHOGONAL MULTIPROCESSOR ARCHITECIURE
The logical architecture of an OMP system is depicted in dedicated to one processor only; there is no time sharing of the buses by multiple processors. This will greatly reduce the memory access conflicts due to the lack of bus contention. The n buses can be functionally divided into two operational sets: n row buses B: and n column buses Bf for i = 0, 1, -, n -1. Physically, the B: and Bf are the same bus B,. When B, is used in a row access mode, the B: is enabled and B; is disconnected. Similarly, the Bf is enabled and B: is disconnected, when B, operates in a column access mode. The two access modes are mutually exclusive. This constraint simplifies the memory access control significantly. All arbiters in the memory modules use only two-way switches. The arbiters are coordinated by a memory-access controller, which has two control states; one for column access and the other for row access. If we remove the orthogonality constraint, we have to use n2 independent arbiters which demand much higher control complexity. We regard the OMP as a partially shared-memory system, because each memory module is shared by at most two processors. Each memory module MI, has two access ports, one for B: and the other for Bf. This implies that MI, is only accessible by processor PI and processor P,. With the orthogonality and partial sharing of memory, possible memory access conflicts are significantly reduced. Hence, higher interprocessor-memory bandwidth can be established. The diagonal memory modules MI, for i = 0, 1, . . a , n -1 are accessed by processor PI exclusively. In other words, the diagonal modules are local memories to each processor. The off-diagonal modules are shared memories, each shared by two processors only.
The orthogonally accessed memory offers a number of attractive application potentials. Given any pair of processors (PI, P,), they can communicate with each other through the shared memory modules MI, or M,, in one or at most two memory cycles. Data or instruction broadcasting can be done in exactly two memory cycles. With n buses active at a time, the maximum memory bandwidth is n words per major memory cycle, which equals those systems using a crossbar switch or a multistage network. This implies that n parallel readdwrites can be carried out per each major cycle.
To characterize the orthogonal memory access patterns, let us denote M,, [k] as the kth word in the memory module Mu. The ith row memory M: consists of n memory modules MIJ for j = 0, 1, * * e , n -1, and similarly, the jth column memory Mf consists of MI,, for i = 0, 1, e , ri -1. The parallel accesses of M: and of MJ'are depicted in Fig. 2(a) and (b), respectively. Note that B: is used to access M: and BJ' is used for Mf. Table I summarizes the notations used in the paper. In fact, the above memory access allows various row and column permutations as shown in Fig. 3 . The following access rules must be observed: when row buses are used, only modules from distinct rows can form a legitimate access memory array. All modules belonging to the same row or the same column can be n-way interleaved to allow pipelined accesses with a minor cycle, which is l/n of the major memory cycle. These row and column interleavings form the two-dimensional addressing illustrated in Fig. 4 for the case of n = 4 and k = 4 words per module. The row interleaving assumes a stride distance of one, while the column interleaving assumes a stride distance of nk = 16. With n-way interleaving per each row and per each column, the effective memory bandwidth can be increased from n words to n2 words per major memory cycle. Usually, the interleaved memory access is for the execution of vector instructions or for access to regular data structures [19] . The random access mode is for randomly accessing scalar operands stored irregularly in the memory array. The practice of random access mode ( Fig. 3 ) and interleaved mode (Fig. 4) must not be mixed on the same memory bus at the same time. When an interleaved mode is uxd, only modules attached to the same row bus (or the same column bus) are accessed in an overlapped fashion using minor cycles. Theoretically, these two access modes can be applied on different row buses (or different column buses) at the same time. This may require a very complex addressing control. For our purpose, we restrict the access of memory to be either with a random-access mode or with an interleaved mode (but not both) for any given time period. Of course, this does not preclude the use of these two modes alternately at different time periods. Concurrent vector processing is done using multiple buses, each in an interleaved mode.
The OMP architecture is aimed at solving large-grain problems, where the data and programs are distributed over a large array of memory modules. The MIMD mode is adopted pattern as shown in Fig. 3(a) . Similarly, only modules from distinct columns can be accessed in parallel using the column buses as shown in Fig. 3(b) . Mixed access patterns are forbidden as shown in Fig. 3(c) .
To improve the memory bandwidth, one can consider the use of a two-dimensional interleaving scheme on the OMP 
COMPARISON TO OTHER MULTIPROCESSORS
The hardware demand of the OMP architecture is compared to known multiprocessor architectures in Table 11 . The comparison is based on using equal number of processors (n) and the same main memory capacity (C words). There are n memory modules for a fully shared-memory multiprocessor, each having a modular capacity of C/n words accessible by n processors. In the case of the OMP, there are n2 memory modules, each having C / n 2 words accessible by at most two processors. The multibus architecture uses multiple timesharing buses and multiported memory modules [25] . The major difference between the fully shared-memory system using multiple buses and the OMP architecture lies in the number of memory ports per module (n ports versus two ports), the modular capacity (C/n words versus C/n words), and the effective memory bandwidth (n words versus n2 words per major cycle). The crossbar architecture requires n2 crosspoint switches and the multistage network demands the use of n log n 2 x 2 switches. The control complexity of these fully shared-memory architectures is much higher than that of an OMP as shown in Table 11 , especially in switching complexity and the degree of memory sharing.
The OMP requires an intlkonnection complexity of n buses with only two control states, which is much simpler than the crossbar switch and the multistage network. With the use of n buses and n2 dual-ported memory modules, the OMP memory array should be able to operate with a faster memory cycle and simpler control than any of the fully shared-memory configurations. The bandwidth of the OMP memory varies between n and n2 words depending on how frequently the interleaved mode is used on various buses. Of course, fully shared-memory multiprocessors have higher flexibility in supporting generalpurpose applications. The OMP, with restricted memory access, is really meant for special-purpose applications in science, engineering, and high technology.
We analyze below the effective memory bandwidth of the OMP architecture. Let p be the probability that a processor requests a memory access (memory access rate). .4 memory access is performed with either a column access mode or a row access mode. Let qc be the probability that a processor requests a column access. Then qr = 1 -qc is the probability of requesting a row access. The memory bandwidth E is derived below as a function of n, p , and qc.
Theorem I : The orthogonal multiprocessor with n processors has the following effective memory bandwidth:
Proof: The probability Prob(k) that k processors request column accesses in the same memory cycle is
where Pm( j ) is the probability that exactly j out of n processors request memory accesses. Pc(kl j ) is the conditional probability that k processors request column accesses and j -k processors request row accesses, given that there are in total j memory requests. Using Bernoulli trial. Pm( j ) is written as
Suppose j processors request memory accesses. If k of them want the column access and j -k the row access, we would carry on the access with a majority for the whole j requests due to the orthogonality. Hence, the number of successful memory accesses in each memory cycle is max(k, j -k ) . The bandwidth is thus obtained by including all possible: values of j :
where the first term in the brackets corresponds to row accesses, and the second to column accesses. The proof is complete, after (3) and (4) are substituted into (5). Q.E.D.
For a multiprocessor with n processors and n ! memory modules interconnected by an n x n crossbar switch network, the effective memory bandwidth has been formulated by Bhuyan [5] as
where q represents the probability of a request to access a favorite memory. A favorite memory is a memory module that a processor accesses more often than the rest. The probabilities of accessing the remaining modules were assumed equal in the above formulation. For the sake of comparison, we consider the diagonal local memories as the favorite ones in the OMP architecture. This is due to the fact that processors access the local memories more frequently than the shared modules. Equal access probabilities are assumed for all off-diagonal memory modules. We divide the access probability qc in (4) into two terms as follows:
The term q is for accessing the favorite memory and the term (1 -q ) / 2 is for accessing the remaining modules in the same column or in the same row. Fig. 5 shows the effective bandwidths of two multiprocessor architectures: the crossbar system versus the OMP. The curves are plotted as a function of the access probability p under two cases. The case in Fig. 5 (a) corresponds to a higher probability of accessing the favorite memory (q = 0.75) and Fig. 5 (b) for a lower value (q = 0.5). The plots demonstrate that the effective memory bandwidth of the OMP is comparable to that of the crossbar network.
In summary, we found that the OMP architecture requires much less control hardware by using dedicated memory buses (rather than time-sharing buses across many processors). The increased memory bandwidth is due to conflict-free access, lower degree of memory sharing, and two-dimensional memory interleaving. For the OMP, the resource conflict problems, such as hot spot and bus contention, are avoided by the synchronized orthogonality among multiple processors. All of these features are crucial in making OMP an attractive architecture for parallel scientific computations. concurrently as outlined below. Note that all n processors are involved in the parallel executions.
Parallel Computations of All Elements in the ith Row of
Matrix C: 1) Read A , by row access.
2 ) At thejth step ( j = 0, 1, . . -, n -l), B:are read with 3) Multiply A , and BJ' term-by-term in parallel, generating 4) Store the product terms in the jth column memory kf:.
Step 2 , else go to Step 6.
6) Now, all column memories contain product terms, one per each module. Perform parallel addition by column access to generate the ith row of C (c,~, c ,~,
The above procedure is formally specified in Algorithm 1.
The parallel construct forall Pk, (k = 0, 1, . , n -1) doparallel means that n processors run concurrently for parallel processing, where the index k corresponds to the kth processor. The end of this parallel process is indicated by endforall. The conventional for and do constructs still imply a sequential process. Each block of codes is delimited by the begin and end constructs.
a column access.
all product terms alkbk,, k = 0, 1, . ., n -1 for c,,.
Algorithm I : Parallel Matrix Multiplication on OMP:
forall processors Pk, (k = 0, 1, . a , n -1) doparallel for i = 0 to n -1 do begin Read ajk forj = O t o n -1 do begin cjkl = ajkbkj {cFl is a value associated with a 'processor Pk} Store cF1 in Mkj.
e N i 1
Theorem 2: It takes 3n2 + 2n time steps to perform a matrix multiplication on an OMP with n processors, where each multiplication, each addition, or each memory access is considered one time step.
Proof: The j-loop for generating the product terms requires 2n steps, and the m-loop requires n steps. Two steps are needed for reading A ; and resetting cij. Since there are n iterations (i-loop), the total time is n(3n + 2).
Q.E.D. Using Gaussian elimination, the problem can be solved by eliminating ajj's for i > j in n -1 steps (triangularization). The solutions for x,,, x,,_ I , * , x1 are then obtained in a sequence of back substitutions. The triangularization procedure on OMP is specified in Algorithm 2 , where a,,(t) and bij(t) are the a;, and bjj at step t , respectively.
Algorithm 2: Parallel Triangularization of a Linear System: f o r t = 1 t o n -1 do forall Pi, (t < j I n -1) doparallel)
r(t)~arr(O, arr(t) + 0
Broadcast mi, to all processors.
Theorem 3: It takes O(n2) time steps to solve a linear system of n equations on an OMP with n processors using Gaussian elimination, where each time step corresponds to a single arithmetic or a single memory access operation.
Proof: The i-loop of triangularization is repeated n -1 times for each iteration t . Hence, it takes C,{(n -1) + (n -2) + * --+ l} = C2n2 steps to perform the triangularization, where CI and C2 are some constants. n -i processors are used for concurrent substitutions at step i, (i = 1 , 2 , * . * , n -l), immediately after the value x n P j becomes known. Thus, where w n 2 = 1 , w = eJ2s/n2, n2 = 2 4 (4: an integer), a n d j
The FFT requires log n butterfly operations for each output sample data. The first log n of them have stride distances (the difference in data indexes of a butterfly operation pair) n 2 / 2 , n2/4, n2/8, * * a , n2/n, which correspond to n/2, n/4, n / 8 , . * , n/n, respectively, in row stride distance. These are performed by row accesses on the OMP as specified in Algorithm 3. The remaining log n steps have stride distances n/2, n/4, n/8, -e , n/n, which correspond to n / 2 , n/4, n/8, -* * , n / n , respectively, in column stride distance. They are executed by column accesses. An example is given below for a 16-point FFT specified by the signal flow graph in Fig. 6 . Initially, 16 samples {x(O), * e , ~( 1 5 ) ) are stored in column-major order, one in each memory module [ Fig. 7(a) ]. The entire process consists of four stages (= log 42) [ Fig. 7(a) and (b) ]. A permutation must follow to sequence the result correctly [ Fig. 7(c) ]. Fig. 7(a) shows the first two stages using row accesses, and Fig. 7(b) illustrates the remaining two stages of column computation. A butterfly operation corresponds to each data pair joined by an arc in Fig. 7 . The arc specifies the direction of data movement; the head receives data which are weighted products of the source data. Associated weights are shown on the arcs. Intermediate results after each stage t are denoted as This implementation assumes that the input vector is in correct order, and that the output vector will be in bit-= a. forall Pk, k = 0, a , n -1 doparallel for j = log n2 -1 downto log n do { row computation } begin for i = 0 to n -1 do To extend the FFT algorithm to kn2 sample points, we distribute k data points to each memory module. The sample point x(/,,+,),,+~ is stored in M,i [l] , where 0 I i, j < 11, 0 I 1 < k. Logically, a kn x n array is stored in column-major order. By this ordering, the data with stride distances 1, 2 , 4, * * e , n / 2 are reached through column accesses. The remaining n, 2n, 4n, -* e , kn2 distances are reached by row accesses.
Once a stride distance is given, n butterfly operations are performed in parallel without access conflict. Since the FFT consists of log kn2 stages of butterfly operations, the first log kn2 -log n butterfly operations are performed by row access (k of them, whose stride distances are integer multiples of 2, have their communication pairs in the same memory module), and the remaining log n stages are carried out by column accesses. We summarize the above results as follows.
Theorem 4: The kn2-point FFT takes O(kn log kn) time steps on an OMP of n processors, where each time step corresponds to one butterfly operation or one permutation operation.
Proof: In each stage, kn2 butterfly operations are performed in O(kn) time, since n processors work concurrently. The row computation consists of log kn2 -log n stages with kn butterfly operations each, and the column computation requires log n stages with kn butterfly operations each. Permutation for ordering of the bit-reversed data needs X(5) W9) X(13) + X(6) X(10) X (14) X (7) X(11) X(15)
IEEE TRANSACTIONS ON COMPUTERS, VOL. 38, NO. 1, JANUARY 1989
X (4) X (5) X (6) X (7) X(8) X(9) X(10) X (11) X ( kn time. Hence, the total execution steps equal O(kn log kn) .
Q.E.D. It takes O(kn2 log kn) steps to perform the FFT on a uniprocessor [26]. Therefore, a linear speedup is achieved by the OMP as expected. The parallelism is exploited within each stage and successive stages must still be processed sequentially. To perform a two-dimensional FFT on the OMP, one can modify the above I-D FFT algorithm by using two successive 1 -D FFT processes: one using column memories and the other using row memories.
VI. PARALLEL SORTING USING OMP
An O{ k2n(log n) log kn} algorithm is developed for sorting k2n2 numbers on an OMP, where 1 I k2 < m/2. This sorting method results in an O(n/log n) speedup over the best known sequential sorting algorithm. The algorithm, called orthogonal sorting, combines the bitonic sort [3] with a sequential merge sort. The numbers are initially stored in column-major order in the memory array. The orthogonal sorting is specified recursively in Algorithm 4. The constructs cobegin and coend indicate that all statements within the block are to be executed concurrently. modules, for i = s, -e , s + c -1 , using a bitonic merge.
We use an example to illustrate how to sort 16 numbers on an OMP with n = 4 processors. One number is stored in each memory module. The detailed execution sequence is shown in Fig. 8(a) , which consists of three levels. It looks like an execution tree; however, all procedures at the same level may not be performed in parallel (because the procedure merge should follow the procedure sortcol). Procedures with the same execution label (e;) will be executed concurrently. The whole sequence is executed in five steps [ Fig. 8(b) ]. By evenly distributing the numbers to all memory modules, the orthogonal sorting algorithm can be extended to sort k2n2 numbers, where 1 I k 2 < m/2. Each of n 2 memory modules is initially loaded with k 2 numbers, as shown in Fig. 9 for the case of k2nz = 22 x 4' = 64. The sorting is done by initiating the procedure sort(0, kn, k , 1). The task sortcol(s, k , a) sorts k columns of data in a column memory of processor P, using a sequential merge sort algorithm. The generalized orthogonal sorting is specified in Algorithm 5.
Theorem 5:
The orthogonal sorting of kzn2 numbers requires O{k2n(log n) log kn} time steps on an OMP of n processors, where each step corresponds to the time required for one compare-and-exchange operation on a processor.
Proof: Due to the recursive decomposition by cobegin and coend, the sort has log n levels of execution [i.e., sort(0, kn, k , a), sort(0, kn/2, k , a), sort(0, kn/4, k , a), . * e , sort (0, 2k, k , a), sort(0, k , k, a) ]. The sortcol takes k -k n log kn steps. The task row-merge uses at most kakn log n steps. The task col-merge requires kekn log kn steps. It results in k2n log kn steps to complete each level. Thus, the total execution time of the orthogonal sorting is O{kZn(log n) log kn}.
Q.E.D. The best known sequential sorting algorithm requires O(k2n2 log kn) steps to sort k2n2 numbers. Therefore, the orthogonal sorting method can achieve a speedup of O(n/log n) over the sequential sort. For the special case of k = 1, it takes O(n log2 n) time steps to sort n 2 numbers on an nprocessor OMP. Because a uniprocessor needs O(n2 log n) time steps to sort n2 numbers, the O(n/log n) speedup holds also for the special case.
Parallel sorting methods are also implementable on a meshconnected computer (MCC), that has n2 processors. The MCC has a time complexity of O(n) in sorting n 2 numbers
[29], [37] . However, the O(log2 n) speed gain in using an MCC is obtained at the expense of using n times more processors than the OMP. Further comparisons of OMP and MCC will be given in Section IX. In Scherson et al. [31] , a shear sort is developed to sort n2 numbers using a twodimensional sorting network of n compare-exchange modules, which is equivalent to using n2 processors. The shear sort has a time complexity of O ( n e ) , but uses n times more processors than the orthogonal sort on an OMP.
VI1 . PARALLEL LINEAR PROGRAMMING
The simplex method [35] for linear programming is mapped onto the OMP. Consider the iterative part of the simplex method with p unknowns and q linear constraints. In each iteration, the computations involved are + cPxq+, -zo, subject to the constraints
where x (~+~)~~ = [x1x2 ..*x,+,]~, Aqxp is a coefficient matrix of constraints, and Iqxq is an identity matrix.
* e , x, are basic variables; x~+~, x~+~, * * e , xq+, are free variables; and -z0 is the current minimal of the iterative process. The process terminates with the true minimal when all c, > 0, (1 I i I : p). For a systematic computation, a coefficient matrix T is formulated as
The variables xl, x2, Initially, these are evenly distributed to the memory modules (Fig. 10) . The following computations are performed in each iteration:
Step 1. Find a pivoting column s by c, = min{ cj I 1 I j I :
d.
If c, > 0, then stop.
Step 2. Find a pivoting row r by br/ars = min{ b;/a;, 
Parallelizing
Step 3 contributes to a linear speedup. The parallel computations in Step 3a are performed using column accesses. The data in row r are broadcast to the column memory modules. The processors then perform parallel computations in Step 3b using the row accesses. A few snapshots of one iteration are shown in Fig. 10 corresponding to the following numerical example ( p = 7, q = 3):
xl>O, x 2 2 0 , . --, x720.
The circled location (r = 3, s = 5) is found by pivoting the row and column (Step 1 and 2).
Step 3a is performed in Fig.  10(b) . All pivoting row elements are broadcast in parallel [ Fig. lO(c) ] for the calculations of the second line of Step 3b. Fig. 10(d) shows the result after one iteration of the simplex method. Theorem 6: Each iteration of the simplex method with p unknowns and q constraints can be computed in O(pq/n) time steps on an OMP with n processors, wherep, q % n, and each step corresponds to the time needed for one arithmetic operation or one comparison operation.
Pro08
Step 1 requires ( q / n + log n) operations since each processor has to perform q / n comparisons and then n processors cooperate to find a minimum among n of them.
Similarly
Step 2 needsp/n + log n operations.
Step 3 consists of (q + l)/n steps for 3a, p/n + 2pq/n steps for 3b, and a unit operation for 3c. Therefore, 0{2(pq + p + q ) / n + 2 log n} = 0(2pq/n) operations, becausep-q/n % p % log n.
Q.E.D. Without parallelization, the same iteration of the simplex ( 1 5~3 q s k ) . (14) The SLOR method is based on Chebyshev overrelaxation as illustrated in Fig. 11 . The points on the mesh are divided into odd and even sets of lines, as indicated by white and shaded strips, respectively. The method updates the odd and even lines alternatively using the following equations at the tth iteration:
( 1 + 1/2)
up, = r u i q + ( l -
The parameter r is a given relaxation factor, and t = 1/2, 1, 3/2, 2, * -with an increment of 1/2 in successive values are used until upq converges. There are k / 2 independent tridiagonal systems of length k along k / 2 lines of the mesh which should be solved in every half iteration. The right-hand side depends on the known values from the lines above and below, which have been already computed in the very last Since there are k / 2 independent tridiagonal systems of length k, n of them are solved in parallel by n processors, which takes k/(2n). 8k = 4k2/n. The updating process of (16) needs 3 * k2/(2n) arithmetic operations. Hence, the total time complexity for each iteration is O(k2/n).
Q.E.D. For a uniprocessor, it takes 2k2 steps to compute the righthand side of (15), k / 2 * 8 k steps to solve the tridiagonal systems, and 3k2/2 steps in updating (16). The sequential SLOR method thus needs O(k2) steps. The above parallel method achieves again a linear speedup over the sequential method.
IX. PERFORMANCE AND CAPABILITY ANALYSIS
In this section, we analyze the performance and capability of the OMP as compared to other parallel computers such as MCC and hypercube systems. The time complexities of all the algorithms we have mapped onto the OMP are summarized in Table I11 as compared to those for a uniprocessor system [29]. All the algorithms result in a linear speedup, except the orthogonal sort which has a speedup of O(n/log n) over a uniprocessor. The results shown in Table I11 are indeed very impressive for the fact that only n processors ifre used to achieve appreciable speedups.
Many matrix and graph algorithms follow regularly structured data flow patterns. Most of these algorithms can be efficiently mapped onto the MCC [ l ] , [9] , [16] , [24] . Other algorithms which demand global data transfers, such as sorting, FFT, and other arbitrary data movement operations can be mapped into the hypercube system efficiently. The OMP can simulate these architectures with a linearly scalable performance. We compare below an OMP of n processors with an MCC containing kn x kn processors. Note that k 2 memory modules of MCC are mapped into a single memory module in the OMP. Assume that each memory module in the MCC has a capacity of r words. Then each memory module of the OMP must have at least rk2 words, i.e., rk2 I m, where m is the capacity of each OMP memory module. When rand k are large, rk2n2 data points imply really large-grain computations. The following two theorems show that the OMP can be used effectively to simulate the MCC or the hypercube computers using only a small number of fast processors and memory modules. Q.E.D. The above methods indeed provide systematic ways of mapping onto the OMP architecture any parallel algorithm, which is originally designed for the MCC or for the hypercube systems. The above time complexities (Theorems 8 and 9) , as obtained by simulating an MCC or a hypercube computer on an OMP, are indeed upper bounds. In fact, all the parallel algorithms we have developed for the OMP have complexities which are lower than or equal to the upper bounds. For example, the orthogonal sort requires only O(n log" n) time steps, lower than the O(n2) obtained by mapping the Thompson-Kung sorting [complexity O(n) ] [37] onto the OMP using Theorem 8. The mesh and hypercube multiprocessors are both popularly used in many scientific applications. This indicates that the OMP is indeed a powerful architecture for ,scientific and engineering computations.
X . CONCLUSIONS
We conclude by summarizing the advantages and shortcomings of the OMP architecture. Based on the results presented, the OMP architecture has the following distinct advantages.
1 ) The control complexity for parallel memory accesses in OMP is very simple as compared to fully shared-memory systems, because of synchronized orthogonality, smaller modular capacity, and lower memory access time.
2) With the orthogonal memory access and 2-D inlerleaved memory organization, the effective memory bandwidth is potentially n times higher than fully shared-memory multiprocessors using a crossbar switch, multiple buses, or a multistage network.
3) The OMP architecture has been demonstrated to be very powerful to implement a large class of scientific algorithms. Some of them have also been independently confirmed by Scherson and Ma [30] , in which they prove that the performance of the OMP architecture compares favorably over the best known multiprocessor architecture within a factor of 3.
4) Many parallel algorithms are attractive candidates for efficient mapping onto the OMP architecture. Essentially, these are the algorithms which can exploit the orthogonal memory access property. We list the candidate algorithms in Table IV [9], [ 121, [ 141, [24] , [27] , [29] . However, not all of them have been thoroughly mapped onto the OMP. In [17] and [2 I ] , we have presented the use of the OMP for parallel image processing and pattern recognition.
Two obvious shortcomings of the OMP architecture are identified below. We point out these drawbacks in order to inspire continued research efforts to overcome the stated difficulties.
1 ) The orthogonal memory access principle prohibits possible memory accesses with mixed modes. This may prolong the communication between processors and reduce the flexibility in mapping those algorithms, which do not access the memory array in an orthogonal fashion.
2) The number of memory modules increases as the square of the number of processors, which makes the system expensive for massive parallelism. In other words, the proposed OMP architecture needs to be modified for finegrain massively parallel processing as suggested in [ 1 11.
Despite a few shortcomings, the advantages of the OMP architecture are sufficiently strong to proclaim its efficiency, when the OMP is applied for matrix algebra, signalhmage processing, sorting, linear programming, FFT, and parallel PDE solutions. The OMP architecture is well suited for modular construction using commercially available microprocessors, random-access memory chips, and off-the-shelf highbandwidth buses. The two-dimensional architecture of the OMP has been generalized to the n orthogonal dimensions as reported in [ 1 11. The generalized OMP can efficiently support massively parallel computations.
