Inner product and matrix operations find extensive use in algebraic computations. In this paper, we introduce a new parallel computation model, called a permutation network processor, to carry out these computations efficiently. Unlike the traditional parallel computer architectures, computations on this model are carried out by composing permutations on permutation networks. We show that the sum of N algebraic numbers on this model can be computed in O(1) time using N processors. We further show that the real and complex inner product and matrix multiplication can both be computed on this model in O(1) time at the cost of O(N ) and O(N 3 ), respectively, for N element vectors, and N × N matrices. These results compare well with the time and cost complexities of other high level parallel computer models such as PRAM and CRCW PRAM.
Introduction
Inner product and matrix operations form the core of computations of vector and array processors [5, 7] . Many algorithms in signal and image processing, pattern recognition, computer vision, and computer tomography require extensive use of such operations [5, 12, 14, 18] . Therefore, it is desirable to find novel parallel architectures to compute these operations fast and efficiently.
Traditional architectures for carrying out such operations are based on reducing vector computations into scalar operations such as binary addition and multiplication [20, 21] . As a result, much of the computations in vector and array processors is handled by conventional arithmetic circuits such as carry look ahead adders, recoded and cellular array multipliers and dividers [6] . While these conventional circuits are optimized for speed and hardware, they still rely on a variety of building blocks such as adder, subtractor and multiplier cells which often lead to nonuniform arithmetic circuits for vector processors.
In this paper, we propose a new concept to carry out vector and matrix computations. Unlike the traditional architectures, this concept is based on coding not only the operands but also the operations over the operands in such a way that a vector or matrix computation reduces to composing permutation maps. Each operand is coded into a permutation and addition or multiplication of two operands is carried out by composing the permutations that correspond to these operands on a permutation network. As a result, both addition and multiplication are reduced to a single computation, i.e., that of composing permutations. Furthermore, any other computation involving addition, subtraction and multiplication operations are also reduced to just composing permutations.
We show that, on this new computation model, called a permutation network processor, the sum of N n-bit numbers, the inner product of two vectors, each containing N n-bit elements, and the multiplication of two N × N matrices with n-bit entries can all be computed in O(1) steps. The first two computations require O(N ) processors and the matrix multiplication requires (N 3 ) processors, where each processor handles an n-bit input, and has O((n + lg N ) 2 ) bit-level cost and O(n + lg N ) bit-level delay.
We note that these results compare well with the complexities for the same computations on other models. For example, on a PRAM model [1, 8] , all three computa- We also note that, even though the permutation network processor model stands on its own, it ties with some earlier computation models that were reported in the literature. One such model, called a processing network, was given [19] where a mesh of processing elements was used to compute certain algebraic formulas. The processing elements in this model can be programmed for arithmetic and routing functions whose combinations lead to various algebraic expressions on the mesh topology. More recently, a new parallel computer model, called a reconfigurable bus system, has been introduced to solve a wide range of problems including sorting problems [23] , graph problems [13, 3] , and string problems [2] . All these problems have been shown to be solvable in O(1) time on the reconfigurable bus system model. As in the processing network model, processors are connected in this model by some fixed topology such as the mesh, and each processor can be programmed for some data processing as well as routing functions. It is assumed that the signals can be broadcast between processors in constant time regardless of how far the broadcast is carried [3, 11] , [22] - [26] . The essence of this assumption is that once the processors are simultaneously programmed for some routing functions, the signals that pass through them only encounter a propagation delay which is short enough so as to be considered a constant. The same assumption also holds for our model.
The remainder of this paper is organized as follows. In Section 2, we provide a brief overview of permutation network processors. In Section 3, we show how real and complex inner products can be performed on our permutation network processor in O(1) time. In this section, we also describe how to sum N n-bit signed numbers in O(1) time on the same model. In Section 4, we use these inner product units to carry out the multiplication of two
permutation network processors. The paper is concluded in Section 5.
The Permutation Network Processor Model
The computations to be described in subsequent sections all rely on a permutation network processor model which was introduced in [10, 15, 16] . Here we give a brief overview of this model and introduce some changes so as to make it suitable for these computations. The reader is referred to these citations for a more detailed account.
A. The Model
A permutation network processor is obtained by cascading three components together as shown in Figure 1 : an r- In most of the computations that follow, we will need to cascade several permutation network processors together. In such cases, the r-out-of-s residue encoders and decoders are redundant and will be removed from the model. In this reduced model, we only retain the permutation network and binary residue encoder.
B. The Cost and Time Assumptions
In the reduced permutation network processor model, each processor has an n- The third computation, i.e., the product of two N ×N matricies can be carried out by performing N 2 inner product operations. Thus, the matrix multiplication problem can be solved by using N 3 permutation network processors each constructed
As for the delay of the permutation network processors needed for these three computations, it was shown in [10] that a permutation network processor encom-
Given that M ≈ 2 n N, it follows that all three computations mentioned above can be performed by using permutation network processors with O(n + lg N ) bit-level delay.
We point out that these bit-level cost and delay complexities are comparable with those for other parallel computer models. The last two computations require a multiplier circuit for n-bit operands and this exacts O(n 2 ) bit-level cost to attain O(lg n) bit-level delay regardless of the model used. Given this, in obtaining the cost and time complexities of ther algorithms that follow, we will assume that our permutation network processors have constant cost and constant time where the cost and time are expressed in word level as in other parallel computer models.
Inner Product Processors
In this section, we show how to sum N n-bit numbers and compute real and complex inner products using permutation network processors.
A. Summation of N n-bit numbers
Assume that N n-bit numbers to be added together are all in 2's complement form. This implies that the sum can be as large as 2 n−1 N and to avoid a possible overflow, the dynamic range M of the permutation network processor must satisfy
As described in [10] , the set 
Under the isomorphism fixed by mapping 1 to π, element X ∈ Z M is mapped to
, . . .
where + M denotes modulo M addition. Since π i is a cycle of length m i
or by Horner's rule
. . .
where
To compute Equation (4), we cascade N permutation network processors together and each operand X i ; 1 ≤ i ≤ N, is converted into its corresponding residue code (x i1 , x i2 , . . . , x ir ) by using a binary residue encoder such as one of those described in [10] . These converted residue codes (x i1 , x i2 , . . . , x ir ); 1 ≤ i ≤ N, are then used to set up the switching states of corresponding permutation networks. The complete algorithm for computing the summation of N n-bit numbers on a permutation network processor is then given as follows.
Algorithm 1 (Summation of N n-bit 2's complement numbers)
Input: N n-bit 2's complement numbers,
Output: Sum of X 1 , X 2 , . . . , X N in 2's complement form.
Method:
Step 1: Convert X i 's into their corresponding binary residue codes (x i1 , x i2 , . . . , x ir ), in parallel.
Step 2: Add X 1 , X 2 , . . . , X N .
Step 2.1: Set up the switching states of permutation network processors in parallel by the binary residue codes obtained in Step 1. with a "1" and all the other input lines with a "0."
Step 3: Decode the r-out-of-s residue code obtained at the outputs of the last permutation network processor in the cascade into its binary equivalent. Figure 2 shows an example with N = 3, n = 5, and X 1 = 13, X 2 = 12, and X 3 = 9.
The light lines indicate the paths of "1" between inputs and outputs. To avoid a possible overflow, the dynamic range of the processor is chosen so that it satisfies 
To compute X j Y j on a permutation network processor, we recall that the multi- 
Now we carry over the product X i Y i onto (Z M , ⊗) by setting up a monoid isomorphism f between Z m and Z M as follows:
It is easy to show that f (1) = (1, 1, . . . , 1) and
and hence f is an isomorphism, and using this isomorphism, we can compute the
x j,r × mr y j,r ). multiplier. An example of modulo 5 permutation network multiplier is shown in Figure 4 and a detailed description of modulo m i permutation network multiplier can be found in [10] .
The following algorithm shows how the inner product is carried out using such multipliers. 
Output:
The real inner product of X and Y represented in 2's complement form.
Method:
Step 1: Convert each element of X and Y into its corresponding residue code in parallel.
Step 2: Multiply x ji and y ji for 1 ≤ j ≤ N and 1 ≤ i ≤ r in parallel.
Step 3: Compute the inner product by adding together the products obtained in
Step 2 over the residue code domain.
Step 3.1: Set up the switching states of permutation network processors in parallel by the binary residue codes obtained in Step 2.
Step 3.2: Feed all input lines marked 0, m 1 , m 1 +m 2 , . . . , m 1 +m 2 +. . .+m r−1 with a "1" and all the other input lines with a "0."
Step 4: Decode the r-out-of-s residue code obtained at the outputs of the last permutation network processor in the cascade into its binary equivalent.
An example with N = 3 and n = 3 is shown in Figure 5 . It is easy to see that 
C. Complex Inner-Product Processor
The inner product of two complex vectors U = (U 1 , U 2 , . . . , U N ) and
is a complex-valued function defined as follows:
where U j and V j ; 1 ≤ j ≤ N, are complex elements, and V * j denotes the complex conjugate of V j . Therefore, the complex inner product can be computed by carrying out four real inner products. Let ⊕ and be binary operators defined on Z M by   (x 1 , x 2 , . . . , x r ) ⊕ (y 1 , y 2 
where (x 1 , x 2 , . . . , x r ), (y 1 , y 2 , . . . , y r ) ∈ Z M . Using the isomorphism constructed in the previous section, each real inner product can then be computed in Z M as in Equation (9), and hence
The complex inner product processor can then be constructed by computing the real and imaginary parts on two permutation network real inner product proces-sors. These real inner product processors require some minor modifications. For the real part, the actual inner product consists of two real inner products, one combining the real parts of U and V and the other combining their imaginary parts. Thus the subnetwork for each modulus in each processor is obtained by cascading two multiplication and two addition units. For the imaginary part, the actual inner product also consists of two real inner products, however, these two inner products are not added together; rather the second one is subtracted from the first one. Therefore, we cascade one multiplication and addition unit with a multiplication and subtraction unit. The subtraction unit can be implemented by a circular left shift permutation network subtracter as described in [10] . The general structure for the permutation network complex inner product processor is shown in Figures 6 and 7 . Figure 6 is the real part and Figure 7 is the imaginary part. We note that the sum expressions on the left hand side are just for labeling the inputs and do not imply any summation. 
Method:
Step 1: Convert each element of U and V into its corresponding residue code in parallel. (Note that the real part and imaginary part use separate residue decoders.)
Step 2: Compute the real multiplications,
Im(u ji )Re(v ji ), and Re(u ji )Im(v ji ), for 1 ≤ j ≤ N and 1 ≤ i ≤ r in parallel.
Step 3: Compute the real and imaginary parts of the complex inner product by adding together the corresponding products obtained in Step 2 according to Equation (14) over the residue code domain.
Step 3.1: Set up the switching states of permutation network processors (adders and subtracters) in parallel by the binary residue codes obtained in Step 2.
Step with a "1" and all the other input lines with a "0."
Step 4: Decode the r-out-of-s residue codes obtained at the outputs of the last permutation network processors in the real and imaginary parts into their binary equivalents.
As in the previous algorithm, it is easy to see that total execution time of Algorithm
is O(1). Its total cost is O(N ) as the real part and the imaginary part require
N processors, respectively. In more exact terms, for the real part, each processor consists of two permutation network adders in cascade. For the imaginary part, each processor consists of a permutation network adder and a permutation network subtracter. Therefore, a total of 4N processors are required by this algorithm.
Matrix Multiplication
In this section, we use permutation network inner product processors to carry out real and complex matrix multiplications.
A. Real Matrix Multiplication
Let A j be the jth row of A and B k be the kth column of B. Let A and B be N × N real matrices and C = A × B. C can be computed by using the following procedure:
For j = 1 to N do in parallel
Hence, if we use N 2 inner product processors in parallel, the multiplication of two real matrices can be computed in O(1) steps as follows.
Algorithm 4 (Computing the real matrix multiplication of two real matrices)
Input: Two N × N real matrices A and B. Each element of A and B is an n-bit 2's complement number.
Output: A real product matrix C = A × B.
Method:
Step 1: Convert each element of A and B into its corresponding residue code in parallel.
Step 2: Compute the N 2 inner products using Algorithm 2.
A schematic version of this algorithm is depicted in Figure 8 . 
B. Complex Matrix Multiplication
Now we consider the multiplication of two complex matrices. Let R = A + iB and S = C + iD be two complex matrices of size N × N. Let T = RS = E + iF be the product matrix. Then
and hence the complex matrix multiplication is reduced to four real matrix multiplications, one addition, and one subtraction. Let A j and B j denote the jth rows respectively. Thus, to compute E = AC − BD, we use the following procedure.
For k = 1 to N do in parallel
The inner equation is a difference of two inner products. Comparing this to the imaginary part of the complex inner product, it is easy to see that they have the same algebraic form. Therefore, matrix E can be computed using Algorithm 4 by replacing the inner product processor in the algorithm with the imaginary part of the complex inner product processor shown in Figure 7 . This amounts to replacing each of the inner product processors in Figure 8 by the imaginary inner product processor given in Figure 7 .
Likewise, to compute F = AD + BC, we use the procedure:
Endfor.
The inner equation entails two inner products as in the real part of the complex inner product processor shown in Figure 6 . Thus F can be computed using Algorithm 4.
Combining these facts, we conclude that the multiplication of two N × N complex matrices can be carried in O(1) time using O(N 3 ) permutation network processors.
Concluding Remarks
In this paper, we proposed permutation network processors to compute algebraic sums, inner and matrix products. It has been shown that the algebraic sum of N n- These results are important in that they establish that one can avoid using conventional adder and multiplier circuits to carry out vector and matrix computations.
It will be worthwhile to extend them to other computations such as discrete transforms, convolutions, and correlation computations. We anticipate that all of these computations can also be done in O(1) time and O(N 2 ) cost and we plan to present our results on these computations in another place.
