Abstract-This paper presents modified parallel architectures for multidimensional ( -d) convolution. We show that for two-dimensional (2-d) convolutions, with careful design, the number of lower-order 2-d convolutions can be reduced from nine to six with a computation saving of 33%. Moreover, the original speed of the computations is not affected. The proposed partitioning strategy results in a core of data-independent convolution computations, and can be generalized to the -d convolution. The resulting very large scale integration networks have very simple modular structure, highly regular topology, and use simple arithmetic devices.
I. INTRODUCTION
Multi-dimensional (m-d) convolution is a very important operation in signal and image processing with applications to digital filtering and video processing. Thus, abundant approaches have been suggested to achieve high-speed processing for linear convolution, and to design efficient convolution architectures [3] - [6] . However, the majority of the previous approaches focused on expressing a two-dimensional (2-d) convolution in terms of two consecutive stages of one-dimensional (1-d) convolutions.
Our methodology employs tensor product decompositions and permutation matrices as the main tools for expressing the convolution algorithm. We employ several techniques to manipulate such decompositions into suitable expressions that can be mapped efficiently onto very large scale integration (VLSI) structures. Tensor products (or Kronecker products), when coupled with permutation matrices, have proven to be useful in providing a unified decomposable matrix, formulations for multidimensional transforms, convolutions, matrix multiplication, and other fundamental computations [2] , [3] , [6] .
The proposed algorithm is based on a nontrivial modification of the 2-d convolution algorithm recently proposed in [3] and realized in Fig. 1 . We show that, using the properties of tensor product and permutation matrices [6] , a large 2-d convolution computation can be decomposed recursively into three cascaded stages. We show that the number of lower-order convolvers at the core computations can be reduced from nine to six with a computation saving of 33%. It should be also emphasized that our partitioning and combining method does not make any assumption about how the core convolution is computed. Indeed, any suitable convolution method can be used. This makes the proposed method very flexible and realizable over a wide range of hardware and software platforms. Publisher Item Identifier S 1057-7130(00)11033-X. 
A. Tensor Product Properties
Some of the properties of the tensor product that will be used throughout this paper are [3] , [6] as follows:
then Pn;n Pn;n = Pn;n n :
Parallel Operations: For square matrices A n and B n , if n = n 1 n 2 then
A n B n = P n;n (I n A n ) P n;n (I n B n ) : (4) If n = n 1 n 2 , then P n;n P n;n = I n :
For nonsquare matrix An;m , we have An;m An;m = P n ;n (In An;m)Pnm;n(Im An;m): (6) Where P n;s is an n 2 n binary matrix specifying an n=s-shuffle (or s-stride) permutation, and In is the identity matrix of size n.
II. REDUCING THE COMPLEXITY OF THE m-d CONVOLUTION ALGORITHM
For an n 1 2n 2 input data image, the 2-d convolution output is given by [3]C n ;n =R n ;n I 9 C n =2;n =2Qn ;n
where C n =2;n =2 = C (n 1 =2) C (n 2 =2) is the lower order 2-d convolution matrix for an n1=2 2 n2=2 input image,Q n ;n = P 9(n =2);3 I n =2 (Q n Q n ) (8) andR n ;n = (R n R n ) P 9(n 01);3(n 01) I n 01 : (9) are the 2-d pre-and post-additions, respectively. Now, we will further manipulate (8) and (9) to exploit the resource sharing available and, consequently, realize the multiplexed architecture of the 2-d convolution using less number of lower-order convolvers.
Applying property (4), we can write Q n in the form Q n = A I n=2 [3] . Consequently,Q n ;n can be written as Q n ;n = P 9(n =2);3 I n =2 (Qn Qn ) = P 9(n =2);3 I n =2 A I n =2 A I n =2 : (10) Also, from properties (1) and (2), we havẽ Q n ;n = P 9(n =2);3 A I n =2 A I n =2 I n =2 = P 9(n =2);3 A I n =2 A I n =2 :
Since the matrix
, we have
I n =2 A = I 3n =2 I n =2 A : (12)
Substituting (12) in (11) then using property (1), we havẽ Q n ;n = P 9(n =2);3 AI 2 I 3n =2 I n =2 A I n =2 = P 9(n =2);3 A I 3n =2 (In A) I n =2 : (13) Using property (4), the parallel form (In A) in (13) can be modified to (I n A) = P 3n ;n (A I n ) P 2n ;2 :
Also, from property (4), we can write the term (AI 3n =2 ) in the form A I 3n =2 = A I n =2 I 3 = P 9n =2;3n =2 I 3 A I n =2 P 3n ;3 : (15) Substituting (14) and (15) in (13), we havẽ Q n ;n = P 9(n =2);3 P 9n =2;3n =2 I 3 A I n =2 P 3n ;3 2 (P3n ;n (A In ) P2n ;2 )) I n =2 :
However, from property (5), we have P 9n =2;3 2 P 9n =2;3n =2 = I 9n =2 P 3n ;3 2 P 3n ;n = I 3n :
Substituting (17) in (16), we can writeQ n ;n in the form Q n ;n = I3 A I n =2 (A In ) P2n ;2 I n =2 = ((I 3 Q n ) Q 2n P 2n ;2 ) I n =2 :
Using property (4), we can writeQ n ;n in the final form Q n ;n = P 9n =21n =2;9n =2 I n =2 ((I3 Qn ) Q2n P2n ;2 ) 2 P n 1n ;n =2 = P 9n =21n =2;9n =2 I n =2 E P n 1n ;n =2 (19) where, E = (I3 Qn )Q2n P2n ;2 . The detailed realization ofQ 8;8 (for an 8 28 input image) is shown in Fig. 2 in which the computations involved in any of the four parallel E blocks are realized by the permutation P16;2, followed by the computation stage of Q16, followed by three parallel blocks of the computation Q 8 . The detailed realization of Q16 is shown in Fig. 3 . A careful scrutiny of the realizations of Q16 shown in Fig. 3 reveals that the data movement through Q16 encounters different amounts of delays. In particular, the computations involved in Q16 affect only the middle Q8 in any of the E blocks (shown with dotted lines in Fig. 2) . Thus, the top and the bottom Q 8 can be computed one addition cycle ahead of the middle Q 8 . This means that only two of the three parallel Q8 blocks in the realization of E are needed at a time. Therefore, through the use of multiplexers, the middle Q 8 can be removed from the architecture of E without affecting the speed of the computations as shown in Fig. 4 , reducing number of the lower-order 2-d convolvers from nine to six with a computation saving of 33%.
A multiplexed architecture ofR n ;n can be also derived using a similar procedure. Applying property (4), we can write Rn in the form R n = R( 1 )(B I n 01 ) [3] . Consequently, we can writeR n ;n in the form Rn ;n = (Rn Rn ) P 9(n 01);3(n 01) In 01 = [(R( 1 ) (B I n 01 )) (R( 2 ) (B I n 01 ))] 2 P 9(n 01);3(n 01) I n 01
which, using property (1), can be modified tõ R n ;n = (R( 1 ) R ( 2 )) ((B I n 01 ) (B I n 01 )) 2 P 9(n 01);3(n 01) In 01 :
Applying properties (1) and (2), we havẽ Rn ;n = (R(1) R(2)) ((B In 01 B) In 01) 2 P 9(n 01);3(n 01) I n 01 = (R( 1 ) R( 2 )) B I 3(n 01) I 3(n 01) B 2 P 9(n 01);3(n 01) I n 01
which, using property (4), can be reformulated in the parallel form Rn ;n = (R(1) R(2))P 9(n 01)(n 01);9(n 01) 2 I n 01 B I 3(n 01) I 3(n 01) B 2 P 9(n 01);3(n 01) P 9(n 01)(n 01);(n 01) = (R( 1 ) R( 2 ))P 9(n 01)(n 01);9(n 01) 2 (I n 01 K) P 9(n 01)(n 01);(n 01)
where K = (BI 3(n 01) )(I 3(n 01) B)P 9(n 01);3(n 01) . Equation (23) representsR n ;n in a multiplexed form similar to that of (19) for Q n ;n .
We can extend the derivation of the multiplexed 2-d convolution algorithm presented in this section to the m-d convolution algorithm
in [3] . Following the same steps that are used to modifyQ n ;n and R n ;n for the 2-d convolution, we can exploit the resource sharing available and derive multiplexed forms for the m-d pre-additions (Q) and the post-additions (R) in [3] , reducing the number of lower-order m-d convolvers form 3 m to (2=3)3 m .
III. THE COMPUTATIONAL COMPLEXITY OF THE 2-d CONVOLUTION ALGORITHM
For simplicity, assume that n 1 = n 2 = n and the input image is of size n 2 n. From (19), the number of additions in the pre-additions stageQ n ;n is given by 
Using property (6), we can write the term R n R n in (20) as [2] R (2n01)23(n01) R (2n01)23(n01) = P (2n01) ;(2n01) I (2n01) R P (2n01)3(n01);(2n01) 2 I 3(n01) R :
Therefore, the number of additions in the post-additions stageR n ;n is given by It should be mentioned that the number of additions in both the pre-addition and post-addition stages remains unchanged in the proposed algorithm. Moreover, the communication complexity is reduced by removing one of the three P48;12 blocks, as shown in Fig. 4 , for the case n = 8.
Since the multiplication stages are centered at the core lower-order parallel blocks [2] , [3] , removing one-third of these blocks in the proposed algorithm guarantees a 33% saving in the number of multiplications. It should be mentioned that the computation complexity in our proposed algorithm depends on the computations involved in the core computations C n =2;n =2 = C (n 1 =2) C (n 2 =2). The number of multiplications in our proposed algorithm based on [1] as a core, compared to the direct method and FFT method, is shown in Table I . The table shows a significant reduction of the number of multiplications of the proposed multiplexed algorithm over the direct and FFT methods.
IV. CONCLUSION
In this paper, we presented modified parallel architectures for m-d convolution. The proposed algorithm showed that for 2-d convolution, the number of lower-order convolutions is reduced from nine to six with a computation saving of 33%. The proposed partitioning strategy results in a core of data-independent convolution computations, and does not make any assumptions on how the core convolutions are computed. Indeed, any suitable convolution method can be used, which makes the proposed method very flexible and realizable over a wide range of hardware and software platforms.
