I. INTRODUCTION
The discrete cosine transform (DCT) has been widely used in image coding for its near-optimal performance [l]. Since the DCT is computation intensive, the development of high-speed hardware is necessary in many real-time applications. Systolic arrays are an appropriate architecture to meet the requirements of both high processing speeds and VLSI implementation. However, the computing algorithms encapsulated within systolic arrays need to be developed specifically.
Recently 161 . Although the two-dimensional arrays can attain higher speeds than one-dimensional arrays, the hardware complexity of PE's and the control complexity of these two-dimensional arrays are generally higher than those of linear arrays. Furthermore, the two-dimensional arrays need high I/O bandwidth and a large number of I/O channels to attain the higher speeds, unless most operands are preloaded into the arrays instead of being supplied from the input ports. But additional overheads are needed if the operands are preloaded into the arrays like the two-dimensional array in 151. Considering for example the array in [6] , the average computation time for N-point DCT is ( A + 2 ) cycles, while the number of multipliers in the array is (4N + 4 A), if the clock cycle is assumed to be the consumption time of one multiplier. In addition, undesirable features such as the complex control problems, high I/O bandwidth, and a large number of I/O channels are still accompanied with the array in [6] .
The attractive feature of linear arrays is that the U0 bandwidth and the number of I/O channels can be kept independent of the DCT length if the I/O channels exist only at the two extreme ends of a linear array. As discussed in [SI, the high U0 bandwidth required for most systolic arrays would limit computing speeds. Hence, linear arrays should be one feasible architecture for a sys- To simultaneously consider the hardware cost, the IiO bandwidth, and the number of I/O channels, a systolic algorithm for prime length DCT is derived in this correspondence. The design approach utilizes the input and output data permutations accompanied with the symmetry property of the cosine kernels such that the proposed array can retain most I/O channels at the two extreme ends and simultaneously attain good performance in average computation time, hardware cost of the PE's, and the number of the PE's. The performance of the proposed array and that of the linear arrays in [2]-[4] are discussed in Section 111. From Section 111, we can see that the proposed array possesses better performance than the arrays [2], [3] in the hardware cost of the PE's, the average computation time, the number of U0 channels, and the IiO bandwidth. Moreover, it also possesses better performance than the array [4] in the hardware cost of the PE's. The overheads of the proposed array include some additional shift registers, latches, multiplexers, a demultiplexer, and a switching element for solving control problems. Basically, these overheads are minor as compared with the savings in regard to the hardware cost of the PE's in the array. This correspondence is arranged as follows. Section I1 describes the derivation of the computing algorithm encapsulated in the array. Section I11 considers the array realization of the proposed systolic algorithm. A brief conclusion is given in Section IV.
THE ALGORITHM DERIVATION
The DCT is defined as
where "a" denotes s / 1 4 ; and N is assumed to be 7. If ( 2 ) 
where and x ( i ) is another sequence defined as
If N is a prime number, there exists some number of " g , " not necessarily unique, such that there is a one-to-one mapping from
where \AIN denotes the result of "A-modulo-N" operation. Then (4) can be reformulated with i and k as powers of the primitive element "g." Because i and k take on the value zero, and zero is not a power of "g," the zero frequency component must be treated specially, i.e.,
where
The term "I g' IN x 1 g kIN" can be expressed as
where "m" is an integer. Then, (9a) can be written as
Applying (1 1) to (9c), (9c) can be written as
where x " ( i ) = if mi and m2 are one even number and one odd number if ml and m2 are all even numbers or all odd numbers Now (7), @a), and (9c) constitute the computational equations for the DCT. To see the difference between these computational equations and (I), (9c) is written as
where "a" denotes 1 r / 7 , N a n d "g" are assumed to be 7 and 3, respectively. It can be seen that the absolute values of the cosine kernels along same antidiagonal positions in the matrix of (10) are the same while those in the matrix of (2) do not have any specific order like (IO). This phenomenon tells that the vector of T'(k) is the circular convolution of inputs x' (i) and the cosine kernels. The phenomenon also exists in the DFT, which was firstly found by Rader [IO] and has also been used to design the efficient systolic arrays for prime length DFT [9] . Now we apply it to derive the systolic algorithm for DCT. From the viewpoint of array realization, the constant value along the same antidiagonal positions means that this variable can be sent to every PE along a link from one input port at the extreme end of a linear array. The (2N -3) antidiagonal lines in the matrix of (10) mean that there are only (2N -3) values instead of N 2 values in the matrix of (2) needed to be sent to the array. This phenomenon can be effectively captured to design the systolic array with a low number of I/O channels and low I/O bandwidth. From (IO), since cos ( k a / N ) = -cos ((N -k ) a / N ) , it is observed that the absolute values of the cosine kernels located at the left three columns are the same as those located at the right three columns. This symmetry property benefits further reduction of the computational complexity in the algorithm. As shown in the Appendix, the symmetry property of the cosine kernels can be expressed as the following equation: Now (7), @a), and (12) constitute the computational equations of the DCT in the proposed algorithm. Considering the computational complexity, the number of multiplications has been reduced from (N -1)2 in (9c) to (N -1)2/2 in (I2a). In addition, the vector of T' (k) in ( 1 2 4 is still in a circular convolution form. It will be shown in the next section that such a form is beneficial to the reduction of I/O cost. Fig. l(a) . The dependence graph (DG) of the proposed algorithm for 7-point DCT where "a" denotes ~/ 7
x'(l)tx'(4) Wtx'(5) ~'(3)+~'(6)

T H E ARRAY REALIZATION
This section considers the array realization of the proposed systolic algorithm. 
T ( k ) = y :
where and "y;" and "zb" are the intermediate results.
From Fig. 2(a) , we know that the operations specified in (13a) and (13b) 
439
constitute the main functions of the PE's, which are shown in Fig.  2(b) . And three control signals denoted as " T a g l , " "Tag2," and "sign" are used to select the right operands in the operations. Fig.  2(c) shows the preprocessing stage needed in the array. The intermediate sequence x ( i ) can be generated from input sequence y ( i ) by a subtractor, and then we use the multiplexers and a switching element to permute the sequence x ( i ) where the required control signals can be generated by circular shift registers. Finally, the required data patterns are obtained by adding and subtracting the permuted data. Fig. 2(d) shows the postprocessing stage in the array, which uses a demultiplexer to perform the output data permutation. Similarly, the control signals needed in the demulti- 
4-
T a g l else Y2o 4---0 e n d C 1 ' C C 1 X e x t l ' + X e x t l T a g l ' e T a g l X e x t Z ' C X e x t Z If T a g l = l then X e x t l ' X e x t l X i n l ' + X e x t l Xin2' +-Xext2 shifted into the registers, they are shifted parallelly into the latches for the I/O data permutations such that the data of next block can be continuously shifted into the registers without any time delay. Therefore, the proposed array including the preprocessing and postprocessing stages can be fully pipelined, and a high throughput rate of the design can be attained. In order to see the features of the proposed array more clearly, (12a) is expressed as
where "a" denotes * / I , N a n d "g" are assumed to be 7 and 3, respectively. If "k" is equal to 1 , 5 , and 6, the minus signs in the values instead of eight to the array for computing each seven-point DCT. It can be seen from the array in Fig. 2(a) that only (N -1) cosine kernels are needed to compute an N-point DCT. And, the average computation time for computing the N-point DCT is (N -1) cycles. This phenomenon is induced from the cyclic property of
Exerting the specific order of the cosine kemels in the matrix of (14). these kernels in the array are imported from the right-most PE instead of being imported from every PE as the approach in [2] . Therefore, the proposed array requires a low number 
IV. CONCLUSIONS
In this correspondence, a new approach to derive the systolic algorithm for prime length DCT is presented. This approach induces the array to have good performance in hardware cost of PE's, average computation time, the number of I/O channels, and the I/O bandwidth. Also, this design approach can be similarly applied to derive the systolic algorithms for discrete sine transform (DST) and discrete Fourier transform (DFT) [9] . Although the proposed systolic algorithm and array are derived under the restriction that N is a prime number, they can be applied to the nonprime !ength DCT by appending the input data from nonprime length to prime length at the expense of some overheads in hardware cost and average computation time. With these overheads, the hardware cost of the proposed array is still lower than that in the arrays (21-[4] . However, it is not always a drawback that N is a prime number. It is known that the blocking effect will occur in the DCT as applied to image coding with low bit rate. And the overlapping method is one of the remedies for this problem [ 1 11. Applying the proposed algorithm to the nonprime length D C T by using the overlapping method can also reduce the undesirable blocking effect. cos ( a ) , cos (3a), cos (2a), cos (6a)) is the sequence of these eight values for the seven-point DCT. It is observed that the last two cosine kemels are identical to the first two cosine kernels in C. And these common cosine kemels can be shared for computing two neighboring blocks successively. As many image blocks are processed continuously, it is only necessary to send six 
and z,'s and R,'s are the poles and residues of the signal, respectively. M is the number of poles of the signal, and n ( k ) is the background noise. a, and w, are the damping factor and angular frequency of the ith sinusoid, respectively. Once the number of poles and their values have been determined, the residues at the poles can be found by the least squares method. Hence, only the problem of estimation of the poles is considered in this correspondence. The most popular method for pole retrieval is Prony's method. However, Prony's method is notorious for its extreme sensitivity to noise. There are many modified versions of the Prony method. The most well known one is the principal eigenvector (PE) method 111. Recently, Hua and Sarkar 121, [3] developed a new technique, named the matrix pencil (MP) method, for pole estimation. The advantage of using matrix pencil is that the signal poles can be found directly from the eigenvalues of the matrix contrast to the PE method, which generally requires two-step processes. In the first step one solves a matrix equation, and finds the roots of a polynomial equation in the second step. 
