Abstract|In this paper, we propose uni ed systolic arrays for computation of the 1-D and 2-D discrete cosine transform discrete sine transform discrete Hartley transform DCT DST DHT. By decomposing the transforms into even-and odd-numbered frequency samples, the proposed architecture computes the 1-D DCT DST DHT. Compared to the conventional methods the proposed systolic arrays exhibit advantages in terms of the number of PE's and latency. W e generalize the proposed structure for computation of the 2-D DCT DST DHT. The uni ed systolic arrays can be employed for computation of the inverse DCT DST DHT IDCT IDST IDHT. 
I. Introduction
The discrete cosine transform DCT, discrete sine transform DST, and discrete Hartley transform DHT are e ective tools in transform coding applications to speech and image signals. The DCT is a better approximation to the statistically optimal Karhunen Lo eve transform KLT than any other orthogonal transforms, therefore, it has been widely used in image and speech coding 1 -3 . The performance of the DST approaches that of the KLT for a rst-order Markov sequence with given boundary conditions, especially for a signal with low correlation. The DHT that involves only real arithmetic is popular in the eld of digital signal processing as an alternative to the complex-valued discrete Fourier transform DFT.
There are two traditional approaches to implementation of the DCT DST DHT; implementations using butter y structures 4 -7 or systolic arrays 8 -18 . Butter y structures are not suitable for e cient hardware implementation because they require global communications, which imposes limitation on the computational rate when the number of processors becomes large and their areas increase. A solution to overcome such di culty is to allow local data exchanges between processing elements PE's, as in a systolic array 19 -20 that consists of locally connected PE's. A pipelined simultaneous data ow via local connections does not require any control overhead, thus systolic array structures are suitable for parallel computation of digital signal processing algorithms.
Computation of the DCT requires massive and complicated data manipulations. Many fast DCT FDCT algorithms requiring global communications and conventional methods using systolic arrays have been presented. In this paper, we propose uni ed systolic array architectures for computation of the 1-D and 2-D DCT DST DHT. The proposed architectures employ simple PE's that require real multiplications and additions. They generate outputs sequentially with short computation time.
The rest of the paper is structured as follows. Section II describes the proposed systolic array architectures for the 1-D and 2-D DCT DST DHT. In Section III, we analyze the performance of the proposed and conventional architectures for the 1-D and 2-D DCT, and conclusions are given in Section IV. 4 Note that even-and odd-numbered frequency samples are computed independently, t h us parallel processing is possible. Fig. 1 shows the proposed architecture for computation of the 1-D DCT with N = 8. The processing unit PU shown in Fig. 1a computes the sum and subtraction of two inputs a and b. The proposed PE shown in Fig. 1b requires two real multiplications and two real additions, which are computed at the same time, thus it needs a single clock cycle. One clock cycle is de ned as the time T= m t + a t , where m t and a t represent elapsed times for real multiplication and addition, respectively. In the proposed architecture shown in Fig. 1c with N = 8, the input data sequence is fed into the PU from top to bottom whereas coe cient v alues for the transform from bottom to top. Two outputs are generated concurrently from the rightmost PE. For the N-point DCT, the proposed architecture requires N=2 PE's.
Also the proposed systolic array can be used for computation of the DST and DHT by changing input sequences and kernel values for transforms. . Similarly, the 8-point DST can be expressed in matrix form as 2 6 6 6 6 6 6 6 6 4 X1 X3 X5 X7 3 7 7 7 7 7 7 7 7 5 = 2 6 6 6 6 6 6 6 6 4 , , , , 3 7 7 7 7 7 7 7 7 5 2 6 6 6 6 6 6 6 6 4 x1 + x8 x2 + x7 x3 + x6 x4 + x5 3 7 7 7 7 7 7 7 7 5 9 2 6 6 6 6 6 6 6 6 4 X2 X4 X6 X8 3 7 7 7 7 7 7 7 7 5 = 2 6 6 6 6 6 6 6 6 4 , , , , 1 ,1 1 ,1 3 7 7 7 7 7 7 7 7 5 2 6 6 6 6 6 6 6 6 4 x1 , x8 x2 , x7 x3 , x6 x4 , x5 3 7 7 7 7 7 7 7 7 5
: 10
Thus, we can compute the 8-point DST using the same structure shown in Fig. 1c except that the inputs to PE's and coe cient v alues for the transform are changed.
With a sequence xn, 0 n N , 1 14 and the 8-point DHT can be computed by the same structure. Note that the 1-D DCT DST DHT can be computed by the same uni ed systolic array with coe cient v alues changed. Therefore, we can obtain the 1-D DCT DST DHT using the same chip. The uni ed systolic array requires N=2 PE's for the N-point transform and two real multipliers are needed in each PE. Therefore, the total number of real multipliers required in the proposed systolic array for the N-point transform is equal to N.
The second architecture for the DHT is a modi ed version of the rst one in calculating can be calculated in parallel. Fig. 2a shows the functional de nition of the PE even and PE odd that alternate in every other column of the proposed architecture in Fig. 2b . Their di erence lies in the de nition of b o and d o : the PE adds two inputs b i d i and x i y i a t even columns whereas it subtracts the latter from the former at odd columns. Fig. 2b shows the systolic array structure of the modi ed algorithm for the 8-point DHT, based on 15, where the leftmost column requires PE even . The input data sequence is fed into the PU from top to bottom whereas the coe cient v alues for the transform from bottom to top. Four outputs are generated concurrently from the rightmost PE. For the N-point DHT, the modi ed one requires N=2 PE's and it requires shorter computation time than a uni ed systolic array described above. Also, the same systolic array can be used to compute the inverse DCT DST DHT IDCT IDST IDHT.
We : 22
The proposed systolic array for the 2-D DCT is shown in Fig. 3 , where x ij ; X kl , and g il denotes xi; j; X k;l, and gi; l, respectively. Fig. 3a shows function de nition of the basic PE and Fig. 3b shows a proposed architecture of the 2-D DCT with N = 8. The input data sequence i 0 ; i 1 ; i 2 , and i 3 to the PE are multiplied b y coe cients k 2 and k 3 , then the sum and subtraction are stored in registers A; B; C, and D as shown in Fig. 3a . The input data sequence xi; j is fed into the PU from top to bottom whereas coe cient values k 0 and k 1 k 2 and k 3 for the transform from top to bottom from left to right. N 1-D DCTs along the row are computed by the input data sequence and coe cient v alues that are fed into the PE from left to right. N 1-D DCTs along the column are computed using the stored values of row transformation in a PE and the coe cient v alues that are fed into the PE from top to bottom. Therefore, the nal DCT coe cients Xk;l are obtained from the rightmost PE's. . W e can compute the 2-D DST and DHT using the same structure shown in Fig. 4 by only changing the transform coecients. Also the uni ed systolic array can be used to compute the 2-D IDCT IDST IDHT with some modi cations.
III. Computational Complexity of 1-D and 2-D DCT Algorithms
In Table I , we show the performance comparison of four 1-D DCT algorithms using systolic array structures: Cho and Lee's 8 , Chang and Wu's 9 , Lee's 10 , and the proposed architectures. Cho and Lee's algorithm 8 requires complex operations since it is based on the DFT. It requires global connections; the input of the nal PE is independently given rather than connected to the output of the preceding PE. In Chang and Wu's structure 9 using the 1-D systolic array, the PE is simple and its transform kernel values are evaluated recursively with data sequence stored in the register, thus we h a v e t o c hange values in the registers for the next input. Lee's algorithm 10 using the 2-D systolic architecture is faster than 1-D systolic architectures, however, it requires complex multiplications and additions, and its data path is complicated. Also it requires another PE that performs complex multiplication and takes the real part of the complex number. Cho and Lee's algorithm and Lee's structure require complex multiplications and additions.
In Table I for the N-point DCT, the required number of real multipliers in the proposed PE is two whereas other architectures require two to eight real multipliers. Also the proposed systolic array requires N real multiplications whereas other architectures require larger number of real multiplications. The latency of the proposed architecture is NT whereas other architectures require a larger number of clock cycles, where the latency is de ned by the time interval between the starting time of the rst computation and the nishing time of the last computation of the DCT. T represents the required computation time for real multiplication and addition whereas T 0 signi es the required time for complex multiplication and addition. Note that T T 0 . The area time complexity AT 2 of the proposed architecture is ON 3 , which is smaller than that of the Cho and Lee's, Chang and Wu's, and Lee's architectures, where A represents the area. Table II 
