The prime-factor decomposition is a fast computational technique for many important digital signal processing operations, such as the convolution, the discrete Fourier transform, the discrete Hartley transform, and the discrete cosine transform (DCT). In this paper, we present a new prime-factor algorithm for the DCT. We also design a prime-factor algorithm for the discrete sine transform which is based on the prime-factor DCT algorithm.
Introduction
This paper is concerned with designing a fast prime-factor algorithm for the discrete cosine transform (DCT). The DCT, which performs much like the theoretically optimal Karhunen-Loeve transform for the rst-order Markov stationary random data, has found wide applications in speech and image processing as well as telecommunication signal processing for the purpose of data compression, feature extraction, and ltering. In order to compute the DCT e ciently, fast algorithms have been intensively studied . It has both theoretical and practical signi cance. Its main theoretical rationale is to convert a large-size one-dimension problem, by employing certain appropriate index mappings, into a multidimensional one. Then, we can deal with the resulting groups of small-size problems in each dimension.
For practical considerations, since in a typical DSP processor the memory for data storage is expensive and usually not large, it is more feasible to process small-size problems one at a time. In addition, when this approach is combined with e cient short-length algorithms, such as Rader's algorithm 20], or Winograd type minimum multiplication algorithms in DFT 31] , or Heideman's small odd-length DCT modules 8], etc., it would be of practical interest in reducing the scalar multiplication complexity.
Although Chan and Ho 6] and Cho and Lee 7] derived prime-factor DCT algorithms based on variant DFT algorithms before, their algorithms required additional complex-number multiplications. Unlike them, Yang and Narasimha 35] proposed a prime-factor DCT algorithm which included only real-number multiplications; however, its index mapping was complicated. From our point of view, an index mapping not only should be easy to understand, but also should be e cient in running. Lee 13] has achieved this goal. However, his input index mapping is realized by constructing and combining two index tables, which occupy additional memory space and would be infeasible in variable-size applications.
Chakrabarti and J aJ a 5] develop a systolic architecture for implementing Lee's algorithm. Because they want to compute the DCT from the DHT, they modify the index mappings, which are essentially the same as Lee's. However, they did not discuss the actual implementation for these index mappings.
In this paper, the input index mapping we adopt is the Ruritanian mapping, since its e cient realization can be based on the previous research e orts 17] 32] 33] 34]. In addition, the resulting algorithm complexity is by no means increased. As for the output index mapping, we employ the same one as Lee's, and this may be the most natural one in view of the DCT transform kernel's structure.
We also consider hardware implementations for the prime-factor DCT. We are interested in the hardware designs which are suitable for the VLSI (Very Large Scale Integration) implementations. We will show three hardware designs for the prime-factor DCT, including a VLSI circuit fabricated directly according to the signal-ow graph, a linear systolic array, and a mesh-connected systolic array. These three designs show the trade-o between cost and performance. Finally, by generalizing Wang's method 29], we design a prime-factor discrete sine transform (DST) algorithm based on the proposed prime-factor DCT algorithm. Both the DCT and the DST discussed in this paper are of type II de ned in 30], and their inverse transforms are of type III.
The rest of this paper is organized as follows. In Section 2, the DCT is introduced. In Section 3, we derive the prime-factor DCT algorithm. In Section 4, we propose three hardware implementations. In Section 5, we derive the prime-factor DST based on the prime-factor DCT. In Section 6, we give some concluding remarks. Finally, in the Appendix, we give the proof of two main theorems.
Background
For a given input data sequence x n , 0 n N ? 1, the DCT output data sequence X k , 0 k N ? 1, is de ned by X k = r 2 N (k)
and the IDCT (inverse discrete cosine transform) is de ned by
where (k) = (   1   p   2 ; for k = 0; 1 ; for 1 k N ? 1 : We assume throughout this paper that N is a product of two mutually prime integers N 1 and N 2 . For convenience, we will ignore the scaling factor (k) and the normalization factor q 2 N in Equations (1) and (2), since they can be done in a separate step. In the following, we deal with the simpli ed version of Equation (2) :
We want to convert Equation (3) into the form in Equation (4) by taking appropriate input and output index mappings:
where 0 n 1 < N 1 , 0 n 2 < N 2 , and Y (k 1 ;k 2 ) are the result of certain modi cations on X (k 1 ;k 2 ) .
The input index mapping connects the input index k, 0 k < N, to (k 1 ; k 2 ), 0 k 1 < N 1 and 0 k 2 < N 2 . The output index mapping connects the output index n, 0 n < N, to (n 1 ; n 2 ), 0 n 1 < N 1 and 0 n 2 < N 2 . Since the DCT is orthonormal, the forward transform also can be realized by taking the transpose of the inverse transform.
Derivation of the Prime-Factor Decomposition
The input index mapping we adopt is the Ruritanian mapping, which was also used in the primefactor DFT From Theorem 1, we can rewrite Equation (3) as follows: Because, when (k 1 ; k 2 ) is in E, f(k 1 ; k 2 ) = j g(k 1 ; k 2 ) j = (k 1 ; k 2 ), we have cos(
). Therefore, Equation (5) becomes:
Now, we de ne
Theorem 2
Next, we introduce the output index mapping, which was proposed by Lee 13] . For the completeness of this presentation, we list it here for easy reference.
The output index mapping: '(n) = (n 1 ; n 2 ), where Therefore, the following theorem can be derived directly from Theorem 2.
Theorem 3
Prime-factor algorithm for the IDCT:
Input: An N-point data sequence X k , 0 k < N, where N = N 1 N 2 , N 1 and N 2 are mutually prime.
Output: An N-point data sequence x n , 0 n < N.
Step 1: Apply the Ruritanian mapping on X k to construct an (N 1 N 2 )-point 2-D data matrix X (k 1 ;k 2 ) , where 0 k 1 < N 1 and 0 k 2 < N 2 .
Step 2: Modify the 2-D data matrix X (k 1 ;k 2 ) to get the 2-D data matrix Y (k 1 ;k 2 ) according to Equation (7).
/* This is proved to be correct in Theorem 2. */
Step 3: /* Row-column evaluation */ 1. Execute the N 1 -point IDCT for each of the N 2 columns of Y (k 1 ;k 2 ) , and the result is T (k 1 ;k 2 ) .
2. Execute the N 2 -point IDCT for each of the N 1 rows of T (k 1 ;k 2 ) , and the result is x (n 1 ;n 2 ) , where 0 n 1 < N 1 and 0 n 2 < N 2 .
/* This is proved to be correct in Theorem 3. */
Step 4: Apply the output index mapping in Equation (8) to get the nal result x n .
In the following, we use an example to demonstrate our method.
Example : In this example, N = 15, and N 1 = 3, N 2 = 5. Figure 1 shows the signal ow graph for implementing the 15-point 1-D IDCT, which can be converted into the (3 5)-point 2-D IDCT.
First, when given a 15-point 1-D input data sequence X k , 0 k < 15, we transform this 1-D data sequence by the Ruritanian mapping to a (3 5)-point 2-D data matrix X (k 1 ;k 2 ) , 0 k 1 < 3 and 0 k 2 < 5. to Equation (7) . Then, we apply the row-column evaluation for the computation: rst, we deal with ve 3-point IDCTs; then, take the transpose of the result; and then, deal with three 5-point IDCTs. Finally, the output data sequence x(n), 0 n < 15, can be obtained by using the output index mapping in Equation (8) . Table 2 shows the output index mapping for this case. Note that the signal ow graph in Figure 1 performs the IDCT if the signals ow from left to right, and performs the DCT if the signals ow from right to left.
Hardware Implementations
In this section, we consider several hardware implementations for the prime-factor IDCT. We are especially interested in the hardware designs which are suitable for the VLSI implementations. Fig. 1 shows the signal-ow graph for the implementation of a 15-point prime-factor IDCT. It is possible to fabricate such a 15-point prime-factor IDCT circuit directly into a single chip, which may contain millions of transistors, by using current VLSI technologies. The input data sequences can then be pipelined and enter into this chip, and an output data sequence is produced in every cycle time. This design, of course, has achieved the best performance and the maximum throughput. We will call this implementation, which is fabricated directly according to the signalow graph, Design I.
However, when the problem size grows, it is di cult to fabricate a large-size prime-factor IDCT circuit directly into a single chip. This is because it requires a network including several butter y-like interconnections and a transpose network that are very costly to layout in most circuit technologies (including VLSI).
After carefully studying the prime-factor algorithm, we nd that Step 1 and Step 4 only deal with O(N) memory references.
Step 2 only requires O(N) addition or subtraction operations. However, Step 3, which implements the row-column evaluation, requires at least O(N log N) multiplication and addition operations by any fast IDCT algorithm 9] 12] 14]. It is clear that Step 3 is the bottleneck of the algorithm. Steps 1, 2, and 4 can be implemented by the host computer, or by a single processing element, or by a single processing element each. Therefore, in the following we concentrate our e ort in solving Step 3 by using feasible VLSI algorithms.
In Fig. 1 , we nd that there are ve 3-point IDCT components and three 5-point IDCT components. In general, there are N 2 N 1 -point IDCT components and N 1 N 2 -point IDCT components.
In practice, in order to save hardware without slowing down the processing speed, we can use only one 3-point (N 1 -point) IDCT component and one 5-point (N 2 -point) IDCT component as shown in Fig. 2 , provided the data sequences are carefully arranged.
In Fig. 2, Steps 1, 2 , and 4 are each implemented by a single processing element.
Step 3 requires a 3-point (N 1 -point) IDCT component, a transpose bu er, and a 5-point (N 2 -point) IDCT component. The data matrix generated after Step 2 is pipelined and enters into the 3-point (N 1 -point) IDCT component column by column. The output data matrix of the 3-point (N 1 -point) IDCT component is stored in a transpose bu er. After this data matrix enters the transpose bu er, this data matrix in the transpose bu er is then pipelined and enters into the 5-point (N 2 -point) IDCT component row by row. The resulting data matrix is then sent to a processing element for generating the output data sequence by using the output index mapping.
We now consider the fabrication of an arbitrary N-point IDCT component. It is true that there is no fast algorithm for the N-point IDCT, except if N has a value of 2 to a power. However, in practice, the value of N may be varied. Previously, if N did not have a value of 2 to a power, then either we should pad additional zeros so that we could use a fast IDCT algorithm, or we should use the conventional matrix-vector multiplication algorithm to solve the N-point IDCT. Like the DFT, the IDCT also can be implemented by the conventional matrix-vector multiplication. In addition, we also can use a linear systolic array with N processing elements to implement the N-point IDCT in O(N) time. We will call this linear systolic array implementation Design II.
However, as each processing element requires N registers or memory storage to store part of the N-point IDCT coe cient matrix, each processing element needs additional memory addressing and control hardware, which will increase the circuit complexity. We now show that the computation required by the row-column evaluation can be represented by two matrix-matrix multiplications. Suppose that the N 1 N 2 data matrix Y is generated after
Step 2. Then, by the row-column evaluation, Fig. 3 shows the complete data ow of the 15-point IDCT systolic array implementation.
When given an input data sequence X k , 0 k < 15, we transform this 1-D data sequence by the Ruritanian mapping to a (3 5)-point 2-D data matrix X (k 1 ;k 2 ) by using only one processing element, for 0 k 1 < 3 and 0 k 2 < 5. Next, we modify the 2-D data matrix X (k 1 ;k 2 ) to the 2-D data matrix Y (k 1 ;k 2 ) according to Equation (7) also by using only one processing element.
Then, we use three systolic array components to implement the row-column evaluation. First, we deal with the matrix-matrix multiplication T 3 ) of the 5-point IDCT coe cient matrix are stored in the 5 5 mesh-connected systolic array initially. The skewed data matrix T T (k 1 ;k 2 ) ows from south to north. The data matrix x (n 1 ;n 2 ) , which requires it to be aligned skewedly and whose initial value x 0 (n 1 ;n 2 ) is zero, ows from east to west. The operations performed in each processing element of the 5 5 mesh-connected systolic array are the same as the case in the 3 3 mesh-connected systolic array.
The resulting data matrix x (n 1 ;n 2 ) can then undergo the output index mapping according to Equation (8) , and the output data sequence x n , 0 n < 15, can then be obtained. This step, however, can be implemented by using only one processing element. maxfN 1 ; N 2 g ) 4 ), which is much better than the case in Design II.
Computation of the IDST by using the IDCT
In this section, we show how to compute the IDST (inverse discrete sine transform) by using the IDCT. Wang 29] showed that an N-point DST could be computed by an N-point DCT. In the following, we generalize his method to compute the IDST by using the IDCT. Therefore, the prime-factor IDST can be computed by the prime-factor IDCT. Since the DST is orthonormal, the forward transform also can be realized by taking the transpose of the inverse transform. Note that, for the sake of distinguishing the formula of the IDST in Equation (9) from the formula of the IDCT in Equation (2), we use the superscript 0 s 0 for denoting the sine transform.
We now state how to compute the IDST by using the IDCT. That is, the data sequence fy m = (?1) m x s m g, for 0 m N ? 1, can be obtained by performing the IDCT on the data sequence fY k = X s N?1?k g, for 0 k N ? 1. Suppose that N is a product of two mutually prime integers. In the following, we list the procedure for computing the prime-factor IDST by using the prime-factor IDCT.
Prime-factor algorithm for the IDST:
Step 1: Compute the data sequence fY k = X s N?1?k g, for 0 k N ? 1;
Step 2: compute the IDCT output data sequence fy m g, 0 m N ? 1, for the data sequence fY k g, 0 k N ? 1, by using the prime-factor IDCT algorithm;
Step 3: compute x s m = (?1) m y m , for 0 m N ? 1.
Conclusion
We have presented in this paper a new prime-factor DCT algorithm, which consists of four steps: the input index mapping; the modi cation; the row-column evaluation; and the output index mapping. The input index mapping we adopted is the Ruritanian mapping method, which is much simpler than the ones designed by Yang and Narasimha 35], and Lee 13] . The output index mapping we employed is the same one as Lee's 13] . We also designed a prime-factor DST algorithm, which is based on the prime-factor DCT algorithm. complex-number multiplications and additions each for the row-column evaluation.
We then considered hardware implementations for the prime-factor DCT. We have shown three hardware designs for the prime-factor DCT. The rst one, which is a VLSI circuit fabricated directly according to the signal-ow graph, might be not easy to implement when the problem size becomes large. The second one is a linear systolic array implementation, which contains N processing elements and can solve an N-point DCT 
Lemma 4
Proof :
(right-hand side) 2 
