I. INTRODUCTION E DISCRETE cosine transform (DCT)
However, though most of them are good software solutions to the realization of DCT, only a few of them are really suitable for VLSI implementation.
Cyclic convolution plays an important role in digital signal processing due to its nature of easy implementation. Specifically, there exists a number of well-developed convolution algorithms 1151 and it can be easily realized through modular and structural hardware such as distributed arithmetic [16] and systolic array [17] .
The way of data movement forms a significant part in the determination of the efficiency of the realization of a transform using the distributed arithmetic. The realization of a cyclic convolution with the distributed arithmetic requires only simple table look-up technique and some simple rotations of the corresponding data set. Hence, the cyclic convolution structure can be considered as the simplest form that is most suitable to be realized with the distributed arithmetic. It is because of this reason, one may consider that the basic criterion for the realization of a transform using the distributed arithmetic relies on the possibility of having an efficient way to convert the transform into the cyclic convolution form. If we could be able to convert a transform into the cyclic convolution form with the minimum number of operations, it would imply an optimal approach for the realization of the transform using the distributed arithmetic.
Some [ll] . The former case has the major problem that it violates the major advantage of the distributed arithmetic which replaces multiplications by additions. The latter case requires relatively complicated circuitry to allow the realization of cyclic convolutions of variable lengths. Different from the above approaches, one may also convert the DCT into the Discrete Fourier Transform (DFT) [31, [131 and make use of the famous algorithms [181, 1191 to convert the corresponding DFT into cyclic convolution form. Indeed, this is a possible approach; however, it turns a real transform into a transform with complex numbers. The realization could still be complicated even if some simplification techniques are to be applied.
In this paper, we propose an algorithm to convert an odd prime length DCT/IDCT into two half-length cyclic convolutions directly. This algorithm involves no multiplication during the conversion and suggests a possible solution to design a unified DCT/IDCT chip. Due to the nature of the structure, this algorithm is most suitable for the VLSI implementation using the distributed arithmetic. A 2-D 11 X 11 unilied DCT/IDCT chip design is also provided in this paper to demonstrate the superiority of the proposed algorithm.
If N is an odd number, there exists a bijective mapping on the set {i:i = 0, 1 e -. N -1): 3, 2, 1, 0, 10, 9, 8, 7, 6) where i = 0, 1, 2 10 accordingly. By making use of this bijective mapping, we can split (1) and rewrite it as
If N is an odd prime P, there exist two bijective mappings defined as
where g is a primitive root of P.
sequences {A(k)} and {B(k)) for k = 1, '2 To make use of these two mappings, one can redefine P -1 as
Then both A(k) and B ( k ) defined in (7) can be converted into a ( P -1)-length cyclic convolutions by mapping i and k to q(i) and l(k), respectively. In formulation, we have f o r k = 1,2.-. P -1 (sa)
However, to make the algorithm more efficient, we can make a further simplification on (8a) and (8b). In particular, as for i = 1 , 2 -. . Equations (lla) and (llb) are exactly a (P -1)/2 length cyclic convolutions and a. ( P -1)/2-length skewcyclic convolution respectively. Hence, A ( k ) and B(k) for k = 1, 2 -. -( P -1)/2 defined as (4) can be realized through two (P -1)/2-length convolutions (one cyclic convolution and one skew-cyclic convolution) with an additional cost of P -1 additions.
Let us use an example with P = 11 (primitive root g = 2) to clarify our approach.
First of all, we realize sequences {A(k):k = 1, 2..*5)
and {B(k):k = 1,2 .--5) via a 5-length cyclic convolution and a 5-length skew-cyclic convolution, respectively. In and
As the sequence {f( cp(i)) + f(cp((P -1)/2 + ill: i = 1, 2 ( P -1)/2} is computed during the realization of A ( [ ( k ) ) , the computation of Y (0) requires ( P -1)/2 additions only. In other words, a P-length DCT can be realized with two (P -1)/2 length convolutions with a cost of 2(P -1) additions totally. (14) where
ONE-DIMENSIONAL IDCT
Obviously, by making use of the zero-padding tech-I nique, we can redefine sequences {G(i)} and ( H ( i ) ) as follows: (18) where Then G(i) and H ( i ) are exactly in the form of (7a) and (7b), respectively. In the previous section, we have proved that equations in the form of (7a) and (7b) can be converted into cyclic convolution form easily by using the mappings defined in (6) if N is an odd prime P. By using a similar approach, we can rewrite (16) and (18) as the following: (21) Equations (20) and (21) are ( N -1)/2-length cyclic convolution and skew-cyclic convolution, respectively. In such case, an odd prime length IDCT can also be realized via two half-length convolutions similar to the case for the DCT.
Note that no multiplication is involved as overheads for the conversion of an odd prime P-length IDCT into
. Actually, only 2(P -1) additions are required during the conversion. In other words, a P-length IDCT can be realized through two ( P -1)/2-length convolutions with a cost of 2(P -1) additions. This is exactly the same cost that a P-length DCT is required to be realized with convolutions.
Again, we use the example with N = 11 to clarify our approach.
To compute the sequence {G(i):i = 1,2 -e -51, we can make use of (201, (17) , and (6), where c ( n ) = c o s ( 2 n~/ l l ) .
On the other hand, we can obtain sequence {H(i):i = 1,2 5) by making use of (20, (191, and (6): where s ( n ) = sin ( 2 n~/ l l ) .
Finally, we use (14) to compute the final result, {y(i):i = 0,
Both the DCT and the IDCT can be realized via convolutions with the same cost. Specifically, if both of them possess the same length, one can make use of the same convolution module to realize both the forward and the inverse DCT. As cyclic convolution is the core module of this algorithm, this algorithm is most suitable for the realization using the distributed arithmetic and it also suggests an efficient and effective way to design a unified DCT/IDCT chip.
IV. VLSI IMPLEMENTATION OF UNIFIED DCT / IDCT CHIP
In the preceding sections, we have proposed an algorithm to convert a P-length DCT/IDCT into a half-length cyclic convolution and a skew cyclic convolution. This provides a straightforward but ideal solution for the VLSI implementation of a unified DCT/IDCT chip by making use of the distributed arithmetic.
, where M , g(qk), and g(q -k ) , are the word length, the jth most significant bit, and the sign bit, respectively. After scaling to 2's-complement fractional number, F ( k ) can be rewritten as = CI M, ; ' {C: I ;
: :
g(q -k)jC(q) can be precalculated and stored in a ROM with ROM size = 2 N words. Then F ( k ) can be obtained by A4 ROM accesses and M -1 shift-additions after g(n)'s are available. Note that the same table can be used for the computation of F ( k ) for any value of k, which is impossible in the case of computing inner products other than a cyclic convolution. Hence, to a certain extent, one can consider that the distributed arithmetic is most suitable for VLSI implementation of cyclic convolutions.
Several high-performance chips have been designed by making use of the distributed arithmetic [20]-[261. However, in most designs, the distributed arithmetic is used to realize a typical inner product directly without first converting the transform into cyclic convolutions. In such a Consider a cyclic convolution defined as F ( k ) = case, optimal performance of the distributed arithmetic can not be achieved and the consequence of which is the requirement of a large memory size for the construction of the data tables.
A P X P unified DCT/IDCT can be implemented by the row-column decomposition technique as shown in Fig.  1 . In fact, the row-column approach is commonly applied in most 2-D DCT chips due to its flexible and regular nature. We first compute the PP X 1 DCT/IDCT's along each row and store the results in an intermediate array.
We then compute the P P X 1 DCT/IDCT's along each column to yield the final results. Note that the intermediate memory is realized by a RAM of P x P words and the transposition operation can be easily achieved by a suitable control of the addresses of the intermediate array. Fig. 2 shows the block diagram on the one-dimensional unified DCI'/IDCT module. The module mainly consists of three operating units, namely, an accumulator, a pre/post-processing unit, and a kernel-processing unit. Note that the whole process is a three-state pipeline. The accumulator is responsible for the computation of the dc term in the DCT mode and the y((N -11/21 term in the IDCT mode, which involves additions or subtractions only. A typical accumulator can satisfy this requirement. The pre/post-processing unit is actually a typical adder which is responsible for the preparation of the input data for convolutions in the DCT mode and the computation of the final results from the convolution outputs in the IDCT mode. The arrangement of the pre/post-processing stage and the kernel-processing stage determines the configuration of the unified chip, which can be easily handled with multiplexers. The table provided in Fig. 2 specifies the relationship between the MUX's configuration and the mode configuration of the module.
Both preshuffling and postshuffling of data can be easily done through the table lookup technique. In a typical pipeline design, input data and output data are normally buffered. Hence, if the sequence of the addresses can be generated in such a way that the input or the output data are fetched in a desirable order, then both the preshuffle and the postshuffle can be achieved. As the transform size is typically fixed and small, the desirable address sequence can be precomputed and stored in a small table. In such case, appropriate data can be fetched with indirect addressing method.
The kernel-processing unit basically consists of two convolvers. Both convolvers are realized with the distributed arithmetic. Fig. 3 shows the implementation of a 5-point convolver, which can be used in the VLSI realization of an 11-point unified DCT/IDCT chip. The two convolvers differ from each other in both of their address generators and their lookup tables stored in ROM's. In this example, the internal word length, the word length of data { x ( i ) } and { X ( k ) } are, respectively, 12, 8, and 12 bits. Note that these parameters can always achieve a signalto-noise ratio of greater than 44 dB under the simulation test.
obtain the final result. The circular buffer advances 6 bits and repeats the foregoing procedures until all results are obtained. This completes a full convolution cycle and starts another one by loading another input sequence one clock cycle later. In such a case, the circular buffer rotates Operation P-point column 6 bits every clock cycle. Hence, the address generator can In such a case, the address generator of the cyclic convolver can be implemented with a 60-bit bitwise circular buffer with shift operations. At the beginning of a specific convolution cycle, the input data for the convolutions (five 12-bit words in this example) are loaded into the circular buffer in parallel. In order to make the chip achieve a throughput rate of 1 output per clock cycle, any one of the convolvers has to produce an output every 2 clock cycles. In the first cycle, the six least significant bits of the five data form six 5-bit addresses to access six ROM tables, respectively. All fetched data are summed up with a carry-save adder to form a partial result. In the second cycle, the 6 most significant bits of the five data form another 6 addresses to fetch other six data. These data are then summed up with the shifted partial result to wise shift registers with parallel load function to release the burden of the clock synchronization of the circular buffer. Note also that a complete convolution cycle spans P -1 clock cycles only while one gets P clock cycles to complete a transform. The inputs of the convolvers can be split and loaded into the address generators in two cycles to reduce the input bandwidth of the convolvers.
Each ROM table consists of 32 words. Note that the contents of all ROM tables of a specific convolver are identical. In other words, one can use multiport ROM to save a number of ROM tables. Besides, as shown in Fig.  3 , the word lengths of different ROM tables are not necessary identical since the fetched data are not equally significant. These features are obviously superior to other chip designs which use the distributed arithmetic to implement inner products without first converting them into cyclic convolutions.
For the implementation of the address generator of the skew-cyclic convolver, a small additional circuit is required to perform a 2's complement negation to the datum passing through the head of the circular buffer.
The contents of the ROM tables are also different from those used in the cyclic convolver.
The silicon efficiency of the unified chip is extremely high. The configuration of the chip, which is controlled by the MUXs, involves the arrangement of the pre/postprocessing unit and the kernel unit only. For other typical design 1201-[261, the convolvers have to swap ROM tables whenever the mode of the unified chip is swapped. However, no such step is necessary in the proposed design. By considering that both (lla) and (20) -B(1)IT in the DCT realization. Hence, whether the chip is configured to perform a DCT or an IDCT, no modification of the convolvers is necessary. Consequently, nearly no silicon area of the chip is idle in a particular transform. A highly efficient unified chip can be implemented.
Furthermore, as shown in Figs. 1 and 2 , the convolvers are the core units of the unified chip and the whole chip involves no multiplier. Since the convolutions are reformulated at the bit level by using the distributed arithmetic, the following advantages can be achieved: 1) no actual multiplication involved as multipliers are replaced by memory look-up tables, 2) high accuracy as it suffers fewer rounding/truncation error than the other structures, 3) possible for modular circuit design as the structure is extremely regular, and 4) simple structure which leads to a saving of gate count and makes routing easy. These features allow a high-speed circuit design composed of memories, adders, and registers only. The proposed design aims to achieve a throughput rate of 1 output per clock cycle. Obviously, the two convolution modules play a significant role in the unified chip and dominate the timing performance of the whole chip. By making use of the current 2-pm CMOS technology, the proposed architecture can easily meet the speed requirement of 14.3-MHz real-time operation.
V. CONCLUSIONS In this paper, we propose a new algorithm to realize an odd prime P-length DCT with two half-length convolutions (one cyclic convolution and one skew-cyclic convolution). This algorithm can be easily modified to realize an IDCT with odd prime length. In such a case, one can realize both DCT and IDCT with the same convolution other than the convglutions required for realizing either DCT or IDCT are just 2(P -1) additions and some I module if they possess the same length. As the operations 711 simple permutations, only a small percentage of the unified chip is idle in a particular transform. Hence, one can design a very efficient unified chip. Furthermore, by making use of the distributed arithmetic, the VLSI implementation of the convolution module can result in a very simple and modular structure without multiplier. In other words, an efficient unified DCT/IDCT chip which involves only adders, latches, and memory tables can be implemented in a very straightforward way. These algorithms can also be easily extended to realize a multidimensional DCT/IDCT by using the row-column decomposition technique. A 2-D 11 X 11 unified DCT/IDCT chip design is also proposed in this paper. The proposed architecture can easily meet the speed requirement of 14.3-MHz real-time operation with the current 2-pm CMOS technology.
