Abstract. This paper presents a 2-D DCT/IDCT processor chip for high data rate image processing and video coding. It uses a fully pipelined row-column decomposition method based on two 1-D DCT processors and a transpose buffer based on D-type flip-flops with a double serial input/output data-flow. The proposed architecture allows the main processing elements and arithmetic units to operate in parallel at half the frequency of the data input rate. The main characteristics are: high throughput, parallel processing, reduced internal storage, and maximum efficiency in computational elements. The processor has been implemented using standard cell design methodology in 0.35 mm CMOS technology. It measures 6.25 mm 2 (the core is 3 mm 2 ) and contains a total of 11.7 k gates. The maximum frequency is 300 MHz with a latency of 172 cycles for 2-D DCT and 178 cycles for 2-D IDCT. The computing time of a block is close to 580 ns. It has been designed to meets the demands of IEEE Std. 1,180-1,990 used in different video codecs. The good performance in the computing speed and hardware cost indicate that this processor is suitable for HDTV applications.
Introduction
The Discrete Cosine Transform (DCT) is widely considered to provide a near optimal performance for transform coding and image compression, because it offers energy compaction, orthogonal separability and fast algorithms [1] . Thus, the DCT has been applied for most of recent still picture and moving picture international standards for sequential codecs [2, 3] as JPEG, MPEG, H.261 and H.263, as well as in highdefinition television (HDTV) systems. The computation complexity requirements in many real-time applications often lead to the use of efficient dedicated hardware (ASIC_s) operating at high speed with an acceptable cost in area.
Since the introduction of the DCT in the 1970s, a considerable amount of research has been performed on algorithms, architectures and processor design for computing of DCT. In the literature, there are many VLSI implementations proposed for DCT and its inverse (IDCT) which, to a greater or lesser extent, search for some of the following characteristics: lowcost area [4] [5] [6] [7] [8] [9] , regularity to reduce the design effort [10, 11] , high throughput [4, 5, 9, [11] [12] [13] and low power [4, 14, 15] . Different approaches have been proposed to implement the 2-D DCT/IDCT: rowcolumn decomposition method, the direct method and other minority alternatives based on transforms (as DFT [16] and DHT [17] ), CORDIC algorithms [18] and systolic array implementations [19] . The row-column decomposition method uses the separability property of 2-D DCT to be broken into two sequential 1-D DCT, one along the row-wise block and the second along the column-wise block of previous row-wise processed blocks, which are stored in a transpose memory [4, 6, 8, 9, 11-15, 20, 21] . This method allows a 2-D DCT to be computed using fast algorithms and hardware developed for 1-D DCT. In some implementations, a single multiplexer 1-D DCT processor is used to perform both operations with the corresponding saving in hardware [8-10, 13, 20] . Roughly 90% of the survey implementations follow the row-column decomposition method because its regularity is highly suitable for VLSI implementation. The direct method requires fewer computations, but it incurs the irregularity [5, 7, 10] . However, the feature of lowcomputation complexity is still attractive and some regular structures have been researched recently.
This paper describes the architecture of an 8Â8 2-D DCT/IDCT processor chip with a high throughput and a cost-effective architecture [22] . The 2D DCT/IDCT is calculated using the separability property, so that its architecture is made up of two 1-D processors and a transpose buffer (TB) as intermediate memory.
This transpose buffer presents a regular structure based on D-type flip-flops with a double serial input/output data-flow highly suitable for pipeline architectures. The processor has been designed searching for high throughput, reduced hardware, parallel and pipeline architecture, and a maximum efficiency in all arithmetic elements. This architecture allows the processing elements and arithmetic units to work in parallel at half the frequency of the data input rate, except for the normalisation of the transform which is carried out in a multiplier operating at maximum frequency. Moreover, it has been verified that the precision analysis of the proposed processor meets the demands of IEEE Std. 1,180-1,990 [23] used in video codecs ITU-T H.261 [24] , ITU-T H.263 [25] y ITU-T H.261+ [26] . The processor has been conceived using a standard cell design methodology and manufactured in a 0.35-mm CMOS CSD 3M/2P 3.3 V process (http://www.asic. austriamicrosystems.com). It has an area of 6.25 mm 2 (the core is 3 mm 2 ) and contains a total of 11.7 k gates, 5.8 k gates of which are flip-flops. A data input rate frequency of 300 MHz has been established with a latency of 172 cycles for 2-D DCT and 178 cycles for 2-D IDCT. The computing time of a block is close to 580 ns. This good performance in the computing speed as well as hardware cost, indicate that the proposed design is compact and suitable for HDTV applications.
The paper is organized as follows: Section 2 presents the principles and algorithm used to implement the 2-D DCT/IDCT. Section 3 addresses the architectural design and circuit design features of the basic processing elements. A description of a block diagram of the 2-D DCT/IDCT processor is presented in Section 4. Section 5 describes the main arithmetic elements and, finally, chip characteristics and comparisons with other previous DCT/IDCT processors are described in Section 6.
Two-Dimensional 8Â8 DCT/IDCT
La 8Â8 DCT transforms a block of the space domain, x n; m ð Þ f g 7 n;m¼0 , into its DCT domain components,
, according to the following equation:
The IDCT is defined by:
In matrix notation, let S R8 the eight-point DCT matrix with rows reordered according to the sequence (0,4,2,6,1,5,3,7):
2 6 6 6 6 6 6 6 6 6 6 4
where
. . . ; 7 . Taking into account the properties of the cosine, the C 1 , C 3 , C 5 and C 7 elements of S R8 can be expressed as:
The elements of columns 2, 3, 6 and 7 of S R8 can be decomposed by applying Eq. (4) in the following way:
The flow chart of inverse DCT is shown in Fig. 2 . In the compute of this algorithm, the role of the inputs and outputs are reversed and the J RE4 and J RO4 must be replaced by J t RE4 and J t RO4 ; respectively. The 8Â8 2-D DCT can be expressed on the basis of the nuclei of the S R8 matrix transform as:
where (P) represent the Hadamard product and K 8 is the normalization matrix defined as 
Similarly, the 8Â8 IDCT is obtained as:
Eqs. (11) and (13) allow the multiplication by K 8 coefficients to be done at output for forward DCT or at input for IDCT, reducing the number of inner multiplications. This scheme is very attractive when this transform is incorporated into the decoding process in an adaptive transform coding system. Therefore, the quantization process at the receiver or at the transmitter can include this normalization in the decoding/encoding lookup table. In this case, the normalization can be completely eliminated and thus a significant reduction in hardware is obtained. and J 04B J t 04B . One important characteristic is that the input data I E and I O are processed in parallel and thus the output data O E and O O are also generated in parallel. In this way, the operation frequency of the J R8 J t R8 processor is reduced to f s /2, where f s is the input data sampling frequency. Figure 4 shows the architecture of each of the basic processors specified in Fig. 3 which are derived from Eqs. (7)- (10) . The control is very simple and is carried out using four signals: Clk1, main clock at frequency f s , Clk2, internal clock at frequency f s /2, and the multiplexer selection signals S 1 at frequency f s /4 and S 2 at frequency f s /8. All of the basic processors have been conceived to work with four input data introduced in series and whose four output data are thus also in series, these outputs being compatible with the next processor. The Forward/Inverse (F/I) signal modifies the operation of the processor to perform the transformation or its inverse. These basic processors are made up of shift registers (S-R), multiplexers (MUX), carry incrementer adders/subtracters and hardwired multipliers and they have been designed aiming at an efficiency of 100% in the arithmetic elements in most cases.
8Â8 2-D DCT/IDCT
The 2-D 8Â8 DCT/IDCT is implemented by the row-column decomposition technique according to Eq. (11) for DCT and to Eq. (13) for IDCT. Figure 5 shows the block diagram of the proposed architecture composed of two 1-D processors (J R8 J t R8 ), a transpose buffer (TB) for storing the intermediate data, one down-sampling (D-S) unit and another upsampling (U-S) unit, and a multiplier for performing the normalization of the transform according to matrix K 8 described in Eq. (12) . One of the main characteristics of this architecture is that it operates at half the frequency of the input data rate (f s /2), except for normalization with K 8 performed at the output for the forward DCT and at the input for the IDCT. This multi-rate operation involves D-S and U-S modules: D-S multiplexes in parallel and at frequency f s /2 the input data, while the U-S performs the opposite process.
Architecture of TB
The 8Â8 intermediate data generated by the first 1-D DCT processor has to be stored and transposed in the TB before the second 1-D DCT is performed. This TB allows simultaneous read and write operations between the two processors while performing matrix transposition. To achieve this, the data are read out of the memory column-wise if the previous intermediate data were written into the memory row-wise, and vice versa.
The TB based on D-type flip-flops has been found to be adequate for pipeline architectures, unlike other proposed architectures based on RAM memories. The schema of this circuit is shown in Fig. 6a . It has a regular serial input/output structure made up of eight 16-bit shift-registers and multiplexers. The control signals are: R/C, which selects the input of shift-registers to store data in row-wise (R/C=0) or column-wise (R/C=1), W j (j=1 to 8), which selects the jth shift-register to make a write operation, and R k (k=1 to 3), which selects the jth shift-register specified by W j to read out the data. The write & read operation is simultaneously made with two serial input data, {I 3 I 2 I 1 I 0 } and {I 7 I 6 I 5 I 4 }, and generates two serial output data in parallel, {O 3 O 2 O 1 O 0 } and {O 7 O 6 O 5 O 4 }. The control signals allow the data to be stored alternatively row-wise and column-wise in order to perform data transposing and to avoid loss of data. For the sake of clarity, Fig. 6b and c show the configuration of shift-registers and the arrangement of stored input data when a writing process in column-wise or in row-wise is made. Both storing data processes are easy to configure from the state of the TB control signals. Figure 6d shows the timing diagram of the state of these variables as a function of the number of Clk2 clock cycles. First, the input data are stored column-wise (in Fig. 6b , they are filled downward) to complete each of the vertical halves which the memory is divided. To do this, W j recurs sequentially to each of the shift-registers so that only four Clk2 cycles are required to store a whole row. At the same time as the outputs, the data previously stored column-wise are read out. To do this, the writing process, performed using W j , and the reading process, performed using R k , are synchronized. In the next block, the input data are stored column-wise (in Fig. 6c , they are filled rightward) and the outputs now read out the data previously stored row-wise. In this case, the data are written sequentially in each of the horizontal halves into which the memory is divided, repeating eight times the sequence specified in Fig. 5d . As a result, the outputs are continuously transposed in parallel and 32 Clk2 cycles are required to fill up all of the memory.
Pipeline Scheme
The DCT/IDCT processor uses a pipelining scheme to shorten the cycle time and perform real-time processing for applications with high pixel rates. The pipeline registers are inserted in the critical path to improve the operation speed with minimal overhead. 
Accuracy Specifications
The accuracy of the computing of the DCT is an important characteristic of this hardware. The DCT kernel components are real numbers so that truncation or rounding errors are inevitably introduced during computation. There are two inherent errors in a DCT implementation: (1) finite internal wordlength and (2) coefficient quantization error. The standards H.261, H.263 and H.263+ defined for videoconference applications, establishes the accuracy specifications in the computing of the IDCT. The fulfilment of the specifications ensures the compatibility between different implementations of the IDCT [28] . This standard has been used to define the accuracy of the data-path and wordlength of the coefficients of the processor. Table 1 summarizes the simulation results carried out with MATLAB according to the procedure described in [23] . This specification allows the errors caused by finite wordlength in IDCT to be evaluated. Thus, the IDCT architecture must be excited with 10,000 8Â8 blocks of random numbers in the ranges 
Arithmetic Circuits
The arithmetic circuits limit the speed of the processor. They have been carefully selected to find MSE Overall mean square error, ME overall mean error, PMSE peak mean square error, PME peak mean error best compromise in area, throughput and latency. The J R8 J t R8 processor uses a total of six 20-bit carry incrementer adders and three hardwired multipliers. In the design of these circuits, fast architectures with minimal overhead, radix representation and pipeline stages have been used to provide balanced critical paths. However, this does not mean a great effort in design since these operate at half the frequency of the input data rate. Only a fine grain pipeline architecture is required for the normalization multiplier K 8 since it is operating at input data rate frequency. In this section, the main arithmetic circuits are described.
Carry Incrementer Adder
The carry incrementer adder (CIA) has been chosen because it has an asymptotic performance with O(n) area and O( ffiffi ffi n p ) time, and provides a compromise between a ripple-carry adder (RCA) and a carry look-ahead adder. It has a short critical path at the expense of a small increase in area in comparison with RCA. The CIA is made up of an RCA divided into blocks and some additional selection logic and it is a modification of the adder presented in [29] . Figure 8 shows the 20-bit CIA built from five blocks of different lengths {6, 5, 4, 3, 2}. For the sake of clarity, a 4-bit block is shown in detail, its output being obtained from the following equations:
where S 0 i is the RCA output and C 0 the input carry for this block. The output carry, which forms the input carry for the next block, is defined as:
Hardwired Multipliers
The concept of hardwired multiplication and binary signed digit representation for fixed coefficients has been adopted to simplify the hardware complexity for realizing multiplication through a carry-saver adder scheme. The multiplication by fixed-coefficients is computed in three types of configurable multipliers which perform the following arithmetic operations: P=d i I {T 1 or T 5 }+d j , P=d i I {1 or T 2 }+d j and P=d i I C 4 , where d i and d j are input data. These multipliers, whose general structure is shown in 9, are made up of a carry save adder tree based on 4:2 and 5:3 compressors and configured according to the type of coefficient, and a final adder with rounding correction. To limit the critical path, pipeline stages and a radix-8 encoding are used for T 1 and T 5 . The fixed coefficients are expressed in the carry save structure as:
For example, P=xI T 1 +y=(3x)I2
+y. The term 3x is precomputed by adding and shifting operations, 2I x+x, in a CIA.
These multipliers generate a larger output than the data-path of the processor. This necessarily implies the use of less significative bits to adapt the output of the multiplier to the 20-bit size of the data-path. However, it has been verified through simulation with MATLAB that if a rounding correction operation rather than a truncation operation is performed on this final adder, the size of the data-path required to verify the IEEE Std. 1,180-1,990 requirements is reduced from 22-b to 20-b. This result is important because it leads to a significant saving in the total area of the processor.
The final adder with rounding correction is made up of a carry generator (CG) with a structure of a high-speed binary carry-look ahead [30] , and a 20-bit double carry incrementer adder (DCIA). The DCIA must compute the input carries, C 1 and C 2 , which indicate the type of increase to be made: +0 for C 1 =C 2 =0,+1 for C 1 = and C 2 =0, and +2 for C 1 =0 and C 2 =1. C 1 y C 2 can easily be generated from the following expressions:
where C nj21 is the carry generated in the CG. 
and the output carries are defined as:
Normalization Multiplier
The Hadamard product for the normalization described by Eqs. (11) and (13) are performed in a Booth-multiplier with fine pipeline. The multiplier is 
Implementation and Comparisons
A prototype of an 8Â8 2-D DCT processor chip has been designed using standard cells in a semi-custom methodology. It uses 9-b input data and 12-b output data for DCT and 12-b and 9-b for IDCT. The processor was implemented with a 0.35 mm CMOS CSD 3M/2P 3.3 V technology of Austria-Microsystem (http://www.asic.austriamicrosystems.com).
The chip has an area of 2.5Â2.5$6.25 mm 2 (the core is 1.75Â1.75$3.06 mm 2 ). It contains a total of 11.7 k gates, 5.8 k gates of which are flip-flops and 826 gates are FA/HA. Table 2 shows the hardware cost in terms of number of gates for the different blocks of this processor. More details about chip implementation can be found in [22] . A maximum operating frequency of about 300 MHz has been established. The latency for 2-D DCT is 172 Clk1 cycles and for 2-D IDCT is 178 Clk1 cycles. The computing time of a block is close to 580 ns.
In the literature, there are many implementation styles for DCT and IDCT. For proposes of comparison, Table 3 lists features of the proposed processor and other DCT implementations selected from among those which fulfil the specifications of the IEEE standard. This Table shows the following parameters: year of publication, function indicating whether it implements the forward DCT and the IDCT or only the IDCT, technology (all are in CMOS) and design methodology (FC for full-custom, SC for semicustom, GA for Gate Array), area in mm 2 of die or of core, complexity in terms of number of gates or transistors (some designs include additional RAM), frequency and latency, and finally, some basic specifications of the architecture. Three parameters have been taken into account in the specification of the architecture [31] :
-Implementation based on row-column decomposition method or direct method. In the first case, the property of separability of the 2D DCT is used to separate its computation into two sequential 1-D DCT (RC) and transpose memory, or into a single 1-D DCT (MUXRC) processor which performs both operations. In the second case, the direct formula of the 2D DCT is used. or on distributed arithmetic (DA).
As compared with the existing design listed in Table 3 , the proposed processor is clearly superior in terms of speed even for those processors that use a better technology [11, 15] . The parallel-pipeline architecture and arithmetic units operating at half the frequency gives an input data rate of 300 MHz, far higher than that of the fastest processor listed in this table [11] . This speed does not imply any additional cost in terms of the number of gates since it is similar to that of the other designs proposed which offer an efficient hardware complexity [5, 6, 8, 9, 13, 20] . Thus, [9, 13, 20] use the MUXRC approach to reduce hardware, [5, 9] implement only the IDCT and others present the same complexity in number of gates but require additional RAM [6] . In this respect, notice should be taken in particular of the highly area-efficient processor described in [7] which combines a software-oriented controller with a hardware unit. This processor is not a pure hardwareoriented approach unlike the rest of the processors.
Conclusions
This paper describes the architecture of an 8Â8 2-D DCT/IDCT processor chip with high throughput, reduced hardware, parallel and pipeline architecture operating at half the frequency of input data rate, and a maximum efficiency in all arithmetic elements. This processor is clearly superior in terms of speed without increasing hardware complexity in comparison with others processors which also meet the demands of IEEE Std. 1,180-1,990. This good performance in the computing speed as well as hardware cost, indicate that the proposed design is suitable for HDTV applications.
One advantage of the proposed architecture is that the K 8 normalization can be incorporated into the decoding process in an adaptive transform coding system. Therefore, the quantization process at the receiver or at the transmitter can include this normalization. Then, the K 8 multiplier can be completely removed in the 2D DCT processor, reducing area by 13% and latency by 10%.
