Abstract-Complete fixed-point error models that include the coefficient quantization are derived for two popular 8 2 8 twodimensional (2-D) IDCT architectures; one is based on distributed arithmetic, and the other is the multiplier-adder chain. The error models are evaluated in the integer domain to accurately measure the effects of rounding. The analysis results show that the overall mean-square error performance (OMSE) is the most critical condition for meeting the IEEE specification (IEEE Std. 1180-1990) when the rounding scheme is employed. On the other hand, the mean error effects (OME and PME) are dominant for truncation. Finally, the analysis results are compared with those of bit-accurate simulation.
I. INTRODUCTION
T HE two-dimensional (2-D) discrete cosine transform has been widely used for various image and video processing standards, such as JPEG, H.261 for videotelephony, MPEG, and HDTV. Efficient implementation of the transform requires fixed-point arithmetic, which may result in a noticeable mismatch between the encoder and decoder. In particular, this problem can be magnified when the IDCT (inverse discrete cosine transform) is used in a reconstruction loop for motion compensation purposes because the quantization error is accumulated. To solve this problem, IEEE specifies the fixed-point performance of the IDCT for use in visual telephony and similar applications using the IEEE Std. 1180-1990 [1] . They require that the peak error (PPE), the peak mean-square error (PMSE), the overall mean-square error (OMSE), the peak mean error (PME), and the overall mean error (OME) should not exceed certain values, and the all-zero input has to produce the all-zero output. The test bed for measuring the accuracy of a proposed IDCT is shown in Fig. 1 . The "reference" IDCT output is generated by the double-precision floating-point arithmetic, while the "test" output is the result of the fixed-point arithmetic. Random integers of nine bits are used for the input. Details of the test procedure are described in [1] .
There have been a few studies on the fixed-point error modeling of several fast DCT/IDCT algorithms [2] , [3] . However, those models are not directly applicable to the word length optimization of actual hardware because of the following reasons. First, the previous works were conducted on the algorithm level. But the quantization effects are very much dependent on the implementation architecture. Second, the fixed-point error models are not complete. For example, those studies did not consider the quantization effects of coefficients. Finally, the IEEE Standard specifications are described in terms of rounded values instead of original unquantized error signals. This means that not only are the mean and the variance of the error important, but the distribution as well. In this paper, complete fixed-point error models are derived for two of the most popular architectures of 2-D IDCT. Then, we evaluate the integer domain fixed-point error, and determine the cost optimum word lengths conforming to the IEEE Standard specification. The analytical results are also proved by experiment with the aid of the fixed-point optimization utility that was developed by the authors [4] .
Although a few fast 2-D IDCT algorithms have been proposed, the row-column decomposition technique is preferred for VLSI implementations due to its numerical characteristics and structural regularity. In order to reduce the number of arithmetic operations without sacrificing the regularity, the one-step decomposed Chen's algorithm [5] has been widely employed. For the matrix-vector product operator, the distributed arithmetic (DA)-and the multiplier-adder-based architectures are usually considered. Although some implementations using the systolic array have been reported recently [6] , [7] , most actual VLSI implementations of the 8 8 IDCT have been based on the DA or multiplier-adder-based architecture, as shown in the survey by Pirsch et al. [8] .
This paper is organized as follows. A technique for analyzing the fixed-point error in the integer domain is explained in Section II. In Section III, the fixed-point error model and the optimum word lengths of a DA-based 8 8 2-D-IDCT architecture is discussed. Section IV presents the error model of a multiplier-adder-chain-based architecture and the optimized internal word lengths. Concluding remarks are given in Section V.
II. INTEGER DOMAIN FIXED-POINT ERROR ANALYSIS
The IEEE specifications are based on integer domain quantization errors that are measured after rounding the output of the fixed-point implementation as illustrated in Fig. 1 . In order to analyze the fixed-point error in the integer domain, it is necessary to redefine the specifications in a stochastic manner.
Consider a general additive noise model
where , and are all random variables representing a floating-point result, a fixed-point result, and the fixed- point error, respectively. We also assume that and are independent of each other. Then the integer domain fixed-point error can be defined as follows:
Note that refers to the rounded value of , i.e., the largest which is smaller than or equal to , where is an integer and is the quantization step size. The probability that the integer domain fixed-point error is an integer , can be shown to be [9] ( 3) where is the probability density function (pdf) of and (4) Now, we can reformulate the IEEE criteria in the integer domain. For example, the OMSE and the PME criteria can be computed as follows:
where represents the integer fixed-point error at pixel location , and is defined as (7) Note that and are the lower and the upper bounds of the fixed-point error, respectively, and is the corresponding probability density function of . The "amax" operator selects the element whose absolute value is the maximum. All other criteria, such as PPE, PMSE, and OME, can be defined in the same fashion [10] .
III. OPTIMIZATION OF A DA-BASED ARCHITECTURE
Distributed arithmetic is one of the most popular VLSI implementation methods for computing a matrix-vector product because multiplications are not needed, and as a result, the hardware cost can be greatly reduced. An architecture for computing the transformation by employing the distributed arithmetic is shown in Fig. 2 . As shown in the figure, there are three quantization error sources: coefficients for the first and the second transform , and the output of the limiter for the first transform , which can be assumed to be independent of each other. Note that the limiter at the output of the second transform is modeled separately by considering the probability of the integer domain output error after rounding. The word lengths of input and output signals are specified in the IEEE Standard as 12 and 9 bits, respectively.
A. Fixed-Point Error Model
The 1-D IDCT matrix can be decomposed as follows [9] : (8) where and simply shuffle the data, and performs a butterfly operation.
is a block diagonal matrix, and its two 4 4 matrices can be obtained by decomposing the 8 8 IDCT coefficient matrix according to Chen's method.
In order to construct a DA hardware, the partial sum of coefficients should be computed in advance. It can be easily shown that the maximum value of partial sums is 2.7208. This means that at least two integer bits are needed for the representation of the coefficient ROM [4] . Since this format can represent all numbers less than 4, the upper 0.56 bit is a waste. Thus, by scaling up coefficients as much as , we can reduce the waste of the integer bits in the coefficient ROM. The scaling effect can easily be compensated in the last stage by a 1 bit right shift because the overall effect of the 2-D transform is a magnification by 2. Now, let us introduce a scaled IDCT matrix which is defined as (9) where . Then, by the row-column decomposition, the 2-D IDCT matrix becomes (10) Note that the scale factor of 1/8 corresponds to just a 3-bit right shift.
By elaborating the equation for the 2-D IDCT, we can obtain the following : (11) where and , , and . Note that , and describe the fixed-point errors caused by , and , respectively. and denote the fixedpoint error occurring in the DA hardware of the rowwise and the columnwise transforms, respectively. The th element of , , is defined as 3 (12) where is the quantization error of the DA coefficient, and denotes the discrete delta function, i.e., and if . and represent the word length and the integer word length of the input data to the rowwise transform, respectively, and is a discrete random variable that depends on the index and the input data. Similarly, the th element of , , is defined as (13) where is the rounding noise of the DA coefficient in the columnwise transform, and and denote the word length and the integer word length of the input data, respectively. represents the rounding error generated at the limiter in front of the transpose unit.
From the definitions of 's, it can be shown that and are linear combinations of and , respectively, which are independent of each other. Also, is a weighted sum of , which are independent. Therefore, according to the well-known central limit theorem, we can approximate the probability density functions of , and to Gaussian distributions. The means and variances of 's will be presented in the following section.
B. Word Length Determination Conforming to the IEEE Specifications
In order to develop the mean and the variance matrices of 's, consider the following theorems. Let us assume that for a matrix whose components are random variables denotes a matrix whose components are the variances of 's. 1 In this paper,ã represents a result by fixed-point arithmetic while a indicates that of floating-point arithmetic. 2 Greek letters denote quantization error signals. 3 Note that capital letters denote matrices, and small characters represent their elements. For example, k ij is the ijth element of 0 k .
Theorem 1:
Let be a constant matrix, and a matrix whose components are independent random variables. Then the variance matrix of is (14) where . is the Schur product of and , where the th component is defined as . Theorem 2: Let and be constant matrices, and a matrix whose components are independent random variables. Then the variance matrix of is (15) The proofs of theorems are given in the Appendix. By applying Theorems 1 and 2 for the definitions of 's, the mean and the variance matrices can be obtained as follows: (16) Now, consider the distribution of , which is the restored image by floating-point arithmetic. Since the IEEE Standard specifies that the transformed image is rounded to 12 bit integers before the inverse transform, the rounded image can be expressed as follows: (17) where denotes the rounding error. We can assume that is independent and identically distributed (i.i.d.) with zero mean and the variance of . The image restored by can be written (18) where and are defined as and , respectively. Each element of is a linear combination of 's. According to the central limit theorem, we can also assume that the distribution of is Gaussian. Since the mean of is zero, that of is zero too. The variance is equal to that of , because the IDCT is a similarity transform, i.e., . Since is assumed to be uniformly distributed from to , the probability density function of the floatingpoint result for a given can be modeled as a sum of shifted probability density functions of . For notational convenience, let and be the random variables whose values are and , respectively. Then, elsewhere
where is a Gaussian distribution function with and as derived above. Now, we can evaluate the fixed-point error performance in terms of IEEE criteria by using the results developed in Section II. In order to carry out bit-accurate fixed-point simulation of given IDCT hardware, we take advantage of the fixed-point optimization utility [4] . The set of cost-optimum word lengths that requires the minimum hardware cost while satisfying the system performance can be determined by using the procedure proposed in [11] .
From both analytic and simulation results, it was found that the overall mean-square error effects are dominant. The OMSE and PMSE criteria for the first coefficient are compared in Fig. 3 , which shows that the OMSE condition requires at least 14 bits for the coefficients while the PMSE performance is met with 12 bits. Fig. 4 shows the OMSE criterion for the output of the limiter. The cost optimum word lengths appear in Table I . As shown in the table, analytic results are consistent with the experimental results. The numbers inside the parentheses show the word lengths of the previous implementation [12] . As for modeling the hardware cost, the cell libraries of VLSI Technologies, Inc. are used [13] .
IV. OPTIMIZATION OF A MULTIPLIER-AND ADDER-BASED ARCHITECTURE
The matrix-vector product in the IDCT can be implemented in a straightforward way by using multiplier and adder chains as shown in Fig. 5 . There are five quantization error sources: quantization of coefficients for the first and the second trans- , word length reduction for the outputs of the first and the second multipliers , and the output of the limiter for the first transform , which are independent of each other. Note that the limiter at the output of the second transform is modeled separately by considering the probability of the output error after rounding.
A. Fixed-Point Error Model
Let us introduce a scaled transform matrix which is defined as (20) Similarly to Section III, we can obtain the fixed-point error model by elaborating (21), the equation for the 2-D IDCT. The 2-D transformed data using fixed-point arithmetic can be represented as follows: (22) where indicates the fixed-point error, and it can be written as and , ,
. represent the overall fixed-point error caused by , and , respectively. and denote the truncation error matrices occurring after the multipliers of the rowwise and the columnwise transforms, respectively. The quantization errors of transform coefficients are expressed as and , respectively. Finally, represents the rounding error at the limiter in front of the transpose unit.
B. Word Length Determination Conforming to the IEEE Specifications
Similarly to Section III-B, we can evaluate the integer domain error criteria. For example, the mean and the variance of , which is the error component caused by the quantization of coefficients, can be represented as follows: (23) It can be also shown that 's are linear combinations of , which are independent random numbers. Thus, we can approximate the probability density functions of , where , to Gaussian distributions with the corresponding variances and means according to the central limit theorem. Now, the integer domain error criteria can be evaluated using the probability density function of , which has been developed in Section III-B.
From both analytic and experimental results, it was found that the most crucial condition for and is the overall mean-square error OMSE. However, since the multiplier outputs are usually truncated to reduce the word length of the following adders, the means of and are not zero. And as a result, the peak mean error PME and the overall mean error OME play the key role for determining the minimum word length for Adder1 and Adder2, respectively. Although we can reduce the size of the adders by inserting rounding circuits after the multipliers, it may not be a more efficient solution. The OMSE criterion for the first coefficient and the PME criterion for the adder in the 1-D IDCT unit are compared in Fig. 6 . The cost optimum word lengths appear in Table II . As shown in the table, analytic results are quite consistent with the experimental results. The numbers inside the parentheses show the word lengths of the previous implementation [14] .
V. CONCLUDING REMARKS
The finite word length effects of 8 8 2-D IDCT algorithms were analyzed on the architectural level, and the optimum internal word lengths for the distributed arithmetic and the multiplier-adder-based architectures have been determined to satisfy the IEEE specifications while requiring the minimum hardware cost.
First, in order to analytically evaluate the IEEE specifications, which are defined in the integer domain by the ensemble sense, a simple method for analyzing the integer domain error has been presented. Also, the IEEE criteria have been reformulated in a stochastic sense. Second, the complete fixedpoint error models for both the distributed arithmetic and the multiplier-adder-chain-based 8 8 2-D IDCT architectures were derived. Finally, the optimum set of word lengths conforming to all of the IEEE specified criteria including PPE, PMSE, OMSE, PME, and OME was determined using the analytical results. The analytical results were compared with that of the bit-accurate simulation. The hardware costs using these optimized word lengths are about 9.7 and 7.6% lower than those of the previous implementations. This study can be used for the VLSI implementation of the video rate DCT and IDCT because the distributed arithmetic and the multiplier-adder-chain-based architectures are quite regular and adequate for high throughput processing. 
