This paper describes fixed-point design methodologies and several resulting implementations of the Inverse Discrete Cosine Transform (IDCT) contributed by the authors to MPEG's work on defining the new 8x8 fixed point IDCT standard -ISO/IEC 23002-2. The algorithm currently specified in the Final Committee Draft (FCD) of this standard is also described herein.
INTRODUCTION
The Discrete Cosine Transform (DCT)
1 is a fundamental operation used by the vast majority of today's image and video compression standards, such as JPEG, MPEG-1, MPEG-2, MPEG-4 (P. This paper describes fixed-point design methodologies and several resulting IDCT implementations contributed by the authors to this MPEG project. The algorithm currently specified in the Final Committee Draft (FCD) of this standard is also described.
Our paper is organized as follows. In Section 2, we provide background information, including definitions of the DCT and IDCT, examples of their factorizations, and review of some basic techniques used for their fixed-point implementations. In Section 2, we also explain several ideas that we have proposed for improving performance of fixed-point designs: introduction of floating factors between sub-transforms, the use of fast algorithms for computation of products by groups of factors, and techniques for minimizing rounding errors in algorithms using right shift operations. In Section 3, we show how these techniques were applied to design our proposed IDCT approximations. Finally, in Section 4, we provide a detailed description of the algorithm in the FCD of the ISO/IEC 23002-2 standard. Appendices A and B contain supplemental information and proofs of our claims.
BACKGROUND INFORMATION & MAIN IDEAS USED IN THIS WORK

Definitions
The order-8, one-dimensional (1D) type II 1 Discrete Cosine Transform (DCT), and its corresponding Inverse DCT (IDCT) are defined as follows:
f x cos (2x + 1)uπ 16 , u = 0, . . . , 7 ,
where c u = 1/ √ 2, when u = 0, and c u = 1 otherwise.
The definitions for the two-dimensional (2D) versions of these transforms are: 
where f y x (y, x = 0.. Mathematically, these are linear, orthogonal, and separable transforms. That is, a 2D transform can be decomposed into a cascade of 1D transforms applied successively to all rows and then to all columns in the matrix. This property of separability is often exploited by implementors to mitigate the complexity of an entire 2D transform into a much simpler set of 1D operations.
Precision Requirements for IDCT Implementations in MPEG and ITU-T Standards
As described previously, specifications of MPEG, JPEG, and several ITU-T video coding standards, do not require IDCTs to be implemented exactly as specified in (4) . Rather, they require practical IDCT implementations to produce integer output valuesf yx that fall within certain specified tolerances from outputs of an ideal IDCT rounded to the nearest integer:
The precise specification of these error tolerances, and how they are to be measured for a given IDCT implementation under test, is defined by the MPEG IDCT precision standard: ISO/IEC 23002-1.
14 This standard includes several tests using a pseudo-random input generator originating from the former IEEE Standard 1180-1990, 9 as well as additional tests required by MPEG standards.
A summary of the error metrics defined by the IEEE 1180 | ISO/IEC 23002-1 specification is provided in Table 1 . Here, the variable i = 1, . . . , Q indicates the index of a pseudo random input 8x8 matrix used in a test, and Q = 10000 (or in some tests, Q = 100000) denotes the total number of sample matrices. The tolerance for each metric is provided in the last column of this table. 
which implies that max e yx or peak mean square error (pmse) metric is the strongest one in this set.
Among the additional (informative) tests provided in the ISO/IEC 23002-1/FPDAM1 specification, 15 the so-called "linearity test" * is of notable interest. This test requires the reconstructed pixel values produced by an IDCT implementation under testf yx to be symmetric with respect to the sign reversal of its input coefficients:
This test was motivated by the observation that in the decoding of static regions within consecutive video frames, the decoder will reconstruct small (zero-mean, symmetrically distributed) differences, that will normally negate each other over time (across the sequence of frames). If the IDCT implementation does not satisfy this property (7) , then the mismatch between IDCT outputs may instead accumulate, eventually producing a remarkable visible degradation in the quality of the reconstructed video.
16, 17
DCT/IDCT Factorizations
Much of the original research for designing fast implementations of DCT transforms was focused on finding DCT factorizations, resulting in the minimum number of multiplications by irrational factors. Many factorizations have been derived by utilizing other known fast algorithms, such as the classic Cooley-Tukey FFT algorithm, or by applying systematic approaches, such as a decimation in time, or a decimation in frequency. 1 The formal setting of this problem and an upper bound for the multiplicative complexity of transforms of orders 2 n can be found in E. Feig and S. Winograd.
20
In one special case of the order-8 two-dimensional DCT/IDCT, the least complex direct 2D factorization is described by E. Feig and S. Winograd. 21 Their implementation requires 96 multiplication and 454 addition operations for the computation of the complete set of 2D outputs. The same paper further describes an efficient scaled 8x8 DCT implementation, that requires only 54 multiplication, 462 addition, and 6 shift operations.
21
The latter of these transforms is refered to as a scaled transform because all of its outputs must be scaled (i.e. multiplied by fixed, possibly irrational, constants) so that each output will equate to the relative output of a nonscaled DCT. In some applications, such as JPEG, and several video coding algorithms, this process of scaling can be implemented jointly with the process of quantization (by factoring together the scale constants with the corresponding quantization values), thereby resulting in significant computational savings. . 27 The VL and LLM algorithms are the least complex among known non-scaled designs, and require only 11 multiplication and 26 addition operations.
We note that the suitability of each of these factorizations to the design of fixed point IDCT algorithms has been extensively analyzed in the course of work for the ISO/IEC 23002-2 standard.
28-31
Fixed-Point Approximations
As described previously, implementations of the DCT/IDCT require multiplication operations with irrational constants (i.e. the cosines). Clever factorizations can only reduce the number of such "essential" multiplications, but not eliminate them altogether. Hence, in the design of implementations of the DCT/IDCT, one is usually tasked with finding ways of approximately computing products of these irrational factors by using fixed-point arithmetic.
One of the most common and practical techniques for converting floating-point to fixed-point values is based on the approximations of irrational factors α i by dyadic fractions:
where both a i and k are integers. In this way, multiplication of x by factor α i permits the implementation of a very simple approximation in integer arithmetic as follows:
where >> denotes the bit-wise right shift operation.
In some transform designs, right shift operations in approximations (9) can be delayed to later stages of the implementation, or done at the very end of the transform, but the more complex operations, such as multiplications for each non-trivial constant α i still need to be performed in the algorithm.
The key variable that affects the precision and complexity of these dyadic rational approximations (8) is the number of precision bits k. In software designs, this parameter is often constrained by the width of registers (e.g. 16 or 32) and the consequence of not satisfying such a design constraint can easily result in the doubling of execution time for the transform. In hardware designs, the parameter k affects the number of gates needed to implement adders and multipliers. Hence, one of the basic goals in fixed point designs is to minimize the total number of bits k, while maintaining sufficient accuracy of approximations.
Improving Precision of Dyadic Rational Approximations
Without placing any specific constraints on values for α i , and assuming that for any given k, the corresponding values of nominators a i are chosen such that:
we can conclude that the absolute error of approximations in (8) should be inversely proportional to 2 k :
That is, each extra bit of precision (i.e. incrementing k), should reduce the error by half.
Nevertheless, it turns out that this rate can be significantly improved if the values α 1 , . . . , α n that we are trying to approximate can be simultaneously scaled by some additional parameter ξ.
We claim the following (the proof for which is provided in Appendix A):
Lemma 2.1. Let α 1 , . . . , α n be a set of n irrational numbers (n 2). Then, there exist infinitely many n + 2-tuples a 1 , . . . , a n , k, ξ, with a 1 , . . . , a n ∈ Z, k ∈ N, and ξ ∈ Q, such that
In other words, if the algorithm can be altered such that all of its irrational factors α 1 , . . . , α n can be prescaled by some parameter ξ, then we might be able to find approximations whose absolute error decreases as fast as 2 −k(1+1/n) . For example, when n=2, this means 50% higher effectiveness in the usage of bits. For large sets of factors α 1 , . . . , α n , however, this gain will be smaller.
These observations suggest that we can significantly improve the precision of a fixed-point IDCT design by splitting it into a set of smaller blocks (or sub-transforms) with alterable common factors, and then adjust these factors such that they yield high-accuracy solutions predicted by Lemma 2.1.
Reducing Complexity of Multiplications
The dyadic approximations shown in (8, 9) already reduce the problem of computing products by irrational constants to multiplications by integers. However, integer multiplications can still be computationally "expensive" to use on many existing platforms, and in such cases it becomes desirable to find ways to compute these products without using general purpose multipliers.
To illustrate this idea, consider a multiplication by an irrational factor 1/ √ 2, using its 5-bit dyadic approximation: 23/32. By looking at the binary bit pattern of 23 = 10111 and substituting each "1" with an addition operation, we can compute a product by 23 as follows:
This approximation requires 3 addition and 3 shift operations. By further noting that the last 3 digits form a series of "1"s, we can instead use: (11) which reduces the complexity to just 2 shift and 2 addition operations.
In engineering literature, the sequences of operations "+" associated with isolated digits "1", or "+" and "-" associated with beginnings and ends of runs "1 . . . 1" are commonly referred to as a "Canonical Signed Digit" (CSD) decomposition. 34 This is a well known and frequently used tool in the design of multiplierless circuits. 38 However, CSD decompositions do not always produce results with the lowest numbers of operations. For example, considering an 8-bit approximation of the same factor 1/ √ 2 ≈ 181/256 = 10110101, we find that its CSD decomposition:
needs 4 addition and 4 shift operations. But, by rearranging the computations and reusing intermediate results, a more efficient algorithm can be constructed: This approximation requires only 3 addition and 3 shift operations.
An even more dramatic reduction in complexity can be achieved by performing joint factorization of simultaneous products by multiple integer constants. For example, consider the task of computing products by two constants: 99 = 1100011 and 239 = 11101111. The use of CSD decompositions
results in a total complexity of 5 addition and 5 shift operations. At the same time, by using a joint factorization of these two products, the same task can be simplified by the following implementation:
which needs only 4 addition and 3 shift operations.
In the context of the IDCT design, such algorithms can be used for simultaneous computation of products by pairs of factors in transform butterflies. Moreover, since in each butterfly there are typically two variables that need to be multiplied by the same factors, such computations can easily be done in parallel.
In passing, we should note that finding optimal (i.e. with fewest numbers of additions and/or shifts) algorithms for computing multiplications by integer constants has been an area of active and fruitful research during the last few decades 33−40 . It has been established that this problem is NP-complete, 36 and numerous fast heuristic algorithms have been proposed for solving it approximately. 
Minimizing Errors in Multiplierless Algorithms using Right Shifts
Another family of techniques for computation of products by dyadic fractions (8) can be derived by allowing the use of right shifts as elementary operations.
For example, considering a factor 1/ √ 2 ≈ 23/32 = 0.10111, and using right shift and addition operations according to its CSD decomposition, we obtain † : or (by further noting that 1/2 + 1/4 = 1 − 1/4):
Yet another (although, somewhat less obvious) way of computing product by the same factor is:
We present plots of values produced by these algorithms in Figure 1 . It can be noted that they all compute values that approximate products by fraction 23/32, however, the errors in each of these approximations are different. For example, the algorithm (13) 
and it also implies that for any
that is, a zero-mean error on any symmetric interval.
This property is very important in the design of signal processing algorithms, as it minimizes the probability that rounding errors introduced by fixed-point approximations will accumulate. Below we will establish the existence of right-shift-based sign-symmetric algorithms for computing products by dyadic fractions and provide upper bounds for their complexity. 
as the following sequence of steps:
where x 1 := x, and where subsequent values x k (k = 2, . . . , t) are produced by using one of the following elementary operations:
The algorithm terminates when there exists indices j 1 , . . . , j m t, such that:
We state the following (the proofs for which are provided in Appendix B): We should point out that these are very simple and rather coarse complexity bounds, and that in many cases, the complexity overhead for achieving sign-symmetry is not that high. Moreover, when complexity considerations are paramount, one can pick algorithms that are sign-symmetric for most, but not all values of x in the expected range of this variable. In many cases, such "almost symmetric" algorithms can also be the least complex for a given set of factors.
In the design of our IDCTs we have used an exhaustive enumeration process for searching for the best algorithms (17-20) with symmetric (or at least well-balanced) rounding errors. As additional criteria for selection of such algorithms, we have used estimates of mean, variance, and magnitude (maximum values) of errors that they produce. In assessing their complexity, we have counted the numbers of operations, as well as the longest execution path, and maximum number of intermediate registers needed for computations.
DESIGN OF FIXED-POINT APPROXIMATIONS OF THE 8X8 IDCT
The overall architecture used in the design of the proposed fixed-point IDCT algorithms is shown in Figure 2 , which can be characterized by its separable and scaled features. The scaling stage is performed with a single 8x8 matrix that is precomputed by factoring the 1D scale factors for the row transform with the 1D scale factors for the column transform. The scaling stage is also used to pre-allocate P bits of precision to each of the input DCT coefficients thereby providing a fixed-point "mantissa" for use throughout the rest of the transform. Other key features of this architecture include simplicity, compactness / cache-efficiency, and flexibility of its interface, by allowing the potential for merging of scaling and quantization logic in video and image codec implementations.
As the underlying basis for scaled 1D transform design, we use a variant of the well-known factorization of C. Loeffler, A. Ligtenberg, and G.S. Moschytz 27 with 3 planar rotations and 2 independent factors γ = √ 2 (see Figure 3 ). This choice has been made empirically based on an extensive analysis of fixed-point designs derived from other known algorithms, including variants of AAN, 25 VL, 26 and LLM 27 factorizations.
31
In order to allow efficient rational approximations of constants α, β, δ, , η, and θ within the LLM factorization, we introduce two floating factors ξ and ζ, and apply them to two sub-groups of these constants as follows (see also Figure 3 , right flowgraph):
We invert these multiplications by ξ and ζ in the scaling stage by multiplying each input DCT coefficient with the respective reciprocal of ξ and ζ. That is, we pre-compute a vector of scale factors for use in the scaling stage prior to the first in the cascade of 1D transforms.
These factors are subsequently merged into a scaling matrix which is precomputed as follows:
E F G B G F E C F H I C I H F D G I J D J I G A B C D A D C B D G I J D J I G C F H I C I H F B E F G B G F E
where A − J denote unique values in this product:
and S denotes the number of fixed-point precision bits allocated for scaling.
This parameter S is chosen such that it is greater than or equal to the number of bits P for the mantissa of each input coefficient. This allows scaling of the coefficients F vu , to be implemented as follows:
where S vu ≈ Σ vu denote integer approximations of values in matrix of scalefactors (23) .
At the end of the last transform in the series of 1D transforms, the P fixed-point mantissa bits (plus 3 extra bits accumulated during executions of each of the 1D stages ‡ ) are simply shifted out of the transform outputs by right shift operations:
To ensure a proper rounding of the computed value in (25), we add a bias of 2 P +2 to the values f yx prior to the shifts. This rounding bias is implemented by perturbing the DC coefficient prior to executing the first 1D transform:
F " 00 = F 00 + 2 P +2 .
Using this architecture, the task of finding fixed point IDCT implementations is now reduced to finding sets of integer approximations of factors • A, B, C, D, E, F, G, I, J -the coefficients in the matrix of scale factors (23), and
• α , β , δ , , η , θ -the factors inside the 1D transforms and algorithms for computing products by them. Global parameters that can also be adjusted are:
• P -the number of fixed-point mantissa bits;
• S -the number of bits used to implement the scaling stage such that S P ;
• k -the number of bits used for the approximations of factors within 1D transforms. ‡ The LLM factorization naturally causes all quantities on the output to be multiplied by a factor of 2 √ 2. 27 This results in 1.5 bits mantissa expansion during row-and column-passes, and 3 bits accumulated at the end of the 2D transform. Notably, this list of parameters does not include the values for our "floating factors" -ξ and ζ. The reason for their exclusion is that these factors are needed only for establishing the relationship between the values of the factors inside the transform (21) and the values for the scale factors (22) . The actual values for ξ and ζ are absorbed by the rational fractions assigned to each factor.
This design framework has been used for the design of several IDCT approximations submitted to MPEG.
30, 31
The search for the above parameters and algorithms has been organized such that for each candidate transform approximation we were able to measure: (a) the IDCT precision in terms of accuracy metrics, and (b) the number of operations needed for its implementation. This approach allowed us to identify transforms with the best achievable complexity and precision tradeoffs.
Examples of IDCT Designs
We summarize the values of parameters and performance characteristics of several algorithms designed using this framework in Table 2 . These algorithms have the following particular features:
L16 -an algorithm passing all normative ISO/IEC 23002-1 precision tests using the lowest achievable number of mantissa bits: P = 3. This implies that this algorithm is implementable on 16-bit platforms.
L1 -an ISO/IEC 23002-1 compliant IDCT approximation with the lowest achievable number of bits in approximations of transform factors: k = 8. 
L2 -an ISO/IEC 23002-1 compliant IDCT approximation with the lowest achievable number of shifts (12 shifts per 1D transform).
Since the underlying factorization contains 12 multiplications, this means that each multiplication in algorithm L2 is implemented by using only 1 shift operation.
Z0a -a higher-accuracy (linearity-test compliant) algorithm, selected for the Final Committee Draft (FCD) of the ISO/IEC 23002-2 standard.
13
Z1 -an algorithm that was originally selected for the Committee Draft (CD) of ISO/IEC 23002-2 standard.
12
This algorithm is considerably more complex than the FCD design (Z0a).
Z4 -an ultra-high precision IDCT approximation.
In characterizing IDCT precision, Table 2 lists worst-case values of ISO/IEC 23002-1 metrics, collected over all normative pseudo-random tests.
14 In describing complexity, letters "a" are used to denote the numbers of additions and letters "s" -to denote the numbers of shifts necessary to implement these algorithms. The "1D" complexity section provides the numbers of operations necessary to implement each scaled one-dimensional transform. The "2D" complexity section shows the total numbers of operations necessary to implement the scaled 2D transform. Finally, the "2D+S" complexity section shows the total numbers of operations necessary to implement the complete 2D IDCT transform, including scaling (assuming that all input coefficients are non-zero).
The collection of algorithms L16, L0, L1, and L2 illustrates extremes that can be reached if the goal is to simply pass the basic set of precision requirements for IDCT implementations in MPEG standards. Algorithms Z0a, Z1, and Z4 strive to go beyond this basic goal and have some nice additional properties. For example, they all pass the linearity test, 16, 17 pass extended dynamic range tests, 15 and perform better in so-called IDCT-drift tests described in the next section.
Drift Performance Analysis
The IEEE 1180 | ISO/IEC 23002-1 tests define mandatory requirements for IDCT implementations in MPEG and ITU-T video coding standards. However, passing them does not always guarantee high quality of the decoded video, particularly in situations with low quantization noise and long runs of predicted (P-type) frames or macroblocks.
42 This is why, in evaluating an IDCT design, it is important to use additional tests, such as those measuring drift (difference between reconstructed video frames in encoder and decoder) caused by the use of this approximate IDCT design in the decoder.
In order to measure the drift performance of our IDCTs we have used reference software encoders (employing floating-point DCTs and IDCTs) of H.263, MPEG-2, and MPEG-4 P2 standards. In order to emphasize IDCT drift effects, we have also:
• forced all frames after the first one to be P-frames;
• disabled Intra-macroblock refreshes;
• forced QP = 1 (quant scale = 1, and w[i, j] = 16 in MPEG 2,4) for all frames;
In the decoder we have used our IDCT approximations, and for comparison, we have also run tests for the following existing IDCT implementations:
• MPEG-2 TM5 IDCT -fixed-point implementation included in MPEG-2 reference software,
43
• XVID IDCT -a high-accuracy fixed-point implementation of IDCT in XVID (MPEG-4 P2) codec, 44 and
• H.263 Annex W IDCT -a 16-bit IDCT algorithm specified in Annex W of ITU-T Recommendation H.263.
7
The results of our tests for sequence "News", using H.263 and MPEG-2 codecs, are shown in Figure 4 . It can be observed, that the high-precision algorithm Z4 has virtually no drift. Then algorithms Z1 and Z0a follow with their worst case accumulated drift contained approximately within 0.5dB in H.263 tests, and within 2dB in MPEG-2 tests. Algorithms L0, L2, L1, then follow with their worst case drift being slightly worse (approximately 0.625dB in H.263 and 2.25dB in MPEG-2 tests). The rest of the algorithms, however, perform much worse. These results illustrate that IDCT drift performance can be significantly affected by the choice of the fixedpoint architecture, and its parameters. In particular, in testing numerous implementations produced using our scaled, LLM-based framework, we have observed that drift performance is most significantly affected by our "mantissa" parameter P . For the majority of algorithms: L0, L1, L2, Z0a, and Z1, reducing the mantissa by 2, 3, sometimes even by 4 bits had almost no effect on most of the IEEE 1180 | ISO/IEC 23002-1 precision metrics, and yet, each such bit had a major effect (about 1-2dB per bit difference) in drift tests. The algorithm L16 is an extreme example of such a mantissa reduction process (leaving only P = 3), and it is obviously unacceptable in terms of drift performance. For this reason, we have retained at least P = 10 bits of mantissa in the design of most of our algorithms proposed to MPEG.
THE ISO/IEC 23002-2 FCD FIXED POINT IDCT ALGORITHM
The overall architecture and 1D factorization flowgraph used by ISO/IEC 23002-2 FCD algorithm are depicted in Figure 2 and Figure 3 correspondingly. All integer factors and parameters used in this algorithm are listed in Table 2 under the column "23002-2". This transform allocates P = 10 bits for the fixed-point mantissa, and uses the same number of bits for specifying the scale factors S = 10. This cancels out right shifts in the processing of input coefficients (24) , and makes the scaling stage of this transform particularly simple:
where F vu are input coefficients, and the DC-term adjustment (27) is done to ensure proper rounding at the end of the transform: The maximum total number of bits needed by all variables in this transform is 26 bits, which assumes full 12-bit dynamic range of reconstructed pixel values, which is sufficient to cover even extreme cases of quantization noise expansion, as described in 41 .
There are three groups of rational dyadic factors processed by this algorithm (see Figure 3 , and Table. 2):
• α = 41/128 and β = 99/128 -in the butterfly with coefficients X 2 and X 6 ,
• δ = 113/128 and = 719/4096 -in the butterfly with coefficients X 3 and X 5 , and
• η = 1533/2048 and θ = 1/2 -in the butterfly with coefficients X 1 and X 7 .
The computation of products by these factors is performed as follows:
The combined complexity of all these operations is only 9 addition and 10 shift operations. Therefore, the average complexity for computing a single multiplication in this algorithm is only 9/6 = 1.5 addition and 10/6 ≈ 1.66 shift operations. In comparing this with traditional fixed point-point implementation of products by factors:
x * η ∼ (x * 1533 + 1024) >> 11 which includes an addition (for proper rounding) and a shift, we conclude that the effective cost of each integer multiplication in ISO/IEC 23002-2 FCD algorithm is only 0.5 addition + 0.66 shift operations.
The total complexity of computing each scaled 1D transform in this algorithm is 44 addition and 20 shift operations. The description of a complete 1D transform in C programming language requires only 50 lines.
13
Extra C-code needed to describe the full 2D version takes only 20 lines.
The scaling of transform coefficients can be done either outside of the transform, e.g. in the quantization stage, thereby taking advantage of the sparseness of the input matrix of coefficients, or inside the transform, by executing multiplications (26) .
This algorithm passes all normative ISO/IEC 23002-1 precision tests, 14 as well as many additional tests that have been created in the process of evaluating fixed point designs in MPEG. These additional tests include MPEG-2 and MPEG-4, and T.83 (JPEG) conformance tests, drift tests with H.263, MPEG-2, and MPEG-4 encoders and decoders, as well as linearity test, and extended dynamic range tests. 
ISO/IEC 23002-2 FCD FDCT Design
The design of the corresponding fixed-point forward ISO/IEC 23002-2 DCT is fully symmetric relative to the IDCT design. Its overall architecture and 1D factorization are presented in Figure 5 and Figure 6 correspondingly. All integer factors and algorithms for computing products in this FDCT design are exactly the same as in the IDCT, with the only difference being simply the order in which they are executed.
The two elements in the FDCT design that are implemented differently when compared to the IDCT design are: the reservation of mantissa bits and scaling. The allocation of mantissa bits is done at the very beginning of the FDCT transform as follows:
and the scaling is done at the very end, by using 
The use of the term (30) in rounding (29) assures that FDCT scaling is done in a sign-symmetric fashion, with a slightly wider deadzone around 0.
We note that the scaled architecture of ISO/IEC FDCT design makes it also possible to combine the final scaling stage (29) (30) with the quantization process in video or image encoders, thereby enabling further complexity reductions.
CONCLUSIONS
In this paper we have described our proposed fixed-point IDCT design methodologies and several resulting algorithms achieving different precision/complexity characteristics. We have explained choices of the parameters in such designs, and their connection to IDCT precision and drift performance.
The fixed-point 8x8 IDCT and DCT algorithms adopted in ISO/IEC 23002-2 FCD standard are also described. Their architecture has benefited from the ideas contributed to the MPEG standardization process by multiple proponents and yielded a remarkably efficient implementation, surpassing all IEEE 1180 | ISO/IEC 23002-2 precision requirements, with low implementation complexity (requiring only 44 addition and 20 shift operations per scaled 1D transform), and performing very well in linearity, extended dynamic range, and IDCT drift tests.
