INTRODUCTION
It is well know that the discrete cosine transform (DCT) has been widely used in many areas such as speech and image coding. In particular, the two-dimensional (2-D) DCT has been adopted in some international standards such as MPEG, JPEG and CCITT [1] . A 2-D DCT can be obtained by applying 1-D DCT over the rows followed by a 1-D DCT over the columns of the 8x8 data block [2] . Therefore the efficient implementation of DCT has become the most important issue in developing realtime embedded system. In mobile multimedia devices such as digital cameras, cell phone or pocket PCs hardware complexity as well as power consumption has to be minimized. To do this the great number of fast DCT algorithm were proposed [3] , among which Loeffler algorithm [4] gained the lower bound of multiplicative complexity for 8-point DCT. It's required only 11 multiplication and 29 addition. But the common disadvantage of all fast DCT algorithm is that they still need floating point multiplication. These operations are very slow in software implementation and require large area and power in hardware. And therefore can not be used in mobile multimedia devises. So there is still the need to look for new design of DCT algorithm compromises better suited to particular application.
Mathematically, fast DCT is composed of additions and multiplications by constants. When implemented in hardware, the multiplication by constants are often implemented by a sequence of additions and shifts which is less expensive in terms of chip area and power consumption [5] . These implementations of transforms are referred to as multiplierless. The binDCT seems to be the most notable result in this field [6] . This transform is based on VLSI-friendly lattice structure and derived from DCT matrix factorization by replacing plane rotations with lifting schemes. Another popular way of multiplierless implementation of DCT is to use the coordinate rotation digital computer (CORDIC) algorithm [7] - [9] . Since the CORDIC algorithm leads to a very regular structure suitable for VLSI implementation.
In [10] , it has been concluded that the length of the critical path, i.e. the maximum number of adders operating in cascade, strongly affects the performance of a hardware implementation of DCT. It has been shown that 30-40% decreases in delay and power consumption were obtained after shortening the critical path from 10 to 7, even through at the cost of increasing the total number of adders. In [7] were found that it is possible to optimize CORDIC-based structures to shorten the critical path to 5-6 additions still having good coding performance.
In this paper we discuss the FPGA implementation of CORDIC-based approximation of the eight-point DCT proposed in [7] . The paper begins with a review of fast Loeffler's algorithms and their multiplierless variant. Then the details of the proposed FPGA implementation scheme are given.
BASE STRUCTURE
In [7] starting point for derivation of short critical path approximation of 8-point DCT is signal flow graph of one of several possible Loeffler's algorithm (Fig. 1 ). (
In [6] using the lifting-based approach were showed that fast and accurate approximations of DCT can be obtained without using any multiplication. The obtained family of such transforms differing in accuracy and efficiency has been called the binDCT.
Another way of multiplierless implementation of plane rotation is CORDIC algorithm were also considered. In [11] special attention is given to constructing transform approximation maintaining orthogonality regardless of their coefficient quantization. On, contrary, in [10] , performance maximization was of interest, especially from the hardware implementation point of view. In [7] were presented a novel family of CORDIC-based algorithms with short critical paths. Here we give the detail of FPGA implementation of DCT approximation algorithm proposed in [7] .
CORDIC ALGORITHM
The CORDIC algorithms are an efficient method of producing a variety of trigonometric, hyperbolic and linear function [12] .
In order to realized a plane rotation in CORDIC algorithm the rotation angle φ. is decomposed by the angle set called the CORDIC arc tangent radix (ATR) as follows [9] .
where . Then, the plane rotation is performed by the iterative equations given by (3) Note that each iteration assume the scaling of the vector, where the scale factor for the ith iteration is given by (4) and the total scale factor K is given by .
Using (2)- (5) we can rewrite (1) in the following form
Originally CORDIC algorithm allows possessing the value ±1. However, recent papers show that computational savings can be achieved by allowing omitting and repeating some iteration. The elementary rotations also called microrotations.
The common approach of utilization CORDIC algorithm adjusted to DCT is to choose set of microrotations as close to required rotation angle as it possible. In contrast to this approach in [7] were proposed another method. The main difficulty that arises when we use CORDIC algorithm is necessity of implementation scaling. In order to extract scaling factors (for rotation by β and γ) outside the transform core in [7] were decided to approximate angles β and γ with the same set of the absolute values of microrotation, thus scale factor for rotation by β and γ became equal and could be extracted outside the transform. There is no problem with scaling extraction for the rotation by α .
It could also be noted that the scaling that require division by irrational numbers cannot be performed exactly using fixed-point arithmetic. However, in the most popular international standard such as MPEG and JPEG the DCT unit is followed by a quantizer, where DCT outputs are scaled by the pointwise division by the corresponding scaling constants that are stored in the quantization table [1] . Thus, each scaling factor of the DCT outputs can be incorporated into the corresponding scaling constant without requiring any additional hardware.
FPGA IMPLEMENTATION OF CORDIC-BASED DCT
Consider a variant C of approximation DCT given in [7] . In this case angles β and γ approximated with microrotation i={1,2,4}, , and for angle α i={1,4} and . The scheme obtained this way shown in Fig. 2 . It should be noted that practically critical path of the suggested scheme contain 7 adders. It is different from the result mentioned in [7] , where critical path seems consist of 6 additions. Optional two adders appears in stage 5, where negation of two intermediate data sample presented in 2's complement code need to be implemented.
Examine stage 5 and stage 6 more detail. The first simplification that can be made is merging lower adder in stage 5 with adders in stage 6 (Fig.3) It is known that inversion in 2's complement code performs as (7) where it is simple bitwise inversion. Using (7) the second simplification we made by replacement of upper adder that perform negation (stage 5) with simple NOT gates (Fig. 4) . Addition of least significant bit (LSB) according to (7) can be made in the next stage (addition of LSB implements without extra hardware resources because conventional adder always have carry input port). However, for output X 5 LSB bit remain uncompensated. Thus we lost one LSB of accuracy for one DCT output, but improving scheme performance. Flow graph obtained after proposed simplification depicted in Fig 5. This scheme can be regarded as combinational circuit with total delay (8) where -adder delay, and -delay of gate NOT
EXPERIMENTAL RESULT
The proposed method of approximation DCT realized with the FPGA place and route (PAR) process to determine the exact hardware cost. We use Xilinx Virtex series of FPGA for our experimentation. The hardware cost is measured as the total number of slices required to implement the design. A Xilinx Virtex slice contains two D-type flip-flops and two four-input lookup tables (LUT). As far as we implement a combinational circuit there is no flip-flops is needed. For more accurate measure we provide information about occupied slices and LUTs. Proposed solution implemented on FPGA XC4VLX25. Input data is 8-bit width, output data -12 bit. Variant I of tested solution used only simplification pictured in Fig.3 , variant II is the scheme in Fig.5 . In out experiments we have compared our solution with another low complexity DCT approximation algorithm binDCT of type C [6] . Table 1 shows hardware cost for DCT approximations. Table II No. bit inversion 0 1 0
Critical path
It can be noted that at expense of nearly 20 % in number of addition we can make critical path shorter in about 35%.
CONCLUSION
This paper has presented an efficient approach for 8-point CORDIC-based DCT approximation suitable for FPGA implementation. The proposed architecture requires 36 add, 16 shift and 1 bit inverse operations to carry out the DCT. Also critical path of given solution contain only 6 adder.
