Frequency Domain Motion Estimation (FDME) is a recent technique that promises to efficiently reduce the computational complexity of ME process. Related Transformed-Discrete Cosine Transform (RT-DCT) is one of the main modules that build the FDME encoder. The RT-DCT module is responsible for generating four transforms that are required for the ME process in the frequency domain. The main problem of generating such transforms is the low speed of the FDME encoder that prevents its use in real time applications. In this paper an efficient fast RT-DCT architecture is proposed to accelerate the encoding process in the frequency domain. The proposed architecture achieves approximately 58%, 39%, and 50% reductions in gate count, power consumption, and area compared to the conventional state of the art pipelined RT-DCT generators. Implementation and Simulation results project that the proposed RT-DCT architecture, when integrated in a whole FDME system, can perform ME for 60 fps of 4CIF video at 118 MHz.
INTRODUCTION
Frequency Domain Motion Estimation (FDME) [1, 2] has recently become as a new technique for speeding-up the whole encoding process by significantly reducing ME computations. The main important part of the FDME is the Related Transformed-Discrete Cosine Transform (RT-DCT) generator. The RT-DCT generator is the responsible for generating the main four transforms required for the ME process in the frequency domain (i.e., Cosine Cosine Transform (DCT), Sine Cosine Transform (DSCT), Sine Sine Transform (DST), and Cosine Sine Transform (DCST)) [1] .
Many algorithms are used to compute the DCT coefficients [3, 4] . Using a time recursive lattice structure [5, 6] is a good choice for generating the RT-DCT coefficients since it can produce dual transforms (e.g., "DCT and DSCT" or "DST and DCST"). It means more reductions in the hardware implementation and computations. The main problem with using the lattice structure is the multiplication by either sine or cosine trigonometric functions. One of the effective and accurate methods to solve this problem is to use the Coordinate Rotation Digital Computer (CORDIC) [7] . Bit-parallel iterative architecture, bit-serial iterative architecture, and unrolled CORDIC architecture are the main architectures for implementing the CORDIC algorithm [7] .
We modified the recursive lattice structure used in [6] to achieve a high speed encoding process. In this work a novel fast CORDIC mixed architecture that mix parallel and serial operations, to generate a fast CORDIC architecture, is integrated within the lattice architecture [6] for a superior performance. The proposed RT-DCT architecture has lower power consumption and area and has higher speed than the existing RT-DCT architectures. This gives our RT-DCT architecture the advantage to be used for real video applications such as cellular phones, Mobile Internet Device (MID), UltraMobile Personal Computer (UMPC), and Personal Digital Assistance (PDA).
The paper is organized as follows. Section 2 discusses the principle of time recursive and lattice architecture. Section 3 discusses the proposed Unrolled Bit Serial Online CORDIC Architecture. Section 4 discusses the proposed lattice-CORDIC architecture. The Whole Architecture of the Related Transformed-DCT (RT-DCT) Generator is discussed in section 5. Implementation and discussion is drawn in section 6. Finally, in section 7 conclusions are drawn.
TIME RECURSIVE AND LATTICE ARCHITECTURE
The two-dimensional (2D) DST of a sequential input data starting from x(t) and ending with x(t+N-1) is defined as [5, 8] : (1) Where: (2) Since c(k) has a value of one all the time except for k=N, we consider first the case of c(k)=1 (i.e., for k=1,2,…,N) then we will consider the case for k=N as a special case. The 2D-DST of the next input data vector can be expressed as: (3) This can be rewritten as: (4) Where:
A FAST DISCRETE TRANSFORM ARCHITECTURE FOR FREQUENCY DOMAIN MOTION ESTIMATION
As we can see, the DCST term ( appears in Equation (4) . To show the dual property of the lattice structure, similarly, we will investigate the time recursive equation of is defined as: (7) Now, and can be re-written as:
Where:
is the one dimension Sine transform that can be obtained using the lattice structure of Figure 3 [5, 6] . The whole time recursive lattice architecture for (k=1, 2, … N-1) is shown in Figure 1 (a). The special cases for k=0, N are shown in Figure 1 (b). The same procedure can be used to derive the lattice architecture for the DCT and DSCT respectively as seen in Figure 2 . As a conclusion, both the 2D-DST and 2D-DCST are obtained using the one dimensional 1D-DST. Also, the 2D-DCT and 2D-DSCT can be obtained using the 1D-DCT. This means, we can use the architecture in Figure 3 to generate both 1D-DCT and 1D-DST to generate the RT-DCT coefficients as will be discussed in section 5. The main concern with the lattice architecture is the multiplication by cosine and sine functions.
In the following section, we will explain an efficient proposed architecture to tackle this problem.
UNROLLED BIT SERIAL ONLINE CORDIC ARCHITECTURE
CORDIC is an iterative algorithm developed by Volder [9] to compute several functions including trigonometric ones, fixed/floating point, multiply, divide, log, exponent and square root using simple shift and add operations. CORDIC is particularly important because of its simplicity, recursive nature, reduced hardware cost and applicability to a wide range of functions. Due to lack of space in this paper, we will not discuss the CORDIC algorithm and we will only be concerned with the architecture design of CORDIC algorithm. More information about the CORDIC algorithm is discussed in detail in [7, 9] . While designing the CORDIC architecture in this paper, the priority was directed toward increasing the encoding speed with an acceptable degradation in both area and power consumption. This can be achieved by serializing both the shifters and adders as seen in Figure 4 . The idea of the proposed architecture is to have N unrolled iterations. At each iteration i (where ), three serial adders/subtractors, three shift registers for X i , Y i , and Z i are used as seen in Figure 4 . A control unit will take care of the correct flow of the information. Once the operations are done at iteration number i, the control unit of iteration number i will trigger the next iteration stage (iteration number i+1) to start its operations and so on.
The whole operation of the proposed unrolled bit serial online CORDIC processor is summarized as follows. Both X 0 and Y 0
( N are represented as 22 bits (10 bits for integer part and 12 bits for fractional part). The angle Z 0 is represented using 12 bits (10 bits for integer part and 2 bits for fractional part). Assuming all the registers are reset to zero, the operation starts by enabling the start signal. Once the start signal goes high, the control unit CU 0 initiates X , Y, and Z registers to start loading, serially, with the initial values X 0 , Y 0 and Z 0 . This occurs when the XYZ_load signal goes high. Since 12 bits are only read from Z 0 , a multiplexer is needed at the front of Z i Shift Register (SR) in the first iteration to continue filling the registers with 10 zeros (chosen from second multiplexer's input). The select control signal is 0 for the first 12 serial clock cycles (ser_clk) and then goes high for the rest of the 10 ser_clk cycles. Once the 22 bits are read, the MSB of Z 0 determines the type of operation for each input (addition or subtraction). The Z en control signal will be high when the last MSB of Z 0 is available. The angle_select control signal will be high after 22 ser_clk cycles to start the addition/subtraction operation. When stage 1 has its first output, the next_stage control signal will be high to trigger stage 2 to start its operations. The inputs to the Z adder/subtractor are the output of the previous Z shift register and another shift register that is loaded with the constant for that specific iteration. The outputs of the adder/subtractor at iteration number i will be loaded in the registers of the next iteration i+1. This operation will continue untill the last iteration stage. The final output X n and Y n should be scaled by multiplying both of them by 0.60727. This multiplication is converted to thirty four shift operations and six addition operations as seen in Figure 4 . It is worth mentioning that the shift operations in the scaling part are constant and may be done using simple wiring. In addition, only nine stages are used to generate the sine and the cosine. The selection of such number of stages is based on the required acceptable accuracy of the fractional part of the final output. 
THE LATTICE-CORDIC DESIGN AND IMPLEMENTATION
In this section, we will illustrate how the proposed CORDIC architecture can be inserted into the lattice structure in Figure 1, Figure 2 , and Figure 3 to convert the multiplication operations into shift and addition/subtraction operations. The result Lattice-CORDIC architecture will be compared to the one proposed in [6, 7] which is the main architecture for the Lattice-CORDIC design.
The proposed Lattice-CORDIC architecture consists of six main parts as seen in Figure 5 . Input Shift Register (ISR) consists of (N+1) locations, each with 9 bits depth. It is used to load the input pixels of each row of the processed input block. One bit serial adders are used to speed up the clock rate as well as reducing both the area and the power consumption of the proposed Lattice-CORDIC architecture. CORDIC 1 and CORDIC 2 processors are implemented using the proposed serial fashion in section 4. Angle shift registers (Angle1_SR and Angle2_SR) are used for storing the desired angles required for CORDIC 1 and CORDIC 2, respectively. The desired angles Z 0 is defined as , where k = 0, 1, 2, …, N and N is the number of pixels in a row of the processed block. 
THE WHOLE ARCHITECTURE OF THE RELATED TRANSFORMED-DCT (RT-DCT) GENERATOR
As shown in Figure 6 , there are three groups of Lattice-CORDIC architecture. Group 1 is used to generate one dimensional DCT and DST using the lattice structure of Figure  3 combined with the proposed CORDIC architecture. Group 2 uses the cosine transform from group 1 ( ) to produce the two dimensional DCT and DSCT. Group 3 uses the sine transform from group 1 ( ) to produce the two dimensional DST and DCST.
The main processor transfers the processed block to the external memory (External MEM). The main processor also biases the RT-DCT generator to start computing the RT-DCT coefficients. The RT-DCT generator starts operation by feeding the pixels of the first row of the processed block to the Lattice-CORDIC group 1. One pixel of the first row is fed to each Lattice-CORDIC module of group 1 at a time. The number of Lattice-CORDIC modules in group 1 equals to the number of pixels in a row. Given that N1 is the number of iterations in CORDIC and one parallel clock cycle equal to 23 serial clock cycles, it takes 2N 1 +12 parallel clock cycles to produce the intermediate transform coefficients and for the first row of the processed block. Once the output of group 1 is available,
they will be loaded into the even and odd Circular Shift Register (CSR even and CSR odd ) and new row pixels will be processed. Both CSR even and CSR odd will automatically assign a bit for the sign at the MSB (they assign 0 for positive values and 1 for negative values). The values in the Circular Shift Registers are then fed to group 2 and group 3, respectively, as shown in Figure 6 . Then the Circular Shift Registers circulate. It takes 2N 1 +12 from group 2 and group 3 to finish the first row of the RT-DCT transform coefficients. It is worth mentioning that the feedback of the Lattice-CORDIC modules in group 2 and group 3 is taken from the output of the Shift Register Array (SRA). This process will continue for the remaining rows of the processed block untill all RT-DCT coefficients are stored into the SRA.
IMPLEMENTATION AND DISCUSSION
The proposed RT-DCT Generator architecture was implemented using functional VHDL. Then this code was verified using the ModelSim tool. The standard cell ASIC design flow approach using OSU (Oklahoma State University) standard cells library was followed for the hardware implementation. The proposed architecture was synthesized using BGX_shell tool in TSMC 0.18 μm technology. The layout was done using Cadence SOC Encounter tool. The frames are divided into blocks of size 8×8. The architecture of the proposed RT-DCT generator is compared to the one in [6] and same lattice architecture in [6] but using the unrolled parallel CORDIC Architecture [7] . Implementation results in this section will aid in calculating of the speed and the cost of the whole FD-ME system [2] in a future work. The gate count of the proposed RT-DCT generator is shown in Table 1 . This count includes the gates used in lattice-CORDIC bank, CSR bank, SRA bank, and the Control Unit. It is noted from the table that using the unrolled parallel CORDIC in [7] improves the throughput of the RT-DCT generator compared to the one in [6] . However, the RT-DCT generator using the CORDIC in [7] degrades the power consumption, area, and the gate count. It means, using the RT-DCT generator in [6] is nominated for applications that give the priority to reducing area and power consumption rather than increased encoding speed (for example, mobile applications with limited resources, such as cellular phones, MID, UMPC, and PDA). Whereas the RT-DCT generator which use the unrolled parallel CORDIC in [7] is nominated for applications like DTV, which prioritize high speed processing over reduced area.
The proposed RT-DCT achieves same throughput as the fast RT-DCT generator in [7] , however, it maintains an efficient area, power consumption, and gate count which are close to the simple design in [6] . This gives the proposed RT-DCT generator a superior performance if it is used in applications that target high speed and maintaining low power and area consumption.
With a maximum operating frequency of 118 MHz, the proposed RT-DCT generator is recommended to those multimedia applications that favor high speed processing as a trade-off to an acceptable increase in area and power. Finally the chip Layouts of the proposed RT-DCT generators and those in [7] and [6] is shown in Figure 7 .
CONCLUSION
The RT-DCT generator is the main module in the FDME system that reduces the overall encoding time. In this paper, an efficient RT-DCT generator is proposed to speed up the encoding process in the frequency domain. The proposed architecture achieves approximately 58%, 39%, and 50% reductions in gate count, power consumption, and area compared to the conventional state of the art pipelined RT-DCT generators that uses the unrolled parallel CORDIC. Our architecture can easily perform real time transformation for 4CIF video with 60 fps at 118 MHz when integrated into the whole FDME system.
REFERENCES

