INTRODUCTION
Frequency Domain Motion Estimation (FDME) is one of the main techniques that speeds-up the entire encoding process; it saves much computations of the Motion Estimation (ME) process [1, 2] . Figure 1 illustrates the whole FDME encoder. The Transform Coding (TC) module produces the four required transforms which perform the Frequency Domain Motion Estimation. These transforms are two Dimensional Discrete Cosine Cosine Transform (2D-DCT), two Dimensional Discrete Cosine Sine Transform (2D-DCST), two Dimensional Discrete Sine Cosine Transform (2D-DSCT), and two Dimensional Discrete Sine Sine Transform (2D-DST) [2] . The TC coefficients of the previous frame are stored in the external memory (EXT-MEM). The Manipulation Unit Engine (MUE) converts the unmatched transformed search area to a matched transformed search area [3] . The Dynamic Padding FDME (DP-FDME) module obtain the best match motion vectors by processing the TC coefficients matched search. Several algorithms are used to calculate the TC coefficients [4, 5] . The time recursive lattice structure [6] [7] [8] can generate dual transforms using single module for each couple of the transform coefficients. Consequently, more hardware and computational savings are achieved. Such reductions in both computations and hardware result from some reasons. First, such recursive lattice architecture can generate dual transforms [6] . Consequently, much savings in computations and hardware can be obtained. Second, the identity of the hardware required for generating such dual transforms http://dx.doi.org/10.12785/ijcds/040202 http://journals.uob.edu.bh make is easy and simple to design such lattice architecture. Third, global communication is not required for the design of such architecture. Finally, such dual lattice architecture requires less number of multiplications [6] .
However, implementing the multiplications of lattice structure is still the main concern nowadays. Coordinate Rotation Digital Computer (CORDIC) is one of the best solutions to reduce the complexity of using such multiplications. Bit-serial iterative, unrolled CORDIC, and Bit-parallel iterative architectures are some examples, which are used to implement the CORDIC algorithm [5] .
Efficient modified recursive hardware for the CORDIC algorithm is proposed in this paper. This modification efficiently speeds up the encoding process by reducing both the computational and the hardware complexity. A fast CORDIC lattice architecture, which mixes parallel and serial operations is implemented and integrated into the TC model. One of the main benefits of the proposed TC generator is the optimized logic gates used in the design which consequently reduces the power consumption and the chip area as well as increasing the speed of video encoding process. The proposed TC generator can be integrated into the state of the art Frequency Domain Motion Estimation Encoders that may be used in video applications such as Ultra-Mobile Personal Computer (UMPC), Mobile Internet Device (MID), Personal Digital Assistance (PDA), Cellular phones, and wireless video surveillance systems.
The rest of the paper is organized as follows. The concept of lattice structure is explained in section 2. Section 3 introduces the proposed CORDIC architecture. The proposed lattice-CORDIC architecture is presented in section 4. The entire TC Generator is described in section 5. Section 6 explains the hardware implementation. Sections 7 will conclude the work in this paper.
TIME RECURSIVE LATTICE STRUCTURE
Time recursive lattice structure was proposed in [6, 7] . It is an efficient structure for generating the transform coefficients needed for Frequency Domain Motion Estimation (FDME) process. As mentioned before, the four transform coefficients needed for the FDME process are 2D-DCT, 2D-DCST, 2D-DSCT, and 2D-DST. The main idea behind such recursive architectures is to use one dimensional (sequential) input to recursively determines two two-dimentional transform coefficients. Figure 2 is used to generate one dimensional DST and DCT that are used for generating the two dimensional transform coefficients mentioned above. The one dimensional DST is used as an input for the structure in Figure 3 to generate the 2D-DST and 2D-DCST coefficients, respectively. Similarly, 2D-DCT and 2D-DSCT can be generated using the structure in Figure 4 and using one dimensional DCT as an input.
Multiplications by sine and cosine functions are big challenge in the lattice architecture. In this work, we are proposing an efficient architecture to overwhelm this challenge. Following sections present the proposed architectures.
BIT SERIAL CORDIC ARCHITECTURE FOR REAL TIME VIDEO
Coordinate Rotation Digital Computer (CORDIC) is an iterative algorithm for computing several functions by using add and shift operations only [8, 10, 13, 14] . It is a simple recursive algorithm that can be used to calculate many functions such as logarithmic and sinusoidal functions because of its low hardware cost since it convert multiplications into shift and add operations. Therefore, The CORDIC algorithm can be integrated in the previous lattice architectures to convert the multiplication of sin(θ) and cos(θ) into only shift and add operations. The basic idea of CORDIC algorithm is to decompose the desired rotation angle θ into the weighted sum of a set of predefined elementary rotation angles such that the rotation through each of them can be accomplished with simple shift and add operations [8, 10] . The desired angle can be defined as:
Where represents the direction of the rotation angle in either positive or negative direction.
Consider a vector in Cartesian plane . Rotating by an angle is obtained by multiplying by another vector . The product becomes:
Where:
If the elementary rotation angles are chosen such that ± , then the multiplication inside the parenthesis is avoided and reduced to a simple shift operation [8] . Equation (3) can be re-formulated as follows:
http://journals.uob.edu.bh
Where: √ and . The value of is calculated by the original angle according to the following equation: (6) Where Z i has initial value of Z 0 . Additionally, the value of in equations (4) and (5) will be decided according to the value of Z i+1 . If Z i+1 <0, , otherwise . An increase in the CORDIC speed architecture is achieved in this paper considering an acceptable degradation in power consumption and area.
As seen in Figure 5 , the proposed CORDIC architecture aims to calculate X n and Y n as a function of sign and cosine of an angle Z 0 . The inputs X 0 and Y 0 are 22 bits; angle Z 0 is 12 bits. The idea is to have N iterations using CORDIC algorithm using the input vectors X 0 , Y 0 , and Z 0 until we have conversion to an accurate value of X n and Y n . The higher the number of iterations is, the higher of both the accuracy of X n and Y n and the computational complexity of the hardware design. In this paper, the complexity and the speed of the CORDIC design are improved by serializing both the adders and the shifters. Figure 6 illustrates only one iteration stage of the proposed CORDIC architecture. Each stage consists of shift registers for sending the input data serially to serial adders. Serial adders are simple, fast and easy to be implemented. Two shift registers are used for angle calculations. The one on the most right is used for storing the value of in equation 6. The other one is used to store the value of the angle Z 0 . All values are entered serially to the corresponding shift registers. There is a two inputs multiplexer used with the angle Z 0 since we need to enter 12 bits then additional 10 zeros for the remaining least significant bits. The output of the angle Z 0 is used to decide the addition or subtraction operation for the current iteration stage. Whenever the output of the first stage is ready, the control unit (CU 0) will initiate the second stage to use its outputs to start processing. This operation will be repeated until the last iteration stage start processing. The final outputs X n and Y n should be scaled (multiply by 0.607). This multiplication can be performed using add/shift operations.
4.
LATTICE-CORDIC HARDWARE ARCHITECTURE We integrate the proposed serial CORDIC architecture into the lattice structures shown in Figures 2, 3 , and 4. Figure 7 shows the proposed hardware Lattice-CORDIC; it consists of 6 modules. The Input Shift Register (ISR) has N+1 registers. Every register is 9-bits. The ISR loads the input pixels of each row of the processed block. The serial adders are used to accelerate the clock frequency. In addition, they are used for more reduction in both area and power consumption of the Lattice-CORDIC hardware architecture. The proposed serial fashion, section 3, is used to implement the CORDIC 1 and CORDIC 2 processors. Angle shift register (Angle1_SR and Angle2_SR) are used to store the desired angles required for CORDIC 1 and CORDIC 2. The desired angles Z 0 is defined as , where N is the number of pixels in a row of the processed block and k = 0, 1, 2, …, N.
THE PROPOSED COMBINATIONAL-DCT (C-DCT) GENERATOR
The Transform Coding (TD) generator has three groups of Lattice-CORDIC architecture as shown in Figure 8 . Group 1 uses the lattice structure presented in Figure 2 to generate one-dimensional DCT and DST combined with the proposed CORDIC architecture. Group 2 is used to make the two-dimensional DCT and DSCT using the cosine transform from group 1 ( ). Similarly, group 3 is used to make the twodimensional DST and DCST using the sine transform from group 1 ( ).
In order to calculate the TD coefficients, the main processor biases the TD generator and load the block to be processed into the external memory. The operation feeds the pixels of the first row of that block to the Lattice-CORDIC group 1. From this first row, only one pixel at a time is fed to each Lattice-CORDIC module of group 1. Number of pixels in a row is equal to the number of Lattice-CORDIC modules in group 1. N 1 is the number of iterations in CORDIC and one parallel clock cycle equal to 23 serial clock cycles. 2N 1 +12 parallel clock cycles are needed for producing the intermediate transform coefficients and . The output of group 1 is loaded into the even and the odd Circular Shift Register (CSR odd and CSR even ) and new row pixels will be processed. The CSR odd and CSR even will assign a bit for the sign at the MSB. Then, the values in the CSR odd and CSR even will be fed to group 3 and group 2, respectively, as seen in Error! Reference source not found.8. Afterward, the Circular Shift Registers circulate. It will take 2N 1 +12 clock cycles from group 2 and group 3 to finish the first row of the C-DCT transform coefficients. This process will continue until all C-DCT coefficients are saved into the SRA.
IMPLEMENTATION AND DISCUSSION
Functional VHDL is used to implement the proposed C-DCT Generator architecture and Modelsim is used to verify the code. The BGX_shell tool in TSMC 0.18 µm technology is used to synthesize the proposed architecture code. The Encounter tool (from Cadence) is used to generate the layout as well. The frames are divided into blocks; each block has a size of 16×16. The proposed TC generator is compared to the one in [11] and [12] . The gate count of the proposed TC generator, in Table 1 , includes the gates that used in lattice-CORDIC bank, CSR bank, SRA bank, and the Control Unit. The results in Table 1 are obtained using the ASIC flow design of Figure 9 [15]. It is noted from Table 1 that Using the architecture in [11] is valid for applications that needs less power and area rather than high speed compared to the architecture in [12] . The proposed TC generator reduces the power, area, and the gate count compared to the proposed architectures in [11] and [12] . However, it is higher in both area and gate count compared to the architecture in [11] . This higher area is compensated by the higher frequency of the proposed architecture compared to those in [11] and [12] , respectively. The maximum allowed operating frequency of the proposed TC generator is 123.21 MHz. This allows the proposed architecture the advantage to be used in real time multimedia applications that target higher data transmission speed. 7.
CONCLUSION
An efficient fast TC generator is proposed in this paper; it speeds up the encoding process in the frequency domain. The TC generator consumes most of the encoding time in frequency domain. Subsequently, speed up the TC generator will accomplish the real time transformation. Compared to the up to date pipelined TC generator, the proposed architecture reduces the power consumption and area. The proposed architecture performs real time transformation for real time video applications at 123.21 MHz when integrated into the whole FDME system.
http://journals.uob.edu.bh Figure 2 . DCT and DST one dimension Lattice structure [6] . http://journals.uob.edu.bh 
