In this paper, we proposed a new pipelined ISR-ANT ME design to further shorten the critical path of the ISR-ANT ME circuit [1], which can make it operate in a lower operating voltage. The proposed design is very simple, which only needs to insert the pipelined register to cut the critical path of ISR-ANT ME into two sections. The PSNR performance can be maintained and the lowest supply voltage can be lowered to 1.2V in the proposed design.
INTRODUCTION
The demands of modern wireless multimedia transmission systems increase rapidly that the low-power high-quality services are required, therefore the capabilities of supporting continuous high-quality image transmission and portability have to be enhanced and also follow the wireless multimedia communications standards of the next generation, such as 4G mobile systems, and Digital Video Broadcast (DVB) [1] - [3] . Since motion estimation circuits are the critical computing units of video multimedia systems, for the smoothness of the playing high-quality video, a large quantity of multimedia data must be transmitted with a motion estimation (ME) circuit with low power and effective computation properties [1] - [3] .
In this paper, the proposed ME algorithm is based on the three-step search and the proposed ME architecture is further improved based on the existing input subsampled replica algorithmic noise-tolerant (ISR-ANT) ME circuit [1] . Our goal is to further shorten the critical path of the ISR-ANT ME circuit to make it operate in a lower operating voltage.
According to the power consumption formula f CV P DD 2 = , the most effective method to lower the power consumption within the same frequency is lowering the operating voltage. However, reducing the operating voltage will also influence the propagation delay of the circuit according to the equation of propagation delay [1] , [5] , [6] . Therefore, it may take longer than sampling time to compute through the critical path and cause the sampling error of data while the operating voltage decreases. Base on the extra computing time of ISR-ANT-ME circuit which is obtained by sacrificing the sampling frequency, pipeline design is utilized to further reduce the critical path to decrease the power consumption under the same computation throughput.
II. PIPELINED ISR-ANT ME DESIGN
The smooth motions of objects in films are generated based on the persistence human vision of uninterruptedlyplayed static pictures. Accordingly the distance of a moving object in different frames is usually insignificant unless the whole scene is changed. This characteristic that the inappreciable difference between the track of an object in two continuous images of the video is utilized to decrease the amount of transmission data significantly. Many motion estimation methods utilized to reconstruct the image by observing the difference between two images have been proposed, these methods decrease the amount of transmission data from the whole original color photograph to a few difference values. In this case, the efficiency of reducing the amount of data can be significant by utilizing dynamic estimation techniques.
A trade-off between the quality of video and the transmitting rate is discussed in many proposed motion estimation search algorithms; a common one is three-step search method [3] - [4] and is employed as the basis for the motion estimation algorithms presented in this paper. The search window of three-step search method is illustrated in Fig. 1 (a) ; first, the step size is defined and is usually half △ of the size of search window. To observe the motion vector, the image is divided into blocks to obtain the most similar block with the target block. The candidate blocks includes the target block in the current micro block and eight blocks around the target block as center within a distance . Nine △ SAD values of the sum of absolute differences between the nine candidate blocks are obtained according to the equation (1):
Where
is the pixel values of (i, j), and
, ( is the pixel information of the previous image. The SAD values obtained from formula (1) are applied to select the most similar block to the target block from the candidate blocks. The less the SAD value is, the two blocks are more similar. The candidate block with the minimum SAD is considered as the matching block to the target block.
2nd International Symposium on Computer, Communication, Control and Automation (3CA 2013) Secondly, the first step is repeated but with half value of to replace the original and y_a [min] as the new center; △ △ and the new set of SAD values are calculated. Likewise, the first step is repeated in the third step but the distance between the candidate blocks and the target block are with half distance of the step distance employed in second step. After the three-step search, the most similar candidate block to the target block is obtained, and the vector between these two blocks is the motion vector, which can be utilized to reconstruct the current image instead of transmit the whole pictures; therefore, utilizing three-step search method can save a considerable amount of data.
When the supply voltage is lowered, to keep the acceptable quality of the output image of ME circuit, an ISR-SAD is employed as an estimator to the ME main SAD block [1] , as shown in Fig. 2 (a) . The sampling frequency of ISR-SAD is m times slower than MSAD to create an ME circuit working on low voltage. Furthermore, a significant amount of computation of ISR-SAD is decreased from of MSAD since the lower clock frequency; for example, if there are 100 pixel values as the input of ISR-ANT, the amount of computation will be greatly reduced to 100/m because of slower clock frequency in ISR-ANT and the timing errors caused by the low voltage can be avoided since the time interval between computations of the absolute block is longer; therefore, the operation of ISR-SAD may be considered error-free when the supply voltage is lowered.
In the design of ISR-ANT ME presented in [1] , the supply voltage is lowered by increasing the operating rate or lowering the sampling frequency and can be applied to reduce the power consumption. To further enhance the throughput of ISR-ANT ME circuit, we proposed a pipelined ISR-ANT ME circuit to improve its operating speed by inserting with additional D-type flip-flops. Under the same throughput rate, the operation time of each pipeline unit can be one time longer than the previous ISR-ANT ME design. Therefore, the proposed pipelined ISR-ANT ME circuit can operate at a lower supply voltage without sacrificing the number of image samples.
In the proposed pipelined ISR-ANT ME design, a D flipflop is added on the feedback path between the accumulators and the path between the accumulator and AD block to divide and shorter the critical path. The pipelined ISR-ANT ME architecture is shown in Fig. 2 (b) . With pipeline arrangement, the critical path between two D flip-flops in the pipelined ISR-ANT ME circuits is shorter, and the propagation delay caused by decreasing voltage is also reduced. As a result, the power consumption can be further lowered under a lower supply voltage.
For confirming the power consumption of the new design of Pipelined ISR-ANT ME is lower than others designs, the ME circuit architecture was implemented by ISR-ANT, Pipelined ISR-ANT and other methods respectively in Verilog; in the chip level, Design compiler and SoC encounter were employed for the synthesis and routing; and then performances were analyzed by Nano-sim. In system level, the performances were analyzed by C sharp.
The image pixel values were series of data streams, and in order to fully accumulate each 256 pixels, a control unit is added into the structure of the circuit; the output of the control unit called count is a data streams, and the control unit reset the output of accumulator yout to zero when count is equal to 0. The control unit is designed to be triggered at positive edges of clock signal Clk to accumulate from 0 to 255 and then reset to zero cyclically. Thereby, exact 256 pixels for accumulating are prepared. Since the output of Accumulator is still a continuous data stream after sampled by a DFF, it has to be divided into nine blocks for the requirement of the three-step search to find the most similar block of a moving object, an additional circuit to separate 256 pixels into nine blocks is constructed in the control unit.
The function block of the pipelined ISR-ANT circuit is shown in Fig. 3(a) . The left part is MSAD block which is the basic ME architecture; the right part is the estimator block where the operation frequency is four times slower than that in MSAD. Because of the computation time is enough to avoid the timing error caused by the low voltage of the circuit, ISR-SAD is considered as an error-free circuit.
To reduce the input frequency of ISR-SAD, a frequency divider is employed in the estimator block to divide the frequency by four. Since the lowered clock frequency, the accumulator which is designed to accumulate 64 times only. Due to the quarter times of accumulations, the amount of the accumulated values is four times less than the MSAD block, so the D flip-flop is employed to multiply the output by four before it enters the bracket circuit for distribution.
After multiplied by four, both bracket output of ISR-SAD and MSAD are received by the decision block. The circuit of decision block is shown in Fig. 3(b) . The absolute value of the difference between two inputs from MSD and ISR-SAD blockes Diff is computed and compared with the threshold value Th. If the Diff is greater than Th, y [i] _MSAD is determined as output; otherwise, the y [i] _ISR_SAD value will be the output. Finally, the pipelined D flip-flops are inserted into both the main block and the estimator of ISR-ANT ME to divide the critical path into two sections. The propagation delay is shortened by cutting the critical path; hence, the pipelined ISR-ANT ME can operate in a lower supply voltage.
III. PERFORMANCE COMPARISONS
To verify and demonstrate the low voltage merit of the proposed pipelined ISR-ANT ME design, we adopt two images for testing, one is flower garden video, and the other is the record of a baseball game. To demonstrate the merits of low voltage operation and low power consumption, we implement the ISR-ANT ME and pipelined ISR-ANT ME chip under TSMC 0.18µm CMOS technology process, which is illustrated in Fig. 4 . The most part of the flower garden video is static and the record of a baseball game was chosen to represent the images with fast movement and more complex color. To obtain the extracted images from the video, YUV viewer and the picture-capturing software were employed to divide the two consecutive frames into two image files with 352*288 pixels; every pixel was converted into an 8-bit image information written in the image files with the C # program. The experimental results are illustrated in Fig. 5 .
ISR-ANT ME
Pipelined ISR-ANT ME Area: W/L=723.495/706.8 Area: W/L=733.87/732.04 Lowest VDD: 1.4V
Lowest VDD: 1.2V Fig. 4 Chip layout for ISR-ANT and the pipelined ISR-ANT ME According to the analysis results, the peak-signal-tonoise (PSNR) in both ISR-ANT ME and pipelined-ISR-ANT ME designs are roughly identical. The differences are chip area and operating speed. Under lower supply voltage, the output from the circuit operating at low frequency might be selected as the final output. Since the pipelined ISR-ANT ME design can operate faster than the ISR-ANT ME design, it can operate reliably under lower supply voltage. The only sacrifice in the pipelined ISR-ANT design is latency.
IV. CONCLUSION
The PSNR value of the ISR designs proposed previously could only be improved to 24.32dB, and the operating voltage could be reduced to 1.4V. The lower supply voltage was applied on the circuit designed with the architecture proposed in this paper; the circuit based on ISR with two additional D flip-flops operated at the voltage further reduced to 1.2V without sacrificing PSNR, and the power consumption was twice lower than that of the conventional ME.
