In the Rate-Distortion Optimization (RDO) 
INTRODUCTION
H.264/AVC is the state-of-art video compression standard proposed by ITU-T and ISO-IEC [1] . It has shown significantly better coding performance than existing video coding standards, e.g. about 50% bit-rate reduction compared with MPEG-2. A huge amount of coding options have been included in the prediction modules (intra and inter) to achieve this better coding efficiency. The rate distortion optimization (RDO) [2] technique for maximizing coding quality and minimizing the amount of coded bits is usually employed in video encoder implementations to achieve the best coding result. However, the RDO technique demands a high computational complexity in H.264/AVC encoders, since it exhaustively examines all intra and interprediction modes, performing a complete encoding process for each mode. Up to 90% of total encoding time may be spent in the mode decision stage [3] . Figure 1 shows the diagram of the RDO-based mode decision. Grey blocks are performed once for each prediction mode, while the mode decision block (in white) receives all candidate mode bit-rates and distortions (dashed lines) and the best mode selected is the one that presents a better trade-off among Bit-Rate (R) and Distortion (D). In the intra-frame prediction for the luminance layer there are two possible block size partitions to encode one macroblock (MB): (1) I16MB, with four possible prediction modes applied to the whole MB (16x16 samples) and (2) I4MB, with nine possible prediction modes applied to the sixteen blocks of 4x4 samples which compose the MB. For the chrominance layer there are also four possible modes to predict each 8x8 block (Cr and Cb) in a MB. Considering an HD1080p video sequence (1920x1080 pixels), 138.720 iterations of prediction, forward transform and quantization, inverse transform and quantization and entropy coding (called in this work as encoding loop) (Figure 1 ) are needed to encode each intra-frame using the RDO technique. This way, it is clear that the RDO-based decision is hard to be used when high resolution and real-time applications are considered. Besides that, an encoder architecture that uses the RDO-based Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SBCCI'11, August 30-September 2, 2011, João Pessoa, Brazil. Copyright 2011 ACM 978-1-4503-0828-1/11/08...$10.00. decision will spend a lot of clock cycles and energy to perform the encoding loop for all those candidate modes that will not be included in the bit stream at the end.
Due to this complexity, some works as [4] , [5] and [6] have proposed fast intra-frame mode decision algorithms and hardware designs to decrease the encoding time for one MB. However, all these works focus only on reducing the number of modes that will be evaluated by the RDO-based decision. The work [4] proposes an intra decision based on the dominant edge strength. The authors use a filtering technique to perform the edge detection. This way, the number of modes is reduced from nine to four in a 4x4 luma block. For 16x16 luma or 8x8 chroma block, only the detected mode and the DC mode are selected to be evaluated by the RDO technique. In the work [5] the authors proposed a modified low complexity mode decision algorithm based on a cost function composed by distortion and an estimated rate. The work [6] proposes a modified three step algorithm [7] to perform the intra decision. As well as the work in [4] , the main goal is to decrease the number of I4MB candidate modes (from nine to seven) to be evaluated by the RDO technique.
The main drawback of the related works [4] - [7] is that the proposed techniques can only reduce the number of candidate modes to be evaluated by the RDO process. This is a good technique but the gains in computational complexity are limited, since the RDO process is still executed for some modes. In this work, our approach is very different.
We propose a fast algorithm and its hardware architecture in order to completely eliminate the RDO-based decision of the encoding process, thoroughly decreasing the time needed to encode one MB. The algorithm uses a threshold value to choose the partition type (I16MB or I4MB) and for each partition type the encoding mode is selected with a simple distortion metric, e.g. Sum of Absolute Differences (SAD). The threshold value is based on the Difference of Distortions (DD) and is determined by offline simulations with several video sequences which represent a variety of illumination and texture patterns.
The paper is organized as follows: Section 2 presents the fast intra-decision algorithm. Section 3 shows the designed architecture of the intra-decision and compares it with previous architecture designs. Finally, Section 4 concludes this work.
FAST INTRA-FRAME MODE DECISION ALGORITHM
Our fast intra-mode decision algorithm is based only on distortion calculation. The decision is performed in a hierarchical way in two steps: (1) decision among equal partition sizes and (2) decision among different partition sizes. The next sub-sections will explain it better.
Decision Among Equal Partition Sizes
The first decision step is based on distortion calculation to choose the best I16MB partition considering the four possible modes and the best I4MB partition considering the nine possible modes. Several simulations considering three different distortion metrics were performed: Sum of Absolute Differences (SAD), Sum of Squared Differences (SSD) and Sum of Absolute Transformed Differences (SATD). The results (for bit-rate and video quality) obtained using these three metrics were compared among each other and a comparison was performed considering computational complexity measured by the number of sums (see Table 1 ). SATD and SSD metrics show better RD results (bit-rate and video quality). However, the computational complexity of these two metrics is extremely much bigger than SAD (about 361% bigger when the SSD metric is considered). As the main goal of this work is to design a faster intra mode decision, we decided to use SAD as the distortion metric. In addition to that, the hardware architecture design for SAD calculators is simpler than one for SATD (which includes a 4x4 Hadamard transform) and SSD (which includes a multiplier and a square root).
Decision Among Different Partition Sizes
The second step of the proposed intra-decision is to choose which partition size (I4MB or I16MB) will be used to encode the MB. This decision is made using the information generated by the first step: the distortion of the best I4MB partition and the distortion of the best I16MB partition measured with SAD. A simple comparison between these two values would cause, in most cases, the choice for I4MB partitions, because of the finer prediction granularity and more coding modes. However, analyzing the behavior of the difference between these two distortion values when RDO-based decision is applied, it is possible to make a good choice on which partition shall be used for each MB.
Simulations were performed with various video sequences using JM H.264 reference software [8] set in full-RDO-based decision and intra-only MB modes (to choose only I4MB or I16MB). It was possible to notice that in most cases when I16MB partition had been chosen the distortion values of the best I4MB partition and the best I16MB partition were very close. It means that most of the 16 4x4 modes were the same and choose only one 16x16 mode is better, since it will generate less modes information. On the other hand, when I4MB partitions were chosen, the distortion values of the best I4MB and the best I16MB partition were very different. It means that most of the 16 4x4 modes were different and even choosing only one 16x16 mode it will generate a lot of residual information. Several simulations were performed to classify the difference of distortions (DD) with the intra-prediction modes selected by the RDO technique.
Equation (1) shows the difference of distortion (DD) calculation where the SADI4MB is the sum of all residual generated by the 16 best 4x4 modes and de SADI16MB is the residual generated by the best 16x16 mode.
DD = SAD I4MB -SAD I16MB
(1) Figure 2 shows a graph where the total of chosen modes (I4MB and I16MB) for each MB selected by the RDO technique is compared with the difference of distortion generated by the best I4MB and I16MB partitions (first decision step). 100  200  300  400  500  600  700  800  900  1000  1100  1200  1300  1400  1500  1600  1700  1800  1900  2000  2100  2200  2300  2400  2500  2600  2700  2800 With the difference of distortion set to 600 for example, it is possible to see that in most cases when the I16MB partition is chosen (97%) the difference of distortion is very small (lesser than 600), while when the I4MB partition is chosen the difference of distortion is very large (84% are bigger than 600). This way, it is possible to use this information in comparison with a threshold value to choose the partition size for intra-frame MBs.
The threshold value that presents the best results in terms of PSNR and bit-rate was obtained as follows: (1) all videos used in the simulations were first encoded using the RDO technique for the decision among different block sizes. Meanwhile, the distortion values and the chosen partition were saved. Then, the differences of distortion were compared with the chosen partition to define the threshold value. Simulations were performed with a threshold ranging from 0 to 1000, adjusting it to bit-rate and video quality. Threshold value set to 600 generated the best results considering the bit-rate and video quality relation for the video sequences evaluated.
The results obtained by using the proposed intra-decision algorithm are presented in Table 2 . The first columns present the results using RDO-based decision. The central columns present the results obtained by the proposed decision. Finally, the last columns show a comparison between the two approaches in terms of bit-rate increase and image quality (PSNR)
The application of the proposed heuristic resulted in an average increase of 5.02% in the bit-rate and an average decrease of 0.255dB in the image quality (PSNR). The increase of bit-rate and the decrease of image quality are very small and are justifiable by the enormous computational complexity reduction achieved in the decision process. As presented in Figure 1 , the RDO-based encoding process is finished only after the execution of all possible intra-frame prediction modes by the encoding loop. The decision proposed in this work is performed after the generation of the predicted blocks by the intra-prediction followed by the SAD-based distortion calculation and then the difference of distortion operation. This way, the encoding loop presented in Figure 1 is completely eliminated resulting in enormous gain in terms of computational complexity reduction of the intra-frame decision process. When RDO-based decision is performed, four I16MB modes and nine I4MB intra-frame modes must be evaluated, totalizing 13 encoding iterations per MB. Considering the proposed decision method, the encoding process is performed only once for each MB. Table 3 presents a comparison with related works in terms of bit-rate, image quality (PSNR) and reduction in RDO calculations. While other works have shown a reduction of coding iterations from 1.1 to 2.6 times in comparison with RDO-based decision, the proposed decision allows a reduction of 13 times (one order-of-magnitude). The cost of this gain resides, however, in the bit-rate increase of 5.02% and image quality loss 0.255dB which do not compromise coding efficiency when the enormous gain in terms of computational complexity reduction is considered. Moreover, reducing the number of encoding loop iterations, it is possible to reduce the number of clock cycles and energy consumption needed to perform the whole prediction of one MB.
DESIGNED ARCHITECTURE AND COMPARISON
In order to further improve the performance of intra-frame decision, a hardware architecture was designed. The fast intra mode decision architecture is shown in Figure 3 , which consists of 17 SAD calculators (nine for I4MB modes, four for I16MB modes and four for chroma modes), three comparators and the DD mode decision module. The distortion calculation is performed by the SAD calculators between the predicted block and the original block. Considering the distortion calculation for I4MB partitions, the SAD value of each mode is compared and then the 16 lowest SADs of each 4x4 block are accumulated to generate the total distortion for the I4MB partition. For I16MB partitions, the SAD values among the four modes are compared and then the lowest one is chosen as the best I16MB partition. As chrominance samples are predicted considering only one partition size (8x8), the decision among these samples is easier. The SAD values for each mode are compared and the lowest one is chosen as the best. Figure 4 shows the RTL diagram of each SAD calculator. The SAD calculator consumes eight samples (two lines of a 4x4 block) per cycle. It was designed with two pipeline stages, i.e., it takes two clock cycles to deliver the first result. There is a little difference between the SAD calculator used in the I4MB distortion decision and the SAD calculator used in the I16MB distortion decision. Considering I4MB partition, the accumulated value is used by the comparator each time that one 4x4 block was processed, since the prediction is performed over 4x4 blocks. After that, the lowest SAD of the nine 4x4 blocks is chosen as the best and then it is accumulated again until all the 16 4x4 blocks are read. On the other hand, when the best I16MB and the best 8x8 chroma partitions are considered, only one accumulator is needed, since the prediction is generated over the whole block. Then the comparator chooses the best mode. Figure 5 shows the time diagram of the designed architecture. As the SAD calculators were designed with two pipeline stages and with eight samples of parallelism, the architecture takes 3 clock cycles to deliver the first valid SAD value of a 4x4 block. When the pipeline is filled, there is a valid SAD value every two clock cycles. This way, the architecture takes 34 clock cycles to evaluate all the nine I4MB modes and more one clock cycle to accumulate the last best SAD. Considering the I16MB partition decision the SAD values are accumulated in the SAD calculator itself. This way, the architecture takes 33 clock cycles to accumulate the SAD for the four modes and one more clock cycle to compare them. The chroma decision is similar to the I16MB decision, however the block size is of 8x8 samples. Then, it only takes 10 clock cycles to perform the whole chroma decision. After the calculation of the best I4MB distortion and the best I16MB distortion, the difference of distortion is evaluated in one clock cycle to decide which partition size will be used. The architecture was described in VHDL language and synthesized targeting two different technologies:
(1) EP2S130F1508C3 Stratix II FPGA [10] and (2) TSMC 0.18µm standard cells [11] . Table 4 presents the synthesis results for both technologies and compares it with previous works. The results are presented in terms of hardware resources usage, number of clock cycles needed to perform one MB, maximum operation frequency achieved and throughput measured in HD1080p frames per second.
When synthesized to FPGA, the architecture used 3.267 ALUTs (Look-Up Tables) and 2.312 DLRs (Dedicated Logic Register), totalizing 4% of the resources in the device. The FPGA synthesis achieved 98.43MHz as maximum operation frequency, being able to process up to 335 HD1080p frames per second. When synthesized to TSMC 0.18µm the total gate count was 28.518. The maximum operation frequency achieved was 129.1MHz. This way, the architecture is able to process 439 HD080p frames per second.
Compared with previous works [4] - [6] , the designed architecture consumes the lowest number of cycles to process an intra-frame: more than 11X reduction compared with [4] , 18X reduction compared with [5] . It also results on the highest throughput among the related works: more than 11X and 14X increase in number of HD1080p frames encoded per second for FPGA and TSMC 0.18 versions, respectively, compared with [4] . All these results were obtained considering our architecture operating at maximum frequency. An interesting result is that with higher throughput we can reduce the operating frequency down to 8.26 MHz and still process HD1080p videos at 30 fps, which is the target frame rate normalized by H.264 standard [1] . With this really low frequency we can achieve very low power using our architecture, which is an excellent alternative for battery-powered devices. However, if a whole encoder design is considered, this low frequency could not allow HD1080p processing.
CONCLUSIONS
This work has presented a fast intra mode decision algorithm and its hardware architecture design for an H.264/AVC video encoder. The proposed algorithm allows the complete elimination of the encoding loop present in RDO-based mode decision, which is substituted by simple distortion calculations and comparisons, thoroughly decreasing the complexity of the video encoder. The number of encoding iterations was reduced in 13 times when compared with RDO-based decision, at the cost of relatively small bit-rate increase (5.02% in average) and image quality loss (0.255 dB in average). When compared to other works, the proposed algorithm achieves a complexity reduction more than five times higher, while the bit-rate increase and the image quality loss are slightly higher and still similar to the compared works. The designed architecture of the fast intra-decision algorithm was described in VHDL and synthesized targeting two technologies: (1) Stratix II FPGA and (2) TSMC 0.18µm standard cell library. The synthesis results have shown that the architecture is able to process until 439 HD1080p frames per second.
