In this paper, we propose novel hardware architecture for intra 16 × 16 module for the macroblock engine of a new video coding standard H.264. To reduce the cycle of intra prediction 16 × 16, transform/quantization, and inverse quantization/inverse transform of H.264, an advanced method for different operation is proposed. This architecture can process one macroblock in 208 cycles for all cases of macroblock type by processing 4 × 4 Hadamard transform and quantization during 16 × 16 prediction. This module was designed using VHDL Hardware Description Language (HDL) and works with a 160 MHz frequency using ALTERA NIOS-II development board with Stratix II EP2S60F1020C3 FPGA. The system also includes software running on an NIOS-II processor in order to implementing the pre-processing and the post-processing functions. Finally, the execution time of our HW solution is decreased by 26% when compared with the previous work.
Introduction
Currently, video system development is generally based on embedded systems. Such systems need to find a compromise between computational complexity and timing execution constraints. On the other hand, the H.264/AVC standard for video compression [1] [2] [3] [4] [5] , due to its high complexity, needed powerful processors and hardware acceleration in order to respect application requirements.
In order to take advantages of hardware acceleration, each functional module of the H.264 video encoder has been carefully studied in order to determine its computational complexity. Furthermore, the intra process presents one of the highest computational complexities in H.264/AVC encoder [6] . This process is based on the hybrid encoding scheme shown in Figure 1 which uses the intra prediction, integer cosine transform and quantization. The intra process is used to remove spatial redundancy. There are two types of intra modes: intra 4 × 4 and intra 16 × 16 modes. The intra 16 × 16 is composed of intra 16 × 16 prediction (IP 16 × 16), integer cosine transform (ICT), quantization AC (QAC), inverse integer cosine transform (IICT), inverse quantization AC (IQ-AC), quantization DC (QDC), Hadamard transform (HT), inverse quantization DC (IQDC) and inverse Hadamard transform (IHT). Special hardware implementations of intra 16 × 16 for H.264 have been proposed [7, 8] . They were shown that some of these parts can be optimized with parallel hardware structures implemented into the hardware system. These previous works have implemented the intra 16 × 16 algorithm with serial [7] and parallel [8] architectures directly into hardware device. But, our architecture uses both a parallel and pipelined structures in order to reduce the number of operations and the ability to achieve fast execution. Our design is described with VHDL (VHSIC Hardware Description Language) language and has been synthetized with the Altera NIOS II softcore processor for experimental validation into a single Altera Stratix II EP2S60 FPGA (Field Programmable Gate Array) device.
This paper is organized as follows: Section 2 presents an overview of intra 16 × 16 algorithm. In the next Section, we present the intra 16 × 16 architecture. The experiment results are shown in Section 4. Finally, Section 5 concludes the paper.
Overview of the Intra 16 × 16 Algorithm
The intra 16 × 16 algorithm is a critical component used in the H.264/AVC. There are eleven functional operations in this module: intra 16 × 16 prediction, residual calculation, integer transform, AC coefficient quantization, DC coefficient quantization, inverse AC coefficient quantization, inverse DC coefficient quantization, Hadamard transform, inverse Hadamard transform, inverse integer transform and pixel reconstruction. The 16 × 16 intra prediction mode is designed according to directions: vertical, horizontal, DC and plane modes are specified in the H.264 standard based on the reconstituted pixels from the previous macroblock (MB). Figure 2 shows the intra 16 × 16 prediction mode.
For each MB, we compute the difference between the predicted pixel and the original pixel. After this step, we calculate the integer transform coefficients. In the H.264/ AVC standard, the equation of the 4 × 4 integer transform is defined by [3, 4] . X   15  14  13  12   11  10  9  8   7  6  5  4   3  2  1  0 (1) "X i " is the residual 4 × 4 block. After this operation, we obtain two coefficients types: AC and DC coefficients. For the AC coefficients, we compute the quantization operation. In general the AC quantization operation is defined by [3, 4] .
We can write (5) as follows:
where: Hence, the shift operation can be greatly used in the quantization and rescaling stages. To simplify the arithmetic, the quantization stated in (6) can be rewritten as (9, 10) for the AC coefficients [3, 4] .
Z ij is the uncalled coefficients after QAC operation. The first 6 values of MF used in the H.264 references are listed in Table 2 .
The 2nd and 3rd columns are the different positions in the scaling matrix. QP%6 represents the QP division rest by 6.
After the calculation of QAC, we must compute the inverse AC quantization. This operation is defined as [3, 4] .
A constant equal to 64 is integrated in order to avoid rounding errors. The inverse quantization AC equation becomes therefore:
Y ij is the result of inverse AC quantization. It must be divided by 64 for recovering the exact value without factor of scaling. The H.264 draft standard doesn't precise Qstep or PF directly. It uses a parameter given by:
The final equation for the inverse quantization is:
The first 6 values of V used in the H.264 standard are listed in Table 3 . The 2nd and 3rd columns are the different positions in the scaling matrix.
For the DC coefficients, Hadamard transform is applied. The equation of 4 × 4 hadamard transform is defined by [3, 4] . (12) "D i " is the DC coefficients. In next step, we calculate the quantization operation for the DC coefficients. This operation is defined by [3, 4] .
K ij is the uncalled coefficients after QDC operation. MF (0, 0) is the multiplication factor for position (0, 0) in Table 2 . After the calculation of QDC, we must compute the 4 × 4 inverse hadamard transform. This operation is defined by [3, 4] . 0  10  16  13  1  11  18  14  2  13  20  16  3  14  23  18  4  16  25  20  5  18  29  23 "D' i " is the block 4 × 4 quantified DC. The final step for the DC coefficient is the inverse DC quantization. This operation is defined by [3, 4] . 
where V(0,0) is the multiplication factor for position (0,0) in Table 3 .
After all operations, we can combine the AC and the DC coefficients for compute the inverse integer transform. Equation (19) gives the equation of 4 × 4 inverse integer defined as [3, 4] . 
Intra 16 × 16 Architecture
The intra 16 × 16 architecture partitions the MB into sixteen 4 × 4 blocks. The scanning order for one MB is shown in Figure 3 . This order is scanned in the x direction first and then performs the scanning in the y direction. The scanning order is the label order from top to bottom, from left to right which is the actual processing order for one MB. The MB is partitioned into sixteen 4 × 4 small sub-blocks. The partitions between the 16 × 16 scanning order labels and the 4 × 4 scanning order labels are shown in Figure 4 .
The 4 × 4 scanning order labels are shown in Figure 5 . In the first step, we compute the intra prediction 16 × 16 for all 4 × 4 blocks. After this, we calculate the residual, the integer transform, the AC quantization and the inverse AC quantization for each 4 × 4 block. During the calculation of integer transform, we extract the DC coefficient for each 4 × 4 block. After obtain the 16 DC coefficients, we calculate the hadamard transform, the DC quantization, the inverse hadamard transform and the inverse DC quantization. Finally, we combine AC and DC coefficient for each 4 × 4 block to perform the inverse integer transform and the reconstruction pixels.
The intra 
Intra 16 × 16 Prediction
Different works have been proposed [9] [10] [11] [12] [13] . For our architecture, the MB pixels are loaded into a dual RAM (Random Access Memory) for reordering and then give (to the residual or reconstruction blocks) by sets of 16 pixels (4 × 4 block). This block calculates the predicted pixels of MB for all 3 intra 16 × 16 prediction modes specified in the H.264 standard (horizontal, vertical and DC) in parallel based on the reconstituted pixels from the previous MB (planar mode is not used [14] ). Figure 8 presents the intra prediction hardware architecture. These predicted pixels are stored into RAM for all modes. We also use a SAD_ 4 × 4 block for calculating the SAD value for each mode. We accumulate this value 16 times in order to obtain the SAD_16 × 16 for each mode. Those absolute values permit to give the sum of absolute differences (SAD) for each prediction mode. The comparator compares the SAD values for all prediction modes and picks the lowest value for determining which prediction mode will be used. After obtaining the best SAD (MIN_SAD), the best MB is given. The difference between the predicted pixels and the source pixels is then calculated for the best prediction mode for obtain the residual MB. 
ICT and HT Architectures
Different works have been published on the integer transform [15] [16] [17] [18] [19] . It is obvious that "I" shown in (1) or "H" shown in (12) can be implemented by a 1-D transform. Figure 9 shows the fast implementation for the integer transform. The matrix contains only four coefficients: 1, -1, 2, and -2. It also can be implemented by using addition, subtraction and shift operations. The Hadamard transform matrix is very similar to the integer transform matrices. The difference is that the coefficients of Hadamard transform are only 1 or -1. Therefore, the fast implementation for the Hadamard transform is shown in Figure 10 .
The hardware implementation of 1-D ICT or HT is given in Figure 11 . The input for this module is a 4 × 4 block. For full transform operation, we use two 1-D transforms in order to obtain the 2-D transform. 3  2  ,  3  1  ,  3  0  ,  3   3  ,  2  2  ,  2  1  ,  2  0  ,  2   3  ,  1  2  ,  1  1  ,  1  0  ,  1   3  ,  0  2  ,  0 
QAC & QDC Architectures
The Quantization hardware architectures have been proposed in [8, 20] . The architecture of DC quantization is similar to the AC quantization presented in Figure 13 . The multiplication factors stated in Table 1 are stored into ROM (Read Only Memory) and selected according to the QP%6 values. The correct factor is multiplied by the uncalled coefficient in the corresponding position. The shifter will shift the product to right with qbits.
The QAC or QDC modules will quantify at the same time 16 pixels according to QP factor. These modules are composed by a quantization block (noted 0…15), a memory for storing the input pixels (noted input_0..15) and two read-only memories for storing QE (equal to QP%6) and F values noted respectively ROM_QE and ROM_F. The AC and DC quantization blocks are constituted by three basic components presented in Figure 14 . A multiplier deals perform the multiplication operation of AC coefficients with the corresponding MF (i, j) factor and gives the absolute value. An adder will perform the sum operation of values given by the multiplier with the F parameter given by the ROM memory. A shifter allows performing the shift operation the result from the adder by "qbits" (varies 15 to 23 according to the value of QP).
IQAC & IQDC Architectures
The IQAC or IQDC modules will quantify 16 pixels according to the QP factor. The architecture of these modules is similar to the QAC or QDC modules respectively presented by the Figure 13 . The difference between quantization (AC or DC) and inverse quantization (AC or DC) is presented in the quantization block. For having the inverse AC quantization values, we use a multiplier to perform the multiplication operation between the QAC coefficients and the V (i, j) values. We also use a shifter for shifting the result from the multiplier floor (QP/6). The architecture for this module is presented by the Figure 15 .
For the DC coefficients, we use a multiplier to perform the multiplication operation between the QDC coefficients and the V (0, 0) value. An adder will perform the sum of values given by the multiplier with {0, 1, 2} (0 for QP > = 12, 1 for QP < 12, 2 others parts). A shifter will perform the shift of result from the adder by floor (QP/6) -2) for QP >= 12 and by (2 -floor (QP/6)) for QP < 12. The architecture for this module is presented in Figure 16 .
IICT and IHT Architectures
The IICT or IHT architectures are similar to the ICT or HT architectures respectively presented by the Figures  12 and 13 . The inverse integer transform matrix contains only four coefficients: 1, -1, 1/2, and -1/2. Figure  17 shows the fast implementation for the inverse integer transform. The inverse Hadamard transform matrix contains only two coefficients, 1 and -1. Figure 18 shows the fast implementation for the inverse Hadamard transform. 
Intra 16 × 16 Execution Time
The intra 16 × 16 execution time is presented in Figure  19 . This figure is divided into two parts. The first part concerns the intra 16 × 16 prediction. This part takes 115 clock cycles for the best predicted MB [21] . The second part concerns the coding chain block that needs 77 clock cycles. In this part, we use a pipeline as shown in Figure  19 . To get the reconstructed MB, we need 16 clock cycles. Finally, 208 clock cycles are necessary to achieve the intra 16 × 16 operations. Comparing with [7] and [8] , the proposed architecture takes less clock cycles. Simulation of our proposed RTL design shows major improvements by reducing clock cycles for the intra 16 × 16 operation as shown in Table 4 . Thus, our hardware implementation is optimized to achieve higher performances for the H.264 video encoder than the hardware architecture presented in [7] [8] .
Experimental Results
The whole design has been designed by using VHDL (RTL level). The VHDL code of all modules was synthesized for an EP2S60F1020C3 Altera Stratix II FPGA circuit by using the Altera Quartus tool. Table 5 shows the implementation results of the intra 16 × 16 module for the Stratix II EP2S60 FPGA circuit. For experimental verification, we have developed a C language reference model of H.264 software. We have compared the output results of our C reference model with the JM 10.1 model [22] and we have confirmed the correctness of our model. We have also used the NIOS II softcore processor for sending data to the intra frame hardware coprocessor. The block diagram of the implemented H.264 intra frame encoder is shown in Figure 20 . The design is composed by three parts: the NIOS II processor, the intra 16 × 16 frame module and the other peripherals connected to the Altera Avalon Bus. The Avalon bus has control, data and address signals and has its bus arbitration logic.
Our embedded system has been tested by using the Altera NIOS II development board. The heart of the target board is the Altera Stratix II EP2S60F1020C3 FPGA circuit. For all experiments, CIF test sequences are coded at 30 Hz. We have focussed on the following video test sequences: "Foreman", "Paris", "Mobile", "Tb420" and "Akiyo". These test sequences have different movement and camera particularities.
We have determined the processing time of intra 16 × 16 for the SW (software) solution. From the Table 6 , we can conclude that a 35 time improvement for the processing speed compared to the software solution can be obtained by using our HW implementation. In order to evaluate the image quality given by this architecture, we have used the average peak signal-to-noise ratio (PSNR) which is here used as a measure of objective quality. The PSNR metric as shown as in Table 7 has not detected any difference between the SW and HW solutions. Thus, the quality comparison confirms the correctness of the designed architecture.
The Figure 21 presents the original and the two reconstructed (one from SW, the other from HW) of the 10th frame of the test video sequences.
Conclusions
In this paper, we have described a new flexible and efficient HW architecture for H.264 video encoder. The hardware part has been implemented by using VHDL language. Comparing with [7] and [8] , our proposed RTL implementation gives major improvements by reducing clock cycles for the intra 16 × 16 operation. The execution time is decreased by 26% even when compared with the best previous work for intra frame coding [8] . We have also designed an embedded system based on an Altera Stratix II FPGA platform running at 160 MHz in order to evaluate the performance of our design in HW/SW codesign context. We have shown that our HW solution improves considerably the intra 16 × 16 process (35 times faster) compared to an all software solution with the same image quality.
