Current multimedia design processes suffer from the excessively large time spent on testing new IP-blocks with references based on large video encoders specifications (usually several thousands lines of code). The appropriate testing of a single IP-block may require the conversion of the overall encoder from software to hardware, which is difficult to complete in the short time required by the competition-driven reduced time-to-market demanded for the adoption of a new video coding standard. This paper presents a new design flow to accelerate the conformance testing of an IP-block using the H.264/AVC software reference model. An example block of the simplified 8 × 8 transformation and quantization, which is adopted in FRExt, is provided as a case study demonstrating the effectiveness of the approach.
Introduction
Digital video coding currently has a significant impact on the computer, telecommunications, and imaging industry. Especially with the remarkable progress in the development of products and services offering full-motion digital video transmitted on heterogeneous networks at different resolutions and on different terminals. This explains the reason for the existence and success of industry standards for compressed video representation achieving extremely high coding efficiency and enhanced robustness to different network environments. Most of the H.264/AVC standard applications are based on software implementations. Nevertheless, hardware implementations are also desirable for consumer products since they provide consistent advantages in terms of compactness, low power, robustness, low costs, and, most importantly, real-time operation up to HDTV rates. In our previous work, [3] [4] [5] [6] [7] [8] hardware implementations of different blocks in the initial H.264 transformation hierarchy model and entropy coding have been presented, while in Refs. 9 and 10, a design flow to accelerate the process of testing the quality of developed IP-blocks with the H.264 software reference model has been specified and described. In this paper, a high-performance hardware implementation of the simplified 8 × 8 transform and quantization module of the H.264/AVC standard is presented. The proposed design flow is used to assess the quality and the conformance with the standard within a reduced time window. The design flow and the hardware architecture block have been included in the draft of the second edition of the MPEG-4 Part 9 Reference Hardware Description.
11
The paper is organized as follows. Section 2 provides an overview of the H.264/AVC video coding standard. Section 3 briefly describes the simplified 8 × 8 transformation and quantization that has been chosen to be implemented in hardware. In Sec. 4, a description of the stages of the proposed design flow is provided, followed by Sec. 5, where more details about the IP-block design features are described. Sections 6 and 7 show the methodology necessary to apply some of the steps in the design flow so that a speedup in the process of designing the IP-block and testing its conformance with the H.264/AVC reference software is achieved. Section 8 presents some simulation results and the achieved performance. Finally, in Sec. 9, considerations on the proposed design flow are discussed; and Sec. 10 concludes the paper. development of a new international standard that is appropriate for conversational and nonconversational audio/video applications.
12-14
H.264/AVC presents many new video coding tools that make it the most powerful state-of-the-art standard as compared to the currently existing video coding families.
14 Network friendliness and compression performance, which has never been achieved before, at high and low bit rates, are the two most important features that distinguish H.264/AVC from other standards.
15-19
Unlike other standards, H.264/AVC presents a transformation hierarchy using exactly integer arithmetic-based algorithms. Such specifications eliminate the possible mismatch issues between the encoder and the decoder that have been observed even when the 8 × 8 DCT/IDCT transform implementation is fully compliant with the IEEE 1180 recommendation providing the specifications for required accuracy for approximations of the floating point implementations. 15, 20 In the initial H.264/AVC standard, which was completed in May 2003, the transformation was primarily in the form of 4 × 4 blocks, which helps reduce blocking and ringing artifacts. Fidelity range extensions (FRExt, Amendment I), a new amendment that was added to the H.264/AVC standard in July 2004 is currently receiving wide attention in the industry. It actually demonstrates further coding efficiency against current video coding standards, potentially by as much as 3:1 for some key applications. The FRExt project produced a suite of some new profiles collectively called high profiles. In addition to supporting all features of the prior main profile, all the high profiles support an adaptive transform-block size and perceptual quantization scaling matrices.
14 The concept of adaptive transform-block size has proven to be an efficient coding tool within H.264/AVC video coding layer design. 21 This has led to the proposal of a seamless integration of a new 8 × 8 integer approximation of DCT (and prediction modes) into the specification with the least possible amount of technical and syntactical changes to give significant compression performance at standard definition (SD) and high definition (HD) resolutions.
22-24

H.264 Simplified 8 × 8 Transform and Quantization
The use of block sizes smaller than 8 × 8 is limited at SD resolutions and higher. This has led to the proposal of an integer approximation of 8 × 8 DCT in FRExt to be added to the JVT specification. 24 This transform is applied to each block in the luminance component of the input video stream. It allows for bit-exact implementation for all encoders and decoders. Despite being more complex compared to the 4 × 4 DCT-like transform that is adopted by the initial H.264 specifications, the 8 × 8 DCT transform provides excellent compression performance when used for high-resolution video streams requiring a number of operations comparable to the number of operations required for the corresponding four 4 × 4 blocks using the fast butterfly implementation of the existing 4 × 4 transform. The 2D forward 8 × 8 integer transform is computed in a separable way as a 1D horizontal (row) transform followed by a 1D vertical (column) transform as shown in Eq. (1):
The Matrix C f is given by expression (2):
Each of the 1D transforms is computed using three-stages fast butterfly operations, as shown in Table 1 .
23
As can be shown from the previous butterfly operations, the 2D transform operation can be implemented using signed additions and right-shifts only, avoiding expensive multiplication implementations. The post-scaling and quantization formulas are provided in Eqs. (3)- (5):
QP is a quantization parameter that determines the level of coarseness of the quantization process. It enables the encoder to accurately and flexibly control the trade-off between bit rate and quality. It takes an integer value that ranges from 0 up to 51 (with low values representing less quantization, hence, better quality of the reconstructed frame). Z ij represents an element in the quantized transform coefficients [7] 1) + a [7] ) w [7] 
A Simplified 8 × 8 Transformation and Quantization Real-Time IP-Block 1015 Table 2 . Multiplication factor (MF ) for the prototyped architecture. 
matrix. MF is a multiplication factor that depends on (m = QP mod 6) and the position (i, j) of the element in the matrix, as shown in Table 2 . SHR( ) is a procedure that right-shifts the result of its first argument a number of bits equal to its second argument. f is defined in the software reference model as 2 qbits /3 for intra-blocks and 2 qbits /6 for inter-blocks.
12,13
Design Flow of an H.264 HW/SW Video Encoder
This section provides a general description of the complete process used to design the proposed IP-block, starting with a HW/SW partitioning of the reference software, passing through functional verification, and ending with physical implementation on a Xilinx Virtex II FPGA. Figure 1 gives the flow chart of the complete design flow.
9
First, an extensive analysis to validate the choice of converting the integer 8 × 8 DCT transform from software to hardware is performed. Then, a description of the hardware design to be realized, followed by the functional verification using SystemC for the formalism and simulation environment. After the functional verification, simulation at the RTL level of abstraction is performed using the University of Calgary Rapid Prototyping Platform (UCRPP). 25, 26 Then, the process ends with system synthesis followed by place and rout, which is a vendor-dependent step, and the physical programming of the FPGA itself. Figure 2 shows a block diagram of the developed IP-block architecture. The IP accepts the following inputs: 8 × 8 parallel blocks, QP, a synchronizing clock, and an enabling signal (input valid). It outputs the quantized transform coefficients matrix and the signal (output valid). The architecture is designed to perform pipelined operations. This drastically reduces the required memory resources and accesses, avoids any stall states, and dramatically improves the throughput of the architecture. Figure 3 provides a detailed block diagram of the architecture showing the data flow between the components.
Hardware Prototyping of the IP-Block
The architecture consists of two main stages. The first one contains two blocks; the "Transform" block, which is composed of three stages of the fast butterfly operations mentioned in Sec. 2, repeated twice (for horizontal and vertical transform), and the "QP-processing" block, which is responsible for calculating the intermediate parameters needed for quantization, such as f , qbits, and (P 0 − P 5 ), which are the values of the multiplication factors at the six different groups of positions in the matrix, as shown in Table 2 . Finally, the quantization process takes place in the second main stage of the design. This is done by performing the addition and multiplication operations in the arithmetic block, and finally the shifting operations in the shifter block. 
Functional Verification Using SystemC
Design productivity is the key to reduced time-to-market. It is an essential element that should be considered when releasing a new design. Hence, early functional verification is one fundamental step for successful IP providers to avoid prolonged products development phases. This is what gives special importance to the SystemC verification step in the design flow, although a direct transition from abstract design blocks to HDL description blocks is also possible. SystemC is a hardware design concept that enables the designer to perform early functional verification of developed hardware blocks by facilitating their integration with software in a unified platform. It provides hardware-oriented constructs within the context of C++ as a class library implemented in standard C++. This facilitates the process of integrating SystemC-emulated hardware blocks with any software reference model as they are all originated from the same environment.
In Ref. 10 , hardware definition switches enabling concurrent development and testing of different hardware blocks (represented in SystemC) with the JM H.264 software reference model is described. This approach facilitates the development process of hardware blocks during the software stabilization phase. One of the major goals was to reduce as much as possible the modifications required to the code for the inclusion of the corresponding HW described blocks. Figures 4 and 5 provide examples of the modifications that have been introduced to the code in order to embed the described HW block presented in the previous section.
10
By using this approach, it has been possible to perform behavioral simulations to the DCT block, showing that it is functionally compliant with the reference software.
The Rapid Prototyping Platform
A PCMCIA prototyping FPGA card, shown in Fig. 6 , has been chosen as the platform to integrate the HDL implementation of the IP-block with the reference software. 25 It connects the FPGA HW with a portable host computer through the PCI bus by plugging it into a standard PCMCIA socket, as shown in Fig. 7 . Some of the components of the prototyping platform are located in the host system, while others are in the pluggable FPGA card. 26 The host side is the main general purpose processor and support chips for a standard PCI bus. Its main storage holds the FPGA device configuration files and the software part of the system (i.e., the H.264 software reference model). Optionally, the host direct memory access (DMA) and interrupt controller can be parts of the design, targeting an increased system performance. The data communication bottlenecks between the hosting computer and the card can be minimized by the use of direct memory access and interrupt-based control. On the other hand, the FPGA card hosts the FPGA chip, programmable clock generator, and a PCI-bus interface controller. A block diagram of the DCT/Q module integration using the rapid prototyping platform is shown in Fig. 8 .
27
The process of integrating the hardware block in the platform is performed through an improved system IP hardware interface controller, which provides an easy way for the IP-block to exchange data between the host system and the memory space on the WildCard II. The controller has a set of interface signals with the 
Simulations and Results
In our experiments, the design flow introduced in Sec. 3 was used to accelerate the development process of the IP-block introduced in Sec. 4. We first embedded the SystemC-emulated architecture to the JM FRExt 2.2 project, and compared the output stream (with HW switch set to "1") with the output from the original software (HW switch set to "0"). Figure 9 reports a visual comparison between the output video streams before and after embedding the SystemC blocks.
In addition to the visual comparison, we performed a statistical comparison by comparing the PSNR of the luminance and chrominance components of the reconstructed video sequences with HW switch set to "0", and the corresponding values with HW switch set to "1". We were able to show that the results in both cases are identical, except for the required simulation time, which, as expected, increases when the emulated SystemC block is used due to the overhead of the hardware emulation stage. Figures 10 and 11 10 provide a summary of results reported by JM FRExt 2.2 before and after embedding the 8 × 8 DCT SystemC block. Then, the UCRPP enables us to test the block starting from the RTL level of abstraction. Simulation was performed using the Mentor Graphics c ModelSim 5.4 r simulation tool, and synthesized using Synplify Pro 7.1 r from Synplicity c . The target technology is the FPGA device XC2V4000 (BF957 package) from the Virtex-II family of Xilinx c . Table 3 summarizes the performance of the prototyped architecture. This value is about 216 times faster than the 16.67 ms time required for continuous motion (assuming a refresh rate of 60 frames/s). Similarly, the time required to encode a complete high definition television (HDTV) frame of a 720 × 1280 pixels resolution, and a 60 frames/s frame rate is 0.21 ms, which is about 79 times faster than the 16.6 ms time required for continuous motion. Hence, the architecture presented in this paper easily satisfies the real-time constraints for SD, HD, and even higher resolution video formats. 
Comments on the Proposed Design Flow
In this paper, a high-performance IP-block of the simplified 8×8 transformation and quantization, which was recently adopted by the H.264/AVC standard, is developed. The architecture was shown to satisfy the real-time constraints required by different high-resolution digital video applications. A novel design flow that facilitates early functional verification of developed IP-blocks, by assessing the conformance to the reference specification as well as the overall system behavior before physical implementation, was used. The methodology of the design flow accelerated the process of testing the quality of the IP-block by using the H.264/AVC software reference model itself. The presented design flow methodology can obviously be applied to the development of any HW-block, thus constituting an integrated framework for the transformation of pure SW specifications into a real hybrid SW/HW implementations. The design flow tools, the support platform, and a library of IP-blocks are in the process of being included in the ISO/IEC MPEG-4 Part 9 Technical Report called (Reference Hardware Description).
