Abstract: This paper, presents an efficient hardware architecture of high performance SVC(Scalable Video Coding). This platform uses dedicated hardware architecture to improve its performance. The architecture was prototyped in Verilog HDL and synthesized using the Synopsys Design Compiler with a 65nm standard cell library. At a clock frequency of 266MHz, This platform contains 2,500,000 logic gates and 750,000 memory gates. The performance of the platform is indicated by 30 frames/s of the SVC encoder Full HD(1920x1080), HD(1280x720), and D1(720x480) at 266MHz.
Introduction
The H.264/AVC scalable extension (SVC) video coding standard [1] has attracted increasing attention, due to its higher coding efficiency versus previous standards [2] . This was set up to provide temporal, spatial and quality scalabilities for streaming multimedia applications with various networks [1] . On the other hand, compared to the baseline profile, a high profile and SVC encoder requires two and four times the computation and memory bandwidth, respectively [3] .
In designing a SoC(System On a Chip) implementation, most implementations use dedicated video processors for complex and parallel functions, like video compression and programmable Digital Signal Processors (DSPs) for serial data processing. The SoC also analyzes computational complexity of the software-based H.264/AVC baseline profile decoder [4] .
Such features and analyses are software-based solutions, and it is difficult to implement them in real time [5] . The H.264 profile video decoder with extremely low power dissipation meets the growing demands for low-cost implementation of such terminals. These applications require low power consumption and fast memory bandwidth access. Despite, the power consumption of high-performance processors is high. On the other hand, dedicated hardware has a lower power and higher performance than software implementation [6] [7] [8] [9] [10] . This paper presents a scalable video encoder with a dedicated hardware architecture. The proposed architecture achieved both high performance and low power design. Performance of the platform is indicated by 30 frames/s of encoder Full HD(1920x1080), HD(1280x720), and D1(720x480) at 266MHz. Section 2 presents an architectural overview of a Scalable video encoder. Sections 3 and 4 discuss the hardware module design and simulation. Conclusions are presented in Sections 5.
Architecture of Scalable Video Encoder
Fig. 1 presents a block diagram of a salable video encoder. The dedicated engines were implemented for fixed functions in SVC, such as image buffer block, integer motion estimation(IME), fine motion estimation (FME), motion compensation(MC), intra prediction(IP), transform and quantization(TQ), entropy coding(VLC), inverse transform and dequantization, prediction, upsampling, reconstruction, deblocking filter(DB), restore, and host interface. The proposed architecture is interfaced IME, Upsampling, ReStore and Image buffer blocks with the DMA(Direct memory access) controller access to external memory. Direct Memory Access and External Memory Interface block performs data transfer between internal memory and external frame memory data. The data transfer of the chip has one cycle operation between local memory and external frame memory. The DMA supports several mode operations with AMBA AHB specification as follows. These are Compatible to AMBA AHB v2.0, supported data size of byte/halfword/words, support of incremental address increment scheme, support of proprietary 1-/2-/3-dimensional DMA operation, support of multibank interleaving mode. DMA controller has special features; it can interface with all the internal modules with only one Channel, which consists of programmable DMA and supports both burst block mode and packet mode for data transfer. On the other hand, it has architecture of dual addressed DMA without buffered memory. In dual addressed DMA transfer, explicit address is to select the correct destinations. Fig. 2 shows macroblock-level pipeline flow by the controller. The pipeline flow consists of nine steps of encoding. Each stage must take less than 600 cycles in encoding. Fig. 3 presents the performance of the SVC decoder for critical path. The cycles of critical path for SVC encoder were evaluated using a processor, Application processor, Hardware, and Hardware with DMA optimization.
Proposed Hardware Module Design

Motion Estimation and Motion compensation
Support of variable block size was 16x16, 16x8, 8x16 and 8x8. The Subsampling motion estimation of the integer-pel level was designed. The hierarchical motion estimation of the half-pel level and quarter-pel level were designed. The accuracy of the motion compensation is expressed in units of one quarter the distance with luma and chorma pixels. The 3-pipeline of the processing element form of motion estimation shows high performance. There are eight input image 64bit data size parallel processing for SAD(Sum Absolute Difference) calculation.
Integer Transform Inverse Quantization (ITIQ) and Intra Prediction
The encoder uses three transforms depending on the type of residual data, which are coded in the bitstream : transformation for the 4x4 array of the luma DC coefficients in intra macroblocks (predicted in 16 x 16 mode), transformation for the 2 x 2 array of the chroma DC coefficients (in any macroblocks) and the transformation for all other 4 x 4 blocks in the residual data. Therefore, for the implementation of ITIQ, the control flow should be dependent on the macroblocks, which is more complex than ISO/IEC 13818-2 MPEG-4 IS(International Standard). In this study, for parallel processing, a 4x4 block unit was designed for the transform core.
Intra Prediction
The architecture of the Intra Prediction is parallel operating the luma block and chroma block. For the luma signal, there are nine intra prediction modes labeled from 0 to 8 such as vertical prediction, horizontal prediction, DC prediction, diagonal down/left prediction, diagonal down/right prediction, vertical-left prediction, horizontaldown prediction, vertical-right prediction, and horizontalup prediction. Examples of the intra prediction for the luma block in the Intra_16x16 macroblock type are vertical prediction, horizontal prediction, DC prediction, and plane prediction. The prediction in intra coding of the chroma blocks includes vertical prediction, horizontal prediction, DC prediction, and plane prediction.
Deblocking Filter
Conditional filtering should be applied to all macroblocks of the pictures. This filtering is done on a macroblock basis, with the macroblocks processed in raster-scan order throughout the picture. For luma, as the first step, the 16 samples of the 4 vertical edges of the 4 x 4 raster shoule be filtered from the left edge to the right edge. Filtering of the 4 horizontal edges (vertical filtering) follows in the same manner, or from the top edge. The same ordering was applied to chroma filtering with the exception that 2 edges for 8 samples each are filtered in each direction. This process also affects the boundaries of the reconstructed macroblocks above and to the left of the current macroblock. This platform designs architecture consisting of a dedicated hardware engine, which performs scalable high @ level 5.1 support, and operates at 266 MHz. Table 1 lists the result of the feature and specifications. Table 2 presents the result of the performance compare to previous work and this work.
Simulation and Verification
A design verification and methodology were developed. Fig. 5 shows the simulation of the SVC encoder. This is from a high level C to a gate level simulation. We developed C language models for the major functional blocks of the SVC video encoder, and the models performed a high level simulation. In addition, the external environment was modeled using HDL, which are the host interface and synchronous dynamic random access memory (SDRAM). The simulation and testing for the result of the software and hardware were carried out using a co-simulation environment. The test vectors for the highlevel simulation were used for verification from the RTLlevel HDL simulation through a gate-level simulation. The Hardware Size 2,500,000 gates 
Conclusion
A SVC encoder was designed multimedia application. This platform has dedicated hardware architecture to improve the performance and low power. The reducing memory bandwidth was designed using a multi pipeline scheme. The architecture of the SVC encoder was developed based on the FPGA platform. This chip contains 2,500,000 logic gates and 750,000 memory gates. The performance of the platform was indicated by 30 frames/s of SVC encoder Full HD(1920x1080), HD(1280x720), and D1(720x480) at 266MHz. 
