Scalable Video Coding (SVC) is an advanced video compression technique that can support temporal, spatial, and quality scalability to terminals with different network conditions. SVC adopts layered coding techniques to improve coding efficiency for spatial and quality scalability. Upsampling and inter-layer prediction are two important mechanisms to remove redundant information between different layers. However, upsampling occupying around 75% memory bandwidth of SVC decoder results in serious performance degradation, especially for applications with high resolutions. Moreover, inter-layer prediction with complex scheduling leads to difficulties when mapping the SVC decoder in parallel. In this paper, we propose a method to parallelize the SVC decoder on a multi-core stream processor platform in both efficiency and flexibility. We focus on mapping issues of spatial scalability supporting with various resolutions of decoded frames. The experiment result proves the proposed design for SVC decoder reduces 95% memory bandwidth of the upsampling module in JSVM, performed on a single general-purpose processor.
INTRODUCTION
With the rapid growth of multimedia applications performed in various environments, scalable video coding (SVC) is a new standard developed to provide scalability of video service for various requirements of end users [1] [2] . SVC encodes a video sequence as one bit-stream and provides temporal (frame rate), spatial (resolution) and quality (SNR) scalability to end users who can decode parts of the encoded bit-stream. Temporal scalability with different frame rates is allowed by adapting the number of encoded hierarchical B-frames. To achieve spatial and quality scalability, SVC is designed with layer prediction mechanism to improve codding efficiency. Fig.1 introduces the block diagram of a SVC decoder with three spatial layers. Since SVC is an extent of H.264/AVC standard, the intra prediction and motion compensation mechanism in each layer of SVC are the same as those employed in H.264/AVC. The additional techniques in SVC are upsampling and inter-layer prediction, which is designed to remove redundant information among different layers in order to increase coding efficiency.
Take an example of SVC encoder supporting spatial scalability with HD, 4CIF, and CIF resolutions, the input video with HD size is downsampled to CIF size and encoded in spatial base layer first. Then, the texture, motion, and residual information encoded in base layer is upsampled and forwarded to the next higher layer for inter-layer prediction. The enhancement layer only encodes the difference of its texture/motion/residual data and those of the reference layer. In this way, a bit-stream encoded with three resolutions can service terminals with different capacities. As to quality scalability, different quality layers adopt different quantization steps.
Despite the advantage of service with scalability, SVC designed with inter-layer prediction mechanism makes itself more complex than H.264/AVC. Thus, SVC is more difficult to be implemented as a real-time design, especially for high resolution applications. One solution is to implement such complex codec as an application-specific integrated circuit (ASIC) in order to achieve high efficiency. However, specific circuit design for a single-purpose algorithm leads to high complexity of design efforts for architects and more design failures. On the other hand, traditional programmable processors are designed to serve a wide range of applications with flexibility. However, these general-purpose programmable core lack efficiency for multimedia applications, requiring real-time performance. Stream processors bridge the gap between ASICs and traditional programmable cores to achieve both efficiency and flexibility [3] . The stream programming model exploits parallelism and locality of media-processing applications to achieve high performance. Based on the stream programming model, the architecture of a stream processor is optimized for media-processing applications. However, how to parallelize a media-processing applications with optimal performance is still a big challenge for both hardware and software designers. There are many research works studying on mapping stream applications on a stream processor platform. In [4] , 3D graphics algorithms are exported on a single stream processor. For video compression applications, [5] develops a method to map H.264/AVC onto CELL platform. To best of our knowledge, previous work do not propose implementations for SVC decoder using ASICs or programmable cores.
In this paper, we address the problem of parallelization and optimization of SVC decoder on a multi-core stream processor platform. We focus on SVC decoder with spatial scalability designed with upsampling and inter-prediction mechanism, resulting in high memory bandwidth and performance degradation. The experiment result shows that our implementation can reduce 95% memory bandwidth of the upsampling module in JSVM [7] running on a single general-purpose CPU. The result demonstrates that SVC decoder can be implemented on a multi-core stream processor platform with real-time performance.
The rest of this paper is organized as follows. In section 2, we introduce SVC standard and the architecture of our stream processor with optimized design. In section 3, we discuss mapping SVC decoder onto multi-core stream processors with spatial scalability. The experiment result is shown in section 4 and conclusions are presented in section 5. To capture data locality in a stream program, each kernel core has a configurable memory array (CMA) and register files for flexible memory usage. In brief, the kernel core is optimized for exploiting data parallel and locality for stream programs. More detailed illustrations for the kernel architecture can be referenced in [6] .
Scalable Video Coding
SVC provides spatial scalability by encoding a video sequence at multiple resolutions to meet different needs of end users. In order to support spatial scalability, multi-layer coding mechanism, which is adopted in MPEG-2, H.263 and MPEG-4, is employed in SVC. As shown in Fig.1 , there are a three-layer SVC decoder. Each layer corresponds to one spatial layer supporting coding for a certain resolution of video. The layer associated with the smallest resolution of video is called base layer, which decodes the front part of the encoded bit-stream first. In the base layer, the motion compensation mechanism and intra prediction are the same as that in the single loop of H.264/AVC decoder. The other spatial layers are called enhancement layers, which decode the subsequent parts of the bit-stream according to network conditions of terminals. The difference to SVC concept is the inter-layer prediction mechanism, which removes redundant information between neighboring layers to improve coding efficiency. Three prediction techniques are used in SVC: inter-layer intra prediction, inter-layer motion prediction, and inter-layer residual prediction. Because SVC encodes an input of video stream to different sizes of picture sequences, the motion, residual, and intra prediction information in different layers are similar except for their resolutions. Therefore, an enhancement layer can utilize redundant information upsampled from the reference layer for prediction. For inter-layer intra prediction, the reconstructed texture of the reference layer is upsampled and forwarded to the next higher layer for the enhancement macorblocks with intra-coded mode. Similarly, an inter-coded macroblock in an enhancement layer predicts intra-layer motion and residual errors with upsampled motion and residual data of the associated block in the reference layer. Only the difference of motion/residual/texture information of the enhancement layer and the corresponding upsampled information derived from the reference layer is encoded to further improve coding efficiency. In this way, SVC provides spatial scalability (video service with multiple resolutions) within only one bit-stream so that end users can decode parts of the bit-stream depending on network conditions.
MAP SVC DECODER ONTO MULTI-CORE STREAM PROCESSORS

Maximize Parallelism and Minimize Bandwidth
Since SVC is a layered codec, its multiple layer coding flow is suitable for implementation in parallel on a multi-core stream processor platform. In this paper, we implement SVC decoder with three spatial layer scalability. Fig.3 shows the proposed architecture for SVC decoder in spatial scalability. For a heterogeneous system, a general-purpose processor controls the whole system and performs CABAC, the coding tool for entropy coding in SVC. The input bit-stream can be partitioned to corresponding decoded parts by CABAC and then sent to bridges (denoted as B in Fig.3 ) of the three spatial layers to organize stream elements. For each layer, one kernel core is assigned to deal with works related to motion compensation, including quarter-pixel filter (QPF), weighted prediction (WP), and inverse transform (IT). As mentioned in section 2, our kernel core is designed with VLIW architecture and multi-thread mechanism so that it can deal with multiple streams in instruction and data-level parallelism. Thus, heavy computational work of the motion compensation module can be offloaded to avoid performance degradation. Another kernel core is assigned to process macorblocks with intra-coded mode for intra prediction (This module is denoted as Intra in Fig.3) . The upsampling mechanism (denoted as UP in Fig.3 ), which upsamples shared data from the reference layer for the next higher layer, is an important module in SVC decoder. However, in JSVM, both upsampling processing for texture and residual information are off-line performed in framelevel. That is, the reconstructed texture or residual of a frame is loaded from the external memory and upsampled through the upsampling module to the picture size of the next higher spatial layer. After that, the upsampled results of texture or residual information are stored back to the external memory again. The whole upsampling process is performed in framelevel. This off-line upsampling processing occupies 75% of memory bandwidth in whole decoding process and leads to serious performance degradation. To reduce high memory bandwidth caused by off-line upsampling processing, we implement upsampling operation for texture and residual data in macroblock level. For each spatial enhancement layer, one kernel core is assigned to receive texture/motion/residual information from the reference layer through the external memory. Once a macroblock is decoded in the reference layer, its intra or residual data is shared with the next higher enhancement layer through the external memory. Then, the kernel core with upsampling function processes this data and the upsampled results are immediately forwarded to the kernel core at the next stage. Thus, all spatial layers can decode co-located macroblocks in parallel. In this way, the upsampling processing is performed in macroblock level and unnecessary memory access for storing upsampled results back to the external memory can be avoided. We increase the parallelism of SVC decoder and at the same time reduce memory bandwidth to achieve performance in high efficiency. 
Design in Efficiency and Flexibility
With the proposed parallel spatial layer architecture and the on-line upsampling mechanism, we reduce additional overheads of SVC decoder caused by multi-layer coding and upsampling mechanism. In addition, our kernel core supports adaptive task scheduling technique that can switch an idle kernel to share work of a kernel core with heavy loading. This technique is useful to increase hardware utilization, especially for SVC decoder where imbalance workload exits because of different sizes of pictures decoded in different layers and the upsampling mechanism with massive computation. More detailed illustrations for the adaptive task scheduling design of the kernel core can be referenced in [6] . In this way, we can further improve the performance of the proposed design. Furthermore, our implementation is based on a design flow employing two primitive units: programmable kernel cores and customized bridges, which can be designed as dedicated hardware circuits according to data access patterns of the target media application. This hardware scheduling techniques of communication behaviors for a specific media application can achieve high performance in a multi-core system. With the optimized architecture of the multi-core stream processor system and its programmable nature, we can implement media applications on our architecture in both efficiency and flexibility.
EXPERIMENTS
Environment
In order to measure memory bandwidth reduction of the proposed implementation for SVC decoder, we have developed a C-level simulator, modeling the architecture of our multicore stream processors, within a heterogeneous system platform. Since the upsampling mechanism of SVC deocder requires high memory bandwidth resulting in serious performance degradation, we compare memory bandwidth requirements for the upsampling module of the proposed design with the reference software, JSVM 9.14 [7] . The simulation runs takes as input a 1920x1080 HD stream decoding at 30 fps and three video sequences with resolution CIF, 4CIF, and HD are decoded. 
The Simulation Result
This experiment presents the difference of memory bandwidth requirements of the upsampling module in SVC decoder between two implementations: JSVM and the proposed design. Fig. 4 shows the result of comparison. The proposed design costs 31.9 MBytes/s while JSVM needs 588.1MBytes/s. The proposed implementation can reduce 95% memory bandwidth of the upsampling module in JSVM. The result proves the proposed design can achieve performance in high efficiency.
CONCLUSIONS
In this paper, we proposed a mapping method to parallelize SVC decoder onto an optimized multi-core stream processor platform. We aim to reduce additional overheads in SVC decoder resulted from inter-layer prediction and upsampling mechanism occupying over 75% of total memory bandwidth.
With the proposed parallel spatial layer architecture and the on-line upsampling mechanism, our design increase the parallelism of SVC decoder and minimize memory bandwidth requirements. The experiment result demonstrates the proposed implementation reduces 95% memory bandwidth of the upsampling module in JSVM. The result proves that our design for SVC decoder can achieve high performance.
