Abstract-In this paper, the implementation of a DSP based video decoder compliant with the H.264/SVC standard (14496-10 Annex G) is presented. A PC-based decoder implementation has been ported to a commercial DSP. Performance optimizations have been carried out improving the initial version performance about 32% and reaching real time for CIF sequences. Moreover, conformance tests have been done using different H.264/SVC streams. This decoder will be the core of a multimedia terminal that will trade off energy against quality of experience.
I. INTRODUCTION
In the last years, a speed-up in the deployment of all kinds of telecommunication networks supporting multimedia services and applications has been produced in many parts of the world. In this context, the consumer multimedia terminals play a central role. In these terminals, video decoding is one of the most demanding tasks in terms of computational load and energy consumption.
The Scalable Video Coding (SVC) techniques [1] can be used in multimedia terminals to achieve a trade-off between quality and energy consumption. Though SVC techniques have been defined in most video coding standards [2] [3] [4] , the SVC capabilities included in H.264 [4] have overcome the ones in former standards.
In PccMuTe * project, our research is focused on the energy and power consumption control in multimedia terminals. A multimedia terminal prototype with a DVB-H receiver, an H.264/SVC decoder and an audio decoder is going to be used to validate the experiments. In this context, the H.264/SVC decoder will be used to achieve a trade-off between quality and energy consumption. The multimedia terminal architecture is based on a commercial chip having a General Propose Processor (GPP) and a DSP [5] .
Up to now, the available SVC decoder implementations are restricted to the PC domain [6] [7] [8] . We have ported the Open SVC decoder (OSD) [8] to the DSP and applied the methodologies proposed in [9] [10] to reduce its decoding time. The real-time performance has been reached for CIF sequences. Up to the best of our knowledge, no other H.264/SVC decoder based on DSP has been reported.
In this paper, a DSP implementation of a real-time H.264/SVC decoder is explained. The H.264/SVC standard and the OSD are outlined in section II. In section III, the porting of the decoder to the DSP and the speed optimization are described. In section IV, the results of the profiling tests are discussed. Finally, section V concludes the paper.
II. THE H.264/SVC STANDARD
Recently, an SVC algorithm was standardized as an annex of H.264 [4] [6] . In this standard, the video compression is performed by generating a unique hierarchical bit-stream with several layers with different resolutions, frame-rates, and qualities. There are a base layer and several enhancements layers. The base layer provides basic quality. The enhancement layers provide improved quality, but increasing the computational load and energy consumption. Since the energy consumption depends on the particular layer to decode, an H.264/SVC decoder is a solution suitable for managing the energy consumption.
The OSD is a C-language Baseline Profile H.264/SVC decoder supporting all tools to deal with spatial, temporal and quality scalabilities. It is based on a fully compliant H.264 baseline decoder with all the tools of the Main Profile except interlaced coding and the weighted prediction. The OSD has been developed for a PC environment and its performance is up to 50 times faster than the JVSM 9.16 decoder [11] .
III. DSP IMPLEMENTATION

A. DSP architecture
The commercial processor selected to implement the mobile terminal [5] consists of two processing cores, a GPP and a DSP. The GPP processor [12] is aimed to run a generic Operating System while the latter, a DSP core based on the C64+ family [13] , is adequate to implement an H.264/SVC decoder using its optimized architecture for video processing.
B. Porting and optimization
It is worth noting that the OSD has been developed for a PC-based platform. The changes made to adapt the OSD to the DSP were presented in [14] . In the work presented in this paper, the decoder performance has been measured using several standard sequences and profiling tools. The modules having the highest computational load have been identified. The methodologies presented in [9] [10] have been applied to reduce the number of CPU cycles needed to decode an H.264/SVC stream. Table I includes all modules that have been optimized and the percentage of improvement obtained.
Moreover the DMA controllers have been used to improve the data movement between internal and external memory during the motion compensation process. The performance has been improved a 5% in average. 
IV. TESTS
A set of tests has been carried out to verify the decoder conformance and to measure its performance. The test-bench is shown in Fig. 1-a. First, a test stream is read from a file and written into a stream buffer allocated in external memory. Then, the decoder reads the stream from this memory, decodes it on a picture basis and writes the decoded picture into a buffer. The picture is also written into a file.
The test-bench has been executed in the prototype board used in PccMute project (see Fig. 1-b ). This prototype is based on a commercial board [15] that contains the processor mentioned in section III.A. In order to assess the decoder performance with the testbench depicted in Fig. 1 , several well-known video sequences (Foreman, News, Stefan, Mobile, etc.) have been encoded using a commercial H.264/SVC encoder [16] . Three 9-layer streams have been generated for each sequence. The layer structure of each stream consists of all the possible combinations among three spatial resolutions (CIF, QCIF, and subQCIF) and three frame-rates (24, 12, and 6 fps). In addition, each sequence has been encoded with three bit-rates (0.5, 1.0, and 2.0 Mbps).
As far as the codec parameters to generate the test sequences concern, the GOP size equals 16 progressive frames, the CABAC is used for entropy coding, the deblocking filter is active, all possible macroblock partitions are enabled for inter-prediction, three reference frames are allowed, and 3 B-frames are coded for each I-frame. Table II contains the profiling results for the Foreman sequence using the un-optimized and optimized decoder versions. This sequence is the worst case among the tested streams. Table II shows the CPU percentage of usage when the decoder extracts each layer of the test sequence. These results have been obtained using a DSP working at 470 MHz. In the columns associated with the optimized version the percentage of improvement is included between brackets. These results demonstrate that real time performance has been achieved for all the layers of the generated streams except for the more complex layer (24 fps, CIF and 2Mbps). 
V. CONCLUSION & FUTURE WORK
An H.264/SVC has been implemented by porting the Open SVC Decoder software from the PC to the DSP environment. Some optimizations techniques have been applied to reach real time performance for CIF sequences. In near future other techniques will be applied. Up to the best of our knowledge, no other H.264/SVC decoder based on DSP has been reported. This optimized decoder will be used in a multimedia terminal to trade-off between quality and energy consumption.
