A DSP Based H.264 Dec oder for a Multi-Format IP Set-Top Box by Pescador del Oso, Fernando et al.
A DSP Based H.264 Decoder for a Multi-Format 
IP Set-Top Box 
F. Pescador, M. J. Garrido, C. Sanz, E. Juárez, D. Samper. 
Universidad Politécnica de Madrid. Spain. 
{pescador, matias, cesar, ejuarez, dsamper}@sec.upm.es 
Abslract— In this paper, the implementation of a digital signal 
processor (DSP) based H.264 decoder for a multi-format set-top 
box is described, Baseline and Main profiles are supported. Using 
several software optimization techniques, the decoder has been 
fitted into a low-cost DSP. The decoder alone has been tested in 
simulation, achieving real-time performance with a 600 MHz 
system dock. Finally, it has been integrated in a multi-format IP 
set-top box using a commercial development board based on the 
DSP @ 600 MHz, Tests in a real environment have been 
performed using this board with good results. 
I. INTRODUCTION 
In the last years, new video coding standards [1] [2] have 
been adopted allowing more data compression but increasing 
the complexity for both encoders and decoders as wetl [3]. 
With the latest generation of Digital Signal Processors (DSPs) 
[4]-[6], very flexible decoders can be implemented at a 
relative low cost. The complexity of an H.264 decoder may 
increase by a factor up to 2 regarding to MPEG-4 SP [3], 
which in turn ¡s more complex than MPEG-2. Thus, a 
real-time H.264 standard definition DSP-based decoder is 
hard to obtain [7]-[9]. 
In this paper, the implementation of an H.264 decoder 
based on a low-cost DSP and its integration on a multi-format 
IP set-top box (STB) is described. The most important 
challenge is to intégrate all the IP STB tasks in a low-cost 
DSP. Section II explains the decoder implementation. Section 
III describes the tests performed and their results. Section IV 
describes the decoder integration into the STB. Finally, 
section V is devoted to the conclusions. 
II. DECODER IMPLEMENTATION 
The decoder has been implemented on a low-cost fixed 
point video-oriented DSP [6], Basic Profile (BP) and Main 
Profile (MP)' of H.264 standard [2] at level 3 have been 
implemented. The starting point has been a standard compliant 
raw-C decoder fully tested fírst in a PC environment and then 
moved to the DSP. This initial code was optimized to increase 
the execution speed in two orders of magnitude. 
In Fig. 1, a simplified flow diagram of the decoding process 
for a slice unit is shown. After decoding the NAL header, the 
NAL unit contení is identified as a slice or another syntax 
element. Añerwards, the slice header is processed; a loop is 
performed to decode every MB. Virtually all computational 
load is consumed in the slice decoding. 
ir decoding I 
Motion vectora reading MB tfX | 
J 
DMA ReqiiBst to obtain btock referertces MB #X I 
ICT doeftlcients regdlng M B #X j 
i 
Last NO 
Btock? 
MB? 
Deblocklng tllter MB #X-1 | 
I DMA requests to wrlce reconstructed MB #X-1 I 
ICT transtorñTMB"#>; j 
. 
| Motion Compensaron {Ariihrnelicj M B # X |
 N Q 
Motioíi Compensaron (Ada1) MB#X ^ 
YES 
| DebíOGking liltet last MB | 
i 
DMA requests lo wnte last MB I 
Ftg.l. Simplified flow diagram of the decoding process. 
Several optimization steps have been developed to 
improve the decoder performance in speed: 
• Frequent arithmetic operations have been coded using 
intrinsic (pseudo-assembler) instructions. The same has 
been done with the core of the Deblocking Filter (DF), 
Integer Cosine Transform (ICT), Context Adaptive 
Variable Length Coding (CAVLC) and Motion 
Compensaron (MC). 
• The Context Adaptive Binary Arithmetic Coding 
(CABAC) core has been optimized, encoded in assembly 
Ianguage and parallelized by hand. 
• The decoding loop has been reorganized to elimínate the 
DMA waits associated with the request of the reference 
macroblocks. DF for MB#X-I is executed after the requests 
of reference macroblock MB#X. More instructions are 
executed between DMA data requests and MC. 
• Data needed for DF are saved in internal buffers to 
reduce data movement as it was presented in [10]. 
• Data movement and decoding operations have been 
parallelized as shown in Fig. 2. 
D*tMde MVs 1 ICT I DflWoc ' 
Header MB Coefrs Pitar 
| MBX | X | MBX | M0X-1 
to leiesence bulfer MB X 
M3 AÍÜJI.TI Atfs 
X | r/B X | MB X 
rtaader 
MBX-M 
WVsl ICT Debloc 
MB Cwffs filter 
X>1 | M B ^ I wex 
OWA 
to leFereríce bLiftsi MB X+1 
DMA lo cuíreit frairie 
MBX-1 
MB | Aíllh Mi 
X+1 | MB X+11 X+1 | 
DMA !0 Cutre[MfnM>e 
MBX 
Hig.2, Parallelization of data movement with the decoding operations. 
A set of simulation tests has been carried out to verify the 
decoder and to measure ¡ts performance. Actual DVD tnovies 
like "Star Wars: episode I" and "Finding Nemo" and a 
football sequence from a digital TV channel have been used to 
genérate both BP and MP H.264 test streams2. The test-bench 
is shown in Fig. 3. First, a test stream is read from a file on a 
picture basis and written into a stream buffer allocated in 
externa! memory. Trien, the decoder reads the stream from this 
memory decodes it and writes the decoded picture into a file. 
Fig.3. Test-bench used to proíile the decoder, 
Table I contains the profiling results, in average clock 
cycles per frame, for the decoder and its main parts: CABAC 
(MP), CAVLC (BP), ICT+MC, DF and others. The CPU% 
row shows the percentage of CPU load spent by the decoder 
with a 600 MHz system clock. 
TABLE I. 
H.264 DECODER PERFORMANCE IN SIMULATION. 
#CYCLESx ]0 " 
DECODER 
CAVLC/CABAC 
ICT+MC 
DF 
OTIIERS 
CPU% 
N E M O 
BP 
18.5 
5.1 
6.0 
5.9 
1.5 
77.1 
MP 
23.3 
8.2 
7.2 
6.1 
1.8 
97.0 
STAR WARS 
BP 
17.6 
4,9 
6.0 
5.2 
1.5 
73.3 
MP 
23.4 
8.0 
7.5 
6.0 
1.9 
97.5 
FOOTBALL 
BP 
18.0 
5.1 
5.9 
5.2 
1.8 
75.0 
MP 
23.3 
7.9 
7.7 
5.8 
1.9 
97 0 
NEMOIM 
BP 
14.9 
3.6 
5.3 
3.9 
2.1 
62.1 
MP 
22.0 
7.1 
6.5 
6.4 
2.0 
91.6 
Fig.5. Test-bench used ¡n real-time tests. 
TABLE II. 
IP SET-TOP BOX PERFORMANCE DATA. 
CPU% 
DECODER 
TOTAL 
N E M O 
BP 
81.2 
98.2 
MP 
107,5 
124.5 
STAR WARS 
BP 
77.0 
94.0 
MP 
108.7 
125.7 
FOOTBALL 
BP 
79.2 
96.2 
MP 
107,9 
124.5 
N E M O I M 
BP 
65.8 
82.8 
MP 
101.6 
118.6 
IV. CONCLUSIORS 
In this paper, the implementation of an H.264 decoder on a 
low-cost DSP and its integration on a single-chip multi-format 
STB have been shown. Tests in a real environment have been 
made with a 600 MHz system clock. These tests show that 
real-time is achieved for BP and could be achieved for MP 
with a 720 MHz system clock. Our current work is focused on 
the assembling of some modules and the implementation of 
ASO, slice groups and interlaced video. 
III. INTEGRATION INTO THE IP STB 
The decoder has been integrated into an IP STB [11] and 
tested using a board [12] based on the DSP @ 600 MHz. The 
IP STB has a multi-task software architecture that has been 
built using the RF5 [13] model (see Fig. 4). A multi-task real-
time kernel schedules the tasks execution and allows inter-
tasks communication. The H.264 decoder has been embedded 
in the video decoding task. The test-bench can be seen in Fig. 
5. A commercial encoder [14] generates the test sequences 
encapsulated in MPEG-2 Transpon Stream over IP3. The 
board decodes and presents the audio and video information 
on a TV display. In Table II, the percentage of CPU load 
spent by the decoder and by the overall system is given. These 
data have been measured using an ¡nternal DSP timer. 
r 
W 
Trsn&port 
_ Video dec. , 
.&& 
a > * 
iVideo pfey , 
1 Audio pía y BE>-!= 
F> D avies Dnver Chanml 
[~] Tus* ^ j Sfelie tiuffei 
^ SCQMMessasej [ SiQüjfler 
C_> SCQMOueue f > CeHwitii I atsoniftm 
| | Mailbox 
1\ Transpon dcniulüple-oiig 
A L G V Viiteo dcanJt i ip algorílhin 
A L C A Audio Occodiíig ¡ilfcoritliiTi 
P L A Y V Video dispkyíf lg algüriÜnn 
P L A Y A AiiJiii p l u u i i g a lganl ¡un 
Fig.4. Block diagram of the software architecture (notation details in [13]). 
" Length: 100 pictures. Format: 720*576 pels @ 25 fps. Average bit rate: 
2Mbps ("NemoIM" has l Mbps). BP: 5% I, 95% P. MP: 4% 1, 48% P, 48% 
B. [online) http://www.sec.upm.es/gdem/en/test_sequences.php. 
' The audio streams have been encoded with MPEG-2 layer II (see note 2). 
REFERENCES 
[I] I SOI 14496-2, Information technoiogy. Coding of audio-visual objects, 
Part2: Visual. May 2004. 
[2J ISOI14496-10. Information technoiogy. Coding of audio-visual objects. 
Part 10: Advanced Video Coding. Dec. 2005. 
[3] Ostermann. J. et al. "Video coding with H.264/AVC: tools. performance, 
and complexity'". IEEE Circuits and Systems Magazine. Vol. 4, Issue 1, 
pp. 7-28,2004. 
[4] Philips Semiconductors. Nexperia Media Processors. [online] 
http://www.nxp.com/products/nexperia/home/prodiicts/ 
med i a_processors/¡ ndex. html 
[5[ Analog Deviees, Blackfin processors. 
[online] http://www.analog.com/processors/blaekfin/index.hlml 
[6] Texas Instruments. TMS320DM642 DSPs. 
[online] http://focus.ti.com/docs/prod/folders/print/tms320dm642.html 
[7[ Y.S. Tung et al. "DSP-Based Multi-Format Video Decoding Engine for 
Media Adapter Applications". IEEE Trans. on Consumer Electronics. 
Vol. 51, Issue 1, pp. 273 - 280, Feb. 2005. 
[8[ Vaidyanalhan Ramadurai et al. "Implementation of H.264 decoder on 
Sandblaster DSP". IEEE International Conference on Multimedia and 
Expo. 2005. ICME 2005. 5 pp. Jul. 2005. 
[9] Moshc, Y.; Peleg. N. "Implementations of H.264/AVC baseline decoder 
on differenl digital signal processors". 47lh International Symposium 
ELMAR. pp. 37 - 40, Jun. 2005. 
[10] F. Pescador e1 al. "A Real-Time H.264 BP decoder based on a DM642 
DSP". IEEE International Conference in Signal Processing and 
Comunication. Dubai. Nov 2007. Paper accepted, 
[II] F. Pescador et al. "A DSP based IP se1-top box for home entertainmenf, 
IEEE Trans. on Consumer Electronics Vol. 52, Issue I, Feb. 2006 pp. 
254-262. 
[12] DM642 evaluation module, [online] http://www.spectrumdigital.com 
[13] Reterence Frameworks for eXpressDSP Software: RF5 (SPRA795A-
April 2003). [online] http://focus.ti.conVlit/an/spra795a/spra795a.pdf. 
[14] AMK 430 AVC encoder. [online] http://www.ateme.com 
