Abstract-Flexible signal processing on programmable platforms are increasingly important for consumer electronic applications. Scalable video algorithms (SVAs) using the novel principle of priority processing guarantee real-time performance on programmable platforms, even with limited resources. Dynamic resource allocation is required to maximize the overall output quality of independent, competing priority processing algorithms that are executed on a shared platform. In this paper we describe the mapping of a priority processing application on the Cell/B.E. platform. We compare the performance of different implementations for dynamic-resource-allocation mechanisms, and show that priority processing achieves real-time performance.
I. INTRODUCTION
The principle of priority processing provides optimal realtime performance for scalable video algorithms (SVAs) on programmable platforms with limited system resources [1] . According to this principle, SVAs provide their output strictly periodically and processing of images follows a priority order. Hence, important image parts are processed first and less important parts are subsequently processed in a decreasing order of importance. After creation of an initial output by a basic function, processing can be terminated at an arbitrary moment in time, similar to the milestone method [2] . This principle yields the best output for given resources.
To distribute the available resources, i.e. CPU-time, among competing, independent priority processing algorithms (PPAs), a decision scheduler (DS) has been developed [3] . The DS aims at maximizing the total relative progress of the algorithms on a frame-basis. The relative progress of an algorithm is defined as the fraction of the number of already processed blocks and the total number of blocks to be processed in a frame. The DS divides the available resources within a period into fixed-sized quanta, termed time-slots, and dynamically allocates these time-slots to the algorithms.
Strategies and mechanisms for dynamic resource allocation have been addressed in [3] and [4] , respectively. Moreover, [4] describes how priority processing applications, implemented in MatLab/Simulink, are executed under Microsoft Windows XP on a general-purpose platform. In this paper we describe the mapping of a priority processing application on an embedded platform, i.e. the Cell Broadband Engine. This platform has been chosen, because it is well supported, widely available, and suitable for consumer electronics [5] . Next, we present implementations for dynamic-resource-allocation mechanisms on the Cell. Finally, we evaluate the implementations and show that priority processing achieves real-time performance. 
II. MAPPING AN APPLICATION ON THE CELL
The Cell is a multi-processor platform that offers a generalpurpose processor (PowerPC) and several dedicated streaming processors (SPEs). These SPEs are capable of processing single instruction multiple data (SIMD) operations. The PowerPC hosts an operating system (OS), i.e. GNU/Linux. No OS is running on the SPEs. Accordingly, the application programmer is responsible for memory management on the SPEs [5] . Fig. 1 shows the mapping of a priority processing application on the Cell, where multiple PPAs share a single SPE.
We ported a PPA implementation of a scalable de-interlacer to an SPE. To achieve real-time performance, we (i) vectorized the code using SIMD operations and (ii) implemented signal processing in parallel with data transfers by applying a double buffering scheme for both input as well as output.
III. DYNAMIC RESOURCE ALLOCATION ON THE CELL
In [4] three mechanisms for dynamic resource allocation have been identified: preliminary termination, resource allocation, and monitoring. The first mechanism is intrinsic for priority processing, i.e. to terminate a PPA at the end of a frame period and skip the remainder of the current frame. The latter two mechanisms are required to distribute the available resources of the SPE among competing, independent PPAs. We consider these mechanisms and their implementations below.
A. Preliminary Termination: On request of the DS a PPA terminates itself within a reasonable amount of time, i.e. in a cooperative way. We propose three alternative implementations: 1. per pixel polling, 2. per block polling (a block is a group of 256 pixels), and 3. using software interrupts.
B. Resource Allocation: Because the SPEs run no OS, basic means for resource management are required to make it possible to share the CPU and local memory of the SPE. We therefore implemented a lightweight mechanism for context 
IV. EXPERIMENTS AND RESULTS
We implemented a priority processing application on the Cell and tested it using standard sequences from the Video Quality Experts Group (VQEG). We used two scalable deinterlacers, each processing a different sequence, to experiment with competing, independent PPAs on a single SPE. All experiments were repeated 20 times and, where applicable, their results have a 95% confidence interval.
A. Dynamic resource allocation
We measured the overheads and latencies for the three different implementations of preliminary termination. The results for VQEG sequence 6 are shown in Table I .
For overhead, we used the mean of the 16 median values per frame. Branch hinting [5] keeps the overhead for polling variants low. This reduces the negative effects of polling on the performance during normal execution. Since processing an entire frame takes approximately 50 ms, the relative overheads for block polling and software interrupts are less than 0.5%.
For latency, we used the median of three consecutive runs per frame to filter out exceptionally high latencies. The communication between PowerPC and SPE takes on average 49.1 μs. This results in relatively high termination latencies for all three implementations. Based on the results in Table I , we used software interrupts for both preliminary termination and context switching.
B. Real-time performance
We performed experiments with a single PPA and with two PPAs to determine whether priority processing can achieve real-time performance on an SPE. Table II shows that the nonscalable part, applied on different VQEG sequences, completes within 16 ms. This makes it possible to simultaneously run two Progress (%)
Number of processed frames Fig. 2 . Average progress achieved for a frame period of 40 ms by: top) one PPA running in isolation, which processes VQEG sequence 6, and bottom) two competing PPAs, which process VQEG sequence 5 (grey) and 6 (black).
Completion of just the basic function of a PPA is defined as 0% progress, whereas completion of all optional parts is defined as 100% progress.
PPAs within a frame period of 40 ms. Fig. 2 illustrates the progress of a single PPA which is approximately 35%, and the progress of two PPAs, each reaching approximately 5%. 
V. CONCLUSION
We presented real-time priority processing on a consumer platform. We described the mapping of a priority processing application on a Cell/B.E. and implementations for dynamicresource-management mechanisms. The PowerPC executes the DS and a single SPE runs the competing, independent PPAs. Because no OS is running on the SPEs, we presented a lightweight mechanism for context switching. Based on our evaluation, we used software interrupts for cooperative preliminary termination and resource allocation. Finally, we showed that priority processing achieves real-time performance. This makes its concept attractive for consumer electronics.
