This paper addresses the concurrent task management of complex multi-media systems, like the MPEG4 IMI player, with emphasis on how to derive energy-cost vs time-budget curves t h u g h task scheduling on a multi-processor platform. Starting f " the original "standard" specification, we extract the concurrency originally hidden by implementation decisions in a "grey-box" model.
The main goal of systemkoftwax embedding is to encapsulate the concurrent tasks in a control shell which takes care of the tusk scheduling (sofitan? scheduling in the restricted sense) and the inter-task communication. Task scheduling is an error-prone process that requires computer assistance to consider the many interactions between constraints. Unfortunately, current design practices for reactive real-time systems are ad hoc and not very retargetable. From this, one can conclude that a need exists for a systematic methodology at the system level for the codesign of hardware and sopVare including especially the management of dynamic concurrent tasks. In this paper we will present in Section 3 the M e w o r k of a methodology that we are currently developing to solve the task concurrency problem. An important step in this methodology is the design-time mapping of different tasks in a cost-efficient way on a multiple processor platform. In Section 4. 2 we will explain a heuristic that helps us to select the most important task to run first on the platform. We will illustrate this scheduling heuristic with an example extracted h m a very complex realistic driver, namely the MPEG4 IM 1 player. The results will be represented in Pareto curves which do not give a single working point but which allow us to globally trade-off cost (e.g. energy) vs constraints (e.g. timebudget).
TARGET APPLICATION
The target applications of our task-level system synthesis approach are advanced real-time multi-media and informationprocessing(RMP) systems, such as consumer multi-media electronics and personal communication systems. These applications involve a combination of complex data-and control-flow where complex data types are manipulated and transferred in a dynamic way including creation and deletion of both data and tasks. Most of these applications are implemented with compact and portable devices, putting stringent constraints on the deof integration (i.e. chip area) and on their power consumption. Secondly, these systems are extremely heterogeneous in natue and combine high performance data processing (e.g. data processing on transmission data input) as well as slow rate control processing (e.g. system control fiu~ctions), syn-chronous as well 85 a q " o u s parts, analog and digital, and so on. Thirdly, time-to-market has become a critical factor in the design phase. Finally, these systems are subjectedto stringent realtime constraints (both had and soft deadlines are present), complicating their implementation considerably.
The main driver for our research is the IM 1 player (see Fig. 1 ). This player is based on the MPEG4 standard, which specifies a system for the communication of interactive audio-visual scenes composed of objects. As this player consists of many different modules (audio, video, 3D animation, etc., coded in over 80,000 lines of highlevel C++ specification), we believe it is representative for other multi-media applications. Each h e in a MPEG4 compatible input stream consists of several parts, video object planes (VOPs), that can be decoded independently. In this paper we will assume that up to 5 different VOPs are simultaneously active, which is realistic. For each VOP two interacting controllers need to be scheduled (see Fig. 3 ). Several other decoders may be needed to interpret additional information required by the VOPs (such as texturedata, audio, video). In this paper we focus on the mapping of the controllers on the programmable components of the platform. The other more static signal processing tasks are assigned to dedicated accelerators.
F'RAMEWORKOF THE TASKCONCUR-
RENCY MANAGEMENT METHODOL-
OGY AT THE GREY-BOX ABSTRACTION

LEVEL
The design of concurrent real-time embedded systems, especially embedded software, is a difficult problem to be performed manually due to the complex consumer-producer relationships, the presence of various timing constraints, the non-determinism in the spec- The fi-amework of our methodology is depicted in Fig. 2 CDFG are then applied to increase the opportunities for concurrency exploration and cost minimization. Afier the extraction, we obtain a set of tasks, each of which consists of many thread nodes. Tala can be viewed as a more or less independent part of the whole application. Then static scheduling will be applied inside each tmk at design time, includmg processor assignment in the context of multiple procesSing elements (PES). Finally, a dynamic scheduler will schedule these tush at run time on the given platform [SI.
The purpose of task concmncy management is to determine a cost-optimal conshint-driven scheduling, allocation and assignment of various thread nodes to a set of processors. Each thread node will be executed at a different speed on a different processor with a different cost, i.e. e n a consumption. Given a tusk, static scheduling is done at design time to explore all the possibilities of valid assignment and scheduling. Each possibility means a different choice in time budget and energy consumption trade-off (i.e. another working point) and together they compose the global Pareto curve. Since the static scheduling is &ne at design time, computation efforts can be paid as much as necessary, provided that it can give a better scheduling result and reduce the computation efforts of dynamic scheduling.
SCHEDULINGHEURISTICFORALARGE
Related work and motivation
NUMBER OF THREAD NODES
Abundant work has been done on scheduling. In literature, the granulari@levels which different scheduling algorithms work at are quite different. However, the entity which scheduling algorithms deal with is always called task, job or node. To avoid confusion, it is better to know that the meaning of these terminologies in OUT work can be different from other existing work.
When a set of concurrent tasks, i.e., tasks that can overlap in time, has to be executed on one or more processors, scheduling algorithm, must be applied to decide the order in which those tasks are executed. For a multiprocessor system, another procedure, assignment is also needed to determine on which processor one task will be executed.
In the real-time community, resmhen use a black-bas view of the task behavior. Comprehensive overviews of schedulii algorithms for real-time systems are given in [ 1, 6, 7, 8] . In contrast, in the embedded system community, many papers focus on white-60~ task descriptions [9, IO], which are typically not available at the early design stage. A gmy-60~ model is used in our approach. As M example, we show in Fig. 3 the grey-box model we are going to use in the following scheduling experiment. Formal definition of this model is given in [ 11. The most important semantics of this model is explained in the appendix. 
A new scheduling heuristic for large number of tasks
Our approach can be applied to the multiprocessor platform without the need of changing the processor voltage dynamically. For a given platform, i.e., the number of high-speed processors and lowspeed processors, the heuristic will derive an energy-cost vs timebudget curve. Compared with other scheduling algorithms, like MILP (mixed integer linear programming), the heuristic does not t m t real-time constraints directly. Hence, the computation complexity is reduced. The heuristic derives a set of working points on the energy-cost vs time-budget plane instead of only one point.
These working points range from the most optimal performance point within that given platform to the point barely meeting timing constraints. Existence of these different working points is mostly due to the different thread node to processor assignments (some of it is due to idle time in the schedule). As a result we also have to deal with a crucial assignment problem. We will show in Section 5 how a trade-off between energy-cost and time-budget can be made by a proper assignment decision. These points are crucial when combining subsystems into a complete system or during the dynamic scheduling stage. The reason is that in both cases we need to select working points of subsystems and combine them into a globally optimal working point.
To schedule the thmd nodes on the above multiple processor platform, we have developed a static scheduling heuristic. The heuristic uses two criteria t o decide which thread node to ND first. The first criterion is the self-weight of a thread nude. It is defined as the execution time of the thread node on the low-speed processor. The larger the self-weight of a thread node, the higher its priority to be mapped on a high-speed processor.
The second criterion is the load of a thread node. If some thread nodes are depending on a thread node, the sum of the self-weights of all the dependent threadnodes is defined as the load of the thread node. It is worth noting that "dependent" means control dependence. The more load a threud node has. the earlier it should start to execute. 1. When a threod node is dominant both in the self-weight and load over the other candidate thread nodes, it will be scbeduled first.
VOP
2. when one thread node has a dominant self-weight and another t h m d node has a dominant load, either of them can be scheduled first. By alternating their order different points on the energy-cost vs time-budget plane can be generated.
It is better to realize that the heuristic implicitly includes energy considerations. Because for a given pmcessor, the energy consumption is directly d a t e d to the execution time. The self-weight and loud in the heuristic are merely two interpretationsof execution time from differen perspectives. Applying this heuristic to the IM1 player will be discussed in the following section. Even though it seems relatively simple, it turns out to be very effective for scheduling the tasks in the MPEG-4 IM1 player.
TASK SCHEDULING EXPERIMENT
To get smooth scenes, a frame rate of 3Oms is required. Currently, the I M I player is implemented on a Pentium or F' entiumcompatible platform. Such an implementation is far from meeting the 3Oms timing constraint. This real-time constraint is only mwhable when an extremely parallel and powerful instruction-set processor is used. In our experiment, the IM1 player is mapped to a multiple processor platform including custom accelerators next to the instruction-set processors. We have extfacted the task graph from the IM1 implementation code. By analyzing the profiling information, we know how many VOPs one kame contains, how many BIFS nodes one VOP has and how many decoder setup are done in one VOP.
We pick up an example frame of 5 VOPs. The number of BIPS nodes and the number of decoder setup are listed in Table 1 . For every BIFS node decoding and decoder setup, we get the execution time from the profiling data. The grey-box model of a typical BIFS decoding and decoder setup is shown in Fig. 4 . The left part of Fig. 4 is the context of the BIFS decoding or decoder setup. The right part is the unfolding of an example BIFS decoding. In reality, instances of the BIFS decoding and decoder setup take place inside a VOP with variations in the underlying graphs and execution Applying this scheduling heuristic to the sample problem. we have derived two curves for processor combination 2 and 3, which are shown in Fig. 5 . Fmm the two Pareto curves, we know that points on the curve of Combination 3 provide globally optimal working points until point 6 is reached. A lower time-budget, i.e., a more tight timing constraint cannot be satisfied by the platform of Combination 3. This is because the existing concurrency in the task graph has been fully utilized by the processors in Combination 3. To further decrease the time-budget, we have to use extra high-speed processors to reduce the critical path. That is, either to replace low-speed (low-power) processors with high-speed (highpower) processors or to introduce additional high-speed processors. In other words, two ways exist to decrease the time-budget. One is to utilize the concurrency, the other is to reduce the critical path. Therefore, to get a lower time-budget, we have to resort to the platform of Combination 2 and pay a large energy penalty. This explains the big jump of the global Pareto curve from point 6 of Combination 3 to point 6 of Combination 2.
As no related work exists that fully matches our problem statement, we have made a comparison with an exact MILP formulation. We have applied a well-known public domain MUP solver [4] to this scheduling problem. Since the amount of MILP description for scheduling the task graph on a multiple processor platfonn will explode, we have to apply the MILP solver to a two processor platform. Shown in Fig. 6 are three curves. One is got from the heuristic, the rest two are the outputs from the MUP solver for the first time and after it runs for five days, respectivley. It is clearly seen that the MILP solver has made large improvement after five days.
However, to get all the optimal points over the time-budget axis, which should be lower than the points given by the heuristic, it will -1 . 
CONCLUSIONS
We have demonstrated in this paper that a grey-box task graph model should be used to design advanced real-time dynamic multimedia applications. This task-level model allows us to systematically explore the design at a high abstraction level in order to obtain more and better task schedules. Transformations on the task graph level have been applied to improve this further. On a multipmcessor targeted platform including custom accelerators for the video coders, the different schedules obtained with a heuristic scheduling approach are represented in Pareto curves trading-off time-budget vs energy-cost. This information is crucial to decide on run-time
