This paper addrcsses t,lrr! problem nf mapping ai1 applicat,ion, whidi is highly dyrianiic in t,he futurc, onto a Iict,crogencous multiproccssor platSonn in an rxmrgy d1ic:icnt way. A two-phase scliednling mcihorl is used for that piirposc. Dy exploring the Pareto curves and scenarios generatcd at design timc, the run-time sclicduler can easily find i~ good sclierluling at a very low overhead, satisSying the syslam constrainls ;~n d niiniinizirig the rncrgy consnmption. A rcal-lib examplo from a 3D quality OS scrvicc kcrncl is iisad to show the ofhctivencss of our mcthad. 
INTRODUCTION
Tho merging of computcrs, cnnsniucr and coinnumication disciplines givcs rise l o vary fast growing marlxts for personal cornmimication, milltimedia and broadhnnd nelworks. Tcchnology advances lcad to plathrms wil.h cuonnons procrssing capaci1.y that are however noi, niatchcd with the rcqnircd incrcasc in systcm design productivit.y.
One oS tlic most critical bottlenccks is l.hr very dynnmic concurrent hohavior of many of thcsc new applications. Thcy a,re Sully specified in software orirmtcd langoagos (likc .Java, TJML, SI)L, (:-I +) and sl,ill nccd to be execntcd in ra;rl-time cost/encrgy-scnsitivc: wny on the IieLcrogciicous SoC plnlforms. A way of Irmpping lliis SW/HW spccilication onto an ernbedrlcd multi-processor platform is rquircd. The innin issue is I.11at fully rlesign-t,inic lrasrd solutions as proposrd carlier in Lhr cornpilei and systcrn syntlicsis cornmimities cannot solvc i.hc problem, niid run-tinic sohitions as presenl in nowadays npcraiing syst,orris are ion incflicient in tcr~ns ni' cos1 opi.imiziition (cspacially ennrgy cnnsurnption) and arc also not arlaptcrl S m thn real-l.imc cunstminl.s.
This dynamic nature is cspccially emerging hecatisa nf the qnality-of-scrvicc (90s) a p of tliesc multi-rrodia arid networking applicirtions. Proinincnl examples or this can bc Sonncl in tlic raccmt Iv~l'I+~C4/.~~EG2000 si.andar.ds and lly the ncw MPEG21 s t a d~u c l . In order lo deal with these dynamic issncs whcre tasks and complcx data lypcs ;we crcatcd m r l delelcd at run-time Ixrscd on OILdcterininistic events, a novel sysI.em dcsign paradigm is xquirod. Thia papcr will focus on ithr ncw reqnircmcnts I.hnt result. in systcm-lcvcl synthesis. In particular n "task conc i~r r e r~; y managament" (TCM) prol>lern Somml;Lt,ion will IE proposed, with special errlplinsis on the r~s u l t s tliat can be o h t i t i i d in t.crms of whcrr power ~:onsi~mpLion reduction. 'Llm conmpt of 1'arcl.o curve based cxploration is crucial in the snhtion for i.his problem.
The prornitiing rcsnlts that can hi! obtained with snch a nielhodology will hi! illnstralcd wii.11 i~n Ml'EGZI based demonstrator mapped 011 a mnlli-processor simuhtion platform with hierarcliical shi~rc rneinory organinatkm.
PLATPORM BASED DESIGN
F i g u r e 1: T h e platform integration.
The Cnlilrc! of cmhrdded rrmllimcdia applical.ions lies in low-power heterogeneous multiprocessor platforms [lO] . In the near future, the silicon market will be driven by low-cost, portable consumer devices which integrate multi-media and wireless technology, Most of these applications will be implemented with compact and portable devices, putting stringent constraints on the degree of integration (i.e. chip area) and on their power consumption (0.1-2W). The applications will require a computational performance of the order (1-IOGOPS). Current PCs offer this performance. However, their power consumption is too high (IO-10014'). We must keep this performance and reduce the power consumption at least two to three orders of magnitude. Embedded systems are also subject to stringent real-time constraints, complicating their implementation considerably. Finally, the challenge to embed these applications on portable devices is increased even further because of user interaction: at any moment the user will be able to trigger new services, change the configuration of the running services, stop existing services. Hence, the execution behavior changes dynamically at run-time.
fast, have a extremely low energy consumption, and be flexible enough to cope with the dynamic behavior of future multi-media applications. In spite of modern voltage scaling techniques, existing single processor systems (e.g. found in PDA's) cannot deliver enough computational performance at a sufficiently low energy budget. O n the other hand, full custom hardware solutions can achieve a good performance/energy consumption ratio, but are not flexible enough to cope with the dynamic behavior and have a relatively long timcto-market delay. Also the technology scaling trend with the explosion of masks' costs and of the physical design (especially due to timing closure and test) implies that custom ASlCs are only feasible for high volume designs. A combination of the bests of both worlds is required for the majority of the new applications (see [7] ). This explains the growing interest in platform based design [6] .
The benefits of heterogeneous platforms can however only be fully exploited when applications are efficiently mapped on them. Unfortunately, current design technologies fall behind these advances in computer architecture and processing technology. When looking at contemporary design practices for systems implemented with these heterogeneous multiprocessor platforms, one can only conclude that these systems are nowadays designed in a very ad hoc manner. A systematic methodology and its corresponding tool set are definitely needed.
In this paper, we present a systematic design methodology and tool support to embed multi-media applications on platforms. With our design methodology, we target especially the management of dynamic and concurrent tasks and their data on heterogeneous multiprocessor platforms. This requires, given an application which has its own specific control and data structure, and a template for the platform architecture, finding a way to decide the instance platform architecture and how to map the application efficiently onto such an architecture (top part of Fig. 1 ). We will only focus on the concurrency management issues in this paper, even though the data management part [22] is as crucial.
Platform based design also encompasses, a physical implementation problem that can be called platform integration (bottom part Fig. l ) , but that is not addressed here. Therefore, the future embedded system should be extremely
RELATED WORK
Task scheduling in a task concurrency management context has been investigated overwhelmingly in the last decades. When a set of concurrent tasks -that is, tasks that can overlap in time -have to be executed on one or more processors, a predefined method, called scheduling algorithm, must be applied to decide the order in which those tasks are executed. For a multiprocessor system, another procedure, assignment is also needed to determine on which processor one task will be executed. A good overview of scheduling algorithms can be found in [27] . In this paper, the terminology task scheduling is used for both the ordering and the assignment.
Scheduling algorithms can be roughly divided into dynamic and static scheduling. In a multiprocessor context, when the application has a large amount of non-deterministic behavior, dynamic scheduling has the flexibility to balance the computation load of processors at run-time and make use of the extra slack time coming from the variation from worst case execution time (WCET). However, the run-time overhead may be excessive and a global optimal scheduling is difficult to find due to the difficulty of the problem. We have selected a combination of a design-time and run-time scheduling here to take advantage of both of them.
Since more and more embedded systems are targeted at multiprocessor architectures, multiple processor scheduling plays an increasingly important role. El-Rewini et a1 (111 give a clear introduction to the task scheduling in multiprocessing systems. Hoang et a1 [14] try to maximize the throughput by balancing the computation load of the distributed processors. All the partition and scheduling decisions are made at compile time. This approach is limited to pure data flow applications. Yen et a1 135, 341 try to combine the processor allocation and process scheduling into a gradient-search cosynthesis algorithm. It is a heuristic method and can only handle periodic tasks statically.
In the above work, performance is the only concern. In other words, they only consider how to meet real-time constraints. For embedded systems, cost factors like energy must be taken into account as well. Gruian et a1 (131 have used constraint programming to minimize the energy consumption at system level but their method is purely static and no dynamic policy is applied to exploit more energy reduction.
Recently, DVS (dynamic voltage scaling) is getting more consideration. Since the energy consumption of CMOS digital circuits is approximately proportional to the square of the supply voltage, decreasing the supply voltage is advantageous to low power design, though it will also slow down the cycle speed. Tkaditionally, the CPU works at a fixed supply voltage, even at a light workload. In fact, under such situations, the fast speed of the CPU is unnecessary and can be traded for a lower energylpower consumption by reducing the supply voltage. Chandrakasan et al. [5] in [20] by evenly distributing the workload. However, no manifest power and performance relation is used to steer the design-space exploration. In addition, they also assume a continuously scalable working voltage. In [21], for a multiprocessor and multiple link platform with a given task and communication mapping, a two phase scheduling method is proposed. The static scheduling is based on the slack time list scheduling, and a critical path analysis and task execution order refinement method is used to find the offline voltage scheduling for a set of periodic real-time tasks. The, run-time scheduling is similar to resource reclaiming and slack stealing, which can make use of the variation from the WCET aid provide the best-effort service to aperiodic tasks. In [36],-a EDF based multiprocessor scheduling and assignment heuristic is given, which is shown better than the normal EDF. After the scheduling, an ILP model is used t o find the voltage scaling accurately or approximately by simply rounding the result from a LP solver and the result is claimed within 97% accuracy. The method can be used for both continuous and discrete voltage. Olir scheduling methodology is different from the above ones in several ways. Firstly, we consider only discrete voltages, which is a more reasonable assumption in terms of fuPower consumption in a multiple processor context is treated . ture process technology and circuit design possibilities, comDared to a continuous ranee. Actuallv. as illustrated in 1361.
it is even more difficult to solve because it corresponds to an ILP problem, not LP. Secondly, we also use a two-phase off-line and on-line scheduling. However, the design-time offline step is more a design space exploration than a simple scheduling, because it gives a series of different scheduling trade-off points, rather than only one given by a conventional scheduler. Thirdly, we avoid having to use the WCET estimation, which is inaccurate and pessimistic due to the dvnamic features of the amlications. Fourthly, we aDDh .. .. .
voltage scaling a t the intra-task level and make use of the run-time application information. The intra-task voltage scaling is considered only recently by other researchers[3].
TARGET PLATFORM ARCHITECTURE AND MODELS
In a n u p t r r d a t e platform like Fig. 2 (where we focus on the digital core only), one or more (re-configurable) prw grammable components, either general-purpose or DSP prw cessor cores or ASIPs, the analog front end, on-chip memory, 110 and other ASICs are all integrated into the same chip (System-on-Chip) or in the same package (System-inPackage). Furthermore, platforms typically contain also some software blocks (API, Operating System and some other middleware solutions), a methodology and a tool set to support rapid architectural exploration. Examples of such platforms are TI'S OMAP[17], the Philips Nexperia [24] , the ARM PrimeXsys[2] and the Xilinx Virtex-I1 Pro [32] .
Platforms obtain a high performance a t a low energy cost through the exploitation of different types of parallelism [18, 11. In the context of this paper we will mainly focus on the task-level concurrency hut also the instruction and datalevel parallelism should be exploited in the mapping environment. In this way we can keep the power consumption lower by exploiting Vdd and frequency scaling in the prw cessing cores. We will assume that only a limited set of Vdd's can he supported with a not too large range because future process technologies (see ITRS roadmap) will demand that. Large memories even do not allow any Vdd scaling but in that case the power consumption can be controlled by an efficient mapping on a distributed memory architecture [ E ] .
A multiprocessor platform simulator was used to test the effectivity of our methodology and to answer "what-if" questions to explore the platform architecture design space. For the experiments in this paper, we will focus on the prw cessing modules. In order to demonstrate the impact of our MATADOR-TCM approach, we will assume that either several StrongARM cores are available each with a different Vdd or a single StrongARM core with a few discrete Vdds. The ranges of the Vdd's will vary from 1.2 to 2.4V, which is motivated by the data sheet information. The power models are based on these data sheets too and are based on instruction counts obtained from profiling. The cycle counts of the thread nodes are also obtained by profiling, based on an ARMulator environment. A clock frequency of 266 MHz is assumed a t 2.4V to obtain the execution times.
TASK CONCURRENCY MANAGEMENT
This approach addresses the dynamic and concurrent task scheduling problem on a multiprocessor platform for realtime embedded systems, where energy consumption is a major concern. Here we propose a two-phase scheduling method, which is a part of our overall design mkthodology and can provide the required flexibility a t a low run-time overhead. The global methodology is briefly introduced in the next section, and then the scheduling method is ex- 
Global TCM Methodology
The TCM methodology comprises of three stages (see Fig.3 ). The, first is concurrency extraction. In this stage, we extract and explicitly model t h e potential parallelism and dynamic behavior of the application. An embedded system can be specified a t a gray box abstraction level in a combined MTG-CDFG model [28, 301, where MTG is the acronym for multi-task graph. With MTG, the application can be represented as a set of concurrent thread frames 'Please refer [33] for terminologies.
(TF) that exhibit a single thread of control. Each of these TFs consists of many thread nodes (TN) that can he looked at as a more or less independent section of the code. In the second stage, we apply concurrency improving transformations on the gray-box model. The third stage mainly consists of a two-phase schednling approach. First the design-time scheduling step is applied to each of the identified TFs in the system. Different from traditional design-time scheduling, it does not generate a single solution but a set of possible solutions, each of which represents a different possible cost-performance trade-off point. Finally, we integrate an application-specific run-time scheduler in the RTOS of the application. The run-time scheduler dynamically selects one of these trade-off points for each running TF to find a global energy efficient solution F i g u r e 3: T a s k Concurrency M a n a g e m e n t
Two-phase scheduling stage
The design of concurrent real-time embedded systems, and embedded software in particular, is a difficult problem, which is hard to perform manually due to the complex consumer-producer relationships, the presence of various timing constraints, the non-determinism in the specification and the sometimes tight interaction with the underlying hardware. Here we present a new cost-oriented approach to the problem of concurrent task scheduling on multiple processors.
The design-time scheduling is applied on the thread nodes inside each thread frame at compile time, including a processor assignment decision of the TNs in the case of multiple processing elements. On different types of processors on the heterogeneous platform, the same TN will be executed at different speeds and with different costs, i.e., energy consumption in this paper. These differences provide the possibility of exploring a cost-performance tradeoff at the system level. The idea of our two phase scheduling is illustrated in Fig. 4 . Given a thread frame, our design-time scheduler will try to explore different assignment and ordering possibility, and generate a Pareto-optimal set, where every point is better than any other one in at. least one way, i.e., either it consumes less energy or it executes faster. The Paretooptimal set is usually represented by a Continuous Pareto curve. Since the design-time scheduling is done at compile time, computation efforts can be paid as much as necessary, provided that it can give a better scheduling result and reduce the computation efforts of run-time scheduling in the later stage.
At run time, the run-time scheduler will then work at the granularity of thread frames. Whenever new TFs are initiated, the run-time scheduler will try to schedule them to satisfy their time constraints and minimize the system energy consumption as well. The details inside a thread frame, like the execution time or data dependency of each thread node, can remain invisible to the run-time scheduler and this reduces its complexity significantly. Only some essential features of the points on the Pareto curve will be passed to the run-time scheduler by the design-time scheduling results, and be used to find a reasonable cycle budget distribution for all the running thread frames.
In summary, we separate the task scheduling into two s e p arate phases, namely design-time and run-time scheduling, for three reasons. First, it lends more run time flexibility to the whole system. We can indeed accommodate more unforeseen demands for more execution time hy any TF, 
Simulation environment
We have integrated this run-time manager on the platform with the help of an existing RTOS [31] . More in particular, the run-time manager uses the services of this RTOS to distribute the tasks/TFs in the application across the processors and to activate them in the correct order.
A simulation environment is used in this methodology to collect the necessary energy and performance profiling information for the design-time scheduling phase. The task;/TFs are precharacterized with the execution times and their COIresponding energy consumption for different configurations of the platform similar to [19] We liavc applied our TCM approach on a part of a rct&life application, Ihc &OS kernel of the 3U remlcring dgorithm that is developed in t,he contcxt of a MPEGZl projcct. We haw sinuilatcd it on our platform. 'l.'lie applicaliou and expcriniental result are discoused in tlie Sollowing sections.
3D RENDERING QOS APPLICATION
To test the effectiveness OS our approach, a real-life application, the QoS (Quality of Scrvice) control part of a 31) rendering algoritlrm, is used. nurnhcr of triangles. When tho available rcs~urces are not onougli t,o rendor the object, instead of lcttirig the system break down (totally stop tlrc dccoding and rondering during n period of timc), the corresponding meslt nS tho nhjocl can he gracefully degraded lo decrcnse rcsouse consumption, whilc rnaiiilaining tho maximal possible quality.
Tho numbor OS triangles that am used to describe a mesh can bc s c a l d up or down. This can be achieved by performing edge collapsen aiid vertex splitn rcspcctively, BS shown in (Va, Vt), wc lirst removc the trimgles which V , and Vt have in common, mrl we replace VL wit,li V , in the triangles adjaccut t r i Vi . Wc then rt:r:entcr V , , to keep the appeilrance oS the ncw set of t,rianglcs as clo. I possible to I.hc foriner one. The new set of triangles re1 'cnls the same object, with less detail hiit also with lcss t,riangIcs. Tlic same principle hut in a rcversed way is nsed to perform ;L vertex sldit. The edge collapsc and vcrtex split iippro;~ches can bc uscd repcatedly till a desired number of triangles a,rc a d k v e d .
For a 3D object,, the morc triangles that arc used 1. 0 reprcsciit a mcsh, the more precisc the description of t h r olrjcct. This increases tho pcrceivcd quality. However, it slows down tho geomatry and rastcriaing stages because more compnt,alioii powcr is nrcclcd thero. Consoqucntly it decreases the mrmlxr of Srainrs i.Iiat can 1, c gertcratcd each sccond(VPS, fraino per sccond), whilc most videos or 3D games applical:ions dcsirc a lixcd W S , Another thing that we have t,o consirlcr hcre is that the same iipplicatiuri can be sun i~t different platforms, c.g., a desktop PC nr a FDA, which provides completely diffcrcnt cotnputiLtion ability and power consnmption fcature. IIcnce d i f h e n t qualities of thr samc scrvice haw to be supjllicd t o achicvo a similar FPS. For a givon computation platform aid n dcsircrl FPS, t.ho miinbcr of triangles it, can liandlo in o m friurie is almost fixed. Based on the mnnber oI objects in the cnrrcnt h n l e and what theso ol1jcct.s are, the! QoS controller will assign the triangles to each objecl so that tho mer can get thc hcst-ofcfforl; visual quality at ;L fixed frame ratc.
EXPERIMENT RESULTS
In the &OS kernel of the considcrod 31) application, for cach visible nbject on the SCPIIC, a scparate threild Srarne will be triggered, in which l . 1~ numlx!r of triangles is adjnst,cd to the number specified hy the QoS algorit,lun. The gray-box model OS that thrctd franc is shown in Fig. 7 , wherc all l.110 intcriial TNs are nrimbcred as well. ' Table 1 gives thc profiled cxccution time and energy consnmption nf each ' l " on a 2.4V StrongARM prowssor.
citdi ohjact is tha first tinio visible, which can he true only once during the whole SI n,, a branch will b r taken. If it is the first time visible, the mesh and texture have to be parsed, generated and bounded (TNs 1 to 9). If it is not, the current number of faces will be compared to the desired number of faces to decide whether to collapse edges (TNs 10, 11 and 12) or to split more vertices (TNs 13, 14 and 15).
The edge collapse and vertex split are done in a progressive and iterative way to avoid abrupt changes of the object shape with a while loop over TN 10 or 13. The iteration number of this loop depends on the difference between the current and desired number of faces of that object, and it varies from 2 to 1000 based on the profiling data. Only one Pareto curve is not enough to represent these highly dynamic features. We have to distinguish first-time-visible or not-first-time-visible and the while loop iteration numbers. For the latter, if we would not make a difference in implementation of the while loop body, we would have to consider the implemetation for the worst case, which is 1000 iterations and much bigger than the average case. To avoid that, we have introduced the concept of "scenario selection" where different Pareto curves are assigned (in an analysis step at design time) to run-time cases that motivate a set of different implementations. Based on this analysis, we have decided to use 9 different Pareto curves in the QoS application to represent the run-time behavior of one object: the first one is when it is first time visible; the others are when it is not and has to be collapsed or split. For "collapse" and "split", each are assigned four curves with different implementations, corresponding to different iteration sub-ranges. For example, the first curve of "collapse" will be selected if the actual iteration number falls between 2 and 12. Therefore, we only have to consider the worst case of that sub-range, which is 12 in this example, not the worst case of the whole range, which is 1000. Extra code has been inserted to enable this. We have selected these ranges based on the profiling data from the application and they are illustrated in Table 2 .
, . In the QoS kernel, which is typical for future object-based multi-media applications, we know at the beginning of each frame the characteristics of its content. In this case we know missed increases correspondently. Compared to REF1, for t h e single processor situation, the TCM approach consumes much less energy (saving of 58% when fps is 5 and 40% when fps is lo), while the deadline miss ratio remains the same. The latter is emy to understand because when the time constraint is really stringent, the TCM method will automatically schedule all thread nodes to the highest possible voltage processor, which is just what REFl does. When the time constraint is less tight, a much cheaper solution will be found by the TCM method. Comparing REFS and REFl you will find that for this type of dynamic multi-media a p plications, state-of-the-art DVS cannot gain much because it does not exploit different combinations of TF realisations. This is especially so for a heterogeneous multi-processor context where all the prestored implementations of T N schedules and processor assignments cannot be computed at runtime any more without too much run-time overhead. Only the Vdd selection can be performed at run-time in some a p proaches but then not per processor. From the result we can also see that by increasing the number of processors, we can reduce the energy consumption even more while now meeting nearly all the deadlines.
T h e distribution of the scenario selection is given in Fig. 8, while Fig. 9 gives the distribution of the selected Pareto points in scenario 6, both when fps is 5. From the figures we can see that the TCM scheduler activates different scenarios dynamically and selects the optimal Pareto point from the activated scenario depending on the run-time situations, i.e. the resource available and the number of competitors. Most of the time, scenario 1 and 5, which are the least time consuming ones, will be selected. Therefore, we avoid the worst case estimation and have more opportunities to scale down the voltage, compared to REF2. In scenario 6, the least energy consuming solution, Pareto point 5, is selected as long as it is possible; otherwise a more expensive one is chosen to meet the time constraint. The combination of scenario and Pareto point selection gives us the advantage of heavily exploring the design space at design time and finding the most energy efficient solution exactly for that situation at run time.
CONCLUSION
In this paper we presented a unique approach to manage concurrent tasks of dynamic real-time applications. We explained an methodology to map the applications in a cost (especially power) efficient way onto a heterogeneous embedded multiprocessor platform. This approach is based on a design time exploration, which result in a set of schedules and assignments for each task, represented by Pareto curves. At run time, a low complexity scheduler selects an optimal combination of working points, exploiting the dynamic and non-deterministic behavior of the system. This approach leads to significant power savingcompared to state of the art voltage scaling techniques because of two major contributions. First, we effectively combine an intratask detailed design-time exploration, giving high scheduling quality, and a low overhead run-time scheduler. Second, by considering the run-time application provided information, we use a scenario approach to avoid the worst case execution time estimation. In the future, we will extend this work to provide tool support for code synthesis, concurrency t r a n c formation and RTOS integration.
