Abstract ls paper introduces the first hardwardsoftware cosynthesis algorithm of distribute rd-time systems that optimizes nzenzo~hierarchy along with the rest of the architecture. Our rdgorithm synthesize a set of red-time tasks with data dependencies onto a heterogeneous multiprocessor architecture that meets the performance constraints with minimized cost. Our rdgorithm chooses cache sizes and dIocates tasks to caches as part of co-synthesis. Experimental results, including examples from the literature and results on an h@EG-2 encoder, show that our algorithm is efficient and compared with existing algorithms, it can reduce the overrdl cost of the synthesized system.
Introduction
This paper describes a new system-level rdgorithm for hard}~lare-so~ifareco-synthesis of multi-rate real-time systems on heterogeneous multiprocessors. Unlike most of the previous work in hardware-software co-synthesis, the algorithmnot only synthesizes the hardware and software parts of the applications, but dso the memory hierarchy: it takes into account the impact of memory hierarchies on system performance and cost in the co-synthesis process. The rdgorithm targets periodic red-time applications running at multiple rates. The target architecture is a heterogeneous multiprocessor architecture that consists of multiple processing elements (PEs) of various types (i.e., general-purpose processors, domain-specific CPUS such as DSPS, and custom hardware), memory components at different levels of memory hierarchy, and communication links. The rdgorithm synthesizes the hardware, software and memory hierarchy based on a multiprocessor target architecture to meet the performance constraints with minimrd cost.
With embedded CPU cores becoming increasingly common in~SI systems, and with increasing use of multiple embedded cores on a single chip (systems on a chip), system designers need to implement major subsystems using real-time system design techniques such as multiple, prioritizedtasks sharing CPUS. The design of these systems (corebased~stems) is complex and requires sophisticated analysisand optimization. Hardware-software co-synthesis can be used to explore the design space and synthesize the application into hardware and software cores that meet design constraints (performance, cost, power, etc.) .
Memory hierarchies, in particular caches, are essentird for modem MSC embedded cores to obtain sustained high perPetission to make &@M or tid copiw of aUor pact of this \vork for pemti or &srmm use& -ted~tithmrt fm protidd tit copim~e not made or dktiũ ted for profit or commw~advanbge md &t copi~bm h notice and the M dtation on tfre fit page. To copy otimvise, to repubhh, to post on aervem or to re&stibute to Lsts, rqti= prior s@c -on and/or a fee. ICC~9S, Sm Jose, CA, USA o Iws Aal l-5sl13ws-z9s/wl1..s5.m formance. As the functionality of embedded systems increases, caches and memories represent a significant portion of the cost, size, weight, and power consumption of many embedded systems. Ineffective use of the memory hierarchy requires extra transfers of data and program and can significantly increase both execution time and power consumption.
Memory hiermchy must be taken into consideration in system-level design to minimize the overall system cost. For example, to improve the performance of a system, the designer may use a faster and usurdly more expensive CPU, or add a piece of custom hardware, or use a bigger cache. It is important for the designer to evaluate the tradeoffs among these different design options in order to find the optimized design. Although many processor chips already include caches, they still provide severrd choices of cache sizes for the same CPU type. In core-based design for systems-on-achip, the designer has the option of adjusting the cache sizes of the CPU cores. However, most previous research in cosynthesis has ignord the cache's impact and only concentrate on the synthesis of PEs for software @rocessors) and hardware (ASICS). So far, there is no systematic approach for the design of memory hierarchies in co-synthesis. In our previous work [S], we designed a task-level cache performance model and concentrated on analysis and scheduling with memory hierarchy but not co-synthesis.
To handle memory hierarchies in a multi-tasking environment, we need a high-level model that can efficiently model the application performance in presence of memory hierarchy. In this paper, we first present a~k-level model that efficiently bounds the cache performance of tasks running in a multi-tasLtingenvironment (see Sec.3). We incorporate this model into hardwme-software co-synthesis and propose anew co-synthesis rdgorithm that optimizes the use of memory hierarchy and synthesizes cache memory together with hardware and software to optimize the totrd system cost (see SW.4). Sec.5 discusses the expenmentrd results of our algorithm.
Pretious Work
Related work includes studies from hardware-software partitioning, hardware-software co-synthesis, performance anrdysis with caches, and red-time computing.
Hardware-software partitioning [3, 4, 14, 16] has been a major topic in the area of hardware-softiare co-design.
Most of the partitioning rdgorithms implement the system basal on a template of a CPU (software) and an ASIC @ard-ware). Recent work in co-synthesis has used a more generalized model consisting of heterogeneous multiprocessors with arbitrary communication links. The SOS dgonthm developed by Prakash and Parker [12] used an integer linear programming~P) approach. Yen and WOlflS work [15, 17] used a faster iterative improvement approach. The co-synthesis algorithms developed by Dave et al.[2] can handle multiple objectives such as cost, performance, power and fault tolerance. However, all of these rdgorithms ignore memory hierarchy.
Recent research, such as the path-based anrdysis rdgorithm of LI et al. [10] has developed cache models for analyzing the performance of a single program. While such models provide accurate estimates of the performance of a single program, they do not take into account the effects of preemptions between multiple tasks, and they are much too expensive to be used in system-level synthesis and design exploration. When one task preempts another, it may (or may not) change the state of the cache at a point in a way that compromises the performance of the origindlyexecuting model. For preemptive red-time systems, such interactions are critical to evaluate during system-level architecture design. Lee, et.al.[7] proposed a technique to anrdyze cacherelated preemption delays of tasks that cause unpredictable variation in task execution time for preemptive scheduling. Krk and Strosnider [6] developed a SMART (strategic memory allocation for rerd-time) cache design that partitions the cache to provide predictable cache performance. Danckaert, et al.[1] studied memory optimimtion aiming to reduce the dominant cost of memory in hardware-software co-design of multi-media and DSP applications. Their algorithmconcentrated on reducing data storage and did not consider multi-level memory hierarchy. All these approaches [1, 6, 10] relies on progrm-level analysis, and are too expensive to be used in design space exploration of multiple tasks.
Research in the area of~1-time scheduling provides an important foundation to our co-synthesis algorithm which targets multi-rate red-time tasks. In a uniprocessor environment, real-time systems commonly use one of two scheduling policies to schedule periodic tasks: earliest-deadlinefirst (EDF) and rate-monotonic scheduling (MS) [11] . For distributed r=l-time systems, Ramamrithm [13] used an task-graph unrolling approach and developed a heuristic allocation and scheduling algorithm that considered data dependencies, communication, and fault-tolerance requirements; L1and Wolf [9] developed an efficient hierarchical dalgorithm to schedule and allocate multi-rate tasks with pr=e-dence constraints.
Task-Level Memory Merarchy Model
Accurate estimation of memory hierarchy (cache) behaviors requires program-lwel or trace-level analysis, which are too expensive to be used in the design exploration of multiple tash on a multiprocessor architmture. A high-loel model of memory hierarchy performance is criticrd for integrating memory hierarchy into co-synthesis of multiple tasks. The model should be able to:
1. efficiently model the multi-tasking environment, which may be further complicate by preemptions; 2. efficiently model cache behavior (hits/misses) when cache size changes. In our earlier work [8], we proposed the first tisk-level model of memory hierarchy performance for system-level synthesis, and dlocatio~scheduling algorithms with memory hierarchies. This model treats each task as an entity, partitions the caches, and reserves some partitions~xclu-sively for certain tasks to guarantee predictable performance of these tasks. While it provides a fast mtins to bound the cache performance of tasks running in a mtiti-hking environment, the cache partitionin~reservation approach results in inefficient utilization of the caches. Furthermore, the model is not flexible in terms of the memory allocation of tasks: for tasks with cache partitions on the same cache, the compiler has to make sure that they do not map to overlapped cache locations. We have developed a task-level cache performance model that handles arbitra~~mapping of tasks to caches. Fig. 1 shows how tasks can map to a cache. For simplicity, we make the following assumptions about the tasks and the caches:
Assumption 1: Only one-level cache is modeled and tasks are well-contained in the level-l cache (each tas~s program size and data size are no bigger than the instruction and data cache size, respectively).~Is may not be a reasonable assumption in a general-purpose system, but it is plausible for many embedded systems. The kernels of time-critical operations are frequently small enough to fit into a modestsized cache. Even when a task is too large to be contained in a level-1 cache, it can be specified at a finer granularity to satisfy the assumption.
Assumption 2: The caches are direct-mapped and the cache sizes are powers of two. Assumption 3: A task's program is rdlocated a continuous region of the memory and is, therefore, mapped into a continuous region of the cache. A tasYs data can be scattered in severrd regions of the memory.
Due to the first assumption, when a task executes on a processor, if not preempted by other tasks, the only cache misses are compulsoq~misses [20] . As opposed to capacity and conflict misses, the number of compulsory misses of a task does not change with cache size.
We now analyze the cache performance of multiple tasks for a fixed cache size. Note that only compulsory misses can happen because of Assumption 1. The cache performance of a task depends on the history of task execution on the processor: if the task is executd on the procmsor for the first time, it is initially loaded into the cache (cold start), with compulsory misses; if the task has been executed before and has not been overwritten by other tasks, then there are no cache misses; if it has been partly overwritten by other tasks, then there are compulsory misses associated with the cache regions that were overwritten. It is important to monitor the change of the cache status to tightly bound the cache performance of tasks.
As shown in Eg.1, when tasks are mapped to a cache, there can be overlap between tasks. These overlaps determine rdl the possibilities of task overwriting. We divide the cache into several regions according to distinct task boundaries. Suppose there are n tasks mapped to the cache, the number of tasks boundaries is bounded by 0(2n), which means the cache is divided into at most 0(2n + 1) regions. Cache Task a --- (1)
state(i) #task=
In summary, for a fixed cache, the cache performance model 1. map tasks to the cache and divide the cache into regions according to task overlaps; 2. for each task and each of its related regions, obtain the number of compulsory misses of that task associated with that region; 3. in the multiple task execution, monitor the cache state to compute the number of cache misses and WCETS for the tasks in their execution context, using~.(l) and Eq.(2). IMen we change the cache size, the overlap between @ks may change. In Fig.2 , we double the cache size of Fig. 1 and tasks a-d map differently onto the new cache and generate different divisions of cache. However, an important observation is that doubling cache size does not incur more divisions on the cache the number of regions that a task spans can only stay the same, or dmr~e.
Since the number of compulsory misses of a task on a particular region does not change with cache size [20], we do not need to re-compute 
, the compulsory fiss numbers for the tasks. For example, in Fig.1 , task a spans four regions 1-4; when cache size is doubled, as shown in Fig.2 , since task b no longs overlaps a, task a now spans two regions (1,2) and (3,4). The compulsom isses of a for these two regions can be easily computed by adding up the compulsory misses of their correspondent subregions 1-4. Based on this observation, to analyze dl possible cache sizes, we can start from the smrdlest cache that satisfies Assumption 1, the anrdysis of any other cache size can be inductively done from the cache hrdf of its size.
The above discussion is based on the assumption that each task is mapped to one continuous region of the cache. l~ile this is true for task program, it is not valid for task data which may occupy several disjoint regions (Assumption 3). The only difference is that the multiple data regions for one task will result in more divisions on the data cache, but a similar analysis still applies.
HardwardSoftware
Co-synthesis with Memory Optimization Based on the task-level model for cache performance, we have built a framework for hardwardsoftware co-synthesis with cache. Fig.3 shows the flow graph of our framework. It has two main phases: the first phase, parameter @rac-tion, prepares for co-synthesis-it extracts task graphs and task-level parameters from the original application specifications (source programs); these parameters are then used by the second phase~esign space exploration (co-synthesis) to synthesize the architecture. Sec.4.2 and Sec.4.3 will describe these two steps respectively.
Problem Specification
The problem specification of our co-synthesis algorithm includes two components: a set of red-time applications and a technology database. The red-time applications are periodic, running at multiple rates. Each application is represented by an acyclic task graph, as shown in Fig.4 , where nodes represent tasks, and directd edges represent data dependencies between tasks. Different tasks 'may share program or data in the memory. The data dependencies can be either read-after-write m~, write-after-read w~) or write-after-write (WAw.
Tasks in one application run at the same rate. We assume that the deadline of the tasks is equal to their period. Each task can have severrd implementation options differing in area cost and execution time. The technology database provides the tasks a number of choices for the types of processors, ASICS, and caches, each associatd with a certain cost.
We use a heterogeneous shared-memo~multiprocessor as the template architecture (see Fig.5 ). The archhecture has a number of PEs of various types. Each processor has its private instruction cache and data cache. An ASIC may have a private data cache. Lower-1evel caches and memory are shared. PEs and memory components are linked by a shared bus.
The goal of the algorithm is to: 1.
2.
4.2 choose the number and types of components in the target architecture from the technology database, such that the applications can be scheduled to meet their performance constraints (deadlines) and the totrd cost of the result system is minimized. return the allocation and scheduling of the tasks on the result architecture.
Parmeter Extraction
For each task, from its program-level description, we extract bsk-level parameters that are essential for evaluating the tas~s execution and caching behaviors. These parameters include worst-case execution time when there are no cache misses (WCET-base), the tasks instruction and data address ranges in memory @ogram_regim, and data~egionl, data>egim2, ... ). Then separately for data and instruction caches, we compute the smallest cache size that satisfies Assumption 1, and assuming rdl tasks are rdlocated to this one cache, we divide the cache into regions according to task overlaps in cache and compute the tasks' compulsory misses on each of its relevant regions (as described in Sec.3). In the co-synthesis process, only a subset (say T) of all the tasks will be dlocatd to a given PE (say P), this will result in fewer regions in P's cache. The compulsory misses for each task in T associated with each cache region can be obtaind by removing boundaries related to tasks that are not allocated to P. In our framework, task execution times and memory addresses are obtained by behavior simulation tools, and the number of cache compulsory misses are easily obtained with a cache simulator. These parameters can be dso obtained using performance analysis tools such as Cinderella [10] .
Cache coherency. In a shared-memory multiprocessor architecture, caching of shared data introduces the cache coherency problem. In our algorithm, we use the }vrite invalidate protocol. A write on one PE will invalidate rdl other copies of the same data on other PEs to ensure this PE has exclusive access to the data. After a task finishes its execution, its data is written back to the main memory such that the updated data can be used by other tasks. Note that there is no need to write to the main memory during the execution of a task (say a), because any other tasks that are datadependent on a do not start running until a is finished.
4.3 HardwmdSoftware Co-;ynthmis Based on the task-level cache model described in Sec.3. we have designed an iterative improvement algorithm that uses the task-level parameters as inputs and outputs a design that meets the performance constraints with rninimd cost.
The total cost C of the system is evaluated as the sum of the component costs (C(...)):
Performance evaluation.
We have used two different methods at different points in the design process to evaluatethe performance of a design. One method is to compute the workload M.(4)) on each PE to quic~y check its fwibility. The wor~oad on a PE is the sum of the wor~oad of dl the tasks allocated to this PE
Wmktoad(PE) =~WCET(tas~, PE)/Period(ta9~) iETasks
where Tasks is the set of tasks allocated to PE. If any PE in the system has a workload of higher than 100%, then the design is not feasible. Workload analysis is usd in the intermediate steps of the design space search to quicNy weed out infeasible designs. However, due to data dependencies and bus contention, a PE can rarely achieve a 100% utilization. A design is validated only when a schedule can be constructed without violating task deadlines. Synthmis refers to the exploration of the design space. It is integratti with the costiperfomance evaluation and scheduling algorithm to find the optimized design. Our synthesis rdgorithm consists of the following steps:
1. Find an initial solution. 2. Iterative PE and cache cost reduction. 3. Allocate and schedule tasks and bus transfers for the find design. In step 1, the initial solution is constructed by assigning each task in the task graphs the fastest PE that is available for the task. The PE with the least WCET-base is chosen. If the PE chosen is a CPU, instruction and data caches of the tas~s program and data sizes are added; if it is an ASIC, a data cache of the size of the tasys data size is added. The performance of the initird solution is evrduated. Hit cannot meet the real-time deadlines, then for the given task graphs, there exists no feasible design given the current technology database and the algorithm returns without a solution.
The PE and cache cost reduction step is the core step of the algorithm and Sec.4.3.2 describes the details of this step.
Task Mlomtion and Scheduling
Task allocation and scheduling are important aspects of the co-synthesis algorithm. The scheduling routine is used not only to generate the allocation and schedule of the find design, but also to evaluate the performance of intermediate SO- Iut.ions,and to help generate ne\v solutions. A schedule that utilizes the PEs~vellis critical to Io\ver the system cost. A fast scheduler is important to shorten the performance evrduation time of a design and, therefore, rdlotvs the design space to be more thoroughly searched.
Scheduling of multiple real-time tasks onto heterogeneous multiprocessors is a difficult problem in itself. The addition of caches make it even more complicated. We built our scheduling algorithm based on the hierarchical scheduling algorithm (referred as HS-algorithm) developed by LI and Wolf [9] . This HS-algorithm uses the Klerarchical structure of the system's task graphs to hierarchically allocate and schedule tasks on the multiprocessors and memory transfers on the bus, to meet the red-time constraints. The HS-dgorithm targets the same task model and architecture model as used by our frame~vork, but did not originally consider memory hierarchies. We added caches to the PEs and integrate our memory hierarchy model to HSalgorithm. In the HS-algorithm, the tasks execution time on a certain PE is assumed to be fixed. This is no longer valid lvhen caches are added-the execution time of a task taski on a PE PEj not only depends on the speed of the PE, but also the speed of the cache and the current cache size and cache state. Therefore, instead of using a fixed lI~CET(taski, PEj), \ve dynamically compute it according to the the current cache state (See.3). This change is reflected in the calculation of dynamic urgeng, a measure used by the HS-algorithm to decide the next task to schedule. In the follo~vingequation, WCET(taski, PEj should be computed according to the current cache state. Dynamic urgenq encourages a task to re-use the cache state to reduce cache misses. Other parts of the equation, as \vell as other parts of the scheduling algorithm remain the same and are not discussed in this paper.
PE and Cache Cost Reduction
PE and cache cost reduction is the most critical step in the co-synthesis algorithm. We used an iterative improvement strategy to search for the optimized design by cutting PE and cache cost interactively. A single iteration of cost reduction is sho~vnin Fig.6 . This step tries to eliminate lightly loaded PEs by moving the tasks on those PEs to other PEs. The PEs are ordered by their vorkload(line 3). Starting from the most lightly loaded PE, \ve identify the tasks on it that can be executed on other PEs (line 6); these tasks are then moved to the other PEs that provide the best performance for the tasks (line 7); the cache sizes of the other PEs increase to accommodate the tasks that are net!~lymoved there (line S). The PE is removed if it becomes empty (line 10-11). When there are tasks on a PE that cannot be rnovd to other PEs, the algorithm tries to implement the remaining tasks \vith a cheaper PE (line 12-13). If such a PE cannot be found, the current PE is kept in the design, but an attempt is made to cut its instruction and data cache sizes (line 141 6). In the single-iteration procedure, \vhen \ve move tasks fromone PEtoanother, theperformance constraint maybe violated. Weusethe quick \vorNoad bound method @q.4) tocheck theutilization of PEs. Insummary, asingle iteration of cost reduction is achieved by: q elimination of PEsthat become empty after moving all their allocated tasks to to other PEs; q replacing PEs\vith cheaper ones; ad q reducing cache cost.
Theiterative rdgorithm issho\vnin Fig.7 . Starting from the initial design, the rdgorithm performs the PWcache cost reduction step-by-step, until there is no improvement in three consecutive iterations. For each ne~vdesign returned by a single iteration of PE and cache cost reduction, \ve cdl the relocation and scheduling procedure to:
. check thevalidity of the design; q ifitisvrdid, tvegenerate anetvallocation md schedule thatiscustomized tothecumentsystem. This isimportant because the single iteration of cost reduction moves tasks bettveen PEs, eliminates and replaces some PEs. Theresultant design may nothave abahmcedrdlocationof tasks on the PEs included in the design. We re-allocate and re-schedule the task graphs to achieve abetter utilization of the PEs in the current architecture. The netvly re-rdlocated design is used as the starting point of the next cost reduction iteration. ( Fig.S(c) ). Fig.S(d) shotvs the result of the find design after the tasks are re-allocated and re-scheduled.
Complexity of Our Mgorithm
To determine the time complexity of our frametvork, \ve recognize that the dominant part is the parameter extraction phase,~vhereprogram-level estimation or simulation is needed to estimate the WCETJase of tasks, and the progrtidata memory locations for tasks on different types of PEs. For one task, the complexity of parameter extraction is either related to the size of its program-level representation if estimation tools are used, or related to the size of its execution trace if simulation tools are used. In our frarne~vork, program-level or trace-level analysis is only needed once to extract task-level parameters; the design space search is performed on a task-level abstraction tvhich is much more manageable.
We analyze the \vorst-case complexity of our co-synthesis algorithm. Suppose there are m task graphs, ach \vith at most h tasks. So the total number of tasks, n, is bounded by mk. Let P be the number of different PE types. Let p be the number of PEs in a design. Because the maximum number of PEs in any design~villnot exceed the number of tasks, p =~(n). Each full allocatiotischeduling step has the complexity of 0(n2p) = 0(n3). The complexity for a single iteration of PWcache cost reduction is O(pnP) = 0(n2P). The total number of iterations is bounded by O(pP) = O(nP) because for each PE, it is either eliminated or can be replaced by a cheaper PE at most P times. Therefore, the~vorst-case complexity of the co-synthesis algorithm is O(nP x (n2P + n3)) = 0(n4P + n3P2).
Experimental RMulW
e conducted t~vo sets of experiments: synthetic task graphs from the literature, and rd-life exampl~s. including a real MPEG-2 encoder. To compare \vith existing cosynthesis rdgorithms, tve used examples from the literature [2, 5, 12] , as sho~vnin Without mche \vhile running our rdgorithrn, \ve set the cache part in the technology database to be null, so that the synthesized architecture do= not have caches. The results sholv that even \vithout the benefits of caches, our algorithm can achieve comparable results.
With tied-size caches associated \vith each processoS imilar to a typicrd design practice, \ve manurdly picked fixed cache sizes to be used in the target architecture The results shotv improvements in term of system cost, compared to the no-cache results.
Co-synthwis tith wche optimization:
This dlo~vs the full potential of our algorithm to synthesize soft}vare, hard\vare as \vell as caches simultaneously. The results sho~vfur-ther cost reduction over the fied-size cache approach.
For the second and the third setups, \ve needed more input parameters rquird by our algorithm, such as the memory regions of programs and data for tasks. These parameters \vere generated because the examples from the literature only have the task-graph representations.
We applid our algorithm to a red MPEG-2 video encoding algorithm. MPEG encoding involves both intensive computation and large amount of data transfers. In rerd time, image frames arrive at the rate of 30 frames per second. We used the MPEG-2 encoding softtvare from MPEG So~are Simulation Group. We first extracted the task graph that is composd of 1350 blocks \vith 12 tasks per block. The graph is huge but the blocks share the same structure, \vhlch our algorithm can take advantage of. The technology database consists of SP~C processors, ASICS for DCT,~CT, various-length encoder and motion estimation, and SW to be used as first-level caches. For the SP~C processors, WCET-base, program and data memory regions are obtained using a SP~C behavior simulator Sparcsim [19] . A cache simulator \vas used to obtain the compulsory miss numbers. We assumed a cache and memory access time ratio of 1:20. We used the retail prices of SP~C processors and SMS.
WCETS and the cost for ASICS \vere estimated \vith high-level synthesis. Synthesis results of the MPEG encoder is sho~vn in Table 2 . Even in this example, tvith the huge number of tasks in the task graph, our algorithm \vas able to find a solution of good quality (average PE utilization 95%) in a short period of time. The short CPU time of our rdgorithm on such a big design is made possible by the efficient task-level cache performance model, and the hierarchicrd scheduling methodology \vhich takm advantage of the task graph structures.
An interesting fact of the MPEG experiment is the CPU time spent in different phases of our frame~vork in the parameter extraction phase, for rdl the tasks, generating their execution traces (about 100M instructions in totrd), computing WCETJase and progrtidata memory mapping took about about 4 hours in totrd; using the task execution traces, it took a little less than 1 hour to compute compulsory misses for dl tasks. In contrast, the co-synthesis dalgorithm itself only took minutes (see Table 2 ). This sho~vs that the task-level abstraction and the task-level cache model gr=tly speed up the design space exploration, \vhich \vould have been impossible \vith program-level anrdysis tools that spend hours to evaluate just one single design for the MPEG Table 2 : Co-synthesis results for the WEG-2 video encoder.
encoder.
Conclusions
In this paper, \ve described a task-level model for bounding cache performance of tasks in a multi-rate, multi-tashng environment. This model is used by our algorithm for hard~vare-softtvare co-synthesis \vith cache memory optimization. The algorithm is the first co-synthesis algorithm that considers the impact of memory hierarchy on the system performance and cost. Our rdgorithm synthesizes complex multi-rate real-time applications onto a heterogeneous multiprocessor architecture to meet rd-time deadlines at minimal cost. The co-synthesis algorithm works at the task level, does not require a detailed program analysis, and is very computationally efficient.
Future \vork may includti developing co-synthesis algorithms\vith a more generalized memory hierarchy model. We plan to model set associative caches, and extend the one-level cache model to multiple level caches. Swondly, our memory-hierarchy model optimizes context s~vitchingat task-level, Ivhich not only helps reduce computation time, but also po~verconsumption. We plan to develop a quantitative model for po~verconsumption at system level and use po~veras another objective of co-synthesis.
