Abstract
Introduction
This paper describes new techniques for the cosynthesis of distributed embedded systems.
We present an iterative improvement strategy which uses the sensitivity of the implementation to incremental modification. Our algorithm can simultaneously design the hardware engine which consists of a network of heterogeneous processing elements (PES), either CPUs or ASICs, and the application software architecture which consists of allocating functions to PES in the hardware engine and scheduling their execution. We refer to the hardware engine and application software architectures together as the embedded system architecture, since both the hardware and software architectures contribute to the system design.
Distributed co-synthesis is important because many embedded systems are heterogeneous distributed machines: Rosebrugh and Kwang [15] described a penbased system built from four processors, all of different types; modern automobiles include up to 60 microcontrollers ranging in size from 4-bit to 32-bit; many 35mm cameras include several microprocessors.
Embedded system synthesis is co-synthesis because the hardware and software must be designed to-*This work was supported in part by a grant from the National Science Foundation.
01995 ACM 0-89791-771-5/95/0011/0004 $3.50 4 gether to meet performance and cost goals. In contrast to traditional distributed system design or hardwaresoftware partitioning, we cannot assume that the topology of the distributed system is given. Our cosynthesis algorithm selects the number of PES, the type of each PE, as well as configuring the software architecture.
The synthesis of communication is important for distributed embedded system design. We do not discuss communication delay and cost in this paper. However, our methods for the allocation and scheduling of processes can be extended t o handle communication.
Previous Work
Performance analysis techniques are essential to any co-synthesis algorithm. Our co-synthesis algorithm in this paper is based on an analytic delay estimation algorithm [20] which extends rate-monotonic analysis (RMA) [7, 8, 161 to derive tight delay bounds on periodic tasks executing on a distributed system. Our algorithm can handle problems which include sets of processes with data dependencies; each set has its own period and deadline; the computation times of processes and the periods are bounded but not necessarily constant.
A great deal of recent work has studied hardwaresoftware partitioning, which targets a one-CPU-one-ASIC topology. The algorithm of Gupta and De Micheli [5] moves operations from hardware to software to reduce system cost. Figure 1 . Each task is given a period (sometimes referred to as a rate constraint), which defines the time between two consecutive initiations, a hard deadline, which defines th.e maximum time allowed from initiation to terminabion of the task and must ble satisfied, and a soft deadline, which describes the optimization goal of the task delav but does not llave to be time embedded system, different tasks run in diiTerent rates, and one process can be interrupted by another. Many algorithms [6, 12, 101 for periodic tasks in distributed systems form a big task with length of the least common multiple (LCM) of all the periods:. The ECM method is not efficient when the periods are large and coprime; it is inaccurate to handle non-constant periods or computation times; it discourages stastic allocation and scheduling because it treats different instances of a task as different nodes for the length of LCM. These methods on distributed systems are not suitable for co-synthesis. Prakash and Parker [ll] formulated distributed system co-synthesis as an integer linear program (ILP). They could simultaneously allocate and schedule processes while designing the underlying distributled engine. However, their ILP formulation cannot handle periodic and preemptive scheduling of processes in the RMA model. Their ILP algorithm sometimes required hours to execute. Wolf [19] developed a heuristic algorithm for distributed system co-synthesis which gives results comparable to ILP in many cases, but this algorithm uses scheduling bounds similar to those of Prakash and Parker. D'Ambrosio and Hu [2] use simulation to judge the feasibility of a schedule during cosynthesis: they first enumerate a set of pareto-olptimal solutions, then screen those candidates for feasibility by simulation. Enumeration of all possible confgurations restricted them to explore only one-CPU architectures. Simulation is both time-consuming and not guaranteed to prove feasibility.
Problem Formulation
Our problem formulation is similar to those used in distributed system scheduling and allocation. problems. A process is a single thread of execution, characterized by a computation time, which is at function of PE type to which it is allocated. We often use a table to show the computation time of a proc:ess on each type of P E which can implement the process. A task is a partially-ordered set of processes, whic:h may v satisfied. The computation time of a process or the period of a task can be a constant or an interval specified by a lower bound and an upper bound. Release times (i.e., delayed initiation of a process) and multiple deadlines can be modeled by inserting dummy processes-processes with delay but not allocated on any physical PE;-in the task graph.
CO-synthesis produces an embedded system architecture. As illustrated in the right-hand side of Figure 1, the hardware engine architecture is a labeled graph whose nodes represent PES and whose edges represent communication links. The allocation is given by a mapping of processes onto PES. Some processes may be implementable by either a CPU or an ASIC; we assume that the processes have been partitioned so that they do not cross CPU-ASIC or CPU-CPU boundaries. The schedule of processes is an assignment of priorities to processes. The CPU always executes the highest-priority ready process to completion. There is a cost associated with each P E type. Cost can be the price, area, or power consumption of the system. The designer can specify a hard constraint and a soft constraint on the total system cost.
Sensitivity Analysis
Embedded system performance is not characterised by a single number, since each task can have its own deadline to meet. In non-real-time slystems, we can evaluate the system by the magnitude of a single number-the total execution time [17] . In real-time systems, the satisfaction of the deadlines iis more important than shorter delays. Unlike design problems on a fixed architecture, we need also take into consideration the change in other criteria such as ]price, area, or power consumption, since the underlying hardware may change.
As in most gradient-search methods, we compute a local sensitivity: given the current design, we estimate how much the system performance arid cost will change when a single process is reallocated. Given a design goal attribute-a task delay, value will become vi after a reallocation. The values of ui and vi for a task delay are estimated by a performance analysis algorithm [20] with the scheduling method discussed in Section 5. Define the ith component of the displacement vector D to be
where W ( z ) is a weight function given in Figure 2 . In other words, Di represents the amount of the change vi -ui, but we give higher weight (penalty) for the portion of change above the hard constraint, and no credit (zero) for the portion below the soft constraint. It is divided by hi for normalization-the tighter the hard Constraint is, the higher the weight on the change. The displacement vector characterizes the nonlinear design goal under the hard and soft constraints. Let the ith component of the target vector T be
As shown in Figure 3(b) , because the final goal is the soft constraint, the vector from the current position to the soft constraint is the direction in which we want to move. We also normalize it by hi. The sensitivity S is the magnitude of the projection of D on T:
This tells how much closer we move the design towards the target, as illustrated in among different design goals, according to the current system architecture. The attributes with tighter hard constraints and the attributes whose current values are farther from soft constraints are weighted more heavily. A positive sensitivity implies an improvement, while a negative value means the result may become worse.
Priority Prediction
In the computation of sensitivities, we need to estimate the delay values ui and vi for a n attribute corresponding to task delays. Determining reallocation is not enough to estimate the delays; we also need to reschedule these processes on each PE. We assume the deadline of each process does not exceed the period, in which case the inverse-deadline priority assignment [7] is optimal for one processor. However, in our model the deadline is specified end-to-end for a whole task, not for individual processes. We develop a heuristic to use the inverse-deadline priority assignment.
We define the fractional deadline of a processthe portion of the task deadline which a particular process must meet-as follows. Our performance analysis algorithm for the worst-case task delay [20] calculates the latest request time and latest finish time relative to the start of a task for each process.
Assign each process a weight equal t o its latest finish time minus its latest request time. For each CPU R, temporarily assign weight zero for the set of processes .TR allocated on R in the task. Then apply the longest-path algorithm backwards from the end of the task. T h e latest required time of each process in J R is the hard deadline the task minus the longest path weight of the process. The calculation of latest required times is similar to the technique in as-late-aspossible (ALAP) scheduling of high-level synthesis 9 .
its latest required time minus its latest request time. We can then order the priority by &-the shorter the fractional deadline is, the higher the priority is. Example 1 For the example in Figure 4 , suppose in the current design, Pz and P3 are allocated t o a P E of type X , PI and P4 are allocated to a P E of type P', and Ps is allocated to a P E of type 2. If we move P5 from Z to X , the three processes Pz, P3> and P5 will share the same PE. According to the fractional deadlines, their priorities should be ordered as P3 > P5 > P2,
The task deadlines are satisfied under this schedule. Among six possible schedules for three processes, this Unfortunately, when we reallocate a process, the delay information about latest initiation times arid latest termination times will change. We should ulse the new delay information to decide the process deadlines for priority assignment, but before we schedule the priority we cannot know the new delay values. We solve this problem by using the delay information in the current solution before the reallocation to predict the fractional deadlines of processes and assign priority, and use the new schedule to compute the new delay information. The new delay information i!3 used to decide priority assignment in the next step. Since we only move one process at a time, the change in the delay is usually not large, making this a n acceptable approximation.
The fractional deadline d; of each process Pi E J J F is is the only feasible schedule for the example.
Idle-PE Elimination
In the computation of sensitivities, we need to1 know the total system cost. The system cost is usually the sum of the cost of each individual component. Consider the situation where two processes are in a P E R, and it is feasible to move these two processes to another PE, remove R, and reduce the cost. Since we are allowed to move only one process at a time, when the first process on R is reallocated, cost is not reduced immediately and such a movement may induce additional delay caused by a higher load on another PE. On the other hand, if R cannot be removed for a feasible solution, it might be desirable to move some processes from a highly-utilized P E to R t o increase the performance. To maximize the satisfaction of goals during both the early and late stages of optimization, we use different criteria at different stages of synthesis:
1. idle-PE-elimination If R is the least-utilized PE, add the product of the cost and the pracessor utilization of R to the total system cost. When we move one of the processes from the least-utilized
2.
The P E to another PE, we can immediately determine that the system cost is reduced, which increases the possibility to accept the first move and then take the other processes away in the next steps and remove R. load-balan cing After we have removed as many PES as we can, we calculate the cost function as usual without considering the least P E utilization, concentrating on balancing P E utilization to increase the performance whenever it i s possible.
idle-PE-elimination criterion helps the solution jump off local minima; the load-balancing criterion brings the solution back to a minimum if a better solution cannot be! found.
The CO-synthesis Algorithm
Our synthesis algorithm uses a n optimization procedure, f indbest (), which is parameterized by the optimization criterion to be used. The procedure is called twice: once with the idle-PE-eliminakion criterion and again with the load-balancing criterion. The f indbest () procedure operates as follows:
1.
2.

3.
4.
Compute the sensitivity for each possible movement of a process from a P E to another, as de- Among the: remaining movements or creations, choose the one with highest sensitivity. Make such a movement and reschedule each P E using priority prediction mentioned in Section 5. Repeat steps 1-3 until no movement is feasible. Our complete synthesis algorithm consists of the 1. Create a n initial solution by assigning only one proces:3 to each PE. The P E with highest performance-to-cost ratio is chosen for each process.
following steps:
2. Call f indbest (idle-PE-elimination. there is no previous solution providing delay information for prediction. However, since there is only one process on each CPU in our initial solution, there is no need for scheduling. All the following solutions can depend on the delay information in the previous one for scheduling. The final step in the algorithm tries to replace PES which are poorly utilized. We disallow a processor to be moved back to the same P E or the same type of PE it is moved away from in a previous iteration. Under this condition, the number of iterations in each call to f indbest () is bounded by [PI2 x IRtl, because each process can be allocated to a t most IRtl different P E types, and we can have a t most (PI PES for each P E type. In practice, the algorithm converges on a solution in many fewer steps.
Experimental Results
We implemented our algorithm in C++ and performed experiments on several examples. All experiments were performed on a Sun Sparcstation SS20. The results of all our experiments are summarized in Table 8 .
The first example, ez,l is a small example shown in Figure 8 . We assume the designer wants to reduce the delay and the cost as much as possible, so the soft constraints and soft deadlines are set to zero to encourage optimization whenever it is possible. The embedded system architecture, total cost and the satisfaction of real-time deadlines after each iterative step are given in Table 8 .
The second example is created by D'Ambrosio and Hu 123. Our result is the second best shown in [2]. However, by exhaustive simulation, we found their best design with the minimum cost of 3.25 is actually infeasible. Our result is the best feasible design achievable. They used simulation to verify whether a system configuration is feasible but simulation does not guarantee the deadlines are satisfied.
Our third example is the second example of Prakash and Parker [ll] ; their algorithm works on a single task graph. We use their assumption about communication cost and delay, and put a hard cost constraint of 10. Our result for the task delay is 7 which is worse than, but similar to 6 in their result. Their integer linear programming approach is optimal but takes hours on this example, while ours takes only seconds. The results are shown in the row labeled prakash-parker-I.
We also combine Prakash and Parker's two examples together and assign the periods as well as the deadlines of 7 and 15 to the two tasks. The results are given under prakash-parker-2. This example demonstrates how our algorithm can co-synthesize from multiple disjoint task graphs.
Conclusions
Distributed computers are often the most costeffective means of meeting the performance requirements of an embedded computing application. Embedded distributed computing is a particularly challenging design problem because the hardware and software architectures must be simultaneously optimized. We have presented a new co-synthesis algorithm for heterogeneous distributed systems of arbitrary topology. We plan to extend this work to synthesize communication, handle control flow, and consider fault tolerance. We believe that algorithms such as this are an important tool for the practicing embedded system designer.
Step 2 
