Introduction
In this paper we concentrate on process scheduling for systems consisting of communicating processes implemented on multiple processors and dedicated hardware components.
In such a sy stem in which several processes communicate with each other and share resources, scheduling is a factor with a decisive influence on the performance of the system and on the way it meets its timing constraints. Thus, process scheduling has not only to be performed for the sy nthesis of the final system, but also as part of the performance estimation task.
Optimal scheduling, in even simpler contexts than that presented above, has been proven to be an NP complete problem [13] . In our approach, we assume that some processes can be activated if certain conditions, computed by previously executed processes, are fulfilled. Thus, process scheduling is further complicated since at a given activation of the system, only a certain subset of the total amount of processes is executed and this subset differs from one activation to the other. This is an important contribution of our approach because we capture both the flow of data and that of control at the process level, which allows an accurate and direct modeling of a wide range of applications. Performance estimation at the process level has been well studied in the last y ears [10, 12] . Starting from estimated execution times of single processes, performance estimation and scheduling of a sy stem containing several processes can be performed. In [14] performance estimation is based on a preemptive scheduling strategy with static priorities using rate-monotonic-analy sis. In [11] scheduling and partitioning of processes, and allocation of sy stem components are formulated as a mixed integer linear programming problem while the solution proposed in [8] is based on constraint logic programming. Several research groups consider hardware/ software architectures consisting of a single programmable processor and an ASIC. Under these circumstances deriving a static schedule for the software component practically means the linearization of a dataflow graph [2, 6] .
Static scheduling of a set of data-dependent software processes on a multiprocessor architecture has been intensively researched [3, 7, 9 ]. An essential assumption in these approaches is that a (fixed or unlimited) number of identical processors are available to which processes are progressively assigned as the static schedule is elaborated. Such an assumption is not acceptable for distributed embedded systems which are typically heterogeneous.
In our approach we consider embedded systems specified as interacting processes which have been mapped on an architecture consisting of several processors and dedicated hardware components connected by shared busses. Process interaction in our model is not only in terms of dataflow but also captures the flow of control under the form of conditional selection. Considering a non-preemptive execution environment we statically generate a schedule table for processes and derive a worst case delay which is guaranteed under any conditions. The paper is divided into 7 sections. In section 2 we formulate our basic assumptions and introduce the graph-based model which is used for sy stem representation. The schedule In [4] we presented algorithms for automatic hardware/ software partitioning based on iterative improvement heuristics. The problem we are discussing in this paper concerns performance estimation of a given design alternative and scheduling of processes and communications. Thus, we assume that each process is assigned to a (programmable or hardware) processor and each communication channel which connects processes assigned to different processors is assigned to a bus. Our goal is to derive a worst case delay by which the sy stem completes execution, so that this delay is as small as possible, and to generate the schedule which guarantees this delay .
As an abstract model for system representation we use a directed, acyclic, polar graph Γ(V, E S , E C ). Each node P i ∈V represents one process. E S and E C are the sets of simple and conditional edges respectively . E S ∩ E C = and E S ∪ E C = E, where E is the set of all edges. An edge e ij ∈E from P i to P j indicates that the output of P i is the input of P j . The graph is polar, which means that there are two nodes, called source and sink, that conventionally represent the first and last process. These nodes are introduced as dummy processes so that all other nodes in the graph are successors of the source and predecessors of the sink respectively .
The mapping of processes to processors and busses is given by a function M: V→PE, where PE={pe 1 , pe 2 , .., pe Npe } is the set of processing elements. For any process P i , M (P i ) is the processing element to which P i is assigned for execution.
Each process P i , assigned to processor or bus M(P i ), is characterized by an execution time t Pi . In the process graph depicted in Fig. 1 , P 0 and P 32 are the source and sink nodes respectively . Nodes denoted P 1 , P 2 , .., P 17 , are "ordinary " processes specified by the designer. They are assigned to one of the two programmable processors pe 1 and pe 2 or to the hardware component pe 3 . The rest are so called communication processes (P 18 , P 19 , .., P 31 ). They are represented in An edge e ij ∈E C is a conditional edge (thick lines in Fig. 1) and it has an associated condition. Transmission on such an edge takes place only if the associated condition is satisfied.
We call a node with conditional edges at its output a disjunction node (and the corresponding process a disjunction process). Alternative paths starting from a disjunction node, which correspond to a certain condition, are disjoint and they meet in a so called conjunction node (with the corresponding process called conjunction process). In Fig. 1 circles representing conjunction and disjunction nodes are depicted with thick borders. We assume that conditions are independent.
A boolean expression X Pi , called guard, can be associated to each node P i in the graph. It represents the necessary condition for the respective process to be activated. In Fig.   1 , for example, X P3 =true, X P14 =D∧K, X P17 =true, X P5 =C. Two nodes P i and P j , where P j is not a conjunction node, can be connected by an edge e ij only if X Pj ⇒X Pi (which means that X Pi is true whenever X Pj is true). This restriction avoids specifications in which a process is blocked because it waits for a message from a process which will not be activated. If P j is a conjunction node, predecessor nodes P i can be situated on alternative input paths. According to our model, we assume that a process, which is not a conjunction process, can be activated only after all its inputs have arrived. A conjunction process can be activated after messages coming on one of the alternative paths have arrived. All processes issue their outputs when they terminate. If we consider the activation time of the source process as a reference, the activation time of the sink process is the delay of the sy stem at a certain execution.
The Schedule Table
For a given execution of the sy stem, a subset of the processes is activated which corresponds to the actual path through the process graph. This path depends on certain conditions. For each individual path there is an optimal schedule of the processes which produces a minimal delay.
Let us consider the process graph in Fig.1 . If all three conditions, C, D, and K are true, the optimal schedule requires P 1 to be activated at time t=0 on processor pe 1 , and processor pe 2 to be kept idle until t=4, in order to activate P 3 as soon as possible (see Fig. 4a ). However, if C and D are true but K is false, the optimal schedule requires to start both P 1 Process mapping Processor pe : P 1 , P 2 , P 4 , P 6 , P 9 , P 10 , P 13 Processor pe : P 3 , P 5 , P 7 , P 11 , P 14 , P 15 , P 17 Processor pe ! : P 8 , P 12 , P 16 Communications are mapped to a unique bus Execution time t Pi for processes P ion pe 1 and P 11 on pe 2 at t=0; P 3 will be activated in this case at t=6, after P 11 has terminated and, thus, pe 2 becomes free (see Fig. 4b ). This example reveals one of the difficulties when generating a schedule for a system like that in Fig. 1 .
As the values of the conditions are unpredictable, the decision on which process to activate on pe 2 and at which time, has to be taken without knowing which values the conditions will later get. On the other side, at a certain moment during execution, when the values of some conditions are already known, they have to be used in order to take the best possible decisions on when and which process to activate. An algorithm has to be developed which produces a schedule of the processes so that the worst case delay is as small as possible. The output of this algorithm is a so called schedule Activation times in a given column represent starting times of the processes when the respective expression is true. Once activated, a process executes until it completes. To produce a deterministic behavior which is correct for any combination of conditions, the table has to fulfill several requirements:
1. If for a certain process P i , with guard X Pi , there exists an activation time in the column headed by expression E k , then E k ⇒X Pi ; this means that no process will be activated if the conditions required for its execution are not fulfilled.
2. Activation times have to be uniquely determined by the conditions. Thus, if for a certain process P i there are several alternative activation times then, for each pair of such times (τ Pi Ej , τ Pi Ek ) placed in columns headed by expressions E j and E k , E j ∧E k =false.
3. If for a certain execution of the sy stem the guard X Pi becomes true then P i has to be activated during that execution. Thus, considering all expressions E j corresponding to columns which contain an activation time for P i , ∨E j =X Pi .
4. Activation of a process P i at a certain time t has to depend only on condition values which are determined at the respective moment t and are known to the processing element M(P i ) which executes P i .
The value of a condition is determined at the moment τ at which the corresponding disjunction process terminates.
Thus, at any moment t, t≥τ, the condition is available for scheduling decisions on the processor which has executed the disjunction process. However, in order to be available on any other processor, the value has to arrive at that processor. The scheduling algorithm has to consider both the time and the resource needed for this communication.
The following strategy has been adopted for scheduling the communication of conditions: after termination of a disjunction process the value of the condition is broadcasted from the corresponding processor to all other processors; this broadcast is scheduled as soon as possible on the first bus which becomes available after termination of the disjunction process.
For this task only busses are considered to which all processors are connected and we assume that at least one such bus Table 1 We present algorithms for scheduling of the individual paths in [5] . In this paper we concentrate on the generation mechanism of the global schedule table.
The Table Generation Algorithm
The input for the generation of the schedule The schedule merging algorithm is guided by the length of the schedules produced for each alternative path. While progressively constructing the schedule table, at each moment, priority is given to the requirements of the schedule corresponding to that path, among those which are still reachable, that produces the largest delay . Thus, we induce perturbations into the short delay paths and let the long ones proceed as similar as possible to their (near)optimal schedule.
1. Schedule Mergin g
The generation algorithm of the schedule table proceeds
1. This formula to be rigorously correct, δ M has to be the maximum of the optimal delays for each subgraph.
along a binary decision tree corresponding to all alternative paths, which is explored in a depth first order. Fig. 2 shows the decision tree explored during generation of At the beginning, start times of processes are placed into Table 1 according to the schedule which corresponds to the path labeled D∧C∧K. After the first back-step, to node K ( Fig.   2 ), the schedule corresponding to path D∧C∧K becomes the actual one. New start times will be fixed into the schedule table according to an adjusted version of this schedule. The next back-step is to node C. Two schedules are now reachable taking the branch C, which are labelled D∧C∧K and D∧C∧K respectively . D∧C∧K, which produces a larger delay , will be selected first as the actual schedule. It will be followed until the next beck-step has been performed.
The algorithm for generation of the schedule table is briefly described, as a recursive procedure, in Fig. 3 . An essential aspect of this algorithm is that, after each back-step, a new schedule has to be selected as the current one. The selection rule gives priority to the path with the largest delay , among those which are reachable from the current node in the decision tree. Further start times of processes will be Length of the optimal schedule for the alternative paths through the graph in In Fig. 4 we show the adjustment of the schedule labelled D∧C∧K performed after the back-step to node K. At this moment start times of processes P 1 , P 2 , P 11 , P 3 , P 12 , P 18 , Table Generation Suppose we are currently handling a path labelled L k .
2. Conf lict Handling at
According to the adjusted schedule of this path we place an activation time τ Pi Lk of process P i into the table, so that the respective column is headed by an expression E. The problem is how to preserve the coherence of the table in the sense introduced by requirement 2 defined in section 3. If there is no activation time previously introduced in the row corresponding to P i no conflicts can be generated. If, however, the respective row contains activation times, there exists a potential of conflict between the column headed by E and columns which already include activation times of P i . L et us consider that such a column is headed by an expression F. According to requirement 2, we have a conflict between columns E and F if there exists no condition C so that E=q∧C and F=q'∧C. Intuitively, such a conflict means that for two or more paths the same process P i is scheduled at different times but the conditions known on the processing element M(P i ) do not allow to the scheduler to identify the current path and to take a deterministic decision on activation of P i . If placement of an activation time for process P i in a column headed by expression E produces a conflict, the current schedule has to be readjusted so that an expression E' will head the column that hosts the new activation time of P i and no conflict is induced in the schedule table. As shown in the algorithm presented in Fig. 3 Fig. 1 cess will be moved to a new activation time and the schedule is readjusted by changing the start time of some unlocked processes (similar to the operation performed at the initial adjustment). The main problem which has to be solved is to find the new activation time for P i so that conflicts are avoided. In [5] we demonstrated the following two theorems in the context of our Thus, the following loop over the set W can produce the new activation time of a process P i so that all conflicts are avoided:
Experimental Evaluation
The strategy we have presented for generation of the schedule table guarantees that the path corresponding to the largest delay , δ M , will be executed in exactly δ M time. This, however, does not mean that the worst case delay δ max , corresponding to the generated global schedule, is alway s guaranteed to be δ M . Such a delay can not be guaranteed in theory . According to our scheduling strategy δ max will be worse than δ M if the schedule corresponding to an initially faster path is disturbed at adjustment or conflict handling so that its delay becomes larger than δ M .
For evaluation of the schedule merging algorithm we used 1080 conditional process graphs generated for experimental purpose. 360 graphs have been generated for each dimension of 60, 80, and 120 nodes. The number of alternative paths through the graphs is 10, 12, 18, 24, or 32. Execution times were assigned randomly using both uniform and exponential distribution. We considered architectures consisting of one ASIC and one to eleven processors and one to eight busses [5] . Experiments were run on a SPARCstation 20. Finally , we present a real-life example which implements the operation and maintenance (OAM) functions corresponding to the F4 level of the ATM protocol lay er [1] . Fig. 7a shows an abstract model of the ATM switch. Through the switching network cells are routed between the n input and q output lines. In addition, the ATM switch also performs several OAM related tasks.
In [4] we discussed hardware/software partitioning of the OAM functions corresponding to the F4 level. We concluded that filtering of the input cells and redirecting of the OAM cells towards the OAM block have to be performed in hardware as part of the line interfaces (L I). The other functions are performed by the OAM block and can be implemented in software.
We have identified three independent modes in the functionality of the OAM block. Depending on the content of the input buffers (Fig. 7b) statically generated schedule table for the respective mode.
We specified the functionality corresponding to each mode as a set of interacting VHDL processes. Table 2 Table 2 . As expected, using a faster processor reduces the delay in each of the three modes. Introducing an additional processor, however, has no effect on the execution delay in mode 2 which does not present any potential parallelism. In mode 3 the delay is reduced by using two 486 processors instead of one. For the Pentium processor, however, the worst case delay can not be improved by introducing an additional processor. Using two processors will alway s improve the worst case delay in mode 1. As for the additional memory module, only in mode 1 the model contains memory accesses which are potentially executed in parallel. Table 2 shows that only for the architecture consisting of two Pentium processors providing an additional memory module pay s back by a reduction of the worst case delay in mode 1.
Conclusions
We have presented an approach to process scheduling for the sy nthesis of embedded sy stems. The approach is based on an abstract graph representation which captures, at process level, both dataflow and the flow of control. A 
