The 
Introduction
Process scheduling is a well known problem in embedded system design. In current approaches, the hardware-software architecture is widely known at compile time. With the advent of hardware-software co-synthesis, hardware-software partitioning can be moved to a very late design stage (late binding) because changing system descriptions and generating and modifying hardware-software architectures has become much easier. This is very much like the automation of physical layout has reduced the cost of netlist changes and logic synthesis has simplified RT-level modifications. As a consequence, .process scheduling could move ahead of hardware architecture definition. We will first explain why process scheduling before hardware definition is useful and will then show that this approach leads to a new kind of scheduling problem and finally give a scheduling approach and first results.
Many of the more complex embedded system applications consist of a mixture of processes with very different requirements. In the example of an automotive motor management, ignition control requires a simple process, which must be repeated every 10 us, while fuel injection and emission control are computation intensive processes with iteration rates in the range of many milliseconds. This is why we often find a mixture of hardwired logic, ASPPs (application specific programmable processors) and general purpose processor cores. Today, this architecture is typically defined before the system is implemented on this architecture, in most cases using preemptive scheduling. Global system optimization is difficult in this context and whatever changes occur during system design will hardly be able to influence the hardware architecture. [4], start with unified process system description and then derive a hardware architecture. In the current state, COSYMA takes a single process in a superset of C, C", together with timing constraints on this C" process. COSYMA first tries to implement the process on a given processor. If the timing constraints are not met, COSYMA automatically creates a coprocessor for process acceleration. CO-synthesis can also be controlled by a required speedup factor for a given processor. Processor-coprocessor architectures are popular in embedded system design such as in the motor management example [9]. If we, now, would be able to map the set of processes of an embedded system to a single process, as in fig. 1 , then we could use the co-synthesis system to create a new processorcoprocessor architecture, independent of the original process structure. This allows global system optimization. To give an example, it might be more efficient to speed up one or more small processes with a high iteration rate, or, instead, to speed up one or more computation intensive processes with loops. Partitioning would also share the coprocessor for several processes, such that the original process structure might differ from the optimized hardware architecture. This is a difficult and time consuming task if done manually. to scheduling which result from this approach, in section 3 we will inspect related work, section 4 will describe our approach, in section 5 we give results which we obtained with a complete system example, and finally, we draw some conclusions.
Scheduling requirements
The major difference of the scheduling problem, as compared to other process scheduling problems is that the scheduler works on an architecture with scalable performance. Given a process system with time constraints, the scheduler cannot simply anticipate the execution time of the individual processes or process segments, because acceleration is determined by the co-synthesis system. CO-synthesis can, however, regard 1/0 time constraints.
To alleviate the scheduling problem, we assume that hard I/O-timing constraints are buffered by a peripheral device and that processes with cycle times of less than a microsecond are moved to hardwired logic upfront. In a manual design, such decisions would be reasonable, as well.
The scalable performance is defined relative to the performance of a given processor which shall be used as core processor in the hardware-software system. To express the required performance, a performance scaling factor is defined, s, = required system performance given processor performance ' such that S, can be used as speedup factor for cosynthesis ( fig. 2) . Communication with buffering increases the solution space of scheduling and co-synthesis; and is therefore included. Buffering does not change rate constraint and it must regard I/O-constraints, but it allows to reschedule process initiation times as well as communication among processes. We define: An I/O-constraint (i.e. an event relating to I/Ocommunica1;ion) is scalable when buffering is permitted for at lcmt one scheduling period. All other I/Oconstraints are unscalable. Inter-process communication constraints are always scalable.
S , is a global speedup factor that does not require that COSYMA is able to speed up the execution time of each individual process. The partitioning is performed on the serialized process, P, enabling a true global optimization that is not restrrcted by the boundaries of the original processes. Assuming that the communication between the processes can be buffered foI at most one macro scheduling period it does not matter in which part of the code the speedup can be reached. In figure 3, six processes with two different rates are serialized in one single macro process. After );hat, COSYMA would choose a part of the macro process for a hardware implementation that offers the greatest capacity for speedup. In fig. 3 , we assume that this is a part of P4. As a result, the execution time is shortened to tb. This way, the average iteration rate of each individual process has been reduced to the required rate, but the execution within a macro period is not, which is why buffering is necessary now.
€%elatled work
Scheduling has been an important area of research in real-time computer systems and in signal processing. Many of the publications are focused on scheduling of periodic processes. Liu and Layland [8 have shown that Rate Monotonic Scheduling (RMS 1 produces an optimal schedule, but it does not consider process communication. Therefore, several extensions have been suggested [13] , [lo] , [ll] . All these approaches are fixed priority assignment algorithms meaning that the order of the processes is determined at run time depending on the priorities which are assigned a priori.
In order to serialize a system of communicating processes as a single process, we need a static order of the processes which is decided at design time. A (mostly) static scheduling approach is given by [7] and [12] .
Chou and Boriello present in [l] a static, nonpreemptive scheduling algorithm for reactive real time systems, which is implemented as a part of the Chinook framework. By serializing a process dependency graph both interprocess and intraprocess constraints can be regarded. In the Chinook framework partitioning and scheduling is an iterative process starting with partitioning [2] . This approach is especially efficient for interface synthesis, but less suited to data dominated parts with (possibly data dependent) loops because it requires loop unrolling. The same holds for Gupta's approach [6] . Both approaches do not consider communication buffering.
All these scheduling approaches depend on a fixed target architecture with known performance. In this paper we present a completely different solution, which allows a serialization of parallel processes with only partial knowledge the final target architecture.
Scalable performance scheduling
For scheduling we define different classes of processes [3]. The major differences are the constraint types and the timing requirements. The scheduling of the process classes is separate and hierarchical. We omit details here, in order to focus on the most challenging class of periodic tasks with a wide variation of process execution times and iteration rates.
At the beginning of the scheduling process ( fig. 4) , a process dependency graph (PDG) is derived from the C" description. The processes are split into basic blocks in order to get short time segments, which allow to schedule processes with a shorter cycle time in the order of a few microseconds. Each basic block also ends at a blocking communication or at a label which corresponds to an intertask constraint. Each node in the PDG represents a code segment and is attributed by the index of the code segment as well as its execution time on the target processor, which is estimated by prescheduling and simulation.
In a prescheduling step, the processes are first serialized such that the order of the processes does not violate the data and communication dependencies but without regarding the external time or sate constraints. Being executed with external stimuli data, the ordered segments show the correct function and timing when executed on a single target processor simulator. The prescheduling step saves an extra simulator. The sequential process provided by prescheduling is functionally correct , because all dependencies were considered, but the timing timing constraints are not regarded. Therefore a different scheduling is required, now.
For both, the prescheduling and the final scheduling we adapt the algorithm of Chou [l] based on the traversal search through a graph. The algorithm originally produces one valid schedule if possible, but this is not necessarily the optimal one. A suboptimal schedule results in a worse processor utilization caused by idle times. This is not a problem in the case of a fixed target architecture, because, there, processor utilization is not the primary concern. Since the target architecture generated by COSYMA depends on the scheduling, a bad processor utilization results in an over-dimensioned target architecture. That is the reason why a schedule close to the optimum is required, here
Neglecting the dependencies of the processes, the "non scaled" utilization U,, is the sum of the loads of each process on the processor, which is ratio of the The utilization is a measure of the ratio of the busy times to the idle times of a processor. The best of all valid schedule is the one which is closest to the value 1. A valid schedule can only be found, if this value is smaller or equal one. Liu and Layland proved, that a set of n processes are always schedulable for U <= U ( n ) = n * (26 -1). For the interval between U(n) and 1 they gave also criteria which guarantees a valid schedule [8] .
In a hardware-software cosynthesis environment, a typical application can not be implemented on a single core processor. The utilization of the processor becomes greater than 1. Therefore we introduced the scaling factor S,, in order to bring the processor utilization in the valid range. The execution time of each process is normalized by S,. In addition to the processor utilization, s, depends on the task dependencies, the constraints, the communication and the overhead for context switching. We merge all these components to a factor a d e p to determine the scaling factor:
Because a d e p depends on the number of context switches and the order of the processes which are unknown before scheduling, it cannot easily be estimated, but is determined heuristically. Scheduling iterates several times and adapt a d e p and S, after each iteration. For the adaption we use simple binary search1, starting with a heuristic value of 5. If this is successful, the interval between one and five is traversed, otherwise the one between five an ten. By choosing the width between the first invalid and the last valid schedule, the quality of the solution can be controlled.
As a result of the scheduling, we get a serialized PDG and an S, which is close to the optimum. The cosynthesis system now tries to reach the required speedup S, by generating an application specific coprocessor. actual ve1oci.y from a speedometer. Finally a motor controller generates pulses for the motor electronics. The train control serves as a demonstrator for a student VHDL course. The current prototype consists of a XIIJNX XC 4010 and a microcontroller MC 68HCll. The bit decoder and the receiver including the error correction are mapped on the IFPGA, whereas the ~egulator, the speedometer and the motor control are iinplemented on a microcontroller. With the scheduling experiment, we want to evaluate if the system could be implemented on a 33 MHz SPARC processor with little additional hardware.
An example
The C" description consists of five different processes. The edges between the processes represent communication. The dotted line describes a nonblocking, whereas the others describe blocking communication. The first process is the decoder that has a cycle time of 3,211s and an execution time of 4,211s. This process alone could not be implemented on a SPARC, but a speedup of 1,3 would be required. Without conimunication a speedup of 2,09 would be necessary in order to run the example on a SPARC processor.
In order to consider the communication the utilization is multiplied by the heuristic factor adep. This factor is adaDted in an iterative process starting with a heuristic Talue of 9,0. Figure 6 shows the Gantt diagrams of three valid schedules during the iterative process. A value of 9,0 ( fig. sa) decoder process is activated every 3,2us. All the other processes are active only at the beginning of the period. In the interval from 6us until 52us only the decoder task is active. This leads to a scaled processor utilization of Vs = 11,83 % (CPUused in Fig.6 ).
In fig. 6b adep is reduced to 4,5. This adaption influences the order of the processes, the regulator activation is moved behind the second run of the decoder. The processor utilization increases to 23,6 %. As a result, the schedule becomes more compact.
One more valid schedule is found for ade,=2,25. After the next iteration adep is set to 1,25 which does not result in a valid schedule. Neither for adep = 1,69 nor for adep = 1,97 a valid schedule is found. Since the difference of Udep of the last valid schedule (2, 225 ) and the the current adep is less than the default minimal interval width (0,5) the scheduler stops returning the best a d e p = 2,25.
As a consequence a speedup of 4,8 is required in order to implement the design on the SPARC. Most of this speedup is required to implement the decoder as seen in the Gantt diagram. The PDG of the train control example consists of about 770 nodes. The whole scheduling algorithm takes only a few minutes on a SPARC 10, if the lowest interval width of the binary search is set to 0.5.
In the following, we assume that the rate of the the bit decoder ranges from 312 kHz to 1,25 MHz depending on the transfer rate between the PC and the train. 7 shows the trials of the binary search to find a valid schedule close to the optimum for three different rates. Each of them converges to the optimum quite quickly. For this example the binary search terminates whenever the distance of factor a between a valid and an invalid schedule is lower than 0.5.
Conclusion
We presented a scalable performance scheduling algorithm for small embedded systems. Scheduling goal is a minimum performance requirement for a given set of communicating processes with rate constraints. All periodic processes are mapped to a single process which is then accelerated to the required performance using hardware-software cosynthesis. This allows global optimization, because cosynthesis is not limited to the original process structure.
Unlike earlier process scheduling problems, hardware performance is not fixed at scheduling time. Hardware performance, however, influences process ordering in the schedule and, on the other hand, process ordering decides on the required performance. Iteration with binary search is used to adapt both of these parameters. Based on a relatively fine grain process segmentation, scheduling can efficiently handle processes with very different execution times and iteration rates. The results with an example show high processor utilization and acceptable computation times.
Scalable performance scheduling can be used for other problems, as well. Instead of controlling cosynthesis, the performance scaling factor could be used for core processor selection, if the relative performance of other processors is known. This, of course, is not useful for arbitrarily different processor architectures where scaling is process dependent. A designer could also use the Gantt diagram to identify inefficiencies when serializing the process system to a valid schedule. Identified inefficiencies can result from the definition of timing requirements, rates, process communication and required context switching.
The hardware architecture is still rather simple. As future work, we will extend the scheduling approach and cosynthesis to small, heterogeneous multiprocessors.
