This paper presents a novel partial assignment technique (PAT) that decides which tasks should be assigned to the same resource without explicitly defining assignment of these tasks to a particular resource. Our. method simplifies the assignment and scheduling steps while imposing a small or no penalty on the final solution quality. This technique is specially suited for problems which have different resources constraints. Our method does not cluster tasks into a new task, as typical clustering techniques do, but specifies which tasks need to be executed on the same processor. Our experiments have shown that PAT, which may produce nonlinear groups of tasks, gives better results than linear clustering when multi-resource constraints are present. Linear clustering was proved t o be optimal comparing to all other clusterings for problems with timing constraints only. In this paper, we show that, if used for multi-resource synthesis problem, as it is often used nowadays, linear clustering will produce inferior solutions.
INTRODUCTION
System synthesis is a design step which maps an initial specification into given architecture and decides its schedule. This can be done using task assignment and scheduling. During this step it is important to have an accurate model Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. F i g u r e 1: A t a s k g r a p h for the motivational example with a good abstraction level. It is often the case that an application is modeled using a coarse grain model which results in limited optimization possibilities. On the other hand, fine grain model can result in prohibitive large run-times where much time is spend on analyzing parts of the design search space which can give no or insignificant improvements to the final solution. Therefore, the designer should have possibilities to specify an application a t a proper level and, only when run-times of the synthesis tools are of concern, the design space reduction methods should be used.
In this paper, we present a novel method called partial assignment technique (PAT). PAT is able to improve efficiency of an assignment and scheduling as well as quality of the final results. This is possible since the complexity of assignment and scheduling problems require use of heuristics.
The complexity of task assignment and scheduling for heterogeneous architecture increases significantly when memory constraints are considered. These constraints define memory requirements for tasks and communications and therefore influence task assignment and scheduling decisions. The data memory aspect in embedded systems is especially important for signal and image processing applications which deal with enormous amounts of data. New synthesis methods are required to cope with these constraints efficiently. All techniques for reducing the search space without loosing (near)optimal solutions should be explored and our partial assignment technique is one of them.
The aim of this work is to develop efficient techniques for an embedded system synthesis tool that accepts system architecture description and an application specification given as a task graph and produces constraints which will reduce complexity of the assignment and scheduling. We assume that an application, specified in C-like language, is compiled into an acyclic task graph annotated with estimates of execution time, code and data memory requirements. PAT uses this task graph, generated from the original specification, as input. Therefore, the fine grain task graph can be used, which gives full optimization potential, but possibly long run-times of synthesis tools.
The rest of this paper is organized as follows. Section 2
Figure 2: Target architecture motivates our work through an example. In section 3, we outline related work in this area. Section 4 briefly presents our synthesis system MATAS system. Section 5 describes our partial assignment technique while section 6 presents experimental results. Finally, section 7 concludes the paper.
MOTIVATIONAL EXAMPLE
Consider a task graph depicted in Figure 1 . This task graph consists of four tasks, T I , T2, T3, T4, depicted as ellipses and three data transfers between these tasks, C1, C2, C3, depicted as boxes. The code memory requirement for processor P1 (P2) for each task is represented as first (second) number in parentheses next to each task. The data memory is used to store local data as well as input/output data (transmission buffers). Each task needs 4 KB of data memory during its execution. The data transfers between tasks send 2 KB of data. Each processor is equipped with 3KB of code memory and 8KB of data memory.
The goal of system synthesis is t o make assignment of tasks to processors and schedule all tasks and communications. The resulting assignment and schedule should have minimal deadline and fulfill code and data memory constraints. The problem can be solved using constraints programming over finite domain [6] . In this constraint programming approach we first specify decision variables and constraints over these variables and then solve constraint problem by assigning values to the decision variables. In the process of variable assignment the constraints propagate changes and reduce the search space. Different assignment methods can be used. In this paper, we use our own heuristic since branch-and-bound methods can be used for relatively small size problems only.
In our example, each task Ti has two decision variables, namely ST^ and P T~, which denote task start time and processor assigned for a given task. All these decision variables are represented by finite domain variables (FDV). In our problem, decision variables ST^ have a finite domain and can take a value from range 0 ms to 10 ms. Processor assignment variables, P T~, have value 1 or 2, which denote a processor number for execution of a given task. There are also decision variables for communications in addition to task related decision variables. Each communication Ci has its start time and duration denoted by Sci and Dei.
When all decision variables and their domains are specified, we need to state all constraints which must be satisfied in the final solution. In our example, we have typical precedence constraints between tasks and communications as well as resource constraints for processors and buses. We also have code memory constraints which ensure that there is enough memory on each processor for assigned tasks. Finally, data memory constraints ensure that enough data memory is available for communication buffers and for local data of tasks [8] . For example, precedence constraints are defined as inequalities and for task T1 and communication C1 they are specified as STI + D T~ 5 Scl, meaning that the time of start of communication C1 must be at least equal the time when task T1 finishes its execution.
The specification of these constraints will already restrict some values of FDV's. However, most of FDV's will still have more than one possible assignment. During search procedure different values of FDV's are evaluated until a valid assignment of values to FDV's is found. Our technique reduces the number of possible values of FDV's. Therefore it speeds up, the most time consuming step, search procedure.
Consider a target architecture as depicted in Figure 2 . It consists of two processors connected by a bus. There are few valid task assignments and schedules for a task graph depicted in Figure 1 . An optimal schedule of a given task graph is presented in Figure 3 . This schedule requires only C2 communication to be scheduled on the bus and therefore reduces bus utilization and memory requirements. Tasks T1 and T4 as well as T2 and T 3 communicate using their local memories and additional memory buffers for communication need to be assigned only for communication C2, between tasks T2 and T4. Additional constraints, identified by PAT, pT1 = P T~ A P T~ = PT3 are added to enforce that these tasks are executed on the same processor, and durations of communications are zero D c~~,~~ = OA D c~~,~~ = 0. These constraints facilitate achievement of good quality schedules under all resource Constraints. They do not define the final assignment but state that selected tasks need to be executed on the same processor.
Consider a chain of assignments of a value to a FDV which needs to be made to arrive to the final solution as presented in Figure 4 . Each of these assignments can be illustrated as a decision node (e.g., in branch-and-bound algorithm) annotated with the number of possible decision branches. Our PAT reduces the search space by reducing the number of these decisions. This is depicted, for our example, by smaller numbers in parentheses next t o PT3 and PT4 nodes. These two nodes represent decisions on which processor tasks T3 and T4 are executed. In addition, since two communications are done locally the number of possible decisions at Scl and Se3 nodes is also reduced.
For this particular example, an efficient linear clustering is not possible. A clustering of tasks T1 and T4 as well as T2 and T3 produces two new clustered tasks. All valid assignments and schedules of these clusters result in executing cluster containing tasks T2 and T3 first and then cluster of tasks T1 and T4, thus producing an inefficient solution. Other clusterings will violate code memory constraints. 
RELATED WORK
The related work which tries t o reduce the size of the problems for different synthesis activities, such as assignment and scheduling, usually concentrates on clustering. This technique tries to select tasks which are close to each other and builds a new task which contains all selected t,asks. Our method is different since it does not build new tasks but instead adds new assignment constraints.
The work presented in [5, 7, 10] concentrates on parallelization of software systems in multiprocessor homogeneous environment through usage of clustering techniques. These techniques reduce also the complexity of the assignment problem but they actually change the input problem since they create new tasks to represent clusters. They have also assumed a number of restrictions. For example, they cannot cluster an application into a given number of clusters or they ignore network contention.
Mapping of heterogeneous task graphs into heterogeneous architectures was addressed in [3]. The core assumption of this work is that a speed of all processors is equal to a baseline processor speed multiplied by a constant. Both the application task graph and the architecture task graph are clustered separately. This clustering produces two multi-layer clustered graphs, which levels are later matched against each other. The clustering of the task graph examines first fork nodes. Afterwards remaining tasks are considered and added into the existing layered graph. They claim that mapping complexity is reduced without loss of quality due to apriori multi-layer clustering of task and architecture graphs.
The critical path-based clustering procedure was presented in [l] . It takes into account different execution time of tasks depending on the processor. However, it does not consider code and data memory during clustering. It creates clusters of tasks which need to be later executed as single tasks.
Our approach makes it possible to obtain complexity reduction of assignment and scheduling problem for the heterogeneous architecture in the presence of multi-resource constraints. All of the mentioned approaches addressed a simple case when the only resource is time. Our method does not simplify the architecture by introducing one homogeneous communication structure or homogeneous computing environment. We also do not assume that there is a speed factor to which the speeds of all processors has to be referred to. Our approach does not produce clusters, it produces constraints which reduce the possible assignment of tasks to processors. In addition all clustering approaches which do not create linear clusters run into a danger of creating designs which will eventually deadlock [ 5 ] . Our approach avoids deadlocks since it does not cluster tasks but constraints task assignment within a group of tasks. is later treated by synthesis method as an ordinary task. The theoretical work on clustering proves the superiority of linear clusters [4]. This does does not apply here since we have not only timing constraints but also code and data memory constraints. The task assignment and scheduling for such systems are closely coupled problems that they should be solved together. This is enabled by our methodology.
Traditional clustering combines tasks into a new task which
MATAS
The proposed framework has been written entirely in Java and it uses our own JaCoP (Java Constraint Programming)
Figure 5: MATAS framework and PAT engine t o model and solve synthesis problems. The important advantage of our approach is the gradual refinement of the model through addition of new constraints. Since the system can handle heterogeneous constraints, the refinements can be very specific. In particular, constraints can specify on which processor a task should be executed, based on the assignment of other tasks. They can also specify which communications need to be local. The addition of new constraints (refinements) decreases the search space by making selected decisions explicit t o the solver. This decreases the number of search nodes or removes some of the branches in case of branch-and-bound search algorithm. Note, that in our case we do not use branch-and-bound algorithm but decisions made by our heuristic are simplified.
In our approach, JaCoP (a constraint programming solver) is used t o model the system architecture, the application and the synthesis problem. A general introduction to CLP is given in [6] . Briefly, a CLP program consists of constraints over finite domain variables and a search method. Each finite domain variable (FDV) is initially defined by a set of integer values that constitute its domain. Constraints specify relations among these variables. A constraint engine provides constraint consistency and propagation methods. Therefore, restricting a domain of one FDV propagates to other FDV's and usually results in restricting domains of the other FDV's. Partial task assignment constraints improve propagation as well as cut off some parts of the search space.
The MATAS synthesis system [SI makes both task and communication assignment and scheduling. It considers timing constraints as well as code and data memory constraints. The goal of the system is to find (near) optimal solution, in respect to schedule length while fulfilling all constraints. The synthesis is done within constraint programming framework, as sketched in section 2 and presented in Algorithm 1. This algorithm tries t o use different resources, such as time, code and data memory, evenly. The decisions are made based on estimates of future use of these resources. Both time and data metrics, which are used to choose next task to schedule, reflect the usage of those resources in relative terms. Often the next task will increase either critical path length or data memory usage, and therefore it is important to know which resource is currently more used and act accordingly. Since the algorithm is constraint-driven the result of all decisions directly propagate t o all FDV's and constraints. This makes the implementation of different search heuristics easier and less error-prone.
The presented partial assignment technique is an extension of the prototype design system MATAS, as presented in Figure 5 . Our method introduces new constraints which limit the possible assignment of the tasks. The idea of these constraints is to guide MATAS system towards better solutions. PAT constraints (1) state that all tasks within the same group are assigned t o the same processor and the related communication between these tasks has duration zero.
We do not create new tasks from old ones but specify which tasks need to be assigned to the same processor. Up to authors best knowledge this is the only solution of this type which reduce the complexity of the task assignment and scheduling problem without changing application model.
Algorithm 1 The general idea of MATAS algorithm. R +-T a s k s without predecessors
while R # 0 do select Ttime with minimal mobility select T d a t a with greatest 
C O E S ( T d a t a ) o'sizeo -C I E P ( T d a t a ) S ( T a s k ) denotes all data produced by the task}

PARTIAL ASSIGNMENT TECHNIQUE
The system synthesis has to take into account multiple resources. In our model, we have currently three types of resources for which parallel tasks compete: time slots, code and data memories. We assign tasks t o processors time slots and communication tasks to bus time slots. Each task needs data memory during execution as well as produces and consumes data which are also stored in data memory. The code memory needs to be reserved for each task so it can be executed on a selected processor. The complete solution specifies these three assignments. Since the number of decision variables is normally large, the size of the search space can be huge. Our method makes selected assignment decisions explicit by specifying assignment constraints (1).
PAT makes partial assignment decisions based on several closeness measures which reflect resources use, such as time, data memory and code memory, between groups of tasks. The final closeness measure is defined as a weighted sum of these closeness measures as defined below. Smaller numbers represent closer groups.
where 201, w2 and w3 are weights, and ct,i,,j is the closeness measure for time, is a closeness measure for code memory, and finally cdgi,gj is a closeness measure for data memory. In our experiments all weights are equal.
241
There are two crucial assumptions when computing closeness measures. The values in the dominator in (3), (4), and ( 5 ) are always computed under an assumption that groups gi and gj execute on different processors. On the other hand, the numerator is the minimal value under assumption that both groups execute on the same processor. Each of the metrics may have a value lower than one and this will indicate that there is a possible gain if both tasks groups are executed on the same processor.
The closeness measure for time resource is given below where D, denotes the execution time of group x , and Cx,y denotes the communication time between group x and y.
The closeness value will be smaller with more communication time required to communicate data between two groups. Small closeness value ct,i,,j indicates that there will be a gain in schedule length if both groups are executed on the same processor. The second closeness measure is for code memory where CM, denotes how much code memory is required by a group x , and CM(P,) denotes the amount of code memory available at the processor which executes group x . The minimal relative usage of code memory under the assumption that groups execute on the same processor is represented by numerator. This will be divided by the sum of minimum relative usage of code memory for each of the groups executing on different processors. The closeness function for code memory reflects, in relative terms, how much more code memory would be used if two groups were grouped. Since the code memory is a resource which is reserved for the whole time it is possible t o use this type of metrics. The most difficult resource t o take into account is the data memory. Equation 
where DM, denotes how much data memory is needed for group x temporary data. This is specified by (6).
The tasks use data memory temporarily so we need a different approach to compute closeness function for this resource type. Each task is annotated with local data memory size multiplied by its execution time. For a given group all data memory requirements are added and divided by the data memory size of the processor. In case when two tasks are executed on different processors, an additional cost appears due to double reservation of data memory buffers for communication. This cost is represented by a third term in denominator of equation ( 5 ) . Since the communication time can differ we assume that communication time is the average of possible communication times. This cost is normalized by the smallest data memory size of one of the processors executing both groups.
The PAT decisions are difficult since the knowledge on assignment of tasks to processors is not available yet. It is possible that grouping of two tasks will increase the schedule length. Therefore, each resource closeness measure aims at reflecting the possible degradation or improvement of resource usage. Our PAT algorithm, presented in Algorithm 2, initially starts with one task per group. It then computes closeness measures for each pair of groups with a non-local communication. Each PAT iteration will merge two closest groups thus making at least one communication local. This communication will not require bus access. The algorithm will stop when expected reduction of task graph is reached.
Algorithm 2 PAT algorithm.
for all i do G, = {T,} tasks-to-constraint = r e d z l~~t~>~a c t o r while tasks-to-constraint > 0 do for all non-local communications C, do G, = producer(C,) G, = consumer(C,) clss, = closenessc, ,G, select C, with smallest cZssz merge groups G, and G, make all communication between merged groups local tasks-to-constraint = tasks-to-constraint -1
EXPERIMENTAL RESULTS
Our technique was applied to a real-life example presented in [9] . Their problem is, however, quite simplified from our perspective since the application is mapped onto a homogeneous multiprocessor architecture consisting of 7 processors and a single bus. The authors also do not take into account code memory and precedence constraints. However the application consists of large number of tasks and communications which makes it a good benchmark example. Our MATAS/PAT system has been applied to this example and produced the results presented in Table 1 .
The first experiment has no partial assignment constraints introduced by PAT. Therefore there is no problem complexity reduction and it has a full optimization potential. Since the full search has not been performed the obtained solution is not proved optimal. The lower bound for deadline of this example, with precedence constraints and a given architecture, is equal to 1975 ms. The first and the second experiment produce solutions that are ~1 % worse than a lower bound and might be optimal.
In the following experiments we compare linear clustering with PAT. Linear clustering uses the same metrics as PAT for making clustering decisions. However, the clustering decisions for some tasks, despite metrics indication, has been rejected when they produced non-linear clusters since they might create deadlocks. In the second experiment 25% of tasks have been partially assigned by PAT or clustered. In this particular case, a solution obtained with both PAT constraints and linear clustering give the same solution as in the previous case. The problem complexity reduction has been achieved without penalty on a quality of the solution.
In experiment three and four both PAT and clustering simplified the problem at the expense of the achieved deadline. However the obtained deadline still lies within 25% from the lower bound when the complexity of the assignment problem was reduced by ratio 50% or 75%.
In experiment three the clustering with MATAS obtained shorter deadline than PAT with MATAS. In this case, however, it was also checked that the constraints imposed by PAT do not exclude the solution found by MATAS with clustering. In this particular case, clustering guided the MATAS system better. This real life example shows that for problems where only time constraints are imposed the linear clustering will always give as good result as non-linear methods like PAT. It conforms to the theory presented in [4] . This however is not the case for multi-resource problem as indicated in further experimental results.
Our technique has also been evaluated on random task graph examples. In this case, we can fully evaluate our method with specific code and data memory requirements as well as use heterogeneous architecture which is depicted in Figure 6 . The architecture consists of four heterogeneous processors and three buses. The speed of bus B1 is 2MB/s and the speed of other buses is 1MB/s. The experiments were performed on task graphs generated by TGFF [2] . The options supplied to TGFF enforced the average execution time, code memory requirement, temporary task data, and application data to be equal respectively t o 4 ms, 4kB, 3MB, and 5MB. They have also respectively constrained the deviation of those parameters t o 2ms, 2kB, lMB, and 2MB. The task graph has maximum number of incoming and outcoming edges equal to 2. Each graph consists of 80 tasks and a number of communications between 110 and 130. They can be regenerated using TGFF and numbers 1 to 20 as random seeds. The task graph characteristics make them difficult to map and schedule on the test architecture. The experiments results, represented as a line in Table 2 , represent the average values obtained by solving 20 different task graphs.
The first experimental setup (1) uses the target architecture where each processor has lOOKB of code memory and lOOMB of data memory. The maximum execution latency for the task graph has been set to be at most 350 ms, which was then gradually reduced to find best shortest deadline. The MATAS system was used to solve these problems in three different settings: without any additional constraints (M), with clustering (C) and finally with PAT constraints (P). The experiments with PAT and clustering used best design space reduction ratio from the real case experiment of 25% and assignment of one fourth of 80 tasks has been constrained. In this particular experimental setup, MATAS with PAT constraints achieved '-4% reduction in average deadline comparing to best results achieved by MATAS with clustering.
The reduction was partially obtained through executing tasks on processors which required more code memory, thus larger code memory utilization has been noticed. The sum of data memory allocated at different processors in the system, represented by data memory utilization has been reduced. This is due to better placement of data in data memory across the whole architecture. In general PAT constraints helped to find more parallel solutions both in terms of computation as well as communications.
In the next experiments (2) we have reduced the amount of available memory on all processors. Each processor has 75MB data memory and 75KB of code memory. The same deadline as previously found for given solutions was applied and it was gradually extended for cases when the deadline could not be fulfilled. It can be noticed that the processor utilization ratio as well as bus utilization ratio has dropped since the application cannot always be executed with the same degree of parallelism as previously.
The memory resources utilization increased when compared to the previous experiments. The data memory utilization is higher since less resources are available. There are two important factors which make data memory a bottleneck, even if the value of data memory utilization around 50% may not suggest this. First, when the code memory of a given processor has been already used, we are not able to assign more tasks to this processor. Second, data is not stored evenly over the whole execution time. Often almost the whole data memory at a single processor is used temporarily for communication buffers which is reflected by very high data memory peak. This phenomenon will restrict execution of new tasks on this processor since no more data memory is available at this moment. The code memory influence was reflected by high averaged peak of 100% and an average code memory utilization of 90%. The average deadline has increased because tasks from the critical path cannot be executed until other non-critical task consumes data and makes more data memory available. It increases also in the case when the fastest task implementation cannot be executed due to code memory restrictions.
Both experiment setups show that reduction of the search space resulted, in some cases, in decreased quality of the solution. The deadline was increased slightly. However, consistent reduction of the heuristics runtimes equal to the percentage of constrained tasks was achieved. The PAT method does not transform the problem itself, it just adds partial assignment constraints. Clustering on the other hand transforms the problem and therefore it influences not only task assignment but also scheduling of communications. The clustering itself did help to improve the solution quality comparing to original MATAS approach but these solutions are usually not as good as PAT can deliver.
CONCLUSIONS
Our partial assignment technique, presented in this paper, makes it possible to decrease the complexity of the assignment and scheduling problem. During the search, assignments of all tasks, within the same group, to the processor is performed only once. This method works for heterogeneous architecture with heterogeneous communication structure. The architecture resources can be of different nature, from simple ones, such as code memory, to more complex, such as computation time or data memory. The experimental results indicate that PAT can simplify the problem by removing inferior parts of the search space, which was observed in the second experiment setup of real-life example.
Our heuristic is able to improve quality of the scheduling and assignment as it was observed in both random experimental setups. It gives better results comparing to clustering as well as not pre-constrained original MATAS approach.
