This paper evaluates an algorithm that maps a number of communicating processes to a heterogeneous tiled System on Chip (SoC) architecture a t run-time. The mapping algorithm minimizes the total amount of energy consumption, while still prouidiny an adequate Quality of Service (QoS). A realistic example is mapped using this algof-ithm.
Introduction
The architecture of a portable multimedia system has to meet many conflicting requirements. For example, it has to be energy-efficient, due to the scarce energy resources and it has to be flexible. It should be flexible so that it a) can employ a lot of different standards, b) can adapt quickly to a new standard c) can run different sets of tasks concurrently and d) can adapt to the dynamically changing environment.
The designer can choose from a wide spectrum of architectures to implement such a system. This can vary from energy-efficient, highperformance but static and inflexible ASICs to flexible and easy programmable but energy hungry general purpose processors. The optimal choice depends on the application/algorithms and several other aspects, including the available energy budget, the time to market and the production volume.
No specific architecture can meet all these requirements perfectly. A heterogeneous System on Chip (SoC) with different kind of (reconfigurable) processing tiles interconnected by a Network on Chip (NoC) provides an attractive solution for this dilemma.
The best of both worlds (energy-efficient and flexible) are combined in such a heterogeneous architecture. For example, small computational intensive algorithms of an application can be mapped to an ASIC or a coarse-grain reconfigurable tile avoiding a power hungry tile such as a general purpose processor. On the other hand, control intensive but less computational intensive parts of the application can be mapped better to a general purpose processor. In this way. the architecture can match the application instead of the other way around, as usual.
However, the use of such a heterogeneous tiled SoC architecture changes the standard development flow (e.g. code a program in C and compile or code functionality in VHDL and synthesize). The designer has to partition the application into a graph with communicating functional processes. In a process graph, a vertex represents a functional process and a directed edge represents communication between functional processes. For each functional process one or inore realization for one or more different types of processing tiles have to be made. Designing more, functional equivalent, realizations of the same process for different types of tiles allows running an application even when the optimal tiles are not available. Often, the partitioning of an application into a process graph arises naturally from the application (see Section 4). Quite often: the designer knows which kind of partitioning makes sense. The designer plays an important role in this process and we assume that the partitioning and the choice of possible realizations are still made manually by the designer.
The mapping of these realizations to the heterogeneous tiled SoC architecture can best. be doue automatically at run-time. At design time it is unknown which applications run siniultaneously and how the external environment (with regard to available services, end-user behavior, wireless link quality) behaves. Therefore, this mapping decision has t,o be made at run-time. This article evaluates such a run-time mapping algorithm.
Related Work
In the areaof scheduling and optimization theory (operations research) a lot of literature exists on models which have some similarities with the considered problem (see e.g. [5: 21). However, our application has some properties, which does not allow us the use of existing approaches without modification. Compared to traditional scheduling for parallel systems we have the following differences:
The use of a heterogeneous architecture instead of a homogeneous architecture. e The most important Optimization parameter is minimization of the energy consumption instead of performance. The goal of most scheduling methods is to optimize for performance. In o w method, the required performance is only one of the constraints, which has to be satisfied. e The communication is an important parameter to be included in the total optimization process because the communication consumes a snbstantial part of LIE total ellergy budget. In conventional multiprocessor systems, the main focus is on computation costs.
Another important difference with regard to optimization in literature is that we need a lightweight algorithm. It may be better to have a reasonable good solution computed with little effort than to have an optimal solution that requires a lot, of effort. Therefore, on beforehand a lot of existing optimization algorithms are not useful for us.
MinWeight Algorithm
In [l] the MinWeight algorithm is described that determines the weight of a niininnim processor assignment for any weighted process graph and a set of processors. Its running time is exponential. However, in practice it can compute solntions quite fast, as long as the inpnt graphs have only a small number of vertices with a high degree (greater than two). A proof of the correctness of the algorithm, the complexity of the algorithm and a further explanation can be found in [I] .
Properties of the MinWeight Algorithm
In this part we describe and discuss the most relevant strong and weak points of the XlinWeight algorithm with respect to onr specific mapping problem.
Firstly> the NinWeight algorithm computes an optimal solution (see [I] ) to the mapping problem instead of an approximation. This is a strong advantage of the algorithm.
Secondly, due to the dynamic programming like approach for vertices with low degrees, the complexity is low. The exact complexity depends on the degree of the vertices in the graphs, see [l] . E.g. for the mapping of 10 processes to 16 possible processors, 16" zz 10" solutions are possible, but the algorithm finishes within a few milliseconds. When the degree of the vertices increases, the computation time of the algorithm increases exponentially. However, the process graphs are relatively small (between 5 and 20 vertices) in our targeted application domain and in practice a process graph does not have a lot of vertices with degree 2 3. Therefore, we do not expect that the computation time will be a problem in practice.
Unfortunately, the algorithm does not take possible constraints into account. E.g. the capacity of processors and communication links are assumed to be infinite and the delay in communications can not be taken into account. This is a serious limitation of the MinWeight algorithm.
Adding the Processor Capacity Constraint
The MinWeight algorithm does not handle additional constraints, e.g. the constraint that a processor has a limited capacity and therefore only a limited fixed number of processes can run on a processor. Or even more advanced, it has to determine the number of processes that can run on a specific processor depending on the capacity of the processor in combination with the load for the execution of the processes. To cope with the limited capacity of a processor, we adapted the MinlVeight algorithm such that it satisfies the constraint that at most one process is mapped to each processor. A similar approach can be used for other constraints. e.g. at most two processes may be mapped to one processor. I t is implemented in such a way that before computing the weight of a particular solution two conditions are checked. First, it checks whether the processors involved in the mapping solution are not already occupied in an earlier mapping step for another vertex of a processor graph. Second, it checks whether the processors involved in the mapping solution are all unique. Only if both conditions are satisfied, the solution is feasible.
Improvement of the Adapted MinWeight Algorithm
The adapted hIinWeight algorithm (as discussed in Sectioii 3.2) suffers from two problems:
1. The algorithm does not find a mapping at all. Different processes can compete for the same resources and it may happen that all the resources for a specific process are already occupied due to mapping decisions in the past. In this case, it is not possible for the algorithm to find a solution.
2.
The algorithm finds a mapping that is far from optimal. This is a result of the introduced dependencies between the different assignment steps.
The first problem is the most severe one. To reduce the chance of getting no feasible solution we may improve the MinWeight algorithm by changing the order of the assignment of the vertices to the processors. If a vertex needs scarce resources, the probability is high that these resources are already taken by other processes when this vertex is mapped as one of the latest. Therefore, it is probably better to start with mapping of vertices that need scarce resources to avoid resource bottlenecks to avoid ending up with the result that the algorithm is not able to map the process graph to the SoC architecture.
When there are no (longer) resource problems, processes that have high processing costs are mapped next: because the quality of a solution is worse when a process with high processing costs is mapped inefficiently compared to when a process with low processing costs is mapped inefficientily. Therefore, we propose an ordering of the vertices that is based on 1) the scarcity of t.he resources and 2) the processing costs of the processes.
However, it is not so simple t,o estimate when the resource scarcity is no longer a threat. It is important to detect as soon as possible that a reordering based on the processing costs of the processes is possible because this improves the optima1it.v of the final solntion. Currently. we are investigating whicli simple metric we may use to decide how t o order the processes to obtain a possible near optimal mapping.
For the NoC in general we do not expect resource problems. Most tiled SoC architectures use a mesh structure for the NoC; which means that t,here are several different routes possible between two processing tiles with equal length. DSRH (7) DSP ( Figure 1 shows the processes of the digital baseband part of our DRM receiver. Table 1 shows the processes that we would like to map on the SoC (for a functional description of the processes see [4] ). These processes concern the data flow of the DRM application; we do not consider the processes in the "Global control & estimation" part of Figure 1 . To test our algorithms, it is not important to have very accurate estimations of the processing costs. Therefore, we make a few assumptions to test our algorithms so that we do not have to realize the system to get the exact numbers:
the number of inultiplications per second is used as an indication for the costs of a process. Table 1 shows of the costs of the process in terms of multiplications per second for reception of Mode B, and the available implementations for the different type of processors. the ratio between processing an mnltiplication on an ASIC, DSRH, FPGA, DSP, GPP are 10:40:50:60:500 respectively. the communication costs increase linearly with the distance of the communication path on the SOC. The communication costs are equal to the throughput in kbit/s given in Table 2 multiplied by the Manhattan distance between the tiles.
Note that processes that have an ASIC realization need a specific ASIC. It is not possible to assign a process with an ASIC realization to ai1 arbitrary ASIC processor. Table 3 shows t,he solutions (the assignments and the total costs) of the different algorithms.
Results
In each mapping vector, index i (starting at zero) gives processor number of the mapping of the ith process. So, e.g. for all mappings, process 3 is assigned to processor 12.
The optimal mapping without constraints is given by the hIinWeight algorithm. Note that the MinWeight algorithm maps different processes to the same DSFW tiles (6 and 13). If we assume that a tile may be used for at most one process there is a resource problem. Even by swapping some of these processes to other tiles of the same type, no feasible solution can he obtainedl since 5 processes are assigned to the DSRH tiles? but only 4 instances of this type of tile are available.
Taking into account that every processor may execute at most 1 process, another mapping is determined by the adapted MinWeight algorithm of Section 3.2. Note that the initial processor mappings are the same and that the first difference occurs when tile 13 is used asecond time. This gives a mapping that is 8% more expensive compared to the solution of the MinIVeight algorithm. The remaining question is how much the solution of the adapted MinWeight differs from the optimal solution, which is expected to be higher than the lower bound given by the MinWeight algorithm due to the additional constraint. Therefore, the optimal solution is determined using a quadratic programming solution. It took several hours of computation on a Peiitium 4 processor to evaluate all the possibilities with the brute force enumeration. This solution is about 3% more expensive than the MinWeight solution due to the additional processor capacity constraint. Therefore, we can conclude that we lose ahout 5% performance due to non optimality for the adapted hlinWeight algorithm.
Conclusion
The MinlVeight algorithm computes an optimal solution very fast. However, the algorithm does not take into account all relevant constraints and therefore the pract,ical use of the algorithm is limited. Adaptation of the XiinWeight algo- 
RIinWeight 22231
Adapted M i n W
24126
Quadratic progr. 22954 5, 13, 9, 12, 13; 10; 6 , 6 , 6, 0 5. 13, 9: 12, 6. 10, 7, 3 . 2, 0 5; 1, 9. 12, 13, 10: 6 , 7, 11, 15 Table 3 . Different Mappings rithm in order to fulfill the additional constraints gives a method which leads to a non-optimal sclution. A realistic case shows that the adapted MinWeight algorithm gives a near optimal solution in a reasonable short computation time.
In future. we focus on three issues. First, additional constraints and heuristics will be added to the MinWeight algorithm to cope with more real life restrictions and to improve the solutions respectively. Second, we expect that adding heuristics to change the order in which the processes are mapped to processors improves the optimality of the solution. We are currently investigating how to determine a better ordering based on simple criteria. Third, another approach may be used so that in the first step an optimal solution is computed using the hIini%7eight algorithm and in the second step the constraint violations are solved.
