As the application area of the embedded processors widens, the demands on their performance are constantly growing. Until now, instruction level parallelism has been successfully exploited to satisfy these high performance requirements. Practice shows however that increasing the number of concurrently operating functional units.of typical ILP (instruction level parallel) architectures above a certain level does not necessarily lead to significant performance gains [9]. Instead, high hardware costs and inefficient use of this hardware occurs. The advent of sub-micron processing, allowing integration of millions of transistors on a single carrier, has brought new opportunities in the embedded system design. A multi-processor embedded system becomes nowadays a very interesting alternative. This both in terms of the hardware cost and performance. Especially, if the system consists of several (different) ASIPs (application specific instruction se processor), each with functionality optimized for the subtask which they have to perform. Code partitioning among th processors leads then to exploitation of the course-grain par allelism (task parallelism and parallelism in loops [4]), whill the fine-grain (instruction level) parallelism [9] is exploitec locally by each of the processors.
As the application area of the embedded processors widens, the demands on their performance are constantly growing. Until now, instruction level parallelism has been successfully exploited to satisfy these high performance requirements. Practice shows however that increasing the number of concurrently operating functional units.of typical ILP (instruction level parallel) architectures above a certain level does not necessarily lead to significant performance gains [9] . Instead, high hardware costs and inefficient use of this hardware occurs. The advent of sub-micron processing, allowing integration of millions of transistors on a single carrier, has brought new opportunities in the embedded system design. A multi-processor embedded system becomes nowadays a very interesting alternative. This both in terms of the hardware cost and performance. Especially, if the system consists of several (different) ASIPs (application specific instruction se processor), each with functionality optimized for the subtask which they have to perform. Code partitioning among th processors leads then to exploitation of the course-grain par allelism (task parallelism and parallelism in loops [4] ), whill the fine-grain (instruction level) parallelism [9] is exploitec locally by each of the processors.
In the past several environments for the embedded systen design have been realized (a lot of references can be founc in [7, 8, 121) . Also a number of papers specifically abou multi-processor system design have been published [3, 2, 15] None of them however addresses the problem of the auto matic extraction of the parallelism from the system specifi cation.
In this paper we propose a new approach to mapping of a1 embedded application written in ANSI C onto a cost-efficien heterogeneous multi-processor. Its uniqueness lies in a com, bination of the state of the art automatic ASIP synthesis soft, ware with a coarse-and fine-grain parallelism exploitatior methodology.
The paper is organized as follows. Section 2 states thc problem. Section 3 is devoted to the introduction of differen parallelization methods. The system design space exploita. tion algorithm is presented in section 4. The performance oj the algorithm is demonstrated on the frequency tracking system in section 5. Section 6 concludes the paper.
Problem statement
Multi-processor system design involves finding a mapping I' of a program graph Gp (V, E ) onto an architecture template graph GA (P, C) , such that the resulting hardware-software solution satisfies the price-performance specification for the design. In these graphs, V is the set of program statements, E the set of data and control dependencies between the statements, P the set of processing elements and C the interconnection network between them. An example multi-processor Permission to make digital/hard copy of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, architecture can include a set of ASIPs, possibly with local memories, communicating via a combination of fast interpro-IK TO copy otherwise, to rrpublish, to post on sewers or to redistribute to lists, requires prior specific permission and/or a fee. 0-89791-964-5/98/06..$5.00 CeSSOr links and/or shared memory. In our Case the ASIPS are designed using the MOVE framework 
ACM

Design Space Exploration
Our design space exploration algorithm takes as input the system specification in ANSI C, accompanied by several parameters, as for example the maximal number of processors available and a set of parallelization methods to be attempted (as defined section 3). Pushing the design in a certain direction is possible as well, for example by specifying that some program fragments have to be parallelized in a certain way. Such extra directives prune the search space. In addition, the algorithm requires providing profiling information. It can be obtained by code profiling on an oversized ASIP architecture and with software compilation options for maximal ILP exploitation (the performance metrics obtained in this way represent optimistic bounds).
To explain the algorithm, we define the following: figure 8 ).
System design space exploration algorithm
The algorithm for system design exploration is shown in figure 2. It has been optimized to avoid repeatedly performing the most time consuming tasks in every step. Therefore the general code transformations' and data dependency analysis are done at the very beginning. For the same reasons the final parallelization and code generation is performed at the end only. The algorithm performs the following four major steps:
1. After some initialization code, calculate the speedup function for the main procedure: for each number of available processors N , find a set of applied parallelization method(s) C and a set of parallelized program fragments 0 which optimize the overall speedup. As side effect the speedup functions for all other vertices in
2. Code partitions for each point in the speedup function, which are the result of the parallelization, are mapped onto available processors, data transfers onto inter-processor communication links (E, @). The code partitions are mapped onto the processors using the assumption that the serial parts of the program are always executed on processor 1. At barriers (moments when parallel execution is started or terminated) the code partitions are assigned to available processors. 
3.
Parallelization
After the speedup functions of the successors are calcu. At v we consider only hierarchical parallelizations in the operation-parallel mode2. This is not an essential limitation; it is expected that in most cases hierarchical data-parallel mode parallelizations do not deliver better results than the single level ones.
'A parallelization of a loop nest with n loops is not considered a hierarchical parallelization and is attempted. The number of processors used at vertex v, Nproc(v) can be calculated using the following formulas:
-
where B(v) is the set of partitions at the vertex v and Nproc(b) the number of processors used inside successor nodes of the partition b.
To obtain exact results for the hierarchical parallelizations a substantial number of alternatives (combinations of the points in the speedup functions of the successor vertices) would have to be considered. It can be shown that the number of alternatives K can be calculated using the following formula:
where R is the number of partitions at v and x the number of successor vertices of v. For example if z = 6 and N,,, = 6 then, using this formula, we obtain K = 26270 combinations. In practice this number will be slightly smaller since some alternatives cannot result in a legal parallelization, but still we would have to run our parallelization algorithms for the majority of them. This can very easily result in unacceptably long run times.
The following example shows a more practical method of obtaining a legal parallelization:
Example 1 Suppose that we are at a point to jind optimal parallelization of a vertex u onto 4 processors, with two partitions at v (i.e. Part, (2, 4) To avoid the high computational complexity we decided to use an approximation algorithm to calculate the best parallelizations at v (the PARALLELIZE procedure in figure 6 ).
In its body, the apply (v,p,R) procedure applies a parallelization method p at v with R partitions at v and returns the obtained parallelization X . The obtained speedup depends on the parallelization method used and is calculated using the sp(p,X) procedure.
First, appropriate points from the speedup functions of the successor vertices are selected and the speedup S calculated (first FOR loop). Subsequently parallelizations p E P (recall def. 2) on N = 2..Nm,, processors are tried.
In the first inner loop, a parallelization of the vertex v itself is attempted. Then, if the parallelization method is operation-parallel the hierarchical ones are tried. We follow the methodology from the example 1. The optimal previous solution Part, (R, N -1) is used as the starting point (following arrows in figure 5) . We identify the bottleneck partition at u and attempt using an extra processor on its successor vertices to further speedup v. 
SF,(N)=max (SF,(N), S ) ; }
for ( all p E P ) { / * * * * * vertex-only parallelization **********/ for ( all {w I w=succ(v)} ) Select SFw(l); X=apply ( v , p , R ) ; SF, ( R ) =max (SF, ( R ) 
Computational complexity
The maximal number of the parallelization try-outs necessary per vertex and per parallelization method is a small constant, which depends only on the number of points in the speedup function (equal to the number of points in the area R 2 2 in figure 5 ):
@? main
For example for Nmax = 6 we obtain Y = 15 tries, which is much less than the number of combinations possible (recall eq. 3). This together with the fact that each vertex is visited only once has a positive effect on the total computational complexity of the algorithm. This complexity is O(Nkax .
IVI PI).
Casestudy
In this section we present an example application of our algorithm -the frequency tracking embedded system from [6]. The system specification contains about 2k lines of ANSI C source code. In its main loop, the program reads a stream of samples (complex numbers) and uses LMS (adaptive signal enhancement) to determine instantaneous frequency estimates. A 1024-point FFT (fast Fourier transform) is then used to determine the frequency response of the adaptive filter every 100 input samples.
In the experiments we used a combination of tools belonging to the MOVE automatic processor generation framework [5] , and the SUIF parallelizing compiler [I] . For functional pipelining a set of new tools operating on SUIF (Stanford University Intermediate Format) has been implemented [13J.
First, we compiled the frequency tracking program with the gcc-move compiler, then scheduled it with the MOVE scheduler [9] . The options for maximal ILP exploitation (including software pipelining) and oversized architecture (large number of move busses and FUs) were used. The generated code was simulated using the move simulator to obtain detailed profiling information. A number of general code transformations has been applied to this code. For example the lms() andfl() functions were inlined inside the FOR loop bodies F20 and F15. The context graph including computation distribution information is presented in figure 7 . For the sake of readability, only the loop and procedure vertices are shown. Vertices which are marked with '*' are suitable for data-parallel mode, while the ones with '#' for operationparallel mode parallelization. The detailed data dependency vectors were generated using the combined static & dynamic methodology described in [ 141. Communication overhead was estimated assuming the availability of fast bidirectional interprocessor links only.
Subsequently the SPEEDUP procedure was called to generate the speedup functions. The following parameters were 1.92 to 2.78 could be obtained. The point for N,,,, = 2 involves operation-parallel mode parallelization of the F15 loop, while the speedup of 2.78 on 4 processors can be obtained by applying in addition parallelizations to the loops F19 and F64 in lms, and to the loops F120, F199 and F312 i n s .
Next the design trade-offs curves IIN for N = 1..4 were generated using the Explore tool of the MOVE framework. Figure 9 presents these curves. To combine them we have to select Pareto points from each curve. In our case this results in the design trade-offs curve presented in figure 10 . Note that points for implementations with larger number of processors lie to the right of the implementatlons with smaller N . After obtaining the combined speedup function one of the points had to be selected. The specified timing constraint was 150 ms (required speedup of 1.48). The point for the configuration with only 2 processors turned out to be sufficient (marked in figure 10 ). Note, that none of the single processor solutions, even with many FUs, meets the timing constraints. The obtained multi-processor is presented in figure 11 . Besides standard components, processor 1 included 2 ALUs, 1 FPU, 2 Load-Store units and 8 move busses. The 
Conclusions
In this paper we proposed a new design space exploration algorithm for semi-automatic mapping of the embedded application onto a cost-efficient heterogeneous multi-processors. Its uniqueness lies in a combination of the state of the art automatic ASIP synthesis software with a coarse-and finegrain parallelism exploitation methodology. The computational complexity of the algorithm is linear in the number of loops in the program and in the number of applied parallelization methods. Its applicability was demonstrated on a case study of the frequency tracking system. The presented approach can be easily extended to handle real-time reactive embedded systems with many subtasks. We can consider each subtask separately when calculating the speedup functions. After that, perform the mapping of the task partitions onto the processors and do the processors design space exploration. Once the trade-offs functions are calculated we should check them for possible subtask constraints violations and then combine them and select points which satisfy the global constraints.
