This paper presents n technzque to nzrcp arifornatically a complete dzyztal szgnnl processing (DSP) 
1: Introduction
The post World War I1 era has resulted in the trend of using Digital Signal Processing (DSP) technologies for both military and civilian applications. The growing requirements for sophisticated algorithms, especially those used for 3-D applicative domains, lead t o process in real time large multi-dimensional arrays of data. These applications are executed on parallel computers, t h a t offer enough computing power [25] .
The mapping of DSP applications onto parallel machines raises new problems. The real time and target machine constraints are imperative. The solution must fit the available hardware: the local memory, the number of processors. the processor communications. The application latency must meet the real time requirements. This necessitates fine-grain optimizations. Combining both kinds of constraints is still out of the scope of automation arid requires deep h u m a n skills. This paper presents a new technique t o map automatically DSP application, represented by a sequence of loop nests, onto a SPMD distributed memory machine. This technique is based on formalizations of the architectural. applicative and mapping models by constraints. The result is (1) a fine grain affine schedule of computations, (2) their distribution onto processors and (3) a memory allocation. Computations are distributed in a block-cyclic (ancourt, i r i g o i n & c r i . ensmp . f r > Ecole des Mines de Paris/CRI 77305 Fontainebleau, France '<Denis .Barthou&prism.uvsq.fr> PRISM, UVSQ, 45, avenue des Etats-Unis 78035 Versailles, France 3 < g u e t t i e r , j e a n n e t , jourdan, juliette.&thomson-lcr.fr> LCR, Thomson-CSF, Domaine de Corbeville, F-91404 ORSAY, France way 011 processors. Conirnunications are overlapped with computations when possible. The memory model is precise: Only the amount of memory useful to the computations is a1 1 ocated .
The niapping problem is designed to distribute computations and d a t a onto a parallel machine. The block size of computations that can be executed onto a processor according to their local rrierriory size should be estimated. The computation partitioning into blocks should fit the number of processors. And the computational blocks should be scheduled according to real time constraints. This general mapping problem has been proved t o be NP-complete [36. 371. Moreover, it cannot be expressed in a single linear framework because the formulation of thc general problem (data AND computation distributions) involves non h e a r constraints. While d a t a dependence constraints can be translated into linear inrqiia.t.ions and t,hrri solved by classical linear programming algorithms. resource constraints require non linear expressions. Solving directly both constraints is still out of the scope of any general algorithms and necessitates the combination of integer programming and search [24. Following the same idea of combining constraints solving and nondeterminism, our technique uses a new approach: the CCLP [19, 531 approach. Unlike conventional constraint solvers based on black box algorithms, CCLP languages use an incomplete constraint solvers over a finite domains algebra. The two main advantages of using such algorithm are first t o enhance compositionality features [52, 311 arid secondly t o offer basic control structures for expressing new constraints [52] .
Our approach takes as input the specification of different models such as: the target machine, the communication cost, the application, the partitioning, the d a t a alignment, the memory allocation and the scheduling models. Then, the CCLP assets enable t o handle linear and non linear expressions and t o yield, through the concurrent propagation of the constraints over all the models t o solutions, satisfying the global problem. The solutioii outlook depends on multiple criteria as memory allocation or latency which are specified by the user.
The article is organized as follows. Firstly, the cha,racteristics of the target machine and DSP applications are presented. Secondly, our constraint formalization of the problem is exposed: Esperially, the partitioning, scheduling and memory models are detailed. Thirdly, the concurrent resolution programming technique is presented, followed by our prototype results. Finally, a comparison with other approaches is described before concluding.
2: Architectural and applicative features
This section presents an overview of the architectural and application features that characterize our general mapping problem formulation.
2.1: Architectural features
The target machine is an abstract SPMD distributed memory machine. The mapping is constrained by machine resources:
Number of processors. The application is mapped on all processors. However, criteria like memory allocation or comniunication minimization may enforce the use of fewer processors.
Local memory size. Because there is no global memory, the amount of ineniory necessary t o execute a set of computations, mapped onto a processor a,t a given moment, must fit the available processor memory. Processor rate. The latency criteria (amount of time betn.een one input and the corresponding output) can be fixed t o a maximum value.
Overlap of computations and communications. The part,itioriing model takes advantage of this property t o overlap communications \yith computations.
The first three parameters are given by the programmer. Our system searches solutions that satisfy these resource constraints. Search with optimization criteria such as local memory minimization d o not change the architectural model but designs a particular search implementation.
2.2: Applicative features
In this section the DSP applicative features are described. These features have been investigated for several years at Thomson-CSF by A. Demeure.
T h e application is a sequence of loop nests in a single-assignment form
It describes an acyclic graph of tasks. Each loop nest includes a procedure call (called macro-instruction) t h a t reads one or several multidimensional d a t a arrays and updates one different array. Array accesses are affine functions of loop indices with eventual modulo. Figure 1 presents a global view of P-4 application [8] . Figure 2 The application parallelism degree, memory location requirement and time scheduling para.meters are controlled by thc partitioning. The iteration domain is decomposed over 3 vector parameters: 2, y, z . Block, cyclic and block-cyclic distributions are possible. The partitioning is equivalent t o the HFP distribute directive, the same distribution formalization follows: This set of computations defines a computational block. max(p) = JJ;e; gives the niaximum number of processors and max(c) the maximum number of synchronizations (cycles) necessary for the loop nest completion.
Due t o DSP application features. the array access functions use a t most per array dimension one explicit loop index and one implicit loop index (for macro-instruction) which scans the read or write region. Since read and write regions are not partitionable, only the explicit loop nest is partitioned. Partitioning matrices are diagonal (with an eventual permutation). processors.
3.2: Scheduling model
T h e scheduling model is designed t o associate t o each computation a logical execution event on a processor. The resulting schedule can be viewed as a succession of loop transformations. In general, it is not possible to find automatically the transformation set to apply such that the final schedule is optimal. So. the affine scheduling approach, used in systolic arrays and parallelization techniques ['23. 22 Thus, it only depends on vector c which fully describes the block of 1 iterations t o perform. We choose the affine schedule class of events to search as:
Variables are indexed by the loop nest number k . d k is the scheduling function of the k"-loop nest. ak and p k are the scheduling affine parameters. a ' is a line vector, and p k is scalar. A' is the number of loop nests. It is used in the formulae with the offset +k in order t o avoid the execution at the same date of two computations belonging t o different loop nests.
In the same way, two computational blocks of a single loop nest cannot be executed a t the same date. Let cf and c t with i < j be two cyclic components of the partitioned loop nest N k . Then, the execution period of Cycle e: must be greater than the execution time of all cycles c: . Hence, Constraints: 0: > xj>i a: max(c,k) with a : 2 1 must be verified.
As an example of additional constraints that link the partitioning and scheduling models, the d a t a flow dependencies express that a piece of d a t a of loop nest N' cannot be read before being updated by Y". These dependencies between two cycles c" of loop nest N" and c' of N' imply that:
is the scheduling associated to N' (resp. N " ) .
Note that these dependencies are computed betvveen iterations of different loop nests. D a t a flow dependencies are approximated by their convex hull representation. However, this approximation lets u s t o obtain the same set of d i d schedules as with the exact representation without any loss. Due t o DSP application characteristics, this representation can remain symbolic. This improves the constraints propagation, since no costly algorithm is needed t o solve the dependence test.
3.3: Memory model
The memory model ensures the application executability under a memory constraint. A capacitive memory model is used. It evaluates the memory required for each computational block mapped onto a processor by analyzing the d a t a dependencies. An allocation function can be extracted straightforwardly from the memory allocation result when the schedule is known after the optimization phase.
A d a t a block is the d a t a set needed to execute a macro-instruction. The number of d a t a blocks needed t o execute a computational block is derived.
The memory is organized in segments of identical d a t a blocks, one per loop nest. This eliminates the problems of niemory fragmentation and the eventual need of block relocation. D a t a duplica.tions due t o input sets of references overlap between successive iterations are eliminated by using partial d a t a block decompositions. Only new partial d a t a blocks are kept and fused to others. This refinement is powerful enough to handle any multidimensional read overlaps and proved very efficient on the studied DSP applications.
The previous capacitive memory const,raints define the memory requirements for executing the t,asks onto each processor according t o their local memory size. The local memory size is fixed by the programmer. Optimizations such as local memory minimization are particular search implementations and d o not change the memory model.
4: Resolution
Constraint logic programming is a generalization of logic programming where unification is replaced by constraint solving over several computation domains. These domains include linear rational arithmetics, boolean algebra, Presburger arithmetics and finite domains [20] .
More recently the introduction of the notion of constraint entailment, stemming from the Ask k3 Y e l l paradigm of' concurrent programming [46] , enhanced the CCLP framework with synchronism mechanisms. This new class of CXLP (see fig. 3 The mapping models such as partitioning and scheduling are represented with mathematical variables and affine constraints. Yon-linear constraints link the different models and generally are composed with complex and polynomial terms. For example, constraint (2) links partitioning and architecture models. The number of processors required by the partitioning must be smaller than the number of processors available.
! \ i z i m b e r~f~r o c e s s o r s 2 m a z n ( n~=~ ( P :~) )
(2) The latency, resources and data-80% dependencies constraints (1) are global constraints.
The effective CCLP expressions of the global mapping problem has required an in-depth collaboration between CCLP and Parallelism specialists. The fine grain models, issued from paralleliration techniques, induce a CCLP model mostly based on the expression of sets of macro-instructions, d a t a blocks and dependency relationships. Those sets are represented as intension rather than extension models.
In some cases, this task \\as impossible t o perform directly and the proposed models have t o be recasted in a set of expressible constraints representing an approximation of the model. dependence. scheduling and communication models. I h r i n g the resolution, models communicate their partial information about these variables to others.
While storing the different constraints, the CCLP system builds a solut,ion-space on a model-per-model basis. Each niodel solution space is pruned when constraints are propagated frorn other models. Once all models have been built into the system, non-linear constraints linking the different models still have to be met. Solutions must be looked for in a resulting overall search space using a specific global search.
This search relies (1) 011 the semantic of the variables of each model and their importance w.r.t. other models and (2) the goal t o achieve (i.e. resource minimization under latency constraint. latency minimization under resource constraint). Each variable takes part in a global cross-model composite solving, such that only relevant information is exchanged between models. The global search looks for partial solutions in the different concurrent models. For instance, the set of scheduling variables (a;, p;) and partitioning matrices P,) L; are partially instantiated by inter-model constraints during the resolution. Model-specific or more global heuristics are used t o improve the reso1ution:e.g. schedule choices are driven by computing the shortest path in the data-flow graph.
Based over models semantic and specific heuristics, the global mapping problpms is solved through CCLP using complex composition schemes.
If dedicated algorithms are used, the composition of the different functions only is possible by sequential solving according t o the functional programming paradigm. It restricts the composition facilities and has a too high complexity. Traditional generic solvers, as Simplex, are designed t o solve only linear constraints in a convex rational context. The Simplex category algorithms does riot support models cooperation.
Integer programming allows t o recast complex non-linear constraints using boolean variables. Therefore, links between models are represented using boolean variables which restricts partial information exchanges between models.
: Results
This section illustrates our prototype results. The user specifies the target machine and the option criteria. In this example, the optimizing cost function is the memory size minimization. The target machine has 8 processors. The latency constraint is set t o 4.10' processor clock cycles and the memory is unbounded. Figure 4 describes the partitioning of PA. The loop nest parallelism and locality are expressed with the diagonal matrices P and L .
5.1: Part it ioning
The partitioning characteristics follow. According to the different partitions, only the time dimension is globally scheduled.
From the N and p scheduling parameters in Figure 5 , the schedule can be expressed using the regular expression:
do isa=0,7
( ( ( F F T , [BF, E], B B ) ! SI. S./l)S, LI)"
Computational dependencies between iterations are satisfied. The system provides a fine grain schedule at the macro-instruction level using the dependence graph shortestpath. This enables the use of d a t a as soon as possible, avoids buffer allocations, and produces output results at the earliest. On the right hand side, the corresponding loop nest is represented.
Eight iterations of Tasks FFT,BF-E,BB (executed every ai = 6 steps) are performed before one iteration of SI,SA (executed every 48 = 6*S steps). The last task LongInteg cannot be executed before 8 iterations of the precedent ones. So it is executed every 384 (=8*48) steps.
. 3 : Comparison with manual mappings
Manual mappings of DSP applications are performed in different ways. In general, userfriendly interfaces provided by manufacturers offer some help for course grain parallelism.
The application is scheduled at the task level and not at the macro-instruct,ion level. Thus, load balancing is more difficult to obtain.
While it is hard for a human being t o instantiate the different models satisfying all constraints, we have compared our solution t o two different manual solutions. The first one is based on loop transformation techniques. The second one uses the rnaximization of the processor usage as only economic function. Our result is equivalent to the one suggested by parallelization techniques. It is better than the second one which requires more memory allocation. The first solution is obtained in a few minutes while this optimization is completed in ten minutcs on a SP.4RC-10 Workstation. These timcs have t o bc compared with human being inquiries t o comprehend and map the application.
6: Related Work
Mapping applications onto parallel machines addresses issues such as scheduling [lo] Although manual loop transformation tcchniqucs arc attractive and give good rcsults, it is not possible to find automatically the transformation set to apply for obtaining the optimal schedule [33, 91. However restructuring the application such that, the parallelism and d a t a locality are maximized is yet a relevant objective. Many studies [9, 41, 481 present interesting approaches. Thereafter, the compiler is in charge of mapping physically the optimized application of the target machine. Compared t o our approach, there is no real time and architectural constraints (number of processors and memory resources) to take iuto account during the parallelization phase. Similar techniques are used i n systolic arrays [16. 17. 141 and parallelization [23. 22. 261 communities t o compute affine schedules. I i i the s?;stolic community, these techniques are applied on a single loop nest with complex internal dependencies. The other approaches dealing with complete applications. do not have the same architectural and application const,raiiits. The parallelism grain is a t the instruction level, there is no real time constraint and the target machine is generally virtual. DSP application features are taken into account i n [45] . This approach is based on task fusion, but for a sequential result. Mapping statically DSP application with specific signal requirements [27, 491 have been widely investigated. The representative Ptolemy framework [39. 47, 441 brings some solutioii but at a coarse grain level. l l o s t of the resolution schemes are based on dedicated algorithms [6] .
Our approach is the first one t o propose an optimal nfline schedule of a complete application with a fine grain parallelism (at the macro-instruction level) and its mapping onto a architecture under resource and real time constraints.
7: Conclusion
A technique t o map automatically DSP applications onto distributed memory machines has been introduced in this paper. It uses a multi-model approach to describe the general mapping problem and a concurrent resolution framework based on the Constraint Logic Programming. Even if the presented model constraints are linear, our system comes t o ternis with non-linear constraints.
Our experiences on DSP benchmark show that our prototype takes into account all architectural and applicative parameters. Sequential, pipelined and parallel schedules are generat,ed depending on the applications. Comparisons with manual solutions proves that our approach may provide interesting, indeed better. solutions.
Future work focuses on developing strategies t o speed-up the solution enumeration and on extending the set of applications automatically proceed.
8: Acknowledgments
We wish to give special thanks t o F. Coelho for his constructive remarks and critical reading of this paper. We also thanks T. Brizard. P. Legal and B. Marcharid for their c o n t i n u o i i s slipport.
