New heterogeneous multiprocessor platforms are emerging that are typically composed of loosely coupled components that exchange data using programmable interconnections. The components can be CPUs or DSPs, specialized IP cores, reconfigurable units, or memories. To program such platform, we use the Process Network (PN) model of computation. The localized control and distributed memory are the two key ingredients of a PN allowing us to program the platforms. The localized control matches the loosely coupled components and the distributed memory matches the style of interaction between the components. To obtain applications in a PN format, we have built the Compaan compiler that translates affine nested-loop programs into functionally equivalent PNs. In this paper, we describe a novel analytical translation procedure we use in our compiler that is based on integer linear programming. The translation procedure consists of four main steps and we will present each step by describing the main idea involved, followed by a representative example.
INTRODUCTION
Applications envisioned for the next decade in the area of multimedia, imaging, bioinformatics, and signal processing have a high computational demand. To satisfy this demand, new hardware platforms are emerging, referred to as heterogeneous multiprocessor platforms. They are typically composed of loosely coupled components that exchange data using programmable interconnections such as a switch matrix or a network on chip. The components can be CPUs or DSPs, specialized IP cores, reconfigurable units, or memories.
Although building such heterogeneous platforms already takes place [22, 37, 28] , mapping applications onto them still relies on the ability of a system designer to manually partition the application's memory and control across the platform components [7] . This process is typically performed in an empirical manner, lacking a systematic solution approach. In this process, a designer primarily focuses on the extraction of application independent tasks, the synchronization between the tasks, and on memory management. There are a number of research projects dealing with the automation of the mapping process. For example the PICO project [15, 29] is an effort that aims to automate the mapping of applications onto platforms consisting of VLIW processors and custom nonprogrammable accelerators. Another example is the Atomium [3] project dealing especially with memory issues when mapping applications onto platforms with distributed memory architectures. To program heterogeneous multiprocessor platform, we believe that the Process Network (PN) model of computation (MoC) is suitable to cope with the multiprocessor characteristic of the new hardware platforms [27] . The PN is a deterministic MoC that explicitly specifies tasks as processes and distributed memory as FIFO channels [18] . The localized control and distributed memory in a PN are the two key ingredients allowing us to program heterogeneous multiprocessor platforms. The localized control matches the
Figure 2: Deriving a Process Networks in four steps
loosely coupled components and the distributed memory matches the style of interaction between the components. However, writing an application in PN format is time consuming and error prone. Therefore, we have built the Compaan compiler [17] that translates affine nested-loop programs into functionally equivalent PNs specified in C++ [8] or Java [19] formats. It is also possible to obtain a hardware implementation of the PNs using the Laura [38] VHDL back-end.
Parallelization of nested loops with static control and affine indices was addressed by Held [13] who tried to automatically generate systolic arrays by using data-flow analysis and dependence graphs. Inspired by this work, Rijpkema [25] stated the problem of translating the class of nested loops to process networks. In [17] , he presented an approach for doing the translation in a number of steps. In some of these steps, he relied on the Ehrhart theory [9, 5] which due to the computationally complexity has implementation limitations [35] . Furthermore this theory is applicable only in the context of a very limited number of applications 1 . To overcome these limitations, we present in this paper a new translation procedure. The main contributions of the paper can be summarized as follows:
The paper presents a novel, fully analytic procedure to translate arbitrary nested loop programs with static control and affine indices to a functionally equivalent process network.
The translation procedure uses a number of novel Integer Linear Programming (ILP) formulations that can be solved with an adequate ILP solver.
Due to the ILP formulations, all steps have been implemented in our Compaan compiler, replacing and extending the steps presented in [17] , making the compiler more robust and capable of converting any nested loop program to process networks.
The translation procedure was evaluated on a number of image and signal processing applications showing that the approach is capable of automatically generating efficient process networks at acceptable computation time.
The translation procedure consists of four main steps and we present each step by describing first the main idea in an intuitive way, followed by the translation of the idea into its ILP formulation, and finally, how the idea works out on a running example. Overall, the paper is organized as follows: In Section 2, we give the problem involved in translating an affine nested-loop program to a PN. In Section 3, we present a four step approach to do the translation. In Section 4, we give results obtained from running our compiler and in Section 5, we conclude the paper.
PROBLEM DEFINITION
The problem we address in this paper refers to the translation of a sequential application to an equivalent PN specification, as shown in Figure 1 . The class of applications we consider in this translation, is confined to nested loops with static control and affine indices [11] . An example of such an application is given in the left side of Figure 1 , where each assignment statement is iterated over a convex domain called iteration space composed of iteration points (IPs) [1] . The iteration spaces can be parameterized by using for-loops with parametric bounds as can be observed in the code. The PN that is generated consists of a number of processes; each process executing one of the assignment statements present in the input program for a number of times. For example, process F1 corresponds to statement ×ØÑ½, process F2 to statement ×ØÑ¾, and so on. In the translation from an affine nested loop program to a PN, two problems are involved. First, the computation carried out by a sequential application in a single process needs to be distributed into a number of separate computational processes. Second, the global memory arrays (e.g., Ö½ and Ö¾) used for data storage need to be transformed to dedicated FIFO buffers that are accessed using a blocking Get primitive, providing in this way a simple inter-process synchronization mechanism.
SOLUTION
As shown in Figure 2 , the conversion from an application to a PN takes place gradually in a number of steps guided by the idea of localizing the control and distributing the memory. As a result of a Preprocessing step, the initial sequential specification is converted to a network representation where all the executions of one assignment statement are collapsed into a single process. This network represents the input of the first step, the Consumption Restructur- ing. During this step, we restructure the data consumption, i.e., each array used for storing data generated by different producer processes is replaced by a number of separated memory arrays; one for each producer process. In the second step, Production Restructuring, we restructure the data production, i.e., each array used for storing data consumed by several consumer processes is replaced by a number of separated memory arrays; one for each consumer process. After performing the first two steps a distinct piece of memory is put between a producer and consumer process. This forms an instance of the classical producer/consumer (P/C) pair. Depending on the order data is produced and consumed in a P/C pair, different types of communication mechanisms should be employed with adequate synchronization policies to derive a valid PN. This is done in the third step called Communication Model Selection. Using the information obtained in first three steps, a PN with autonomously running processes communicating data over FIFO channels is obtained as Java or C++ code in the last step of our approach called Code Generation.
The network obtained after the Preprocessing step does not reveal any degree of parallelism. This is just a partitioned representation of the application code given in Figure 1 . The topology of this network resembles the Reduce Dependence Graph [6] of the application. Each circle from the left part of Figure 2 represents a process iterating one of the assignment statements over the same iteration space as the statement is iterated in the original code. The processes are still executed one at the time following the same global schedule in which the correspondent assignment statements are executed in the original code.
Step1 -Consumption Restructuring
In the Consumption Restructuring step, the consumption of the data is restructured such that each producer process will store data into a separate memory array for each of its function output arguments. Hence, no two producer processes write data into the same array. This transformation is visualized in Figure 3 , where array Ö¾ is replaced by two different arrays Ö¾½ and Ö¾¾. Due to the restructuring, the process now has to decide at each execution whether to read data from Ö¾½, or Ö¾¾. Consequently, the iteration space of process gets partitioned into two subdomains. Each subdomain represents what we call an Input Port Domain (IPD). Therefore, IPD1 contains the IPs at which process reads data produced by process ¾ and the other one, IPD2, contains the IPs at which process reads data produced by process ¿. Graphically, the IPDs of a process are visualized in Figure 2 as black spots located at the end of a consumer process incoming edge. The partitioning in IPDs is done by adding linear inequalities to the domain of as shown in the code in Figure 3 .
Approach:
To derive the inequalities that partition the consumer domain into IPDs, we first identify groups of producer processes that are writing data into the same memory array. Let ËÖ be the set of all the processes È Ö that write one of their process function output arguments into the array Ö and Ö the set of all processes Ö that read data from Ö. For each process È Ö , we replace the writing in array Ö with a write into a separate array Ö . To maintain a correct execution, the corresponding processes Ö have to consume data from the new memory arrays Ö . Therefore, we have to be able to connect the consumption of a data token with its production. To solve this problem, we make use of exact data dependence analysis [11, 23, 20] . By performing the dependence analysis, we get an affine dependency function together with the domain where this function is valid. This domain actually is an IPD. Each IPD represents a integral union of parameterized polytopes containing all IPs at which the input argument of the assignment statement embedded in the process is being produced by one process. Without lost of generality, we will assume that each IPD is represented by only one integral parameterized polytope of dimension , Á È ´AE µ . Thus, each P/C pair is uniquely represented by a polytope ´AE µ together with an affine dependency function represented by an integral matrix Å , and an offset vector Ç, i.e., ´Üµ Å Ü · Ç.
Example:
Consider the code given in Figure 1 , where the statements stm2 and stm3 are responsible for writing data into a ¾-d array Ö¾ from where statement stm4 consumes data. In the network representation of the original application, we identify the P/C pairs of processes
µ, each of them communicating data via the global array Ö¾. We replace in each P/C pair the write into array Ö¾ at location Ö¾ Ð Ñ , with a write into array Ö¾½ at location Ö¾½ Ð Ñ and respectively into array Ö¾¾ at location Ö¾¾ Ð Ñ . As a consequence, process ¾ and ¿ will write data into separate memory arrays as shown in the right-hand side of Figure 3 .
To keep the execution of the network correct, we have to find at each execution of process the location in Ö¾½ or Ö¾¾ containing the appropriate input data. This correspondence is obtained using the data-dependency functions corresponding to the P/C pairs of 
Step2 -Production Restructuring
In the Production Restructuring step, we replace the memory arrays that are accessed by different consumer processes with a separate array for each of the consumer function input arguments. This transformation is visualized in Figure 4 , where array Ö½ is replaced by two different arrays Ö½½ and Ö½¾. Due to the restructuring, process ½ now has to decide at each execution whether to write data to Ö½½, Ö½¾, both arrays, or even none of the arrays. The restructuring will partition the iteration space of process ½ into two subdomains. Each subdomain represents what we call an Output Port Domain (OPD). Therefore, OPD1 contains the IPs at which process ½ writes data consumed by process ¾ and the other one, OPD2, contains the IPs at which process ½ writes data consumed by process ¿. Graphically, the OPDs of a process are visualized in Figure 2 as black spots located at the beginning of an outgoing edges of a Producer process. The partitioning in OPDs is done by adding linear inequalities to the domain of ½ as shown in the code given in Figure 4 .
Approach:
To derive the inequalities of the OPDs that partitions the producer domain, we first identify groups of consumer processes that are reading data from the same memory array. Let Ö be the set made of all the consumer processes Ö that read one of their process function input arguments coming from the same memory array Ö where data has to be written. Finding for a producer IP Ý the appropriate storage arrays, is equivalent with deciding whether Ý belongs to the following set:
where is the dependency function and ´AE µ is the parametrized consumer IPD. As you can see Ç È represents a linearly bounded lattice (LBL) [30] i.e., represents the integral image of ´AE µ through the affine function . Finding whether Ý belongs to Ç È can be expressed as the solution of the following parametric integer linear programming (PIP) problem [10] , with variable x and parameter Ý 2 :
where condition´ ½µ specifies that the problem domain is given by the polytope ´AE µ, and´ ¾µ imposes that the problem should include only the integer points ÝÔ for which a consumer point x exists. Although we are interested only whether an integral solution exists or not, we choose as objective the lexico-minimal function. This allows us to gather additional information which is used in the Code Generation step to optimize the network memory management. As shown in [11] , the solution of the presented problem is a multistage conditional expression: . This corresponds to the case when is not a surjective function. Only when a producer IP belongs to a non-empty branch, data is consumed by a consumer process and it has to be stored into an memory array. Although in this step, the expressions of the Ì functions do not serve a purpose, they are used in the Code Generation step for the lifetime analysis of tokens to optimize the network memory management. Due to the restructuring, an Ç È is the union of the domains expressed by the non-empty tree branches:
It is easy to observe that this formulation of an OPD is equivalent to the one given in Equation 1.
Example:
In case of process ½, we have to make explicit two OPDs, namely Ç È ½ consisting of the iterations at which data has to be loaded into Ö½½ and Ç È ¾ consisting of the IPs at which data has to be loaded into Ö½¾. Since we have two P/C pairs, the following two PIP problems (corresponding to È ¿ and to È ) have to be solved: As shown in [10, 24] , the two ILP problems can be solved using algorithms like Lexicographical Dual Simplex and Gomory Cuts or an integer version of Fourier-Motzkin Elimination. As a result we get the following two solution trees Ë Ì ½ and Ë Ì ¾, composed of statements expressed in the coordinates of the iteration space of process ½:
The two trees from above partition the iteration space of process ½. While initially the data produced by ½ was always written at ´ØÔ Ô Ôµ is a producer IP belonging to ½ at which ½ produces the token Ø, than´ØÔ Ô · ½ Ô · µ represents the first (lexicographically smallest) consumer IP that consumes the token Ø. Similarly, if Ô¾ is a producer iteration point belonging to ¾ than the first consumer IP that consumes it iś ØÔ Ô·½ ¿µ. However, from the point of view of the distribution of the data the information regarding different first consumption is irrelevant here i.e., we are interested only whether a produced token has to be submitted or not. Therefore, as shown by Figure 4 the two disjoint domains ½ and ¾ represent the output port domain
Step3 -Communication Model Selection
After performing the Consumption and Production Restructuring, the original application has been partitioned into separate tasks in which P/C pairs communicate data over dedicated memory arrays. In the Communication Model Selection step, we investigates the communication characteristics of each P/C pair in order to replace the memory array with a FIFO based communication structure. As result of this step, a PN with bounded memory execution is obtained. This is because a FIFO size equal to the number of IPs included into the corresponding OPD will be enough to avoid the appearance of network deadlocks. However, using techniques that allow to find a good balance between memory space and interprocess parallelism [21, 2] , the sizes of the FIFOs can be decreased. 
Approach:
There are four communication types for a P/C pair. These four types of communication are given in Figure 5 . They result from the ordering of the iterations at the Producer and the Consumer processes and the existence of multiplicity for a given token, which means that a token sent by the Producer is read more than once at the Consumer side. Here we define in a formal way ordering and multiplicity as follows: 
Definition 1 A P/C pair is in-order if and only if the dependency function

According to these definitions, an arbitrary P/C pair belongs to one of four disjoint classes: in-order without multiplicity (IOM-), inorder with multiplicity (IOM+), out-of-order without multiplicity (OOM-), and out-of-order with multiplicity (OOM+).
To determine the communication pattern of an arbitrary P/C pair, we need to identify to which of the four classes the P/C data-flow graph belongs. For that purpose, we introduce two tests. The Reordering Test determines if a P/C pair is in-order and the Multiplicity Test determines if a P/C pair is with multiplicity. Based on these two tests, an arbitrary P/C pair is classified to one of the four categories. These two tests can be formulated and solved using ILP [34] . Consider again an arbitrary P/C pair È represented by a parameterized IPD ´AE µ and a dependency function . According According to Definition 2, a P/C pair has multiplicity if two different Consumer points Ü and Ý exists as given by conditions´ ½µ, ¾µ and´ ¿µ, such that they consume one and the same token from the Producer as given by condition´ µ. The four conditions form the Multiplicity Problem (MP). If a solution exists for the MP then the MT is true such that then the P/C pair is with multiplicity. [23, 10] . The ET is an ILP test and requires systems of linear constraints. Both Å È and ÊÈ contain non-linear constraints (see for example conditions´ ¿µ in both problems), but using the lexicographic order, we can decompose them into subsets of linear constraints (Å È respectively ÊÈ ) onto which ET can be applied: Å ÌÅ È µ Ï ÌÅ È µ and ÊÌ´ÊÈµ Ï ÌÊÈ µ. In case of the RP, the lexicographical order operator " " is decomposed into subsets of linear constraints. On each subset, the ET needs to be applied. In case of the MP, the negation is the non-linear operator. The negation can be rewritten to two inequalities:
where we use again the decomposition of the lexicographical operator to obtain linear constraints. However, as presented [33] for a large amount of P/C pairs ( 95%) the RT and the MT can be solved using polynomial algorithms like Hermite and Smith normal form [14] .
Example:
Let us analyze now how the presented tests are used for deciding the communication characteristics of the P/C pair È ´ ½ ¿µ. We first present how the RT applies in deciding the ordering characteristics of the considered P/C pair. For this purpose we verify whether the domain specified by the constraints given by As can easily bee seen (even without using ILP [33] ) the conditions´ ¿µ and´ µ contradict each other and therefore, the reordering problem from above is without solution. Thus, we conclude that È is in-order. Now we go further and analyze the multiplicity characteristics of the same P/C pair. For this purpose we verify whether the domain specified by the constraints given in Å È È contains integer points. As you can see in Å È È , all the constraints are linear in- equalities excepting those specified by the condition´ ¿µ. Because Ü and Ý are arbitrary points from ´AE µ by using the lexicographical order the condition´ ¿µ is decomposed as the following set of linear conditions:
This leads to three instances of the MP. If one of these systems has a solution, multiplicity is involved, which is the case. The system made of conditions´ ½µ ´ ¾µ ´ ¿ ¿ µ ´ µ has a solution. This can be verified by looking, for example, to the points È ½ ´Ø Ð Úµ and È ¾ ´Ø Ð Ûµ with Ú Û. Both points are mapped to the same point´Ø Ð ½ AE µ at the Producer side such that we conclude that multiplicity takes place. Hence from the analysis of the two problems it turns out that È is in-order with multiplicity. By applying the multiplicity and reordering tests to the remaining 3 P/C pairs in our example, we find that È ½ È ¾ È ¿ are of type IOM-as shown in Figure 6 . 
Step4 -Code Generation
In the first three steps, we have created a PN model which consists of a topology, the iteration spaces of the processes, the IPDs and OPDs, and the types of the channels. In the Code Generation step, a software representation is derived for the PN model. The iteration spaces are converted to for-statements by making use of Fourier-Motzkin Elimination [36] . The topology, IPDs, and OPDs are transformed into components like threads and sets of for and if statements with linear expressions. For the discussed components, equivalent implementation exists in the YAPI environment, as C++ [8] , or in the PN-domain of the Ptolemy Framework which is based on Java [19] . In this step, we also take advantage of the classification done in the Communication Model Selection step, to implement an optimal communication structure for each P/C pair.
Approach:
To derive a software description of the PN takes place in two steps. In the first step, the network processes are derived. Each iteration space of a process, which is represented by a matrix, is translated to a nested for-loop representation. Furthermore, each IPD and OPD is translated from its matrix representation to a structure of if-statements that is inserted in the appropriate processes. In the second step, the network communication structure is derived for each P/C pair. Based on the type of the P/C pair, we realize the communication in the follow way:
IOM-Using only a FIFO buffer that is accessed using a Get and Put primitive.
IOM+ Using a FIFO buffer that is accessed using a Get and Put primitive. However the Get primitive is guarded by additional control used to determine the life-time of a token to account for its multiplicity.
OOM-Using a FIFO buffer that is accessed using a Get and Put primitive, but at the Consumer process we add private reordering memory and a controller to perform the reordering. Since multiplicity is not involved, each time the controller accesses the reordering memory for reading data, the corresponding memory location can be immediately released.
OOM+ Using a FIFO buffer that is accessed using a Get and Put primitive, but at the Consumer process we add private reordering memory and a controller to perform the reordering and additional control to keep track of the life-time of a token. If the life-time of the token has come to an end, the life-time control releases the memory location hold by the token in the reordering memory.
The implementations for the different types increase in their complexity from IOM-to OOM+. The implementation of IOM-and IOM+ are closely related, except that in IOM-additional control is needed to know when to read data from the FIFO. The implementation of OOM-and OOM+ requires additional reordering memory and a reorder controller. Of the four models identified, OOM+ is the most expensive communication structure to be realized. It is also the generic communication structure since it subsumes all three other structures. In [32] we have presented a number of alternative realizations of this type of communication out of which 2 types have been prototyped in hardware.
To perform a compile time lifetime analysis of data communicated between Producer and Consumer processes, for communication type IOM+, we make use of what we call the Lexicographically minimal Preimage (LmP) [31] . The LmP maps the domains ½ Ò presented in the solution tree presented in Section 3.2 into the Consumer domain using the non-empty functions Ì½ Ì Ò. These transformations are the solution to the minimization problem given in Equation 3.2.1, with as objective to find the lexicographical minimal. Hence, an iteration Ý ¾ Ì½´ ½µ is therefore the lexicographically minimal IP that consumes the token produced by ´Ýµ. This means that Ý is a point at which a new token has to be read from a FIFO. Once the token is read and removed from the FIFO, it can be reused as many times as needed, until the next Ý ¼ is found that indicated that a new token is to be read. The opposite of the LmP is the Lexicographically Maximal Preimage. This identifies the last consumer IP which uses a certain input data token. For communication type OOM+, where the tokens are stored in a reordering memory, the Lexicographically Maximal Preimage indicates when a memory location can be released allowing us to minimize the size of the reordering memory.
Example:
In the example, we focus only on the implementation of the communication types IOM+ and IOM-. In case of pair È ½ È ¾ È ¿, we replace the static arrays Ö½½, Ö¾½ and Ö¾¾ with a FIFO buffer. Observe that the absolute addressing performed on the arrays is now replaced by a relative addressing using Put and Get primitives. In case of È , we replace static array Ö½¾ by a FIFO buffer, but we also need to take into account the life-time of the tokens flowing over the FIFO due to multiplicity. To find the moment a process can read a token from FIFO2. we use the LmP. We map the domain represented by Ç È ¾ through their correspondent solution functions. Hence, we map the domain ½ through affine mappinǵ The pseudo code for the PN is shown in Figure 7 . It shows the way the four processes are implemented. It also shows how the IPDs and OPDs derived in the various steps, are transformed into ifstatement using linear expressions. In case of Process ¿, we need 
IMPLEMENTATION AND RESULTS
The steps presented in Figure 2 are implemented in the tool chain shown in Figure 8 . The first tool, called MatParser [16] , performs an exact data-dependence analysis. This tool implements the Consumption Restructuring step. The Process Network Generator tool, or PNGen, implements the remaining three steps presented in this paper and generates a PN description. PNGen replaces the Panda tool in Compaan as presented in [17] . The user can choose the PN to be generated in C++ or in Java. The generated code allows us to simulate the PN and to verify that the PN is equivalent to the original sequential program. It is also possible to generate hardware for a PN. The Laura tool [38] transforms the network generated by PNGen into an equivalent VHDL description that can be synthesized and mapped on an FPGA platform. The four transformation steps make extensive use of polyhedron manipulations, matrix decompositions, and integer linear programming. We relay for these operations within MatParser and PNGen on existing libraries, like PolyLib [36] , Pip [10] , and Omega [23] . In Table 1 , we present some quantitative characteristics obtained form compiling 8 applications, of which the M-JPEG case is described separately in [27] and the QR case in [12] . For each application, we have given the number of lines of the original sequential representation, the compilation time required on a Pentium III processor, and the number of processes and channels generated. We also show how the P/C pairs are classified to the four types. Based on this data, we ob- served that in approximately 90% of the P/C pairs a communication structure based on a FIFO buffer is sufficient. In remaining 10% of the cases, we need to realize a reordering at the Consumer process, using extra memory and a reordering controller.
CONCLUSION
This paper describes a solution to convert the complete class of static affine nested loop programs into equivalent PN representations equivalent using Integer Linear Programming. The approach is analytical; there is not a single heuristic involved. We have shown that the translation problem can be divided into 4 steps. For each step, we presented the main idea of the step, how to formulate the step in an ILP formulation, and how the step applies to a running example.
All the steps and techniques presented have been implemented in software in the Compaan tool chain, replacing and extending the steps presented in [17] . Actually, all the examples given in this paper are generated by this compiler. We also showed the results we get from running Compaan on a set of 8 applications from the area of signal and image processing. As shown in [27] , the Compaan compiler puts us in a great position to further develop a programming environment for heterogeneous multiprocessor platforms. Using PNGen, we obtain software implementations for PNs that can be mapped and executed on CPUs or DSPs. On the other hand, using Laura, we can also obtain hardware implementations for PNs making use of dedicated IP cores and reconfigurable hardware. An arbitrary mix between hardware and software is also possible.
As future work, we plan to provide a number of transformations at the process network level. These network transformations are for example, Channel Merging [4] , in which a number of channels are merged into an single one reducing therefore memory and control, Process Splitting, in which the loop structure of a process is unrolled resulting in this way a larger number of processes and Process Retiming, in which the iteration space of a process is rescheduled by applying unimodular transformations. Both Process Splitting and Process Retiming are applied in order to speedup the execution of the network and give similar results, but at less computational price, than the algorithmic transformations presented in [26] .
ACKNOWLEDGMENTS
