In this paper, we introduce a constraint programming- 
Introduction
In this paper, we focus on the problems of the static optimization of area and reconÀguration time for communication network of regular 2D reconÀgurable processor array architectures. To solve these problems (a) jointly and (b) not for a single, but for a whole set of algorithms, a unique constraint programming approach has been applied.
Previously we have introduced an abstract model for minimization of the number of multiplexers [12] . This model is limited and covers only unicasting data transfers. In this paper, we propose a new optimized formulation that makes it possible to support multicasting data transfers. Moreover, we deÀne new cost functions that make the minimization of other communication network parameters possible, such as area as well as parallel and sequential reconÀguration time.
The correctness of our approach is illustrated by applying our methodology to a concrete architecture, namely weakly programmable processor array (WPPA) [7] . This architecture belongs to a class of computer architectures that consist of an array of processing elements with reconÀgurable interconnections and limited programming possibilities, see Fig. 1 . However, we would like to emphasize that our approach is not limited to WPPAs and can be applied to any other regular 2D reconÀgurable architecture. To our knowledge there is no other published similar solution. Related work. Routing of communication requests in reconÀgurable networks is a topic of huge relevance in the area of billion transistor SoCs. Here, two different directions can be distinguished: The Àrst aims at establishing dynamically connections between hardware components by switching wires. This area is called circuit-switched routing and our approach presented here also belongs to this class. Especially for Àne-grained reconÀgurable hardware systems (e.g., FPGA) concepts such as reconÀgurable multiple buses [1] have been recently studied. In [6] , a templatebased approach is presented where it is possible to set a Àxed routing path between modules by attaching these templates through dynamic partial reconÀguration.
In case of dynamically changing communication requests, the other main stream is based on message passing networks-on-a-chip (NoC), see for instance [2, 5] . Here, components send messages (packets) which are routed through router nodes to their destinations. In the context of reconÀgurable FPGA designs, these capabilities have also been studied. For example in [3] , a 2D NoC concept called DyNoC (Dynamic Network-on-a-Chip) that can be dynamically reconÀgured at run-time is presented. The concept applies modiÀed XY-routing in a mesh-like NoC that can handle also obstacles given by placed modules on the FPGA. Unfortunately, the cost of NoC solutions can be very high. Also, the delay of communications can be substantial, i.e., in case of congestions or, in case of multi-hop routing. Finally, also memory elements must be provided in router nodes to store data packets temporarily. For cycle-based reconÀgurable coarse-grained architectures such as WPPAs that we are considering in this paper, a routing network would be much too slow as we demand the ability to switch communications on a cycle-base here. Also, the connections themselves should be delay-free. Therefore, circuit routing is the only viable solution here.
A communication-conscious mapping approach for WPPAs, based on integer linear programming, is presented in [10] . But, this approach considers only the static mapping of a single application. In [9] , the authors present mapping heuristics to merge datapaths and to share interconnection structures in reconÀgurable architectures. The minimization of the interconnection network's size leads also to reduced reconÀguration times. Whereas the optimization goals are similar to ours, we present an exact and resource constraint method where a limited number of channels between the processor elements can be considered.
Routing has been deÀned before using constraint satisÀ-ability encodings. In [11] , authors encode FPGA detailed routing problems using SAT deÀnition. They can Ànd a routing or prove that a particular global routing does not have a detailed routing for a given number of tracks per channel. Unlike our approach, their formulation cannot be used for optimization of particular features of routing.
Organization. Section 2 introduces WPPAs, shows how an application can be executed on it and presents a formula for parallel reconÀguration time overhead calculation. Section 3, is devoted to a small example introducing the problem of routing data dependencies for a given algorithm. The optimization problem is discussed in Section 4. Finally, a case study is presented (Section 5) with promising optimization results.
Weakly programmable processor arrays
A WPPA architecture consists of an array of weakly programmable processor elements (WPPEs) each having a VLIW (Very Long Instruction Word) structure, see Fig. 1 (right). The parameters of each WPPE can be customized at synthesis-time with respect to the number and types of functional units such as adders, subtractors, multipliers, shifters, and modules for logical operations. Furthermore, special Figure 2 . Multiplexer architecture of an interconnect cell (reÀnement of Fig. 1 (left) ).
storage elements at the inputs of each WPPE have been proposed to store incoming data [7] . The instruction set of a single WPPE is minimized also according to domainspeciÀc computational needs.
In order to allow to model a vast set of different interconnect topologies, dynamically reconÀgurable interconnect structures have been investigated by the deÀnition of a switchable interconnect cell structure a WPPE is connected to. In our example in Fig. 1 (left) , the interconnect cell form a regular 2D-mesh topology.
A set of VLIW programs and an interconnect conÀgura-tion together form a so-called setup. The global setup memory contains several processor array setups, one for each algorithm which can be processed by the array. Since an on-chip setup memory consumes logic resources, it has to be as small as possible.
A given WPPA is characterized not only by the number and types of processing elements and their internal structure but also by the interconnection capabilities. These are given in terms of (a) number of channels in each direction (east, west, north, south), see Fig. 2 , and (b) number of ports of a processing element.
Algorithm reconÀguration on WPPAs is initiated and requiring two steps: Program reconÀguration which is beyond the scope of this paper and interconnect reconÀgura-tion. Interconnect reconÀguration is possible through overwriting conÀguration registers located in each interconnect cell as shown in Fig. 2 . Each connection to other PEs via so-called channels may be changed dynamically by loading a new value to each of these conÀguration registers. Note, if there are several multiplexers in one direction, the select signals are concatenated in one reconÀguration register. This enables to reconÀgure all channels in the same direction within one cycle.
The corresponding time overhead for conÀguration of the interconnect in one multicast domain is given as fol- 
where
The variable o is in our case a constant which reÁects the overhead for the setup of a conÀguration automaton. In a WPPA framework implemented in FPGAs, an experimental running example requires a setup time of o=4 cycles.
Algorithm class and mapping
Starting point of the design Áow for mapping algorithms onto WPPAs is the class of so called dynamic piecewise regular algorithms (DPRAs) [4] . This class of algorithms describes loop nests containing uniform data dependencies by a set of recurrence equations.
The following example of an FIR Àlter The application of space-time transformation [4] leads to a 2D processor array structure 1 as shown in Fig. 3 . Now, for conÀguration of the communication interconnect, we have to lock at the algorithms' data dependencies, e.g., the dependency between variable a and a as vector (1, 0) T , between u and u as vector (1, 1) T , and Ànally between y and y as vector (0, 1) T .
For a given algorithm A i , we can group the set of data dependencies like in the FIR-algorithm above into a set
} of two-dimensional vectors in the following. Note that for each algorithm A i to be executed at runtime, there may be a different set of data dependencies.
On the physical processor array, these data dependencies result in connection dependencies as can be seen in Fig. 3. 
Optimization of communication overhead
Now, we present our framework for statically minimizing the interconnect conÀgurations for a set of timemultiplexed algorithms such that the reconÀguration overhead in terms of required multiplexers is minimized as a secondary goal. We will see that this problem involves solving routing problems. We model the problem using constraint programming over Ànite domains formulation [8] .
Modeling of all connection dependencies d i, j for each algorithm A i (1 ≤ i ≤ K) is done using a set of 2D arrays of cells of size N × M. N and M denote the maximum horizontal and vertical part of the Manhattan distance of all connection dependencies. Therefore, the optimization problem is independent of the processor array's size. Each cell Cell n,m is identiÀed by its (n, m) coordinates in 2D grid Fig. 4 for 
It can be noted that connections between different PEs can be routed using different resources. In principle we need to make a number of decisions that can be grouped into two classes.
• decisions on the selection of the path from source PE to destination PE that passes different cells, and • decisions on the selection of different connections in the channels between cells. Both decisions inÁuence the number and size of multiplexers that need to be included to reconÀgure static connections between cells when reconÀguring from an algorithm into another one. They need to be considered simultaneously. For this purpose we have deÀned a constraint programming model.
Before deÀning the model we need to point out that all connections need to be implemented in a single cell since all the cells in the architecture execute the same program and they use the same channel connections to transfer the data (so called modulo routing [12] ). This means that we can reduce the optimization problem to a single cell that contains all routed connection for all connection dependencies for given algorithms. Normally, the implemented connections for one algorithm cannot use the same connection channels. A cell that implements all connections from Fig. 4 
Our communication minimization problem is split into two steps to reduce the complexity of our method. In the Àrst step, all possible paths between interconnect cells and PEs in the system corresponding to all d i, j 1 ≤ j ≤ D(i) and 1 ≤ i ≤ K are found using a CP formulation. In the second step, another CP formulation is used for multiplexer area and reconÀguration time minimization. We will see later that the area and reconÀguration time are expressed in terms of size and number of multiplexers. It encodes all paths using model variables and assigns connection dependencies to connections of channels implementing identiÀed paths.
All paths corresponding to a given dependency d i, j are found by applying a SimplePath constraint to the SimpliÀed System Graph (SSG). The SSG graph is deÀned as a directed graph. Vertices are cells and processing elements while edges represent inter-cell channels between processing elements and cells.
The constraint that Ànds all simple paths in a graph takes as a parameter a graph, a source and a destination vertex (i.e. PEs in our case). This constraint can be combined with other constraints to generate paths of a limited length that is useful in practice.
The identiÀed paths need to be encoded in our model as variables. This is achieved by a special constraint (ExtensionalSupport in our case) that deÀnes a relation between model variables using a table of values. For this purpose we use 0/1 variables Ch i, j (n,m)(n ,m ) that deÀne for algorithm i and dependency j whether a directional channel from cell (n, m) to cell (n , m ) is used (value 1) or not (value 0). Table 1 presents variables that are set to one for different paths for connection dependencies (1, 0) T and (1, 1) T from Fig. 4 . All other variables in each row are equal zero.
In our model we also maintain explicitly input/output ) deÀnes internal cell connection from "West" to "South" indicated as doted line in Fig. 5 .
In this way we provide opportunity to select one of the paths for implementing a given connection dependency. For a selected path a number of speciÀc channel connections need to be determined. They implement communication between cells and a cell and a PE. In our example depicted in Fig. 5 , each channel has two connections and communication dependencies can use any of the connections but normally they cannot share a channel connection. It is only possible for multicast communications that will be discussed later.
The channel selection is implemented using a channel occupation table. It is in turn implemented in our model using Diff2 constraint that assures that any pair of rectangles speciÀed on a list of rectangles do not overlap. The idea is depicted in Fig. 6 . In this formulation each rectangle represents a channel connection and the constraint assures that two connections will not use the same connection in the channel. A rectangle is speciÀed using its origin (x, y) and lengths in both directions l x and l y , i.e. using list [x, y, l x , l y ]. All channels connecting two cells or a cell and PE in a given direction are collected in a list of rectangles. For example, in direction "South" from for all 1 ≤ j ≤ D(i) are used for selection of connections. For our example it is deÀned using constraints (2) . Note that we use a special feature of Diff2 constraint that considers rectangles with length zero as non-existing. This makes it possible to consider only selected connections and assure that they do not use the same connection channel at the same time.
, 1],
For multicast communication this condition is relaxed. In addition we also enforce that the same connection is used for multicast communications in the same direction (constraint 3).
We introduce vectors Tab i Dir to collect all inputs that are connected to a given cell output connection for a given al-gorithm i and connection direction Dir. These vectors are deÀned for direction south, west, north, east and PE. The information for these vectors is gathered using constraint (4) that is formulated for all internal cell connections. For the example of connection dependencies presented in Table 1 only one internal cell connection exists from "West" to "South" and therefore only one constraint has to be formulated for this output direction (4). Vectors Tab i Dir are later used for formulation of cost functions. It can be noted that the number of model variables in the model is radically reduced comparing to our previous model [12] . The model deÀnes all communications for a single algorithm. They are then combined into a single model that contains all algorithms with their connections dependencies. In this model we deÀne a number of cost function to reduce the communication overhead related to reconÀguration. To deÀne these cost function the tables Tab i deÀned in (4) are used. The main cost function deÀnes a condition that speciÀes when a multiplexer is needed. A multiplexer needs to be included in a WPPE if there exist two paths implementing connection dependencies for two disjoint algorithms A i and A i , and there exist an output connection that has inputs from two different connections. This condition is deÀned using our tables depicted in Fig. 7 . In this table each vector Tab i Dir (column in the array) deÀnes input connection numbers connected to given output connections in the algorithm i. Each row, on the other hand deÀnes, for each output connection, input connection numbers for all algorithms. Therefore a number of different connection numbers (in the row) deÀnes number of inputs to this particular output and a multiplexer with this number of inputs. In this case we do not consider zeros since numbering starts from one. Two variables MultSize Dir,t and MultExist Dir,t are associated to each row. The Àrst one deÀnes the multiplexer's size and the second deÀnes whether the multiplexer is needed or not. In our experiments, we consider two optimization objectives; the multiplexers' area and the reconÀg-uration time overhead. The related cost functions are speciÀed using variables MultSize Dir,t The reconÀguration time overhead is deÀned according to Eq. (1) and expressed below with constraints (5).
Dir t mux Dir
Above constraints use reiÀed constraint, i.e., constraint Cond ⇔ B that reÁects satisÀability of condition Cond into a 0/1 variable B. Area overhead is deÀned below as weighted sum of different types of multiplexers, i.e., two input, three input, etc.
Constraint Count(K, List, Var), used in the above formulation, assures that number of elements of List with value K equals Var.
Experimental results
To validate our approach for area and reconÀguration time overhead minimization, we have carried out experiments using six algorithms A i with different connection dependencies, as presented in Table 2 . Each experiment used different combinations of these algorithms to evaluate reconÀguration overhead. The algorithms represent both existing algorithms as well as synthetic benchmarks. Algorithm 1 represents the connection dependencies of a matrixmatrix multiplication algorithm and Algorithm 2 is a FIR Àlter algorithm. They are two frequently used digital signal processing algorithms. The connection dependencies of A 3 stem from a Sobel image Àltering example. Algorithms A 4 to A 6 represent synthetic benchmarks. All of the following experiments have been run on 2 GHz Intel Core Duo under Mac OSX operating system.
In Table 3 , we present our results obtained for minimization of area overhead and sequential reconÀguration time. We also compare the obtained results with a na×ve approach. All experiments are carried out for a minimal assumed number of channel connections and input/output ports needed for routing all dependencies. As can be seen, the area and sequential reconÀguration time improvement between the na×ve approach and our method are rather large. The average value for area improvement is 62% and average sequential reconÀguration time improvement is 41%. Table 4 presents the results of minimizing the parallel reconÀguration time, again comparing na×ve with our optimized solutions. We specify the number of dimensions that use multiplexers. The reconÀguration time is shorter for parallel reconÀguration when optimized with our method and in average we obtain 32% improvement.
Conclusions and future work
In this paper, we have presented a constraint programming formulation for minimization of area as well as sequential and parallel reconÀguration time overhead for regular reconÀgurable architectures. Our system makes it also possible to make a design space exploration that involves trading multiplexers against channel connections, for example. The experimental results indicate large savings of area, specially for applications that have larger number of algorithms and a large number of PEs.
The following extensions are possible for future work: First, a different cost function would make it possible to minimize, for example, a number of channel connections with a given limit on a number of multiplexers or a maximal length of routed paths for a given set of algorithms.
The search space for the considered problem is large, and for large problems, we cannot Ànd or prove optimality of our solutions. This is partially caused by existence of many symmetrical solutions with the same cost. These could be eliminated by introduction of additional symmetry elimination constraints. We leave it for our future work. 
