Area and Reconfiguration Time Minimization of the Communication Network in Regular 2D Reconfigurable Architectures by Wolinski, Christophe et al.
Area and Reconfiguration Time Minimization of the
Communication Network in Regular 2D Reconfigurable
Architectures
Christophe Wolinski, Krzysztof Kuchcinski, Ju¨rgen Teich, Frank Hannig
To cite this version:
Christophe Wolinski, Krzysztof Kuchcinski, Ju¨rgen Teich, Frank Hannig. Area and Reconfigu-
ration Time Minimization of the Communication Network in Regular 2D Reconfigurable Archi-
tectures. International Conference on Field Programmable Logic and Applications (FPL 2008),
Sep 2008, Heidelberg, Germany. pp.391-396, 2008, <10.1109/FPL.2008.4629969>. <inria-
00451667>
HAL Id: inria-00451667
https://hal.inria.fr/inria-00451667
Submitted on 29 Jan 2010
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destine´e au de´poˆt et a` la diffusion de documents
scientifiques de niveau recherche, publie´s ou non,
e´manant des e´tablissements d’enseignement et de
recherche franc¸ais ou e´trangers, des laboratoires
publics ou prive´s.

Area and Recon!guration Time Minimization of the Communication Network in
Regular 2D Recon!gurable Architectures
Christophe Wolinski
Univ. of Rennes I/IRISA
France
Krzysztof Kuchcinski
Lund University
Sweden
Ju¨rgen Teich, Frank Hannig
Univ. of Erlangen-Nuremberg
Germany∗
Abstract
In this paper, we introduce a constraint programming-
based approach for the optimization of area and of recon!g-
uration time for communication networks for a class of reg-
ular 2D recon!gurable processor array architectures. For
a given set of different algorithms the execution of which is
supposed to be switched upon request at run-time, we pro-
vide static solutions for the optimal routing of data between
processors. Here, we support also multi-casting data trans-
fers for the !rst time. The routing found by our method min-
imizes the area or the recon!guration time of the commu-
nication network, when switching between the execution of
these algorithms. In fact, when switching, the communica-
tion network recon!guration can be executed in just a few
clock cycles. Moreover the communication network area
can be minimized signi!cantly (62% in average).
1 Introduction
In this paper, we focus on the problems of the static op-
timization of area and recon!guration time for communica-
tion network of regular 2D recon!gurable processor array
architectures. To solve these problems (a) jointly and (b)
not for a single, but for a whole set of algorithms, a unique
constraint programming approach has been applied.
Previously we have introduced an abstract model for
minimization of the number of multiplexers [12]. This
model is limited and covers only unicasting data transfers.
In this paper, we propose a new optimized formulation that
makes it possible to support multicasting data transfers.
Moreover, we de!ne new cost functions that make the min-
imization of other communication network parameters pos-
sible, such as area as well as parallel and sequential recon-
!guration time.
The correctness of our approach is illustrated by apply-
ing our methodology to a concrete architecture, namely
weakly programmable processor array (WPPA) [7]. This
∗This work has been supported in part by the German Science Founda-
tion (DFG) in project under contract TE 163/13-1 and TE 163/13-2.
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
WP
PE
Interconnect Cell
I/O I/O I/O
I/O I/O I/O
I/O
I/O
I/O
I/O
I/O
I/O
i0 i1 i2 i3
Instruction
Decoder
o0 o1
Input Registers/FIFOs
Output
Registers
ALU
type1
mux mux
demux
General Purpose Regs
ip0 ip1 ip2 ip3
op0 op1
BUnit
regFlags
f0 f1
r0
r1
r2
r3
r4
r5
r6
r7
r8
r9
r10
r11
r12
r13
r14
r15
regGP
rP
or
ts
w
Po
rts
regI
regO
Instruction
Memory
pc
Figure 1. Example of a WPPA. All param-
eters such as number and type of proces-
sor elements and their interconnect structure
can be de!ned at synthesis-time according to
domain-speci!c needs.
architecture belongs to a class of computer architectures
that consist of an array of processing elements with recon-
!gurable interconnections and limited programming possi-
bilities, see Fig. 1. However, we would like to emphasize
that our approach is not limited to WPPAs and can be ap-
plied to any other regular 2D recon!gurable architecture. To
our knowledge there is no other published similar solution.
Related work. Routing of communication requests in
recon!gurable networks is a topic of huge relevance in the
area of billion transistor SoCs. Here, two different direc-
tions can be distinguished: The !rst aims at establishing
dynamically connections between hardware components by
switching wires. This area is called circuit-switched routing
and our approach presented here also belongs to this class.
Especially for !ne-grained recon!gurable hardware sys-
tems (e.g., FPGA) concepts such as recon!gurable multiple
buses [1] have been recently studied. In [6], a template-
based approach is presented where it is possible to set a
!xed routing path between modules by attaching these tem-
plates through dynamic partial recon!guration.
In case of dynamically changing communication re-
quests, the other main stream is based on message passing
networks-on-a-chip (NoC), see for instance [2, 5]. Here,
components send messages (packets) which are routed
978-1-4244-1961-6/08/$25.00 ©2008 IEEE.
391
Authorized licensed use limited to: UR Rennes. Downloaded on January 29, 2010 at 09:52 from IEEE Xplore.  Restrictions apply. 
through router nodes to their destinations. In the context of
recon!gurable FPGA designs, these capabilities have also
been studied. For example in [3], a 2D NoC concept called
DyNoC (Dynamic Network-on-a-Chip) that can be dynami-
cally recon!gured at run-time is presented. The concept ap-
plies modi!ed XY-routing in a mesh-like NoC that can han-
dle also obstacles given by placed modules on the FPGA.
Unfortunately, the cost of NoC solutions can be very high.
Also, the delay of communications can be substantial, i.e.,
in case of congestions or, in case of multi-hop routing. Fi-
nally, also memory elements must be provided in router
nodes to store data packets temporarily. For cycle-based
recon!gurable coarse-grained architectures such as WPPAs
that we are considering in this paper, a routing network
would be much too slow as we demand the ability to switch
communications on a cycle-base here. Also, the connec-
tions themselves should be delay-free. Therefore, circuit
routing is the only viable solution here.
A communication-conscious mapping approach for WP-
PAs, based on integer linear programming, is presented in
[10]. But, this approach considers only the static mapping
of a single application. In [9], the authors present map-
ping heuristics to merge datapaths and to share intercon-
nection structures in recon!gurable architectures. The min-
imization of the interconnection network’s size leads also
to reduced recon!guration times. Whereas the optimization
goals are similar to ours, we present an exact and resource
constraint method where a limited number of channels be-
tween the processor elements can be considered.
Routing has been de!ned before using constraint satis!-
ability encodings. In [11], authors encode FPGA detailed
routing problems using SAT de!nition. They can !nd a
routing or prove that a particular global routing does not
have a detailed routing for a given number of tracks per
channel. Unlike our approach, their formulation cannot be
used for optimization of particular features of routing.
Organization. Section 2 introduces WPPAs, shows how
an application can be executed on it and presents a formula
for parallel recon!guration time overhead calculation. Sec-
tion 3, is devoted to a small example introducing the prob-
lem of routing data dependencies for a given algorithm. The
optimization problem is discussed in Section 4. Finally,
a case study is presented (Section 5) with promising opti-
mization results.
2 Weakly programmable processor arrays
A WPPA architecture consists of an array of weakly
programmable processor elements (WPPEs) each having a
VLIW (Very Long Instruction Word) structure, see Fig. 1
(right). The parameters of each WPPE can be customized at
synthesis-time with respect to the number and types of func-
tional units such as adders, subtractors, multipliers, shifters,
and modules for logical operations. Furthermore, special
South config reg.
West config reg.
North config reg.
PE_in config reg.
East config reg.
PE
Interconnect Cell
South input South output
North output North input
W
es
t i
np
ut
W
es
t o
ut
pu
t
Ea
st
 o
ut
pu
t
Ea
st
 in
pu
t
PE
 input
PE
 output
Figure 2. Multiplexer architecture of an inter-
connect cell (re!nement of Fig. 1 (left)).
storage elements at the inputs of each WPPE have been
proposed to store incoming data [7]. The instruction set
of a single WPPE is minimized also according to domain-
speci!c computational needs.
In order to allow to model a vast set of different in-
terconnect topologies, dynamically recon!gurable intercon-
nect structures have been investigated by the de!nition of a
switchable interconnect cell structure a WPPE is connected
to. In our example in Fig. 1 (left), the interconnect cell form
a regular 2D-mesh topology.
A set of VLIW programs and an interconnect con!gura-
tion together form a so-called setup. The global setup mem-
ory contains several processor array setups, one for each
algorithm which can be processed by the array. Since an
on-chip setup memory consumes logic resources, it has to
be as small as possible.
A given WPPA is characterized not only by the number
and types of processing elements and their internal structure
but also by the interconnection capabilities. These are given
in terms of (a) number of channels in each direction (east,
west, north, south), see Fig. 2, and (b) number of ports of a
processing element.
Algorithm recon!guration on WPPAs is initiated and re-
quiring two steps: Program recon!guration which is be-
yond the scope of this paper and interconnect recon!gura-
tion. Interconnect recon!guration is possible through over-
writing con!guration registers located in each interconnect
cell as shown in Fig. 2. Each connection to other PEs via
so-called channels may be changed dynamically by load-
ing a new value to each of these con!guration registers.
Note, if there are several multiplexers in one direction, the
select signals are concatenated in one recon!guration reg-
ister. This enables to recon!gure all channels in the same
direction within one cycle.
The corresponding time overhead for con!guration of
the interconnect in one multicast domain is given as fol-
392
Authorized licensed use limited to: UR Rennes. Downloaded on January 29, 2010 at 09:52 from IEEE Xplore.  Restrictions apply. 
Figure 3. Processor array implementation of
a FIR !lter with L = 6 taps.
lows:
TINTc f g = o+ tmuxN + tmuxE + tmuxS + tmuxW + tmuxPEin (1)
where
tmuxx =
{
1 if #multiplexers in direction x≥ 1
0 else
The variable o is in our case a constant which re"ects the
overhead for the setup of a con!guration automaton. In a
WPPA framework implemented in FPGAs, an experimental
running example requires a setup time of o=4 cycles.
3 Algorithm class and mapping
Starting point of the design "ow for mapping algorithms
onto WPPAs is the class of so called dynamic piecewise reg-
ular algorithms (DPRAs) [4]. This class of algorithms de-
scribes loop nests containing uniform data dependencies by
a set of recurrence equations.
The following example of an FIR !lter yout [n] =
!
L−1
m=0 a[m] × u[n − m] with L taps is used to illus-
trate the mapping of regular algorithms onto a pro-
grammable array architecture. After embedded all co-
ef!cients a[m] and !lter inputs u[n − m] into a com-
mon two-dimensional space, we obtain the description:
par (n>=0 and n<=T-1 and m>=0 and m<=L-1)
a[n,m] = a[n-1,m] if (n > 0);
u[n,m] = u[n-1,m-1] if (n > 0 and m > 0);
x[n,m] = a[n,m] * u[n,m];
y[n,m] = y[n,m-1] + x[n,m] if (m > 0);
y_out[n] = y[n,m] if (m == L-1);
The application of space-time transformation [4] leads to a
2D processor array structure1 as shown in Fig. 3.
Now, for con!guration of the communication intercon-
nect, we have to lock at the algorithms’ data dependencies,
e.g., the dependency between variable a and a as vector
(1,0)T , between u and u as vector (1,1)T , and !nally be-
tween y and y as vector (0,1)T .
For a given algorithm Ai, we can group the set of data
dependencies like in the FIR-algorithm above into a set
{ !di,1, · · · , !di,D(i)} of two-dimensional vectors in the follow-
ing. Note that for each algorithm Ai to be executed at run-
time, there may be a different set of data dependencies.
On the physical processor array, these data dependencies
result in connection dependencies as can be seen in Fig. 3.
1The depicted array implementation of the FIR !lter is able to process
three input samples in parallel.
Now, in case the target WPPA does not support these vec-
tors directly as interconnect channels, they must be routed
over disjoint routing paths of channels. The corresponding
optimization problem to !nd such a routing con!guration
for several algorithms simultaneously, is the subject of Sec-
tion 4.
4 Optimization of communication overhead
Now, we present our framework for statically mini-
mizing the interconnect con!gurations for a set of time-
multiplexed algorithms such that the recon!guration over-
head in terms of required multiplexers is minimized as a
secondary goal. We will see that this problem involves solv-
ing routing problems. We model the problem using con-
straint programming over !nite domains formulation [8].
Modeling of all connection dependencies !di, j for each
algorithm Ai (1 ≤ i ≤ K) is done using a set of 2D arrays
of cells of size N ×M. N and M denote the maximum
horizontal and vertical part of the Manhattan distance of
all connection dependencies. Therefore, the optimization
problem is independent of the processor array’s size. Each
cell Celln,m is identi!ed by its (n,m) coordinates in 2D grid
(0≤ n≤ N−1,0≤m≤M−1). Each array represents im-
plementation of a single connection dependency for a given
algorithm. For example, implementation of two connec-
tion dependencies {(1,0)T ,(1,1)T} is depicted in Fig. 4 for
N = M = 2.
Cell0,0
PE
0,0
Cell0,1
PE
0
,1
Cell1,0
PE
1
,0
Cell1,1
PE
1
,1
Cell0,0
PE
0
,0
Cell0,1
PE
0,1
Cell1,0
PE
1,0
Cell1,1
PE
1
,1
di,2di,1
i,1 i,1
i,1 i,1 i,2
i,2 i,2
i,2
Figure 4. Example of implementation of con-
nection dependencies !di, j = {(1,0)T ,(1,1)T}.
It can be noted that connections between different PEs
can be routed using different resources. In principle we
need to make a number of decisions that can be grouped
into two classes.
• decisions on the selection of the path from source PE
to destination PE that passes different cells, and
• decisions on the selection of different connections in
the channels between cells.
Both decisions in"uence the number and size of multiplex-
ers that need to be included to recon!gure static connections
between cells when recon!guring from an algorithm into
another one. They need to be considered simultaneously.
For this purpose we have de!ned a constraint programming
model.
Before de!ning the model we need to point out that all
connections need to be implemented in a single cell since all
393
Authorized licensed use limited to: UR Rennes. Downloaded on January 29, 2010 at 09:52 from IEEE Xplore.  Restrictions apply. 
the cells in the architecture execute the same program and
they use the same channel connections to transfer the data
(so called modulo routing [12]). This means that we can re-
duce the optimization problem to a single cell that contains
all routed connection for all connection dependencies for
given algorithms. Normally, the implemented connections
for one algorithm cannot use the same connection channels.
A cell that implements all connections from Fig. 4 is de-
picted in Fig. 5. Bold lines represent the implementation of
connection dependency (1,0)T while doted lines represent
connection dependency (1,1)T ,
Celln,m
North
South
West East
PE
n
,m
20
 21
 22
 23
Figure 5. The cell connections for implemen-
tation of !di, j = {(1,0)T ,(1,1)T}.
Our communication minimization problem is split into
two steps to reduce the complexity of our method. In the
!rst step, all possible paths between interconnect cells and
PEs in the system corresponding to all !di, j 1≤ j ≤D(i) and
1≤ i ≤ K are found using a CP formulation. In the second
step, another CP formulation is used for multiplexer area
and recon!guration time minimization. We will see later
that the area and recon!guration time are expressed in terms
of size and number of multiplexers. It encodes all paths
using model variables and assigns connection dependencies
to connections of channels implementing identi!ed paths.
All paths corresponding to a given dependency !di, j are
found by applying a SimplePath constraint to the Simpli-
!ed System Graph (SSG). The SSG graph is de!ned as a
directed graph. Vertices are cells and processing elements
while edges represent inter-cell channels between process-
ing elements and cells.
The constraint that !nds all simple paths in a graph takes
as a parameter a graph, a source and a destination vertex
(i.e. PEs in our case). This constraint can be combined with
other constraints to generate paths of a limited length that is
useful in practice.
The identi!ed paths need to be encoded in our model
as variables. This is achieved by a special constraint
(ExtensionalSupport in our case) that de!nes a relation
between model variables using a table of values. For this
purpose we use 0/1 variables Chi, j(n,m)(n′,m′) that de!ne for
algorithm i and dependency j whether a directional channel
from cell (n,m) to cell (n′,m′) is used (value 1) or not (value
0). Table 1 presents variables that are set to one for different
paths for connection dependencies (1,0)T and (1,1)T from
Fig. 4. All other variables in each row are equal zero.
In our model we also maintain explicitly input/output
Table 1. Encoding of simple paths for imple-
mentation of connection dependencies.
!di,1 = (1,0)T path 1 Chi, jPE(0,0)(0,0), Ch
i, j
(0,0)(1,0), Ch
i, j
(1,0)PE(1,0)
path 2 Chi, jPE(0,0)(0,0), Ch
i, j
(0,0)(0,1), Ch
i, j
(0,1)(1,1), Ch
i, j
(1,1)(1,0), Ch
i, j
(1,0)PE(1,0)
!di,2 = (1,1)T path 1 Chi, jPE(0,0)(0,0), Ch
i, j
(0,0)(1,0), Ch
i, j
(1,0)(1,1), Ch
i, j
(1,1)PE(1,1)
path 2 Chi, jPE(0,0)(0,0), Ch
i, j
(0,0)(0,1), Ch
i, j
(0,1)(1,1), Ch
i, j
(1,1)PE(1,1)
pairs of channels that de!ne internal cell connectivity. For
example, pair (Chi,2(0,0)(1,0),Ch
i,2
(1,0)(1,1)) de!nes internal cell
connection from “West” to “South” indicated as doted line
in Fig. 5.
In this way we provide opportunity to select one of the
paths for implementing a given connection dependency. For
a selected path a number of speci!c channel connections
need to be determined. They implement communication
between cells and a cell and a PE. In our example depicted
in Fig. 5, each channel has two connections and commu-
nication dependencies can use any of the connections but
normally they cannot share a channel connection. It is
only possible for multicast communications that will be dis-
cussed later.
The channel selection is implemented using a channel
occupation table. It is in turn implemented in our model
using Diff2 constraint that assures that any pair of rectan-
gles speci!ed on a list of rectangles do not overlap. The
idea is depicted in Fig. 6. In this formulation each rectangle
represents a channel connection and the constraint assures
that two connections will not use the same connection in the
channel. A rectangle is speci!ed using its origin (x,y) and
lengths in both directions lx and ly, i.e. using list [x,y, lx, ly].
All channels connecting two cells or a cell and PE in a given
direction are collected in a list of rectangles. For example,
in direction “South” from Fig. 5, variables Chi, j(0,0)(0,1) and
Chi, j(1,0)(1,1) for all 1≤ j≤D(i) are used for selection of con-
nections. For our example it is de!ned using constraints (2).
Note that we use a special feature of Diff2 constraint that
considers rectangles with length zero as non-existing. This
makes it possible to consider only selected connections and
assure that they do not use the same connection channel at
the same time.
Diff2([1,Outi,1(0,0)(0,1),Ch
i,1
(0,0)(0,1),1], (2)
[1,Outi,1(1,0)(1,1),Ch
i,1
(1,0)(1,1),1],
[1,Outi,2(0,0)(0,1),Ch
i,2
(0,0)(0,1),1],
[1,Outi,2(1,0)(1,1),Ch
i,2
(1,0)(1,1),1]),
For multicast communication this condition is relaxed. In
addition we also enforce that the same connection is used
for multicast communications in the same direction (con-
straint 3).
∀1≤ j, j′≤D(i), j (= j′ Outi, j(n,m)(n′,m′) = Outi, j
′
(n,m)(n′,m′) (3)
We introduce vectors TabiDir to collect all inputs that are
connected to a given cell output connection for a given al-
394
Authorized licensed use limited to: UR Rennes. Downloaded on January 29, 2010 at 09:52 from IEEE Xplore.  Restrictions apply. 
gorithm i and connection direction Dir. These vectors are
de!ned for direction south, west, north, east and PE. The
information for these vectors is gathered using constraint
(4) that is formulated for all internal cell connections. For
the example of connection dependencies presented in Ta-
ble 1 only one internal cell connection exists from “West”
to “South” and therefore only one constraint has to be for-
mulated for this output direction (4). Vectors TabiDir are
later used for formulation of cost functions. It can be noted
that the number of model variables in the model is radically
reduced comparing to our previous model [12].
TabiSouth[Out
i,2
(1,0)(1,1)] = In
i,2
((0,0)1,0)⇔ (4)
Chi,2(0,0),(1,0) = 1∧Chi,2(1,0),(1,1) = 1
Celln,m
Out(n,m)(n+1,m)
i,j
i,j Celln+1,m
In(n,m)(n+1,m)
i,j
i,j
Channel
Channel
connection
Connection
Connection
1
2
Ch(n,m)(n+1,m)
1
i,j
Celln,m
Out(n,m)(n+1,m)
i,j'
i,j' Celln+1,m
In(n,m)(n+1,m)
i,j'
i,j'
Channel
Channel
Ch(n,m)(n+1,m)
i,j'
Figure 6. The channel connection selection
for two connection dependencies !di, j and !di, j′ .
The model de!nes all communications for a single algo-
rithm. They are then combined into a single model that con-
tains all algorithms with their connections dependencies. In
this model we de!ne a number of cost function to reduce
the communication overhead related to recon!guration. To
de!ne these cost function the tables Tabi de!ned in (4) are
used. The main cost function de!nes a condition that spec-
i!es when a multiplexer is needed. A multiplexer needs to
be included in a WPPE if there exist two paths implement-
ing connection dependencies for two disjoint algorithms Ai
and Ai′ , and there exist an output connection that has inputs
from two different connections. This condition is de!ned
using our tables depicted in Fig. 7. In this table each vec-
tor TabiDir (column in the array) de!nes input connection
numbers connected to given output connections in the al-
gorithm i. Each row, on the other hand de!nes, for each
output connection, input connection numbers for all algo-
rithms. Therefore a number of different connection num-
bers (in the row) de!nes number of inputs to this particu-
lar output and a multiplexer with this number of inputs. In
this case we do not consider zeros since numbering starts
from one. Two variables MultSizeDir,t and MultExistDir,t
are associated to each row. The !rst one de!nes the multi-
plexer’s size and the second de!nes whether the multiplexer
is needed or not. In our experiments, we consider two opti-
mization objectives; the multiplexers’ area and the recon!g-
uration time overhead. The related cost functions are spec-
i!ed using variables MultSizeDir,t and MultExistDir,t whose
values are calculated directly from table TabDir.
Figure 7. The table structure used for compu-
tation of cost functions.
The recon!guration time overhead is de!ned according
to Eq. (1) and expressed below with constraints (5).
∀DIR
!
t
MultExistDir,t > 0⇔ t muxDir (5)
RecTimeOverheadCostFunction =
!
Dir
t muxDir
Above constraints use rei!ed constraint, i.e., constraint
Cond ⇔ B that re"ects satis!ability of condition Cond into
a 0/1 variable B.
Area overhead is de!ned below as weighted sum of dif-
ferent types of multiplexers, i.e., two input, three input, etc.
List = {MultSizeSouth,1, . . . ,MultSizePE,K} (6)
∀i∈{2,...,K} Count(i,List,MuxSizei)
AreaOverheadCostFunction =
K
!
i=2
(i−1) ·MuxSizei
Constraint Count(K, List, Var), used in the above for-
mulation, assures that number of elements of List with
value K equals Var.
5 Experimental results
To validate our approach for area and recon!guration
time overhead minimization, we have carried out experi-
ments using six algorithms Ai with different connection de-
pendencies, as presented in Table 2. Each experiment used
different combinations of these algorithms to evaluate re-
con!guration overhead. The algorithms represent both ex-
isting algorithms as well as synthetic benchmarks. Algo-
rithm 1 represents the connection dependencies of a matrix-
matrix multiplication algorithm and Algorithm 2 is a FIR
!lter algorithm. They are two frequently used digital signal
processing algorithms. The connection dependencies of A3
stem from a Sobel image !ltering example. Algorithms A4
to A6 represent synthetic benchmarks. All of the following
experiments have been run on 2 GHz Intel Core Duo under
Mac OSX operating system.
In Table 3, we present our results obtained for minimiza-
tion of area overhead and sequential recon!guration time.
We also compare the obtained results with a na#¨ve approach.
All experiments are carried out for a minimal assumed num-
ber of channel connections and input/output ports needed
for routing all dependencies. As can be seen, the area and
sequential recon!guration time improvement between the
na#¨ve approach and our method are rather large. The aver-
age value for area improvement is 62% and average sequen-
tial recon!guration time improvement is 41%.
395
Authorized licensed use limited to: UR Rennes. Downloaded on January 29, 2010 at 09:52 from IEEE Xplore.  Restrictions apply. 
Table 2. Six algorithms and their connection
dependencies.
Algorithm !di, j
A1 {(−1,0)T ,(0,1)T ,(0,1)T }
A2 {(−1,0)T ,(−1,−1)T ,(0,1)T }
A3 multicasting {(0,−1)T ,(−1,−2)T ,(−1,−1)T ,(−2,−1)T ,(1,0)T }
A4 {(2,0)T ,(2,1)T }
A5 {(0,1)T ,(1,1)T }
A6 {(0,1)T ,(1,0)T }
Table 3. Area overhead and sequential recon-
!guration time of (a) na"¨ve solution and (b) op-
timized solutions for different combinations
of algorithms.
#MUX with N inputs Improvement
Algorithms Na#¨ve Optimized Area Area Serial
2 3 4 5 2 3 4 5 Na#¨ve Opt. Reconf.
A1, A2 6 1 6 1 83.33% 50.00%
A1, A5 4 3 4 3 25.00% 12.50%
A1, A6 5 3 5 3 40.00% 22.22%
A2, A5 5 1 5 1 80.00% 44.44%
A2, A6 5 3 5 3 40.00% 22.22%
A5, A6 5 5 0 100.00% 55.56%
A1, A2, A5 3 4 3 11 3 72.53% 53.33%
A1, A2, A6 4 3 4 10 4 60.00% 42.86%
A1, A5, A6 5 3 4 11 4 63.64% 46.67%
A2, A5, A6 4 3 2 1 10 4 60.00% 42.86%
A1, A2, A5, A6 3 3 2 4 1 15 6 60.00% 47.37%
A1, A2, A3, A4, A5, A6 7 5 3 1 2 2 2 30 11 63.33% 55.88%
Table 4 presents the results of minimizing the parallel
recon!guration time, again comparing na#¨ve with our opti-
mized solutions. We specify the number of dimensions that
use multiplexers. The recon!guration time is shorter for
parallel recon!guration when optimized with our method
and in average we obtain 32% improvement.
6 Conclusions and future work
In this paper, we have presented a constraint program-
ming formulation for minimization of area as well as se-
quential and parallel recon!guration time overhead for reg-
ular recon!gurable architectures. Our system makes it also
possible to make a design space exploration that involves
trading multiplexers against channel connections, for exam-
ple. The experimental results indicate large savings of area,
specially for applications that have larger number of algo-
rithms and a large number of PEs.
The following extensions are possible for future work:
First, a different cost function would make it possible to
minimize, for example, a number of channel connections
with a given limit on a number of multiplexers or a maximal
length of routed paths for a given set of algorithms.
The search space for the considered problem is large, and
for large problems, we cannot !nd or prove optimality of
our solutions. This is partially caused by existence of many
symmetrical solutions with the same cost. These could be
eliminated by introduction of additional symmetry elimina-
tion constraints. We leave it for our future work.
Table 4. Parallel recon!guration time of (a)
na"¨ve solution and (b) optimized solutions for
different combinations of algorithms.
Na#¨ve Optimized Improvement
Algorithms MUX Time Mux Time Parallel reconf.
in dim. (cycles) in dim. (cycles) (cycles)
A1, A2 3 7 1 5 28.60%
A1, A5 3 7 1 5 28.60%
A1, A6 3 7 1 5 28.60%
A2, A5 3 7 1 5 28.60%
A2, A6 3 7 1 5 28.60%
A5, A6 3 7 0 4 42.90%
A1, A2, A5 3 7 1 5 28.60%
A1, A2, A6 3 7 1 5 28.60%
A1, A5, A6 4 8 1 5 37.50%
A2, A5, A6 4 8 1 5 37.50%
A1, A2, A5, A6 4 8 1 5 37.50%
A1, A2, A3, A4, A5, A6 5 9 2 6 33.33%
References
[1] A. Ahmadinia, C. Bobda, J. Ding, M. Majer, J. Teich, S. Fekete, and
J. van der Veen. A Practical Approach for Circuit Routing on Dy-
namic Recon!gurable Devices. In Proc. of the 16th Int. Workshop
on Rapid System Prototyping (RSP), pages 84–90, Montreal, Canada,
June 2005.
[2] L. Benini and G. Micheli. Network on Chips: A new SoC Paradigm.
IEEE Computer, January 2001.
[3] C. Bobda and A. Ahmadinia. Dynamic Interconnection of Recon-
!gurable Modules on Recon!gurable Devices. IEEE Design & Test,
22(5):443–451, 2005.
[4] F. Hannig and J. Teich. Resource Constrained and Speculative
Scheduling of an Algorithm Class with Run-Time Dependent Con-
ditionals. In Proc. of the 15th Int. Conference on Application-speci!c
Systems, Architectures, and Processors (ASAP), pages 17–27, Galve-
ston, TX, USA, Sept. 2004.
[5] A. Hemani, A. Jantsch, S. Kumar, A. Postula, J. Oberg, M. Millberg,
and D. Lindqvist. Network on Chip: An Architecture for Billion
Transistor Era. In Proc. of the Int. NorChip Conference, Sept. 2000.
[6] M. Hu¨bner, C. Schuck, M. Ku¨hnle, and J. Becker. New 2-
Dimensional Partial Dynamic Recon!guration Techniques for Real-
time Adaptive Microelectronic Circuits. In Proc. of the Symposium
on Emerging VLSI Technologies and Architectures (ISVLSI), page 97,
Washington, DC, USA, 2006.
[7] D. Kissler, F. Hannig, A. Kupriyanov, and J. Teich. A Highly Param-
eterizable Parallel Processor Array Architecture. In Proc. of the Int.
Conference on Field Programmable Technology (FPT), pages 105–
112, Bangkok, Thailand, Dec. 2006.
[8] K. Kuchcinski. Constraints-Driven Scheduling and Resource Assign-
ment. ACM Transactions on Design Automation of Electronic Sys-
tems (TODAES), 8(3):355–383, July 2003.
[9] N. Moreano, G. Araujo, Z. Huang, and S. Malik. Datapath Merg-
ing and Interconnection Sharing for Recon!gurable Architectures. In
Proc. of the 15th Int. Symposium on System Synthesis (ISSS), pages
38–43, New York, NY, USA, 2002.
[10] S. Siegel, R. Merker, F. Hannig, and J. Teich. Communication-
conscious Mapping of Regular Nested Loop Programs onto Mas-
sively Parallel Processor Arrays. In Proc. of the 18th Int. Conference
on Parallel and Distributed Computing and Systems (PDCS), pages
71–76, Dallas, TX, USA, Nov. 2006.
[11] M. N. Velev and P. Gao. Comparison of Boolean Satis!ability En-
codings on FPGA Detailed Routing Problems. In Proc. of the con-
ference on Design, Automation and Test in Europe (DATE), pages
1268–1273, Munich, Germany, 2008.
[12] C. Wolinski, K. Kuchcinski, J. Teich, and F. Hannig. Optimization of
Routing and Recon!guration Overhead in Programmable Processor
Array Architectures. In Proc. of the 16th IEEE Symposium on Field-
Programmable Custom Computing Machines (FCCM), poster, Palo
Alto, CA, USA, Apr. 2008.
396
Authorized licensed use limited to: UR Rennes. Downloaded on January 29, 2010 at 09:52 from IEEE Xplore.  Restrictions apply. 
