Design Space Exploration of Deeply Nested Loop 2D Filtering and 6 Level FSBM Algorithm Mapped onto Systolic Array by B. Bala Tripura Sundari
Hindawi Publishing Corporation
VLSI Design
Volume 2012, Article ID 268402, 15 pages
doi:10.1155/2012/268402
Research Article
Design Space Exploration of Deeply Nested Loop 2D Filteringand
6LevelFSBM AlgorithmMapped ontoSystolic Array
B. Bala TripuraSundari
Department of ECE, Amrita Vishwa Vidyapeetham, Coimbatore 641 112, India
Correspondence should be addressed to B. Bala Tripura Sundari, balasrikanth2003@yahoo.com
Received 26 December 2011; Revised 9 April 2012; Accepted 23 April 2012
Academic Editor: Sungjoo Yoo
Copyright © 2012 B. Bala Tripura Sundari. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
The high integration density in today’s VLSI chips oﬀers enormous computing power to be utilized by the design of parallel
computing hardware. The implementation of computationally intensive algorithms represented by n-dimensional (n-D) nested
loop algorithms, onto parallel array architecture is termed as mapping. The methodologies adopted for mapping these algorithms
onto parallel hardware often use heuristic search that requires a lot of computational eﬀort to obtain near optimal solutions. We
propose a new mapping procedure wherein a lower dimensional subspace (of the n-D problem space) of inner loop is identiﬁed,
in which lies the computational expression that generates the output or outputs of the n-D problem. The processing elements
(PE array) are assigned to the identiﬁed sub-space and the reuse of the PE array is through the assignment of the PE array to the
successivesub-spacesinconsecutiveclockcycles/periods(CPs)tocompletethecomputationaltasksofthen-Dproblem.Theabove
is used to develop our proposed modiﬁed heuristic search to arrive at optimal design and the complexity comparisons are given.
The MATLAB results of the new search and the design space trade-oﬀ analysis using the high-level synthesis tool are presented for
two typical computationally intensive nested loop algorithms—the 6D FSBM and the 4D edge detection alternatively known as
the 2D ﬁltering algorithm.
1.Introduction
1.1. Prelude to the New Search Method. Today’s reconﬁg-
urable SoCs feature processing elements (PEs) with sig-
niﬁcant amount of programmable logic fabric present on
the same die. The management of complexity and tapping
the full potential of these RSoC architectures present many
challenges [1]. A large number of heuristic algorithms
have been used in developing many novel scheduling and
mapping algorithms [2–5]. However, these approaches face
diﬃculties in dealing with large execution times.
n-dimensional (n-D) nested loop representations are
used in the formulation of numerous computationally
intensive multimedia computing/image processing and sig-
nal processing algorithms. Systolic array design style can
eﬀectively exploit parallelism inherent in the nested loop
algorithm and, therefore, reduce processing time [2, 3].
Often heuristic procedures are used to search for the
mapping transformations that are used to map the nested
loop algorithms onto array architectures [4, 5]. Since the
eﬀort that goes into heuristic search is large and complex,
the challenge lies in improving the process to reduce the
computational eﬀort in getting the mapping results.
Our main contribution in this paper is that we propose
an augmented approach to the heuristic search. A new
method of identifying the subspace to which the PE array is
to be assigned is proposed based on the directional index of
the computational expression that is explained in Section 2.
The new vectors and terminologies used in the procedure are
d e ﬁ n e da n de l a b o r a t e di nSection 2.
A modiﬁed heuristic search is implemented using the
proposed procedure to determine the optimal solution to
the n-D problem. The complexity analysis is performed by
comparing the search space used in our method with the
search space in [4]. The high-level synthesis tool GAUT is
used to plot the design space trade-oﬀ curves to obtain the
design space exploration curves.2 VLSI Design
The paper is organized as follows: in Section 3, mapping
steps used in the heuristic method and our proposed
modiﬁed search method are described. The 4D nested loop
formulation of the 2D ﬁltering problem is explained in
Section 4. The methodology and the implementation of
the above approach for the 2D ﬁltering algorithm and the
mapping results are presented in Section 4. The mapping
process for 6D FSBM is elaborated in Section 5 followed by
the results of the heuristic search for the reduced 4D FSBM
and modiﬁed heuristic search for the same in Section 6.T h e
design trade-oﬀ results using the high-level synthesis tool
GAUT are presented in Section 6. Section 7 discusses the
complexity considerations and comparisons. Section 8 gives
the conclusion and future work.
2.Terminologies andDeﬁnitions
2.1. Axis Vector I. The multidimensional (n-D) problem
is associated with an n-dimensional axis vector I.I t s
components are {i1,i2,i3,...,in}, where the subscripts of the
components belong to the integer set z. The components
of the vector I represent the diﬀerent axis directions of the
n-dimensional vector I. The letter K is used to represent
a constant vector whose components are diﬀerent constant
numbers, K = {k1,k2,k3,k4,...,kn}.E a c hkz represents the
upper limit of the corresponding vector component iz of the
vector I. For example, axis component i4 has a value varying
from i4 = 1t oi4 = k4.
2.2. Data Representations. Considering the input data set to
the algorithm, the input data is represented using letter A
with subscript z. The input data set consists of collection of
data A1,A2,...,Ak where k is some constant integer number.
Each of this type of data is associated with the axis vector I.
For example, for A1,w ec a l li ta sA1(I). Now every such data
is associated with a particular axis component iz in I. iz is the
axis vector along which the data A1(I) is read into the n-D
multidimensional algorithm using a set of ports. The input
data is represented as A1(i1,I2,i3,....i n). This means that
input data A1(I)i sf e da l o n gi2 axis. The corresponding word
sizeisk2,andtheportsizerequiredtofeedthisdataisk2.The
input data is reused either within the same computation or
indiﬀerentcomputationswithiniteration(dependingonthe
application considered). If the reuse is within the same clock
cycle/clock period CP, it is made possible by propagating the
data (with zero delay) termed as data broadcast. The reuse
direction of each data is represented by the directional vector
termed as the “dependence vector”—Dv. Dv is determined
as follows: as shown in Listing 1, the data A1 on the LHS is
assigned from the data A1(i1,I2,i3 − 1,...,in) on the RHS
in equation (1a) in Listing 1. This means that it is broadcast
withinthesameiterationinthe i3 direction andfedalongthe
i2 axis using k2 ports (Figure 1).
The output data is represented as C(i1,i2,i3,...,In − 1)
which means that the data is output along in axis and
propagated along in direction. When we consider the output
data, the word “propagation” is replaced by the term “update
direction.” The vector associated with update direction is
termed as the Computational Trail Vector (CTV). The
updation of CTV may be with delay or without delay as
demanded by the application.
The vector representing the update direction in this
e x a m p l ei sg i v e na s
CTV = [0,0,...,1]. (1)
The form of representation of the n-D algorithm in
Listing 1 wherein the broadcast direction and computations
are shown with the complete detail is termed as the uniform
recurrence relation or the URE form of the n-D nested loop
algorithm. In the expression (2) for CTV in Listing 1, the
computational output data C is represented as C[ I]( a r r o w
lineontopofthesymbol)whichindicatesthatitisassociated
with an update direction. The corresponding vector  d in the
RHS of (2) represents the CTV deﬁned in (1).
The functions f1 and f2 in (3) in Listing 1 are simple
commutative operators which are executed independent of
any other output component computations of C. These are
assumed to be operators with no precedence constraint. f2
especially is an operator that has no precedence constraint.
It needs not wait for any past computations. It can proceed
independently provided as much parallel hardware is avail-
able. There is only one output computation expression in
Listing 1. Listing 1 is said to have a single CTV with no
precedence constraint.
2.3. n-D Nested Loop Problem. A general n-D nested loop
algorithmisillustratedinListing1.i1,i2,andi3,...,in arethe
loop indices. Together they form the n-D (iteration) index
space. Representation of the n-D loop computations as a
dependence graph (DG) leads to each point in the index
spacecorrespondingtoasinglenodeintheDG.Theoretically
each node can be assigned a processing element (PE). The n-
D iteration space is constructed as follows.
2.3.1. An n-D IterationSpace Computation in Terms of (n−1)-
D Subspace. First an (n−1 ) - Dd e p e n d e n c eg r a p h( D G )a si n
Figure 1 with an (n−1) multidimensional indexed positions
given by
 
 I
in
n−1 ,1
 
={ i1,i2,i3,...,in−1,1} (2)
is constructed showing the data input directions and data
broadcast directions. Here we show one of the data input
directions and data broadcast directions for the sake of
illustration. The data speciﬁcations or the dependence
relations within each cell in the iteration space show the
diﬀerent data broadcast directions as shown in Figure 1.
The n-D iteration space is constructed by replicating
the (n−1)-D iterationspace along the in direction. Each
(n−1)-D subspace is termed as a cell (or iteration). An
array of PE is assigned to this cell, and the computation
of the cell is completed in 1 clock period (CP). In the
next CP, the PE array is assigned to the next cell along
the in direction. The direction of PE array assignment to
consecutive subspaces is termed as the scheduling directionVLSI Design 3
Do i1 = 1t ok1;
Do i2 = 1t ok2;
Do i3 = 1t ok3;
A1(i1,I2,i3,...,in) = A1(i1,I2,i3 −1,...,in);//broadcast in i2 direction
(1a)
A2(i1,i2,I3,...,in) = A2(i1 −1,i2,I3,...,in);//broadcast in i1 direction
A3(I1,i2,i3,...,in) = A3(I1,i2,i3 −1,...,in)broadcast in i3 direction
An−1(i1,i2,i3,I4,...,in) = An−1(i1,i2,i3 −1,I4,...,in);
C[ I] = f2(C[ I − d],f1(A1A2 [ I ])) (2)
or
C[i1,i2,i3,...,in] = C[I1,I2,i3,...,in −1]+ A1(i1,I2,i3,...,in) ×A2(I1,i2,i3,...,in)
(3)
End Do in; ....;
End Do i2;
End Do i1;
Listing 1: n-D multidimensional algorithm in URE form.
i2
i2 i1
in
in
In−1
An−1[i1i2,i3 ···in]
kn−1 ports to feed in
An−1 data
CTV
An−1 data broadcast
CTV partially updated in the 
iteration cell
(n – 1)-D subspace
along i2 direction
(n – 1)-D
Figure 1: The input data set and computation in the ﬁrst (n−1)-D
subspace or cell represented as DG.
represented as the scheduling vector  sd. As per Listing 1, the
CTV is also updated along the same in direction. The CTV is
partially updated in CP1, and the updation continues as the
schedulingadvancesalongthein directionineveryCPtillthe
completion of computation in kn CPs.
2.4.MappingandScheduling. Anynodeintheiterationspace
isN[i1,i2,i3,...,1]andismappedontothePEarrayassigned
to the iteration subspace. This is termed as mapping.T h e
time “t” at which the node N[i1,i2,i3,...,1]ismappedon
the PE in the PE array is termed as scheduling. The mapping
and scheduling are derived for each application in detail in
the corresponding sections.
2.5. Computation of n-D Iteration Space Using an (n−2)-D
Subspace. In an alternate generalization, we represent the
n-D nested loop problem as identiﬁed to have an iterative
(n−2)-D subspace as shown in Figure 2.A n( n−2)-D
dependence graph (DG) with an (n−2)-D multidimensional
indexed positions is given by
 
 I
in,in−1
n−1 ,1,1
 
={ i1,i2,i3,...,in−2,1,1}. (3)
The collection of indexed node positions in (3)i s
termed as the (n−2)-D subspace or hyperplane, which is
represented showing the data input directions and data
broadcast directions in Figure 2(a).T h en-D iteration space
computation is completed by replicating the (n−2)-D DG.
We expand the iteration space along the in−1 direction,
followed by its expansion along in direction. Each (n−2)-
Ds u b s p a c ei st e r m e da sacell or iteration cell. An array of
PE is assigned to this cell, and the computation of the cell is
completed in 1 CP.
A part of the output expression termed as the com-
putational expression is assumed to be computed in the
inner loop formed by the (n−2)-D iteration space as
depicted in Figure 2(a). The directional index representing
the propagation direction or the update direction of the
computational expression is termed as the Computational
Trail Vector (CTV). The CTV is partially updated in CP1,
and the updation continues as the scheduling advances along
the in−1, showing that in the next CP the PE array is assigned
to the next iteration cell along the in−1 direction (as shown in
Figure 2(b)) to complete the ﬁrst row of computation in kn−1
CPs. The sequence direction of subspace assigned to the PE
arrayinconsecutiveCPsistermedastheschedulingdirection
represented by the scheduling vector  sd1, which is along the
in−1 direction, and CTV is also updated along the same in−1
direction.
Following this, the PE array assignment is done to next
in giving the scheduling vector  sd2 as in as in Figure 2(b).
The total number of CPs used to complete the computation
is kn−1 ×kn.
2.6. n-D to (n − x)-D with CTV and Scheduling Directions.
In the previous section, the (n−1)-D subspace is built
using a sequence of (n−2)-D subspaces by scheduling along4 VLSI Design
Ak data set-port
size kk broadcast
along ij
ik
ik−1
A1 data set-port size
i1
i3
A3 data set-port size
I
in,in−1,···ix
n−x
k3 broadcast along i1
k1 broadcast along i3
(a)
I
in,in−1,···ix
n−x I
in,in−1,···ix
n−x I
in,in−1,···ix
n−x
I
in,in−1,···ix
n−x I
in,in−1,···ix
n−x I
in,in−1,···ix
n−x
I
in,in−1,···ix
n−x I
in,in−1,···ix
n−x I
in,in−1,···ix
n−x
2kn kn
in
1
2
in−1 i2
(kn−1 −1)kn +1
(kn−1 −1)kn +2
Kn−1kn
kn +1
kn +2
(b)
Figure 2: (a) The (n−2)-D iteration cell. (b) (n−2)-D iteration
space with scheduling  sd1 along in−1 direction CTV is also updated
along the same in−1 direction, followed by  sd2 along in, and CTV is
also updated along the same in direction.
the appropriate (n−1)th dimension followed by scheduling
along the appropriate nth dimension—say along in with
an assumption that CTV has the same direction as the
scheduling vector which may not be true always. There are
two approaches to complete the n-D computation using the
(n−2)-D subspaces. The PE array assignment to the (n−2)-
D subspace is one order closer to the physical realization. For
a practical implementation, this process has to be continued
down to 2D level.
In general, the direction of updation of the compu-
tational expression is deﬁned as a vector termed as the
Computational Trail Expression (CTV) of the n-D problem.
We identify the corresponding (n − x)-D computational
hyperplane in which the CTV lies, forming an (n − x)-D
subspace in the n-D space. The PE array is assigned to this
plane. This is followed by the reuse of the (n − x)-D plane
along the scheduling direction/s.
3.Methodology ofMapping
Themappingmethodologyusedintheheuristicsearchofthe
mapping transformation matrix M is explained hereafter. In
Table 1: Dependence vectors for 2D ﬁltering.
Variable LHS RHS Dependence
assignment assignment vector
Image data I (i, j,k,l) I (i+1 ,j,k −1,l) [ 1010 ]
Image data I (i, j,k,l) I (i, j +1,k,l −1) [0 1 0 1]
Window coeﬃcient W (i, j,k,l) W (i −1, j,k,l) [ 1000 ]
Window coeﬃcient W (i, j,k,l) w (i, j −1,k,l) [ 0100 ]
Output O (i, j,k,l) O (i, j,k,l −1) [0 0 0 1]
Output O (i, j,k,l) O (i, j,k −1,l) [ 0010 ]
general, the mapping matrix M is constituted of the timing
vector or hyperplane S and the space matrix or vector also
calledthespacehyperplaneP [6,7].Anynodeintheiteration
space N[i1,i2,i3,...,in] is mapped onto a PE in the PE array
using the P matrix at a time “t” determined by the S vector
of [4]
M =
 
S
P
 
. (4)
3.1. Heuristic Method [4]
Step 1. Generate the iteration space for the n-D nested loop
application under consideration.
Step 2. Find the data dependencies in the algorithm and
formulate the dependence vector Dv.
Step 3. The causality constraint is checked for using (5), that
is, whether the condition
S ∗Dv > 0 (5)
for all dependencies is satisﬁed, where Dv is dependence
vector for each data variable (Table 1). Choose those s
elements of S which satisfy the condition.
Step 4. GenerateormodifythesearchspacefortheM matrix
(Mset) to satisfy the rank condition [4].
Step 5. Chose a candidate M matrix from the above set.
Step 6. Save the candidate M matrix in Mresult.
3.2. The Proposed Modiﬁed Heuristic Method. The following
arethestepsinourapproachformodiﬁcationoftheheuristic
search based on the optimal allocation method evolved in
Section 2.
Identify the scheduling direction. Once a layer of PEs
is assigned to the (n − x)-D subspace, the same array of
PEs is to be used in the next computation. This reuse
direction is known as the scheduling direction in Section 2.
All these conditions are used in the modiﬁed heuristic search
procedure in the following steps.VLSI Design 5
Table 2: Delay-edge determination—Step 11 in Section 4.5.
Case (i) window size =
[w1 w2]1 = [3 3] Case (ii) [w1 w2]2 = [4 3]
Image size = [R C] = [1 1] [R C] = [1 1]
Image size = one window size
Dv—dependence matrix Dv dependence matrix
101000 101000
010100 010100
100001 100001
010010 010010
To determine delays use Delays
Sdd vector = [1 0; 0 1; 1 0; 0 1;
00 ;00 ]
Sdd vector = [1 0; 0 1; 1 0; 0 1; 0 0;
00 ]
sdd ∗ [w1 w2]1 = sdd ∗ [3 1] sdd ∗ [w1 w2] = sdd ∗ [4 1]
Delays = [ 313100 ] s d d∗ [4 1] = [ 414100 ]
To determine
edge connectivity use sde = [1 0; 0 1; 0 0; 0 0; 0 1; 1 0] 
Sde vector = [1 0; 0 1; 0 0; 0 0;
01 ;10 ]  
sde ∗ [w1 w2]1 = sde ∗ [4 1]
ans = 313100 a n s= 410014
Step 7. The scheduling vector representing the scheduling
directionrepresentedbythe  sd1 vectorisusedtoprunedown
the valid M matrices.
Step 8. Prune down the valid M matrices by choosing the
(n − x)-D subspace to which the PE plane is discussed. This
is done by identifying the iterative subspace. To summarise,
the selected Mmat is obtained by pruning down the Mresult
using the CTV and PE plane assignment done as discussed
in Section 2.
Step 9. Evaluate the cost function as given in (10)i n
Section 5.2.I fC o s t actual < Costrequired, proceed to Step 6 else
to Step 3.
The plots of Figure 5 show the comparison of heuristic
method of Section 3.1 with the modiﬁed heuristic search
method described in Section 3.2.
3.3. Direct Method
Step 10. The delay edge is calculated by the direct method as
explained in Section 4.5. The results are presented in Table 2.
Step11. ThedelayedgematrixinTable 2 isdeterminedusing
the expression Dv deﬁned in Tables 1 and 3 for 2D ﬁltering
algorithm.
3.4. Mapping Process. The main objective is to ﬁnd the M
matrix which consists of the processor allocation vector (Pt)
and the scheduling vector (St).
Table 3: Dependence vectors for each variable for 2D ﬁltering.
2D ﬁltering dv1 dv2 dv3 dv4 dv5 dv6 ∗∗∗
Index variables I1 I2 w1 w2 O1 O2
i 101000Next row window
j 010100
Next
window-column
wise
k 100001PE array along k
and l
l 010010
∗p-direction—2D array represented as a 1D array.
∗∗index variables.
First we take the boundaries of the search space between
whichthePt andSt aretobesearched.Theselectionofsearch
space is an important factor, because there is an exponential
growth in both area and time complexity of the mapping
methodology. Consider that Ui,Uj,Uk,...,Un are the upper
bounds of an n-dimensional nested loop algorithm. The
heuristic followed in this work is to generate the search
space that can be obtained by the following element set
{0,1,Ui,Uj,Uk,
 
(UiUj)}.
3.5. Methods and Resources Used in Obtaining the Mapping
Methodology. As a whole, the implementation of the map-
ping methodology consists of two parts. The ﬁrst is the
heuristic search for the mapping. The heuristic search allows
us to obtain the near optimal solutions and then pick up
the feasible architecture by pruning the solutions based on
Steps 4–9 as described in Section 3.3. The new mapping
methodology is explained with respect to the 2D ﬁltering
algorithminSection 4.Themodiﬁedheuristicmethodbased
on the new method followed after implementing the steps
in Section 3.1 are implemented using MATLAB to obtain
the results of the search procedure of Sections 3.1 and 3.2.
Also the comparative results between the heuristic and the
modiﬁed heuristic method for the 6D full search block
motion estimation (FSBM) algorithm are given. The second
part is the design space exploration of resultant architecture.
It is obtained as explained in the next section.
3.6. High-Level Synthesis (HLS). The input to high-level
synthesis system is the problem represented in behavioural
description in a high-level language. The optimization in a
high-level synthesis is done at a level higher than the boolean
optimizationdonebytheRTLsynthesistools.Thisissuitable
for hardware optimization of DSP and image processing
algorithms [8]. This is followed by scheduling and allocation
[9]. The GAUT [10] tool used incorporates all the above
features and allows the design space exploration.
The algorithm is described in a high-level description in
C, and this is used as the input design speciﬁcation to the
high-levelsynthesistool.Thehigh-levelsynthesistoolisused
to obtain the Control Data Flow Graph (CDFG). The CDFG
allows the designer to verify the design required at a later
stage. It allows the tracing of data values as live variables in6 VLSI Design
(0, 0)
(1, 0) (1, 1) (1, 2) (1, 3) (1, 4) (1, 5) (1, 6)
(2, 0) (2, 1) (2, 2) (2, 3) (2, 4) (2, 5) (2, 6)
(3, 0) (3, 1) (3, 2) (3, 3) (3, 4) (3, 5) (3, 6)
(4, 0) (4, 1) (4, 2) (4, 3) (4, 4) (4, 5) (4, 6)
(5, 0) (5, 1) (5, 2) (5, 3) (5, 4) (5, 5) (5, 6)
(0, 1) (0, 2) (0, 3) (0, 4) (0, 5) (0, 6)
Figure 3: Window for the 2D ﬁltering algorithm.
registers associated with the PE hardware. Also the high-level
synthesis tool is used to obtain the design space exploration
results which give the area Versus latency tradeoﬀ.
4. Mapping of 2D Filtering Algorithm
4.1. 2D Filtering for Image Processing: A 4D Problem. The
problem formulation of Section 2 and the methodology
in Section 3 are applied to the 2D ﬁltering problem. 2D
ﬁltering or convolution is one of the essential operations in
digital image processing required for image enhancements.
The grey levels are usually represented with a byte or 8-
bit unsigned binary number, ranging from 0 to 255 in
decimal. Equation (6) shows the two dimensional discrete
convolution algorithm, where I[x, y] represents the input
pixel data image, W is the window coeﬃcient, and O is the
output image. The movement of the mask window function
to calculate the window function value for the whole image
region is shown in Figure 3:
O
 
x, y
 
= W
 
x, y
 
⊗I
 
x, y
 
=
4  
i=0
3  
j=0
I
 
i+x, j + y
 
⊗W
 
i, j
 
.
(6)
Digital convolution can be thought of as a moving
window of operations, where the window that is, mask, is
moved from left to right and from top to bottom.
T h e2 Di m a g eﬁ l t e r i n gp r o b l e mi sar e p r e s e n t a t i v e
example of a 4D nested loop involving 2D convolution, as in
Listing 2 and Figure 3. The computation is highly redundant
and requires high data reuse. This is considered here for
systolic mapping. An image of size 0 to +k1;0t o+ k2 is
considered convolved with a mask of size 0 to +k3;0t o+ k4.
The mask coeﬃcients are stored in memory. The signiﬁcant
features of the algorithm are listed in the following section.
4.2. Nested Loop Formulation. T h en e s t e dl o o pf o r m u l a t i o n
for the 2D ﬁltering algorithm for image size k1 × k2 and
window function size k3 × k4 is given in Listing 2—the same
is represented in uniform recurrence equation form (URE)
in Listing 3.
4.3. Single Assignment Statement Formulation or Uniform
Recurrence Equation (URE) Form of 2D Filtering. The SAS
of the 4D edge detection algorithm is in Listing 3,a n d
the dependence vectors for the four level algorithms have 4
indices and the index space is generated by varying the four
index values till the upper limit of each index as in Listing
3. The dependencies give the propagation direction of the
input variables and update direction of the output data. In
Listing 3, Wnew represents the mask values in 2D ﬁltering
algorithm that are to be input at the fresh windowing and
Inew to indicate the loading of pixel values for a new frame of
image.
4.4. Dependence Vectors for 2D Filtering Algorithm. Listing
3 is well commented to bring out the formulation of the
following dependence vectors in Table 1.
4.5. Delay-Edge Matrix-Direct Method of Determining Delay
and Edge Connectivity. The delay edge mapping is obtained
by the product of dependence matrix (Dv)a n dM matrix as
shown in Table 2.
Step 11 in the mapping process uses the dependence
matrix to compute the edges and delays as follows: Dv =
[ 0101 ;1010 ;0001 ;0010 ;000 1 ;0
010 ]   (Table 1); the ﬁrst half in each vector in Dv stands
for the scheduling direction and the second half for the PE
array directions. The ﬁrst half (termed as sdd vector— sdd)
gives the delays associated with the corresponding edges
given by the second half (sde vector=  sde):
Delays =
 
 sd2 × w11
 
×  sdd;
Edges =
 
 sd2 × w11
 
×  sde.
(7)
This is computed and presented in Table 2.
Mapping results for 2D ﬁltering are given in Table 4(a)
for heuristic method, and Table 4(b) gives the modiﬁed
heuristic method.
4.6. Space-Time Mapping Matrix (M) Illustration. The map-
ping was performed for 1D array. The generalized form of
space time mapping matrix M is given here as shown in (3);
M =
 
St
Pt
 
. (8)
if Pt =
 
0031
 
; St =
 
4151
 
.VLSI Design 7
Table 4
(a) Heuristic search results for 2D ﬁltering
NPE Ncyc M-matrix Reg. cost
12 12 1 0 1 3
0 0 1112 1 0
12 14 1 0 1 3
0 0 1114 1 4
12 18 1 0 1 3
0 0 1113 1 2
12 16 1 0 1 3
0 0 1111 8
12 12 1 0 1 3
0 0 1121 1 0
12 15 1 0 1 3
0 0 1122 1 2
12 17 1 0 1 3
0 0 1120 8
12 13 1 0 1 3
0 0 1124 1 6
12 21 1 0 1 3
0 0 1123 1 4
12 19 1 0 1 3
0 0 1121 1 0
12 15 1 0 1 3
0 0 1104 1 2
12 15 1 0 1 3
0 0 1103 1 0
12 13 1 0 1 3
0 0 1141 1 4
12 21 1 0 1 3
0 0 1142 1 6
12 23 1 0 1 3
0 0 1140 1 2
12 19 1 0 1 3
0 0 1144 2 0
12 27 1 0 1 3
0 0 1143 1 8
12 25 1 0 1 3
0 0 1141 1 4
12 21 1 0 1 3
0 0 1131 1 2
(b) Mapping results using the modiﬁed heuristic search results process 2D ﬁltering
Window size = 3 × 3; 2D result arrived by using Step 11 Window size = 4 × 3
[pe arr, Ncyc arr, M or Tmat] [pe arr Ncyc arr M or Tmat]
NPE Ncyc M matrix = [P; S]N P E N C Y C M matrix = [P; S]
9 9 1004 ;1121 1 2 1 2 1014 ;1131
9 9 1004 ;1304 1 2 1 2 1014 ;1314
9 9 1004 ;1321 1 2 1 2 1014 ;1331
9 9 1004 ;1204 1 2 1 2 1014 ;1214
9 9 1004 ;1221 1 2 1 2 1014 ;1231
9 9 1004 ;1404 1 2 1 2 1014 ;14148 VLSI Design
(b) Continued.
Window size = 3 × 3; 2D result arrived by using Step 11 Window size = 4 × 3
[pe arr, Ncyc arr, M or Tmat] [pe arr Ncyc arr M or Tmat]
NPE Ncyc M matrix = [P; S]N P E N C Y C M matrix = [P; S]
9 9 1004 ;1421 1 2 1 2 1014 ;1431
9 9 1004 ;1104 1 2 1 2 1014 ;1114
9 9 1004 ;1121 1 2 1 2 1014 ;1131
9 9 1004 ;0104 1 2 1 2 1014 ;0114
9 9 1004 ;0121 1 2 1 2 1014 ;0131
9 9 1004 ;0304 1 2 1 2 1014 ;0314
9 9 1004 ;0321 1 2 1 2 1014 ;−0331
9 9 1004 ;0204 1 2 1 2 1014 ;0214
9 9 1004 ;0221 1 2 1 2 1014 ;0231
9 9 1004 ;0404 1 2 1 2 1014 ;0414
9 9 1004 ;0421 1 2 1 2 1014 ;0431
9 9 1004 ;0104
9 9 1004 ;0121
9 9 1004 ;2104
9 9 1004 ;2121
∗Search space for M matrix without the use of the scheduling vector  sd; the execution time takes more execution time to obtain Table 4(a), than the search
time which uses the  sd as the projection direction for reassignment of PE plane used to obtain Table 4(b).
For(i1 = 0;i1 <k 1;i1++)
For(i2 = 0;i2 <k 2;i2++)
{
O[i1,i2] = 0;
For (i3 = 0;i3 <= +k3;i3++)
For (i4 = 0;i4 <= +k4;i4++)
O[i1,i2] = O[i1,i2]+I[i1 +i3,i2 +i4] × w[i3,i4] (8)
//Output one window function evaluation
Ends do i4;E n dd oi3;}
End i2;
End i1;
Listing 2: 2D ﬁltering algorithm nested loop formulation.
4.6.1. Delay-Edge Matrix. The delay edge mapping is
obtained by the product of dependence matrix (D)a n dM
matrix:
DEmat = M ∗DVmat,
 
4151
0031
 
×
⎡
⎢ ⎢ ⎢
⎣
010100
101000
010001
100012
⎤
⎥ ⎥ ⎥
⎦ =
 
111300
130141
 
.
(9)
4 . 7 .D i r e c tM e t h o d - E d g eC o n n e c t i v i t ya n dD e l a yR e g i s t e r s .
The direct method in deriving the delay edge connectivity
is obtained from the dependence vector as given in Table 6.
(1) The delay edge matrix based on the heuristic search
is used to calculate the cost as given in Table 4(a) and
the above can be used to pick up the good solution
based on minimum cost, but does always guarantee
the feasibility. So we do not consider the delay edge
obtained from this method.
(2) Using the proposed modiﬁed search algorithm, 9,
9 or 12, 12 are the number of PEs and number of
clock cycles in Table 4(b) (for assumed window size
in Table 4(b)) are arrived at after pruning down the
search results using the PE-plane subspace based on
CTV.
(3) As mentioned above, the delay edge connectivity
is obtained directly from the dependence matrix
directly by considering the scheduling directions for
delaysandconsideringthePEdirectionsfortheedges
as discussed in Section 4.5 and as shown in Table 2
and the architecture is obtained using the mapping
results and direct delay edge connectivity.VLSI Design 9
For i = 1t ok1
For j = 1t ok2
For k = 1 to 4 //window size = 4 × 3
For l = 1t o3
If (i == 1& &j == 1)
w(i, j,k,l)= Wnew
Else if(j == k2)
w(i, j,k,l) = w(i −1, j,k,l);//next i
//[Dvw1 = 1000 ]
Else
w(i, j,k,l) = w(i, j −1,k,l);
// next j [Dvw2 = 0100 ]
End if
If (i == 1& &j == 1)
I(i, j,k,l) = Inew;
Else if(i == 1& &j>1) // ﬁrst row—second window calculation
I(i, j,k,l) = I(i, j +1 ,k,l +1 );mo v et oth en e x tj pixel –j + 1; pixel data—reads in next column of pixel and
old data is moved in the (k,l) plane-PE array from (k,l)t o( k,l +1 ) ; //[DVx2 = 0101 ]
Else if(j==k2) // for next i
I(i, j,k,l) = I(i+1,j,k +1 ,l); // move to the next i pixel i + 1; pixel data—reads in next row of
// pixel and old data is moved in the (k,l) plane-PE array from (k,l)t o( k +1,l)
//[DVx1 =1010 ]
End if
If (l ==1& &k == 1)
O(i, j,k,l) = 0;
Else if (k < 3)
O(i, j,k,l) = O(i, j,k,l − 1)+ I(i, j,k,l) ∗W(I, j,k,l);
Else
O(i, j,k,l) = O(i, j,k −1,l)+I(i, j,k,l) ∗W(I, j,k,l);
End;
End For l,k, j,i;
Listing 3: URE algorithm for 2D ﬁltering.
4.8. Mapping Results. The cost function is deﬁned as (10)
and is used as an additional constraint mentioned to Step 9
in Section 3.2 for selecting architecture according to the
modiﬁed heuristic method heuristic search
Cost = a ∗processors + b ∗cycles+c ∗
delays
reg
. (10)
Here a,b,c arethescalarcoeﬃcientswhichrepresentweights
for the corresponding costs to minimize the overall cost
function.
4.9. Architecture. Figure 4 shows the architecture for edge-
detection algorithm. It consists of 2 ports, one for accessing
the image data and the other for the output. The architecture
consists of w1 × w2 PEs, where w1 × w2 is the size of the
window used. The intermediate output is propagated to the
successive PEs within a row but has to be passed through a
line buﬀer when passing the intermediate output between
rowsofPEs.Thebuﬀerwidthisequaltothenumberofpixels
per row. The ﬁnal output is at the w1 ×w2 PE.
Figure 5(a) shows the search results giving the possible
solutions including the register cost. Registers represent
the delays in the connecting edges which are the result of
heuristic search, but which may not be feasible or realizable.
The Pareto optimal and near optimal solutions are shown
in the plots Figures 5(a) and 5(b) based on the heuristic
search and the modiﬁed heuristic search, respectively. The
modiﬁed heuristic search developed by us picks up the
good solutions with respect to the number of PEs and
cycles concerned, but we see that the register cost does not
reﬂect the Pareto optimal solution and does not guarantee
feasibility. The delay-edge connectivity is obtained directly
from the dependency vectors as explained in Section 3.3
and in Table 2 for 2D ﬁltering and leads to the feasible
architecture in Figure 4.
5. Mapping of 6D FSBM
The main objective is to ﬁnd the M matrix which consists of
theprocessorallocationvector(Pt)andtheschedulingvector
(St). The method used is same as explained in Section 3.
5.1. Dependencies for 6-Level FSBM Algorithm. Dependence
vectorsformulationshavebeenpresentedforareducedindex
space 4D FSBM algorithm [11]. Due to lack of space, it is not
presented.10 VLSI Design
PE 11
PE 21
PE 31
PE 12
PE 22
PE 32
PE 13
PE 23
PE 33
PE 14
PE 24
PE 34
w11
w21 w22 w23 w24
w31 w32 w33 w34
w12 w13 w14
X-data
X-data
X-data
X-data X-data X-data X-data-pixels-image
1D
1D
1D 1D
1D 1D 1D 1D
1D 1D
1D 1D 1D
1D 1D 1D
1D 1D 1D 1D
1D 1D 1D 1D
4D 4D 4D 4D
4D 4D 4D
4D 4D
4D
4D
4D
Figure 4: Architecture for 2D ﬁltering algorithm for window size 4 ×3.
5.2. Results of Modiﬁed Method for FSBM Algorithm. The
mapping results after the search are presented here.
The heuristic search results of Tables 5 and 8(a) (using
MATLAB) for p = 1 and 2, respectively, are shown in the
graph in Figure 6.
5.3. Delay-Edge Connectivity for FSBM Algorithm Using
Table 5 Results
(1)
 
1101
 
∗ Dv = 3113310031;
the edges we get are same as the elements
in Dv at the p-direction (hence veriﬁed),
and the delays [0 1 0 0] ∗ Dv = ans =
311 6431101 61 6 a n s =
register/delays for the variables x, y,MAD,Dmin.
This is obtained as a good solution from Table 5 by
selecting the optimum cost taking into consideration
the feasibility.
(2) The ﬁnal delay edge is given as follows:
 
00 h ×N2 NN 111 N2 N2
2P + 1 112 p +1 2 P + 11002 P +1 1
 
.
(11)
The second row is the edge, and the ﬁrst row is the registers
connected obtained as the highest nonzero value, in the
Dv values along other indices other than p-direction in
Listing 2. p-direction is the direction of orientation of
the systolic array (PE array) in the n-D problem space.
The above gives a minimal cost connectivity and register
delay elements simultaneously satisfying the feasibility and
implementability checked by the direct method.
Table 5: 4D FSBM—heuristic search.
Mmat NPEI NcycII Reg costIII
Total cost = 0.4
∗ I+0 . 4∗ II +
0.2 ∗ III
0101 9 2 4 1 6 15.35
0100 9 2 7 E d g e
0111 9 2 4 1 9 15.5
0100 9 2 7 E d g e
1101 9 1 6 6 8 14.75
0100 9 1 9 E d g e
1111 9 2 4 7 1 18.1
0100 9 2 7 E d g e
1001 9 1 6 5 2 13.95
0100 9 1 9 E d g e
1011 9 2 4 5 4 17.25
0100 9 2 7 E d g e
3101 9 1 6 1 7 2 19.95
0100 9 1 9 E d g e
3111 9 2 4 1 7 4 23.25
0100 9 2 7 E d g e
3001 9 1 6 1 5 8 19.25
0100 9 1 9 E d g e
3 0 1 1 12 24 160 24.2
0 1 0 0 12 27 Edge
9 1 1 1 12 16 500 38
0 1 0 0 12 19 Edge
6. Architecture of the FSBM Algorithm
The architecture is arrived at, based on above is in Figure 7.VLSI Design 11
5 1 01 52 02 53 03 54 0
0
10
20
30
40
50
60
70
80
N
u
m
b
e
r
 
o
f
 
o
b
s
NPE
Reg. cost
Cycles
Histogram (spreadsheet1 10v∗152c)
NPE = pruned down PE array
Cycles = 74∗5∗normal (x, 20.8378, 5.4623)
Reg. cost = 74∗5∗normal (x, 13.6892, 3.712)
(a)
NPE
Ncyc
Reg. cost
8 9 10 11 12 13 14 15 16 17 18
0
2
4
6
8
10
12
14
16
18
20
22
N
u
m
b
e
r
 
o
f
 
o
b
s
Histogram (spreadsheet1 10v∗21c)
NPE = ﬁt not drawn-convergence of values
Ncyc = ﬁt not drawn-convergence of values
Reg. cost = 20∗1∗normal (x, 12.75, 3.024)
(b)
Figure 5: (a) Plot for Table 4(a)—heuristic search. (b) Plot for
Table 4(b)—modiﬁed heuristic search.
6.1. Design Space Exploration Using High-Level Synthesis.
The design space exploration results are presented in the
following based on the architecture arrived at.
6.2.CDFGoftheDesign. ThearchitectureinFigure 7isinput
using a behavioural description using a C type language to
the GAUT tool, and it generates the control data ﬂow graph
(CDFG) architecture as in Figure 8 and also integrates into
ModelSim and Xilinx ISE.
Table 6: Results of modiﬁed method for 4D FSBM algorithm for
p = 1.
>> [pe arr Ncyc arr, Mmat]Reg. cost Total cost = 0.4 ∗ I+0 . 4
∗ II + 0.2 ∗ III
001140 3 5 1 7
91 61001 1 0
001320 5 6 2 1 . 2
91 61001 1 0
001230 4 3 1 8 . 6
91 61001 1 0
001410 6 9 2 3 . 8
91 61001 1 0
001140 3 0 1 6
91 61001 1 0
000140 2 8 1 5 . 6
91 61001 1 0
000320 5 4 2 0 . 8
91 61001 1 0
000230 4 1 1 8 . 2
91 61001 1 0
000410 6 7 2 3 . 4
91 61001 1 0
000140 2 8 1 5 . 6
91 61001 1 0
002140 3 1 1 6 . 2
91 61001 1 0
002320 5 8 2 1 . 6
91 61001 1 0
002230 4 5 3 8 . 5
Table 7: Design space exploration of the FSBM for p = 1.
CadencyOperators, Area % use rate Number of FF Latency
stages muxes
40 22, 2 88 100 48 336 60
50 8, 2 64 100 96 288 80
100 5, 2 40 60,90,10,10 160 224 120
10,10
150 2, 1 16 60 128 144 140
200 2, 1 16 45 128 144 140
6.3. Results of Design Space Exploration. The high-level
synthesis tool allows the designer to input the timing
constraint as the cadency values to obtain the tradeoﬀ of
allocation of hardware as obtained in Table 7 for p = 1f o r
FSBM algorithm.
6.4. Design Space Exploration for p = 2. The search range
p in FSBM algorithm is increased to p = 2, and the design
space exploration is done in MATLAB for the modiﬁed
heuristic and also using the HLS GAUT tool.
The results of the above are shown in Figure 9.12 VLSI Design
25 30 35 40 45 50 55 60 65
Normalized cycles and area
10
20
30
40
50
60
70
80
90
T
o
t
a
l
 
c
o
s
t
p = 1
p = 2
Figure 6: Graph-search results for reduced 4D FSBM heuristic
search-cost function versus (normalized area and cycles) for Table 5
for search range p = 1a n dTable 8(a) for search range p = 2.
0
147
6 3
28 5
x frame data
Y frame data also moves along p or 2p +1
direction as per listing 2
p direction
(2p +1)-direction
Figure 7: FSBM architecture after design space exploration.
7.ComplexityAnalysis
The merit of the modiﬁed heuristic algorithm is measured in
terms of the search space complexity.
7.1. Search Space Complexity. In general, in heuristic search
procedures,theloop bounds areconsidered asthemaximum
values for searching. But as the loop bounds and the
nested loop dimension increase, the search space will be
huge if vectors are exhaustively generated. A graphical
representation of search space expansion with respect to the
diﬀerent values of n for n-level nested loop algorithms is
given in Figure 10.
The “a” bars show the search space obtained by taking
the loop bounds, say −Ui to +Ui, as the limit for each
variable, and the “b” bars are obtained by using our
proposed modiﬁed heuristic elaborated in Section 3,w h e r e
Figure 8: CDFG of the FSBM architecture in high-level synthesis
tool.
Table 8
(a) Search results of MATLAB for p = 2
Npe Ncyc Reg. cost Total cost
25 16 8 18
40 4 99 37
25 16 8 18
40 4 290 75
1 379 393 231
40 4 291 75
16 28 27 23
25 19 293 76
40 19 293 82
(b) Design space exploration GAUT-FSBM for p = 2
Cadency Area Number of operators Latency
50 144 18 90
60 152 19 80
70 112 14 100
80 112 14 100
100 104 13 120
150 64 8 170
200 32 4 180
300 40 5 320
400 16 2 340
it is observed from the plot in Figure 10 that the increase in
cost is not high.
7.2. Search Space Complexity Tables. Tables 9(a) and 9(b)
show the complexity calculations for 6D FSBM and 4D
FSBM and the proposed modiﬁed heuristic method whose
results are in Tables 4(b) and 5(b).
Table 9(a) shows the complexity calculations for varying
values of n and gives a comparison between the general
heuristic method and the method presented in this paper.
7.3. 6D Problem Reduced to 4D FSBM [11] and 4D Problem-
2D FIR Filtering Problem. The reduction in search space byVLSI Design 13
Table 9
(a) 6D problem—full search block motion estimation (FSBM) problem
n = 6 S&K-2D array Our work
2D array—use of sd
Use of direct
determination
of S vector of expression
(3)
2D array considered as
1D array
Uh,Uv,Um,Un, Ui,Uj
Image ﬁle—image size—
Ui ×Uj = n ×Uh,n ×Uv
subframe size Ui ×Uj
-do-∗ -do- -do-
I space [1,1,1,1,1,1] to
[Uh,Uv,Um,Un,Ui,Uj] -do- -do- -do-
S space
0, 1,−1,
Uh,Uv,Um,Un,Ui,Uj,Uh
×Uv,Uv×Um,...,
UiUj,...,Uh
×Uv ×Um×Un×Ui×U
∗∗
j
-do- -do- -do-
C T V N i l [ 0 , 0 , 0 , 1 , 0 , 0 ] ;[ 0010 , 0 , 0 ] [0,0,0,0,0,1], [0,0,0,0,1,0],
[0,1,0,0,0,0], [1,0,0,0,0,0] -do-
Scheduling direction =
sd Nil [1,0,0,0,0,0]; [0,1,0,0,0,0] -do- -do-
Search space
complexity—P
vector—size—[1 × 2]
662n = 6612 (Number of
possible elements of P
matrix)
Pruned down using
Pt ×sd = 0 = P2n−2−2
6612−2−2 = 668
sd along 2 directions
Pruned down using
Pt ×sd = 0
P2n−2−2
6612−4 = 668
Pn−2
666−2 = 664
S vector – size – [1 × 2] 66n = 666 Pruned down using
St ×sd > 0 = 666−2 Nil∗∗∗ Nil
Example 6612 +6 6 6 = P2×n + Pn 668 +6 6 4 668 664
∗-do-entry same as in previous column,∗∗∗nil: not deﬁned/not applicable.
(b) Reduced index space
n = 4-4D FSBM S&K-2D array Our work-2D array
I—use of sd
Use of
direct determination of S
vector
2D array considered as
1D array
Uhnew,Upnew,Ui,Uj
Image ﬁle—image size—
Ui×Uj = N×Uhnew,N×Uv
Sub-frame size size Ui,×Uj
-do- -do-∗ -do-
I space [1,1,1,1] to
[Uhnew,Upnew,Ui,Uj] -do- -do- -do-
S space
0, 1, –1,
Uhnew,Upnew,Ui,UjUhnew×
Upnew,...,UiUj,...,Uhnew
×Upnew ×Ui ×U
∗∗
j
-do- -do- -do-
CTV Nil∗∗∗ [0,1,0,0];
[0, 2pnew +1 ,0 , 0 ]
[0,0,0,1], [0,0,1,0],
[0,1,0,0], [1,0,0,0] -do-
Scheduling direction =
sd Nil∗∗∗ [1,0,0,0]; [0,1,0,0] -do- -do-
Search space
complexity—P
vector—size—[1 × 2]
172n = 178 (Number of
possible elements of P
matrix)∗∗
Pruned down using
Pt× sd = 0 = P2n−2−2
178−2−2 = 174
sd along 2 directions
Pruned down using
Pt× sd =0
P2n−2−2
178−4 = 174
Pn−2
174−2 = 172
S vector—size—[1 × 2] 17n = 174 Pruned down using
St× sd > 0 = 174−2 Nil∗∗∗ Nil
Example 178 +1 7 4 = P2×n +Pn 174 +1 7 2 174 172
∗∗note 4p1 +4p2 +4p3 +1= 7+6+4= 17.
∗∗∗nil: not deﬁned; ∗d oe n t r ys a m ea si np r e v i o u sc o l u m n .14 VLSI Design
40 80 120 160 200 240 280 320
0
20
40
60
80
100
120
140
160
A
r
e
a
Latency
p = 1
p = 2
Figure 9: Design space exploration using HLS tool (Tables 7 and
8(b)).
a-bars: heuristic method
b-bars: new method
n = 2 n = 3 n = 4 n = 6
S
e
a
r
c
h
 
s
p
a
c
e
a-61
a-62n
a-102n
a-101
a-172n
a-172
a-662n
a-664
Figure 10: Plot showing the search space size and FSBM algorithm
parameter (P)( w i t hNv = Nh = 4).
modifying the 6D algorithm to 4D as reported in [11]a n d
also the beneﬁt of the modiﬁed heuristic are reﬂected by the
last entry in Table 9(b).
8. Conclusion and Future Work
Many of the computationally intensive algorithms are of n-
D deeply nested loop type. The methodology of mapping
of algorithms involves heuristic search wherein the search
complexity is large. The search space of the 2D ﬁltering and
4D FSBM has been pruned down using the scheduling vector
 sd and the constraints imposed by it. The search has been
performed using MATLAB, for the PE array assigned to the
identiﬁed (n − x)-D subspace evolved with the nature of the
CTV. The resultant mapping matrix is useful in determining
the PE assignment and the exact clock cycle at which a
particular node in n-D space represented by the DG is
mapped onto a PE in the PE array. The search results are
presented for 2 computationally intensive applications—2D
ﬁltering and the reduced index space 4D FSBM algorithm.
The graph in Figure 5(a) corresponds to Table 4(a) showing
the heuristic search results that show the distribution of PEs
and cycles and cost. Figure 5(b) corresponds to Table 5(b)
that gives the number of PEs and cycles pruned down after
applying the modiﬁed heuristic algorithm. The delay edge
connectivity is determined by the proposed direct approach
as described in Sections 3.3 and 4.5 using Tables 2 and 4,
instead of using the Mapping Transformation Matrix M or
Tmat in Tables 4(a) and 5(a) as in [4]. The use of high-level
synthesis tool is to obtain the CDFG. Also the design space
exploration results obtained using high-level synthesis tool
GAUT have been presented. The search have been performed
for varying search ranges of P values P = 1a n dP = 2a n d
the number of resources used, and latency for diﬀerent input
cadency values gives the design trade-oﬀ results presented in
Tables 7 and 8(b) shown in the graph in the Figure 9.T h e
output ﬁle of the GAUT tool could be used to interface with
simulation tools and synthesis tools to build the RTL design
and map it onto target FPGA architecture in the future for
elaborate timing veriﬁcation. The complexity comparison of
ourmethodwithheuristicmethodisgiveninTables9(a)and
9(b).
References
[1] C. Lee, S. Kim, and S. Ha, “A systematic design space
exploration of MPSoC based on synchronous data ﬂow
speciﬁcation,” Journal of Signal Processing Systems, vol. 58, no.
2, pp. 193–213, 2010.
[2] U. Bondhugula, J. Ramanujam, and P. Sadayappan, “Auto-
matic mapping of nested loops to FPGAS,” in Proceedings
of the ACM SIGPLAN Symposium on Principles and Practice
of Parallel Programming (PPoPP ’07), pp. 101–111, San Jose,
Calif, USA, March 2007.
[3] X. Zhang and K. K. Parhi, “High-speed VLSI architectures
for the AES algorithm,” IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, vol. 12, no. 9, pp. 957–967, 2004.
[4] S. Kittitornkun and Y. H. Hu, “Mapping deep nested do-loop
DSP algorithms to large scale FPGA array structures,” IEEE
Transactions on Very Large Scale Integration (VLSI) Systems,
vol. 11, no. 2, pp. 208–217, 2003.
[5] D. Peng and M. Lu, “On exploring inter-iteration parallelism
within rate-balanced multirate multidimensional DSP algo-
rithms,” IEEE Transactions on Very Large Scale Integration
(VLSI) Systems, vol. 13, no. 1, pp. 106–125, 2005.
[6] P. Lee and Z. M. Kedem, “Synthesizing linear array algorithms
from nested for loop algorithms,” IEEE Transactions on
Computers, vol. 37, no. 12, pp. 1578–1598, 1988.
[7] L. Lamport, “The parallel execution of Do loops,” Journal of
Communication ACM, vol. 17, no. 2, pp. 83–93, 1974.
[8] P. Coussy and A. Morawiec, Eds., High-Level Synthesis—From
Algorithm to Digital Circuit, Springer, 2008.
[9] D. D. Gajski, N. D. Dutt, A. C. H. Wu, and S. Y. L. Lin, High
LevelSynthesis:IntroductiontoChipandSystemDesign,Kluwer
Academic Press, 1992.VLSI Design 15
[10] http://www-labsticc.univ-ubs.fr/.
[11] B. Bala Tripura Sundari, “Dependence vectors and fast search
of systolic mapping for computationally intensive image
processing algorithms,” in Proceedings of the International
Multi-Conference of Engineers and Computer Scientists 2011
(IMECS ’11), Kowloon, Hong Kong, March 2011.Submit your manuscripts at
http://www.hindawi.com
VLSI Design
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
 International Journal of
 Rotating
Machinery
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
Hindawi Publishing Corporation 
http://www.hindawi.com
 Journal of Engineering
Volume 2014
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
Shock and Vibration
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
Mechanical 
Engineering
Advances in
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014 Civil Engineering
Advances in
Acoustics and Vibration
Advances in
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
Electrical and Computer 
Engineering
Journal of
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
Distributed 
 Sensor Networks
International Journal of
The Scientific 
World Journal
Hindawi Publishing Corporation 
http://www.hindawi.com Volume 2014
Sensors
Journal of
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
Modelling & 
Simulation 
in Engineering
Hindawi Publishing Corporation 
http://www.hindawi.com Volume 2014
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
 Active and Passive  
Electronic Components
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
Chemical Engineering
International Journal of
Control Science
and Engineering
Journal of
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
 Antennas and
Propagation
International Journal of
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014
Navigation and 
 Observation
International Journal of
Advances in
OptoElectronics
Hindawi Publishing Corporation 
http://www.hindawi.com
Volume 2014 Robotics
Journal of
Hindawi Publishing Corporation
http://www.hindawi.com Volume 2014