Efficient Implementation of Nested-Loop Multimedia Algorithms by unknown
EURASIP Journal on Applied Signal Processing 2001:3, 129–146
© 2001 Hindawi Publishing Corporation
Efficient Implementation of Nested-Loop
Multimedia Algorithms
Surin Kittitornkun
Department of Electrical and Computer Engineering, University of Wisconsin, Madison, WI 53706, USA
Email: kittitor@cae.wisc.edu
Yu Hen Hu
Department of Electrical and Computer Engineering, University of Wisconsin, Madison, WI 53706, USA
Email: hu@engr.wisc.edu
Received 22 June 2001 and in revised form 22 August 2001
A novel dependence graph representation called the multiple-order dependence graph for nested-loop formulated multimedia sig-
nal processing algorithms is proposed. It allows a concise representation of an entire family of dependence graphs. This powerful
representation facilitates the development of innovative implementation approach for nested-loop formulated multimedia algo-
rithms such as motion estimation, matrix-matrix product, 2D linear transform, and others. In particular, algebraic linear mapping
(assignment and scheduling) methodology can be applied to implement such algorithms on an array of simple-processing ele-
ments. The feasibility of this new approach is demonstrated in three major target architectures: application-speciﬁc integrated
circuit (ASIC), ﬁeld programmable gate array (FPGA), and a programmable clustered VLIW processor.
Keywords and phrases: dependence graph, systolic array, multiple-order, FPGA, VLIW.
1. INTRODUCTION
Many data-intensive multimedia algorithms such as motion
estimation, 2D DCT/IDCT, matrix multiplication, and others
consist of mainly deep nested-loops with relatively simple-
loop body. They have been painstakingly implemented man-
ually as hardwired ASICs (application speciﬁc integrated cir-
cuits) [1, 2] in order to meet the extremely high throughput
rate demand of video signal processing algorithms. In view
of the repetitive nature of the nested-loop algorithm formu-
lation, the most effective strategy to speed up computation
is to realize such an algorithm using an array of processing
elements (PEs) to facilitate parallel processing.
In 1967, Karp et al. [3] introduced the notion of uni-
form recurrence equation as a powerful abstraction to describe
nested-loop algorithm formulations. Lamport [4] considered
the parallel execution of Do loops and proposed to use the al-
gebraic construct of a hyperplane in the iteration index space
to characterize the iterations that can be executed simulta-
neously. In 1982, Kung [5] and Leiserson [6] used the term
systolic array to describe the pipelined execution of nested-
loop formulated algorithms in a regular-structured, locally
connected array of identical PEs. It has been argued that a sys-
tolic array structure is amenable for very large scale integra-
tion circuit (VLSI) implementation as it exploits maximum
parallelism while requiring only local communication among
processors. In the 1980s, Kung [7] and others [8, 9] have de-
veloped systematic design methodologies for systolic arrays.
These methods facilitate algebraic mapping of the iteration
indices of a nested-loop onto a systolic array with a linear task
assignment and a linear schedule implementation.
With the rapidly increasing demands for real-time multi-
media signal processing applications, it becomes ever impor-
tant to pursue efﬁcient implementation of data intensive mul-
timedia processing algorithms. As pointed out earlier, many
of them are suitable for parallel implementation using sys-
tolic array of PEs. One notable example is the numerous 1D
and 2D array structures of block-based motion estimation
algorithm [10, 11, 12, 13, 14].
In a recent study [15], we observed that the state-of-art
systolic array design methodology does not always yield an
optimal implementation. This is mainly due to a restrictive
representation of a single-assignment, localized dependence
graph (DG) [7]. More speciﬁcally, there are a number of sit-
uations where the execution order within a loop-nest may
be reversed or even permuted without affecting the result.
Summation of more than two numbers is a good example.
On the other hand, data delivery within the array in such
data-intensive algorithms can be organized such that redun-
dant memory fetches can be eliminated to preserve memory
130 EURASIP Journal on Applied Signal Processing
bandwidth. In a conventional dependence graph,every execu-
tion (data delivery) order must be explicitly speciﬁed. When
there are multiple-choices of loop execution (data delivery)
orders, one must select a speciﬁc execution order (data deliv-
ery) using heuristic. These premature commitments of par-
ticular orders often prevent an optimal implementation to
be found.
In this paper, we propose a novel dependence graph repre-
sentation, called multiple-order dependence graph (MODG).
In a MODG, all the permissible execution (data delivery) or-
ders will be represented explicitly in a concise format. As such,
a single MODG representation can represent many of equiv-
alent DGs. Since each of these equivalent DGs may lead to
a locally optimal design, searching all of them combined is
more likely to yield a globally optimal design.
As noted above, systolic array architecture style so far
has been applied mainly for hardwired ASIC modules. With
the advancement of deep submicron system-on-chip (SOC)
manufacturing technology, it becomes clear that the cost of
interconnect in terms of propagation delay, power consump-
tion, and chip real estate has increased so much compared to
that of logics. Furthermore, the speed gap between logic and
dynamic memory gets widened.
As such, the architecture style of a regular array of pro-
cessing elements with mostly local interconnects can be found
in several different architecture styles. For example, modern
ﬁeld programmable gate array (FPGA) architectures such as
the Xilinx Vertex II and the Altera APEX II are composed of a
regular array of LUTs (look-up tables) and static RAM blocks
as local storage. Even modern programmable digital signal
processors also adopted a clustered VLIW (very long instruc-
tion word) architecture where the collection of functional
units and a local register ﬁle within a cluster can be regarded
as a powerful processing elements (PEs). These clusters can be
programmed as a multiprocessor system to achieve parallel
processing [16, 17].
In the remaining of this paper, we will ﬁrst introduce
basic deﬁnitions and the novel representation of MODG in
Section 2. Then the design methodology based on MODG
will be presented in Section 3. Next, we illustrate the advan-
tages of MODG using three design examples: a hardwired
ASIC implementation of block-based motion estimation, an
FPGA implementation of matrix-matrix multiplication, and
the implementation of separable 2D transform on a 4-cluster
VLIW processor in Sections 4, 5, and 6, respectively. Lastly,
Section 7 concludes this paper.
2. MULTIPLE-ORDER DEPENDENCE GRAPH
In this section, a novel representation of nested-loop, called
multiple-(execution) order dependence graph as well as the ba-
sic deﬁnitions of dependence analysis will be introduced. The
motivation for a multiple-order dependence graph represen-
tation will then be presented prior to its formal deﬁnition.
2.1. Nested-loop formulated algorithms
Many digital signal processing, image and video processing
algorithms can be formulated compactly with a nested-loop
formulation. Consider the following example of a matrix-
matrix multiplication algorithm.





where A = [ai,j], B = [bi,j], and C = [ci,j] are matrices
of appropriate dimensions. The corresponding nested-loop
algorithm formulation can be expressed as:
Listing 1. Matrix-matrix multiplication multiple times
Do i = 1 to 3
Do j = 1 to 3
c[i, j] = 0
Do k = 1 to 3




where i, j, and k are loop indices. Together, they form an
(iteration) index space where each point (i, j, k) corresponds
to a single execution of the loop body.
In general, let im be themth level loop index of loop-nest,
and i = (i1, i2, . . . , in)t ∈ Z be an n-dimensional column
index vector of an n-level nested Do-loop where Z is the
space of integer numbers and at denotes the transposition of
a. Hence, the n-D index space Jn can be expressed as:
Jn = {i = (i1, i2, . . . , in)t | i1, i2, . . . , in ∈ Z}. (2)
In this example, the loop body consists of a single recur-
rence equation
c[i, j] = c[i, j]+ a[i, k]× b[k, j], (3)
where a[i, k] and b[k, j] are input variables and their values
are needed to execute this loop, c[i, j]’s are output variables
whose values will be computed by executing the loop-nest.
In Listing 1, the innermost k-loop is used to realize the
summation of K product terms a[i, k]× b[k, j], 1 ≤ k ≤ K.
While c[i, j] is the ﬁnal result, it is also used to store in-
termediate results at kth iteration. In other words, the same
memory address designated to c[i, j] is assigned to new val-
ues multiple-times during the execution of the algorithm.
In a single-assignment formulation [7], we introduce a set of
new intermediate variables to store the intermediate results.
As such, every variable will be assigned to a new value at most
once during the execution of the algorithm.
In this example, the input variable a[i, k] will be used
in each of the j loops, and b[k, j] will be used in each of
the i loops. In particular, a[i, k] will be made available to
iterations with indices {(i, j, k)t ; 1 ≤ j ≤ N}, and b[k, j]will
be made available to iteration indices {(i, j, k)t ; 1 ≤ i ≤ M}
in the index space J3. In a parallel computing platform, if
Efﬁcient implementation of nested-loop multimedia algorithms 131
different iterations are executed at different processors, these
input variables must be propagated or broadcast to different
processors to facilitate the computation. This routine of input
variables can be represented using intermediate variables to
ensure that the single-assignment constraint is satisﬁed.
With the introduction of the intermediate variables, every
variable associated with a particular iteration will have the full
set of indices. For example, the matrix-matrix multiplication
loop body can now be rewritten as follows.
Listing 2. Single-assignment matrix-matrix multiplication
Do i = 1 to M
Do j = 1 to N
Do k = 1 to K
a3[i, j, k] =

a3[i, j − 1, k], j > 0a[i, k], j = 0
b3[i, j, k] =

b3[i− 1, j, k], i > 0b[k, j], i = 0
c3[i, j, k] =


c3[i, j, k− 1]+ a3[i, j, k]
×b3[i, j, k], k > 0
0, k = 0




In the listing above,a3 and b3 are the transmittal variables
of a and b, respectively, and c3 is the computation variable of
c. We may also deﬁne an inter-iteration dependence vector
as the set of index differences between the output of each
iteration (on the left-hand side of each equation) and the
input (on the right-hand side of the recurrence equations).
Therefore, the dependence vector of variable a3, b3, and c3
are da3 = (0,1,0)t , db3 = (1,0,0)t , and dc3 = (0,0,1)t ,
respectively.
A loop-nest is called a set of uniform recurrence equa-
tions (URE) [3] if its loop bounds are not functions of any
output variables in the loop body and its dependence vec-
tors are independent of the loop index i. In other words, a
URE’s loop bounds are known constants before the execution
of the algorithm. Almost all the data intensive nested-loop
formulated multimedia algorithms are uniform recurrence
equations. In general, a URE algorithm satisfying the single-
assignment constraint can be described in a representation as
shown in Figure 1 where lm andum are im’s lower and upper
bounds and m = 1,2, . . . , n.
Three types of statement are included in the innermost
loop body: the Input/propagation statement, the Computa-
tion/initialization statement, and the Output statement. Each
type of statement corresponds to a particular type of the vari-
ables.
• Transmittal variable. An input variable vj is ﬁrst assign-
ed to the transmittal variablevnj [i] at iteration indices i ∈ IIvj ,
and then propagated along a propagation dependence vector
dvj for i ∈ IPvj . Speciﬁcally, IPvj ∩ IIvj = ∅ (empty set).
Do i1 = l1 to u1
· · ·






[Gvj (i)], i ∈ IIvj
vnj



















Figure 1: n-level nested Do-loop algorithm.
Compared to Listing 2, we have a3[i] and b3[i] as trans-
mittal variables where i = (i, j, k)t . The initialization spaces
of a3 and b3 are IIa3 = {(i, j, k)t | 1 ≤ i ≤ M,1 ≤ k ≤ K, j =
0} and IIb3 = {(i, j, k)t | 1 ≤ j ≤ N,1 ≤ k ≤ K, i = 0},
respectively.
• Computation variable. The computation variable vnk [i]
will be initialized to some constants at index set (iterations)
i ∈ INvk , and then will be assigned to output values at other
iterations i ∈ ICvk according to the recurrence function Fvk .
Use Listing 2 as an example, we have INc = {(i, j,0)t | 1 ≤
i ≤ M,1 ≤ j ≤ N} and ICc = {(i, j, k)t | 1 ≤ i ≤ M,1 ≤
j ≤ N,1 ≤ k ≤ K}. The operator Fvk is the summation
operation.
• Output variable. These are results that will be stored
back to memories outside the processor array. Their values
will be assigned at a set of output index points IOvk .
From Listing 2, c[i, j] is the output variable and IOc =
{(i, j, k)t | k = K}. The indexing function of output variable
c is Gc(i, j, k) = (i, j) if k = K.
Based on the URE algorithm formulation, a set of data
dependence vectors can be identiﬁed.
• Propagation dependence vector: dvj . In Example 1,
da = (0,1,0)t, db = (1,0,0)t. (4)
• Computation dependence vector: dvk . Likewise,
dc = (0,1,0)t. (5)
These dependence vectors, together with the indices in the
index space form a dependence graph (DG) as shown in Figure
2. In a dependence graph, each node represents the execution
of an iteration of the loop-nest. An edge from one node to the
other indicates that some data must be made available from
the source iteration to the destination iteration.
















































Figure 2: 3× 3 matrix-matrix multiplication in a 3D dependence graph (a) and its node (loop body)(b).
The dependence vectors constrain the execution order
that must be followed to ensure correctness of the algorithm.
However, we note that in this example, alternate execution
orders are possible:
(a) Since the operation of summation is invariant to any
permutation of its operands, we may obtain the same result
by reversing the direction of the dependence vector (0,0,1)t
to (0,0,−1)t . This leads to a different execution order
c[i, j, k] = c[i, j, k+ 1]+ a[i, j, k]× b[i, j, k]. (6)
(b) Since the transmittal variables remain unchanged
throughout the execution, the actual transmission direction
will not be critical. Hence, instead of using current depen-
dence vectors (1,0,0)t , (0,1,0)t , one may also use alternate
dependence vectors (−1,0,0)t , (0,−1,0)t .
Clearly, with the set of alternate dependence vectors, dif-
ferent, yet functionally equivalent dependence graph may be
constructed. Motivated by this observation, we set out to pro-
pose a powerful way to represent a dependence graph and all
those dependence graphs with alternative execution orders in




The dependence graph [7] or the iteration space dependence
graph [18] is a graphical representation of data dependen-
cies among loop iterations of a nested Do-loop. A directed
dependence graph consists of a set of nodes (vertices) and a
set of edges. Each node corresponds to a loop index,i ∈ Jn, or
the innermost loop body regardless of its complexity. Each di-
rectional edge represents either a propagation or a computa-
tion dependence vector. It can be observed from Figure 2 that
the 3-level nested Do-loop results in a 3D dependence graph.
In contrast, we deﬁne an MODG as a set of MODG nodes
where each node is associated with a number of information
ﬁelds including edges as dependence vectors as follows.
Deﬁnition 2 (n-dimensional multiple-order dependence
graph, Kn). An n-D MODG, Kn, is a set of MODG nodes.
Each node, kn ∈ Kn, is a collection of the following ﬁelds of
information, n-D index, multiple-order dependence vectors
set, input data set, output data set, and terminal ﬂag set.
Each n-D MODG node, kn ∈ Kn, containing a tuple-of
information ﬁelds can be written as
kn 
(i,DV , IV ,OV , FV ), (7)
where V is a set of variables. Similar to C++ object ori-
ented programming language, we use “.” to access a partic-
ular ﬁeld. For example, kn.i denotes an n-D index ﬁeld,
kn.i = (i1, i2, . . . , in)t ∈ Jn, kn.IV is a set of input data,
where kn.IV = {kn.Iv | ∀v ∈ V}. Likewise, kn.OV is a set
of output data, kn.OV = {kn.Ov | ∀v ∈ V}. Next, a set
of Boolean terminal ﬂag speciﬁes whether the node is ei-
ther an input or an output node for each variable, v ∈ V ,
Efﬁcient implementation of nested-loop multimedia algorithms 133
kn.FV = {kn.Fv | ∀v ∈ V} where Fv ∈ {0,1} and Fv = 1
denotes the true logic value. Finally, kn.DV is a set of depen-
dence vectors associated with variable set V ,
kn.DV =
{
kn.Dv | ∀v ∈ V
}
, (8)
where kn.Dv denotes a set of all feasible dependence vectors
of variable v. kn.Dv can be obtained depending upon the
following categories of variables.
2.2.1 Transmittal variable
In traditional systolic design methodology, each instance of
input variable v[g] is reused and propagated according to a
single dependence vector dv as shown in Figure 1, that is,
kn.Dv = dv, ∀kn.i ∈ IPv ∪ IIv . (9)
This dependence vector is determined during the algorithm
formulation and restricted to an adjacent or neighboring
index. Hence, the propagation along a particular direction
based on heuristic is imposed such that the dependence graph
becomes localized. Once it is input inside the processor array,
it is called the transmittal variable instead.
Actually, each instance v[g] of the variable v can be
reused several times during the course of execution. Order-
ing should not be imposed as long as it is delivered correctly.
To represent all the possible dependence vectors among all
reusing MODG nodes, we deﬁne a broadcast index set as a









) = g, ∀kn ∈ Kn}, (10)
where
Hv(i) : i −→ g, ∃i ∈ IPv ∪ IIv . (11)
The indexing function Hv is similar to Gv(i). Therefore,
at any index point, each variable is associated with a set of
multiple-order dependence vectors
kn.Dv =
{i1 − kn.i | kn.i ≠ i1, ∀i1 ∈ Igv} (12)
including the localized dependence vector dv .
2.2.2 Computation variable
In a nested Do-loop algorithm, the computation variable v is
an output of a particular recurrence function Fv . This func-
tion can be as simple as add, multiply, minimum, maximum,
and the like. If these operators follow both commutative and
associative laws such that
Fv(a, b) = Fv(b,a),
Fv







respectively, they are called multiple-order operators. Other-
wise, functions or operators that do not follow both laws are
considered in-order. Traditionally, each instance of compu-
tation variable v is computed and propagated along a local
dependence vector dv , that is,
kn.Dv = dv, ∀kn.i ∈ INv ∪ ICv , (14)
as if it were an in-order operation.
Given a multiple-order operation, the instance v[g] can
be computed in many different orders of execution. Ordering
should be relaxed provided that it is semantically correct and
its numerical condition is satisﬁed. To represent all the possi-
ble dependence vectors among all computing MODG nodes,
we deﬁne a multiple-order index set as a set of MODG nodes








) = g, ∀kn ∈ Kn}, (15)
where
Hv(i) : i −→ g, ∃i ∈ INv ∪ ICv . (16)
Therefore, a set of multiple-order dependence vectors of vari-
able v can be obtained as
kn.Dv =
{i1 − kn.i | kn.i ≠ i1, ∀i1 ∈ Igv}. (17)
Back to the matrix product example, we can identify
the indexing functions G as Ga(i, j, k) = (i, k) if j = 0,
Gb(i, j, k) = (k, j) if i = 0, and Gc(i, j, k) = (i, j) if k = K.
The indexing function H of both input variables, a and b
are Ha(i, j, k) = (i, k) and Hb(i, j, k) = (k, j), respectively.
Since the summation follows both associative and commu-
tative laws, it is a multiple-order operation and its indexing
function Hc(i, j, k) = (i, j).
2.3. Summary
We call the dependence graph with multiple-order depen-
dence vectors for both propagation and multiple-order oper-
ator a multiple-order dependence graph (MODG). Although
MODG contains cycles due to those multiple-order depen-
dence vectors, no self loops are introduced. MODG is still
computable because it is formulated based on the computable
dependence graph [7]. After the mapping is applied, a few of
dependence vectors will become feasible and the ﬁnal one will
be selected appropriately. The mapping methodology will be
introduced in the following section.
3. MAPPINGMETHODOLOGYOFMODG
Due to the regular structure of n-D MODG, the task of
scheduling and assignment of each index (loop body) to
execute on a number of processing elements (PEs) at a certain
clock cycle becomes an algebraic projection. This projection
is called systolic or space-time mapping. As a consequence,
the index space Jn is mapped (both assigned and scheduled)
subject to the dependencies described by dependence vectors.
The obtained schedule assumes synchronous operations in
which the whole loop body is executed in one clock cycle
latency.
134 EURASIP Journal on Applied Signal Processing
3.1. 1D space-timemapping
We propose the 1D space-time mapping of n-D MODG to a
1D array of PEs. The advantages of 1D array are three folds.
First, the ﬁnal 1D array can be adjusted to ﬁt the chip area
easily. Second, the input/output port is already on the array
boundary. Third, the array is easy to rearrange to a 2D array.
The 1D mapping matrix is deﬁned below.
Deﬁnition 3 (1D space-time mapping matrix, T1). The 1D
space-time mapping matrix T1 consists of a scheduling vector








s1 s2 · · · sn
p1 p2 · · · pn
]
, (18)
where T1 ∈ Z2×n and si, pi ∈ Z. Furthermore, s and P must
be linearly independent so that Rank(T1) = 2.
Prior to applying T1 to n-D MODG, we deﬁne the 1D PE
array representation to accommodate the mapping. In a big
picture, the mapping process can be visualized as shown in
Figure 3.
Deﬁnition 4 (1D processing element array,K1). A 1D PE array
is a set of nodes or PEs in which each node is resulting from
a projection of a number of n-D MODG nodes. Each PE
encapsulates the PE number and a clock schedule of feasible
delay-edge pairs, input/output data, and terminal ﬂags.
In other words, each PE is a tuple of (J1, t(i), RV , EV ,
rV , eV , IV ,OV , FV ). Input, output, and terminal sets are as-
signed to a PE and ordered according to the synchronous








, ∀kn ∈ Kn
}
(19)
assigned (allocated) to the PE number k1.J1, where PEmin ≤
k1.J1 ≤ PEmax and
PEmax = max
(Pt q | q ∈ Jn),
PEmin = min
(Pt q | q ∈ Jn). (20)







) | k1.J1 = Pt(kn.i), ∀kn ∈ Kn}. (21)
At a particular cycle τ ∈ k1.t(i) of PE number k1.J1 =
Pt(kn.i), its feasible edges and delay sets are k1.EV [τ] =
{k1.Ev[τ] | ∀v ∈ V} and k1.RV [τ] = {k1.Rv[τ] | ∀v ∈
V}. Each variable v’s k1.Ev[τ]− k1.Rv[τ] pair results from
the space-time mapping of the multiple-order dependence














In addition, the terminal ﬂag set at this cycle τ becomes
k1.FV [τ] = {k1.Fv[τ] | ∀v ∈ V} where k1.Fv[τ] =
kn.Fv . Finally, the ﬁnal edge-delay pair, k1.ev[τ]−k1.rv[τ],
k1.ev[τ] ∈ k1.Ev[τ], k1.rv[τ] ∈ k1.Rv[τ], are chosen
appropriately subject to the design objectives (performance)
and design constraints (cost) in the next two subsections.
Although this 1D mapping matrix T1 was originally pro-
posed by Lee and Kedem [19], ours is different from theirs in
the following aspects:
• There is no restriction on the ratio of delay and edge
length.
• Input and output ports are not necessary at either end
of the array.
• Ours can be applied to several different target architec-
tures such as ASIC, FPGA and clustered VLIW which will be
illustrated later.
3.2. Mapping objective functions: performance
The 1D PE array can be evaluated based on the following per-
formance characteristics. As objective functions, the number
of cycles, the number of PEs, and the utilization are analogous
to the execution time, the area, and the efﬁciency in hard-
ware, respectively. Besides, physical input/output pins and
memory bandwidth are of increased importance as the speed
gap between logic and memory especially dynamic RAM gets
wider. Additionally, these objective functions can be used to
constrain the optimization as well.
3.2.1 Number of cycles (Ncycle)
For ASIC implementation, the total parallel execution time,
ttotal, can be computed by
ttotal = tcycle ×Ncycle, (23)
where tcycle is the PE’s longest critical propagation delay,
which depends on the PE architecture and the implemen-







p − q)}+ 1. (24)
In other words, it is the number of cuts by the equitempo-
ral hyperplane [4] perpendicular to the scheduling vector s.
Although Ncycle seems to depend on the scheduling vector s
only, the question on how many PEs are utilized and how the
data are delivered to the right PE still remains.
3.2.2 Number of PEs (NPE)
In the 1D or linear array mapping, the number of PEs, NPE
can be expressed as
NPE = PEmax − PEmin + 1. (25)
In other words, NPE is the number of distinct projections of
the index space Jn on the vector P .
3.2.3 Utilization (U)
The maximum PE array utilization is the maximum ratio
of active PEs and the number of PEs, NPE. It is obtained
Efﬁcient implementation of nested-loop multimedia algorithms 135
k1,1











Iv2         
Iv0
Ov1








































Figure 3: Space-time mapping of n-D multiple order dependence graph to a 1D processor array.
by
Umax = maxτ
∣∣{k1 | τ ∈ k1.t(i)}∣∣
NPE
, (26)
where |a| denotes the cardinality or size of set a. On the
other hand, the average utilization is the ratio of the number
of MODG nodes and the Ncycle −NPE product,
Uavg =
∣∣Kn∣∣
NPE ×Ncycle . (27)
3.2.4 Number of input/output ports (#IO)
Due to limited number of physical I/O pins of a given target
architecture, it is important to minimize the number of I/O
ports. Based on our formulation, the set of PEs performing
I/O of either input or output variable v at cycle τ is
IOv[τ] =
{




Therefore, the number of variable v’s I/O ports is given by
#IOv = maxτ
∣∣IOv[τ]∣∣, (29)
where |IOv[τ]| denotes the size of IOv[τ].
3.2.5 Memory bandwidth
The memory bandwidth associated with variable v,
Bv , is the number of input/output instances via a
particular input/output port per unit time. In this case, the
unit time is a clock cycle. From (28),Bv is equivalent to the to-








The 1D space-time mapping is basically a search of scheduling
and allocation vectors yielding the best PE arrays according
to certain objective functions and subject to the following
design constraints within bounded search space. The map-
ping process is targeted as a computer aided design tool. On
the contrary to traditional mapping, it lets constraints de-
cide the feasibility, and connectivity, as well as the ﬁnal ar-
chitecture. In general, the mapping conﬂicts prune out the
infeasible solutions. The propagation and multiple-order
constraints are responsible for assigning input and output
ports. In addition, the causality constraint will validate the
solution whether the producer-consumer concept is violated.
136 EURASIP Journal on Applied Signal Processing
3.3.1 Mapping conflict
Due to insufﬁcient rank of T1, a mapping conﬂict occurs when
two MODG nodes are assigned to the same PE and scheduled
at the same cycle, that is, T1 p = T1q, p ≠ q, p, q ∈ Jn.
Unlike [19, 20], no communication conﬂict is introduced by
our method. In order to efﬁciently detect the conﬂicts, a 2D
integer array A of size NPE ×Ncycle is used,




0 initialization,A[T1q]+ 1 otherwise. (32)
Each array element A[T1q] is increased by one if the
previous value is zero from the initialization phase. Oth-
erwise, the computation conﬂict is detected. The worst-
case time complexity of this method is O(Nn) where N =
max{ui − li + 1, i=1,2, . . . , n}while that of [19] isO(N2n).
3.3.2 Propagation constraint
Despite the fact that a systolic array can achieve high com-
putation throughput, one of the reasons that it has not been
successful is partly due to its high demand of memory band-
width. As a result, every input variable should be traced and
reused as many times as possible to eliminate redundant
memory fetches. Thus, we can eventually save the bandwidth
and the number of I/O ports. Hence, the input port should
be determined after the mapping process rather than from a
predetermined terminal ﬂag assigned by a human designer.
The input port is assigned to a k1 PE where the ﬁrst instance
of data appears and propagated to the next instance. Permis-
sible edge-delay pairs are obtained by eliminating edges that
point backward in the time domain.
The mapping process starts by initializing the terminal
vectors kn.Fv = 0,∀kn ∈ Kn. After space-time mapping,
the terminal ﬂag will be true at the ﬁrst appearance in the
broadcast index set Igv of each input instance v[g]. Thus, the
terminal ﬂag is assigned by
k1.Fv[τ] =





0, τ > min
(
k1.t(i)
) ∀kn.i ∈ Igv . (33)
3.3.3 Multiple-order operation constraint
Likewise, given an output variable v after mapping, the ter-
minal ﬂag will be true at the last appearance in the multiple-
order index set of each instance v[g], Igv in (15). Thus, the
terminal ﬂag is assigned by
k1.Fv[τ] =





0, τ < max
(
k1.t(i)
) ∀kn.i ∈ Igv . (34)
3.3.4 Causality constraint
The purpose of this causality constraint is to prevent consum-
ing intermediate data before it is fully produced. For every
instance of intermediate producer variable vp[ f] and












Due to the exponential growth in both area and time com-
plexity of MODG, its mapping methodology must be lim-
ited to some heuristic search. To be competitive in the mar-
ket, some strategies are necessary to cut the design time
and cost. From the deﬁnition of T1 in (18), its elements
are limited to si ∈ {0,±1,±li ± 1,±ui ± 1} and pi ∈
{0,±1,±li ± 1,±ui ± 1}. As a matter of fact that the search
is independent to one another, a number of different iter-
ations can be distributed in a network of workstation [21]
and compared to the performance evaluations at the end.
Finally, the problem size should be scaled down to reduce the
computational complexity.
3.5. Summary
In summary, the mapping can be cast as an optimization
problem with the objective functions in Section 3.2 subject to
the design constraints in Section 3.3 with some search strate-
gies in Section 3.4. Furthermore, the search of T1 must be
subject to the architecture mapping constraints of hardwired
ASIC, FPGA, and clustered VLIW processor in the subsequent
sections, respectively.
4. HARDWIRED ASICMAPPING
It has been widely accepted that the full search block match-
ing (FSBM) motion estimation is one of the most time-
consuming task for digital video encoding. Several hardwired
ASICs have been manufactured including the STi3220 mo-
tion estimation processor [1]. As a major step towards saving
memory bandwidth, Yeo and Hu [13] proposed the formu-
lation of a six-level nested Do-loop algorithm to represent
a single-frame, multiple-block motion estimation. A typical
video frame consists of Nh ×Nv blocks of pixels where Nh is
the number of blocks in each row and Nv is the number of
rows of block in each frame. A motion vector, (m,n) of an
N × N block of pixels, which yields the minimum mean-
absolute distortion (MAD) between current block and the
(2p + 1)2 blocks in the search area, can be obtained by
MV = arg {min MAD(m,n)}− (p,p), 0 ≤m, n ≤ 2p,
(36)
where p is the search range in number of pixels and usually
less than or equal to block size N. The corresponding MAD






∣∣x(i, j)−y(i+m−p, j +n−p)∣∣,
(37)
where x(i, j) and y(i, j) are the luminance pixels of the
current and previous frames, respectively. A six-level nested
Efﬁcient implementation of nested-loop multimedia algorithms 137
Do-loop MAD-based FSBM algorithm is shown in Figure 4
whereDmin is the minimum distortion measured using MAD.
It can be noticed that MAD(m,n) and min are of multiple-
order operators. In addition, it has been observed in [14] that
there are many possible data propagation patterns. Therefore,
the proposed MODG is applicable.
Do v = 0 to Nv − 1
Do h = 0 to Nh − 1
MV(h,v) = (0,0);
Dmin(h,v) = ∞;
Dom = 0 to 2p
Don = 0 to 2p
MAD(m,n) = 0;
Do i = 0 to N − 1
Do j = 0 to N − 1
MAD(m,n) = MAD(m,n)+∣∣x(hN + i, vN + j)
−y(hN + i+m− p,vN + j +n− p)∣∣;
EndDo j, i
if Dmin(h,v) > MAD(m,n)
Dmin(h,v) = MAD(m,n);
MV(h,v) = (m− p,n− p);
endif
EndDo n,m, h, v
Figure 4: Six-level nested Do-loop FSBM motion estimation algo-
rithm.
4.1. Result
Due to the limited presentation space, we can scale the prob-
lem size down to Nv = 3, Nh = 3, N = 4, and p = N/2 = 2.
As indicated earlier in Section 3, T1 must be searched to
obtain the ﬁnal 1D array that satisﬁes numerous constraints.
The heuristic search is limited to the surrounding regions to
the previous result from [14, 15]. Our objective is to equalize
the bandwidth of current frame pixel x and previous frame
pixel y while minimizing Ncycle.
The optimal solution is
T1 =

N2 NhN2 2p + 1 2 N 1
0 0 2p + 1 1 0 0

 . (38)
The permissible edges-delay pair of each PE, k1.ev[τ] −
k1.rv[τ], v ∈ {x,y,Dmin}, was chosen in such a way to
minimize the total number of registers (delay elements).
Depicted in Figure 5, the 2D array is obtained semi-
automatically where the  indicates the input ports to the
array. On the other hand, the schedules of current frame
pixel x(i, j), previous frame pixel y(i, j) at τ = 35,36, and
37 are listed in Figure 6(a), (b), and (c), respectively. Each
x(i, j) propagates from left PE to right PE every two cycles
in parallel and to the consecutive row every ﬁve cycles only
in the ﬁrst column. The propagation of Dmin is of the similar
pattern while the ﬁnal result is the minimum one among all
rows. The previous frame pixels y(i, j) propagate downward
every cycle and to the right PE in the ﬁrst row only. Although
the schedules show that two y(i, j) pixels are needed per
cycle, every y(i, j) is fetched once and reused throughout
the entire execution.
4.2. Discussions
Table 1 quantitatively compares different aspects among 2D
arrays in terms of search range, NPE, throughput (cycles per
block), memory bandwidth ratio By/Bx , and fan-out. We
were able to achieve unity By/Bx bandwidth ratio. It can
be noticed that ours By/Bx ratio is superior to others with
a few more registers to compensate with the huge demand
of y ’s memory bandwidth. The less bandwidth demand, the
less pressure on the on-and off-chip memory becomes. The
array can eliminate redundant memory fetches thus preserve
memory (I/O) transactions as well as power consumption at
the same time.
We deﬁne fan-out as the total number of loads (sinks)
whose source has more than two loads. It essentially repre-
sents the undesirable broadcasting mechanism. Greater fan-
out implies bigger effective capacitance Ceﬀ which the prop-
agation delay as well as the power consumption according to
the very well-known equation,
Power ∝ CeﬀV2supfCLK, (39)
where Vsup and fCLK are the supply voltage and operating
clock frequency, respectively.
It can be observed at this point that the MODG and
its mapping methodology can reduce redundant memory
fetches of previous frame pixels by at least 50% while achiev-
ing frame-level pipelining [14], and no broadcast. The bot-
tom line is that this 2D array is not achievable from the tra-
ditional mapping methodology [7]. That is because every
variable instance must propagate/compute in only one direc-
tion by enforcing a uniform edge-delay pair. The MODG can
represent the motion estimation loop-nest as if it were rep-
resented in human thinking which results in a more ﬂexible
array structure and more efﬁcient implementation.
5. FPGA ARCHITECTUREMAPPING
FPGAs can serve not only as a vehicle for rapid prototyping,
but also as a deployment of special-purpose reconﬁgurable
embedded architecture. In recent million-gate SRAM-based
FPGA architectures, the on-chip programmable modules are
organized in a regular fashion with pipelined local as well as
long interconnects. Thus, it provides excellent implementa-
tion of systolic array structure. For example, the recent imple-
mentation of 512-tap systolic FIR ﬁlter on Xilinx Virtex can
achieve maximum clock frequency of 117 MHz [23]. They
pointed out that the current FPGA mapping tools are still in-
efﬁcient for mapping DSP algorithm to its regular structure.
Based on traditional design methodology [7], only shift
registers (delay elements) are required. This is mainly due
to the hardwired architecture. In this section, our target






















































































0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
0 1
x
Figure 5: 2D full search block matching motion estimation array (N = 4, p = N/2 = 2).
architecture is the SRAM-based FPGAs such as those of Xilinx
Virtex family.
5.1. Programmable-length shift register
It is a magniﬁcent feature of the look-up table (LUT) of
an SRAM-based FPGA and can be exploited by our design
methodology to propagate input data that are unknown at
design time. Each LUT can be programmed as 1-bit of length
N ≤ 16. Hence, an 8-bitN-cycle delay consumes only 8 LUTs
if N ≤ 16 rather than 8N ﬂip/ﬂops or equivalently 8N LUTs
in FPGA because one ﬂip/ﬂop is provided per LUT [24]. For
an input variable v at PE k1 ∈ K1, the ﬁnal edge-delay pair
at cycle τ ∈ k1.t(i) is subject to the following equations.
k1.rv[τ] > 0,
k1.ev[τ] ≠ 0,∣∣k1.ev[τ]∣∣(k1.rv[τ]) = min (k1.RV [τ]∣∣k1.Ev[τ]∣∣).
(40)
5.2. Distributed ROM/RAM
Besides working as the 4-bit input/1-bit output Boolean func-
tion generator, an LUT can be utilized as a single- or dual-port
ROM/RAM. According to [24], two LUTs in the same slice can
be conﬁgured as a single-port 32×1-bit ROM/RAM or a dual-
port 16× 1-bit ROM/RAM . Unlike hardwired ASIC design,
FPGA can be reconﬁgured with any initialization values em-
bedded in the conﬁguration bit stream. ROM and RAM are
suitable for input and output variables, respectively. This is
due to the fact that RAM can be updated while ROM cannot.
However, the control mechanism is a little more complex.
For a constant and known input variable v at PE k1 ∈ K1,
the delay and edge sets at cycle τ ∈ k1.t(i) are subject to the
following equations to utilize the distributed ROM:
k1.Rv[τ] =
{
mN | N > 0, m ≠ 0, m ∈ Z},
k1.Ev[τ] = {0}.
(41)
The constraints are the same for an output variable v to use
distributed RAM.
5.3. Three-state buffer
To save the multiplexer which actually consumes LUTs, an
output bus with three-state buffers is an efﬁcient alternative
to a high fan-in multiplexer. The output data is placed on
the bus at different cycles without bus collision. This design
Efﬁcient implementation of nested-loop multimedia algorithms 139




Throughput Registers By/Bx Fan-Out
Range (Cycles/Block) (bytes)
[10] AB2 −2/+ 2 16 40 94 10 0
[11] Type 1 −1/+ 1 16 40 180 2 100
[22] −2/+ 2 16 65 94 4 0
[13] −2/+ 1 16 16 64 2 16
[14] −2/+ 2 25 16 152 2 8
Ours −2/+ 2 25 16 164 1 0
9,4   9,3   8,6   8,5   7,8
9,3   9,2   8,5   8,4   7,7
9,2   8,5   8,4   7,7   7,6
9,1   8,4   8,3   7,6   7,5
8,4   8,3   7,6   7,5   6,8
9,5   9,4   9,3   8,6   8,5
9,4   9,3   8,6   8,5   7,8
9,3   9,2   8,5   8,4   7,7
9,2   8,5   8,4   7,7   7,6
9,1   8,4   8,3   7,6   7,5
11,6  11,4  10,6   10,4    9,6 
10,5  10,3    9,5     9,3    8,5
  9,4    8,6    8,4     7,6    7,4
  8,3    7,5    7,3     6,5    6,3
  6,6    6,4    5,6     5,4    4,6
12,3   11,5  11,3  10,5  10,3
10,6   10,4    9,6    9,4    8,6
  9,5    9,3     8,5    8,3    7,5
  8,4    7,6     7,4    6,6    6,4
  7,3    6,5     6,3    5,5    5,3
12,4  11,6   11,4  10,6  10,4
11,3  11,5   10,3    9,5    9,3
  9,6    9,4     8,6    8,4    7,6
  8,5    8,3     7,5    7,3    6,5
  7,4    6,6     6,4    5,6    5,4
9,6   9,5   9,4   8,7   8,6
9,5   9,4   9,3   8,6   8,5 
9,4   9,3   8,6   8,5   7,8
9,3   9,2   8,5   8,4   7,7





Figure 6: Schedule of full search block matching motion estimation:
(a) at cycle τ = 35, (b) τ = 36, and (c) τ = 37 (Nh = 3, Nv = 3,
N = 4, p = N/2 = 2).
strategy consumes virtually zero LUT provided that built-in
three-state buffers are inherently available from the occupied
logic LUTs. A bus-based design for an output variable v is
subject to the following constraints:




Algorithms that involve inner dot products such as FIR ﬁlters
can exploit the LUTs to store ﬁlter coefﬁcients and fast-carry
adders using distributed arithmetic [25]. In modern the 10-
million-gate FPGA architecture like XilinxVirtex II, a number
of highly optimized 18-bit ×18-bit multipliers are provided
and distributed throughout the chip [26].
5.5. Example: matrix-matrixmultiplication
Matrix-vector and -matrix multiplications are the fundamen-
tal operation of 1D orthogonal transform, 2D/3D graphics,
and many others. A number of 1D processor arrays have been
proposed in [19, 20, 27, 28]. However, their results suffer
from low processor utilization due to propagation constraints
of unknown input data. Normally, the coefﬁcient matrix is
known beforehand. We can exploit this fact using MODG to
FPGA architecture mapping.
We let x, c, and y be the N × N input, coefﬁcients, and
output matrices, respectively, where
y = cx. (43)
The algorithm can be formulated as 3-level nested Do-loop
as shown earlier. Our goal is to minimizeNPE andNcycle while
achieving 100% maximum utilization, single-port pipelined
input datax, stored coefﬁcients c, and single-port output data
y . In a bounded solution space, the heuristic search pruned
out more than half of the invalid solutions according to the
mapping constraint in (31). Then, only the valid solutions
were evaluated subject to the following constraints: the prop-
agation input x in (40), the distributed ROM for coefﬁcients
c in (41), and the bus-based output constraint in (42). This







From (24) and (25) we achieveNPE = 4 andNcycle = 19 from
the PE array and its clock schedule as shown in Figures 7 and 8,
respectively. The propagation of inputx is pipelined along the
array to multiply with the stored coefﬁcients c in each PE. The
output y is accumulated every cycle and placed on the bus
every N cycles. The output bus interface exploits abundant
three-state buffers of FPGA. The schedule is equivalent to
cutting the original 3D MODG of a matrix-matrix product
into 2D slices, tiling each slice of MODG to a 2D MODG, and
projecting the tiled 2D MODG to a 1D PE array.
Table 2 compares our 1D array with the previous works.
All of them are targeted at 1D PE array. Assuming the same
critical path delay, ours outperforms theirs in terms of NPE,
#IO, utilization, and latency. The smaller number of PEs,
140 EURASIP Journal on Applied Signal Processing















0 1 2 3
0 1 2 3
Figure 7: Architecture of pipelined bus N ×N matrix multiplication N = 4.
      0   1    2   3
  0  --  --  --    41
  1  --  --   31 42
  2  --  21 32 43
  3 11 22 33 44
  4 12 23 34 41
  5 13 24 31 42
  6 14 21 32 43
  7 11 22 33 44
  8 12 23 34 41
  9 13 24 31 42
10 14 21 32 43
11 11 22 33 44
12 12 23 34 41
13 13 24 31 42
14 14 21 32 43
15 11 22 33 44
16 12 23 34  --
17 13 24  --   --
18 14  --   --   --
      0   1   2    3
  0  --   --   --  14
  1  --   --  14 24
  2  --  14 24 34
  3 14 24 34 44
  4 24 34 44 13
  5 34 44 13 23
  6 44 13 23 33
  7 13 23 33 43
  8 23 33 43 12
  9 33 43 12 22
10 43 12 22 32
11 12 22 32 42
12 22 32 42 11
13 32 42 11 21
14 42 11 21 31
15 11 21 31 41
16 21 31 41  --
17 31 41  --   --




Figure 8: Schedule of pipelined bus N × N matrix multiplication
N = 4: (a) coefﬁcient c[i, j] and (b) input x[i, j].
the fewer resources it utilizes. Ours is even better since its
target architecture is the FPGA. Due to the known coefﬁ-
cient matrix c, we save M = 8 pins where M is c’s and x’s
bit precision and y is of 3M-bit precision. The utilization
ﬁgures, Umax and Uavg, are of importance when the power
supply is limited. Unlike ours, other PE arrays spend most
of the time propagating data. Finally, latency is more con-
cerned as the delay from applying input to getting the ﬁrst
output increases which results in less hardware utilization.
Each single-port N-entry M-bit ROM consumes N/16M
LUTs [24] where x denotes the smallest integer greater
than or equal to x. The LUT consumption ratio of [27] to
ours is over 3.85 with the multiplier built from LUTs. The ra-
tio would increase up to 5.25 if the Virtex II architecture were
assumed because each built-in multiplier does not consume
any LUT [26].
Table 2: Matrix-matrix product, N = 4, M = 8 is the precision
(bits) (Virtex FPGA is assumed).
Performance [27] [19] [20] [28] Ours
PE Array 1D 1D 1D 1D 1D
Ncycle 31 19 19 19 19
NPE 8 10 10 10 4
Umax 50% 60% 70% 60% 100%
Uavg 25.8% 33.7% 33.7% 33.7% 84.2%
Latency(cycles) 8 18 18 25 4
#IO(pins) 6M 5M 5M 5M 4M
LUTs/PE 216 168 144 136 112
LUTs Ratio 3.85 3.75 3.21 3.04 1.00
Table 3: Number of LUTs/PE, N = 4,M = 8 is the precision (bits)
(Virtex FPGA is assumed).
Performance [27] [19] [20] [28] Ours
(24+ 16)-b Adder 24 24 24 24 24
(8-bit×8=16)-b Multiplier 48 48 48 48 48
24-b y ’s Registers 72 72 48 24 24
8-b c’s Registers 56 8 16 24 8
8-b x’s Registers 16 16 8 16 8
Total 216 168 144 136 112
Table 3 enumerates the number of LUTs estimated for
datapath of each PE. The Xilinx core generator system pro-
vides not only the netlists of those components in PE datap-
ath, that is, bit-parallel adder, multiplier, and others but also
the accurate estimation of LUTs consumption. This allows
the designer to predict the complexity of each PE in advance.
On the other hand, the control circuitry of each PE was difﬁ-
cult to estimate because it is mechanism dependent and varies
from architecture to architecture. Note that the mapping can
be applied to matrix products of any size.
In the next section, we will apply the MODG and its space-
time mapping to pipeline nested Do-loop algorithms in lus-
tered VLIW video signal processors.
Efﬁcient implementation of nested-loop multimedia algorithms 141
6. CLUSTERED VLIWMAPPING
After the extensive study of multimedia workload [16] espe-
cially video encoding/decoding, it has been analyzed that a
video signal processor should be equipped with a large num-
ber of ALUs (arithmetic and logic units), shifters, multipli-
ers, and multiported register ﬁle to gain the available data
parallelism. Current examples of clustered VLIW processor
are the TI’s C6X Velociti and the analog device’s TigerSharc
families. The Princeton video signal processor [17, 29, 30]
has been recently proposed and targeted at 1 GHz operating
clock frequency. In order to achieve the goal, the so-called
clustered VLIW has been adopted with up to eight homoge-
neous clusters of functional units. Some time-critical video
signal processing algorithms involve deep nested Do-loop
such as 2D DCT/IDCT (discrete cosine transform/inverse
discrete cosine transform), block-based motion estimation,
and others.
However, the current instruction scheduling of clustered
VLIW is limited to basic block level [31] and tries to per-
form both partitioning and instruction scheduling to all clus-
ters [32]. Particularly multimedia workload, inter-procedure
analysis, and loop transformation are needed to discover
coarse-grain parallelism that is not visible at basic block level
[17]. Hence, there is not enough instruction-level parallelism
(ILP) to fully utilize the available hardware parallelism.
As a coarse-grain loop pipelining methodology, the
MODG and its space-time mapping can be applied to fully
exploit loop-level parallelism. We assume the clustered VLIW
architecture just like the Princeton video signal processor with
predicate execution capability currently adopted by the Intel
IA64 architecture. The algorithm to its architecture mapping
is subject to the following constraints. However, these con-
straints can be used as objective functions as well.
6.1. Number of clusters (C)
Since NPE is a function of problem size, it can be greater than
the number of clusters C after the mapping. The solutions
are constrained such that the number of PEs is an integer
multiple- of the number of clusters.
NPE =mC, (45)
where m > 0, m ∈ Z. Therefore, m PEs will be mapped
to the same cluster. This is beneﬁcial to instruction sched-
uler because basic block will be expanded so that intracluster
parallelism can be exploited.
6.2. Intercluster communication channels
The number of intercluster transactions should be minimized
to avoid being communication bottleneck. The links can be
implemented as crossbar switches [30] or busses [31]. In each
PE or cluster k1, the number of occurrences of nonzero-
length edge and nonnegative delay pair of variable v can
be expressed as {τ | ∃k1.Ev[τ]i > 0,∃k1.Rv[τ]i ≥ 0}where
k1.Ev[τ]i and k1.Rv[τ]i are the ith element of k1.Ev[τ]
and k1.Rv[τ], respectively. Hence, its cardinality yields the
number of intercluster transactions of PE k1 ∈ K1.
6.3. Local register file size
If the loop is unrolled according to the obtainable solution,
the basic block will be enlarged and instruction scheduler will
be able to exploit ILP more. However, the number of registers
needed may grow. Since each cluster has its own local register
ﬁle, register allocation can be performed independently. To
aid the register allocation using graph coloring such as the one
proposed by Chaitin et al. [33], live range of each variable
instance can be obtained to formulate register interference
graph. For each processor number k1.J1 the LiveRange of
variable instance v[g] can be written as
LiveRangegv
= max {st(p − q) | ∀p,∀q ∈ Igv , k1.J1 = Pt p = Pt q}.
(46)
6.4. Local memory space
The number of variable v’s instances in the PE of pro-
cessor number k1.J1 is the cardinality of the following set,
{g | Pt(kn.i) = k1.J1,∃kn.i ∈ Igv}. The total space of local
memory required in each cluster is the summation over all
variables,∀v ∈ V . The size of local memory should be large
enough to hold all necessary variable instances. For example,
local memory of 2 Kbytes per cluster is projected by [16].
6.5. Example: 2D separable transformation
The 2D DCT/IDCT is a common 2D separable linear trans-
form in image and video processing. Its matrix-matrix prod-
uct is formulated as
X = c(cx)t, (47)
where c, x, and X are the N ×N transform coefﬁcient, input
and output matrices, respectively. We can rewrite it in a two-
pass matrix-matrix multiplication
y = cx, Y = yt, X = cY . (48)
Its single loop-nest algorithm was derived in the appendix
for more details. It can be noticed that the innermost loop
body is invariant. Therefore, loop collapsing [18] was applied.
Eventually, the single-assignment loop and its dependence
graph can be illustrated in Figures 9 and 10, respectively.
As a result, we could formulate its MODG. The mapping is
constrained to the following conditions. First, since practical
2D DCT/IDCT are of N × N elements where N = 8, the
mapping is constrained to NPE = C = N = 4 in this case.
Second, one input port per input variable, c and x. Finally,
intercluster transaction should be kept as low as possible. One







The assignments and schedules of x, Y , and c are illustrated
in Figure 11. The c[m,n]-x[m,n] pair or the c[m,n]-
Y[m,n] pair represents the iteration in the ﬁrst and second
142 EURASIP Journal on Applied Signal Processing
passes, respectively. The iterations are assigned to clusters 0–3
and skewed/ordered relative to one another to pipeline fetch-
ing data from data cache. The overall schedule shows that
no intercluster transactions are required. The following is a
simple-pseudo assembly code for matrix-matrix product of
y = cx.
ldi R12,0 ;i = 0
L0: ;label L0
ldi R11,0 ;j = 0
L1: ;label L1
ldi R11,0 ;k = 0
L2: ;label L2
ldi R3,0 ;y[i, j] = 0
ldi R10,0 ;k = 0
ld R0,c[R12,R10] ;R0← c[i, k]
ld R1,x[R10,R11] ;R1← x[k, j]
mul R2,R0,R1 ;R2← c[i, k]× x[k, j]
add R3,R3,R2 ;R3← R3+c[i, k]× x[k, j]
addi R10,R10,1 ;k = k+ 1
ble L3,R10,4 ;branch to L1 if R10 ≤ 4
st y[R12,R11],R3 ;y[i, j]←R3
addi R11,R11,1 ;i = i+ 1
ble L2,R11,4 ;branch to L2 if R11 ≤ 4
addi R12,R12,1 ;j = j + 1
ble L1,R12,4 ;branch to L3 if R12 ≤ 4
According to the loop schedule in Figure 11, the assembly
code above is replicated in all four clusters to form a bigger
loop body. They are pipelined/skewed by one cycle as shown
in Figure 12. We assume that each cluster has one memory
load/store unit that takes one cycle latency if data cache hits.
Although the ﬁrst load may stall the following instructions
for a certain number of clock cycles, the subsequent loads
will take only one cycle due to cache hit. This corresponds to
the constraint of having single input port for c and x. The
four loop bodies are almost identical and can be scheduled
with any existing instruction scheduling algorithm.
Unlike software pipelining that schedules instructions
from adjacent loop iterations to the available functional units
in any cluster [31], our scheme is different in such a way
that each cluster is assigned to execute instructions solely
belonging to an independent loop body. In addition, soft-
ware pipelining using modulo scheduling [34] yields each
result every initiation interval to increase throughput rate
which can be limited by the available ILP. Ours instead en-
hances the throughput by a larger factor of the number of
clusters C executing in parallel. Furthermore, the subsequent
independent iterations can be executed in parallel if the loop
is unfolded and more functional units such as multiple-ALUs
and multipliers are available.
7. CONCLUSION
Multimedia signal processing is subject to real-time con-
straint, for example, audio/video coding/decoding, 2D/3D
Dom = 1 to N
Do n = 1 to N




c[m,k], n = −1




x[k,n], m = −1




0, k = −1
y3[m,n, k− 1]+ c3[m,n, k]
×x3[m,n, k], k ≥ 0
EndDo k




c[m,k−N], m = −1




y3[n,m,k], k = N




0, n = −1
X3[m,n− 1, k]+ c3[m,n, k]




Figure 9: Single-assignment of two-dimensional separable trans-
form algorithm.
graphics. To meet such high throughput demand, either
application-speciﬁc integrated circuit (ASIC) or application-
speciﬁc instruction set processor (ASIP) is often utilized.
ASIC can be of the form hardwired ASIC or programmable
hardware like FPGA. On the other hand, clustered VLIW
architecture is adopted as the main architecture of ASIP to
exploit static instruction level parallelism due to predictable
behavior of multimedia algorithms.
As a matter of fact that most of multimedia algorithms
are characterized as data-intensive and data-parallel algo-
rithms. Furthermore, a huge portion of total execution time
is consumed by elementary operations such as motion esti-
mation, matrix multiplication, 1D/2D linear transform, and
others that can be described as nested-loop with simple-
innermost loop body. The novel multiple-order dependence
graph (MODG) is proposed to represent all possible exe-
cution and data delivery orders. Along with its architecture
mapping methodology, it is targeted as a computer aided de-
sign tool.
Depending upon the target architecture, our design
methodology can assign and schedule parallel/pipelined exe-
cution of nested Do-loop algorithms subject to architecture-
speciﬁc constraints. We have demonstrated the applications
of MODG in three different implementation styles, hard-
wired ASIC, FPGA, and clustered VLIW processor using block
matching motion estimation, matrix-matrix product, and 2D
separable transformation, respectively.

























































































Figure 10: Dependence graph of two-dimensional transformation, Y = (cx)t , X = cY ,N = 3.
Firstly, the best-ever 2D systolic motion estimation array
was obtained in hardwired architecture. It outperforms others
by demanding the least memory bandwidth with mostly local
interconnects, small amount of fan-out, and reasonable num-
ber of registers. Secondly, we were able to obtain systolic array
structures that match well to the modern FPGA architecture
with optimal execution time, less LUT consumption, low I/O
pin count, high processor utilization, and consequently low
power consumption. Finally, the achievable loop-level paral-
lel execution schedule for clustered VLIW processor was able
144 EURASIP Journal on Applied Signal Processing
 --  --   --   11
 --   --  12 11
 --  13 12 11
14 13 12 11
14 13 12 21
14 13 22 21
14 23 22 21
24 23 22 21
24 23 22 31
24 23 32 31
24 33 32 31
34 33 32 31
34 33 32 41
34 33 42 41
34 43 42 41
44 43 42 41
44 43 42  --
44 43  --   --
44  --   --   --
 --  --   --   11
 --  --   21 12
 --  31 22 13
41 32 23 14
42 33 24 11
43 34 21 12
44 31 22 13
41 32 23 14
42 33 24 11
43 34 21 12
44 31 22 13
41 32 23 14
42 33 24 11
43 34 21 12
44 31 22 13
41 32 23 14
42 33 24  --
43 34  --   --
44  --   --   --
 --  --  --   11
 --  --  11  21
 -- 11  21 31
11 21 31 41
21 31 41 12
31 41 12 22
41 12 22 32
12 22 32 42
22 32 42 13
32 42 13 23
42 13 23 33
13 23 33 43
23 33 43 14
33 43 14 24
43 14 24 34
14 24 34 44
24 34 44 11
34 44 12 11
44 13 12 11
14 13 12 11
14 13 12 21
14 13 22 21
14 23 22 21
24 23 22 21
24 23 22 31
24 23 32 31
24 33 32 31
34 33 32 31
34 33 32 41
34 33 42 41
34 43 42 41
44 43 42 41
44 43 42  --
44 43  --   --
44  --   --   --















































































































Figure 11: Schedule of 2D separable transformation: (a) input x[i, j], (b) intermediate data Y[i, j], and (c) coefﬁcients c[i, j].
Cluster 0 Cluster 1 Cluster 2 Cluster 3
ldi r12,0
ldi r12,0 ldi r11,0
ldi r12,0 ldi r11,1 ldi r10,0
ldi r12,0 ldi r11,2 ldi r10,0 ldi r3,0
ldi r11,3 ldi r10,0 ldi r3,0 ldi r10,0
ldi r10,0 ldi r3,0 · · · ld r0,c[r12,r10]
ldi r3,0 · · · ld r1,x[r10,r11]







Figure 12: Instruction scheduling of cluster-parallel pipelined
matrix-matrix product, C = 4.
to exploit both cluster level and instruction level parallelism.
The methodology acts like a coarse-grain loop pipelining
scheduler to parallel available clusters while minimizing
intercluster communication.
Although the time and area complexities of the MODG
grow exponentially, it is the only concise representation that
expresses parallelism explicitly. Moreover, its 1D array target
and design constraints as well as heuristic search strategy can
counterbalance for its complexity on a modern network of
workstations.
APPENDIX
SINGLE-PASS SEPARABLE 2D TRANSFORMATION
Dom = 1 to N
Do n = 1 to N
Do i = 1 to N
Do j = 1 to N
y[i, j] = 0
Efﬁcient implementation of nested-loop multimedia algorithms 145
Do k = 1 to N





Do p = 1 to N
Y[p,n] = y[n,p]




After loop collapsing [18], the nested Do-loop above be-
comes
Dom = 1 to N
Do n = 1 to N
y[m,n] = 0
Do k = 1 to N
y[m,n] = y[m,n]+ c[m,k]× x[k,n]
EndDo k
X[m,n] = 0
Do p = 1 to N
Y[p,n] = y[n,p]





[1] S. Microelectronics, “Sti3220 motion estimation processor,”
Jan. 1994.
[2] T. Xanthopoulos and A. P. Chandrakasan, “A low-power dct
core using adaptive bitwidth and arithmetic activity exploiting
signal correlations and quantization,” IEEE J. Solid-State Circ.,
vol. 35, no. 5, pp. 740–750, 2000.
[3] R. M. Karp, R. E. Miller, and S. Winograd, “The organization
of computations for uniform recurrence equations,” J. ACM,
vol. 14, no. 3, pp. 563–590, 1967.
[4] L. Lamport, “The parallel execution of do loops,” Commun.
ACM, vol. 17, no. 2, pp. 83–93, 1974.
[5] H. T. Kung, “Why systolic architectures?,” IEEE Comput., vol.
15, no. 1, pp. 37–46, 1982.
[6] C. E. Leiserson, Area-Efﬁcient VLSI Computation, MIT Press,
Cambridge, Massachusetts, 1983.
[7] S. Y. Kung, VLSI Array Processors, Prentice Hall, Englewood
Cliffs, NJ, 1988.
[8] S. K. Rao and T. Kailath, “Regular iterative algorithms and
their implementation on processor arrays,” Proc. IEEE, vol. 76,
no. 4, pp. 259–282, 1988.
[9] D. I. Moldovan, Parallel Processing: From Applications to Sys-
tems, Morgan Kaufmann, San Mateo, CA, 1993.
[10] T. Komarek and P. Pirsch, “Array architectures for block match-
ing algorithms,” IEEE Trans. Circ. Syst., vol. 36, no. 10, pp.
1301–1308, 1989.
[11] L. D. Vos and M. Stegherr, “Parameterizable vlsi architectures
for the full-search block-matching algorithm,” IEEE Trans.
Circ. Syst., vol. 36, no. 10, pp. 1309–1316, 1989.
[12] K.-M. Yang, M.-T. Sun, and L. Wu, “Vlsi designs for block-
matching algorithms,” IEEE Trans. Circ. Syst., vol. 36, no. 10,
pp. 1317–1325, 1989.
[13] H. Yeo and Y. H. Hu, “A novel modular systolic array architec-
ture for full-search block matching motion estimation,” IEEE
Trans. Circuit Syst. Video Technol., vol. 5, no. 5, pp. 407–416,
1995.
[14] S. Kittitornkun and Y. H. Hu, “Frame-level pipelined motion
estimation array processor,” IEEE Trans. Circuit Syst. Video
Technol., vol. 11, no. 2, pp. 248–251, 2001.
[15] S. Kittitornkun and Y. H. Hu, “Reconﬁgurable processor array
synthesis,” in Int. Conf. Parallel and Distributed Computing,
Applications, and Technologies, 2001.
[16] J. Fritts, W. Wolf, and B. Liu, “Understanding multimedia
application characteristics for designing programmable media
processors,” vol. 3655, pp. 2–13. The International Society for
Optical Engineering, 1998.
[17] Z. Wu and W. Wolf, “Design study of shared memory in vliw
video signal processor,” in Proc. Int. Conf. Parallel Architectures
and Compilation Techniques, pp. 52–59. 1998.
[18] M. J. Wolfe, High Performance Compilers for Parallel Comput-
ing, Addison-Wesley, Redwood City, CA, 1996.
[19] P. Z. Lee and Z. M. Kedem, “Synthesizing linear array algo-
rithms from nested for loop algorithms,” IEEE Trans. Comput.,
vol. 37, no. 12, pp. 1578–1598, 1988.
[20] W. Shang and J. A. B. Fortes, “Time optimal linear schedules for
algorithms with uniform dependencies,” IEEE Trans. Comput.,
vol. 40, no. 6, pp. 723–742, 1991.
[21] Condor high throughput computing, http://www.cs.wisc.edu/
condor/.
[22] C. H. Hsieh and T. P. Lin, “Vlsi architecture for block-matching
motion estimation algorithm,” IEEE Trans. Circuit Syst. Video
Technol., vol. 2, no. 2, pp. 169–175, 1992.
[23] D. R. Martinez, T. J. Moeller, and K. Teitelbaum, “Application
of reconﬁgurable computing to a high performance front-end
radar signal processor,” J. VLSI Signal Processing, vol. 28, no.
1–2, pp. 65–83, 2001.
[24] Xilinx, “Virtex 2.5v ﬁeld programmable gate arrays,” May 2000.
[25] R. D. Turney, C. Dick, D. B. Parlour, and J. Hwang, Modeling
and implementation of dsp fpga solutions, Xilinx.
[26] Xilinx, “Virtex-ii 1.5v ﬁeld programmable gate arrays,” Jan.
2000.
[27] V. K. P. Kumar and Y.-C. Tsai, “Synthesizing optimal family
of linear systolic arrays for matrix computations,” in Proc. Int.
Conf. Systolic Arrays, pp. 51–60. 1988.
[28] K. G. Ganapathy and B. Wah, “Optimal design of lower di-
mensional processor arrays for uniform recurrences,” in Int.
Conf. Application Speciﬁc Array Processors, pp. 636–648. 1992.
[29] S. Dutta, K. J. O’Connor, W. Wolf, and A. Wolfe, “A design
study of a 0.25-µm video signal processor,” IEEE Trans. Circuit
Syst. Video Technol., vol. 8, no. 4, pp. 501–519, 1998.
[30] W. Wolf, “Alternative architectures for video signal processor,”
in Proc. IEEE CS Workshop VLSI, pp. 5–8. 2000.
[31] J. Sanchez and A. Gonzalez, “Modulo scheduling for a fully-
distributed clustered vliw architecture,” in Proc. 33rd ann.
IEEE/ACM Int. Symp. Microarchitecture, pp. 124–133. 2000.
[32] R. Leupers, “Instruction scheduling for clustered vliw dsps,”
in Proc. Int. Conf. Parallel Architectures and Compilation Tech-
niques, pp. 291–300. 2000.
[33] G. J. Chaitin, M. A. Auslander, A. K. Chandra, J. Cocke, M. E.
Hopkins, and P. W. Markstein, “Understanding multimedia
application characteristics for designing programmable media
processors,” J. Comput. Lang., vol. 3655, pp. 2–13, 1981.
[34] B. R. Rau, “Iterative modulo scheduling: An algorithm for
software pipelining loops,” in Proc. 27th Ann. Workshop on
Microprogramming, pp. 63–74. 1994.
146 EURASIP Journal on Applied Signal Processing
Surin Kittitornkun received his B. Eng
(2nd Honor) from King Mongkut’s Insti-
tute of Technology Ladkrabang (KMITL),
Bangkok, Thailand in 1991. He received M.
Eng and Telecom Finland Prize for academic
excellence from Asian Institute of Technol-
ogy (AIT), Bangkok, Thailand in 1995. He
is currently a Ph.D. Candidate at Depart-
ment of Electrical and Computer Engineer-
ing, University of Wisconsin, Madison. His
research interests include VLSI architecture for digital signal/image
processing, and FPGA/reconﬁgurable computing. He was a sum-
mer intern with Broadband Wireless System Department, Motorola,
Inc., Schaumberg, IL in 1998.
Yu Hen Hu received a BSEE degree from
National Taiwan University, Taipei, Taiwan,
ROC in 1976. He received MSEE and Ph.D.
degrees in Electrical Engineering from Uni-
versity of Southern California, Los Ange-
les, California in 1980 and 1982, respec-
tively. From 1983 to 1987, he was an assis-
tant professor of the Electrical Engineering
Department of Southern Methodist Univer-
sity, Dallas, Texas. He joined the Department
of Electrical and Computer Engineering, University of Wisconsin,
Madison, Wisconsin in 1987 as an assistant professor (1987–1989),
and is currently an associate professor. His research interests in-
clude multimedia signal processing, artiﬁcial neural networks, fast
algorithms and design methodology for application speciﬁc micro-
architectures, as well as computer aided design tools for VLSI us-
ing artiﬁcial intelligence. He has published more than 150 journal
and conference papers in these areas. He is a former associate edi-
tor (1988–1990) for the IEEE Transaction of Acoustic, Speech, and
Signal Processing in the areas of system identiﬁcation and fast al-
gorithms. He is currently associate editor of Journal of VLSI Signal
Processing. He is a founding member of the neural network signal
processing technical committee of IEEE signal processing society
and served as chair from 1993-1996. He is a former member of
VLSI signal processing technical committee of the signal process-
ing society. Currently he serves as the secretary of the IEEE signal
processing society (1996–1998). Dr Hu is a fellow of the IEEE.
