Partitioning and Scheduling DSP Applications with Maximal Memory Access Hiding by Zhong Wang et al.
EURASIP Journal on Applied Signal Processing 2002:9, 926–935
c© 2002 Hindawi Publishing Corporation
Partitioning and Scheduling DSP Applications
with Maximal Memory Access Hiding
ZhongWang
Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USA
Email: zwang1@cse.nd.edu
Edwin Hsing-Mean Sha
Department of Computer Science, University of Texas at Dallas, Richardson, TX 75083, USA
Email: edsha@utdallas.edu
YukeWang
Department of Computer Science, University of Texas at Dallas, Richardson, TX 75083, USA
Email: yuke@utdallas.edu
Received 2 September 2001 and in revised form 14 May 2002
This paper presents an iteration space partitioning scheme to reduce the CPU idle time due to the long memory access latency.
We take into consideration both the data accesses of intermediate and initial data. An algorithm is proposed to find the largest
overlap for initial data to reduce the entire memory traﬃc. In order to eﬃciently hide the memory latency, another algorithm is
developed to balance the ALU andmemory schedules. The experiments on DSP benchmarks show that the algorithms significantly
outperform the known existing methods.
Keywords and phrases: loop pipelining, initial data, maximal overlap, balanced partition scheduling.
1. INTRODUCTION
The contemporary DSP and embedded systems always con-
tain the memory hierarchy, which can be categorized as on-
chip and oﬀ-chip memories. In general, the on-chip mem-
ory have a fast speed and restrictive size, while the oﬀ-chip
memory have the much slower speed and larger size. To do
the CPU’s computation, the data need to be loaded from the
oﬀ-chip to on-chipmemories. Thus, the system performance
will be degraded due to this long oﬀ-chip access latency. How
to tolerate the memory latency with memory hierarchy is be-
coming a more and more important problem [1]. The on-
chip and oﬀ-chip memories are abstracted as the first and
second level memories, respectively, in this paper.
Prefetching [1, 2, 3, 4, 5] is a technique to fetch the data
from the memory in advance of the corresponding computa-
tions. It can be used to hide thememory latency. On the other
hand, software pipelining [6] and modulo scheduling [7, 8]
are the scheduling techniques used to explore the parallelism
in the loop. Both the prefetching and scheduling techniques
can be used to accelerate the execution speed. However, these
traditional techniques have some weaknesses [9] such that
they cannot eﬃciently solve the problem mentioned in the
first paragraph. This paper combines the software pipelin-
ing technique with the data prefetching approach. Multiple
memory units, attached to the first level memory, will per-
form operations to prefetch data from the second to the first
level memories. These memory units are in charge of prepar-
ing all data required by the computation in the first level
memory in advance of computation. Multiple ALU units ex-
ist in the processor for doing the computation. The ALU
schedule is optimized by using the software pipelining tech-
nique under the resource constraints. The operations in the
ALU units and memory units execute simultaneously. There-
fore, the long memory access latency is tolerated by over-
lapping the data fetching operations with the ALU opera-
tions. Although using computation to hide the memory la-
tency has been studied extensively before, trying to balance
the computation and memory loading has never been re-
searched thoroughly according to the authors’ knowledge.
This paper presents an approach to balance the ALU and
memory schedules to achieve an optimal overall schedule
length.
The data to be prefetched can be classified into two
groups, the intermediate and initial data. The intermediate
data can serve as both left and right operands in the equa-
Partitioning and Scheduling DSP Applications with Maximal Memory Access Hiding 927
tions. Their value will vary during the computation. On the
contrary, the initial data can only serve as right operands
in the equations. They will maintain their value during the
computation. Take the following equations as an example,
the arrays B, C can be regarded as the intermediate data and
A as the initial data
B[i + 1] = B[i]∗B[i− 1] + A[i],
C[i + 1] = B[i− 1]∗A[i + 1] + A[i]. (1)
The influence of both these two kinds of data should be de-
liberated in order to obtain an optimal overall schedule.
To take full use of the data locality, the entire iteration
space can be divided into small blocks named partitions.
A lot of works have been done on the partitioning tech-
nique. Loop tiling [10, 11] is a technique used to group ba-
sic computations so as to increase computation granular-
ity and thereby reduce communication time. Generally, they
have no detailed schedule of ALU and memory operations
as our method. Moreover, only intermediate data are taken
into consideration. Agarwal and Kranz [12] make an ex-
tensive study of data partition. They use an approximation
method to find a good partition to minimize the data trans-
fer among the diﬀerent processors. Aﬃne reference index is
considered in their work. However, they mainly concentrate
on the initial data and have few consideration on the inter-
mediate data.
The approaches in [9, 13] are the few approaches to con-
sider the detailed schedule under memory hierarchy. Never-
theless, their memory references consider only the interme-
diate data, and ignore the initial data, which are an important
influence factor of performance. From the experimental re-
sults in Section 5, we can see that such deficiency will lead to
an unbalanced schedule, which means a worse schedule.
In our approach, both the intermediate and initial data
are considered. For the intermediate data, we will restrict
our study to nested loops with uniform data dependencies.
The study of uniform loop nests is justified by the fact that
most general linear recurrence equations can be transformed
into a uniform form. This transformation (uniformization
[14]) greatly reduces the complexity of the problem. On the
other hand, it is diﬃcult to implement uniformization for the
initial data. Therefore, aﬃne reference index is considered.
The concept footprint [12] is used to denote the initial data
needed for the computation of ALU units in one partition.
Given a partition shape, this paper presents an algorithm to
find a partition size which can give rise to the maximum
overlap between the adjacent overall footprints such that the
number of memory operations is reduced to the largest ex-
tent.
When considering the schedule of the loop, we propose
the detailed ALU and memory schedules. Each of the mem-
ory and ALU operations are assigned to an available hard-
ware unit and time slot. Therefore, it is very convenient to
apply our technique to a compiler. The memory schedule is
balanced to the ALU schedule such that the overall schedule
is close to the lower bound, which is determined by the ALU
schedule. Our method gives the algorithm to determine the
partition shape and size in order to achieve balanced ALU
and memory schedules. At last, the memory requirement of
our technique for applications is also presented.
The new algorithm in this paper significantly exceeds the
performance of existing algorithms [9, 13] due to the fact
that it optimizes both ALU and memory schedules and con-
siders the influence of initial data. Taking the wave digital fil-
ter as an example, in a standard system with 4 ALU units and
4 memory units, assuming 3 initial data references exist in
each iteration, our algorithm can obtain an average sched-
ule length of 4.018 CPU clock cycles, which is very close to
the theoretic lower bound of 4 clock cycles. The traditional
list scheduling needs 22 clock cycles. The hardware prefetch-
ing costs 10 clock cycles. While the PSP algorithm in [13]
can achieve some improvement, it still needs 8 clock cycles.
Without the memory constraint, the algorithm in [9] has the
same performance, 8 clock cycles. Our algorithm improves
all the previous approaches.
It is worthwhile to mention that some works have been
done on data layout technique [15, 16], which is used to
maintain the cache coherency and reduce the conflict traf-
fic. Our work should be regarded as another diﬀerent layer
which can be built upon the layer of data layout to get a bet-
ter performance.
The remainder of this paper is organized as follows.
Section 2 introduces the terms and basic concepts used in
the paper. Section 3 presents the theory on initial data.
Section 4 describes the algorithm to find the detailed sched-
ule. Section 5 contains the experimental result of comparison
of this technique with a number of existing approaches. We
conclude in Section 6.
2. BACKGROUND
We can represent the operations in a loop by a multidimen-
sional data flow graph (MDFG) [6]. Each node in the MDFG
represents a computation. Each edge denotes the data depen-
dence between two computations, with its weight as the dis-
tance vector. The benefit of using MDFG instead of the gen-
eral data dependence graph (DDG) or statement dependence
graph (SDG) is that MDFG is the finer-grained description
of data dependences. Each node of MDFG corresponds to
one ALU computation. On the contrary, a node always cor-
responds to a statement in DDG or SDG, which will consume
uncertain ALU computation time depending on the com-
plexity of the statement. It is more convenient to schedule
the ALU operations with MDFG. Moreover, lots of DSP ap-
plications, such as DSP filters, and so forth, can be directly
mapped into MDFG [17].
The execution of all nodes in an MDFG one time is an
iteration. It corresponds to executing the loop body for one
time under a certain loop index. Iterations are identified by a
vectori, equivalent to a multidimensional index.
In this paper, we will always illustrate our ideas under
two-dimensional loops. It is not diﬃcult to extend to loops
with more than two dimensions by using the same idea pre-
sented in this paper.


























Figure 1: Architecture model with multiple function units and a
memory hierarchy.
2.1. Architecturemodel
The technique in our paper is designed for use in a system
which has one or more processors. These processors share
a common memory hierarchy, as shown in Figure 1. There
are multiple ALU and memory units in the system. The ac-
cess time for the first level memory is significantly less than
for the second level memory, as in current systems. During
a program’s execution, if one instruction requires data which
is not in the first level memory, the processor will have to
fetch data from the second level memory, which will cost
much more time. Thus, prefetching data into the first level
memory before its explicit use can minimize the overall ex-
ecution time. Two types of memory operations, prefetch and
keep are supported by the memory units. The prefetch opera-
tion prefetches the data from the second level to the first level
memories; the keep operation keeps the data in the first level
memory for the execution of one partition. Both of them are
issued to guarantee that those data being referenced in the
near future appear in the first level memory before their ref-
erences. It is important to note that the first level memory in
this model cannot be regarded as a pure cache, because we
do not consider the cache associativity. In other words, it can
be thought of as a full-associative cache.
2.2. Partitioning the iteration space
Regular execution of nested loops proceeds in either a row-
wise or column-wise manner until the boundary of iteration
space is reached. However, this mode of execution does not
take full advantage of either the locality of reference or the
available parallelism. The execution of such structures can
be made to be more eﬃcient by dividing the entire iteration
space into regions called partitions that better exploit spatial
locality.
Provided that the total iteration space is divided into par-
titions of iterations, the execution sequence will be deter-
mined by each partition. Assume that the partition in which
the loop is executing is the current partition. Then the next
partition is the partition adjacent on the right side of the
Keep for inter data
Prefetch for
inter data

































Figure 2: The overall schedule.
current partition along the x-axis. The other partitions are
all partitions except the above two partitions. Based on this
classification, diﬀerent memory operations will be assigned
to diﬀerent data in a partition. For a delay dependency that
goes into the next partition, a keepmemory operation is used
to keep this data in the first level memory for one partition,
since this data will be reused immediately in the next parti-
tion. Delay dependencies that go into other partitions result
in the use of prefetch memory operations to fetch data in ad-
vance.
A partition is determined by its partition shape and par-
tition size. We use two basic vectors (in a basic vector, each
element is an integer and all elements have no common
factor except 1), Px and Py , to identify a parallelogram as
the partition shape. These two basic vectors will be called
partition vectors. Assume, without loss of generality, that the
angle between Px and Py is less than 180◦, and Px is clock-
wise of Py . The partition size is determined by the vector
S = ( fx, fy), where fx and fy are the multiples of the parti-
tion size over partition vectors Px and Py , respectively. Thus,
the partition can be delimited by two vectors fxPx and fyPy .
How to find the optimal partition size will be discussed
in Section 4. Due to the dependencies between the iterations,
the Px and Py cannot be chosen arbitrarily. The following
property gives the condition of a legal partition shape [9].
Property 1. A pair of partition vectors that satisfy the follow-
ing constraints is legal. For each delay vector de, the following
cross products1relations hold: de × Px ≤ 0 and de × Py ≥ 0.
Because nested loops should follow the lexicographical
order, we can choose (1, 0) as our Px vector and use the nor-
malized leftmost vector of all delay dependencies as our Py .
The partition shape is decided by these two vectors.
An overall schedule consists of two parts: an ALU part and
a memory part, as seen in Figure 2. The ALU part sched-
1The cross product p1 × p2 is defined as the signed area of the parallelo-
gram formed by the points (0,0), p1, p2, and p1 + p2 = (x1 + x2, y1 + y2). It
is p1 × p2 = p1 · xp2 · y − p1 · yp2 · x.
Partitioning and Scheduling DSP Applications with Maximal Memory Access Hiding 929
i
B(2i + j, i − 2 j)




Figure 3: The footprint.
ules the ALU computation. We know that the computation
in a loop can be represented by an MDFG. The ALU part is
a schedule of these MDFG nodes. The memory part sched-
ules the memory operations—prefetch and keep, so that the
data for the computation can always be found in the first level
memory.
3. THE THEORY ABOUT INITIAL DATA
The overall footprint of one partition consists of all the initial
data needed by one partition computation. Provided the exe-
cution is along the partition sequence, the initial data needed
by the current partition computation have been prefetched to
the first level memory at the time of previous partition. Also,
the initial data needed by the next partition execution will be
prefetched by the memory units during the current partition
execution. For the overlap between the overall footprints of
the current and next partitions, they have already been in the
first level memory. The prefetch operations can be spared.
Thus, the major concern for the initial data is how to maxi-
mize the overlap between the overall footprints of two con-
secutively executed partitions to reduce the memory traﬃc.
As mentioned in Section 1, we consider aﬃne reference
for the initial data. Given a loop index vector i, an aﬃne
reference index can be expressed as g(i) = iG + a, where
G = [ G1 G2] is a 2×2matrix anda is the oﬀset vector. The foot-
print with respect to a reference A[g1(i)] is the set of all data
elements A[g1(i)] of A, fori an element of the partition. The
overall footprint is the union of the footprints with respect
to all diﬀerent references. For example, in Figure 3, the par-
tition is a rectangle with size 3× 4. The initial data references
are B(i + j, i − j) and B(2i + j, i − 2 j). Their corresponding
footprints are denoted by those integer points marked by ×
and •, respectively. The overall footprint is the union of these
two footprints.
In [12], Anant presents the concept uniformly generated
references. Two references A[g1(i)] and A[g2(i)] are said to be
uniformly generated if
g1(i) = iG + a1, g2(i) = iG + a2. (2)
If two references B1 and B2 are not uniformly generated, the
overlap between footprint with respect to B1 of the current
partition and that with respect to B2 of the next partition can
be ignored because the overlap, if exists, diminishes rapidly.
Therefore, we need only consider the overlap between foot-
prints with respect to uniformly generated references of two
consecutive partitions. Moreover, the oﬀset vector a should
satisfy that a = m G1 + n G2, where m and n are integer con-
stants. Otherwise, no overlap between the footprints of con-
secutive partitions will exist even for the uniformly generated
references.
The memory requirement should be taken into account
when trying to maximize the overlap. The partition size can-
not be enlarged arbitrarily only for the sake of increasing
overlap. In such case, the larger partition means the larger
overall footprint; that is, the much more memory space will
be consumed. Therefore, given a partition shape and a set of
uniformly generated references, we try to derive some condi-
tions of the partition size which should be met to achieve a
reasonable maximal overlap. For the convenience of descrip-
tion, we introduce the following notations.
Definition 1. (1) Assuming the partition size is S, f (a, S) is
the footprint with respect to reference with oﬀset vector a of
the current partition, and f (a′, S) is the footprint with re-
spect to reference with oﬀset a of the next partition.
(2) Given a set of uniformly generated references, the set
R = {a1, a2, . . . , an} is set of oﬀset vectors.2 Assuming the
partition size is S, F(R, S) is the overall footprint of the cur-
rent partition and F(R′, S) is the overall footprint of the next
partition.
The one-dimensional case can be regarded as a simpli-
fication to the two-dimensional problem, in which the fy is
always set to zero. It provides the theoretic foundation for
the two-dimensional problem. In the case of one dimension,
a partition is reduced to a line segment and all vectors re-
duce to integer numbers. The partition size can be thought
of as the length of the line segment. We use an example to
demonstrate the problem we are tackling. In Figure 4, there
are three diﬀerent oﬀset vectors: 1, 2, 7. The solid lines rep-
resent the overall footprint of the current partition, and dot-
ted lines denote that of the next partition. Then, we need to
find the condition of the partition size, that is, the length of
the line segment, to achieve a maximal overlap. The figure
shows the case when the length equal 5, which is the mini-
mum length to obtain the maximum overlap between overall
footprints.
In order to derive the theorem on the minimum value
S which can generate the maximum overlap, we first have
2Note that the elements in the set R are in lexicographically increasing
order.
930 EURASIP Journal on Applied Signal Processing
181614121086420
Figure 4: One-dimensional line segments.
{{ Sa2S
a1
(a) Case 1.{{ Sa2 S
a1
(b) Case 2.
Figure 5: Two diﬀerent relations between a1 and a2.
the following lemmas. They are used to consider the over-
lap of two footprints of the consecutive partitions, as show in
Figure 5. The solid line is the footprint of the current parti-
tion and the dotted line is the footprint of the next partition.
Lemma 1. The minimum S is a2 − a1 which makes the max-
imum intersection between f (a′1, S) and f (a2, S), where a2 ≥
a1.
Proof. According to the relation between (a1 + S) and a2,
there are two diﬀerent cases.
Case 1. As shown in Figure 5a, a1 + S ≤ a2, that is, S ≤
a2 − a1. The intersection is (a2, a1 + 2S− 1). It can reach the
maximum value a2 − a1 when S = a2 − a1.
Case 2. As shown in Figure 5b, a1 + S > a2, that is, S >
a2−a1. The intersection of two segments is (a1+S, a2+S−1).
It has no relation to S. This means the size of intersection will
not increase in spite of the increment of S.
Lemma 2. For the intersection between f (a′1, S) and f (a2, S),
where a2 ≥ a1, it will keep constant, irrelevant to the value of
S, as long as S ≥ a2 − a1.
According to Definition 1, F(R, S) and F(R′, S) can be ex-
pressed as
F(R, S) = f (a1, S
)∪ f (a2, S






) = f (a′1, S
)∪ f (r′2, S




The following lemma gives the expression of their intersec-
tion.
Lemma 3. Let Cm be the intersection f (am, S) ∩ f (a′m−1, S).
Then the intersection of F(R, S) and F(R′, S) is
⋃n
2 Cm, where
the number of integers in R is n.
Proof. Let Am denote f (rm, S), and Bm denote f (r′m, S).
Basis step. Let n = 2. Then F(R, S) = A1 ∪ A2 and
F(R′, S) = B1 ∪ B2. The ending point of A1 is less than the
starting point of B1 and B2, the starting point of B2 is greater
than the ending point of A1 and A2. Thus, the only possible
intersection is A2 ∩ B1.
Induction hypothesis. Assume that, for some n ≥ 2,
F(R, S)∩ F(R′, S) = ⋃n2 Cn.
Induction step. For n+1, the added intersection is An+1∩
(B1 ∪ B2 ∪ · · · ∪ Bn). There are two diﬀerent cases.
(1) an+1 ≥ (an+S). Then An+1 can only intersect with Bn.
(2) an+1 < (an + S). Then An+1 can be divided into two
parts, A′ = (an+1, an + S) and A′′ = (an + S, an+1 + S− 1)
An+1 ∩
(
B1 ∪ B2 ∪ · · · ∪ Bn
)
= A′ ∩ (B1 ∪ B2 ∪ · · · ∪ Bn
)











Therefore, F(R, S)∩ F(R′, S) = ⋃n+12 Cn.
Theorem 1. Given the set R = (a1, a2, a3, . . . , an), the maxi-
mum intersection between F(R, S) and F(R′, S) can be achieved
when S = maxnm=2(am − am−1).
Proof. When considering two adjacent Cm and Cm−1, we have
Cm = Am ∩ Bm−1 and Cm−1 = Am−1 ∩ Bm−2. There is no
common element between Bm−1 and Am−1, neither is Cm and
Cm−1. According to Lemmas 1 and 2, the value x ≥ rm− rm−1
can make segment Cm largest. Moreover, each Cm will not
intersect each other. Therefore, the theorem is correct.
From Theorem 1 and Lemma 2, we can directly derive
the following theorem.
Theorem 2. For the overall footprints F(R, S) and F(R′, S),
their overlap will keep constant if the value of S continues to
increase from the S value obtained by Theorem 1.
To maximize the overlap between F(R, S) and F(R′, S) in
the two dimension space, we can find that the fy element of
the partition size is not so important as the fx element, since
the intersection always increases when fy is enlarged. We will
determine the value of fy based on other conditions. There-
fore, the key is what is the minimum value of fx to make the
intersection maximum, given a certain fy .
Next, we discuss the situation with G a two-dimensional
identity matrix. If G is not an identity matrix, the same idea








Figure 6: The stripe division of a footprint.
can be applied as long as a = m G1 + n G2. The only dif-
ference is that the original XY-space will be transformed
to the new space by the G matrix. An augment set R∗
can be obtained based on a certain partition size of S and
the set R with the following method: a∗i = ai, a∗i+n =
ai + fyPy · y, where n is the size of the set R and Py =
(Py · x, Py · y). Arranging all the points in the set R∗
with the increasing order of the Y element, the overall
footprint of one partition can be divided into a series of
stripes. Each stripe is determined by two horizontal lines
which pass the two adjacent points sorted in R∗. For in-
stance, in Figure 6, the R set is {(0, 0), (6, 1), (3, 2), (1, 3)}.
Assume the value of fyPy · y is 5, then the augment set R∗
is {(0, 0), (0, 5), (6, 1), (6, 6), (3, 2), (3, 7), (1, 3), (1, 8)}. After
sorting, it will become {(0, 0), (6, 1), (3, 2), (1, 3), (0, 5), (6, 6),
(3, 7), (1, 8)}. The overall footprint consists of 7 stripes as in-
dicated in Figure 6.
In each stripe, a horizontal line will intersect with
left bounds of some footprints f (a, S). Thus, the two-
dimensional intersection problem of this stripe in the foot-
print can be reduced to the one-dimensional problem, which
can be solved using Theorem 1. Applying this idea to each
stripe, we can solve the two-dimensional overlap problem, as
demonstrated in Algorithm 1. The algorithm is obviously a
polynomial-time algorithm, whose time complexity isO(n2).
From Lemma 2, the intersection will keep constant if
fx is greater than the value chosen by this algorithm, and
will reduce with less fx. We can demonstrate this phe-
nomenon by two examples. The set R for the first exam-
ple is {(0, 1), (5, 3), (−3, 1), (4,−1), (−2,−2)} and the par-
tition shape is (1, 0) × (0, 1). It is the partition shape
for wave digital filter. The set R for the second example
is {(0, 2), (3, 5), (1, 3), (−1,−1)} and the partition shape is
(1, 0) × (−3, 1). It is the partition shape for two-dimensional
filter. Figures 7a and 7b show the varying trends of footprint
intersection with the value of fx and fy for two examples, re-
spectively.
4. THE OVERALL SCHEDULE
The overall schedule can be divided into two parts—ALU and
memory schedules. For the ALU schedule, the multidimen-
sional rotation scheduling algorithm [6] is used to generate a
Input: The set R and the shape of the partition
Output: The fx to make the overlap maximum under a
certain fy .
(1) Set fx to 0.
(2) Based on the set R and partition shape, choose an fy
such that the product fy ∗ Py · y is larger than the
diﬀerence between the largest and least b element of
all vectors in the set R.
(3) Using the fy above, generate the augment set R∗.
(4) Sort all the values in the R∗ in increasing order
according to the b element and keep them in an
event list.
(5) Use a horizontal line to sweep the whole iteration
space. When an event point is met, insert the
corresponding set f (a, S) in a visiting list, if the
event point is the lower bound of the footprint.
Otherwise delete the corresponding f (a, S) from the
list.
(6) Calculate the intersection point of this line with the
left bound and right bound of each set in the
visiting list, respectively. Use Theorem 1 to derive an
f ′x value to make the intersection in the current
stripe maximal.
(7) Replace fx with f ′x if f
′
x > fx.
Algorithm 1: Calculating the minimum x to make the overlap
maximum.
static schedule for one iteration. Then the entire ALU sched-
ule can be formed by simply replicating this schedule for each
iteration in the partition. The schedule obtained in this way
is the most compact schedule since it only considers the ALU
hardware resource constraints. The overall schedule length
must be longer than it. Thus, this ALU schedule provides a
lower bound for the overall schedule. This lower bound can
be calculated by #leniteration × #nodes, where leniteration repre-
sents the schedule length obtained bymultidimensional rota-
tion scheduling algorithm for one iteration, and #nodes de-
notes the number of iteration nodes in one partition. Our
objective is to find a partition whose overall schedule length
can be very close to this lower bound.
4.1. Balanced overall schedule
Diﬀerent from the ALU schedule, the memory schedule is
considered as an integrate for the entire partition. It consists
of two parts: memory operations for initial data and interme-
diate data. Each part consists of the prefetch and keep opera-
tions for the corresponding data. Because all the prefetch op-
erations have no relations to the current computation, they
can be arranged from the beginning of the memory schedule
part. On the contrary, the keep operation for intermediate
data can only be issued after the corresponding computation
has finished. The keep operations for initial data can be is-
sued as soon as they have been prefetched. The memory part
schedule length is the summation of these two parts’ sched-
ule lengths.
For the intermediate data, the calculation of the number






























Figure 7: The tendency of intersection with fx and fy .
of prefetch and keep operations can refer to [13]. For the ini-
tial data, they can be prefetched in blocks. This kind of oper-
ation can fetch several data at one time and costs only a little
longer time than general prefetch operation. To calculate the
number of such operations, we first have the following ob-
servation.
Property 2. As long as fyPyG2, the projection of footprint
size along the direction G2, is larger than the maximum dif-
ference of aG2, for all a belongs to a uniformly generated oﬀ-
set vector set, the overall footprint will increase at a constant
rate with the increment of fy , so does the number of prefetch
operations for initial data.
Note the requirement in the above property guarantees
that the partition is large enough, such that the footprint
with respect to an oﬀset vector can intersect with the foot-
print with respect to all other oﬀset vectors belonging to the
same uniformly generated set.
Suppose that a two-dimensional vector can be written as
a = (a · x, a · y). Given a certain fx, the number of prefetch
operations for initial data for any fy , which satisfy the condi-
tion in the above property, is PreBase ini+( fy− fy0 )×Preincr ini,
where fy0 = 	y0/((PyG) · y)
, y0 is the maximum diﬀerence
of (aG) · y for all oﬀset vectors, PreBase ini denotes the num-
ber of such operations for a partition with size fx × fy0 , and
Preincr ini represents the increment of number of prefetch op-
erations when fy is increased by one.
The keep operations for the initial data can be issued after
they have been prefetched. The number of such keep opera-
tions is KeepBase ini + ( fy − fy0 ) × Keepincr ini, where y0 and
fy0 have the same meaning as above. KeepBase ini denotes the
number of keep operations for a partition with size fx × fy0 ,
and Keepincr ini represents the increment of keep operations
when fy is increased by one.
In order to understand what is a good partition size, we
first need the definition of the balanced overall schedule. It
also gives the balanced overall schedule requirement.
Definition 2. A balanced overall schedule is a schedule for
which the memory schedule is at most one unit time of keep
operation longer than the ALU schedule.
To reduce the computation complexity and simplify the
analysis, we add a restriction on the partition size: the parti-
tion size is large enough that no data dependence can span
more than two partitions.
(1) There is no delay dependency which can span more
than two partitions along the y coordinate direction, that is,
fy ∗ Py · y ≥ dy , for all d = (dx, dy) ∈ D.
(2) There is no delay dependency which can span more
than two partitions along the x coordinate direction, that is,
fx > max{dx − dy(Py · y/(Py · x))}.
As long as these constraints on minimal partition size
are satisfied, the length of prefetch and keep parts for inter-
mediate data in memory schedule increases slower than the
ALU schedule length when partition size is enlarged. At this
time, if a partition size cannot be found to meet the balanced
overall schedule requirement, it means that the length of the
block prefetch part for initial data increases too fast. Due to
the property of block prefetch, increasing fx will increase the
number of block prefetch only by a small number, while in-
crease the ALU part by a relative large length. Therefore, a
partition size which satisfy the balanced overall schedule re-
quirement can be found. Algorithm 2 determines the parti-
tion size to obtain the balanced overall schedule.
After the optimal partition size is determined, the opera-
tions in ALU and memory schedules can be easily arranged.
For the ALU part, it is the duplication of the schedule for one
iteration. For the memory part, the memory operations for
initial data are allocated first, then are the memory opera-
tions for intermediate data, as we discussed above.
The memory requirement for a partition consists of four
parts, the memory requirement for the calculation of in-
partition data, the memory for prefetch operations of inter-
mediate data, the memory for keep operations of intermedi-
ate data, and the memory for those operations of initial data.
The memory consumption calculation for in-partition data
can refer to [9]. For the other part memory requirements,
Partitioning and Scheduling DSP Applications with Maximal Memory Access Hiding 933
Table 1: Experimental results with only one initial data.
Benchmark
Par vector New algo Partition algo List Hardware
Px Py size m r len size m r len ratio len ratio len ratio
WDF (1, 0) (−3, 1) 4× 7 221 4.107 4× 4 143 5.312 22.68% 18 77.18% 10 58.93%
IIR (1, 0) (−2, 1) 4× 9 407 6.028 4× 7 350 6.893 12.55% 36 83.26% 37 83.71%
DPCM (1, 0) (−2, 1) 8× 10 736 4.01 8× 8 628 4.891 18.01% 25 83.96% 21 80.9%
2D (1, 0) (0, 1) 3× 5 233 12 3× 4 207 12 0.0% 55 78.18% 51 76.47%
Floyd (1, 0) (−3, 1) 7× 5 301 6.057 4× 4 174 6.312 4.04% 32 81.72% 30 79.81%
Input: The ALU schedule for one iteration, the partition
shape Px × Py and the initial data oﬀset vector set R.
Output: A partition size which can generate a balanced
overall schedule.
(1) Based on the information of initial data, use
Algorithm 1 to calculate the minimum partition size
f ′x and f
′
y .
(2) Using the two above conditions on partition size,
calculate another pair of minimum f ′′x and f
′′
y .
(3) Get a new pair fx = max( f ′x , f ′′x ) and
fy = max( f ′y , f ′′y ).
(4) Using this pair ( fx, fy), calculate the number of
prefetch operations, block prefetch operations, and
keep operations.
(5) Calculate the ALU schedule length to see if the
balanced overall schedule requirement is satisfied.
(6) If it is satisfied, this pair ( fx, fy) is the partition size.
Otherwise, increase fx by one, use the balanced
overall schedule requirement to find the minimum
fy . If such fy does not exist, continue increasing fx
until the feasible fy is found. Use them as partition
size.
(7) Based on the partition size, output the
corresponding ALU part schedule and memory part
schedule.
Algorithm 2: Find a balanced overall schedule.
they can be computed simply by multiplying the number of
operations with the memory requirement of each operation.
The memory requirement for a prefetch operation is 2. One
is used to store the data prefetched by the previous partition
and consumed in the current partition, the other stores the
data prefetched by the current partition and consumed in the
next partition. As the same rule, the keep operation will take
2 memory locations, too. The block prefetch operations will
take 2× block size memory locations.
5. EXPERIMENT
In this section, we use several DSP benchmarks to illus-
trate the eﬀectiveness of our new algorithm. They are WDF,
IIR, DPCM, 2D, and Floyd, as indicated in Tables 1 and
2, which stand for wave digital filter, infinite impulse re-
sponse filter, diﬀerential pulse-code modulation device, two-
dimensional filter and Folyd-Steinberg algorithm, respectively.
These are DSP filters in common usage in real DSP applica-
tions. We applied five diﬀerent algorithms on these bench-
marks: list scheduling, hardware prefetching scheme, par-
titioning algorithms in [9, 13] and our new partition al-
gorithm (since it has been shown in [9] that loop tiling
technique cannot outperform partitioning algorithms, we
do not compare the result of loop tiling in this section). In
list scheduling, the same architecture model is used. How-
ever, the ALU part uses the traditional list scheduling algo-
rithm, and the iteration space is not partitioned. In hardware
prefetching scheduling, we use the model presented in [18].
In this model, whenever a block is accessed, the next block
is also loaded. The partitioning algorithms in [9, 13] assume
the same architecture model as ours. They partition the it-
eration space and execute the entire loop along the partition
sequence. However, they do not take into account the influ-
ence of the initial data.
In the experiment, we assume an ALU computation, a
keep operation of one clock cycle, a prefetch time of 10 CPU
clock cycles, and a block prefetch time of 16 CPU clock cycles,
which is reasonable when the big performance gap between
CPU and the main memory is considered. Table 1 presents
results with only one initial data with the oﬀset vector (1, 1),
and Table 2 presents results with three initial data with the
oﬀset vector set {(1, 1), (2,−2), (0, 3)}. Note all these three
initial data references are uniformly generated. From the dis-
cussion in Section 4, the overall footprint is only the sim-
ple summation of the footprint with respect to diﬀerent uni-
formly generated reference sets. In Tables 1 and 2, the par vec-
tor column determines the partition shape. The list column
lists the schedule length for list scheduling and the improve-
ment ratio our algorithm can get compared to list schedul-
ing. The hardware column lists the schedule length for hard-
ware prefetching and our algorithm’s relative improvement
ratio. Since the algorithm in [13] will get the same result as
the algorithm in [9] when there is nomemory size constraint,
we merge their results into one column partition algo. In the
partition algo and new algo columns, the size column is the
size of partition presented with the multiple of partition vec-
tors. The m r column represents the corresponding mem-
ory requirement and the len column is the average schedul-
ing length for corresponding algorithms. The ratio column is
the improvement our new algorithm can get relative to the
corresponding algorithms.
The list scheduling and hardware prefetching schedule
the operations based on the iteration, which will result in the
934 EURASIP Journal on Applied Signal Processing
Table 2: Experimental results with three initial data.
Benchmark
Par vector New algo Partition algo List Hardware
Vx Vy size m r len size m r len ratio len ratio len ratio
WDF (1, 0) (−3, 1) 8× 7 474 4.018 4× 4 206 8 49.78% 22 81.74% 10 58.92%
IIR (1, 0) (−2, 1) 5× 13 772 6.015 4× 7 472 7.857 23.44% 40 84.96% 37 83.74 %
DPCM (1, 0) (−2, 1) 8× 14 1207 4.001 8× 8 811 5.266 24.02% 29 86.2% 21 80.95%
2D (1, 0) (0, 1) 4× 5 346 12 3× 4 253 13.833 13.25% 59 79.66% 51 76.47%
Floyd (1, 0) (−3, 1) 8× 6 526 6 4× 4 223 8.812 31.91% 36 83.33% 30 80%
much longer memory schedule. It is this dominant memory
schedule that leads to an overall schedule which is far away
from the balanced schedule. Thus, lots of ALU resources are
wasted waiting for the data. Their much worse performance
compared with the partitioning technique can be seen from
the tables.
Although the traditional partitioning algorithms con-
sider the balance of ALU andmemory schedules for interme-
diate data. They lack of the consideration for the initial data.
The time consumption to load the initial data is a rather sig-
nificant influence factor for one partition. The lack of such
consideration will result in an unbalanced overall schedule.
The memory latency cannot be eﬃciently hidden. This is the
reason why traditional partitioning algorithms get the worse
performance than our new algorithm. It also explains the re-
sults that the performance will become worse as the initial
data references increase. Our new algorithm considers both
data locality and the initial data. Therefore, the much bet-
ter performance can be achieved through balancing the ALU
part and memory schedule.
6. CONCLUSION
In this paper, a new scheme that can obtain a minimal av-
erage schedule length under the consideration of initial data
was proposed. The theories and an algorithm on initial data
were presented. The algorithm explores the ILP among in-
structions by using software pipelining techniques and com-
bines it with data prefetching to produce high throughput
schedules. Experiments on DSP benchmarks show that our
scheme can always produce a better average schedule length
than existing methods.
REFERENCES
[1] T. Mowry, “Tolerating latency in multiprocessors through
compiler-inserted prefetching,” ACM Trans. Computer Sys-
tems, vol. 16, no. 1, pp. 55–92, 1998.
[2] T.-F. Chen, Data prefetching for high-performance processors,
Ph.D. thesis, Dept. of Comp. Sci. and Engr., University of
Washington, Wash, USA.
[3] F. Dahlgren andM. Dubois, “Sequential hardware prefetching
in shared-memory multiprocessors,” IEEE Trans. on Parallel
and Distributed Systems, vol. 6, no. 7, pp. 733–746, 1995.
[4] N. Manjikian, “Combining loop fusion with prefetching on
shared-memory multiprocessors,” in Proc. International Con-
ference on Parallel Processing, pp. 78–82, Bloomingdale, Ill,
USA, August 1997.
[5] M. K. Tcheun, H. Yoon, and S. R. Maeng, “An adaptive se-
quential prefetching scheme in shared-memory multiproces-
sors,” in Proc. International Conference on Parallel Processing,
pp. 306–313, Bloomington, Ill, USA, August 1997.
[6] N. Passos and E. H.-M. Sha, “Scheduling of uniform multi-
dimensional systems under resource constraints,” IEEE Trans.
on VLSI Systems, vol. 6, no. 4, pp. 719–730, 1998.
[7] W. Mangione-Smith, S. G. Abraham, and E. S. Davidson,
“Register requirements of pipelined processors,” in Proc.
International Conference on Supercomputing, pp. 260–271,
Washington, DC, USA, July 1992.
[8] B. R. Rau, “Iterative modulo scheduling: an algorithm for
software pipelining loops,” in Proc. 27th Annual International
Symposium on Microarchitecture, pp. 63–74, San Jose, Calif,
USA, November 1994.
[9] Z. Wang, T. W. O’Neil, and E. H.-M. Sha, “Minimizing av-
erage schedule length under memory constraints by optimal
partitioning and prefetching,” Journal of VLSI Signal Process-
ing, vol. 27, no. 3, pp. 215–233, 2001.
[10] P. Bouilet, A. Darte, T. Risset, and Y. Robert, “(pen)-ultimate
tiling,” in Scalable High-Performance Computing Conference,
pp. 568–576, Knoxville, Tenn, USA, May 1994.
[11] J. Chame and S. Moon, “A tile selection algorithm for data lo-
cality and cache interference,” in Proc. 13th ACM International
Conference on Supercomputing, pp. 492–499, Rhodes, Greece,
June 1999.
[12] A. Agarwal, D. A. Kranz, and V. Natarajan, “Automatic par-
titioning of parallel loops and data arrays for distributed
shared-memory multiprocessors,” IEEE Trans. on Parallel and
Distributed Systems, vol. 6, no. 9, pp. 943–962, 1995.
[13] F. Chen and E. H.-M. Sha, “Loop scheduling and partitions
for hiding memory latencies,” in Proc. IEEE 12th International
Symposium on System Synthesis, pp. 64–70, San Jose, Calif,
USA, November 1999.
[14] V. Van Dongen and P. Quinton, “Uniformization of linear re-
currence equations: a step towards the automatic synthesis of
systolic array,” in International Conference on Systolic Arrays,
pp. 473–482, San Diego, Calif, USA, May 1988.
[15] R. Bixby, K. Kennedy, and U. Kremer, “Automatic data layout
using 0-1 integer programming,” in Proc. International Con-
ference on Parallel Architectures and Compilation Techniques,
pp. 111–122, Montreal, Canada, August 1994.
[16] G. Rivera and C. W. Tseng, “Eliminating conflict misses for
high performance architectures,” in Proc. 1998 AACM In-
ternational Conference on Supercomputing, pp. 353–360, Mel-
bourne, Australia, July 1998.
[17] N. L. Passos, E. H.-M. Sha, and S. C. Bass, “Schedule-
based multi-dimensional retiming on data flow graphs,” IEEE
Trans. Signal Processing, vol. 44, no. 1, pp. 150–156, 1996.
[18] J. L. Baer and T. F. Chen, “An eﬀective on-chip preloading
scheme to reduce data access penalty,” in Proc. Supercom-
puting ’91, pp. 176–186, Albuquerque, NM, USA, November
1991.
Partitioning and Scheduling DSP Applications with Maximal Memory Access Hiding 935
Zhong Wang received a Bachelor’s degree
in electric engineering in 1994 from Xi’an
Jiaotong University, China and a Master’s
degree in information and signal process-
ing in 1998 from Institute of Acoustics,
Academia Sinica, China. Currently, he is
pursuing his Ph.D. in computer science and
engineering at University of Notre Dame in
Indiana. His current research focuses on the
loop scheduling and high-level synthesis.
Edwin Hsing-Mean Sha received his B.S.
degree in computer science and informa-
tion engineering from National Taiwan
University, Taipei, Taiwan, in 1986; he re-
ceived the M.S. and Ph.D. degrees from the
Department of Computer Science, Prince-
ton University, Princeton, NJ, in 1991 and
1992, respectively. From August 1992 to Au-
gust 2000, he was with the Department of
Computer Science and Engineering at Uni-
versity of Notre Dame, Notre Dame, IN. He served as Associate
Chairman for Graduate Studies since 1995. He is now a tenured
full professor in the Department of Computer Science at the Uni-
versity of Texas at Dallas. He has published more than 140 research
papers in refereed conferences and journals. He has been serving as
an editor for several journals such as IEEE Transactions on Signal
Processing and Journal of VLSI Signal Processing. He also served
as program committee member in numerous conferences. He re-
ceived Oak Ridge Association Junior Faculty Enhancement Award
in 1994, and NSF CAREER Award. He was a guest editor for the
special issue on Low Power Design of IEEE Transactions on VLSI
Systems in 1997. He also served as the program chairs for the Inter-
national Conference on Parallel and Distributed Computing Sys-
tems (PDCS), 2000 and PDCS 2001. He received Teaching award in
1998.
Yuke Wang received his B.S. degree from
the University of Science and Technology
of China, Hefei, China, in 1989, the M.S.
and Ph.D. degrees from the University of
Saskatchewan, Canada, in 1992 and 1996,
respectively. He has held faculty positions at
Concordia University, Canada, and Florida
Atlantic University, Florida, USA. Currently
he is an Assistant Professor at the Computer
Science Department, University of Texas at
Dallas. He has also held visiting assistant professor positions in the
University of Minnesota, the University of Maryland, and the Uni-
versity of California at Berkeley. Dr. YukeWang is currently an Edi-
tor of IEEE Transactions on Circuits and Systems, Part II, an Editor
of IEEE Transactions on VLSI Systems, an Editor of Applied Signal
Processing, and a few other journals. Dr. Wang’s research interests
include VLSI design of circuits and systems for DSP and communi-
cation, computer aided design, and computer architectures. During
1996–2001, he has published about 60 papers among which about
20 papers are in IEEE/ACM Transactions.
