Algorithms for Optimally Arranging Multicore Memory Structures by Wei-Che Tseng et al.
Hindawi Publishing Corporation
EURASIP Journal on Embedded Systems
Volume 2010, Article ID 871510, 16 pages
doi:10.1155/2010/871510
Research Article
Algorithms for Optimally Arranging Multicore
Memory Structures
Wei-Che Tseng, Jingtong Hu, Qingfeng Zhuge, Yi He, and Edwin H.-M. Sha
Department of Computer Science, University of Texas at Dallas, Richardson, TX 75080, USA
Correspondence should be addressed to Wei-Che Tseng, wxt043000@utdallas.edu
Received 31 December 2009; Accepted 6 May 2010
Academic Editor: Chun Jason Xue
Copyright © 2010 Wei-Che Tseng et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
As more processing cores are added to embedded systems processors, the relationships between cores and memories have more
influence on the energy consumption of the processor. In this paper, we conduct fundamental research to explore the eﬀects
of memory sharing on energy in a multicore processor. We study the Memory Arrangement (MA) Problem. We prove that
the general case of MA is NP-complete. We present an optimal algorithm for solving linear MA and optimal and heuristic
algorithms for solving rectangular MA. On average, we can produce arrangements that consume 49% less energy than an all shared
memory arrangement and 14% less energy than an all private memory arrangement for randomly generated instances. For DSP
benchmarks, we can produce arrangements that, on average, consume 20% less energy than an all shared memory arrangement
and 27% less energy than an all private memory arrangement.
1. Introduction
When designing embedded systems, the application of
the system may be known and fixed at the time of the
design. This grants the designer a wealth of information
and the complex task of utilizing the information to meet
stringent requirements, including power consumption and
timing constraints. To meet timing constraints, designers
are forced to increase the number of cores, memory, or
both. However, adding more cores and memory increases the
energy consumption. As more processing cores are added to
a processor, the relationships between cores and memories
have more influence on the energy consumption of the
processor.
In this paper, we conduct fundamental research to
explore the eﬀects of memory sharing on energy in a multi-
core processor. We consider a multi-core system where each
core may either have a private memory or share a memory
with other cores. The Memory Arrangement Problem (MA)
decides whether cores will have a private memory or share
a memory with adjacent cores to minimize the energy
consumption while meeting the timing constraint. Some
examples of memory arrangements are shown in Figure 1.
The main contributions of this paper are as follows.
(i) We prove that MA without sharing constraints is NP-
complete.
(ii) We propose an eﬃcient optimal algorithm for solving
linear cases of MA and extend it into an eﬃcient
heuristic for solving rectangular cases of MA.
(iii) We propose both an optimal algorithm and an
eﬃcient heuristic for solving rectangular cases of
MA where only rectangular blocks of cores share
memories.
Our experiments show that, on average, we can produce
arrangements that consume 49% less energy than an all
shared memory arrangement and 14% less energy than an
all private memory arrangement for randomly generated
instances. For benchmarks from DSPStone [1], we can
produce arrangements that, on average, consume 20% less
energy than an all shared memory arrangement and 27% less
energy than an all private memory arrangement.
The rest of the paper is organized as follows. Related
works are presented in Section 2. Section 3 provides a
motivational example to demonstrate the importance of
MA. Section 4 formally defines MA and presents two
properties of MA. Section 5 presents an optimal algorithm










Figure 1: Memory arrangements. Each circle represents a core, and each rectangle represents a memory.
for linear instances of MA. Section 6 proves that MA with
arbitrary memory sharing is NP-complete. Section 7 presents
algorithms to solve rectangular instances of MA including
an optimal algorithm where only rectangular sets of cores
can share a memory and an eﬃcient heuristic to find a
good memory arrangement in a reasonable amount of time.
Section 8 presents our experiments and the results. We
conclude our paper in Section 9.
2. Related Works
Many researchers in diﬀerent areas have already begun
lowering the energy consumption of memories. On a VLIW
architecture, Zhao et al. [2] study the eﬀect of register
file repartitioning on energy consumption. Wang et al. [3]
develop a leakage-aware modulo scheduling algorithm to
achieve leakage energy savings for DSP applications with
loops. For multiprocessor embedded systems, Qiu et al. [4]
take advantage of Dynamic Voltage Scaling to optimally
minimize the expected total energy consumption while
satisfying a timing constraint with a guaranteed confidence
probability. On a multi-core architecture, Hua et al. [5] use
Adaptive Body Biasing as well as Dynamic Voltage Scaling
to minimize both dynamic and leakage energy consumption
for applications with loops. Saha et al. [6] attack the
synchronization problems of concurrent memory accesses by
proposing a new software transactional memory system that
makes it both easy and eﬃcient for multiprocess programs
to share memory. Kumar et al. [7] focus on the interconnects
of a multi-core processor. They show that interconnects play
a bigger role in a multi-core processor than in a single core
processor. We attack the problem from a diﬀerent angle,
exploring how memory sharing in a multi-core processor can
aﬀect the energy consumption.
Other researchers have worked on problems more
specific to the memory subsystem of multi-core systems
including data partitioning and task scheduling. In a timing
focused work, Xue et al. [8] present a loop scheduling
with memory management technique to completely hide
memory latencies for applications with multidimensional
loops. Suhendra et al. [9] present an ILP formulation that
performs data partitioning and task scheduling simultane-
ously. Zhang et al. [10] present two heuristics to solve larger
problems eﬃciently. The memory architectural model used
is a virtually shared scratch pad memory (VS-SPM) [11],
where each core has its own private memory and treats all
the memories of the other cores as one big shared memory.
Other researchers also start with a given multi-core memory
architecture and use the memory architecture to partition
data [12–16]. We approach the problem by designing the
memory architecture around the application.
A few others have taken a similar approach. Meftali
et al. [17] provide a general model for distributing data
between private memories and a global shared memory.
They assume that each processor has a local memory, and
all processors share a remote memory. This is similar to
an architecture with private L1 memories and a shared L2
memory. This architecture does not provide the possibility of
only a few processors sharing a memory. The integer linear
programming-(ILP-) based algorithm presented decides on
the size of the private memories. Ozturk et al. [18] also com-
bine both memory hierarchy design and data partitioning
with an ILP approach to minimize the energy spent on data
access. The weaknesses of this approach are that ILP takes
an unreasonable amount of time for large instances, and
timing is not considered. The generated architecture might
be energy eﬃcient but takes a long time to complete the
tasks. In another publication, Ozturk et al. [19] aim to lower
power consumption by providing a method for partitioning
the available memory to the processing units or groups of
processing units based on the number of accesses on each
data element. The proposed method does not consider any
issues related to time such as the time it takes to access the
data or the duration of the tasks on each processing unit. Our
proposed algorithms will consider these time constraints to
ensure that the task lengths do not grow out of hand.
3. Motivational Example
In this section, we present an example that illustrates the
memory arrangement problem. We informally explain the
problem while we present the example.
The cores in a multi-core processor can be arranged
either as a line or as a rectangle. For our example, we have
6 cores arranged in a 2× 3 rectangle as shown in Figure 2.
Each core has a number of operations that it must
complete. We can divide these operations into those that
require memory accesses and those that do not. The com-
putational time and energy required by operations that do
not require memory accesses are independent of the memory
EURASIP Journal on Embedded Systems 3
v1,1 v1,3v1,2
v2,1 v2,3v2,2
Figure 2: Motivational example. Each circle denotes a core.
Table 1: Data accesses.
v1,1 v1,2 v1,3 v2,1 v2,2 v2,3
v1,1 5 0 0 3 0 0
v1,2 0 0 0 0 0 5
v1,3 0 0 2 0 0 0
v2,1 4 0 0 0 0 0
v2,2 0 0 0 0 2 0
v2,3 0 5 0 0 0 0
arrangement. We do not consider the energy required by
these operations since they are all constants, but we do
consider the time required since it may aﬀect the ability of
a core to meet its timing constraint. Each core then has a
constant time for the operations that do not require memory
accesses. For our example, each core requires ten units of
time for these operations.
For the operations that do require memory accesses, we
count the number of these operations for each pair of cores.
This number is the number of times a core needs to access
the memory of another core. These counts for our example
are shown in Table 1. In Table 1, the left column shows
which core requires the memory accesses. The top row shows
which core the memory accessed belongs to. For instance,
v1,1 has five operations that access its own memory and three
operations that access the memory of v2,1.
The computational time and energy required by each of
these memory-access operations dependent on the memory
arrangement. The least amount of time and energy required
is when a core with private memory accesses its own memory.
For our example, each of these accesses takes one unit of
time and one unit of energy. The most amount of time and
energy required is when a core accesses a remote memory.
For our example, each of these accesses takes three units of
time and three units of energy. In between, the amount of
time and energy required when a core accesses a memory that
it shares with another core is two units of time and two units
of energy.
To make sure that the computations do not take too
long, we restrict the time that each core is allowed to
take. If, for a memory arrangement, any core takes more
time than the timing constraint allows, we say that the
memory arrangement does not meet the timing constraint.
Sometimes it is impossible to find a memory arrangement
that meets the timing constraint. For our example, the timing
constraint is 25 units of time.
Two simple memory arrangements are the all private
memory arrangement and the all shared memory arrange-
ment. These are shown in Figure 1. Figure 1(a) shows the
all private memory arrangement where each core has its
own memory. Figure 1(b) shows the all shared memory
arrangement where all cores share one memory.
Let us calculate the time and energy used by these two
memory arrangements. First, let us consider the cores v1,1
and v2,1. In the all private memory arrangement, v1,1 uses
5 units of time and energy to access its own memory and
9 units of time and energy to access the memory of v2,1.
Including the operations that do not need memory accesses,
v1,1 uses a total of 24 units of time and 14 units of energy.
v2,1 uses 12 units of time and energy to access the memory of
v1,1. Including the non-memory-access operations, v2,1 uses
a total of 22 units of time and 12 units of energy. Together,
these two cores use 26 units of energy.
In the all shared memory arrangement, v2,1 uses 8 units
of time and energy to access the memory of v1,1. Including
the non-memory-access operations, v2,1 uses a total of 18
units of time and 8 units of energy. v1,1 uses 10 units of time
and energy to access its own memory and 6 units of time
and energy to access the memory of v2,1. Including the non-
memory-access operations, v1,1 uses a total of 26 units of time
and 16 units of energy. Together, these two cores use 24 units
of energy, which is less than the 26 units of energy that the
all private memory arrangement uses. However, v1,1 takes 26
units of time, thus the all shared memory arrangement does
not meet the timing constraint. We should use the all private
memory arrangement even though it uses more energy.
Let us now consider the cores v1,2, v1,3, v2,2, and v2,3.
In the all private memory arrangement, cores v1,2 and v2,3
each use 15 units of time and energy to access each other’s
memory. Including the non-memory-access operations, v1,2
and v2,3 each use 25 units of time and 15 units of energy.
v1,3 and v2,2 each use 2 units of time and energy to access its
own memory. Including the non-memory-access operations,
v1,3 and v2,2 each use 12 units of time and 2 units of energy.
Together, these four cores use 34 units of energy.
In the all shared memory arrangement, cores v1,2 and v2,3
each use 10 units of time and energy to access each other’s
memory. Including the non-memory-access operations, v1,2
and v2,3 each use 20 units of time and 10 units of energy.
v1,3 and v2,2 each use 4 units of time and energy to access its
own memory. Including the non-memory-access operations,
v1,3 and v2,2 each use 14 units of time and 4 units of energy.
Together, these four cores use 28 units of energy, which is
less than the 34 units of energy that the all private memory
arrangement uses, but the all shared memory arrangement
does not meet the timing constraint for v1,1. Hence, the best
we can do with either an all shared or all private memory
arrangement is to use 60 units of energy.
Instead of an all private or all shared memory arrange-
ment, it would be better to have a mixed memory arrange-
ment where v1,1 and v2,1 each use a private memory while the
rest of the cores share one memory as shown in Figure 1(c).
This memory arrangement uses only 54 units of energy and
meets the timing constraint. All of our algorithms are able to
achieve this arrangement, but it is possible to do better.
4 EURASIP Journal on Embedded Systems
Figure 3: Linear array of cores. Each circle denotes a core.
v1 v2 v3 v4 v5 v6
Figure 4: Memory sharing example. Each circle represents a single
core. All cores in the same rectangle share a memory.
If we have an arrangement such that v1,2 and v2,3 share a
memory but all the other cores have private memories, then
we can meet the timing constraint and use only 50 units of
energy. This arrangement, however, is diﬃcult to implement
since v1,2 and v2,3 are not adjacent to each other. In a larger
chip, it is not advantageous from an implementation point of
view to have two cores on opposite sides of the chip share a
memory. Moreover, we prove that this version of the problem
is NP-complete in Section 6.
4. Problem Definition
We now formally define our problem. Let us consider the
problem of memory sharing to minimize energy while
meeting a timing constraint assuming that all operations and
data have already been assigned to cores. We call this problem
the Memory Arrangement Problem (MA). We first explain
the memory architecture then MA.
We are given a sequence V = 〈v1, v2, v3, . . . , vn〉 of
processor cores. The cores are arranged either in a line
or a rectangle. For example, the cores in Section 5 are
arranged in a line. An example is shown in Figure 3. Each
core has operations and data assigned to it. We can divide
the operations into memory-access-operations and non-
memory-access operations. For a core u ∈ V , b(u) is the
time it takes for u to complete all its non-memory access
operations. For cores u, v ∈ V ,w(u, v) is the number of times
core u accesses a data that belongs to v. The time and energy
it takes for u to access a data that belongs to v depends on
how the memories of u and v are related. If u and v share
the same private memory, that is, u = v, and u does not
share a memory with any other cores, then the time and
energy each memory-access operation takes are t0 and e0,
respectively. If u and v share a memory, but u /= v, then the
time and energy each memory-access operation takes are t1
and e1, respectively. If u and v do not share a memory, then
the time and energy each memory-access operation takes are
t2 and e2, respectively. For convenience, let us denote the time
and energy each memory-access operation takes as Ct(u, v)
and Ce(u, v), respectively. For example, if v3 and v5 share the
same memory, then Ct(v3, v5) = t1 and Ce(v3, v5) = e1.
We can represent the memory sharing of the cores with
a partition of the cores such that two cores are in the same
block if they share a memory. Let us consider the example
in Figure 4. The memory sharing can be captured by the
partition {{v1, v2, v3}, {v4}, {v5, v6}}.
We wish to find a partition of the cores to minimize the





Ce(u, v)w(u, v). (1)
Energy is not our only concern. We also want to make
sure that all operations finish within the timing constraint.
Aside from memory-access operations, non-memory-access
operations also take time. Since the memory sharing does not
eﬀect the time taken by non-memory access operations, for
each u ∈ V we describe all the time taken by non-memory-





Ct(u, v)w(u, v) ≤ q ∀u ∈ V. (2)
MA then asks, given a sequence V , w(u, v) ∈ Z∗ for each
u, v ∈ V , b(u) ∈ Z∗ for each u ∈ V , and nonnegative
integers t0, e0, t1, e1, t2, e2, q, “what is a partition P such
that the total energy used by memory-access operations is
minimized and the timing constraint is met?”
Now that we have formally defined MA, we look at two
of its properties. We use these properties in the later sections.
4.1. Optimal Substructure Property. Suppose that P is an
optimal partition of V for an instance I = 〈V ,w, b,
t0, e0, t1, e1, t2, e2, q〉. Let B1 be the block that contains v1.
Suppose that P′ is an optimal partition for the subinstance
I′ = 〈V ′,w, b′, t0, e0, t1, e1, t2, e2, q〉, where V ′ and b′ are
defined as follows:
V ′ = V − B1
b′(u) = b(u) + t2
∑
v∈B1
w(u, v) ∀u ∈ V ′. (3)
Lemma 1. P′ = P − {B1} is an optimal partition for I′.
Proof. Let us prove Lemma 1 by contradiction. Suppose for
the purpose of contradiction that P′ is not an optimal parti-
tion for I′. Then there is a partition Q′ for I′ such that Q′ is a
better partition than P′. Since Q′ is a partition that meets the
timing requirements in I′, Q = Q′ ∪ {B1} is also a partition
that meets the timing requirements in I . Furthermore, Q is a
better partition than P, a contradiction.
4.2. Conglomerate Property. Suppose a partition P contains
two diﬀerent blocks of size at least 2, that is, Bi,Bj ∈ P, where
i /= j, |Bi| > 1, and |Bj| > 1. Let P′ = P−{Bi,Bj}∪{Bi∪Bj}.
If t1 ≤ t2 and e1 ≤ e2, then P′ would be a partition that is as
good as or better than P.
EURASIP Journal on Embedded Systems 5
Figure 5: Subinstances. There are 6 sets cores. Each set has one
more core than the previous set.
Proof. Let V ′ = V − B1 − B2 and B′ = B1 ∪ B2. The total





































































































In this section, we consider the linear instances of MA. Linear
instances are where the cores are arranged in a line. An
example is shown in Figure 3. Let us make the assumption
that only cores next to each other can share a memory. In
other words, shared memories must only contain continuous
blocks of cores, that is, if ui,uj ∈ V are in the same block
Bx ∈ P, then uk ∈ Bx for all i ≤ k ≤ j. This is the case in
real applications since it is diﬃcult to share memory between
cores that are not adjacent. We consider what happens when
we allow arbitrary cores to share a memory in Section 6.
Using the optimal substructure property of MA, we can
solve the problem recursively. Unfortunately, in Section 4.1
we assumed that we already know the first block of an
optimal partition. Since we do not know any optimal
partitions, we will try all the possible first blocks and
then find the best block. Figure 5 shows an example of
the sub-instances of a problem. Notice that because of our
assumption, all the sub-instances include vn.
Let the largest sub-instance that contains the core vi
be Ii = 〈Vi,w, bi, t0, e0, t1, e1, t2, e2, q〉, where Vi and bi are
Input: An instance I of Linear MA.
Output: An optimal partition P1 and its energy
consumption d1.
(1) dn+1 ← 0
(2) Pn+1 ← {}
(3) for i← n to 1 do
(4) Vi ← {vi, vi+1, vi+2, . . . , vn}
(5) di ←∞
(6) Pi ← {}
(7) for  ← 1 to n− i + 1 do
(8) Vi ← {vi, vi+1, vi+2, . . . , vi+−1}
(9) Compute ci and d

i .
(10) if di < di then
(11) di ← di




Algorithm 1: Optimal linear memory arrangement (OLMA).
defined as follows:
Vi = {vi, vi+1, vi+2, . . . , vn},
bi(u) = b(u) + t2
∑
v∈V−Vi
w(u, v) ∀u ∈ Vi. (5)
Note that I1 = I , and there are, including I1, only n sub-
instances.
For each sub-instance Ii, let Pi be an optimal partition
that satisfies the timing constraints. Let di be the energy
consumption of Pi or ∞ if no partition can meet the timing
constraint for Ii. Let Vi be the first  cores in Vi, that
is, Vi = {vi, vi+1, vi+2, . . . , vi+−1}. Let ci be the minimum
energy necessary for Ii if Vi is a block in Pi. Let d

i be ∞ if
no partition of Vi that contains Vi as a block satisfies the
timing constraints. Otherwise, let di be c

i . We can define c

i ,
di , and di recursively as (6), (7), and (8), respectively.
During the computation of di, we record the optimal
value of  by recording the corresponding partition in Pi.
Let Pn+1 = {}. For all 1 ≤ i ≤ n, let k be an optimal
value of  used to compute di. Then Pi = {Vki } ∪ Pi+k. If
di = ∞, then there is no partition for Ii that satisfies the
timing requirement, and Pi is undefined. P1 is an optimal
partition for I , and d1 is the energy necessary. If d1 = ∞, then
there does not exist a partition for I that satisfies the timing
requirement.
Optimal Linear Memory Arrangement (OLMA), shown
in Algorithm 1, is an algorithm to compute Pi and di. It starts
by setting the sentinels for Pn+1 and dn+1 in lines 1-2. The
body of the algorithm is the for loop on lines 3–15. Notice
that it computes P and d from n to 1. For each value of
i, OLMA computes di starting from  = 1. ci and di are
computed according to equations (6) and (7) on line 9. Lines
10–13 record the optimal Pi whenever a better di is found. At
the end of the algorithm, P1 holds an optimal partition for
6 EURASIP Journal on Embedded Systems
v1 v2 v3 v4 v5 v6
Figure 6: Example for OLMA. Each circle is a core.
Table 2: Data accesses.
v1 v2 v3 v4 v5 v6
v1 5 0 0 0 0 3
v2 0 0 0 5 0 0
v3 0 0 2 0 0 0
v4 0 5 0 0 0 0
v5 0 0 0 0 2 0
v6 4 0 0 0 0 0
I , and d1 holds the energy consumption of P1. The running
time of OLMA is O(n4) where n is the number of cores.
Let us illustrate OLMA with an example. We unroll the
example from Section 3 to create a linear example of 6 cores
as shown in Figure 6. In other words, V = 〈v1, v2, v3, . . . , v6〉.
The memory access operations are shown in Table 2. For each
core v ∈ V , b(u) = 10. t0 = e0 = 1, t1 = e1 = 2, and
t2 = e2 = 3. The timing constraint q = 25.
The computed values of di are shown in Table 3, and the
computed values of di and Pi are shown in Table 4. From
these values, we see that if v1 is not in a block by itself, then
it is unable to meet the timing constraint. Thus, d1 = ∞
for  > 1. The optimal partition for this example is P1 =
{{v1}, {v2, v3, v4}, {v5}, {v6}}, and its energy consumption is






































∣∣∣ = 1 and bi(u) + t0w(u, v) +
∑
v∈Vi−Vi
t2w(u, v) > q for any u ∈ Vi ,
∞ if
∣∣∣Vi
∣∣∣ > 1 and bi(u) + t1w(u, v) +
∑
v∈Vi−Vi
















Let us consider MA if we do not assume that only cores next
to each other may share a memory. Since any cores can share
a memory, the shape that the cores are arranged in does not
aﬀect the solution. We first define the decision version of MA
and then show that it is NP-complete.
An instance of MA consists of a set V , functions w : V ×
V → N and b : V → N, nonnegative integers t0, e0, t1, e1,
t2, e2, q, and k. The question is as follows. Is there a partition
P of V such that the timing requirement q is met and the
energy consumption is less than k?
Let us apply the conglomerate property. For any partition
P, there is a partition P′ such that P′ is at least as good
as P and P′ contains only one block that has a cardinality
greater than 1. We can specify P′ with a subset V ′ ⊆ V
where V ′ contains the cores that do not share a memory with
another core. Conversely, for any subset V ′ ⊆ V , there exists
a corresponding partition P = {V − V ′} ∪ {{v}|v ∈ V ′}.
Thus, we can restate the decision question as follows. Is there
a subset V ′ ⊆ V such that its corresponding partition meets
the timing and energy requirements?
Theorem 1. MA is NP-complete.
Proof. It is easy to see that MA∈NP since a nondeterministic
algorithm needs only to guess a partition of V and check in
polynomial time whether that partition meets the timing and
energy requirements.
We transform the well-known NP-complete problem
KNAPSACK to MA. First, let us define KNAPSACK. An
instance of KNAPSACK consists of a set U , a size s(u) ∈ Z+
and a value v(u) ∈ Z+ for each u ∈ U , and positive integers
B and K . The question is as follows. Is there a subset U ′ ⊆ U
such that
∑
u∈U ′ s(u) ≤ B and
∑
u∈U ′ v(u) ≤ K?
Let U = u1,u2,u3, . . . ,un, s(u), v(u), B, and K be
any instances of KNAPSACK. We must construct set V , a
functions w : V × V → N and b : V → N, and nonnegative
integers t0, e0, t1, e1, t2, e2, q, and k such that there is a subset
U ′ ⊆ U such that∑u∈U ′ s(u) ≤ B and
∑
u∈U ′ v(u) ≤ K if and
EURASIP Journal on Embedded Systems 7
only if there is a subset V ′ ⊆ V such that its corresponding
partition meets both the timing and energy requirements.
We construct a special case of MA such that the resulting
problem is the same as KNAPSACK. We start by setting V =





s(v2) if v1 = u0 and v2 ∈ U ,
s(v2) + v(v2) if v1 = v2 and v1 ∈ U ,
0 otherwise.
(9)





0 if v ∈ U ,
∑
u∈U
v(u) if v = u0. (10)
We complete the construction of our instance of MA by
setting t0 = 0, e0 = 1, t1 = 1, e1 = 2, t2 = 2, e2 = 3, q =∑
u∈U[s(u) + v(u)] + B, and k =
∑
u∈U[4s(u) + 2v(u)]− K .
It is easy to see how the construction can be accomplished
in polynomial time. All that remains to be shown is that the
answer to KNAPSACK is yes if and only if the answer to MA
is yes.
Since w(u0,u0) = 0, it is of no advantage for u0 to be in
a block by itself. Therefore, u0 /∈V ′ unless V ⊆ V ′. The time























Notice that the time required by u0 is greater than any
u ∈ U . Hence, the timing constraint is met if and only if
∑
u∈U












s(u) ≤ B. (13)
Table 3: di .

1 2 3 4 5 6
i
1 52 ∞ ∞ ∞ ∞ ∞
2 46 48 38 40 40
3 31 33 35 35
4 29 31 31
5 14 16
6 12
Table 4: di and Pi.
i di Pi
1 52 {{v1}, {v2, v3, v4}, {v5}, {v6}}
2 38 {{v2, v3, v4}, {v5}, {v6}}
3 31 {{v3}, {v4}, {v5}, {v6}}
4 29 {{v4}, {v5}, {v6}}
5 14 {{v5}, {v6}}
6 12 {{v6}}
7 0 {}


















































v(u) ≤ K. (16)
Hence, there is a subset V ′ ⊆ V that meets both the timing
and energy requirements if and only if there is a subset U ′ ⊆
U such that
∑
u∈U ′ s(u) ≤ B and
∑
u∈U ′ v(u) ≤ K . Thus, MA
is NP-complete.
7. Rectangular Instances
Since general MA is NP-complete and linear MA is in P, let us
consider the case when the cores are arranged as a rectangle.
8 EURASIP Journal on Embedded Systems
An example of such an arrangement is our motivational
example shown in Figure 2. We extend OLMA to solve the
rectangular case in Section 7.1. In Section 7.2, we define
what staircase-shaped sets are. Then we use staircase-shaped
sets to optimally solve rectangular MA in Section 7.3. We
finally present a good heuristic to solve rectangular MA in
Section 7.4.
7.1. Zigzag Rectangular Partitions. We propose an algorithm
Zigzag Rectangular Memory Arrangement (ZiRMA) to solve
this problem. ZiRMA transforms rectangular instances into
linear instances before applying OLMA. It runs in polyno-
mial time but cannot guarantee optimality.
Let us use OLMA to handle this case by treating the
rectangle as a zigzag line as shown in Figure 7(b). To
transform anm×n rectangle into a line, we can simply relabel
each core vi, j of an m× n rectangle as vn·i+ j . An example of a
resulting line is shown in Figure 7(a). Notice how v1,5 and v2,1
are not adjacent in the rectangle, but they are adjacent in the
line. Instead, let us relabel the cores with a continuous zigzag
line so that each core vi, j of an m× n rectangle becomes
vj(−1)i+1+(n+1)[(i+1) mod 2]+n(i−1). (17)
The resulting line on the same rectangle is shown in
Figure 7(b). Notice how adjacent cores in the line are also
adjacent in the rectangle. Now we can use OLMA to solve the
linear problem.
Unfortunately, not all cores adjacent in the rectangle are
adjacent in the line. For example, v1,2 and v2,1 are adjacent in
the rectangle, but they are separated by 6 other cores in the
line. To mitigate this problem, we run OLMA twice—once
on the horizontal zigzag line shown in Figure 7(b) and once
on the vertical zigzag line shown in Figure 7(c). This time, let
us relabel the cores in a vertical zigzag manner so that each
core vi, j of an m× n rectangle becomes
vi(−1) j+1+(m+1)[( j+1) mod 2]+m( j−1). (18)
After both iterations are complete, we have two partitions
Ph and Pv of the same set of cores. We construct a new
partition such that two cores share a memory if they share
a memory in either Ph or Pv. To create the final partition, we
merge a block from Ph and a block from Pv if they share a
core. An example merge is shown in Figure 8.
ZiRMA is summarized in Algorithm 2. Its running time
is O(m4n4) for an m × n rectangle. We illustrate ZiRMA
with our motivational example. We transform the cores
according to Table 5. The accesses for the horizontal zigzag
transformation are shown in Table 2, and the accesses for
the vertical zigzag transformation are shown in Table 6.
The resulting partitions are shown in Figure 9. In this case,
the reverse transformations of Ph and Pv are the same, so
merging does not have an eﬀect.
As we can see from Figure 8, the shapes created by this
algorithm may be long and winding, unsuitable for real
implementations. Next, we make the restriction that the
cores sharing a memory must be of a rectangular shape.
To optimally solve this problem, we introduce the concept
staircase-shaped set of cores.
Table 5: Core transformations.







Table 6: Accesses for vertical transformation.
v1 v2 v3 v4 v5 v6
v1 5 3 0 0 0 0
v2 4 0 0 0 0 0
v3 0 0 2 0 0 0
v4 0 0 0 0 0 5
v5 0 0 0 0 2 0
v6 0 0 0 5 0 0
7.2. Staircase-Shaped Sets. Let us call a set of cores
Vs staircase shaped if Vs satisfies the following requirements.
(1) All cores are right-aligned, that is, for each 1 ≤ i ≤ m,
there is an integer si such that vi, j /∈Vs for all 1 ≤ j ≤
si and vi,j ∈ Vs for all si < j ≤ n.
(2) Each row has at least as many cores in Vs as the
previous row, that is, s1 ≥ s2 ≥ s3 ≥ · · · ≥ sm.
Some examples of staircase-shaped sets are shown in
Figure 10.
We can uniquely identify any staircase-shaped sub-
set Vs of a rectangular set V by an m-tuple s =
(s[1], s[2], s[3], . . . , s[m]) such that s[i] is the number of cores
from row i of V that are not in Vs. For example, the tuples
corresponding to the sets in Figures 10(a), 10(b), 10(c), and
10(d) are (2, 1, 0), (2, 2, 0), (4, 2, 1), and (4, 4, 2), respectively.
Let us consider all rectangular subsets V
i, j
s of any
staircase-shaped set Vs such that Vs − Vi, js is a staircase-
shaped set. Let V
i, j
s = {vi′, j′ | i′ ≤ i, j′ ≤ j, and vi′, j′ ∈ Vs}.
It is easy to see that Vsi, j = Vs − Vi, js is a staircase-shaped
subset of Vs if Vs is a staircase-shaped set, 0 ≤ i ≤ m, and
0 ≤ j ≤ n. We see that si, j is an m-tuple where si, j[k] =
max(s[k], j) if k ≤ i and si, j[k] = s[k] if k > i.
Unfortunately, V
i, j
s as defined does not necessarily have
to be rectangular. To restrict V
i, j
s to be rectangular, we define
an m-tuple ks such that for all 1 ≤ i ≤ m, ks[i] is the largest
integer such that ks[i] < i and s[ks[i]] /= s[i]. As a sentinel, let
s[0] = n+1 so that s[0] /= s[i] for all 1 ≤ i ≤ m. In words, row
ks[i] is the closest row before row i that is diﬀerent from row i.
For example, the ks’s corresponding to Figures 10(a), 10(b),
10(c), and 10(d) are (0, 1, 2), (0, 0, 2), (0, 1, 2), and (0, 0, 2),
respectively. Then, for all i, j such that 1 ≤ i ≤ m, j ≤ n, and
s[i] < j ≤ min(s[ks[i]],n), Vi, js is rectangular.
EURASIP Journal on Embedded Systems 9
(a) Discontinuous (b) Horizontal (c) Vertical
Figure 7: Zigzag lines. We transform a rectangular problem into a linear problem by following one of these zigzag lines.
(a) Ph (b) Pv (c) P
Figure 8: Merging Ph and Pv . P is the partition resulting from merging Ph and Pv .
Input: An instance I of rectangular MA.
Output: A partition P and its energy consumption d.
(1) Create a linear instance Ih from I by transforming each
core vi, j according to (17).
(2) Find the optimal partition Ph of Ih with OLMA.
(3) Reverse the transformation of each core in Ph by
applying (17) in reverse.
(4) Create a linear instance Iv from I by transforming each
core vi, j according to equation (18).
(5) Find the optimal partition Pv of Iv with OLMA.
(6) Reverse the transformation of each core in Pv by
applying (18) in reverse.
(7) Create P by merging Ph and Pv .
(8) Compute the energy consumption d of P.
Algorithm 2: Zigzag rectangular memory arrangement (ZiRMA).
Lemma 2. If a partition P of a nonempty staircase-shaped set
V is composed of only rectangular blocks, there exists a block
B ∈ P such that V − B is a staircase-shaped set.
Proof. Let us suppose that V is m high and n wide. V then
has at most m top left corners. For example, in Figure 10(a),
the 3 top left corners are (3, 1), (2, 2), and (1, 3). Since
all blocks of P are rectangular, none of the top left corners
are in the same block. One of the blocks containing these
corners is a block B′ such that V − B′ is a staircase-shaped
set. Let B1,B2,B3, . . . ,Bj , where j ≤ m, be the sequence
of these blocks ordered by the row index of the top left
corner that it contains. Let us consider all these blocks in this
order.
If B1 does not extend to the right underneath B2, then it is
a block such that the remaining blocks compose a staircase-
shaped set, and the lemma is correct. If it does not, then it is
not B′, and one of the remaining blocks must be B′.
Let us consider Bi, where i ≤ j. Since we are considering
Bi, Bi−1 must not be B′, thus Bi−1 extends underneath Bi, and
Bi cannot extend down next to Bi−1. Thus, if Bi is not B′,
then it must extend to the right. If Bi does not extend to the
right underneath Bi+1, then it is B′, and the lemma is correct.
Otherwise, it is not B′, and we consider Bi+1. We continue
this until we come to Bj .
By the same argument, Bj does not extend down next to
Bj−1. Since this is the topmost top left corner, there is nothing
above this block. Thus, Bj is B′. Thus, we have found a block
such that the remaining blocks compose a staircase-shaped
set.
Lemma 3. If a partition of a rectangular set is composed of
only k rectangular blocks, there exists a sequence of the block




Proof. Since a rectangular set is staircase-shaped, we can
repeatedly apply Lemma 2 to find such a sequence.
7.3. Staircase Rectangular Partitions. We use staircase-shaped
sets to find the optimal partition of a rectangular set of
cores that only has rectangular blocks. For an MA instance
10 EURASIP Journal on Embedded Systems
Table 7: ds and Ps.
s Shape ds Ps
(4, 3, 3) 0 {}
(4, 3, 2) 15 {{v2,3}}
(4, 3, 1) 17 {{v2,2}, {v2,3}}
(4, 3, 0) 29 {{v2,1}, {v2,2}, {v2,3}}
(4, 2, 2) 17 {{v1,3}, {v2,3}}
(4, 2, 1) 19 {{v1,3}, {v2,2}, {v2,3}}
(4, 2, 0) 31 {{v1,3}, {v2,1}, {v2,2}, {v2,3}}
(4, 1, 1) 28 {{v1,2, v1,3, v2,2, v2,3}}
(4, 1, 0) 40 {{v2,1}, {v1,2, v1,3, v2,2, v2,3}}
(4, 0, 0) 54 {{v1,1}, {v2,1}, {v1,2, v1,3, v2,2, v2,3}}
I = 〈V ,w, t0, e0, t1, e1, t2, e2, b, q〉, let Is be the sub-instance
that contains a staircase-shaped set Vs ⊆ V , where s is an
m + 1-tuple such that s[0] = n + 1 and for all 1 ≤ i ≤ m,
0 ≤ s[i] ≤ n and s[1] ≥ s[2] ≥ s[3] ≥ · · · ≥ s[m]. Is =




vi, j | 1 ≤ i ≤ m and s[i] < j ≤ n
}
,
bs(u) = b(u) + t2
∑
v∈V−Vs
w(u, v) ∀u ∈ Vs.
(19)
Let s0 be the m + 1-tuple that consists of all 0’s except
s0[0] = n + 1, and sn be the m + 1-tuple that consists of all
n’s except sn[0] = n + 1, i.e. s0 = (n + 1, 0, 0, 0, . . . , 0) and
sn = (n + 1,n,n,n, . . . ,n). Note that Is0 = I . For each sub-
instance Is, let Ps be an optimal partition that satisfies the
timing constraint. Let ds be the energy consumption of Ps or
∞ if no partition for Is can meet the timing constraint. Let
V
i, j
s = {vi′, j′ |i′ ≤ i, j′ ≤ j, and vi′, j′ ∈ Vs}. Let ci, js be the
minimum energy necessary for Vs if V
i, j
s is a block in Ps. Let
d
i, j
s be ∞ if no partition that has Vi, js as a block satisfies the









and Ps can be defined recursively as shown in equations (20),
(21), (22), and (23), respectively.
Ps0 is an optimal partition, and ds0 is the minimum
energy necessary to meet the timing constraint. If ds0 =
∞, then there is no partition for I that consists of only
rectangular blocks that will satisfy the timing constraint.
An algorithm to compute Ps and ds, Staircase Rectangular
Memory Arrangement (StaRMA), is shown in Algorithm 3.
We illustrate the algorithm on the motivational example. ds
and Ps for all s’s that correspond to staircase-shaped sets are
shown in Table 7. The second column of Table 7 shows the
shape of the corresponding staircase-shaped set. To illustrate
equation (20), d(4,1,1) = min{15 + d(4,2,1), 19 + d(4,3,1), 19 +
d(4,2,2), 28 + d(4,3,3)} = 28. The output partition is P(4,0,0) =
{{v1,1}, {v2,1}, {v1,2, v1,3, v2,2, v2,3}}. Its energy consumption
is d(4,0,0) = 54.
By Lemma 3, if we search through all possible staircase-
shaped sets, we search through all the partitions composed
of only rectangular blocks. Since StaRMA loops through all
the staircase-shaped subsets, it is able to find an optimal



























































∣∣∣ = 1 and bs(u) + t0w(u,u) + t2
∑
v∈Vs−Vi, js
w(u, v) > q for any u ∈ Vi, js ,
∞ if
∣∣∣Vi, js
∣∣∣ > 1 and bs(u) + t1w(u,u) + t2
∑
v∈Vs−Vi, js











s ∪ Psi, j for any i, j such that di = dsi, j if s /= sn,
{} if s = sn.
(23)
EURASIP Journal on Embedded Systems 11
v1 v2 v3 v4 v5 v6






Figure 9: Partitions. The two linear partitions are transformed back and then merged together.
(a) (b) (c) (d)
Figure 10: Examples of staircased-shapes sets. The enclosed cores make up a staircase-shaped set.
Unfortunately, the running time of StaRMA is O(nm(n+
m)!/n!m!) for an m×n rectangle. This is still acceptable when
the number of cores is small, about 100 cores. If we also
restrict the sub-instances to be rectangular, then we can have
an algorithm that finds the best partition in polynomial time.
7.4. Carving Rectangular Partitions. In this section, we
restrict all sub-instances as well as blocks to be rectangular.
We lose in terms of optimality, but we gain much more in
terms of the size of the problems we can solve in a reasonable
amount of time. From our experiments, we see that we do
not sacrifice much in terms of optimality either.
Since rectangles can be uniquely identified by two points,
we will label our sub-instances by two points. For an instance
I = 〈V ,w, t0, e0, t1, e1, t2, e2, b, q〉 of rectangular MA, let
Ix,y be the sub-instance that contains a staircase-shaped set
Vx,y ⊆ V , where x = (xi, xj) and y = (yi, yj) are two
pairs such that xi ≤ yi and xj ≤ yj . We can define Ix,y




vi, j | xi ≤ i ≤ yi, xj ≤ j ≤ yj
}
,
bx,y(u) = b(u) + t2
∑
v∈V−Vx,y
w(u, v) ∀u ∈ Vx,y.
(24)
For each sub-instance Ix,y , let Px,y be an optimal partition
that satisfies the timing constraint. Let dx,y be the energy
consumption of Px,y or∞ if we are unable to find a partition
for Ix,y that can meet the timing constraint. Let z = (zi, zj)
be a pair such that xi ≤ zi ≤ yi and xj ≤ zj ≤ yj , and Vzx,y =
{vi, j|xi ≤ i ≤ ziandxj ≤ j ≤ zj}. Suppose that Vzx,y is a block
in Px,y , then there are two configurations of sub-instances
with two sub-instances each. In configuration 1, shown in
Figure 11(a), sub-instance 1 is to the left of sub-instance 2.
The two sub-instances are Ix1,y1 and Ix2,y , where x1 = (zi +
1, xj), y1 = (yi, zj), and x2 = (xi, zj + 1). In configuration 2,
shown in Figure 11(b), sub-instance 1 is above sub-instance
2. The two sub-instances in this configuration are Ix1,y1 and
Ix2,y , where x1 = (xi, zj +1), y1 = (zi, yj), and x2 = (zi+1, xj).
Let cz,1x,y be the minimum energy necessary for Vx,y if V
z
x,y
is a block in Px,y and we use configuration 1. Conversely,
let cz,2x,y be the minimum energy necessary for Vx,y if V
z
x,y
is a block in Px,y and we use configuration 2. Similarly, Let
dz,1x,y(d
z,2
x,y) be ∞ if no partition in configuration 1(7) that has













x,y , and d
z,2
x,y can be
defined recursively as shown in (25), (26), (27), (28), and
(29), respectively.
During the computation of dx,y , we record the optimal
value of z and configuration by recording the corresponding
partitions in Px,y . Let Px,y = {} for any x, y such that xi > yi
or xj > yj . For all x, y where xi ≤ yi and xj ≤ yj , let z′ be
the optimal value of z used to compute dx,y . If configuration
1 is used, then Px,y = {Vz′x,y} ∪ P(z′ i+1,xj ),(yi,z′ j ) ∪ P(xi,z′ j+1),y . If
configuration 2 is used, then Px,y = {Vz′x,y}∪P(xi,z′ j+1),(z′ i ,yj )∪
P(z′ i+1,xj ),y . If dx,y = ∞, then we are unable to find a partition
for Ix,y that satisfies the timing requirement, and Px,y is
undefined.
Note that I(1,1),(m,n) = I , P(1,1),(m,n) is an optimal partition,
and d(1,1),(m,n) is the minimum energy necessary to meet the
timing constraint corresponding to P(1,1),(m,n). If d(1,1),(m,n) =
∞, then we are unable to find a partition for I that
consists of only rectangular blocks that will satisfy the timing
constraint.













































































































∣∣∣ = 1 and bx,y(u) + t0w(u,u) + t2
∑
v∈Vx,y−Vzx,y
w(u, v) > q for any u ∈ Vzx,y ,
∞ if
∣∣∣Vzx,y
∣∣∣ > 1 and bx,y(u) + t1w(u,u) + t2
∑
v∈Vx,y−Vzx,y









∣∣∣ = 1 and bx,y(u) + t0w(u,u) + t2
∑
v∈Vx,y−Vzx,y
w(u, v) > q for any u ∈ Vzx,y ,
∞ if
∣∣∣Vzx,y
∣∣∣ > 1 and bx,y(u) + t1w(u,u) + t2
∑
v∈Vx,y−Vzx,y
w(u, v) > q for any u ∈ Vzx,y ,
cz,2x,y otherwise.
(29)
EURASIP Journal on Embedded Systems 13
Input: An instance I of Rectangular MA.
Output: Ps and ds.
(1) s← an (m + 1)-tuple
(2) s[0] ← n + 1
(3) for i← 1 to m do
(4) s[i] ← n
(5) end for
(6) Ps ← {}
(7) ds ← 0
(8) while s[1] > 0 do
(9) i← m
(10) s[i] ← s[i]− 1
(11) while s[i] = −1 do
(12) i← i− 1
(13) s[i] ← s[i]− 1
(14) end while
(15) for j ← i + 1 to m do
(16) s[ j] ← s[i]
(17) end for
(18) ds ←∞
(19) ks ← an m-tuple
(20) for i← 1 to m do
(21) ks[i] ← i− 1
(22) while s[ks[i]] = s[i] do
(23) ks[i] ← ks[i]− 1
(24) end while
(25) end for
(26) for i← 1 to m do








s < ds then
(30) ds ← di, js





Algorithm 3: Staircase rectangular memory arrangement
(StaRMA).
A polynomial time algorithm to compute Px,y and dx,y ,
Carving Rectangular Memory Arrangement (CaRMA), is
shown in Algorithm 4. Its running time is O(m5n5) for
an m × n rectangle. It starts with small sub-instances and
loops through progressively larger sub-instances. Since each
sub-instance only references sub-instances smaller than the
current sub-instance, all needed sub-instances have already
been solved. Lines 3-4 loops through all the diﬀerent y’s.
Lines 9-10 loops through all the possible z’s. For eachVzx,y , we
compute the energy consumption on line 12. If configuration
1 uses less energy, lines 13–16 will record the corresponding
Px,y . If configuration 2 uses less energy, lines 17–20 will
record the corresponding Px,y .
8. Experiments
We evaluate ZiRMA, CaRMA, and StaRMA by comparing
the memory arrangements generated to both an all shared
Input: An instance I of Rectangular MA
Output: A near optimal partition P(1,1),(m,n) and its
energy consumption d(1,1),(m,n)
(1) for i ← 1 to m do
(2) for j ← 1 to n do
(3) for yi ← i to m do
(4) for yj ← j to n do
(5) x ← (yi − i + 1)
(6) Vx,y ← {vi, j | xi ≤ i ≤ yi and xj ≤ j ≤ yj}
(7) dx,y ←∞
(8) Px,y ← {}
(9) for zi ← xi to yi do
(10) for zj ← xj to yj do
(11) Vzx,y ← {vi, j | xi ≤ i ≤ zi and
xj ≤ j ≤ zj}




x,y , and d
z,2
x,y .
(13) if dz,1x,y < dx,y then
(14) dx,y ← dz1x,y
(15) Px,y ← {Vz′x,y} ∪ P(xi ,z′ j+1),y
∪P(z′ i+1,xj ),(yi ,z′ j )
(16) end if
(17) if dz,2x,y < dx,y then
(18) dx,y ← dz2x,y
(19) Px,y ← {Vz′x,y} ∪ P(z′ i+1,xj ),y








Algorithm 4: Carving rectangular memory arrangement
(CaRMA).
memory arrangement and an all private memory arrange-
ment. We do not explicitly evaluate OLMA since it is used in
ZiRMA. We run experiments on two sets of instances. The
instances in the first set are randomly generated, while the
second set are extracted from digital signal processing (DSP)
benchmarks from DSPStone [1]. For these experiments, we
only consider the energy consumption of memory access
operations.
8.1. Random Instances. We generate 800 random rectangular
instances with varying degrees of memory access locality and
penalty. The locality describes the memory accesses among
cores. Clumpy means that most memory accesses are within
groups of cores, between which there is little interaction.
Diﬀuse means that memory accesses are distributed evenly
among the cores, and it is diﬃcult to divide them into
groups. The penalty of remote accesses with respect to local
accesses may either be mild or severe. Mild penalty means
that the energy cost for accessing remote data is only two
times the energy cost for accessing local data. Conversely,
severe penalty means that the energy cost for accessing data
14 EURASIP Journal on Embedded Systems
Table 8: Improvements for randomly generated instances.
Locality Penalty
ZiRMA CaRMA StaRMA CaRMA
ZiRMAAll shared All private All shared All private All shared All private
Clumpy
mild 38% 4% 42% 10% 42% 10% 6%
severe 51% 7% 56% 17% 56% 17% 10%
Diﬀuse
mild 40% 9% 42% 11% 42% 11% 3%
severe 54% 14% 56% 19% 56% 19% 5%
Table 9: Improvements for DSP benchmarks.
Instance Penalty
ZiRMA CaRMA StaRMA
All shared All private All shared All private All shared All private
allpole
mild 6% 6% 17% 18% 17% 18%
severe 7% 32% 22% 43% 22% 43%
deq
mild 5% 10% 17% 21% 17% 21%
severe 7% 35% 21% 44% 21% 44%
elliptic
mild 21% 8% 21% 8% 21% 8%
severe 8% 13% 23% 27% 23% 27%
iir
mild 5% 13% 17% 23% 17% 23%
severe 7% 36% 20% 45% 20% 45%
lattice
mild 18% 11% 18% 11% 18% 11%
severe 7% 20% 23% 33% 23% 33%
Average
mild 11% 10% 18% 16% 18% 16%
severe 7% 27% 22% 38% 22% 38%
in a remote memory is several times greater than the energy
cost for accessing data in a local memory.
The results from this set of random experiments are
shown in Table 8. We generated 200 instances for each
combination of memory access locality and penalty. The
third, fifth, and seventh columns show how much better
ZiRMA, CaRMA, and StaRMA perform than an all shared
memory arrangement, respectively. The fourth, sixth, and
eighth columns show how much better ZiRMA, CaRMA, and
StaRMA perform than an all private memory arrangement,
respectively. The ninth column show how much better
CaRMA performs than ZiRMA.
8.2. DSP Instances. In addition to randomly generated
instances, we perform experiments on instances extracted
from DSP benchmarks. The benchmarks we use are an all
pole filter (allpole), a diﬀerential equation solver (deq), an
elliptic filter (elliptic), an infinite impulse response filter (iir),
and a 4-stage lattice filter (lattice). For these instances, we
unfold the benchmarks and perform the experiments on 2×4
rectangular instances with varying memory access penalty.
The results from this set of random experiments are
shown in Table 9. The third, fifth, and seventh columns show
how much better ZiRMA, CaRMA, and StaRMA perform
than an all shared memory arrangement, respectively, while
the fourth, sixth, and eighth columns show how much better
ZiRMA, CaRMA, and StaRMA perform than an all private
memory arrangement, respectively. The last two rows show
the average improvement for both mild and severe penalties.
In summary, on instances extracted from DSP bench-
marks, CaRMA and StaRMA perform an average of 18%
better than an all shared memory arrangement for cases with
mild memory-access penalty and an average of 38% better
than an all private memory arrangement for cases with severe
memory access penalty.
8.3. Computation Times. From previous sections, we know
that the running times of ZiRMA, CaRMA, and StaRMA are
O(m4n4), O(m5n5), and O(nm(n + m)!/n!m!), respectively.
We compare the time it takes these algorithms to process an
instance. Figure 12 shows the computation times for these
algorithms for instances of diﬀering sizes. From the graph, we
can see that ZiRMA and CaRMA have similar computation
times, and StaRMA’s computation times grow much faster.
8.4. Analysis. From these experiments, we can see that all
algorithms perform the same for instances with only a
few cores, and ZiRMA performs the worst for instances
with many cores. For larger instances, the linear arrays that
ZiRMA considers deviate more from the rectanguar mesh.
Many of the small sharings in the middle of the mesh are
not possible in ZiRMA since the zigzag segment that ZiRMA
considers is quite long. Thus, ZiRMA struggles with large
rectangular meshes, especially square meshes. We also see
that CaRMA performs as well as StaRMA in most cases. As
for the computation time, CaRMA takes only a little more
time than ZiRMA. Thus, CaRMA produces the best results
in a reasonable amount of time.
EURASIP Journal on Embedded Systems 15
Table 10: Summary of experimental results.































Figure 12: Runtimes for ZiRMA, CaRMA, and StaRMA.
A summary of the experimental results for CaRMA is
shown in Table 10. The results from the random instances are
all averaged together. On average, CaRMA produces arrange-
ments that consume 49% less energy than an all shared
memory arrangement and 14% less energy than an all private
memory arrangement for randomly generated instances. For
DSP benchmarks, CaRMA produces arrangements that, on
average, consume 20% less energy than an all shared memory
arrangement and 27% less energy than an all private memory
arrangement.
9. Conclusion
We study the Memory Arrangement Problem (MA). We
prove that if arbitrary cores can share a memory, then
MA is NP-complete. We present an eﬃcient optimal algo-
rithm for solving linear instances of MA and extend the
algorithm to solve rectangular cases of MA. We present
an optimal algorithm for solving rectangular cases of MA
where only rectangular blocks of cores share memories and
an eﬃcient heuristic to obtain good memory arrangements
in a reasonable amount of time. On average, we can
produce arrangements that consume 49% less energy than
an all shared memory arrangement and 14% less energy
than an all private memory arrangement for randomly
generated instances. For DSP benchmarks, we can produce
arrangements that, on average, consume 20% less energy
than an all shared memory arrangement and 27% less energy
than an all private memory arrangement.
Acknowledgments
This work is partially supported by NSF IIS-0513669, HK
CERG 526007, HK GRF 123609, NSFC 60728206, and
Changjiang Honorary Chair Professor Scholarship.
References
[1] V. Zˇivojnovic´, J. M. Velarde, C. Schla¨ger, and H. Meyr,
“DSPSTONE: a DSP-oriented benchmarking methodology,”
in Proceedings of the International Conference on Signal Pro-
cessing and Technology (ICSPAT ’94), Dallas, Tex, USA, 1994.
[2] Y. Zhao, C. J. Xue, M. Li, and B. Hu, “Energy-aware register
file re-partitioning for clustered VLIW architectures,” in
Proceedings of the Asia and South Pacific Design Automation
Conference (ASP-DAC ’09), pp. 805–810, Yokohama , Japan,
January 2009.
[3] M. Wang, Z. Shao, H. Liu, and C. J. Xue, “Minimizing leakage
energy with modulo scheduling for VLIW DSP processors,”
in Proceedings of the Distributed Embedded Systems: Design,
Middleware and Resourcess (DIPES ’08), B. Kleinjohann, L.
Kleinjohann, and W. Wolf, Eds., vol. 271 of IFIP International
Federation for Information Processing, pp. 111–120, Springer,
Milano, Italy, 2008.
[4] M. Qiu, Z. Jia, C. Xue, Z. Shao, and E. H.-M. Sha, “Voltage
assignment with guaranteed probability satisfying timing
constraint for real-time multiproceesor DSP,” Journal of VLSI
Signal Processing Systems, vol. 46, no. 1, pp. 55–73, 2007.
[5] G. Hua, M. Wang, Z. Shao, H. Liu, and C. Xue, “Real-time
loop scheduling with energy optimization via dvs and abb for
multi-core embedded system,” in Proceedings of Embedded and
Ubiquitous Computing (EUC ’07), T.-W Kuo, E. H.-M. Sha, M.
Guo, L. T. Yang, and Z. Shao, Eds., vol. 4808 of Lecture Notes
in Computer Science, pp. 1–2, Springer, Taipei, Taiwan, 2007.
[6] B. Saha, A.-R. Adl-Tabatabai, R. L. Hudson, C. M. Chi,
and B. Hertzberg, “McRT-STM: a high performance software
transactional memory system for a multi-core runtime,”
in Proceedings of the 11th ACM SIGPLAN Symposium on
Principles and Practice of Parallel Programming (PPoPP ’06),
vol. 2006, pp. 187–197, ACM, New York, NY, USA, 2006.
[7] R. Kumar, V. Zyuban, and D. M. Tullsen, “Interconnec-
tions in multi-core architectures: understanding mechanisms,
overheads and scaling,” in Proceedings of the 32nd annual
International Symposium on Computer Architecture (ISCA ’05),
pp. 408–419, IEEE Computer Society, Washington, DC, USA,
2005.
[8] C. Xue, Z. Shao, M. Liu, M. Qiu, and E. H.-M. Sha,
“Loop scheduling with complete memory latency hiding on
multi-core architecture,” in Proceedings of the International
Conference on Parallel and Distributed Systems (ICPADS ’06),
vol. 1, pp. 375–382, 2006.
[9] V. Suhendra, C. Raghavan, and T. Mitra, “Integrated scratch-
pad memory optimization and task scheduling for MPSoC
architectures,” in Proceedings of the International Conference
on Compilers, Architecture and Synthesis for Embedded Systems
(CASES ’06), pp. 401–410, ACM, Seoul, South Korea, 2006.
16 EURASIP Journal on Embedded Systems
[10] L. Zhang, M. Qiu, W.-C. Tseng, and E. H.-M. Sha, “Variable
partitioning and scheduling for MPSoC with virtually shared
scratch pad memory,” Journal of Signal Processing Systems, pp.
1–19, 2009.
[11] M. Kandemir, J. Ramanujam, and A. Choudhary, “Exploiting
shared scratch pad memory space in embedded multiproces-
sor systems,” in Proceedings of the 39th Conference on Design
Automation (DAC ’02), pp. 219–224, ACM, New Orleans, La,
USA, 2002.
[12] F. Angiolini, L. Benini, and A. Caprara, “Polynomial-time
algorithm for on-chip scratchpad memory partitioning,” in
Proceedings of the International Conference on Compilers,
Architecture and Synthesis for Embedded Systems (CASES ’03),
pp. 318–326, ACM, San Jose, Calif, USA, 2003.
[13] S. Udayakumaran and R. Barua, “Compiler-decided dynamic
memory allocation for scratch-pad based embedded systems,”
in Proceedings of the International Conference on Compilers,
Architecture, and Synthesis for Embedded Systems (CASES ’03),
pp. 276–286, ACM, San Jose, Calif, USA, 2003.
[14] G. E. Suh, L. Rudolph, and S. Devadas, “Dynamic partitioning
of shared cache memory,” Journal of Supercomputing, vol. 28,
no. 1, pp. 7–26, 2004.
[15] M. Chu, R. Ravindran, and S. Mahlke, “Data access parti-
tioning for fine-grain parallelism on multicore architectures,”
in Proceedings of the 40th Annual IEEE/ACM International
Symposium on Microarchitecture (MICRO ’07), pp. 369–378,
IEEE Computer Society, Washington, DC, USA, 2007.
[16] C.-G. Lyuh and T. Kim, “Memory access scheduling and bind-
ing considering energy minimization in multi-bank memory
systems,” in Proceedings of the 41st Annual Conference on
Design Automation (DAC ’04), pp. 81–86, ACM, San Diego,
Calif, USA, 2004.
[17] S. Meftali, F. Gharsalli, F. Rousseau, and A. A. Jerraya, “An
optimal memory allocation for application-specific multipro-
cessor system-on-chip,” in Proceedings of the 14th International
Symposium on System Synthesis (ISSS ’01), pp. 19–24, ACM,
Montre´al, Canada, 2001.
[18] O. Ozturk, M. Kandemir, M. J. Irwin, and S. Tosun, “Multi-
level on-chip memory hierarchy design for embedded chip
multiprocessors,” in Proceedings of the 12th International
Conference on Parallel and Distributed Systems (ICPADS ’06),
pp. 383–390, IEEE Computer Society, Washington, DC, USA,
2006.
[19] O. Ozturk, M. Kandemir, G. Chen, M. J. Irwin, and M.
Karakoy, “Customized on-chip memories for embedded chip
multiprocessors,” in Proceedings of the Conference on Asia
South Pacific Design Automation (ASP-DAC ’05), pp. 743–748,
ACM, Shanghai, China, 2005.
