Efficient Inter-Task Communication for Nested Loop Programs on a Multiprocessor System by Bijlsma, Tjerk et al.
1Efficient Inter-Task Communication for Nested Loop
Programs on a Multiprocessor System
Tjerk Bijlsma1, Marco Bekooij2, Gerard Smit1 and Pierre Jansen1
1University of Twente, P.O. Box 217, 7500AE Enschede, The Netherlands
2NXP Semiconductors Research, 5656AE Eindhoven, The Netherlands
t.bijlsma@utwente.nl
Abstract— In modern multiprocessor systems, proces-
sors can be stalled by inter-task communication when read-
ing from a remote buffer. This paper presents a solution
for the inter-task communication, that has a minimal im-
pact on the performance of the system, hides the inter-task
communication latency without requiring additional hard-
ware. The solution applies to jobs, represented as task
graphs, where the tasks are nested loop programs. Buffers
are allocated in scratch-pad memories of the consuming
tasks to provide low latency read access. For the nested
loop programs, minimal buffer sizes can be determined
to cover all possible communication patterns. The added
computational complexity is low, as the solution adds only
a few operations to the nested loop programs.
Keywords—Nested Loop Program, Scratch-Pad Mem-
ory, Circular Buffer.
I. Introduction
Modern multimedia devices typically contain a
multiprocessor system-on-chip (MPSoC). Jobs, rep-
resented by task graphs, are mapped on this MP-
SoC. Multimedia jobs typically process data streams
and often have real-time requirements in the form of
throughput and latency constraints.
In an MPSoC, the tasks are assigned to pro-
cessors. The inter-task communication occurs via
buffers. Inter-task communication has a minimal im-
pact on the performance of the MPSoC if there is low
latency access to the buffers, or the communication
latency to the buffers is hidden. Both prevent pro-
cessors from being stalled while buffers are accessed.
Furthermore, the order in which data is read from and
written in a buffer may differ. The buffer should be
able to handle the different patterns without reorder-
ing the data, because this imposes additional software
or hardware overhead on the system.
In this paper, a novel software solution is presented
to hide the inter-task communication latency, without
requiring additional hardware support. Buffers are
used that, require only a small computational over-
head of the tasks, can handle all possible inter-task
communication patterns, and have a minimal size.
The jobs to which our solution can be applied, con-
sist of a task graph in the form of a chain. Each task
is an affine nested loop program (aNLP) [5], such that
the pattern in which it accesses the memory is known
at design time. A task that produces data sends it
to the task that consumes it, thus preventing remote
read requests. Reading from a memory that is local
to the processor has a small communication latency,
such that the stall time of the processor from the con-
sumer is minimized. The producer writes the data
in a circular buffer, allocated in the local scratch-pad
memory (SPM) of the consumer. The circular buffer
consists of a read and write part in which random ac-
cess is allowed. It can be accessed by adding a few
operations to the NLPs, thus no additional hardware
is required.
The methods discussed in [4] and [10] copy com-
plete shared data structures between various layers of
their memory hierarchy. The most frequently accessed
data structures are stored in an SPM and the sporad-
ically accessed data in a background memory. In this
method, the consuming task initiates the communica-
tion, by sending read requests for the data structure.
In contrast, our approach considers the communi-
cation of data items, rather than complete data struc-
tures and the producing task initiates the communica-
tion by sending the data items to the consuming pro-
cess, preventing the communication latency because
the data is read from a local SPM.
In [9], a method is proposed where a process net-
work is derived from an aNLP. In a process network
the communication between processes is explicit and
performed through first-in-first-out (FIFO) buffers,
note that there are no global variables. In this ap-
proach, data items are communicated instead of com-
plete data structures, such that less buffer space is
required. This advantage is also present in our ap-
proach. Another similarity with our approach is that
the producer is initiating the communication, by writ-
ing the data in the buffer. If the inter-process com-
munication is not in FIFO order, additional reorder
buffers and reorder hardware or software is required.
In contrast, in our approach a simple software solu-
2T1 T2
int X[10] int X[10]
for i0:1:1:5 for j0:0:1:2
for i1:0:1:1 for j1:0:1:3
X[2i0 − i1 − 1] = ∼ ∼ = X[3j0 + j1]
(Iteration 0,1,2,3. . .,9)
ap = [1,0,3,2. . .,8]
(Iteration 0,1,2,3,4. . .,11)
ac = [0,1,2,3,3. . .,9]
Fig. 1. A producing task T1 and a consuming task T2
tion consisting of a circular buffer, that allows random
read and write access, is sufficient.
The organization of this paper is as follows. Sec-
tion II discusses the code of the tasks in the job and
the communication patterns between the tasks. Sec-
tion III presents the architecture to which the job
should be mapped. Based upon these sections, Sec-
tion IV defines the problem statement, how to hide
the inter-task communication latency. In Section V,
a method for hiding the inter-task communication la-
tency is presented. The conclusion and future work
are presented in Section VI.
II. Consumption and production patterns
This section introduces aNLPs that express the pat-
tern in which data is produced or consumed by tasks.
The combination of both patterns determines the re-
quired size of the buffer, for deadlock freedom.
Figure 1 shows a producer-consumer pair from a
task graph. The code of both tasks is in the form of
an aNLP with single assignment code. Single assign-
ment code means that for each execution of an aNLP,
each variable is assigned a value only once. Further-
more, an aNLP consists of multiple nested for-loops.
A for-loop is described with for(i; l; s;u), where i is
the iterator, l and u are the lower- and upper-bound
and s is the stride. In this paper, we assume that
the stride is one and the lower- and upper-bound are
constants. The innermost loop contains operations
that access shared data structures, the location to be
accessed is given by an index expression. An index
expression is affine and contains only iterators as vari-
ables. This means that the index expression consists
of a summation of multiples of iterators, plus option-
ally a constant value. For aNLPs the dependency be-
tween when data is written and read in a shared data
structure is explicit.
The bottom part of Figure 1, contains the address
lists ap and ac for the producer T1 and the consumer
T2, respectively. These lists contain the addresses of
the shared data structure in the order that they are
written or read. These lists can show two kinds of
access patterns that cannot always be mapped on a
FIFO buffer: out-of-order access and multiplicity.
Arbiter
Processor 1 Processor 2
Processing tile 1
NoC
Arbiter
Processing tile 2
SPMSPM
buffer
NI NI
Fig. 2. A target MPSoC, in which the communication to
the buffer in the SPM of processor 2 is shown
Out-of-order access oocurs when a producer or a
consumer writes or reads the addresses of a data struc-
ture in a non-consecutive order. For example, the ad-
dress list ap of task T1 in Figure 1 shows out-of-order
writing, because address 1 is written in the first it-
eration (ap[0] = 1), address 0 in the second iteration
(ap[1] = 0) and so on. Task T2 shows in-order read-
ing, since the addresses in the address list ac show a
consecutive order, starting at the first address of the
shared data structure.
Another property of a pattern is multiplicity, for
such a pattern one or more addresses are accessed
multiple times during an execution of an aNLP. In
our system we assume that there is no write multi-
plicity, since these are unnecessary operations for the
producer. Read multiplicity can occur and is shown
in Figure 1. The consumer T2 is reading address 3
and 6 twice. When a consumer reads an address mul-
tiple times, the data item at that address should be
buffered.
III. MPSoC architecture
The real-time requirements of the job to be mapped,
require predictable behavior from an underlying ar-
chitecture. In [1], a template is described for a pre-
dictable MPSoC. For our solution, it is sufficient if the
target MPSoC corresponds to this template.
The architecture template in [1], consists of pro-
cessing tiles that are interconnected via a network on
chip (NoC), see Figure 2. The NoC provides guaran-
teed services, like uncorrupted, lossless and ordered
data delivery, guaranteed throughput and bounded
latency. A processing tile contains an SPM, a pro-
cessor, an arbiter and a network interface (NI). The
processor can read and write in the SPM of its own
processing tile or write in the SPM of another process-
ing tile. When the processor has to write in the SPM
of another processing tile, it posts the write to the
NI that autonomously handles this request, such that
the processor can continue. The arbiter guarantees
that the interference on the memory port, between
accesses from the NI and accesses from the processor,
is bounded.
3IV. Problem statement
The problem addressed in this paper is to deter-
mine an efficient method for inter-task communica-
tion, where inter-task communication latency has a
minimal impact on the performance and no special-
ized hardware is required for the buffers. The method
is applicable for task graphs that are chains, where
the tasks are aNLPs. The challenges are to minimize
the computational overhead for the tasks and the re-
quired buffer size. Furthermore, the buffers should
enable out-of-order communication and multiplicity.
Splitting up the problem, the first subproblem iden-
tified is the location of the buffer. The second sub-
problem is the synchronization of the tasks and to
guarantee mutually exclusive access to the data items
in the shared memory. The third subproblem is how
to determine the size of a buffer, such that out-of-
order access and multiplicity are possible. The fourth
subproblem is how to extend the tasks with operations
for the synchronization and buffer accesses.
V. Solution approach
This section discusses how the inter-task communi-
cation between tasks on a target MPSoC can be re-
alized. The example in Figure 1 is used to show how
we solve the problem formulated in Section IV.
A. Communication grain
When the complete data structure X, from Fig-
ure 1, is communicated as one block between the
tasks, it is called a block data transfer [3]. This kind
of communication has the advantage that the commu-
nication overhead, for example the starting address of
the data structure in the SPM, only has to be send
for a single block. In our solution, communication is
performed at a word level granularity, meaning that
the data items of data structure X will be commu-
nicated individually. The advantage is that, often, a
buffer smaller than the size of the data structure is
sufficient and pipelining is possible by increasing the
size of the buffer. To determine the buffer size the
read and write pattern in the shared data structure
has to be analyzed.
B. Buffer location
When the producer is writing in a different order
than the consumer is reading, or when the consumer
reads data multiple times, a buffer with more than one
word is required. This buffer can be either located in
the SPM of the producer or the consumer. When
the buffer is located in the SPM of the producer, it
is called receiver initiated communication [3]. In this
case the consumer has to send a request to SPM in
the processing tile of the producer. This causes the
processor of the consumer to be stalled, from the mo-
ment it sends the read request until it receives the
data. Precommunication [3] can be used to send the
read request earlier, such that the data is in the SPM
of the consumer at the moment it is required. Hard-
ware solutions for precommunication detect at run-
time which addresses from the memory to precommu-
nicate. Software solutions typically insert precommu-
nication operations in the code at compile time.
When the buffer is located in the SPM of the con-
sumer, the producer sends its data to the SPM of the
consumer. This is called sender initiated communica-
tion [3] and is considered as a special kind of precom-
munication that requires no additional hardware or
software. In this case the consumer can read the data
from the buffer in its local SPM with a low latency,
minimizing the stall time of its processor.
Figure 2 shows an MPSoC according to our tem-
plate. The task graph from Figure 1 can be mapped
to this MPSoC, by assigning the producing task T1 to
processing tile 1 and the consuming task T2 to pro-
cessing tile 2. The buffer and its administration are
located in the SPM of the consumer, to have the ad-
vantage of sender initiated precommunication. The
dotted lines show the communication, from both the
producer and the consumer, to the buffer.
C. Memory consistency
In our solution sender initiated communication is
applied at a word level granularity. Therefore the pro-
ducer and the consumer, communicate and synchro-
nize via the shared address space of the SPM of the
consumer. A memory consistency model is required
for the synchronization between the tasks.
The memory consistency model we use is stream-
ing memory consistency (StrC) [2]. According to
this model, the address of a shared variable is ac-
quired before it is accessed and released when it is not
needed anymore, the acquire and release form a syn-
chronization section. In comparison, release consis-
tency (RC) [6] allows shared variables to be accessed
outside synchronization sections. For synchronization
sections from different buffers StrC has no strict order-
ing, but RC does. Both StrC and RC require no strict
ordering of memory access operations within synchro-
nization sections. Unlike RC, StrC even allows acquire
operations of synchronization sections, from the same
buffer, to overtake release operations.
The key advantage of streaming memory consis-
tency is that it allows posted writes. A posted write is
a write operation that allows the producer to continue
executing, instead of stalling until the completion of
4the write. A write operation is completed if the data
is stored in the memory. Posted writes are allowed
only if the synchronization variables are located in
the same memory as the buffer and if the connection
to the memory provides uncorrupted, lossless and or-
dered data delivery of reads and writes.
In our system, the producer initially performs an
acquire operation for an address range in the buffer,
which has to be confirmed. When the access to an ad-
dress range is acquired, the writes are posted, as is the
release of the address range, which can immediately
be followed by an acquire. Because StrC allows an ac-
quire to overtake releases, acquire operations can be
pipelined, to hide the latency of the NoC, and when
the first one is confirmed, the posting of write opera-
tions can be started, followed by release operations.
D. Buffer type
The buffer used for inter-task communication will
be located in an SPM. The buffer should allow read
multiplicity and out-of-order reading and writing in
the buffer. This behavior is possible when using a
circular buffer (CB). In order to arrange the buffer
administration without additional hardware, a proto-
col such as C-HEAP [7] can be used.
In a CB, the producer can write all addresses be-
tween the write and the read pointer. Similarly, the
consumer can read all addresses between the read
and the write pointer. The producer makes data
available to the consumer, by increasing the write
pointer. When the consumer has read the address
at the read pointer for the last time, it increments the
read pointer, providing the address to be reused by
the producer. Hence, both the producer and the con-
sumer have exclusive access to their part of the buffer.
The pointers are not allowed to overtake each other.
Typically both pointers start at the same location and
the first pointer to be increased is the write pointer.
When a pointer reaches the end of the CB, it wraps
around.
E. Read and write window
The producer and consumer task, from Figure 1,
use a CB in combination with StrC. To use a circular
buffer and StrC, the code of the tasks needs to be
annotated with acquire and release operations. A task
performs a release operation and increases its pointer
in the CB, when it is finished with the value at its
pointer address. Furthermore the address it accesses
is acquired previously.
The operations added to the code are simple and
limited in number. It is possible to acquire each ad-
dress individually and release it when it is not needed
Read window
d2d1 d2d1
Read pointer Write pointer
Buffer size
Write window
Fig. 3. A circular buffer, with a read and a write window
anymore, but this would impose to much control over-
head, as shown in [8].
For our solution a read window and a write window
are defined, for the consumer and the producer, re-
spectively, as shown in Figure 3. Both windows are a
sequence of acquired addresses, only the first and the
last address of a sequence needs to be stored. When
an acquire is performed, an address from the buffer is
added at the head of the sequence. A release opera-
tion, removes the address at the tail of the sequence
and increments the pointer in the CB, since the re-
leased address will not be accessed anymore.
To keep the control overhead in the code limited, ev-
ery iteration performs one acquire, until all addresses
of the data structure have been acquired. Possibly a
number of initial acquires are required, to guarantee
that in each iteration the address to be accessed is
acquired, eg. due to out-of-order access. Furthermore
after an initial number of iterations, every iteration
performs a release. This pattern of acquires and re-
leases make that the windows become sliding windows.
F. Window size
The size of a read or write window is influenced
by the access pattern of the aNLP. Since every itera-
tion of a task acquires at most a single address in the
buffer, we have to guarantee for every iteration that at
least the address that is accessed is acquired. This can
be done by acquiring an initial number of addresses
in the CB, before the loop-nest starts, called the lead-
in (d1). In the following text, the first address to be
acquired in the CB is address 0.
Figure 4 shows a graphical method to determine the
lead-in, for task T1 from Figure 1. The top part of the
figure shows the order in which the addresses are ac-
quired by the write window, called addresses acquired.
In the bottom part, the address list ap is shown. To
guarantee that an addresses is not written before it is
acquired, the list with written addresses ap is shifted
right, such that every address is acquired before it is
written. The first acquires, that are performed before
0
1
1 2
0
3
3
4
2
5
5
9
9 8
. . .
. . . Address written (ap)
Address acquired
d1 = 1
Fig. 4. Lead-in (d1) of task T1 from Figure 1
51
0
3
1 2
2
3
5 4
4
8
8 9
. . .
Address released
Address written (ap)0
d2 = 1
. . .
Fig. 5. Lead-out (d2) of task T1 from Figure 1
any write operation can be performed, are grouped as
an initial number of acquires, called the lead-in. In
Figure 4 a lead-in of an acquire of one word is found.
A formal expression to determine the lead-in can be
given. Let a be the address list, i the iteration of the
NLP and î the total number of iterations:
Lemma 1: A lead-in of d1 = maxi(a[i]− i), with
0 ≤ i < î, is minimal and ensures that every address
is acquired before or when it is accessed.
Proof: In iteration j, address a[j] is accessed, so
at least a[j] addresses should be acquired in the
buffer. Each iteration performs one acquire, so ini-
tially a[j]− j acquires should have been performed,
if a[j] ≥ j. To make sure that in each iteration of
the loop-nest the accessed address is acquired, the
minimal and sufficient number of initial acquires d1
is found by maxi(a[i]− i), with 0 ≤ i < î. 2
When the tail address from the window is written
or read for the last time it can be released, such that
the addresses in the buffer can be reused. Note for
task T1 from Figure 1 that it is not possible to start
with releasing one address per iteration in the first
iteration, because then after the first iteration address
0 is released, which should be written in the second
iteration (ap[1] = 0). Therefore an initial number of
iterations in which no address is released has to be
determined, called the lead-out (d2).
In a similar way as for the lead-in, a lead-out is
determined in Figure 5 for task T1 from Figure 1. In
this figure the top part shows the address list. In order
to guarantee that an address is not released before it is
written, the bottom part with the released addresses
is shifted right. The figure shows that with a lead-
out of one, all addresses are written before they are
released.
As for the lead-in, a formal expression to determine
the lead-out can be given. Let a be the address list,
i the iteration of the NLP and î the total number of
iterations:
Lemma 2: A lead-out of d2 = maxi(i− a[i]), with
0 ≤ i < î, is the minimal number of iterations after
which each accesses can be combined with a release.
Proof: In iteration j, at least the address a[j] should
still be acquired in the buffer. After an initial number
of iterations each iteration releases 1 address. To make
sure that address a[j] is still acquired in iteration j,
at least the first j − a[j] iterations should release no
82 3 3 4
3210
0 1 . . .
8
9
9. . .
Address read (ac)
Address released
d2 = 2
Fig. 6. Lead-out (d2) of task T2 from Figure 1
address, when j ≥ a[j]. To make sure that in each iter-
ation the accessed address is still acquired, minimally
the first maxi(i− a[i]) iterations, with 0 ≤ i < î, of
the loop-nest should not release an address. 2
A window is build up from a lead-in d1 and a lead-
out d2 and a location for the current address, making
the window size w = d1+d2+1. We can define a read
window for the consumer wc and a write window for
the producer wp.
When the consumption pattern of the consumer
shows multiplicity, it is possible that d1 + d2 + 1 is
larger than the highest address in the address list ac.
A window containing the whole shared data structure
is sufficient. Let â be the highest address in the ad-
dress list, which is also the maximum window size.
Theorem 1: In combination with a lead-in and a
lead-out a window with size w = min(d1 + d2 + 1, â)
guarantees that the accessed address is in the window.
Proof: The Lemmas 1 and 2 state that the ad-
dress to be accessed will be acquired. Lemma 1 shows
that the lead-in is determined such that a[i] ≤ i+ d1.
Lemma 2 shows that the lead-out is determined such
that i − d2 ≤ a[i]. Therefore the combination of
both ensures that the accessed address is always in
the window, i− d2 ≤ a[i] ≤ i + d1. Since at most all
addresses of the data structure are acquired, the max-
imum window size is â. 2
For the example from Figure 1, a write window wp
is 1 + 1 + 1 = 3 is derived. The consumption pattern
of task T2 shows multiplicity. Figure 6 shows that a
lead-out d2 = 2 can be determined. Furthermore the
figure shows that the addresses are accessed in order,
requiring a lead-in d1 = 0. The window required by
the consumer is wc = 3.
G. Buffer size
The CB in Figure 3, contains both a read and a
write window. With the derived sizes of these win-
dows, the minimum required size of the CB for this
solution can be determined.
At the beginning of an iteration, a task acquires one
address and at the end of the iteration one address is
released, thus only during an iteration the whole win-
dow size is acquired. Before both the producer and
the consumer start their iteration, thus before they
performed the acquire operation in their inner-loop,
there are wp − 1 + wc − 1 addresses acquired. To al-
6T1 T2
int tp = 1 int tc = 1
acquire(1,CB1) acquire(0,CB1)
for i0:1:1:5 for j0:0:1:2
for i1:0:1:1{ for j1:0:1:3{
if(tp < 9) if(tc < 10)
acquire(1,CB1) acquire(1,CB1)
write(CB1, ∼ = read(CB1,
2i0 − i1 − 1,∼) 3j0 + j1)
if(tp > 1) if(tc > 2)
release(1,CB1) release(1,CB1)
tp++ tc++
} }
release(1,CB1) release(0,CB1)
Fig. 7. Annotated producer consumer pair, from Figure 1
low either the consumer or the producer to perform
an acquire and start an iteration, one additional ad-
dress should be available in the buffer. Therefore the
buffer size that allows the execution of the producer
and consumer is wp + wc − 1.
A chain, in which every task has at most one input
and one output buffer, is deadlock free if the buffer
sizes are chosen as wp + wc − 1. Consider a task Tn,
that shares its input buffer with task Tn−1 and its out-
put buffer with task Tn+1. If the input buffer contains
data and there is space in the output buffer, Tn will
execute. If there is no space in the output buffer, it
is filled, so Tn+1 will eventually execute and provide
space. If there is no data in the input buffer, Tn−1 will
eventually execute and provide data. In all cases Tn
can eventually execute, so the chain is deadlock free.
H. Annotating the NLP
The NLP of the tasks in the task graph needs to
be extended to make use of the circular buffer and
the sliding windows. Figure 7 shows how the NLPs
from Figure 1 should be annotated. Initially both
NLPs acquire d1 addresses in the buffer. In the in-
nermost loop of the loop-nest, before the operation is
performed, an if-statement with an acquire is added.
The if-statement guarantees that no more addresses
are acquired than there are data items in the shared
data structure. The access to the shared data struc-
ture is replaced with an access operation to the CB.
At the end of an iteration an if-statement is inserted
that starts releasing addresses from the window after
d2 iterations. At the end of the NLP the remaining
addresses in the window are released.
VI. Conclusion
This paper presents a novel software solution for
hiding inter-task communication latency, that re-
quires no additional hardware. Locating the buffer for
inter-task communication in the scratch-pad memory
of the consuming task, provides low latency access to
it. In combination with the streaming memory consis-
tency protocol the producing task can post its writes,
such that it is not stalling until writes complete. To
be able to handle the different access patterns of the
producer and the consumer to the buffer, a circular
buffer is used with a write window for the producer
and a read window for the consumer. The sizes of
these windows can be determined to cover the dif-
ferent access patterns and for deriving the minimum
required buffer size. In comparison, a single FIFO
buffer cannot handle most different access patterns.
Only a few operations need to be added to the nested
loop programs of the tasks, such they can handle the
circular buffers.
The next step is to apply this solution to a real-life
application. Another step is to extend the solution to
work for task graphs in the form of a directed acyclic
graph. It is challenging to increase the expressivity of
the nested loop programs that can be handled.
References
[1] M. Bekooij, A. Moonen, and J. van Meerbergen. Pre-
dictable and composable multiprocessor system design: A
constructive approach. In Bits & Chips Embedded System
symposium, October 2007. Accepted for publication.
[2] J. v. d. Brand and M. Bekooij. Streaming memory con-
sistency for efficient MPSoC design. In Proc. Euromicro
Symposium on Digital System Design, 2007.
[3] D. Culler, A. Gupta, and J. Singh. Parallel Computer Ar-
chitecture: A Hardware/Software Approach. Morgan Kauf-
mann, 1999.
[4] M. Dasygenis, E. Brockmeyer, B. Durinck, F. Catthoor,
D. Soudris, and A. Thanailakis. A combined dma and
application-specific prefetching approach for tackling the
memory latency bottleneck. In IEEE Transactions on Very
Large Scale Integration(VLSI) Systems, 2006.
[5] P. Feautrier. Dataflow analysis of array and scalar refer-
ences. In Int’l Journal of Parallel Programming, 1991.
[6] K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons,
A. Gupta, and J. Hennessy. Memory consistency and event
ordering in scalable shared-memory multiprocessors. In
Proc. Int’l Symposium on Computer Architecture, 1990.
[7] A. Nieuwland, J. Kang, O. Gangwal, R. Sethuraman,
N. Busa´, K. Goossens, R. Peset Llopis, and P. Lippens. C-
heap: A heterogeneous multi-processor architecture tem-
plate and scalable and flexible protocol for the design of
embedded signal processing systems. In Proc. Design Au-
tomation Conference (DAC), 2002.
[8] A. Turjan, B. Kienhuis, and E. Deprettere. Realizations
of the extended linearization model. In Proc. Int’l Work-
shop on Systems, Architectures, Modeling, and Simulation
(SAMOS), 2002.
[9] A. Turjan, B. Kienhuis, and E. Deprettere. Translating
affine nested-loop programs to process networks. In Proc.
Int’l conference on Compilers, architecture, and synthesis
for embedded systems, 2004.
[10] M. Verma, L. Wehmeyer, and P. Marwedel. Dynamic over-
lay of scratchpad memory for energy minimization. In Proc.
Int’l Conference on Hardware-Software Codesign and Sys-
tem Synthesis (CODES+ISSS), 2004.
