Buffer sizing for self-timed stream programs on heterogeneous distributed memory multiprocessors by Carpenter, Paul et al.
Buﬀer Sizing for Self-timed Stream Programs on
Heterogeneous Distributed Memory
Multiprocessors
Paul M. Carpenter, Alex Ramirez, and Eduard Ayguade´
Barcelona Supercomputing Center, C/Jordi Girona, 31, 08034 Barcelona, Spain
{paul.carpenter,alex.ramirez,eduard.ayguade}@bsc.es
Abstract. Stream programming is a promising way to expose concur-
rency to the compiler. A stream program is built from kernels that com-
municate only via point-to-point streams. The stream compiler statically
allocates these kernels to processors, applying blocking, ﬁssion and fusion
transformations. The compiler determines the sizes of the communication
buﬀers, which aﬀects performance since local memories can be small.
In this paper, we propose a feedback-directed algorithm that deter-
mines the size of each communication buﬀer, based on i) the stream
program that has been mapped onto processors, ii) feedback from an
earlier execution, and iii) the memory constraints. The algorithm ex-
poses a trade-oﬀ between throughput and latency. It is general, in that
it applies to stream programs with unstructured stream graphs, and it
supports variable execution times and communication rates.
We show results for the StreamIt benchmarks and random graphs. For
the StreamIt benchmarks, throughput is optimal after the ﬁrst iteration.
For random graphs with stochastic computation times, throughput is
within 3% of optimal after four iterations. Compared with the previ-
ous general algorithm, by Basten and Hoogerbrugge, our algorithm has
signiﬁcantly better performance and latency.
1 Introduction
Many applications, including video, audio, 3D graphics, and radio, contain abun-
dant task and data parallelism, but it is hard to extract from C source code.
Stream programming represents the application as concurrent kernels, interact-
ing only via point-to-point streams of data. This representation exposes concur-
rency to the compiler, is natural for signal processing, and easier to debug since
it is deterministic. As the industry moves towards multiprocessors [1], there is
increasing interest in portable, eﬃcient, correct use of parallelism.
Much work on stream compilation has focused on blocking and allocation.
Blocking unrolls kernels to amortise ﬁxed costs. Allocation fuses one or more
kernels, from the source program, into each task, in the executable, and maps
these tasks onto processors, balancing loads on processors and buses.
This paper considers a problem that has received less attention: allocating
memory for stream buﬀers, subject to memory constraints, when computation
Y.N. Patt et al. (Eds.): HiPEAC 2010, LNCS 5952, pp. 96–110, 2010.
c© Springer-Verlag Berlin Heidelberg 2010
Buﬀer Sizing for Self-timed Stream Programs 97
times and communication rates are variable. This is an important problem, be-
cause it aﬀects performance, as we explain in Section 2. The buﬀer sizes are
constrained by the available memory, which may be small. On the Cell Broad-
band Engine [2], for example, code and data must ﬁt in the 256KB local store.
The inputs to the algorithm are the mapped stream program, a program trace
and the machine description, giving the target topology and memory budgets.
A simple model of computation times and communication rates, such as inde-
pendent normal distributions and Poisson arrivals, may be misleading, so the
only options are simulation and real execution. We use coarse-grain simulation,
but real execution could be used instead. The output is the buﬀer size for the
producer and consumer on each stream, which may be diﬀerent.
The main contributions of this paper are:
– In Section 3, we describe a feedback-driven method to allocate stream buﬀers
in a distributed memory machine, when computation times and communi-
cation rates are variable.
– In Section 5.1, we describe two algorithms that analyse proﬁling information
to ﬁnd bottleneck cycles caused by undersized communication buﬀers. The
ﬁrst uses waiting times only; the second is more complex but more accurate.
– In Section 5.2, we describe an algorithm to allocate stream buﬀers using the
above algorithms, which converges quickly to a close-to-optimal allocation.
2 Motivation
Double buﬀering is a well-known technique to overlap communication and com-
putation. There are two situations, however, when a stream ought to be allo-
cated more than two buﬀers. The ﬁrst is when a stream covers a long latency or,
equivalently, crosses more than one pipeline stage boundary. The second is when
there are short-duration load imbalances due to variable computation times or
communication rates.
The chain8 benchmark illustrates the ﬁrst situation, and is shown on the left
of Figure 1. It has eight tasks in a pipeline, with streams between consecutive
tasks, and another stream between the ﬁrst and last tasks. Figure 1(a) shows
the progress of the ﬁrst and last tasks relative to the stream between them. The
vertical axis is time, and the horizontal axis is the position in the stream. At
any given time the producer is working on some interval of the stream, which it
owns. It starts at the top left of the plot, at the beginning of both the stream and
time, moving to the right when it sends data to the consumer, and continually
downward through time. The ﬁgure also shows the progress of the consumer.
The periodic pattern of waiting is caused by the interaction between two
dependencies. First, the consumer must wait for its data to arrive, which means
that it waits for the producer, plus the latency of the pipeline. This gives a
vertical dependency from producer to consumer. Second, the producer must
wait for an empty consumer-side buﬀer in which to send its data, and this gives
a horizontal dependency from consumer to producer.
98 P.M. Carpenter, A. Ramirez, and E. Ayguade´
Chain8 Producer-consumer
t1 t2 t8· · · t1 t2
0ms
50ms
100ms
0 40000 80000
Producer, t1Consumer, t8
0 40000 80000
(a) 2 buﬀers (b) 6 buﬀers
0ms
5ms
10ms
15ms
20ms
0 500 1000 0 500 1000
(c) 2 buﬀers (d) 5 buﬀers
Producer work Push Remote Wait Data Sent Pop Wait Consumer work
Fig. 1. Eﬀect of consumer queue length on chain8 and producer-consumer
Figure 1(b) is for six consumer-side buﬀers, which increases throughput by
73%, and is suﬃcient for the producer to be always busy. This shows that double
buﬀering was not suﬃcient, but also that the number of buﬀers can be less than
one plus the diﬀerence in pipeline stage, which is the number of buﬀers allocated
by StreamRoller [3] and SPIR [4]; in this case eight.
The second situation is illustrated using the producer-consumer example on
the right of Figure 1. If the producer and consumer both have ﬁxed computation
times and communication rates, then double buﬀering is suﬃcient. Sometimes,
single buﬀering at one or other end will be enough, even with good load bal-
ancing. Subﬁgure (c) shows the progress of this example, using double buﬀering,
when computation times are normally distributed. Increasing the number of con-
sumer buﬀers to ﬁve, as shown in subﬁgure (d), increases throughput by 20%.
The performance of the queue length assignment algorithm is quantiﬁed using
the utilisation, which is the percentage of time that the most heavily loaded
processor or bus is busy. Utilisation is proportional to throughput. If the stream
graph is acyclic, at least one resource ought to be 100% busy. If any resource
has utilisation less than 100%, it must be due to insuﬃcient buﬀering.
The tradeoﬀ between utilisation and the number of consumer buﬀers is il-
lustrated in Figure 2. Chain has linearly increasing utilisation until it reaches
100%. Producer-consumer achieves 99% utilisation with 3 producer and 4 con-
sumer buﬀers, and additional buﬀering yields diminishing returns.
3 The ACOTES Stream Compiler
This work is part of the ACOTES European project [5], which is developing an
open source stream compiler for embedded systems. The compiler will map a
portable stream program, written in the SPM [6], an annotated version of C,
onto a heterogeneous multicore system, applying blocking and task fusion.
The compiler statically allocates tasks onto processors. Although a dynamic
policy can achieve better load balance [7], it has greater overhead. On a dis-
tributed memory processor, instructions and state cannot be transferred on de-
mand through caches, so a context switch requires all data to be transferred at
Buﬀer Sizing for Self-timed Stream Programs 99
once. A context switch on the Cell SPE requires about 30μs [8]. The techniques
in this paper can be used to absorb small scale variation in complexity.
Figure 3 shows how the queue length assignment algorithm ﬁts into this stream
compiler. The blocking and partitioning stages transform the program as de-
scribed in the introduction. The queue length assignment stage, which is the
focus of this paper, then determines the optimal buﬀer allocation.
2 4 6 8
0.
4
0.
6
0.
8
1.
0
Number of consumer buffers
R
es
ou
rc
e 
ut
ilis
at
io
n
prodcons: 3 producer buffers
prodcons: 2 producer buffers
prodcons: 1 producer buffer
chain11
Fig. 2. Memory-performance tradeoﬀ
Blocking
and splitting
Partition
and allocate
Queue length
assignment
Partition
Buﬀer size
update
Allocation
Coarse grain
simulation
Metrics
Cycle
detection
Bottleneck
E
valu
ation
Final
allocation
Fail
Fig. 3. Mapping phase of the compiler
Our SPM language eliminates deadlock, so the objective function depends
only on performance and latency. The interaction between bounded memory
in process networks and deadlock, but not performance, has been explored in
depth [9,10,11], and these techniques can determine the minimum buﬀer sizes.
The queue length assignment algorithm is iterative, and consists of a coarse-
grain simulator, a cycle detection algorithm, a buﬀer size update algorithm, and
an evaluation algorithm. The cycle detection algorithm analyses metrics from
the simulator, and ﬁnds a bottleneck cycle. The buﬀer update algorithm chooses
the initial buﬀer allocation, and adjusts buﬀer sizes to resolve the bottleneck.
The evaluation algorithm monitors progress and decides when to stop, choosing
the buﬀer allocation that achieved the best performance-latency tradeoﬀ.
4 Formalisation of the Problem
Queue length assignment seeks to ﬁnd an optimal tradeoﬀ, subject to memory
constraints, between throughput and latency We wish to ﬁnd a close to Pareto
optimal solution: that is, neither latency nor throughput can be improved with-
out making the other one worse. We keep memory use within the constraints,
but do not try to minimise it.
The stream program is represented as a connected, not necessarily acyclic,
digraph, P = (T, S), where T is the set of vertices (tasks), and S is the set of
edges (streams). Each stream s has a producer and consumer buﬀer size in bytes,
bp(s) and bc(s), and a minimum number of buﬀers, suﬃcient to hold the working
set and avoid deadlocks. If P is acyclic, as for ACOTES, deadlock is impossible;
otherwise minimum sizes can be found using the references in Section 3. The
algorithm determines the actual number of buﬀers, np(s) and nc(s).
100 P.M. Carpenter, A. Ramirez, and E. Ayguade´
Each task has a trace, which is an alternating sequence of computation times
and primitives. There are four communications primitives and a fire primitive,
which marks the ﬁring of a task; i.e. the calling of its work function inside an
implicit loop. The communications primitives use a push model similar to the
DBI variant of TTL [12]. They are described below, assuming, for simplicity, that
the producer and consumer have the same buﬀer size, which is not required. A
block is the contents of one buﬀer, and i and j count blocks, starting at zero.
The ﬁrst argument is the stream.
ProducerAcquire(s, k): Wait for the producer buﬀer for block i+k to be avail-
able, meaning that the DMA transfer of block i + k − np(s) has completed.
ProducerSend(s): Wait for the consumer buﬀer for block i to be available,
meaning that the producer has received acknowledgement that block i−nc(s)
has been discarded. Then send the block and increment i.
ConsumerAcquire(s, k): Wait for block j + k to arrive in the consumer buﬀer.
ConsumerDiscard(s): Discard block j, send acknowledgement, and increment j.
The traces are interpreted using the ASM coarse-grain simulator, which takes
a machine description that deﬁnes the target [13]. Queue length assignment needs
only the memory constraints, which are represented using a bipartite graph,
H = (R,E). The set of vertices, R = P ∪M , is a disjoint union of processors P
and memories M , and the edges, E, connect processors to their local memories.
Each memory has weight equal to the amount of memory available, in bytes,
for stream buﬀers. Figure 4 shows the memory constraint graph for the Cell
Broadband Engine; the memory weights depend on how much memory is already
being used. We will later assume that each processor is connected to a single
memory, but it may be shared with other processors.
SPE1 SPE2 SPE8
LS1 LS2 LS8
Processors, P :
Memories, M :
Fig. 4. Memory constraint graph for the Cell Broadband Engine
The evaluation algorithm and Section 6 of the paper require an estimate of
latency. Since it is orthogonal to the rest of the paper, and only diﬀerences in
latency matter, we use a scheme which ignores delays inside tasks.
Deﬁne ft(n) to be the time of ﬁring, n = 0, 1, · · · ,Mt−1 of task t, taken from
the fire primitive. Since each task contributes to a common amount of real-world
progress, normalise n to the interval 0 ≤ x < 1 by dividing it by Mt. Then
gt(x) = ft(Mtx) gives the time that task t was proportion x ∈ [0, 1) through
the calculation. The latency, L(x), is the diﬀerence between the largest gt(x) for
a sink and the smallest gt(x) for a source, which can, unfortunately, be negative
when multiplicities are variable. We report the average value of L(x).
Buﬀer Sizing for Self-timed Stream Programs 101
5 Description of the Algorithms
In this section, we describe several algorithms for cycle detection and buﬀer
size update. First we review the standard critical cycle detection algorithm,
and explain when it is applicable. We introduce our baseline algorithm, which
ﬁnds the bottleneck cycle by analysing the time each task is blocked on each
stream. This data is easy to obtain, and the algorithm is quite eﬀective. We
then give an example that the baseline algorithm gets wrong, and propose the
token algorithm, which requires extra bookkeeping but achieves better results.
Finally, we describe several variants on the buﬀer update algorithm, which have
diﬀerent tradeoﬀs between speed of convergence and latency.
ProducerAcquire
ProducerSend
ConsumerAcquire
ConsumerDiscard
(448, 0) (0, 1) (480, 0) (0, 1)(13, 1)
(13, 0)
(13, 1)
Style Waiting primitive (§4)
Bold ProducerAcquire
Dashed ConsumerAcquire
Solid ProducerSend
Dotted Computation
(a) Timed event graph (b) Types of edge
Fig. 5. Example timed event graph used by the critical cycle algorithm
5.1 Cycle Detection Algorithms
Critical cycle algorithm: The critical cycle algorithm [14,15,16] solves the
cycle detection problem for homogeneous Synchronous Data Flow (SDF) [17]
with constant computation times and communications latencies. In homogeneous
SDF, every time a producer or consumer ﬁres, it pushes or pops a single buﬀer
on each stream. All tasks therefore ﬁre at the same rate. The algorithm can be
extended to SDF, where each producer or consumer pushes or pops any ﬁxed
number of buﬀers, but it requires expanding the graph, which can make it much
bigger [18].
Figure 5(a) shows how producer-consumer, assuming a single buﬀer at each
end, is represented by this algorithm. Each vertex is the return from a commu-
nications primitive. The edges are distinguished, for the diagram but not the
algorithm, using the convention in subﬁgure (b), which refers to the primitives
in Section 4. Each edge has weight, which is its ﬁxed computation time or com-
munications latency, and height, which is the ﬁxed diﬀerence between the ﬁring
number, which counts the number of times a task has ﬁred, at its two ends.
For example, at the producer side, the dotted line from ProducerAcquire to
ProducerSend, of weight 448 and height 0, represents computation inside a single
iteration. The solid line in the reverse direction, of weight 13 and height 1, is
because the producer cannot reuse its single buﬀer in the current ﬁring until the
previous DMA has completed.
Throughput is constrained by the critical cycle, which is a cycle with maxi-
mum ratio of total weight divided by total height. There are several algorithms
102 P.M. Carpenter, A. Ramirez, and E. Ayguade´
to ﬁnd such a cycle, many based on Karp’s Theorem [19], in time O(|S|2|T |) or
so [15], using the terminology of Section 4.
Baseline Algorithm: Our baseline algorithm is more general, because it sup-
ports variable data rates, computation times, and communication latencies. It
ﬁnds the bottleneck by analysing wait times in a real execution or simulation.
Figure 6 shows how the stream program and wait times are represented by
the algorithm. Subﬁgure (a) is an example stream graph with three tasks in a
triangle. Subﬁgure (b) is the wait-for graph, which has the same three edges per
stream as the timed event graph. Following convention for wait-for graphs, the
arrows point in the opposite direction, from the waiting task. The weight of an
edge is the proportion of the total time that the task at the initial vertex, or
tail, spent waiting in its communications primitive.
t0
t1
t2
t0
t1
t2
0.27
0.34
0.37
t0
t1
t2
0.77
0.09
0.05
0.13
(a) Program (b) Wait-for graph (c) (t0) has zero strength
Fig. 6. Example weighted wait-for graphs
As for the critical cycle algorithm, performance is constrained by dependence
cycles in the wait-for graph. We will use two bounds, one local and one global,
on the maximum increase in performance from relaxing a cycle; i.e. increasing
buﬀering on one of the streams in the cycle that gets full.
Consider the potential beneﬁt from relaxing cycle C1 = (t0 t2 t1). This can
only be done by increasing buﬀering on the stream from t0 to t2. Since t1 waits
for 27% of the time, during the ConsumerAcquire primitive in this cycle, we could
reduce the execution time of t1 by at most 27%, before the cycle disappears. Since
all tasks execute for nearly the same amount of wallclock time, any change in
throughput will cause all vertices to have their total waiting time, not just on
the edges of this cycle, reduced by the same amount. It is therefore likely that
the edge in the cycle that disappears ﬁrst is its weakest edge.
The local bound is the weight of cycle C, denoted w(C), which is the minimum
weight of its edges. If there is no cycle with non-zero weight, then utilisation is
already 100%. This is because every directed acyclic graph has a vertex with no
outgoing edge, which corresponds to a task that never has to wait.
Figure 6(c) is the motivation for the global bound. The maximum weight cycle
is the loop on t0, of weight 0.13, which we will call C2. A moment’s reﬂection,
however, shows that C2 cannot really be a bottleneck since neither t1 nor t2 ever
wait for t0, even indirectly. If we reduced the time t0 spent waiting on this loop,
it cannot make t1 or t2 go any faster. Since throughput would be unchanged, t0
Buﬀer Sizing for Self-timed Stream Programs 103
must spend the same total amount of time waiting, so the waiting time would
move from ProducerAcquire to ProducerSend (see Figure 5(b)).
The global bound is the strength of the cycle, denoted s(C), which is the
lowest value of the maximum ﬂow through a single path to the cycle, starting
from any other vertex. Since there is no path at all from t1 to C2 in Figure 6,
the cycle has zero strength: s(C2) = 0. In contrast, the cycle (t1 t2) has strength
0.77, because this is the weight of the only path from the only other vertex, t0.
Increasing the performance of t1 and t2 by any means could reduce execution
time of the program as a whole by 77%. This cycle is the bottleneck, and it has
weight 0.05. The requirement that ﬂow be through a single path makes little
diﬀerence in practice, but it reduces considerably the algorithmic complexity.
It is possible for the wait-for graph to be disconnected; e.g. when tasks wait
for each other only through bus contention. This happens rarely, but it causes all
strengths to be zero. Therefore, when all strengths are zero but the utilisation
is below some threshold (currently 100%), the strengths are ignored. Since it
almost never happens, there is little reason to be more sophisticated.
We ﬁrst calculate the strength of each vertex by computing the all-pairs bot-
tleneck paths [20]. This ﬁnds, for every pair of vertices, the value of the maximum
ﬂow through a single path from the ﬁrst vertex to the second. It is solved using
a variant of Dijkstra’s algorithm, running Dijkstra for each vertex to ﬁnd the
maximum ﬂow paths into it. The strength of that vertex is given by the path
with the lowest ﬂow. The total execution time is O(|S||T |+ |T |2log|T |), using a
Fibonacci heap [21,22], with the terminology of Section 4.
The algorithm ﬁnds a cycle with the maximum value of the minimum of the
local and global bounds. It is straightforward to show that we can take account
of both simply by replacing the weight of every edge e = (a, b) by a new weight,
w′(e) = min (w(e), s(a)). A maximum weight cycle, according to w′, can be
found in time O(|S| log |S|), where S is the set of streams. To ﬁnd out whether
there is a cycle of weight ≥ W , for some W , just check whether there is any
cycle if you ignore all edges of weight < W . This can be done in time O(|S|) by
attempting to perform a topological sort. To ﬁnd a maximum weight cycle, ﬁrst
sort the edge weights, and perturb them so that no two are exactly the same.
Then use bisection on the sorted edge weights.
The baseline algorithm uses data that is easy to obtain, and is usually quite
eﬀective, but it has one limitation. Since each task is represented by a single
vertex, it cannot “see” what is happening inside them.
Figure 7(a) shows an example where the baseline algorithm makes a bad
decision. The maximum weight cycle is (t1 t0 t2), which has weight 0.50. Whether
or not this is a bottleneck depends on the internal behaviour of tasks t1 and t2.
The order of operations per ﬁring of task t1 is shown in subﬁgure (b). If we
also know that task t1 always waits in step 5, then reducing the waiting time in
step 1 will simply result in a longer waiting time in step 5. It can never advance
the push in step 6, so the critical cycle cannot be (t1 t0 t2).
Token Algorithm: The token algorithm addresses this problem by tracking de-
pendencies through tasks. This is somewhat similar to causal chains [23], except
104 P.M. Carpenter, A. Ramirez, and E. Ayguade´
t3
t0
t1
t2
0.52 s01
1.00
s02
0.50
s12
0.48
s13
0.52
s23
Primitive Wait
time
1. ConsumerAcquire(s01, 0) 0.52
2. ProducerAcquire(s13, 0) n/a
3. ProducerAcquire(s12, 0) n/a
4. ConsumerDiscard(s01) n/a
5. ProducerSend(s13) 0.48
6. ProducerSend(s12) n/a t3
t0
t1
t2
s02
s13
0.52
0.48
0.98
0.48
0.48
0.48
0.47
(a) Wait-for graph (b) Order of primitives in t1 (c) Indirect wait-for graph
Fig. 7. Example where baseline fails
that the aim is to resolve performance bottlenecks rather than artiﬁcial dead-
locks. Their algorithm ﬁxes a deadlock after it happens, when all tasks have got
stuck, but we cannot expect all tasks in a cycle to ever be waiting simultaneously.
During the simulation, or at runtime in a dynamic scheme, each task t has a
current token, St, which is the stream that most recently made t wait, directly or
indirectly, because it got full. It has a current waiting time, Wt, which measures
how much the task has already had to wait, so that only increases in waiting
times are charged to streams. It also has a waiting vector, (Vt)s, which gives the
total waiting time for each stream in the whole program. Each consumer buﬀer
c has a current token, Sc, and current waiting time, Wc, which together record
the producer’s problem at the time the block in that buﬀer was sent.
When task p blocks for time τ because output stream s is full, it sets Sp ← s
and increases both Wp and Vp[s] by τ . When task p sends a block using buﬀer c
on output stream s, it records a copy of its current state: Sc ← Sp and Wc ← Wp.
When a task q blocks for time τ because input stream s is empty, it also, after
the data arrives, reads Sc and Wc, from the consumer buﬀer c containing the end
of the data. It then updates its current token Sq ← Sc to indicate that it had to
wait, indirectly, for whichever stream the producer had to wait for, and calculates
the increase in current waiting time ΔWq ← min(τ,Wc − Wq), which can be
either positive or negative. If it is positive, then Vq [Sq] is increased by ΔWq. In
either case, the current waiting time is then updated using Wq ← Wq + ΔWq.
The waiting vectors are used to construct an indirect wait-for graph, as shown
in Figure 7(c). If Vt[s] > 0, there is an edge from task t to stream s with weight
Vt[s]/L, where L is the total execution time of the run, in the same units. Each
stream s also produces an edge from s to its consumer q. The weight of this edge
is s(q), the strength of q, as deﬁned for the baseline algorithm.
This is eﬀectively viewing each stream as an actor in its own right, which is
always blocked waiting for the consumer to discard its data. This is the most
convenient place to take account of the strengths, which are still relevant by the
same argument as before. The token algorithm ﬁnds the maximum weight cycle
in the same way as the baseline algorithm.
Figure 8 shows a second example which clariﬁes the need for the cycle-based
algorithm outlined above. In the stream program of Figure 8(a), task t0 pushes
Buﬀer Sizing for Self-timed Stream Programs 105
t0
t1 t2 t3
t4 t5 t6
s01
s03
s04
s06
t0
t1 t2 t3
t4 t5 t6
s03
s06
0.25
0.49 0.49 0.48
0.25
0.50 0.50 0.50
(a) Stream graph for bichain4 (b) Indirect wait-for graph
Fig. 8. Token algorithm: bichain4 example
the outputs in the cyclic order (s01 s03 s04 s06), waiting only in ProducerSend for
streams s03 and s06 due to their longer latency.
When it pushes on stream s04 of the right branch, the most recent wait was
due to stream s03 being full, so it sends the token for s03. Similarly, it sends the
token for stream s06 to stream s01 of the left branch. The indirect wait-for graph
is shown in Figure 8(b), with cycle (t3 s06 t6 s03) going through both streams.
5.2 Buﬀer Size Update Algorithms
The cycle detection algorithm returns a set of edges in the wait-for graph that
cause a bottleneck cycle by becoming full. Relaxing the cycle involves increasing
memory on one or more of these edges. The purpose of the buﬀer size update
algorithm is to determine which edges to enlarge, and by how many buﬀers.
Our simplest algorithm is miserly, meaning that it starts at the minimum
number of buﬀers, mentioned in Section 4, and each iteration increases the allo-
cation of a single buﬀer by one. The other algorithms speculatively assign spare
memory, and only take it away if it is needed elsewhere. For all these algorithms,
each stream s demands some number ds of buﬀers, as for the miserly algorithm,
and requests another rs to be granted out of unused memory, if there is any.
When there is not enough memory to grant all requests within some memory,
we used the following algorithm. The total request in bytes is R =
∑
rsbc(s),
where bc(s) is the size in bytes of a single consumer buﬀer for stream s. If M
bytes are left after granting all demands, so R > M , then each stream is initially
granted rjM/R extra buﬀers, then possibly one more, if it ﬁts.
In our ﬁrst alternative, double, each edge requests an extra buﬀer if it is
currently allocated only one. In our second alternative, exponential, the request
is for some multiple, f − 1, of the number of buﬀers demanded. We still use a
greedy update algorithm, so that when the number of buﬀers is increased, the
edge demands, on the next iteration, one more buﬀer than it was given in total
last time. We used f = 2, so an edge will demand 2k − 1 buﬀers, and request an
equal number, for k = 1, 2, · · · , until it is given fewer buﬀers than it wants.
The third alternative, level, uses the top level, the length of the longest path
from a source node, and bottom level, the length of the longest path to a sink
node. The algorithm the same as exponential, except that the request is the max-
imum of a) f − 1 times the number of buﬀers demanded, b) twice the diﬀerence
106 P.M. Carpenter, A. Ramirez, and E. Ayguade´
in top level, and c) twice the diﬀerence in bottom level. This tries to give a high
initial allocation to streams that cross a high latency.
6 Evaluation
We used the StreamIt 2.1.1 benchmarks [24], random graphs, and sixteen exam-
ples, including chain8, producer-consumer, bad-baseline, and bichain4. For the
StreamIt benchmarks, we used the program graph, work estimates and commu-
nications rates generated by the StreamIt compiler, and used our algorithm [25]
to produce partitions for an IBM QS20 blade, which has two Cell BEs.
Buﬀer size update: The ﬁrst three rows of Figure 9 compare the buﬀer up-
date algorithms from Section 5.2. These plots also contain results for Basten
and Hoogerbrugge (B&H) [23] and modiﬁed StreamRoller [3], which will be dis-
cussed in Section 7. The left column shows as a function of the iteration number,
the utilisation, which is proportional to throughput, as remarked at the end of
Section 2. The right column shows the tradeoﬀ between latency and utilisation.
Any points that cannot be Pareto optimal, because they are beaten on both
utilisation and latency by some point to the top-left, have been removed.
The ﬁrst row is for random stochastic graphs with 32 tasks and 50 streams.
The graphs are connected and acyclic, but otherwise unstructured. The com-
putation time of each task is normally distributed with a random mean and
variance (clamped above zero). Notice that B&H has poor performance and,
since it increases buﬀering where it isn’t necessary, high latency.
We found the upper bound on utilisation using an exhaustive search over all
allocations of the buﬀers on the processor, p, whose memory bound caused the
level algorithm to terminate. All other queues on other processors were set to
their maximum possible size, assuming that all other queues in the same memory
had their minimum size. Since this tends to allow a task near the beginning of
the stream graph to work ﬂat out ﬁlling downstream buﬀers, the steady state
utilisation would be known only after many ﬁrings. Instead, we took the utilisa-
tion of the task on p, and scaled by the ratio of the long-term processing times
of the most heavily loaded processor and of p.
The second row shows the StreamIt 2.1.1 benchmarks, with an unroll factor
of 100. The third row shows the stochastic StreamIt benchmarks, which have
normally-distributed computation times, and are intended to show how the al-
gorithms fare for realistic program graphs.
The left column shows that the level algorithm always provides the fastest
convergence. The modiﬁed StreamRoller algorithm is similar to the ﬁrst iteration
of the level algorithm, and B&H is considerably worse. The level heuristic initial
allocation is within 15% of the upper bound on optimal performance, and is
increased to within 3% of optimal after four iterations.
Cycle detection: We evaluate the cycle detection algorithms only, using greedy
buﬀer update without memory constraints. When task execution times and com-
munications rates are constant, and bus contention is negligible, the critical cycle
algorithm of Section 5.1 is optimal. The last row of Figure 9 shows the utilisation
Buﬀer Sizing for Self-timed Stream Programs 107
Utilisation vs iteration number Utilisation-latency tradeoﬀ
Buﬀer size update
Stochastic
random
G(32, 50)
0 10 20 30 40
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
Iteration number
Ut
ilis
at
io
n
upper bound (see text)
level
exponential
double
miserly
B&H
streamroller
0.4 0.5 0.6 0.7 0.8
0.
5
0.
6
0.
7
Latency
Ut
ilis
at
io
n
upper bound level
exponential
double
miserly
B&H
streamroller
StreamIt
on 2-Cell
0 5 10 15
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
Iteration number
Ut
ilis
at
io
n
level
exponential
double
miserly
B&H
streamroller
0.50 0.55 0.60 0.65
0.
90
0.
94
0.
98
Latency
Ut
ilis
at
io
n
level
exponential
double
miserly
B&H
streamroller
Stochastic
StreamIt
on 2-Cell
0 10 20 30 40
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
Iteration number
Ut
ilis
at
io
n
upper bound (see text)
level
exponential
double
miserly
B&H
streamroller
0.45 0.50 0.55 0.60 0.65 0.70
0.
7
0.
8
0.
9
1.
0
Latency
Ut
ilis
a
tio
n
upper bound
level
exponential
double
miserly
B&H
streamroller
Cycle detection
Stochastic
G(8, 12)
0 10 20 30 40 50
0.
0
0.
2
0.
4
0.
6
0.
8
1.
0
Iteration number
Ut
ilis
a
tio
n
token
baseline
crit. cycle
0.6 0.7 0.8 0.9
0.
4
0.
5
0.
6
0.
7
0.
8
0.
9
Latency
Ut
ilis
a
tio
n
token
baseline
crit. cycle
Fig. 9. Comparison of the buﬀer size update and cycle detection algorithms
108 P.M. Carpenter, A. Ramirez, and E. Ayguade´
and latency for an average of six random graphs with stochastic computation
times. The poor performance of the critical cycle algorithm (about 60% utili-
sation), is because it is unable to detect cycles that arise from execution time
variability. The baseline and token algorithms achieve similar performance, al-
though the token algorithm achieves slightly lower latency.
We also evaluated the cycle detection algorithms when there is high bus util-
isation, but for space reasons did not include the graph. The critical cycle al-
gorithm cannot model increased communication latency due to contention [26,
§E.5]. For a benchmark with a single producer task connected to two consumers,
and bus usage close to 100%, the critical cycle algorithm achieves about 70%
utilisation. The baseline and token algorithms measure waiting times directly,
and consistently achieve 100% utilisation.
7 Related Work
Basten and Hoogerbrugge (B&H) [23] is the only other work that also targets
unstructured graphs with variable multiplicities and computation times. Their
algorithm sets each FIFO buﬀer size to be proportional to the amount of data
streaming through it. This gives a relative size for each buﬀer, but it is not
motivated by the underlying problems discussed in Section 2, and has poor
performance in Figure 9. We interpreted B&H to mean double buﬀering on the
producer side, with all the remaining memory allocated to consumer buﬀers,
rounding the number of buﬀers up to an integer. If rounding up causes the
buﬀer allocation to not ﬁt, we reduced the target memory use until it did ﬁt.
The chain8 example in Figure 1 shows the problem with this heuristic. If all data
rates are the same and there is enough memory on tn for ten buﬀers, Basten and
Hoogerbrugge allocates ﬁve buﬀers to each stream for 70% utilisation, while our
heuristic allocates eight to (t1, tn) and two to (tn−1, tn) for 100% utilisation.
The SDF tool [27] uses an exhaustive search to ﬁnd all Pareto-optimal buﬀer
allocations for an SDF graph. It requires exponentially many steps, and only
supports constant computation times and data rates. For an n-way split or join
where each stream needs b buﬀers, their algorithm requires nb steps, while our
level algorithm requires O(n log2 b) steps to ﬁnd a single solution.
StreamRoller [3] performs buﬀer allocation as part of software pipelining,
but it is restricted to graphs with ﬁxed multiplicities and computation times.
The algorithm is similar to the ﬁrst iteration of the level algorithm, in that
the number of buﬀers allocated to a stream is always one plus the diﬀerence in
pipeline stage. The chain8 example in Section 2 shows that this is conservative,
even when there is no variability. Hence the StreamRoller algorithm can require
more memory than necessary; if there is insuﬃcient memory, it fails.
Due to the unrolling factor we used, StreamRoller failed on at least one
benchmark for all of the graphs in Figure 9. This is true even for the StreamIt
benchmarks, for which our algorithm achieves 100% utilisation on at least one
processor. We modiﬁed StreamRoller to use our arbitration scheme described
in Subsection 5.2, and obtained the results shown in Figure 9. Even with this
Buﬀer Sizing for Self-timed Stream Programs 109
modiﬁcation, however, our iterative algorithm has about 13% higher performance
for the stochastic random graphs and stochastic StreamIt benchmarks.
The SPIR compiler [4] extends StreamRoller to ﬁnd a partition and software
pipeline subject to memory and latency constraints. Unlike our approach, com-
putation times and communication rates are constant. As for StreamRoller, the
number of buﬀers allocated to a stream is one plus the diﬀerence in pipeline
stage. Since the problem cannot be solved exactly using ILP, it is a heuristic
which uses two passes of the commercial CPLEX ILP solver. Our algorithm
could be used to improve the buﬀer allocation of a partition produced by SPIR.
8 Conclusions
In this paper, we presented a feedback-directed algorithm to allocate memory
for communications buﬀers in a statically-allocated stream program. The algo-
rithm achieves close to optimal performance, even when StreamRoller fails due
to insuﬃcient memory. It achieves signiﬁcantly higher performance and lower
latency than the previous fully general algorithm, by Basten and Hoogerbrugge.
Acknowledgements
The researchers at BSC-UPC were supported by the Spanish Ministry of Sci-
ence and Innovation (contract no. TIN2007-60625), the European Commission in
the context of the ACOTES project (contract no. IST-34869) and the HiPEAC
Network of Excellence (contract no. FP7/ICT 217068). We would also like to
acknowledge our partners in the ACOTES project for the insightful discussions
on the topics presented in this paper.
References
1. Olukotun, K., Hammond, L.: The future of microprocessors. Queue 3(7), 26–29
(2005)
2. Pham, D., Behnen, E., Bolliger, M., Hofstee, H.: et al.: The design methodology and
implementation of a ﬁrst-generation Cell processor: a multi-core SoC. In: Custom
Integrated Circuits Conference 2005, pp. 45–49 (2005)
3. Kudlur, M., Mahlke, S.: Orchestrating the execution of stream programs on mul-
ticore platforms. In: Proceedings of the 2008 ACM SIGPLAN conference on Pro-
gramming language design and implementation, pp. 114–124 (2008)
4. Choi, Y., Lin, Y., Chong, N., Mahlke, S., Mudge, T.: Stream Compilation for
Real-Time Embedded Multicore Systems. In: Proceedings of the 2009 International
Symposium on Code Generation and Optimization, vol. 00, pp. 210–220 (2009)
5. IST-034869, A.: Advanced Compiler Technologies for Embedded Streaming,
http://www.hitech-projects.com/euprojects/ACOTES/
6. ACOTES: IST ACOTES Project Deliverable D2.2 Report on Streaming Program-
ming Model and Abstract Streaming Machine Description Final Version (2008)
110 P.M. Carpenter, A. Ramirez, and E. Ayguade´
7. Becchi, M., Crowley, P.: Dynamic thread assignment on heterogeneous multipro-
cessor architectures. In: Proceedings of the 3rd conference on Computing frontiers,
pp. 29–40. ACM, New York (2006)
8. Hofstee, H.P.: Power eﬃcient processor architecture and the cell processor, pp.
258–262. IEEE Computer Society, Los Alamitos (2005)
9. Parks, T.: Bounded scheduling of process networks. PhD thesis, University of Cal-
ifornia (1995)
10. Buck, J.: Scheduling dynamic dataﬂow graphs with bounded memory using the
token ﬂow model. PhD thesis, University of California (1993)
11. Geilen, M., Basten, T.: Requirements on the execution of Kahn process networks.
LNCS, pp. 319–334. Springer, Heidelberg (2003)
12. van der Wolf, P., de Kock, E., Henriksson, T., Kruijtzer, W., Essink, G.: Design
and programming of embedded multiprocessors: an interface-centric approach. In:
Proceedings of the 2nd international conference on Hardware/software codesign
and system synthesis, pp. 206–217 (2004)
13. Carpenter, P.M., Ramirez, A., Ayguade, E.: The Abstract Streaming Machine:
Compile-Time Performance Modelling of Stream Programs on Heterogeneous Mul-
tiprocessors. In: SAMOS Workshop 2009, pp. 12–23. Springer, Heidelberg (2009)
14. Ito, K., Parhi, K.: Determining the minimum iteration period of an algorithm. The
Journal of VLSI Signal Processing 11(3), 229–244 (1995)
15. Dasdan, A., Gupta, R.: Faster maximum and minimum mean cycle algorithms for
system-performance analysis. IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems 17(10), 889–899 (1998)
16. Govindarajan, R., Gao, G.: A novel framework for multi-rate scheduling in DSP ap-
plications. In: International Conference on Application-Speciﬁc Array Processors,
pp. 77–88 (1993)
17. Lee, E., Messerschmitt, D.: Synchronous data ﬂow. Proceedings of the IEEE 75(9),
1235–1245 (1987)
18. Lee, E.A.: A coupled hardware and software architecture for programmable digital
signal processors (synchronous data ﬂow). PhD thesis (1986)
19. Karp, R.: A characterization of the minimum cycle mean in a digraph. Discrete
Mathematics 23(3), 309–311 (1978)
20. Pollack, M.: The maximum capacity through a network. Operations Research, 733–
736 (1960)
21. Fredman, M., Tarjan, R.: Fibonacci heaps and their uses in improved network
optimization algorithms. Journal of the ACM (J. ACM) 34(3), 596–615 (1987)
22. Vassilevska, V., Williams, R., Yuster, R.: All-pairs bottleneck paths for general
graphs in truly sub-cubic time. In: Proceedings of the thirty-ninth annual ACM
symposium on Theory of computing, pp. 585–589. ACM, New York (2007)
23. Basten, T., Hoogerbrugge, J.: Eﬃcient execution of process networks. Communi-
cating Process Architectures (2001)
24. Gordon, M., Thies, W., Amarasinghe, S.: Exploiting coarse-grained task, data, and
pipeline parallelism in stream programs. ASPLOS, 151–162 (2006)
25. Carpenter, P.M., Ramirez, A., Ayguade, E.: Mapping Stream Programs onto Het-
erogeneous Multiprocessor Systems. In: CASES 2009, October 11-16 (2009)
26. Hennessy, J.L., Patterson, D.A.: Computer Architecture: A Quantitative Ap-
proach, 4th edn. Morgan Kaufmann, San Francisco (2007)
27. Stuijk, S., Geilen, M., Basten, T.: Exploring trade-oﬀs in buﬀer requirements and
throughput constraints for synchronous dataﬂow graphs. In: Proceedings of the
43rd annual conference on Design automation, pp. 899–904 (2006)
