A case for hardware task management support for the StarSS programming model by Meenderinck, Cor & Juurlink, Ben
Cor Meenderinck, Ben Juurlink
A case for hardware task management
support for the StarSS programming model
Conference object, Postprint version
This version is available at http://dx.doi.org/10.14279/depositonce-5776.
Suggested Citation
Meenderinck, Cor; Juurlink, Ben: A case for hardware task management support for the StarSS
programming model. - In: 2010 13th Euromicro Conference on Digital System Design: Architectures,
Methods and Tools : DSD. - New York, NY [u.a.], IEEE, 2010. - ISBN: 978-1-4244-7839-2, pp. 347-354. -
DOI: 10.1109/DSD.2010.63. (Postprint version is cited, page numbers differ.) 
Terms of Use
 © © 2010 IEEE. Personal use of this material is permitted. Permission from IEEE must be
obtained for al other uses, in any current or future media, including reprinting/republishing
this material for advertising or promotional purposes, creating new colective works, for
resale or redistribution to servers or lists, or reuse of any copyrighted component of this
work in other works. 
Powered by TCPDF (www.tcpdf.org)
A Case for Hardware Task Management Support for the StarSS Programming
Model
Cor Meenderinck
Delft University of Technology
Delft, the Netherlands
cor@ce.et.tudelft.nl
Ben Juurlink
Technische Universit¨at Berlin
Berlin, Germany
b.juurlink@tu-berlin.de
Abstract—StarSS is a paralel programming model that eases
the task of the programmer. He or she has to identify the tasks
that can potentialy be executed in paralel and the inputs and
outputs of these tasks, while the runtime system takes care
of the difﬁcult issues of determining inter task dependencies,
synchronization, load balancing, scheduling to optimize data
locality, etc. Given these issues, however, the runtime system
might become a botleneck that limits the scalability of the
system. The contribution of this paper is two-fold. First, we
analyze the scalability of the current software runtime system
for several synthetic benchmarks with diferent dependency
paterns and task sizes. We show that for ﬁne-grained tasks
the system does not scale beyond ﬁve cores. Furthermore, we
identify the main scalability botlenecks of the runtime system.
Second, we present the design of Nexus, a hardware support
system for StarSS applications, that greatly reduces the task
management overhead.
Keywords-task management; hardware support; StarSS; par-
alel programming;
I. INTRODUCTION
1
Curently, processor performance is mainly increasing
due to a growing number of cores per chip. According to
Moore’s Law the number of transistors per die is expected
to double every two years, alowing chips with hundreds
of cores within 10 to 15 years. Successful deployment
of such many-core chips, depends on the availability of
paralel applications. Creating efﬁcient paralel applications,
though, is curently cumbersome and complicated as, besides
paralelization, it requires task scheduling, synchronization,
programming or optimizing data transfers, etc.
Curently there are several eforts to solve the programma-
bility issues of multicores. Among these is the task-based
StarSS programming model [1]. It requires the programmer
only to indicate functions, refered to as tasks, that are
amenable to paralel execution and to specify the input and
output operands. Al the other issues, such as determining
task dependencies, scheduling, programming and issuing
data transfers, etc., are automaticaly handled by StarSS.
The task management performed by the StarSS runtime
system is rather laborious. This overhead afects the scalabil-
ity and performance, and potentialy limits the applicability
of StarSS applications. In this paper we study the beneﬁts
of accelerating task management with hardware support.
We show that for ﬁne-grained tasks, the runtime overhead
greatly reduces performance and severely limits scalability.
Fine-grained task paralelization might be important for
future many-core systems as it has greater potential for
paralelism than coarse grained task paralelization. Through
extensive analysis of the curent system’s botlenecks, we
specify the requirements of a hardware support system.
As an example of such a system, we present Nexus: a
hardware task management support system compliant with
the StarSS programming model and potentialy applicable to
other programming models using dynamic task management.
Some other works have proposed hardware support for
task management. Most are limited, in contrast to our
work, to scheduling of independent tasks leaving it up to
the programmer to deliver tasks at the appropriate time.
Carbon [2] provides hardware task queueing enabling low
latency task retrieval. In [3] a TriMedia-based multicore
system is proposed containing a centralized task scheduling
unit based on Carbon. For a multicore media SoC, Ou [4]
proposes a hardware interface to manage tasks on the DSP,
mainly to avoid OS intervention. Architectural support for
the Cilk programming model has been proposed in [5],
mainly focussing on cache coherency.
A few works include hardware task dependency resolu-
tion. In [6] a look-ahead task management unit has been
proposed reducing the task retrieval latency. It uses task de-
pendencies in a reversed way to predict due available tasks.
Because of the programmable nature, its latency is rather
large. The later issue has been resolved in [7] at the cost
of generality: the proposed hardware is speciﬁc for H.264
decoding. For System-on-Chips (SoCs), an Application Spe-
ciﬁc Instruction set Processor (ASIP) has been proposed
to speedup software task management [8]. Independently,
Etsion et al. [9] have also developed hardware support for
the StarSS programming model. They observe a similarity
between input and output dependencies of tasks and instruc-
tions and have proposed task management hardware that
works similar to an out-of-order instruction scheduler.
This paper is organized as folows. The StarSS program-
ming model and the target processor platform are brieﬂy
reviewed in Section I. The benchmarks used throughout
this work are described in Section II. In Section IV the
StarSS runtime system is analyzed. Section V compares the
performance of the curent StarSS system to that of a manu-
aly paralelized implementation and ilustrates the need for
hardware task management support. The proposed Nexus
system is described in Section VI. Section VI concludes
the paper.
II. BACKGROUND
The StarSS programming model is based on simple anno-
tations of serial code, by adding pragmas. The main pragmas
are ilustrated in Listing 1. The pragmascss startand
css finishindicate the initialization and ﬁnalization of
the StarSS runtime system. Tasks are functions that are anno-
tated with thecss taskpragma, including a speciﬁcation
of the input and output operands. The programmer does
not have to specify which task can be executed in paralel
as the runtime system detects this, based on the input and
output operands. In addition to the pragmas ilustrated in this
example, StarSS provides several synchronization pragmas.
int∗A[N] [N] ;
#pragma c s s task input ( base [16][16])\
output ( this [16][16])
void foo (int∗base , int∗this ){
...
}
voidmain (){
inti,j;
...
#pragma c s s start
foo(A[0][0], A[i ][ j ]);
...
#pragma c s s finish
...
}
Listing 1. Basics of the StarSS programming model.
The curent implementation of StarSS uses one core with
two threads to control the application, while the others
(worker cores) execute the paralel tasks. A source-to-source
compiler, processes the annotated serial code and generates
the codes for the control and worker cores. The ﬁrst thread
of the control core runs the main program code that adds
tasks and the coresponding part of the runtime system. The
second thread handles the communication with the worker
cores and performs the scheduling of tasks.
Tasks are added dynamicaly to the runtime system, which
builds the task dependency graph based on the addresses
of input and output operands. Tasks whose dependencies
are met, are scheduled for execution on the worker cores.
Furthermore, the runtime system manages data transfers
between main memory and local scratchpad memories,
if applicable. It tries to minimize the execution time by
applying a few optimizations. First, it creates groups of
tasks, refered to as bundles within StarSS. Using bundles,
reduces the overhead per task for scheduling. In addition,
the runtime system optimizes for data locality by assigning
chains within the task graph to bundles. Within such bundles,
data produced by one task is used by a next, and thus locality
is exploited.
Throughout this paper, we use the StarSS instantiation
for the Cel Broadband Engine, which is caled CelSS. We
use CelSS version 2.2 [10], which is the latest release at
time of writing. It has an improved runtime system with
signiﬁcantly lower overhead than prior releases. CelSS has
several conﬁguration parameters that can be used to tune
the behavior of the runtime system. We manualy optimized
these parameters to obtain the best performance for smal
task sizes.
Our experimental platform is the Cel processor, which
contains one PowerPC Processing Element (PPE) and eight
Synergistic Processing Elements (SPEs). The PPE is a
general purpose core that runs the operating system. The
SPEs are autonomous but depending on the PPE to receive
work to do. The CelSS runtime system runs on two threads
of the PPE. The SPEs wait to receive a signal from the
helper thread indicating what task to perform. Once they
are ready, they signal back and wait for another task.
The SPEs have scratchpad memory only and use DMA
(Direct Memory Access) commands to load data from main
memory into their local store. When using bundles, CelSS
applies double bufering, which hides the memory latency by
performing data transfers concurently with task execution.
The measurements are performed on a Cel blade containing
two Cel processors, running at3.2GHz. Thus, in total 16
SPEs are available.
One of the main metrics used throughout this paper is
scalability, which is deﬁned as the speedup of a paralel
execution onN SPEs compared to execution using one
SPE. Furthermore, scalability efﬁciency is deﬁned asS/N,
whereSis the scalability andN is the number of SPEs.
For example, if an application takes 1ms using a single
SPE and 12msusing 16 SPEs, the scalability is 12 and the
scalability efﬁciency is 75%.
III. BENCHMARKS
2
To analyze the scalability of StarSS under diferent cir-
cumstances, we used three synthetic benchmarks. We used
synthetic benchmarks rather than real applications because
it alows controling the application parameters such as the
dependency patern, the task execution time (and, hence,
the computation to communication ratio), etc. The three
benchmarks are refered to as CD (Complex Dependencies),
SD (Simple Dependencies), and ND (No Dependencies).
The dependency patern of the CD benchmark is similar to
the dependency patern in H.264 decoding [11], [12].
Al three benchmarks process a matrix of1024×1024
integers in blocks of16×16. Each block operation is one
task, adding up to a total of 4096 tasks. The task execution
time is varied from approximately2µsto over2ms.
int∗A[64][64];
#pragma c s s task input ( left [16][16],\
topright [16][16]) inout ( this [16][16])
void foo (int∗ left ,int∗topright ,int∗this ){
...
}
voidmain (){
inti,j;
initmatrix (A);
#pragma c s s start
for(i=0; i<64; i ++){
for(j=0; j<64; j ++){
foo (A[ i ][ j−1], A[ i−1][j +1], A[i ][ j ]);
}
}
#pragma c s s barrier
#pragma c s s finish
}
Listing 2. The code of the CD benchmark including StarSS annotations.
In the CD benchmark, each task depends on its left and
top-right neighbor if it exists. Listing 2 shows the simpliﬁed
code of the CD benchmark, including the annotations. Due
to the task dependencies, at the start of the benchmark
only one task is ready for execution. During the execution,
the number of available paralel tasks gradualy increases,
until halfway, after which the available paralelism similarly
decreases back to one. Therefore, the amount of available
paralel tasks at the start and end of the benchmark is lower
than the number of cores. Due to this ramping efect in the
CD benchmark, the maximum obtainable scalability is 14.5
using 16 cores.
In the SD benchmark, each task depends on its left
neighbor if it exists. Thus, al 64 rows are independent and
can be processed in paralel. In the ND benchmark al tasks
are independent of each other and thus the application is
fuly paralel. Therefore, the maximum obtainable scalability
for the SD and ND benchmarks is 16 when using 16 cores.
IV. SCALABILITY OF THESTARSS RUNTIMESYSTEM
The main purpose of this section is to quantify the task
management overhead of the StarSS runtime system. Besides
measuring performance and scalability, we have also gener-
ated Paraver traces to study the runtime behavior in more
detail. First, we study the CD benchmark in detail, especialy
for smal task sizes, as it resembles H.264 decoding, which
has an average task size of approximately20µs.
A. CD Benchmark
For the CD benchmark a maximum scalability of 14.2
is obtained, as depicted in Figure 1. This near optimal
scalability, however, is only obtained for large task sizes.
The ﬁgure also shows that in order to obtain a scalability
efﬁciency of 80% when using 16 SPEs, that is a scalability of
12.8, at least a task size of approximately100µsis required.
Similarly, for 8 SPEs a task size of50µsis required. For a
task size of19µsthe scalability is only 4.8.
 0
 2
 4
 6
 8
 10
 12
 14
 16
 1  10  100  1000
Sc
ala
bilit
y
Task size (us)
16 SPUs8 SPUs4 SPUs
2 SPUs1 SPU
Figure 1. Scalability of StarSS with the CD benchmark.
Figure 2 depicts a part of the Paraver trace of the CD
benchmark with a task size of19µs. The colored parts indi-
cate the execution phases, of which the legend is depicted in
Figure 3. The ﬁrst and second row of the trace corespond
to the main and helper thread, respectively. The other rows
corespond to the 16 SPEs. The yelow lines connecting the
rows indicate communication. The helper thread signals to
the SPEs what task to execute. The SPEs, in turn, signal
back when the task execution is ﬁnished.
The second row of the trace shows that scheduling (yel-
low), preparing (red), and submiting a bundle (green) takes
approximately6µs, while removing a task (magenta) takes
approximately5.8µs. As the total task size, including task
stage in, wait for DMA, task execution, task stage out, and
task ﬁnished notiﬁcation, is22.5µs, the helper thread can
simply not keep up with the SPEs. Before the helper thread
has launched the sixth task, the ﬁrst one has already ﬁnished.
However, its ready signal (yelow line) is processed much
later, and before this SPE receives a new task, even more
time is spent.
The stress on the helper thread can be reduced by alowing
the runtime system to create bundles. Our measurements
show that bundles of four tasks provides the optimal per-
formance for this task size. Grouping 4 tasks in a bundle
reduces the scheduling time with 45%, the preparation and
submission of tasks with 73%, and the removal of tasks with
50%. In total, the overhead per task is reduced from11.6µs
to5.5µs
3
resulting in an overal performance improvement of
20%. The scalability, however, is not afected as the helper
thread remains a botleneck. To efﬁciently leverage 16 SPEs,
Figure 2. Partial Paraver trace of the CD benchmark with 16 SPEs, a task size of19µs,andthe(1,1,1)conﬁguration.
Figure 3. Legend of the Paraver traces.
at most an overhead per task of22.5/16 = 1.4µsis required.
The main reason for the botleneck in the main thread
is very subtle. Although the benchmark exhibits sufﬁcient
paralelism, due to the order in which the main thread
spawns tasks (from left to right and top to botom), at
least 16 “rows of tasks” (coresponding to 1/4th of the
matrix) have to be added to the dependency graph before 16
independent tasks are available that can keep al cores busy.
Thus, to exploit the paralelism available in the benchmark,
the addition of tasks in the main thread must therefore run
ahead of the paralel execution. The trace reveals, that the
main thread can only keep up with the paralel execution,
and thus it limits the scalability of the system.
B. SD Benchmark
In the SD benchmark, tasks coresponding to diferent
rows are completely independent of each other, because
tasks only depend on their left neighbor. The diference
in dependency patern has several efects on the runtime
system. First, this greatly simpliﬁes the task dependency
graph, which in turn might lead to decreased runtime over-
head. Second, the dependency patern enables the runtime
system to create efﬁcient bundles, further reducing the
overhead. Our measurements, however, show only a smal
improvement in scalability compared to the CD benchmark.
The scalability of the SD benchmark is slightly beter than
that of the CD benchmark. As Figure 4 depicts, a maximum
scalability of 14.5 is obtained for large task sizes. For a task
size of19µsa scalability of 5.6 is obtained. Analysis of the
Paraver trace shows that the main thread is the botleneck. It
is continuously busy with user code (26.6%), adding tasks
(50.9%), and removing tasks (20.6%). On average per task
these phases take2.0µs,3.8µs, and1.5µs, respectively. This
indicates that the time taken by building and maintaining
the task dependency graph is not much afected by the
dependency patern, unlike expected.
 0
 2
 4
 6
 8
 10
 12
 14
 16
 1  10  100  1000
Sc
ala
bilit
y
Task size (us)
16 SPUs8 SPUs4 SPUs
2 SPUs1 SPU
Figure 4. Scalability of StarSS with the SD benchmark.
The stress on the helper thread, on the other hand, is sig-
niﬁcantly lower compared to the CD benchmark. The reason
for this is that we alowed the creation of larger bundles. On
average, the time spent on scheduling, preparation of the
bundles, and submission of the bundles is1.86µs,0.10µs,
and0.20µs
4
, respectively.
C. ND Benchmark
The scalability of the ND benchmark is depicted in
Figure 5. The maximum obtained scalability is 15.8, which
is signiﬁcantly higher than with the two other benchmarks.
This is because al tasks are independent. Thus, ﬁrst, there
is no ramping efect and second, as soon as 16 tasks have
been added, the runtime system detects 16 independent tasks
to execute.
 0
 2
 4
 6
 8
 10
 12
 14
 16
 1  10  100  1000
Sc
ala
bilit
y
Task size (us)
16 SPUs8 SPUs
4 SPUs2 SPUs1 SPU
Figure 5. Scalability of StarSS with the ND benchmark.
The improvement for smaler task sizes is also due to the
increased available paralelism. The time spent on building
and maintaining the task graph, however, is approximately
equal to the time spent on these for the SD benchmark,
limiting scalability. We conclude that not the dependencies
by itself are time-consuming, but the process of checking
for dependencies.
V. A CASE FORHARDWARESUPPORT
In the previous section we analyzed the StarSS system
and showed that the curent runtime system has a low
scalability for ﬁne-grained tasks. To show the potential
gains of accelerating the runtime system, in this section we
compare StarSS to manualy paralelized implementations
of the benchmarks using a dynamic task pool. Based on
the analysis presented in this and the previous section, we
identify the botlenecks in the StarSS runtime system. Those
lead to a set of requirements that hardware support for task
management should comply with.
A. Comparison with Manualy Paralelized Benchmarks
The benchmarks are also manualy paralelized. A dy-
namic task pool is used to determine which tasks are
ready for execution. As the dependencies are predetermined
and hard coded, however, there is no dynamic dependency
detection, in contrast to the runtime system of StarSS.
Moreover, a simple Last-In-First-Out (LIFO) bufer is used
to minimize the scheduling latency, where StarSS tries to
apply several scheduling optimizations. Therefore, the com-
parison between the StarSS system to that of the manualy
paralelized (Manual) system, shows the cost of dynamic
dependency detection and the scheduling optimizations.
The task pool implementation was taken from [13]. It is
optimized for the Cel processor. The authors have investi-
gated several synchronization primitives. The direct mapped
mailboxes proved to provide the fastest communication
between the PPE and the SPEs for up to 16 SPEs, although
this approach has scalability issues.
The scalability of the CD benchmark with the manualy
coded task pool is shown in Figure 6. For a task size of19µs,
the scalability using 16 SPEs is almost 12. This is much
higher than the scalability of 4.8 obtained with StarSS. Sim-
ilarly, the scalability of the other two benchmarks increased
signiﬁcantly.
 0
 2
 4
 6
 8
 10
 12
 14
 16
 1  10  100  1000
Sc
ala
bilit
y
Task size (us)
16 SPUs8 SPUs4 SPUs2 SPUs1 SPU
Figure 6. Scalability of the manualy paralelized CD benchmark.
Figure 7 compares the scalability of the StarSS system
to that of the manualy paralelized system. It displays the
iso-efﬁciency lines, which represent the minimal task size
required for a certain system in order to obtain a speciﬁc
scalability efﬁciency (80% in this graph). For the Manual
system and a smal number of SPEs, an efﬁciency of 80%
is always obtained, irespective the task size. Therefore, the
coresponding iso-efﬁciency points do not exist and are not
displayed in the graph.
 1
 10
 100
2 4 8 16
Ta
sk 
siz
e (
us)
Number of SPUs
StarSS (CD)StarSS (SD)StarSS (ND)Manual (CD)Manual (SD)Manual (ND)
Figure 7. The iso-efﬁciency lines of the StarSS and Manual systems (80%
efﬁciency).
5
From the iso-efﬁciency graph it is clear that the Manual
system achieves an efﬁciency of 80% for much smaler task
sizes than StarSS does. For eight SPEs the diference in
task size is between a factor of 5 and 6, while for 16 SPEs
the diference is between 2 and 3 times. The main cause
for this diference is the dependency resolution process of
the StarSS runtime system. The iso-efﬁciency lines of both
systems increase with the number of SPEs. Going from 8 to
16 SPEs, for both systems the lines increase because of of-
chip communication. In general, however, there are diferent
reasons. In the StarSS system the throughput of the two
PPE threads is the limiting factor. Therefore, with increasing
number of SPEs longer tasks are required to hide the task
management overhead on the PPE.
In the Manual system, the iso-efﬁciency line mainly
increases with the number of SPEs mainly due to the limited
scalability of the synchronization mechanism. When SPEs
have ﬁnished the execution of a task they send a mailbox
signal to the PPE. The helper thread checks the mailboxes in
a round-robin fashion. When it ﬁnds a signal, it processes the
contents of the signal and send a mailbox back indicating a
new task. The time between the SPE sending a ﬁnish signal
and receiving a new task, is depending on the phase of the
round-robin poling. In the worst case the SPE send a signal
just after the helper thread checking that SPEs mailbox.
Moreover, this synchronization latency is increasing with the
number of SPEs.
Besides a lower scalability, also the absolute performance
of StarSS is lower, up to four times, than that of the
Manual System for ﬁne-grained tasks. Table I depicts the
execution time of the CD benchmark. For larger task sizes,
the diference diminishes as the task management overhead
becomes negligible compared to the task execution time. A
similar diference between the StarSS and Manual system is
observed in the ND and SD benchmarks.
B. Requirements for Hardware Support
From the prior analysis, we conclude that the StarSS
runtime system curently does not scale wel for ﬁne-grained
task paralelism. The main botleneck is the dependency
resolution process. Irespective of the actual dependency
patern, determining whether there are task dependencies is a
laborious process that cannot keep up with the paralel task
execution. In order to efﬁciently utilize 8 or 16 SPEs for
H.264 decoding, roughly coresponding to the CD bench-
mark with a task size of20µs, the process of building and
maintaining the task dependency graph should be reduced
from9.1µsto2.5µsand1.3µs, respectively. Therefore, we
describe the ﬁrst requirement for task management hardware
support as folows:
Requirement 1:The hardware support system should ac-
celerate the task dependency resolution process by at least
a factor of 3.6 and 7.3, when using 8 and 16 SPEs,
respectively.
Preferably this process is accelerated even more, such
that it can run ahead of execution in order to reveal more
paralelism. As such a large speedup is required, we expect
that software optimizations (if possible) or some hardware
acceleration such as special instructions, are not sufﬁcient.
Instead, we propose to perform this process in hardware.
The second botleneck is the scheduling, preparation, and
submission of tasks, performed by the helper thread in
StarSS. In total this overhead per task is5.5µs, where at
least a latency of2.8µsand1.4µsis required, for 8 and 16
SPEs, respectively.
Requirement 2:The hardware support system should ac-
celerate scheduling, preparation, and submission of tasks by
at least a factor of 2.0 and 3.9, when using 8 and 16 SPEs,
respectively.
The scheduler tries to minimize the overhead by creat-
ing bundles of tasks, and tries to exploit data locality by
grouping a dependency chain within one bundle. For smal
tasks, however, the scheduling costs more than the beneﬁts
it provides. A rudimentary, yet fast, scheduling approach
might therefore be more efﬁcient than a sophisticated one.
Although the numbers mentioned in these ﬁrst two re-
quirements are speciﬁc for the Cel processor, the require-
ments are in general applicable to other multicore platforms
as wel. The requirements stem from the fact that the task
management overhead is relatively large compared to the
task execution. As both are performed in software, changing
platform is expected to afect the relative size insigniﬁcantly.
The latency of synchronization on the Cel processor
is too long and scales poorly due to the round-robin
polingmechanism.Interuptsare more scalable than the
poling mechanism, but stil need to execute a software
routine. Instead, SPEs should be able to retrieve a new
task themselves instead of waiting for one to be assigned.
Hardware task queues, such as Carbon [2], is an example
of such an approach. Each core can read a task from the
queue autonomously, avoiding of-chip communication and
execution of a software routine. Assuming a 5% overhead for
synchronization is acceptable, the synchronization latency
should be at most1µs, irespective of the number of cores.
Requirement 3:The hardware support system should pro-
vide synchronization primitives for retrieving tasks, with a
latency of at most1µs
6
, irespective of the number of cores.
The round-robin synchronization mechanism is speciﬁc
for the Cel processor. Despite, the need for low-latency and
scalable synchronization is platform independent. In [2] it
was shown that synchronization through a hardware queue
is beneﬁcial for a wide range of platforms. Therefore,
Requirement 3 is also generaly applicable.
When compliant with these three requirements, a hard-
ware support system wil efectively improve the scalability
of StarSS, while maintaining its ease of programming. In
the next section we propose the Nexus system, which is
developed based on this set of requirements.
Table I
ABSOLUTE EXECUTION TIME,INms,OF THECDBENCHMARK FOR THESTARSSANDMANUAL SYSTEMS,AND USING8 SPES.
System Task size
1.92µs 11.2µs 19.1µs 66.7µs 241µs 636µs 2373µs
StarSS 23 23 23 41 132 340 1250
Manual 6.0 10 13 39 130 337 1250
VI. NEXUS:AHARDWARETASKMANAGEMENT
SUPPORTSYSTEM
In this section we present the design of the Nexus system,
which provides the required hardware support for task
management in order to efﬁciently exploit ﬁne-grained task
paralelism with StarSS. Nexus can be incorporated in any
multicore architecture. In this section, as an example, we
present a Nexus design for the Cel processor.
A. Nexus System Overview
The Nexus system contains two types of hardware units
(see Figure 8). The ﬁrst and main unit, is the Task Pool Unit
(TPU). It receives tasks descriptors from the PPE, which
contain the meta data of the tasks, such as the function
to perform and the location of the operands. It resolves
dependencies, enqueues ready tasks to a memory mapped
hardware queue, and updates the internal task pool for every
task that ﬁnishes. The TPU is designed for high throughput
and therefore it is pipelined. It is directly connected to the
bus to alow fast access from any core. The TPU is a generic
unit and can be used in any multicore platform.
The second unit in the system is the Task Controler
(TC), which can be placed at individual cores. The TC
fetches tasks from the TPU, issues the DMAs of input and
output operands, and enables double bufering of tasks. TCs
are optional and platform speciﬁc. In future work we wil
investigate for the Cel processor whether the beneﬁts of the
TC are enough to be worth its costs.
SPE
SPESPE
LS SPU
MFC
SPE
MEM
PPE
TPU
TC
ELEMENT INTERCONNECT BUS
Figure 8. Overview of the Nexus system (in bold) implemented in the Cel
processor. The depicted architecture could be one tile of a larger system.
Storing al tasks in a central unit provides low latency
dependency resolution and scheduling. The hardware queues
ensure fast synchronization. Such a centralized approach,
however, eventualy creates a scalability botleneck for in-
creasing core count. In such a case, a clustered approach can
be used. Tasks are divided among TPUs such that inter TPU
communication is low while task stealing alows system
wide load balancing.
B. Design of the Task Pool Unit
The block diagram of the Nexus TPU is depicted in
Figure 9. Its main features are the folowing. Dependency
resolution consists of table lookups only, and therefore has
low latency. The TPU is pipelined to increase throughput.
Al task descriptors are stored in the task storage to avoid
of-chip communication.
ptr size
in bufer
descriptor
descriptor
handler
loader
id *descriptor
finishhandler
..
registerstatus 
task storage
descriptor 1
descriptor 2
address kick−of list
ready queue
id
finish bufer
id *descriptor status #deps
task table
producers table
address #deps kick−of list
consumers table
Figure 9. Block diagram of the Nexus TPU. The ports are memory mapped
and can be accessed anywhere from the system.
7
The life cycle of tasks starts at the PPE that prepares the
task descriptor and writes the pointer to it in the in bufer
of the TPU. The descriptor loader reads task descriptor
pointers from the in bufer and loads the descriptor into the
task storage, where it remains until the task is completely
ﬁnished.
Once the descriptor is loaded into the task storage, the
descriptor handler processes the descriptor and ﬁls the three
tables with the required information. These three tables
together, resolve the dependencies among tasks, as described
below. This process can be performed fast, as it consists of
simple lookups only. There is no need to search through the
tables to ﬁnd the corect item.
The task table contains al tasks in the system and records
their status and the number of tasks it depends on. Tasks
with a dependency count of zero are ready for execution
and added to the task queue. The producers table contains
the addresses of data that is going to be produced by a
pending task. Any task that requires that data, can subscribe
itself to that entry. Thus, this table is used to prevent write-
after-read hazards. Similarly, the consumers table is a table
containing the addresses of data that are going to be read by
pending tasks. Any new task that wil write to these locations
can subscribe itself to the kick-of list. This table prevents
read-after-write hazards. The lookups in the producers and
consumers tables are addressed based. As the lists are much
smaler than the address space, a hashing function is used
to generate the index. Write-after-write hazards are handled
by the insertion of a special marker in the kick-of lists.
The three tables, the two bufers, and the ready queue
al have a ﬁxed size. Thus, they can be ful in which case
the pipeline stals. For example, if the task table is ful, the
descriptor handler and the descriptor loader stal. If no entry
of the task table is deleted, the in bufer wil quickly be ful
too, which stals the process of adding tasks by the PPE.
Deadlock can not occur, because tasks are added in serial
execution order.
The SPEs obtain tasks by reading from the ready queue.
This operation can be performed by either its Task Controler
(TC) or by the software running on the SPU. The task
descriptor is loaded from the task storage after which the
task operands are loaded into the local store using DMA
commands. When execution is ﬁnished and the task output
is writen back to main memory, the task id is writen to the
ﬁnish bufer. Optionaly, double bufering can be applied, by
loading the input operands of the next task while executing
the curent task.
The ﬁnish handler processes the task ids from the ﬁnish
bufer. It updates the tables and adds tasks whose dependen-
cies are met to the ready queue. Finaly, the ﬁnished task is
removed from al tables.
VII. CONCLUSIONS&FUTUREWORK
The task-based StarSS programming model is a promising
approach to simplify paralel programming. In this paper
we evaluated its runtime system. It was shown that for
ﬁne-grained task paralelism, the task management overhead
prohibits scaling beyond ﬁve cores. More speciﬁcaly, in
order to efﬁciently utilize 8 or 16 cores for task based
paralel H.264 decoding, the runtime system should be
accelerated by at least a factor of 3.6 and 7.3, respec-
tively. Mainly the task dependency resolution process is too
laborious to perform in software. Other botlenecks were
found in the scheduling of tasks and the lack of fast syn-
chronization primitives. We proposed the Nexus hardware
task management system. It performs dependency resolution
in hardware using simple table lookups. Hardware queues
are used for fast scheduling and synchronization. We are
curently implementing Nexus in a simulation environment
to evaluate its performance and scalability.
ACKNOWLEDGMENT
This work was supported by the European Commission in
the context of the SARC integrated project #27648 (FP6).
REFERENCES
[1] J. Planas, R. Badia, E. Ayguad´e, and J. Labarta, “Hierarchical
Task-Based Programming With StarSs,”Int. Journal of High
Performance Computing Applications, vol. 23, no. 3, 2009.
[2] S. Kumar, C. Hughes, and A. Nguyen, “Carbon: Architectural
Support for Fine-Grained Paralelism on Chip Multiproces-
sors,” inProc. Int. Conf. on Computer Architecture, 2007.
[3] J. Hoogerbrugge and A. Terechko, “A Multithreaded Multi-
core System for Embedded Media Processing,”Transactions
on High-Performance Embedded Architectures and Compil-
ers, vol. 3, no. 2, 2008.
[4] S. Ou, T. Lin, X. Deng, Z. Zhuo, and C. Liu, “Multithreaded
Coprocessor Interface for Multi-Core Multimedia SoC,” in
Proc. Design Automation Conference, 2008.
[5] G. Long, D. Fan, and J. Zhang, “Architectural Support for
Cilk Computations on Many-Core Architectures,”ACM SIG-
PLAN Notices, vol. 44, no. 4, 2009.
[6] M. Sj¨alander, A. Terechko, and M. Duranton, “A Look-Ahead
Task Management Unit for Embedded Multi-Core Architec-
tures,” inProc. Conf. on Digital System Design Architectures,
Methods and Tools, 2008.
[7] G. Al-Kadi and A. Terechko, “A Hardware Task Scheduler
for Embedded Video Processing,” inProc. High Performance
Embedded Architectures and Compilers Conference, 2009.
[8] J. Castrilonet al., “Task Management in MPSoCs: an ASIP
Approach,” inProc. Int. Conf. on Computer-Aided Design,
2009.
[9] Y. Etsion, A. Ramirez, R. Badia, and J. Labarta, “Cores
as Functional Units: A Task-Based, Out-of-Order, Dataﬂow
Pipeline,” inProc. Int. Summer School on Advanced Com-
puter Architecture and Compilation for Embedded Systems.
[10] “Cel Superscalar,” htp:/www.bsc.es/plantilaG.php?catid=
179.
[11] E. van der Tol, E. Jaspers, and R. Gelderblom, “Mapping
of H.264 Decoding on a Multiprocessor Architecture,” in
Proc. SPIE Conf. on Image and Video Communications and
Processing, 2003.
[12] C. Meenderinck, A. Azevedo, B. Juurlink, M. Alvarez, and
A. Ramirez, “Paralel Scalability of Video Decoders,”Journal
of Signal Processing Systems, 2008.
[13] C. Chi, B. Juurlink, and C. Meenderinck, “Evaluation of
Paralel H.264 Decoding Strategies for the Cel Broadband
Engine,” inProc. Int. Conf. on Supercomputing (ICS)
8
, 2010.
