Hardware-based task dependency resolution for the StarSs programming model by Dallou, Tamer & Juurlink, Ben
Tamer Dalou, Ben Juurlink
Hardware-based task dependency
resolution for the StarSs programming
model
Conference object, Postprint version
This version is available at http://dx.doi.org/10.14279/depositonce-5781.
Suggested Citation
Dalou, Tamer; Juurlink, Ben: Hardware-based task dependency resolution for the StarSs programming
model. - In: 2012 41st International Conference on Paralel Processing Workshops : ICPPW. - New York,
NY [u.a.] : IEEE, 2012. - ISBN: 978-1-4673-2509-7. - pp. 367-374. - DOI: 10.1109/ICPPW.2012.53. 
(Postprint version is cited, page numbers differ.) 
Terms of Use
 © © 2012 IEEE. Personal use of this material is permitted. Permission from IEEE must be
obtained for al other uses, in any current or future media, including reprinting/republishing
this material for advertising or promotional purposes, creating new colective works, for
resale or redistribution to servers or lists, or reuse of any copyrighted component of this
work in other works. 
Powered by TCPDF (www.tcpdf.org)
Hardware-Based Task Dependency Resolution for
the StarSs Programming Model
Tamer Dalou and Ben Juurlink
Embedded Systems Architecture Group
Technische Universität Berlin
Einsteinufer 17, 10587 Berlin, Germany
{dalou, b.juurlink}@tu-berlin.de
Abstract—Recently, several programming models have been
proposed that try to relieve paralel programming. One of these
programming models is StarSs. In StarSs, the programmer has
to identify pieces of code that can be executed as tasks, as wel as
their inputs and outputs. Thereafter, the runtime system (RTS)
determines the dependencies between tasks and schedules ready
tasks onto worker cores. Previous work has shown, however,
that the StarSs RTS may constitute a botleneck that limits
the scalability of the system and proposed a hardware task
management system caled Nexus to eliminate this botleneck.
Nexus has several limitations, however. For example, the number
of inputs and outputs of each task is limited to a ﬁxed constant
and Nexus does not support double bufering. In this paper
we present Nexus++ that addresses these as wel as other
limitations. Experimental results show that double bufering
achieves a speedup of54×/143×with/without modeling memory
contention respectively, and that Nexus++ signiﬁcantly enhances
the scalability of applications paralelized using StarSs.
I. INTRODUCTION
Due to the advent of multicore architectures, several par-alel programming models have been proposed that aim atrelieving paralel programming. Examples include Google’sMapReduce [4], Intel’s TBB [14], and StarSs [12]. StarSs, likeOpenMP [3], enables the programmer to express paralelismby adding pragmas to the code. These pragmas identify piecesof code that can be executed astasks, as wel as theirinputsandoutputs. Based on the inputs and outputs, the RTS candetermine the dependencies between tasks and schedule readytasks onto cores that execute the tasks. The programmer, there-fore, does not have to explicitly express dependencies between
tasks and the coresponding synchronization. Furthermore, theRTS can also transparently optimize data reuse between tasks
and coarsen tasks, thereby relieving the programmer fromthese burdens.
Previous work [10] has shown, however, that the StarSsRTS, when implemented in software, can be a botleneck that
limits the scalability of applications paralelized using StarSs.Roughly speaking, the RTS cannot compute task dependenciesand atend to ﬁnished tasks fast enough to keep alworker
coresthat execute the tasks busy. The same work thereforeproposed a hardware task management system caledNexustoaccelerate the RTS. In Nexus, task dependencies are computed
using hardware hash tables and a scalable synchronizationmechanism with the worker cores is provided. Results show
that Nexus improves the scalability of a synthetic applicationmodeled after H.264 decoding by a factor of 4.3 when using16 worker cores.
Even though Nexus improves the scalability signiﬁcantly,
it has several limitations. For example, since the hash tableentries have a ﬁxed size, the number of inputs and outputs
of each task is limited (up to 5 in [10], [9]). Similarly, the
number of tasks that can depend on a certain data segmentis limited. This limits the applicability of Nexus, i.e., not alStarSs applications can be executed on a multicore system
with Nexus. Another limitation is that Nexus does not supportdouble bufering, which alows executing one task whilefetching the input data of another task. In [10] double buferingwas not needed because the data transfer time was negligible.In this paper we present Nexus++ that addresses these aswel as other limitations. Nexus++ main contributions include:ﬁrst, it solves the constraint on the maximum number ofinputs/outputs a task can have by introducingdummytasks. Italso solves the constraint on the dependency count of a certaindata segment by adding dummy entries to the list of tasks thatdepend on this data segment. Second, it support double (infact arbitrary) bufering by providing aTask Controlerat eachworker core that bufers tasks before they are executed. Third,it implements task dependency resolution more efﬁciently,since fewer resources and computations are needed.Fourth,Nexus++ implementation is platform-independent, since itsparameters are fuly conﬁgurable, while Nexus was integratedin a simulator of the Cel processor.A SystemC model has been developed to validate andevaluate the design. The preliminary results show that doublebufering achieves a speedup of54× for up to 64 cores.For 128 cores and more, the speedup gain starts to decreasebecause the master core that generates tasks and submits themto Nexus++ cannot generate them fast enough to keep alworker cores busy, and due to limited memory bandwidth. The
results also show that applications that could not be executedby Nexus, such as Gaussian elimination with partial pivoting,
which resembles the LINPACK benchmark, can be executed
efﬁciently on a multicore system with Nexus++.This paper is organized as folows. Overview of the StarSsprogramming model and related work are described in Sec-
tion I. Nexus++ and its features are described in Section II.In Section IV the simulation environment and the employedbenchmarks are described. The experimental results are pre-
sented in Section V, and conclusions are drawn in Section VI.
II. BACKGROUND
A. StarSs
StarSs is a task-based programming model, which enablesexploitation of task-level paralelism, regardless of the target
architecture. StarSs provides programmers withpragmas,anannotations added to the serial code, to identify potentialpieces of code that can run in paralel. The programmer does
not need to care about synchronization between tasks, as this isdone implicitly by the StarSs RTS. Listing 1 shows an exampleof exploiting paralelism using pragmas.The example in Listing 1 shows that functiondecode()
1
is
caled inside a nested loop, processing the elements of matrix
X. Calculatingdecode()for each element requires the values ofthe left and up-right cels.This example represent macroblock
wavefront decoding in H.264 [16], for one1920×1088framein blocks of16×16, and it is one of the benchmarks used toevaluate Nexus++.
int*X[120][68];#pragmacss task input(left[16][16],\
upright[16][16]) inout(this[16][16])
voiddecode(int*left,int*upright,int*this){...}voidmain(){
inti, j;
init_matrix(X) ;
for(i=0; i<120; i++)
for(j=0; j<68; j++)
decode(X[i][j-1], X[i-1][j+1], X[i][j]);
#pragma css barrier
}
Listing 1. StarSs example of macroblock wavefront decoding in H.264
Annotating a function with thecss taskpragma deﬁnes atask. The inputs/outputs of the task should also be speciﬁed aswith functiondecode()in Listing 1. StarSs also provides sev-eral synchronization pragmas such as thecss barrierpragma.
A source-to-source compiler transforms the annotated func-tion cals to runtime library cals, which generate a task outof each function cal, and add it to the task graph. As in theexample of Listing 1, every time functiondecode()is caled,a task is generated.
Having identiﬁed the tasks and the direction of their pa-rameters, the StarSs environment builds, at run time, the taskgraph, and the task-level paralelism is detected and exploited.
B. Related Work
Several hardware scheduling units have been proposed inliterature. Most of them, however, assume independent tasksand are optimized for a certain application, a certain platform,or both. For example, Carbon [7] assumes independent tasksand uses hardware queues to retrieve tasks with low latency.
In StarSs, tasks can be dependent and it is the responsibilityof the RTS to determine their dependencies. An exampleof a hardware accelerator targeted at a certain applicationdomain is a hardware task scheduler optimized for H.264decoding [1]. It requires, however, that the programmer spec-iﬁes the dependencies between blocks. Etsion et al. [6] alsoproposed a hardware task management unit for the StarSs RTS,based on the similarity between task dependency checking andthe instruction scheduler of an out-of-order processor. It was
evaluated using high-level simulations, however, and detailedhardware models were not developed.
As mentioned before, our work builds upon Nexus [10],
which was integrated in a simulator of the Cel processor.
III. NEXUS++ HARDWARETASKMANAGEMENTSYSTEM
The multicore system under consideration, shown in Fig-ure 1, is assumed to have oneMaster Core that executesthe main thread and createsTask Descriptors, and several
worker cores that execute the tasks. ATask Descriptorcon-tains task-related information such as its function pointer andinput/output list. Nexus++ is responsible for the task graph
management usualy caried out by the software RTS. In ann-core system (one master core and(n−1)worker cores),
Nexus++ is composed ofnhardware modules:
• oneTask Maestro, which is mainly responsible for depen-
dency resolution, task scheduling, and load balancing,
• andn−1localTask Controlers(TCs), one per workercore, and are mainly responsible for task bufering.
Fig. 1. Nexus++ in a multicore system
A. System Description
The diferent components of Nexus++ shown in Figure 2are described through explaining a task’s life cycle.
Tasks submission:When the Master Core executes themain program, it generates theTask Descriptorsand sendsthem to theTask Maestrovia theGet TDsblock; the ﬁrsthardware block of theTask Maestro; which communicates withtheMaster Coreand receives variable-lengthTask Descriptors(depending on the number of inputs/outputs per task) andwrites them to theTDs Bufer. This block is important sothat theMaster Coreis not blocked while theTask Maestrois busy processing an earlier submited task. It also enablesdirect communication between the master core and theTaskMaestro, avoiding of-chip communication overhead, which isone of the scalability limiting factors of Nexus[9]. After havingreceived aTask Descriptor, theGet TDsblock writes its sizeto a FIFO list caled theTDs Sizeslist. If this list is ful, theMaster Corestals and stops sending newTask Descriptors.
Storing tasks:Once theTDs Sizeslist is writen, it triggerstheWrite TP block, which reads the size of the recentlyreceivedTask Descriptorfrom theTDs Sizeslist, then it readstheTask Descriptorfrom theTDs Bufer, appends some metadata to it, and ﬁnaly writes it to the main task storage table inNexus++ which is caled theTask Pool. The ful format of aTask Descriptorin theTask Poolis shown in Table I. The1stcolumn in Table I is the index at which tasks are stored. Thisindex is determined by theWrite TPblock, which reads theTP Free indiceslist, that stores initialy al indices of theTaskPool. After the completion of a task, itsTask Pool’sindex iswriten back to theTP Free indiceslist.
Inside Nexus++, a task is identiﬁed by itsTask Poolindex.This is important to directly address a speciﬁc entry in the
table, rather than searching the table for that entry.
Thebusycolumn of aTask Descriptoris a boolean ﬂagindicating whether thisTask Descriptoris curently underprocessing by one of the blocks of theTask Maestroor not.This is to ensure exclusive access to any entry in theTask Poolat a certain time, and hence, preventing dead locks.
The*fcolumn of aTask Descriptorindicates the functionpointer of that task. TheDCcolumn stands forDependenceCounter, which records how many dependencies must be
fulﬁled before this task can be scheduled to run, i.e., howmany inputs of this task are outputs of older tasks.
ThenDstores thenumber of dummy entriesthat are linkedto thisTask Descriptor. Adding dummy entries to theTaskPoolis the mechanism used to overcome the limit on the
number of inputs/outputs a task can have. This mechanismis explained in Section II-C.
ColumnsnP
2
and the folowing ones indicate the number
addr busy tp_i *f DC nD nP P1 P2 ... P8or ptr_next Dummy17 0 17 0xABCD 0 0 8 1A/4/in 2A/4/in ... 1B/4/out
98 0 98 0xDCBA 1 1 10 1B/4/in 2B/4/inout ... 99/.../...
99 0 99 - - - - 8B/4/in 9B/4/out 10B/4/out -
TABLE I
THETASKPOOL.(TP_I:TPINDEX,*F:FUNC.PTR, DC:DEPENDENCE COUNT,ND:NUM.DUMMY ENTRIES,NP:NUM.PARAMETERS,Px:PARAMETERx)
Fig. 2. Nexus++ modules block diagram
of inputs/outputs, and their information, respectively. An in-put/output of a task is stored in the format:(base memoryaddress, size, and access mode), where the access mode canbe either input, output, or inout.
Resolving tasks dependencies:Once theWrite TPblockhas ﬁnished storing a task in theTask Pool, it writes this task’sID(Task Pool’sindex) in a FIFO list caledNew Tasks, theevent that triggers theCheck Depsblock. The later block isresponsible for checking whether the new submited task isready or not, by checking the new submited task inputs/out-
puts against al those of the previously submited tasks. Thetask dependence graph is stored inside theDependence Table.The process of dependency resolution is described in detail in
Section II-B to emphasize on its capabilities and efﬁciency.
Scheduling tasks:Once a ready task is found by theCheckDepsblock, it writes itsIDto a FIFO list caled theGlobal
Ready Taskslist. This event triggers theScheduleblock, whichis responsible for scheduling ready tasks onto the worker
cores. Another FIFO list caled theWorker Cores IDs listcontains initialy al worker cores IDs (repeated"buferingdepth"times). TheScheduleblock reads the later FIFO for aworker core ID and schedules the last found ready task on this
worker core. This simple round-robin scheduling mechanismachieves load balancing between cores, since whenever a core
ﬁnishes running a task, the core’s ID is writen back at the tailof theWorker Cores IDslist.
TheTask Maestrohas two FIFO lists for each worker core.
The ﬁrst one caled theCiRdyTasks(CoreiReady Tasks) list,
and the second one is theCiFinTasks(CoreiFinished Tasks)list. Scheduling a task on a core is done by writing the task’s
IDin that core’sCiRdyTaskslist.CiFinTaskslists are usedlater upon completion of tasks.
Send ready tasks to worker cores:Once theRdyTaskslist of a certain core is writen, this 1-bitlist_writen_eventiscommunicated to the coresponding worker core. In each of
the worker cores, a smal and simple unit caled the localTaskControler(TC) is integrated. TheTask Controleris mainlyresponsible for communication with theTask Maestro, and toenable bufering of tasks.ATask Controlercontains four pipelined hardware blocks,namely theGet TD,Get Inputs,Run Task, andPut Outputsblocks. The ﬁrst of them is theGet TDblock, which istriggered upon writing a new task ID to the corespondingcore’sRdyTaskslist. TheGet TDblock is responsible forfetching parts (*f and input/output list) of theTask Descriptorsfrom theTask Maestro. This is done by sending a 1-bit requestsignal to the theTask Maestro; the event that is handled by theSend TDsblock in theTask Maestro. The later block worksin a round-robin fashion. It checks al the requests from thediferentTask Controlers, and whenever it ﬁnds an active one,it reads theRdyTaskslist coresponding to the incoming activesignal and gets the ready taskID. Since a taskIDis the indexat which it is stored in theTask Pool, theSend TDsblock readstheTask Descriptorat that index directly without searching theTask Pool. After theSend TDsblock have sent the requestedTask Descriptorto the requesting worker core, it writes thesent taskIDto that core’sFinTaskslist, which is importantupon task completion as wil be shown later.Sending tasks to worker cores upon requests from the localTask Controlersinsures that theSend TDsblock in theTaskMaestrowil not waste any clock cycle waiting for a localTask Controler, due to for example a handshaking protocolor ful bufer at that localTask Controler.
Run tasks:After geting a task from theTask Maestro, theGet Inputsblock at theTask Controlerside, prefetches thetask code and inputs from memory. Then, theRun Taskblockpasses the task to the worker core to run it, and ﬁnaly thePut Outputsblock writes the outputs back to memory, and
notiﬁes theTask Maestro, via a 1-bit notiﬁcation signal, oftask completion.
Finalize tasks, and update the task graph:Thetask-ﬁnishednotiﬁcation signals from the localTask Controlersarehandled by theHandle Finishedblock in theTask Maestro.The later block also works in a round-robin fashion; it
continuously checks the notiﬁcation signals from the diferentcores, and whenever it ﬁnds an active one, it performs twothings: ﬁrst, it acknowledges the coresponding localTask
Controlerof the observation of itstask-ﬁnishedsignal, sothe localTask Controlerdeactivates itstask-ﬁnishedsignalconsequently.The second thing theHandle Finishedblock performs isthat it reads theFinTaskslist of the coresponding worker
core. The value read is theIDof the ﬁnished task, since theFinTaskslist was writen by theSend TDsblock immediatelyafter having sent theTask Descriptorto the coresponding
worker core. After reading the ﬁnished taskID, theHandleFinished
3
block reads the input/output list of the ﬁnished task
from theTask Pool, updates theDependence Tableand kicks-of pending tasks, if any. Finaly, theHandle Finishedblockdeletes the task from theTask Pool, adds the taskIDto theTPFree indiceslist, and adds the worker coreIDto theWorkerCores IDslist.
Since inside Nexus++ any task is identiﬁed by the indexat which itsTask Descriptoris stored in theTask Pool, thesize and access time of the diferent tables and FIFO lists are
reduced. Furthermore al events and notiﬁcations are one-bitsignals, which ensures low communication overhead betweentheTask Maestroblocks, theTask Controlerblocks, and ofcourse between theTask Maestroand theTask Controlers.
Both Nexus and Nexus++ provide dependency resolution.However, Nexus can only deal with tasks with a limited
number of inputs/outputs. It also can deal with dependencypaterns where only few, limited number of tasks depend on acertain task. In addition, Nexus proposed TCs, but did not
implement them. Nexus++ solves the above limitations asdescribed next.
B. Dependency Resolution
Dependency resolution between tasks is accomplished in-side theTask Maestroby theCheck Depsblock,HandleFinishedblock, and theDependence Tablealong with theDependence Counterassociated with everyTask Descriptorin theTask Pool. Curently, dependencies between tasks aredecided by comparing the base addresses of the inputs/outputsof the diferent tasks.
TheDependence Table:The place where dependence in-formation are stored. Each input/output that is accessed by atask wil have an entry in theDependence Tableindicating itsaccess mode, and aKick-Of Listthat contains theIDsof taskswaiting for this address to be produced before they can run.
TheDependence Tableis a hash table with a simple separatechaining hash colisions resolution algorithmh(). The diferentﬁelds of it are shown in Table I. The ﬁrst columnhAddris thehash address, folowed by a valid bit in thevcolumn, folowedby the ful memory address in thefAddrcolumn. Size andaccess mode of this memory segment are stored in theSizeandisOutcolumns respectively. TheRdrscolumn indicates thenumber of tasks reading-only this memory segment at a certaintime. Thewwﬂag (stands fora writer waits) indicates whethera task is waiting for previous readers to ﬁnish before it can runand write this memory segment. The later case is known as thewrite-after-read hazardsWAR. Although theWARhazards andthe write-after-writeWAWhazards are false dependencies andare normaly resolved using renaming techniques,Nexus++
supports them as a safe guard.
Then_v, n_i, andp_icolumns stand fornext is validﬂag,next entry index, andprevious entry indexrespectively, whichbuilds up a linked-list structure inside theDependence Table
for entries that map to the same hash address. Theh_Dandl_Dare thehas dummyﬂag andlast dummy indexto implementthe dummy entries mechanism explained in Section II-C in
theDependence Table, to overcome the limit on the numberof tasks that may depend on a certain memory segment. Thesetasks are stored in theKick-Of Listof which is composed of
the columnsT1...T8of Table I.
Resolving new tasks dependencies:Every new submitedtask to theTask Maestrois handled by theCheck Depsblock,of which pseudocode is shown in Listing 2. Listing 2 shows
that for each entryAin the input/output list of thenewTask,theDependence Tableis looked up, and an entry forAwould
be inserted if it was not found. On the other hand, ifAwasfound, then an older task is already accessing it. In this case,
the access modes are checked; if both the old and new tasks
accessAas read-only, then the new task is granted accesstoA. However, if the older task is writingA, then the newtaskT2is added to theKick-Of ListofAas shown in Table Iregardless of its access mode toA(RAWorWAWhazards whenthe new taskT2is a reader or a writer ofArespectively), anditsDependence Counteris incremented.Finaly, theWARhazards are handled using theww(a writerwaits) ﬂag in Table I. If a taskT1is readingB, andT10wants to writeB, thenT10is added to theKick-Of ListofBas shown in Table I, itsDependence Counteris incremented,and thewwﬂag is set. Any other task that wishes to accessB, regardless its access mode, wil be added to theKick-OfListofB, and itsDependence Counteris incremented.After checking al inputs/outputs of a new task, theCheckDepsblock checks the new task’sDependence Counter,ifitwas 0, then the task does not depend on any other older tasks,
and can be scheduled to run.
foreachA in parameters[newTask]
{
if(A not exist){ //1
Add A to DT;
if(newTask read-only A){ //2
DT[A].Rdrs=1;
DT[A].isOut =false;
}
else //2’
DT[A].isOut =true;
}
else //1’
if(newTask read-only A) //3
if(!DT[A].isOut && !DT[WriterWaits])//4
DT[A].Rdrs++;
else{ //4’
DT[A].writeKickOffList(newTask);
TP[newTask].DC++;
}
else{ //3’
DT[A].writeKickOffList(newTask);
TP[newTask].DC++;
if(!DT[A].isOut)
DT[A].WriterWaits =true;
}
}
if(TP[newTask].DC == 0)
GlobalReadyTasksList.write(newTask);
Listing 2. Checking dependencies for new tasks pseudocode
Handling ﬁnished tasks: Upon task completion, the
Handle Finishedblock takes action. For example, for eachentryAin the input/output list of the ﬁnished taskT1,ifT1has read-onlyA, then theRdrscount ofAis decremented. Ifit becomes 0 and no writer task is waiting (wwﬂag is false),
thenAis deleted from theDependence Table. But if thewwﬂag was true, then a pending taskT2must exist and is readfromKick-Of ListofA.On the other hand, ifT1is a writer ofA, and no tasks arewaiting forA, thenAis deleted from theDependence Table.But if there are some tasks waiting forA, then theHandleFinishedblock wil continuously read these tasksIDsone afterthe other as long as they read-onlyA, until it reads a task that
is wiling to writeA, or theKick-Of ListofAis empty. Eachtime a reader is read from theKick-Of List, theRdrscountofAis incremented.Dependency resolution in Nexus++ is more efﬁcient than
that in Nexus [10], since we use fewer and simpler tablesandKick-Of Lists. Nexus++ has only one table (DependenceTable) to maintain the task graph, and using the
4
Task Pool’s
hAddr v fAddr Size isOut Rdrs ww n_v n_i p_i h_D l_D TID ... TIDor ptr_next Dummy0xA 1 0x1A 4 1 0 0 0 - - 0 - T2 - -0xB 1 0x1B 4 0 1 1 0 - - 0 - T10 - -0xC 1 0x1C 4 1 0 0 1 111 - 1 333 T20 ... 2220x111 1 0x2C 4 1 0 0 0 - C 0 - T50 ... -0x222 1 0x1C 4 - - - - - - 1 333 T27 ... 3330x333 1 0x1C 4 - - - - - - 0 - T34 ... -
TABLE I
THEDEPENDENCETABLE.(HADDR:HASH ADDRESS,FADDR:FULL ADDRESS,ISOUT:IS OUTPUT,RDRS:READERS COUNTER,N_V:NEXT IS VALID,
N_I:NEXT ENTRY INDEX,P_I,PREV.ENTRY INDEX,H_D:HAS DUMMY ENTRIES,L_D:LAST DUMMY ENTRY INDEX,WW:A WRITER WAITS,Tx:TASKx)
indices as taskIDseliminate the need to search the tables. InNexus, on the other hand, three tables (containing twoKick-Of Lists) are used and are accessed always for al kinds ofscenarios.
C. Dummy Tasks and Entries
In aTask Descriptor, a task has a limited number ofinputs/outputs, so applications with tasks that have more
inputs/outputs can not be executed directly on a system withNexus. In addition, not al tasks necessarily have a numberof inputs/outputs equal to theTask Descriptor’slimit, whichyields a poor memory utilization. We solve this problem byintroducing dummy tasks. A dummy task wil not be executed,it just takes the form of a task by having an entry in theTask Pool, only to store inputs/outputs that did not ﬁt inthe parent’s input/output list. Figure 3 shows a scenario todemonstrate the need for dummy tasks. IfTxhas2noutputs,and aTask Descriptorcan only storenof them (8 in ourdesign), then dummy tasks (D1andD2) are created havingtheir inputs/outputs as those that did not ﬁt in the parent’s(Tx)Task Descriptor. A dummy task is simply a pointer thatreplaces the last entry of an input/output list.
In Table I, this mechanism is accomplished using thenD(number dummy) column along with the last column (P8or ptr_next Dummy)ofaTask Descriptor. The number ofthe extraTask Descriptorsneeded is stored in thenDummiescolumn of the parent entry, as shown in the example in Table I.TheTask Descriptorat index 98 has 10 inputs/outputs, whichis more than maximum limit of 8 perTask Descriptor, thatis why a new entry is occupied by this task, namely theTaskDescriptorat index 99. The parent entry at index 98 has 1 initsnDummiesﬁeld, indicating that this task occupies in total2Task Descriptors, and the last entry in its input/output listnow points to index 99. This process is done by theWrite TPblock.
Although this solves the problem of having a ﬁxed, limited
number of inputs/outputs per task, the maximum number ofinputs/outputs is stil bounded by the size of theTask Pool.
The same principle can be deployed in theDependence
Tableshown in Table I, where theKick-Of Listhas a limitedsize, thus restricting the number of tasks that might dependon a certain memory segment. As a solution, we add dummy
entries to theDependence Tableto extend theKick-Of Listofa certain entry.
In Table I, a precise example is shown. Memory segment0x1Cis curently being writen by a certain taskT1, and thenumber of tasks that are waiting in theKick-Of Listof0x1C
doesn’t ﬁt in a singleKick-Of List. That is why theh_Dﬂag
Fig. 3. Dummy Tasks/Entries added to theTask Pool/Dependence Table
of0x1Cis set, and the last entry in theKick-Of Listof0x1Cpoints to address 222, which contains also some tasksIDs,and also have another dummy entry at address 333 of the
Dependence Table, where the rest of the waiting tasks reside.Reading tasksIDsfrom theKick-Of Listof a certainmemory address happens from the top of the ﬁrstKick-OfListof the chain. Whenever al tasks are read from the ﬁrstKick-Of List, this entry’s data (except theKick-Of Listand the
h_Dﬁelds) wil be copied to the next dummy entry so that itbecomes the new parent. For example, memory address0x1C
occupies 3 entries (at DT[0xC, 0x222, and 0x333]) in Table I.When al items in the Kick-Of Listof DT[0xC] are read,this entry wil be invalidated, and the parent entry of0x1Cwil reside at DT[0x222]. This way, theDependence Tableis efﬁciently utilized, since DT[0xC] can now be reused byother memory segments, even before memory segment0x1Cistotaly removed from theDependence Table. This also alowsdirect (and hence, fast) access to the ﬁrstKick-Of List, sinceit always resides at the parent entry of a memory segment.Dummy tasks are injected by theTask Maestrowhen neededat run time. They utilize memory wel, and are scalable. Thecompiler could also add dummy tasks when it discovers thata task has more inputs/outputs than the maximum. However,the master core then would have to generate and submit moreTDs, and [9] indicates that eventualy the master core formsthe botleneck. Furthermore, the compiler can not add dummyentries to theDependence Tablesince it depends on runtimeinformation which is not available to the compiler. For thesereasons we have decided that theTask Maestroadds dummytasks and entries.
IV. EXPERIMENTALSETUP
A. Benchmarks
Several benchmarks were used to evaluate Nexus++. First,we used a trace of paralel H.264 decoder decoding oneful HD frame on a Cel Broadband Engine processor [11],
consisting of 8160 tasks in total. The trace consists of tasks
input/output information, tasks execution times and the timethey have spent reading/writing their inputs/outputs from/tomemory. On average a task spends7.5μsfor accessing of-
chip memory and11.8μsfor execution [2]. The benchmarkprocesses a matrix of120×68
5
macroblocks and the depen-dency patern is shown in Figure 4(a) [15]. Tasks are generated
in serial execution order, which is from left to right andfrom top to botom. Initialy there is only one task ready forexecution, but this number increases until halfway execution,
after which it decreases again. This ramping efect inﬂuencesthe average amount of paralelism available in the benchmarkand thus its scalability.To evaluate Nexus++ for a range of dependency paterns,we created two additional synthetic benchmarks derived fromthe H.264 benchmark. Their dependency paterns are shown
in Figure 4(b) and (c). We also used an additional benchmarkwithout dependencies, i.e., has only independent tasks, in
order to measure the maximum scalability of Nexus++. Incontrast to dependency patern (a), the dependency paterns
Fig. 4. Dependency paterns (120×68blocks): (a) Ramp efect, (b, c) Fixed
# of paralel tasks
Fig. 5. Dependency patern for the Gaussian elimination benchmark.Tji:i, jrow and column numbers respectively
depicted in 4(b) and (c) do not sufer from the rampingefect. Instead, these dependency paterns provide a constantnumber of paralel tasks. In (b), however, the dependencypatern has the same direction as the order in which tasks aregenerated. As a consequence, the amount of efective availableparalelism could be reduced by the speed of the additionprocess or the size of theTask Pool, since when the table isful, tasks of the ﬁrst row have to be executed to make roomfor other tasks. Hence, leading to an indirect dependency.
To validate the dummy tasks/entries approach, the taskgraph of Gaussian elimination with partial pivoting [16] isused. In this benchmark, the number of tasks that depend oncertain outputs depends on the size of the input matrix asdepicted in the dependency patern of Figure 5, assuming ann×nmatrix.
The execution starts with one task (T11), on whichn−1tasks (T21..Tn1) depend. After that only one task (T22) canexecute, and thenn−2tasks, etc. Total number of tasks is
relative to the matrix size, and equalsn2+n−22 , wherenisthe matrix dimension. Each task performs number of ﬂoatingpoint operationsFLOPs. This number represent the weightWof a task and equals [16]:
W(Tji)= n+1−i FLOPs ifi=jn−i FLOPs ifi<j (1)
wherei, jare the row and column numbers respectively. Hence
the duration of a taskTij equalsW(Tji), divided by the
GFLOPSof one core. Each task also readsW(Tji)ﬂoatingpointnumber from memory, and writes the same number back
when ﬁnished.
Some tasks in the Gaussian elimination benchmark are
realy smal (few FLOPs), but as can be seen in Formula (1)and Figure 5, the number of tasks of a certain weight is directlyproportional to the weight itself. So the large portion of tasks
are relatively coarse, and only a smal portion are ﬁne. Table IIgives an overview about number and granularity of Gaussian
tasks for diferent matrix sizes.
B. Simulation Environment
The Task Machine:Nexus++ was simulated using theTask
Machine, a SystemC simulator of a task-based, trace-drivenmulticore system. TheTask Machineis a fuly conﬁgurablesystem that is designed to match modern real systems. Among
the conﬁgurable parameters are the number of cores, core
Matrix dimension # Tasks Average task weight (FLOPs)
250 31374 167
500 125249 334
1000 500499 667
3000 4501499 2012
5000 12502499 3523
TABLE II
GAUSSIAN ELIMINATION TASKS FOR DIFFERENT MATRIX SIZES
System Parameter Value
Cores clock freq. 2.0 GHz
Nexus++ clock freq. 500 MHz
On Chip Access Time 2ns
Of Chip Access Time 12ns
On chip bus bandwidth 2 GB/s
Memory bandwidth 10.67 GB/s
Task Descriptor (TD)size 78 Byte
Task Poolsize 78 KB (1K TDs)
No. Parameters perTD 8
Dependence Tableentry size 28 Byte
Dependence Tablesize 112 KB (4K entries)
Kick-Of list size 8 task IDs
TDs Sizes list size 1KB
New Tasks list size 2KB
TP Free Indices list size 2KB
Global Ready Tasks list size 2KB
Worker Cores IDs list size 2KB
CxRdyTasks list size 4 Bytes
CxFinTasks list size 4 Bytes
TABLE IV
SYSTEM PARAMETERS
clock frequency, onchip/ofchip memory access times, etc.Tasks information are read from experimental traces, whichinclude tasks input/output information, and also their executionand memory access times. Thus task execution is simplymodeled by waiting for a certain time. Memory accessesdelays are modeled in the same way and memory contentionis also modeled. The list of parameters and their values areshown in Table IV.
Nexus++ is simulated assuming a clock cycle time of 2ns,which equals a clock frequency of 500MHz. TheTask Maestrotables and the FIFO lists are on-chip storage and thereforetheir access times are relatively fast. The hash table accesstime equals the on-chip access time multiplied by the numberof lookups required per access.
The traces recording execution and communication timesper task were generated after paralel H.264 decoding on a Celprocessor [11]. Thus, the experiments are assuming a local-
stores, shared-memory architecture. Nevertheless, Nexus++concept can be applied to any other multicore architecture.
Design Space Exploration: The sizes of theTask Mae-
strotables and lists were empiricaly determined. They aresummarized in Table IV. We observed, as wil be shown laterin Figure 6, that for the curent benchmarks, theTask Pool
should be able to contain 1kTask Descriptors. Assuming 8parameters and a total 78 Bytes per task descriptor yields aTask Poolsize of 78 KB. TheDependence Table, on the other
hand, should be able to hold 4K entries, as wil be shown laterin Figure 6. Each entry size equals 28 bytes, which yields atable size of 112 KB.
Having 1K tasks in theTask Pool, 10 bits are needed index
it and to identify a single task ID. This number is roundedup to multiples of a byte (i.e., 2 bytes), yields that 2KB are
needed to store the IDs of a 1K tasks, which is the selectedsize for theNew Tasks list, theTP Free Indices list, and theGlobal Ready Tasks list. 1 byte is alocated to store the size of
oneTask Descriptorupon its reception from theMaster Core
6
.
This gives a total size of 1KB for theNew Tasks listto storethe sizes of 1KTask Descriptors.
Simulating up to 512 worker cores, requires 9 bits to assignan individual ID to each core. Rounding this number up tomultiples of bytes gives a 2KBWorker Cores IDs listsize.Assuming double bufering, a worker core should be able to
store two task IDs in itsRdyTasksandFinTaskslists, whichyields a size of 4 bytes per list.
Access Latencies:The access time for the∼100 KB on-chip memory structures(those are mainly theTask PoolandtheDependence Table) was determined using Cacti 5.3 [8],and was found to be 2nsfor each of them. Of-chip memory(RAM) access time is also determined using the same tool, and
was found to be 12nsper 128 bytes RAM chunk, assuming32-bank 1GB of RAM, which is equivalent to a maximum
memory bandwidth of 10.67 GB/s. The of-chip memory isassumed to have 32 banks, each having one read/write port.Therefore, no more than 32 tasks can access the memory at
a given time, and this is how contention accessing of-chipmemory is modeled.
The latency of preparation and submission ofTask Descrip-torsby the master core was estimated. These times weremeasured in Nexus [9] in detail. As Nexus++ avoids of-chip communication in this part, we had to compensate forthis. As a result, the task preparation was set to 30ns,while the task submission is not ﬁxed since it depends onthe size of the input/output list of a task. The modeled on-chip bus is a very basic one. It is an 8-byte width bus, andits bandwidth is assumed to be 2GB/s which is a typicalbandwidth of the state-of-the-art on-chip buses [13]. Everytime theMaster Core wishes to submit a task to theTaskMaestro, it aranges the task’s information into 8-byte words.The ﬁrst word speciﬁes the task’s ID and function pointer,and every other word speciﬁes a single parameter(including itsaddress, size, and access mode). TheMaster Corealso sendsinitialy a handshaking word specifying the new task’s numberof words, and hence, number of its parameters. We assume thatfor each task submission, an initial(handshaking) bus delay of5 cycles is needed, and each word takes 2 cycles(2GB/s busbandwidth) to reach theTask Maestro. For example, a task with4 parameters takes 10 cycles(20ns), whereas an 8-parameterstask takes 14 cycles(28ns) submission delay.
V. EVALUATION
Nexus++ was tested under diferent conditions, varying thenumber of worker cores, the bufering depth, and with diferentdependency paterns.
Using double bufering, the independent tasks benchmark
was performed varying the number of cores. Measuring thespeedup against the single core experiment, the indepen-dent tasks benchmark achieved a speedup of54× on 64
cores. Furthermore, it achieved143×on 256 cores, assum-ing contention-free memory. When disabling task preparation
delay, the resulting speedup was221×using 256 cores.
Design space exploration is also performed by runningthe independent tasks benchmark, on a 256-core system with
double bufering, and contention-free memory. First, in orderto determine the optimalDependence Tablesize, al the otherstructures are conﬁgured to be very large, theTask Pool,for example, is conﬁgured to hold 8KTask Descriptorsat
once (given that the total number of tasks is 8160). Theﬁrst column in Figure 6 shows the speedup achieved againstvarying theDependence Tablesize, and ﬁxing theTask Pool
size at 8K entries. Maximum speedup equals143× whenseting theDependence Tablesize to 2K entries. However,theDependence Tableis set to 4K entries since this size
Fig. 6. Speedup achieved with varying the size of theTask Pooland ﬁxing
the size of theDependence Tableand vice versa. Also showing the efect of
varying theDependence Tablesize on the longestKick-Of Listchain.
Fig. 7. Speedup achieved with diferent number of cores running tasks with
dependencies shown in Figure 4.
enhances shorterKick-Of Listchain(almost half of that whentheDependence Tableis set to 2K entries), as showin in thethird column of Figure 6, as longerKick-Of Listchains impliesa longer search time. The second column shows the speedupwhen varying theTask Poolsize, and ﬁxing theDependenceTablesize at 8K entries. ATask Poolsize of 512 entries isenough to achieve a speedup of143×, however, a 1K entriesTask Poolis chosen to alow a larger task window.
Figure 7 shows the achieved speedup for the benchmarksilustrated in Figure 4. As before, we simulate 8160 tasks withexecution and communication times obtained from a paralelH.264 decoder [2]. The speedup is measured against thesingle core experiment of Nexus++ (double bufering enabled).Limited application scalability explains why the speedup gaindecreases faster for the H.264 benchmark compared to the
independent tasks speedup.
More interesting is the speedup gain diference betweenthe benchmarks with horizontal and vertical dependenciesilustrated in Figures 4(b) and 4(c), respectively. Although
theTask Poolis larger than a single row, the processing ofnon-ready tasks before reaching the next ready task (ﬁrst task
in the second row of Figure 4(b) limits the scalability ofthis benchmark to at most8cores, whereas the benchmarkilustrated in Figure 4(c) scales wel to64cores.
Figure 8 shows the speedup achieved by using diferentmulticore systems to solve the Gaussian elimination prob-lem (Figure 5) for diferent matrices of sizes ranging from
250×250to5000×5000. Memory contention is modeled,and double bufering is used.
Although the size of the Kick-Of Listof each of theDependence Table
7
entries is equal to 8, Nexus++ could handle
the Gaussian elimination problem for matrices of large sizes.This is mainly because of the dummy entries added to the
Fig. 8. Speedup achieved with diferent multicore systems running Gaussian
elimination for diferent matrix sizes (legend shows matrix dimension)
Dependence Table. As shown in Figure 8, the matrix size hasa great impact on the speedup gain and the scalability of the
system, since a bigger matrix results in a larger number oftasks of larger granularity. A5000×5000matrix scaled upto 64 cores with a speedup factor of45×. This experimentincludes building and managing a task graph of 12502499tasks with 3523 FLOPs per task on average as shown inTable II. Each single worker core is assumed to be able to do2 GFLOPS, which means that the average computation timeof each of the aforementioned tasks equals1.77μs.
Although the250×250has very smal tasks (83.5nspertask on average), Nexus++ could handle them. The benchmarkscaled to 4 cores and a speedup of2.3×is achieved. Thisdemonstrates the applicability of Nexus++ to any kind ofapplications, even those with very ﬁne grained tasks.
Al tables and FIFO lists in the Nexus++ task managerdo not exceed 210KB of memory. Nevertheless, they aresufﬁcient to perform al the objectives of Nexus++. The TaskSuperscalar [5], on the other hand, consumes more than 6.5MBand stil has a static limit (19) on the number of inputs/outputsa task can have. Nexus++ introduces dummy tasks/entries intheTask Pooland theDependence Tablerespectively, usestheTask Poolindices as tasks identiﬁers, and uses its internalstructures more dynamicaly and efﬁciently, therefore tablessizes are relatively smal.
VI. CONCLUSION
We have presented Nexus++, a hardware task managementaccelerator for the StarSs RTS. Compared to previous workNexus++ makes four main contributions. First, it overcomesthe limitation of Nexus that a task can only have a ﬁxed,
limited number of inputs/outputs by introducing dummy tasksin theTask Pool. It also overcomes the limitation that onlya ﬁxed, limited number of tasks can depend on a certain
task by introducing dummy entries in theKick-Of ListsoftheDependence Table. Second, it support double buferingby providing a task controler in each worker core. Third,it implements task dependency resolution more efﬁciently,
since fewer hash table lookups are required to determineif tasks depend on each other. Fourth, we have presenteda platform-independent implementation of Nexus++ whose
parameters are fuly conﬁgurable, while Nexus was integratedin a simulator of the Cel processor.
Experimental results obtained using a SystemC model show
that double bufering achieved a speedup of54×/143×with/without modeling memory contention respectively, fora benchmark modeled after H.264 decoding. Furthermore,
double bufering increases the scalability of the system. Even-tualy, for large (64 cores and more) systems, the speedup gain
starts to decrease, mainly because the application does notexhibit sufﬁcient task-level paralelism, insufﬁcient memory
bandwidth, and/or because the master core cannot generatetasks fast enough to keep al worker cores busy. We have alsoshown that a benchmark modeled after Gaussian elimination,
where the number of tasks that depend on a certain task isnot constant, ran successfuly and efﬁciently with an achievedspeedup of45×for an5000×5000matrix using 64 cores.Although Nexus++ targets StarSs applications, parts of it
can be reused for other programming models. For example,it contains hardware queues that can be used for low-latencyretrieval of independent tasks. Future work wil focus on how
to make Nexus++ more versatile.
ACKNOWLEDGMENT
The research leading to these results has received fund-ing from the European Community’s Seventh Framework
Programme [FP7/2007-2013] under the ENCORE Project(www.encore-project.eu), grant agreement n◦248647.
REFERENCES
[1] G. Al-Kadi and A. S. Terechko. A Hardware Task Scheduler for Em-
bedded Video Processing. InProc. 4th Int. Conf. on High Performance
Embedded Architectures and Compilers, 2009.
[2] C. C. Chi, B. Juurlink, and C. Meenderinck. Evaluation of Paralel
H.264 Decoding Strategies for the Cel Broadband Engine. InProc.
24th ACM Int. Conf. on Supercomputing, 2010.
[3] L. Dagum and R. Menon. OpenMP: an Industry Standard API for
Shared-Memory Programming.IEEE Computational Sci. Eng., 1998.
[4] J. Dean and S. Ghemawat. MapReduce: Simpliﬁed Data Processing on
Large Clusters. InProc. 6th Symp. on Operating Systems Design &
Implementation, 2004.
[5] Y. Etsion, F. Cabarcas, A. Rico, A. Ramirez, R. M. Badia, E. Ayguade,
J. Labarta, and M. Valero. Task Superscalar: An Out-of-Order Task
Pipeline.Microarchitecture, IEEE/ACM International Symposium on,0,
2010.
[6] Y. Etsion, A. Ramirez, and R. M. B. Jesuslabarta. Cores as Functional
Units: A Task-Based, Out-of-Order, Dataﬂow Pipeline. InProc. Int.
Summer School on Advanced Computer Architecture and Compilation
for Embedded Systems, July 2009.
[7] S. Kumar, C. J. Hughes, and A. Nguyen. Carbon: Architectural Support
for Fine-Grained Paralelism on Chip Multiprocessors. InProc. 34th
Annual Int. Symp. on Computer Architecture, 2007.
[8] H. Laboratories. Cacti 5.3. htp:/www.hpl.hp.com/research/cacti/.
[9] C. Meenderinck.Improving the Scalability of Multicore Systems, with
a Focus on H.264 Video Decoding. PhD thesis, Delft University of
Technology, July 2010.
[10] C. Meenderinck and B. Juurlink. A Case for Hardware Task Man-
agement Support for the StarSS Programming Model. InProc. 13th
Euromicro Conf. on Digital System Design: Architectures, Methods and
Tools, 2010. Sp. Session on Multicore Systems: Des. and Apps.
[11] D. Pham, S. Asano, M. Boliger, M. Day, H. Hofstee, C. Johns, J. Kahle,
A. Kameyama, J. Keaty, Y. Masubuchi, M. Riley, D. Shippy, D. Stasiak,
M. Suzuoki, M. Wang, J. Warnock, S. Weitzel, D. Wendel, T. Yamazaki,
and K. Yazawa. The design and implementation of a ﬁrst-generation cel
processor. InSolid-State Circuits Conference, 2005. Digest of Technical
Papers. ISSCC. 2005 IEEE International, 2005.
[12] J. Planas, R. M. Badia, E. Ayguadé, and J. Labarta. Hierarchical Task-
Based Programming With StarSs.Int. J. High Perf. Comp. Appl., 2009.
[13] Power.org. Power.org Embedded Bus Architecture Report. www.power.
org/resources/downloads/Embedded_Bus_Arch_Report_1.0.pdf, 2008.
[14] J. Reinders.Intel Threading Building Blocks. O’Reily & Associates,
Inc., 1st edition, 2007.
[15] E. B. van der Tol, E. G. Jaspers, and R. H. Gelderblom. Mapping of
H.264 Decoding on a Multiprocessor Architecture. InProc. SPIE Conf.
on Image and Video Communications and Processing, 2003.
[16] M. Veldhorst. Gaussian Elimination with Partial Pivoting on an MIMD
Computer.Journal of Paralel and Distributed Computing
8
, 1989.
