Task Based Programming on Embedded Multicores by Schleuniger, Pascal & Karlsson, Sven
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
General rights 
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners 
and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. 
 
• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. 
• You may not further distribute the material or use it for any profit-making activity or commercial gain 
• You may freely distribute the URL identifying the publication in the public portal  
 
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately 
and investigate your claim. 
   
 
Downloaded from orbit.dtu.dk on: Dec 20, 2017
Task Based Programming on Embedded Multicores
Schleuniger, Pascal; Karlsson, Sven
Publication date:
2013
Document Version
Publisher's PDF, also known as Version of record
Link back to DTU Orbit
Citation (APA):
Schleuniger, P., & Karlsson, S. (2013). Task Based Programming on Embedded Multicores. Poster session
presented at 8th International Conference on High-Performance and Embedded Architectures and Compilers ,
Berlin, Germany.
Task Based Programming on Embedded Multicores
Pascal Schleuniger and Sven Karlsson
Technical University of Denmark
Motivation
IDirectory based cache coherency protocols have drawbacks:
IThey introduce a high communication overhead.
I Induced latency limits the scalability of the system.
IDirectories may occupy up to 20% of the total memory.
I Low power efficiency.
IHigh design and implementation complexity. State machines have
a multitude of transitional states.
IWe argue that parallelism should be expressed using a task based model.
We also claim this will simplify the cache coherency protocol.
Contributions
IWe design a scalable high performance multicore platform on FPGA.
IWe outline the runtime environment needed to support task models.
IWe only do cache coherency operations at task boundaries to simplify the
cache coherency protocol.
Architecture
IProcessor Core
ITinuso processor core optimized for a high throughput on FPGA.
I 8-stage single issue, in-order pipeline, support of predicated
instructions.
I Support both hard- and soft float operations.
IFull GCC based tool suite: GCC, Binutils tools, and NewlibC library
INetwork Interconnect
IGeneric 2D mesh on-chip network optimized for FPGA
implementation.
IPacket switched, deadlock free XY-routing scheme.
I 1 cycle latency per network hop.
IPeak switching data rate of 9.6 Gbits/s per link.
I Simulation Environment
IPlatform independent behavior level VHDL.
IFull system simulator with the GHDL open source VHDL compiler.
I Simulator runs ELF executables.
I Simulation speed: 1 kHz for single core / 10 Hz for a 64 core
system.
IAllows for monitoring and plot network traffic and CPUutilization.
network traffic
x nodes
y 
no
de
s
1 3 6 7 9
7
5
3
1
backpressure
x nodes
y 
no
de
s
1 3 6 7 9
7
5
3
1
Task ModelExample
A
write X
B C
A
invalidate X
B
C
write X
D
1
A
read X
B
D
2
D
n
D
3
master thread
synchronization
R R R
R R R
R R R
L2$C13 C23
C31
C32C22
C21
C12
C11
M
R
C
O
M
NI
Router
Core
Output
Memory
NI NI
NI NI
NI NI
NI
NI
NI
work stealing
$ $
$
$$$
$ $
Network
Interface
memory lookup
directory points 
to core C12
fetch X from C12
Task semantics:
IC extensions to spawn and synchronize tasks.
IUse ”spawn” keyword to create a number of parallel tasks.
IEach core has its own queue of tasks.
I Spawned tasks are put on the top of the spawning core’s task queue.
I Idle cores steal work form the bottom of the task queues of other cores.
IUse ”sync” keyword to wait for parallel tasks to finish.
INested tasks are possible.
Memory Consistency Model:
IOnly task stealing leads to coherency actions.
IBetween parallel tasks, memory is not kept coherent.
I In a set of parallel tasks, there can only be either a single reader and
writer of a memory location or multiple readers. It is up to the
programmer to assure this.
I If and only if a stolen task finishes, memory coherency actions takes place.
I Sync operations only complete once memory coherency actions have
finished.
IHardware support for synchronization primitives to avoid spin-locks.
Implementation:
I Support of ”load-linked” and ”store conditional” operations to steal tasks
and to synchronize.
Conclusions
IWe design and implement a scalable high performance multicore platform
on FPGA.
IWe outline the runtime environment for tasks:
I global shared address space
I cache coherency operations are only done at task boundaries to
simplify the cache coherency protocol
IWe envision hardware support for synchronization primitives to avoid
spin-locks.
References
IM. Frigo et al. The implementation of the Cilk-5 multithreaded language.
In PLDI ’98, pages 212-223, 1998
IB. Choi et al. DeNovo: Rethinking the Memory Hierarchy for Disciplined
Parallelism. In PACT ’11, pages 155-166, 2011
DTU Informatics - Technical University of Denmark pass@imm.dtu.dk http://www.imm.dtu.dk
