







This	 paper	 presents	 a	 scheduler	 for	 Data-Flow	 threads	 implemented	 in	 reconfigurable	 logic	 for	 being	
deployed	 on	 Reconfigurable	 MPSoCs	 (i.e.,	 Multi-Processing	 System	 on	 Chips	 with	 FPGA).	 "Data-Flow	
threads"	 (DF-Threads)	 is	 a	 novel	 execution	 model	 for	 mapping	 threads	 on	 local	 or	 distributed	 cores	
transparently	to	the	programmer.	This	model	is	capable	of	being	parallelized	massively	among	different	
cores	and	it	handles	even	hundreds	of	thousands	or	more	Data-Flow	threads,	and	their	associated	data	
frames,	 in	order	 to	distribute	 them	both	 in	a	 local	node	and	 through	 the	network	 to	other	nodes	 in	a	
transparent	way.	The	Hardware	Scheduler	(HS)	is	designed	for	being	used	in	Programmable	Logic	(PL)	of	







engineering	 community	 to	 shift	 to	 the	 multicore	 processors	 as	 an	 alternative	 way	 to	 improve	
performance	at	the	limited	power	budget.	
An	 increased	core	number	benefits	many	workloads,	but	programming	 limitations	 to	exploit	 full	
performance	still	remain	due	to	the	not	fully	exploited	parallelism.	According	to	Mondelli	et	al.	[2],	
the	Data-Flow	execution	model	 is	 capable	of	 taking	advantage	of	 the	 full	 parallelism	offered	by	

















GPP	cores	allow	us	 to	be	 suitable	 for	a	 large	 set	of	 applications	and	FPGAs	are	known	 for	 their	
reconfigurability	 and	 power	 efficiency,	 compared	 to	 software	 only	 designs,	 so	 that	 they	 are	 a	
suitable	 choice	 for	 being	 deployed	 in	 the	many-threads	 Data-Flow	 execution	models	 as	well	 as	
providing	 a	 spatial	 substrate	 for	 mapping	 Data-Flow	 threads.	 These	 models	 evolve	 around	 the	
optimizing	 of	 data	mobility	 and	 exploiting	massively	 parallelism	 among	 thousands	 of	 Data-Flow	
threads	to	offer	more	modularity	and	higher	performance	[5][6][7]	[8]	[9][15][16].	
Here	the	idea	is	to	detach	the	execution	of	the	Data-Flow	threads	from	its	scheduling,	reducing	the	
latency	of	 the	data	communication	and	 increase	 the	overall	performance	of	 the	execution.	 	We	
propose	 a	 scalable	 hardware	 scheduler	 (HS),	 implemented	mainly	 on	 the	 FPGA,	which	 provides	








































the	 speedup	 of	 the	 execution	 is	 reasonable	
good,	 specially	 increasing	 the	 number	 of	
nodes/GPPs	and	with	large	size	of	the	Matrix.	
The	 block	 size	 does	 not	 affect	 much	 the	



























[4]	 Budiu,	 M.,	 Artigas,	 P.	 V.,	 &	 Goldstein,	 S.	 C.	 (2005,	 March).	 Dataflow:	 A	 complement	 to	
superscalar.	 In	 Performance	 Analysis	 of	 Systems	 and	 Software,	 2005.	 ISPASS	 2005.	 IEEE	
International	Symposium	on	(pp.	177-186).	IEEE.	
[5]	 Giorgi,	 R.,	 &	 Faraboschi,	 P.	 (2014,	 October).	 An	 introduction	 to	 DF-Threads	 and	 their	
execution	model.	 In	Computer	 Architecture	 and	 High	 Performance	 Computing	Workshop	
(SBAC-PADW),	2014	International	Symposium	on	(pp.	60-65).	IEEE.	
[6]	 Stavrou,	K.,	Pavlou,	D.,	Nikolaides,	M.,	Petrides,	P.,	Evripidou,	P.,	Trancoso,	P.,	...	&	Giorgi,	R.	
(2009,	 June).	 Programming	 abstractions	 and	 toolchain	 for	 dataflow	 multithreading	
architectures.	 In	Parallel	and	Distributed	Computing,	2009.	 ISPDC'09.	Eighth	 International	
Symposium	on	(pp.	107-114).	IEEE.	
[7]	 Verdoscia,	L.,	Vaccaro,	R.,	&	Giorgi,	R.	(2014,	August).	A	clockless	computing	system	based	
on	 the	 static	 dataflow	 paradigm.	 In	 Data-Flow	 Execution	 Models	 for	 Extreme	 Scale	
Computing	(DFM),	2014	Fourth	Workshop	on	(pp.	30-37).	IEEE.	
[8]	 Solinas,	M.,	Badia,	R.	M.,	Bodin,	F.,	Cohen,	A.,	Evripidou,	P.,	Faraboschi,	P.,	...	&	Goodman,	D.	
(2013,	 September).	 The	 TERAFLUX	 project:	 Exploiting	 the	 dataflow	 paradigm	 in	 next	
generation	teradevices.	In	Digital	System	Design	(DSD),	2013	Euromicro	Conference	on	(pp.	
272-279).	IEEE.	
















[15]	 Kyriacou,	 C.,	 Evripidou,	 P.,	 &	 Trancoso,	 P.	 (2006).	 Data-driven	 multithreading	 using	
conventional	microprocessors.	IEEE	Transactions	on	Parallel	and	Distributed	Systems,	17(10),	
1176-1188.	
[16]	 Alves,	T.	A.,	Marzulo,	L.	A.,	França,	F.	M.,	&	Costa,	V.	S.	(2011).	Trebuchet:	exploring	TLP	with	
dataflow	virtualisation.	International	Journal	of	High	Performance	Systems	Architecture,	3(2-
3),	137-148.	
