Bringing Real Processorsto Labs by Gómez Requena, Crispín et al.
 
Document downloaded from: 
 























 This is the accepted version of the following article: Gómez, C., Gómez, M. E. and
Sahuquillo, J. (2015), Bringing real processors to labs. Comput Appl Eng Educ, 23:




Gómez Requena, C.; Gómez Requena, ME.; Sahuquillo Borrás, J. (2015). Bringing Real






multithreaded	 multicore	 processors.	 The	 inherent	 complexity	 of	 such	 processors	 makes	 difficult	 to	 update	 processor	
teaching	to	include	current	commercial	products,	especially	at	lab	sessions	where	simplistic	simulators	are	usually	used.	
However,	 instructors	are	 forced	to	reduce	 this	gap	 if	 they	want	 to	properly	prepare	students	 in	 this	 topic.	Dealing	with	










Nowadays,	 electronic	 devices	 are	 present	 in	 all	 the	 areas	 of	 our	 everyday	 life.	 This	 has	 been	 possible	
thanks	to	the	great	computational	power	reached	by	current	microprocessors	and	to	the	low	power	they	
consume.	 This	 huge	 computational	 power	 has	 come	 from	 an	 increase	 in	 hardware	 and	 functional	




taught	 in	these	new	features,	or	at	 least	they	do	not	work	them	in	 laboratories.	This	means	that	these	
complex	 concepts	 or	mechanisms	 are	 not	 fixed	 in	 lab	 sessions,	mainly	 due	 to	 the	 high	 complexity	 of	
current	 microprocessor	 generations,	 which	 is	 not	 modeled	 in	 the	 simple	 simulators	 usually	 used	 in	
laboratories,	 as	 well	 as	 the	 constant	 evolution	 of	 commercial	 processors.	 This	 widely	 extended	
methodology	 is	 a	mistake	 because	 these	 concepts	 are	 precisely	 the	most	 difficult	 to	 understand	 and	
should	be	worked	out	in	laboratories	for	better	understanding.	In	addition,	students	feel	that	the	studied	
concepts	widely	differ	from	the	real	hardware,	which	discourage	them	to	the	study	of	these	topics.	Thus,	
it	 is	 important	 for	 instructors	 in	 this	area	 to	 spark	 students’	motivation.	 From	our	 view,	 this	emphasis	
should	be	persistent	and	linked	with	commercial	products	showing	how	real	and	recent	processors	work.	
This	paper	proposes	a	methodology	aimed	at	bringing	 the	 industry	 to	 the	academia	 in	general,	and	 in	
particular,	to	laboratories,	by	training	students	to	work	with	real	mechanisms	at	laboratory	sessions.	For	
this	 purpose,	 the	 proposed	 approach	 relies	 on	 the	 use	 of	 a	 detailed	 simulation	 framework,	 able	 to	
accurately	model	current	microprocessors.	In	addition	to	this	issue,	it	is	highly	recommendable	that	the	









The	 methodology	 has	 been	 applied	 sing	 the	 Multi2Sim	 [7]	 simulator,	 a	 cycle-accurate	 simulator	 of	






without	modifying	 the	source	code).	After	 that,	 students	are	progressively	 introduced	 in	 the	simulator	
code,	ranging	from	minor	modifications	of	the	source	code	to	the	completion	of	their	Degree	Thesis1	and	
their	 participation	 as	 co-authors	 in	 research	 papers.	 The	 gradual	 project-based	 learning	 approach	 has	
been	 already	 followed	 by	 instructors	 ([3],[4])	 in	 other	 areas.	 However,	 unlike	 these	works,	 this	 paper	
proposes	 the	 combination	 of	 using	 both	 a	 detailed	 simulator	 that	models	 realistic	 processors	 and	 the	
gradual	project-based	methodology	with	the	aim	of	training	students	in	advanced	processor	architecture	
concepts.	 The	great	 advantage	of	using	both	 together	 is	 that	 students	 are	allowed	 to	experience	how	
real	processors	work	without	feeling	overwhelmed	by	the	simulator	complexity.	
Quantitative	 and	 qualitative	 results	 have	 been	 analyzed	 to	 assess	 the	 proposed	methodology.	 Results	
show	that	since	the	proposed	methodology	was	introduced,	students	have	obtained	better	marks	in	the	


















After	 some	 compulsory	 computer	 architecture	 courses,	 two	 main	 elective	 courses	 are	 offered	 at	
Universitat	 Politècnica	 de	 València	 (UPV)	 that	 present	 a	 close	 approach	 to	 commercial	 products:	
“Advanced	processor	architectures”	offered	both	 in	 the	 last	 year	of	 the	Computer	Engineering	Degree	
and	 in	 the	 Master	 in	 Computer	 Engineering;	 and	 “Networks	 on-chip”	 offered	 in	 the	 Computer	
Engineering	Master.	 These	 courses	 are	 complementary	 and	 students	 are	 suggested	 to	 follow	 both	 of	
them.	
During	 the	 last	 two	 academic	 years	 we	 have	 developed	 and	 implemented	 a	 new	 methodological	
approach	that	 includes	 the	mentioned	courses	as	 if	 they	were	only	one.	The	methodology	 is	based	on	
the	 fact	 that	 students	 can	 be	 trained	 in	 laboratories	with	 complex	 features	 of	 commercial	 processors	
that	have	been	taught	in	theoretical	lectures,	without	being	overwhelmed.	For	doing	this,	both	courses	
use	 the	 same	 simulation	 framework	 that	 allows	 students	 to	 model	 realistic	 commercial	 processors	
mechanisms.	The	benefits	of	this	approach	are	twofold.	On	the	one	hand,	students	analyze	and	realize	








Commercial	 processors	 are	 continually	 evolving.	 This	 constant	 evolution	 has	 forced	 the	 continuous	
update	 of	 teaching	 contents,	 both	 theoretical	 ones	 in	 lectures	 and	 practical	 ones	 in	 laboratories.	
Concerning	laboratories,	several	simulators	are	often	used	depending	on	their	intended	use.	Simulation	
tools	 can	 be	 simple	 and	 easy	 to	 use	 or	 complex	 and	 harder	 to	 understand.	 As	 an	 example	 of	 simple	
simulators,	we	can	find	DLXV	that	is	typically	used	to	study	the	processor	pipeline.	DLXV	comes	in	handy	
for	understanding	simple	scalar	processors,	which	are	the	basics	of	superscalar	processors.	Nevertheless,	
for	more	 advanced	 concepts	 such	 as	 superscalar	 or	multithreaded	processors,	DLXV	does	 not	 provide	
any	teaching	support.	
Most	universities	do	not	cover	this	gap	for	undergraduates	and	complex	simulators	like	SimpleScalar	[5],	
Sniper	 [6],	and	Multi2Sim	[7],	are	exclusively	used	by	PhD	students.	This	 is	a	clear	mistake	due	to	 two	







The	overall	 objective	 of	 the	 proposed	methodology	 is	 twofold,	 to	 provide	 students	 a	 solid	 knowledge	
about	 realistic	 processor	 architectures	 and	 to	 encourage	 them	 in	 the	 study	 of	 computer	 architecture	
topics.	 The	 methodology	 consists	 of	 four	 different	 phases	 with	 an	 increasing	 difficulty	 degree.	 The	
phases	 are:	 i)	modification	 of	 simulation	 parameters,	 ii)	modification	 of	 simple	 parts	 of	 the	 simulator	
code,	iii)	implementation	of	complete	functionalities,	and	iv)	full	autonomy.	
At	the	first	stage,	students	are	asked	to	only	modify	simulation	parameters	to	analyze	how	they	impact	
on	 performance.	 Students	 must	 first	 modify	 the	 values	 of	 some	 key	 system	 parameters	 previously	






memory	controller.	As	 in	the	previous	stage,	this	work	 is	done	with	a	careful	 instructors’	guidance;	 for	
instance,	instructors	provide	the	code	of	a	very	simple	prefetching	mechanism	to	the	students	and	they	
are	asked	to	implement	other	prefetching	mechanisms	from	it.	For	this	purpose,	instructors	provide	the	
students	with	 another	booklet	 describing	 the	 source	 code	of	 the	 simulator.	 From	our	 experience,	 this	
phase	is	best	performed	as	final	course	project.	We	consider	this	phase,	which	is	not	usually	considered,	
important	since	the	study	of	the	source	code	provides	a	complete	understanding	of	the	details	of	a	given	






published	 in	 top-notch	 computer	 architecture	 conferences	 like	 IPDPS,	 PACT,	 and	 Euro-Par.	 This	 stage	
stimulates	the	interest	of	students	in	research	on	computer	architecture	topics.	
Those	students	who	successfully	complete	the	three	previous	stages	are	in	a	privileged	position	to	start	





to	 be	 used	 both	 in	 computer	 architecture	 related	 courses	 and,	 if	 students	 are	 interested,	 in	 the	
development	 of	 their	 future	 PhD	 Theses.	 To	 cover	 the	mentioned	 courses	 and	 the	wide	 spectrum	 of	
computer	 architecture	 topics	 where	 the	 research	 is	 currently	 focusing	 on,	 we	 have	 used	 Multi2Sim,	
which	 is	 capable	 to	 model	 multithreaded	 and	 multicore	 superscalar	 processors	 as	 well	 as	 graphics	
processing	 units	 (GPUs).	 In	 addition,	 recent	 extensions	 include	 a	 detailed	 modeling	 of	 the	 on-chip	
network	and	the	memory	controller.	The	simulation	framework	 is	being	used	by	major	microprocessor	







• Cache	 hierarchy	 configuration	 is	 highly	 flexible.	 The	 user	 can	 define	 as	 many	 cache	 memory	
levels	 and	 number	 of	 caches	 in	 each	 level	 as	 desired,	 and	 the	 caches	 can	 have	 any	 size	 and	
geometry.	Cache	coherence	is	guaranteed	by	means	of	a	MOESI	protocol.	









however,	 is	really	 long	and	 it	can	take	a	 long	time	to	understand	only	a	few	modules	of	the	simulator.	
Because	of	this	reason	it	is	important	for	instructors	to	have	a	deep	knowledge	of	the	selected	tool.	Since	
Multi2sim	[7]	was	originally	developed	at	UPV,	we	are	strongly	familiarized	with	this	framework	and	its	






we	 chose	 an	 8-tile	 multicore	 processor	 that	 implements	 a	 mesh	 NoC	 as	 baseline	 system.	 This	
architecture	 was	 chosen	 since	 it	 resembles	 to	 those	 that	 could	 be	 found	 in	 commercial	 multicore	
processors.	With	this	processor,	students	have	to	complete	several	learning	stages	in	a	progressive	way.	
At	each	learning	stage,	students	only	have	to	deal	with	the	simulator	part	that	is	required	for	that	stage,	
hiding	 the	 rest	of	 the	simulator.	Unlike	what	 is	 typically	done	with	PhD	students,	 it	 is	not	advisable	 to	
make	 a	 thorough	 tutorial	 about	 the	 simulator	 before	 starting	 the	 work,	 as	 this	 would	 likely	 strongly	






As	 an	 initial	 step	 to	make	 the	 proposed	 approach	 feasible,	 instructors	 prepared	 a	 detailed	 simulator	
guide	 for	 those	 aspects	 that	 are	 required	 for	 this	 stage.	 This	 is	 not	 a	 typical	 simulator	 guide	 but	 it	
combines	 and	makes	 relations	between	 theoretical	 concepts	 studied	 in	 lectures	 and	practical	 aspects.	
This	 guide	 describes	 the	 three	 main	 subsystems	 that	 can	 be	 modeled	 in	 the	 simulator:	 cores,	 cache	
memories,	and	interconnection	network.	For	each	subsystem,	the	guide	explains	its	operation	and	role	in	
the	 system,	 and	 provides	 a	 brief	 description	 of	 the	main	 configuration	 parameters.	 For	 instance,	 the	
guide	shows	how	to	change	the	branch	predictor	in	the	core	subsystem;	in	the	case	of	cache	memories,	
the	 guide	 explains	 how	 to	 change	 the	 size	 of	 the	 cache	 line	 or	 the	 access	 latency;	 and	 for	 the	
interconnection	network	subsystem,	the	guide	presents	how	the	bandwidth	of	the	links	and	the	size	of	
the	buffers	at	the	switches	can	be	changed.	Moreover,	regarding	caches,	the	guide	shows	how	to	model	
the	entire	cache	hierarchy,	 interconnecting	 the	caches	of	 the	distinct	 levels	among	 them,	 to	 the	cores	
and	 to	 the	 main	 memory.	 In	 a	 similar	 way,	 the	 guide	 explains	 the	 correct	 way	 of	 defining	 an	
interconnection	on-chip	network	to	connect	all	the	different	cache	memories	or	network	nodes.	
After	explaining	 the	modeling	of	 the	system	components,	 instructors	define	 the	baseline	system	to	be	
modeled	mentioned	above,	which	is	a	typical	configuration	of	commercial	processors	as	shown	in	Figure	




(core,	 memory	 hierarchy,	 and	 interconnection	 network).	 As	 an	 example,	 two	 small	 fragments	 of	 two	
configuration	 files	 are	 shown.	 Example	 1	 shows	 the	 configuration	 file	 for	 the	 on-chip	 network	 called	









Once	 students	 become	 familiar	with	 the	 configuration	 files,	 they	 are	 asked	 to	 upgrade	 the	 system	by	




the	processing	cores	 remain	 the	same.	With	 this	extension	we	aim	to	provide	 the	students	with	more	
self-confidence	with	the	simulator.	
The	next	 step	 in	 the	proposed	 simulator	 learning	process	 is	 to	 replace	 L2	private	 caches	by	 shared	 L2	
caches.	 In	this	model,	each	L2	cache	memory	 is	shared	by	a	pair	of	cores.	That	 is,	core	#0	and	core	#1	
share	the	L2	cache,	core	#2	and	#3	also	do	that,	and	so	on.	As	the	number	of	L2	caches	is	halved,	their	
size	is	doubled	to	keep	the	L2	storage	capacity	constant,	and	their	latency	is	accordingly	increased	from	6	




Students	 experience	 with	 standard	 benchmarks,	 used	 to	 evaluate	 actual	 processors	 and	 extensively	
accepted	by	the	scientific	community.	In	particular,	we	choose	the	SPLASH2	parallel	benchmark	suite	[9].	
To	relax	the	amount	of	work,	we	provide	the	students	with	the	scripts	to	launch	the	benchmarks	of	the	




application.	 To	 analyze	 the	 obtained	 performance	 metrics,	 students	 are	 suggested	 to	 arrange	 their	
results	in	a	graphical	manner	similar	as	the	depicted	in	Figure	3.	The	three	aforementioned	performance	
metrics	are	normalized	to	the	values	obtained	with	private	L2	caches	for	three	(Cholesky,	FFT	and	Radix)	





applications	 share	 code	 and	 data	 among	 the	 processor	 cores,	 so	 by	 implementing	 shared	 caches,	
coherency	 traffic	 can	 be	 significantly	 reduced.	 Moreover,	 if	 two	 cores	 sharing	 the	 L2	 cache	 access	 a	
shared	block,	 this	cache	will	hold	a	single	copy	of	 the	block	 for	both	cores,	whereas	 two	copies	of	 the	
shared	block	are	necessary	 in	L2	private	caches.	Thus,	shared	caches	make	a	more	efficient	use	of	 the	
available	storage	capacity.	Looking	at	the	results,	students	are	able	to	observe	that	the	execution	time	of	







implement	 a	 complete	 basic	 processor	mechanism.	 The	 student	 can	 select	 either	 any	 of	 the	 available	
projects	 or	 alternatively	 suggest	 her	 own	 proposal	 and	 submit	 it	 to	 the	 instructor	who	will	 check	 the	
properness.	Below	we	present	some	examples	of	final	projects:	
• Hardware	prefetchers.	This	project	consists	 in	 implementing	L2	hardware	prefetchers.	Students	
are	 asked	 to	 implement	 different	 prefetchers:	 a	 n-block	 sequential	 prefetcher	 and	 a	 stride	
prefetcher	that	detects	constant	strides	from	the	cache	accesses.		
• Memory	 controller	 scheduling	 policies.	 In	 this	 project	 students	 must	 implement	 distinct	





• NoC	 virtual	 channels.	 In	 this	 project	 the	 implementation	 of	 virtual	 channels	 in	 the	 NoC	 is	
evaluated	analyzing	how	they	affect	the	system	performance	and	the	area	of	the	switches.		
















64B	block	 size.	As	 can	be	 seen,	 the	 system	performance	 improves	with	 the	 block	 size	up	 to	 512B	but	
increasing	the	block	size	beyond	this	value	negatively	impacts	on	performance.	
Study	of	 the	baseline	Prefetcher.	 	After	 this	 first	 initial	 study,	designed	 to	provide	a	global	overview	of	
how	 prefetching	 works,	 students	 are	 asked	 to	 implement	 several	 prefetching	 mechanisms	 in	 the	
simulator.	To	help	students	with	this	 learning	phase,	 implementation	stubs	of	the	simplest	prefetching	
technique	 (One	 Block	 Look-Ahead	 or	 OBL)	 jointly	 with	 a	 user’s	 guide	 is	 provided	 to	 them.	 The	 guide	
illustrates	how	the	core	code	must	be	modified	step	by	step	to	implement	other	prefetching	mechanisms.	
Basically,	the	provided	stubs	implement	two	main	components:	a	queue	for	pending	prefetches	and	the	
pattern	 detection/triggering	 mechanism,	 which	 on	 a	 cache	 miss,	 triggers	 new	 prefetches	 that	 are	
inserted	 in	 the	 queue.	 The	 queue	 is	 looked	 up	 at	 the	 issue	 stage	 and,	 if	 not	 empty,	 a	 new	 prefetch	
request	is	issued.	
Implementation	 of	 more	 complex	 prefetchers.	 Once	 students	 become	 familiar	 with	 the	 code	 of	 the	













results	 refer	 to	 the	 grades	 achieved	 by	 students,	 while	 qualitative	 analysis	 includes	 i)	 students’	
enrollment	to	the	course,	ii)	understanding	of	the	course	topics,	and	iii)	a	survey	that	was	answered	by	
students.	Below	the	analysis	of	these	items	is	discussed.	
Grading.	 One	 important	 aspect	 that	 gives	 information	 about	 the	 effectiveness	 of	 the	 proposal	 is	 the	
marks	 obtained	 by	 students.	 Table	 I	 summarizes	 them	 for	 the	 year	 before	 the	 methodology	 was	
introduced	 and	 the	 year	 where	 it	 was	 applied.	 After	 applying	 the	 methodology	 their	 marks	 have	
noticeably	 improved;	 the	 percentage	 of	 students	 with	 a	 grade	 of	 “B”	 doubles	 with	 respect	 to	 the	
previous	year	and	the	number	of	students	with	a	grade	of	“D”	has	significantly	dropped.	
The	 final	 marks	 consider	 both	 written	 exams	 and	 the	 assessment	 of	 the	 work	 performed	 in	 the	
laboratory	sessions.	To	develop	the	lab	work	instructors	provide	the	students	with	working	guidelines.	To	
grade	 the	 lab	 sessions	 students	 deliver	 to	 the	 professor	 a	 report	 containing	 the	 answers	 to	 a	




Two	 important	observations	can	be	appreciated.	On	the	one	hand,	 labs	marks	are	slightly	 lower	when	
applying	the	methodology	due	to	the	increase	in	complexity	of	the	concepts	worked	in	the	lab	sessions.	
However,	 this	 fact	 does	 not	 negatively	 affect	 the	 final	 grade	 because	 students	 have	 significantly	
improved	their	marks	in	written	exams	with	similar	difficulty.		
Better	 understanding	 of	 the	 course	 topics.	 The	 increase	 in	 the	 final	 marks	 can	 be	 explained	 by	 the	
followed	 methodology.	 The	 use	 of	 a	 realistic	 processor	 simulator	 framework	 offers	 two	 important	
advantages.	 Students	 can	 read	 the	 detailed	 simulator	 code	 of	 specific	 components	 and	 study	
performance	 interactions	among	 the	different	 system	components.	Authors	would	 like	 to	 remark	 that	
the	 same	 instructor	works	with	 students	both	 in	 lectures	and	 laboratory	 sessions.	This	way	allows	 the	




course	 is	made	 to	 students.	Most	 of	 them	 have	 really	 appreciated	 the	 proposed	 approach,	 since	 the	
enrollment	has	grown	by	2.6x	(from	18	students	in	2013	to	55	students	for	the	next	course	in	2014	after	
the	methodology	has	been	applied),	making	 the	AAV	 (Advanced	Architecture)	Course	one	of	 the	most	
popular	among	students.	
Survey.	 Finally,	 to	 provide	 qualitative	 results	 about	 this	 great	 success,	 a	 survey	was	made	 to	 analyze	
students’	 perception	 of	 both	 the	 course	 and	methodology.	 The	 survey	 included	 twenty-five	 questions	
classified	 in	 four	 main	 categories	 to	 evaluate	 possible	 motivation	 factors:	 i)	 use	 of	 a	 single	 tool,	 ii)	
understanding	 complex	 microprocessor	 mechanisms,	 iii)	 hiding	 non-necessary	 structures,	 and	 iv)	
approaching	labs	to	real	processors.	The	questionnaire	was	completed	by	22	students	corresponding	to	
those	 belonging	 to	 one	 of	 the	 two	 lab	 groups.	 Table	 III	 shows	 an	 excerpt	 of	 the	 questions	 and	 the	
provided	 marks.	 For	 illustrative	 purposes	 one	 question	 from	 each	 category	 is	 shown.	 Most	 of	 the	
students	 selected	 either	 the	 highest	 mark	 or	 the	 second	 one	 regardless	 the	 category	 to	 which	 the	





This	paper	has	presented	a	new	methodology	 that	has	been	carried	out	during	 the	 last	 two	academic	










parts	 of	 the	 simulator,	 acquiring	 a	 deep	 knowledge	 of	 real	 hardware	 that	 has	 positively	 impacted	 on	















[2]	 Daniel	 Sanchez	 and	 Christos	 Kozyrakis.	 ZSim:	 fast	 and	 accurate	 microarchitectural	 simulation	 of	
thousand-core	systems,	ISCA,	2013,	pages	475-486	





[5]	 Todd	 Austin,	 Eric	 Larson	 and	 Dan	 Ernst.	 SimpleScalar:	 An	 Infrastructure	 for	 Computer	 System	
Modeling.	IEEE	Computer	vol.	35(2),	2002.	
	[6]	Trevor	E.	Carlson,	Wim	Heirman	and	Lieven	Eeckhout.	Sniper:	exploring	the	 level	of	abstraction	 for	






[9]	 S.	Woo,	M.	 Ohara,	 E.	 Torrie,	 J.	 Singh	 and	 A.	 Gupta.	 The	 Splash-2	 programs:	 Characterization	 and	
methodological	considerations.	In	22th	International	Symposium	on	Computer	Architecture	(ISCA),	pages	
24–36,	1995.	
