Improving achieved memory bandwidth from C++ codes on Intel® Xeon Phi™ Processor (Knights Landing) by Raman, Karthik et al.
                          Raman, K., Deakin, T., Price, J., & McIntosh-Smith, S. (2017). Improving
achieved memory bandwidth from C++ codes on Intel® Xeon Phi™
Processor (Knights Landing). IXPUG Spring Meeting, Cambridge, United
Kingdom.
Publisher's PDF, also known as Version of record
Link to publication record in Explore Bristol Research
PDF-document
This is the final published version of the article (version of record). It first appeared online via IXPUG at
https://www.ixpug.org/events/spring-2017-emea. Please refer to any applicable terms of use of the publisher.
University of Bristol - Explore Bristol Research
General rights
This document is made available in accordance with publisher policies. Please cite only the published







































































































▪ Allocate	using	_mm_malloc(*a, 2097152) 
OR 









▪ #pragma omp parallel for simd aligned(a : 2097152)
OR



















































































▪ Simple	C	implementation,	loop	index	i &	array-access	a[i] uses	“int”	for	loop	
indexing	and	the	induction-variable
e.g. for (int i = 0; i < array_size; i++) {a[i]= …}
▪ The	Kokkos	version	was


































































































e.g.	forall<policy>(index_set, [=] RAJA_DEVICE (long index){


























































Model Original GB/s Optimized GB/s Original GB/s Optimized GB/s
McCalpin	Stream 448 - 129 -
OpenMP 302 438 95 130
Kokkos 298 436 96 129
RAJA 124 436 96 129
Intel®	Xeon Phi™	(Knights	Landing) Intel®	Xeon®	E5-2697v4 (Broadwell)
Conclusions	and	Insights
▪ Out	of	the	box,	C++	and	OpenMP	struggle	to	show	close	to	peak	achievable	
memory	bandwidth.
▪ Partially	down	to	the	knowledge	the	compiler	has	at	compile	time.
▪ Needs	to	know	the	alignment	and	trip	counts	to	generate	the	best vector	code.
▪ Can	use	OpenMP to	give	the	compiler	enough	knowledge	to	do	the	right	
thing.
▪ Using	an	abstraction	layer	hides	some	detail	away.
▪ Must	ensure	the	abstraction	layer	holds	enough	information	to	generate	the	same	
best	vector	code.
▪ Key	optimizations:
▪ Ensure	memory	alignment	(Align	and	tell	compiler).
▪ Remove	abstraction	layer	loop	iteration	typecasts (Avoid	datatype	conversions)
▪ Non-temporal	stores	(for	peak	memory	bandwidth,	use	only	where	applicable)
IXPUG	Annual	Spring	Conference	2017 17
References
Website:	http://uob-hpc.github.io/GPU-STREAM/
[1] T.	Deakin	and	S.	McIntosh-Smith,	“GPU-STREAM:	Benchmarking	the	achievable	
memory	bandwidth	of	Graphics	Processing	Units	(poster),”	in	Supercomputing,	2015.
[2] T.	Deakin,	J.	Price,	M.	Martineau,	and	S.	McIntosh-Smith,	“GPU-STREAM	v2.0:	
Benchmarking	the	Achievable	Memory	Bandwidth	of	Many-Core	Processors	Across	
Diverse	Parallel	Programming	Models,”	2016,	pp.	489–507.
[3] T.	Deakin,	J.	Price,	M.	Martineau,	and	S.	McIntosh-Smith,	“GPU-STREAM:	Now	in	
2D!	(poster),”	in	Supercomputing,	2016.
[4] S.	J.	Pennycook,	J.	D.	Sewall,	and	V.	W.	Lee,	“A	Metric	for	Performance	
Portability,”	pp.	1–7.
[5] R.	Krishnaiyer	“Data	Alignment	to	Assist	Vectorization”,	Intel®	Developer	Zone	
article,	2015.	https://software.intel.com/en-us/articles/data-alignment-to-assist-
vectorization
[6] K.	Raman	“Optimizing	Memory	Bandwidth	in	Knights	Landing”	Intel®	Developer	
Zone	article,	2016.	https://software.intel.com/en-us/articles/optimizing-memory-
bandwidth-in-knights-landing-on-stream-triad
IXPUG	Annual	Spring	Conference	2017 18
