Study and evaluation of an Irregular Graph Algorithm on Multicore and
  GPU Processor Architectures by Nagpal, Varun
Study and evaluation of an Irregular Graph Algorithm
on Multicore and GPU Processor Architectures
Master’s Thesis submitted to the
Faculty of Informatics of the Università della Svizzera Italiana
in partial fulfillment of the requirements for the degree of
Master of Science in Embedded Systems Design
at
Advanced Learning and Research Institute (ALaRI)
in collaboration with Politecnico di Milano and ETH Zürich
presented by
Varun Nagpal
under the supervision of
Prof. Mariagiovanna Sami
co-supervised by
Dr. Leandro Fiorin
January 2013
ar
X
iv
:1
60
3.
02
65
5v
1 
 [c
s.D
C]
  8
 M
ar 
20
16

I certify that except where due acknowledgement has been given, the work presented in
this thesis is that of the author alone; the work has not been submitted previously, in whole
or in part, to qualify for any other academic award; and the content of the thesis is the result
of work which has been carried out since the official commencement date of the approved re-
search program.
Varun Nagpal
Lugano, 14 January 2013
i

I dedicate this work to my late grandfather Shri. Vidya Sagar
Nagpal ( 15th June, 1921 to 31st Aug, 2012 ). For me, he was the
highest source of discipline, determination and undying
enthusiasm. I miss him dearly
iii
iv
Karmanye Vadhikaraste, Ma phaleshou
kada chana, Ma Karma Phala Hetur
Bhurma tey Sangostva Akarmani - Do
your duty without being desirous of the
results. The results are not in your hands,
so you shall not think of them. Your
attachment should not be to the results or
to the non-performance of action
Sankha Yoga - chapter-2, verse 47, Gita
v
vi
Abstract
To improve performance of an application over the years, software industry assumed that the
application would automatically run faster on new upcoming processors. This assumption
mainly relied on ability of the microprocessor industry to extract more ILP(Instruction level
parallelism) in single-threaded programs through technology improvements and processor ar-
chitecture innovations ( higher clock rates, greater transistor density due to transistor scaling,
deeper pipelines etc.). Consequently, software programmers have traditionally focused on writ-
ing correct and efficient sequential programs and have rarely needed to understand hardware
or processor details.
In the last few years however, this reliance of the software industry on the processor industry
to automatically extract and scale application performance with new generations processors
has hit a wall due to processor industry moving towards Chip Multiprocessors. This sudden
change in trend was motivated mainly by factors of technology limitations( exploding power,
ILP flattening ), costs and changing market trends ( increased popularity of portable devices ).
As a result, most upcoming processors in desktop, server and even embedded platforms these
days are Multicore or Chip Multiprocessors(CMP). Technology scaling continues to date thanks
to Moore’s law, although now in the form of more processor cores per CMP. Consequently, the
focus of programming is slowly moving towards multi-threading or parallel computing. Writing
scalable applications with increasing processor cores cannot be done without knowing archi-
tectural details of the underlying Chip Multiprocessor.
One area of applications which poses significant challenges of performance scalability on CMP’s
are Irregular applications. One main reason for this is that irregular applications have very little
computation and unpredictable memory access patterns making them memory-bound in con-
trast to compute-bound applications. Since the gap between processor & memory performance
continues to exist, difficulty to hide and decrease this gap is one of the important factors which
results in poor performance acceleration of these applications on chip multiprocessors.
The goal of this thesis is to overcome many such challenges posed during performance ac-
celeration of an irregular graph algorithm called Triad Census. We started from a state of the
art background in Social Network Analysis, Parallel graph algorithms and Triad Census graph
algorithm. We accelerated the Triad Census algorithm on two significantly different Chip Mul-
tiprocessors: Dual-socket Intel Xeon Multicore (8 hardware threads/socket) and 240-processor
core NVIDIA Tesla C1060 GPGPU(128 hardware threads/core). Intel Multicores are designed
mainly for general purpose and irregular applications, while GPGPU’s are designed for data
parallel applications with predictable memory access. We then looked at techniques for ef-
vii
viii
ficiently implementing graph data structures and optimizations targeted for Irregular appli-
cations(memory and control flow). During implementation and experiments we looked at the
problem from different points of view: from the algorithmic perspective, code optimization per-
spective, parallelization perspective and processor-aware perspective. We then experimented
with different algorithmic, parallelization, load balancing and architecture-aware approaches
to accelerate this algorithm. The experimental results obtained for Multithreaded Triad Census
implementation on our Intel Multicore Xeon system shows performance speedups (w.r.t base-
line sequential) of maximum 56x , average 33x and minimum 8.3x for real world graph data
sets. On NVIDIA Tesla C1060 GPGPU, we were able to match almost equally the Multicore
results - 58.4x maximum, 32.8x average and 4.2x minimum speedups w.r.t baseline sequen-
tial. In terms of raw performance, for the graph data set called Patents network, our results
on Intel Xeon Multicore(16 hw threads) were 1.27x times faster than previous results on Cray
XMT(16 hw threads) while results achieved on GPGPU were comparatively slower(0.72x). To
the best of our knowledge, this algorithm has only been accelerated on supercomputer class
computer named Cray XMT and no work exists that demonstrates performance evaluation and
comparison of this algorithm on relatively lower-cost Multicore and GPGPU based platforms.
Acknowledgements
No man is an island - John Donne. This work wouldn’t have been possible without encourage-
ment and support of so many people. Firstly I wish to extend my sincere thanks to my advisor
Prof. Mariagiovanna Sami and co-advisor Dr. Leandro Fiorin for allowing me to work on this
thesis topic and also for their support and guidance throughout this work. Special thanks to Dr.
Oreste Villa at PNNL, US for proposing the thesis topic and for replying patiently to my emails.
I would also like to thank Umberto Bondi, Janine, Elisa and other members of the ALaRI staff
for their support and encouragement throughout the master studies. Special thanks to Prof.
Antonio Carzaniga for allowing me to attend his lectures on algorithms which helped me im-
mensely in many ways during this work. I would also like to thank Prof. Vladimir Batagelj
at University of Ljubljana for very patiently answering my queries. Special thanks to Dorian
Krause at ICS for arranging access to NVIDIA Tesla C1060 GPGPU and patiently replying to
emails.
Learning Embedded Systems Design at ALaRI has been an enormous learning experience for
me which I could have hardly imagined before coming here. In many ways, my motivation
and enthusiasm for Computer Architecture has been shaped by the inspiration and exposure to
lectures and lab sessions from ALaRI faculty and so, I feel that I also owe sincere acknowledge-
ment to them.
Furthermore, my studies and time spent in Switzerland wouldn’t have been worthwhile with-
out the love and support of all friends I made during this time at ALaRI - Darayus, Anubhav,
Chetan, Basundhra, Jane, Tanish, Ameya, Amrit, Gaurang, Naresh, Sanathan Krishna, Kanishk,
Sudhanshu, Prasanth, Avinash, Nisha, Ekaterina, Varun, Vasudha, Ankur. Lastly, I am at loss
of words to acknowledge how much I owe to my parents and family for their constant encour-
agement, and support throughout my academic career - to my grandfather for being source for
undying enthusiasm, to my father and mother for believing in me and for their unconditional
care and love.
ix
x
Contents
Contents xi
List of Figures xiii
List of Tables xv
1 Introduction 1
1.1 Thesis Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Contribution of thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 7
2.1 Social Network Analysis(SNA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Triad Census Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Graph Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Irregularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Efficiently Implementing Irregular Graph Algorithms . . . . . . . . . . . . . . . . . 16
2.5.1 Efficiently implementing adjacency lists . . . . . . . . . . . . . . . . . . . . . 16
2.5.2 Optimizing irregularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6 Multicore Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6.1 Platform Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6.2 Intel Nehalem(Westmere-EP) Micro-architecture . . . . . . . . . . . . . . . 24
2.7 Tesla GPGPU Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.7.1 Platform Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.7.2 NVIDIA Tesla GPGPU Processor . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.7.3 Differences between Multicore and GPGPU processors . . . . . . . . . . . . 32
3 Related Work 35
3.1 Acceleration of Triad Census on Cray XMT . . . . . . . . . . . . . . . . . . . . . . . . 36
4 Evaluation of Triad Census on Intel Multicores 43
4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Sequential Triad Census . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.1 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.2 Baseline results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.3 Performance profiling and analysis . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.4 Optimizing Sequential Triad Census . . . . . . . . . . . . . . . . . . . . . . . 50
xi
xii Contents
4.2.5 Summary of sequential optimizations . . . . . . . . . . . . . . . . . . . . . . 53
4.3 Multithreaded Triad Census . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.1 Graph partitioning and Load Balancing strategy . . . . . . . . . . . . . . . . 55
4.3.2 Using Canonical Dyad(non-uniform distr.) load balancing(v0.6) . . . . . . 60
4.3.3 Using Canonical Dyad(uniform distr.) load balancing(v0.7) . . . . . . . . 63
5 Evaluation of Triad Census on Tesla GPGPU 73
5.1 Design and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.1.1 Graph Data Pre-processing on host CPU . . . . . . . . . . . . . . . . . . . . . 73
5.1.2 CUDA Kernel implementation of Triad Census . . . . . . . . . . . . . . . . . 76
5.2 Initial experiments and Speedup Results . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3 Profiling Triad Census CUDA kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.4 Optimizing Triad Census CUDA implementation . . . . . . . . . . . . . . . . . . . . 84
5.4.1 Using shared memory to calculate Triad Census per Thread block . . . . . 84
5.4.2 Discussion on other optimizations . . . . . . . . . . . . . . . . . . . . . . . . 85
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6 Conclusion and future work 89
6.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
A Concurrency and Parallelism 93
B Processor Architectures and Micro-architectures 103
C Graph theory - Small world networks 113
D CUDA Programming Model 115
D.0.1 Thread Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
D.0.2 Thread Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
D.0.3 Thread communication and synchronization . . . . . . . . . . . . . . . . . . 118
D.0.4 Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
E Counters for computeprof profiler 123
F Business Plan 125
Glossary 127
Bibliography 129
Figures
2.2 Dyad types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Types of Triads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Subquadratic Triad Census Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Triad Code Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 Directed graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7 Adjacency list representation of a digraph . . . . . . . . . . . . . . . . . . . . . . . . 14
2.8 Normal linked list and Unrolled(blocked) list . . . . . . . . . . . . . . . . . . . . . . 17
2.9 Adjacency array representation of a digraph. Top array V is the row index array
of length Size(G) + 1. Lower array E is the column or edge array of length
Order(G) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.10 Dual socket configuration of two Quad Core Intel Xeon E5630 processors with
two hardware threads per core. Each Xeon chip can access its local DDR RAM
directly through its IMC. This organisation makes the system a 2-way cc-NUMA
with shared address space. Each Xeon processor chip is organised as Core and
Uncore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.11 Tesla C1060 GPGPU connected through PCIe 2.0 @ 8GB/s to Intel Quad Core(2core
x 2socket) Xeon X5260 @3.33 GHz, per core L1-32 KB data cache, Per Socket
6MB L2 Unified Cache shared between two cores, 1333 MHZ FSB, and 8 GB
DDR2@667 MHz RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.12 Tesla C1060 processor die (T10/GT200 GPU Architecture). Source: NVIDIA . . . 31
2.13 Single core CPU and GPGPU architecture comparison. Source: NVIDIA CUDA C
Programming guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1 Top Left: shows a digraph of order=7 and size=7. Top Right: shows key to
bottom two diagrams. Bottom Right: shows array of neighbour list pointers
indexed by vertex id. Bottom left: shows array of Vertex objects indexed by
vertex id . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Modified TriadCode Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3 Performance speedup of optimized sequential triad census for actors graph net-
work on Intel Core 2 Duo with baseline execution time of 103 sec. . . . . . . . . . 55
4.4 Distributed Task Queue generation algorithm based on Canonical Dyad(non-
uniform distr.) load balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5 Distributed Task Queue generation algorithm based on Canonical Dyad(uniform
distr.) load balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.6 Subquadratic Triad Census Algorithm based on distributed task queues . . . . . . 60
xiii
xiv Figures
4.7 Distributed task queue data structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Comparison of our results on Intel Xeon Multicore and Tesla GPGPU with previ-
ous work on Cray XMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3 Comparison between GPGPU and Multicore w.r.t baseline sequential . . . . . . . 88
D.1 CUDA Program - interleaving of host code and kernel invocations . . . . . . . . . 117
D.2 Multidimensional grid and thread block . . . . . . . . . . . . . . . . . . . . . . . . . 117
D.3 Scheduling of warps and thread blocks. Source: Mark Silberstein, Technion,
"Lectues notes - Programming massively parallel processors" . . . . . . . . . . . . 118
D.4 CUDA memory model [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Tables
2.1 Adjacency matrix representation of a graph . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Memory hierarchy optimization summary . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Details of Multicore Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1 Comparison of our work to previous work on accleration of Triad Cenus on Cray
XMT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1 Graph data sets used for evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Compilation settings for Sequential Triad Census implementation . . . . . . . . . 46
4.3 Baseline structure characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4 After structure fitting (decreasing struct size, increasing struct size) . . . . . . . . 51
4.5 Performance improvement after applying optimization(Actors) . . . . . . . . . . . 51
4.6 Sequential Triad Census performance speedup and memory overhead results on
Intel Xeon Multicore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.7 Sequential Triad Census execution time(sec) and memory consumption(MB) re-
sults on Intel Xeon Multicore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.8 Task abstractions, graph partitioning, load balancing strategies. Assuming T
threads. n vertices, N(u) neighbours of vertex u . . . . . . . . . . . . . . . . . . . . 58
4.9 Execution Time(seconds) breakup of Multithreaded Triad Census algorithm(v0.6)
on Intel Xeon Multicore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.10 Multithreaded Triad Census speedup(v0.6) w.r.t baseline(v0.1) results on Intel
Xeon Multicore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.11 Multithreaded Triad Census Execution Timing(seconds) results(v0.6) on Intel
Xeon Multicore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.12 Comparison of load balance strategy based on aggregate NsetSize(all dyads) . . 63
4.13 Multithreaded Triad Census with improved load balancing [Canonical Dyad(uniform
distr.)] (v0.7) - performance speedup w.r.t baseline(v0.1) and Execution Tim-
ing(seconds) breakup with 16 threads on SMT Intel Xeon Multicore . . . . . . . . 66
4.14 Multithreaded Triad Census with improved load balancing [Canonical Dyad(uniform
distr.)] (v0.7) - best speedup w.r.t baseline(v0.1) and Execution Timing(seconds)
breakup with maximum threads(found experimantally) on SMT Intel Xeon Mul-
ticore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.1 Initial GPGPU memory requirements for different graph data sets for 64 bit un-
signed integer type data i.e. range of 0 to 18,446,744,073,709,551,615 . . . . . 75
xv
xvi Tables
5.2 Final GPGPU memory requirements for different graph data sets for 64 bit un-
signed integer type data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3 Thread hierarchy configuration suggested by Occupancy calculator for 50% oc-
cupancy(maximum with 31 registers) . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.4 Initial performance results of Triad Census on Tesla GPGPU with 256 thread-
s/block and 120 thread blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.5 Memory Coherency statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.6 Global Memory Load/Store Transactions . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.7 Global Memory Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.8 Instruction profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.9 Branch profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.10 Memory load store profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.11 Speedups obtained by using shared memory to calculate Triad Census per Thread
block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
D.1 CUDA thread communication and synchronization . . . . . . . . . . . . . . . . . . . 119
D.2 CUDA Variable Qualifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
D.3 Memory organization of Tesla C1060 T20/GT200 with CUDA compute capability
1.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
E.1 Profile counters for computeprof tool . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Chapter 1
Introduction
The diminishing returns of single threaded performance(Instruction Level Parallelism or ILP)
derived from uniprocessor architectural and micro-architectural techniques as well as the grow-
ing chip power dissipation have in recent years led the general purpose and embedded com-
puting industry to shift the processor design from single-threaded processors to Multithreaded
Chip Multiprocessors(CMP) called Multicore or Manycore Processor[2, 3]. In such CMP’s,
multiple processor cores are integrated on a single chip die and each processor core provides
one or more hardware threads that can run one or more software threads either in interleaved
concurrency or true concurrency. As a result, it is possible to get work done faster on such
CMP’s by distributing work onto the multiple cores.
As a side effect, this shift in processor design from performance-driven computing to perfor-
mance per watt computing has lead software development to move towards concurrency and
has opened up new areas of research in parallel programming, automatic parallelization frame-
works, compilers etc. In comparison, the challenges for efficient programming of Embedded
Multicore processors(also called Multiprocessor SoC( MPSoC )) are somewhat greater mainly
due to the additional constraints posed in embedded systems such as hard real-time require-
ments, ultra low power demand for battery-driven devices, reliability, security, heterogeneous
architectures( multiple functions on a SoC) and lack of standard tools(compilers, debuggers
etc.).
Further, these CMP based Multicore and Manycore GPGPU(General Purpose GPU) plat-
forms are now easily accessible and affordable. As such platforms become more ubiquitous
and powerful due to continued CMOS scaling, many applications which have been limited to
High Performance Computing domain will slowly make their way into the mainstream con-
sumer applications on desktops, servers and portable embedded devices.
One application area which requires a large amount of parallel processing power is Graph Algo-
rithms. Many real world data can be most naturally modelled as a graph as it provides intuitive
abstraction for representing objects and relationships between them. For example, the entire
web on the internet can be represented as a graph with vertices representing web-pages and
edges representing the hyperlinks connecting those web-pages. Internet search engines often
use such graph abstractions and algorithms for processing search queries. Social networking
sites use graph abstractions to maintain their social networks. Understanding the properties
and dynamics of such social networks is important for example to target marketing. EDA tools
1
2used for designing VLSI circuits use graph abstraction for representing net-lists(circuit con-
nectivity description) where vertices represent circuit elements(ALU, Bus, Memories etc.) and
edges represent nets (interconnections between circuit elements). These graph abstracted net-
lists can then be explored and optimized for various figures of merit (power, performance, area,
testability, fault-tolerance, floor-planning etc.). Biological networks represented as graphs are
used to study the spread and origin of communicable diseases. Transportation companies use
graph algorithms to find the shortest possible routes for cargo planes and ships. Fault tolerance
studies use bi-connected component graph algorithm to understand robustness of the system
when one or more parts of system fail.
As a result, there exist many graph algorithms [4, 5] that process real world data-sets. Ex-
amples of some graph algorithms are: Breadth-first search, Depth-first search, Minimum span-
ning tree(Prim’s, Kruskal’s and Boruvka’s algorithm), Single source shortest path (Dijikstra’s,
Bellman-ford algorithm), All pairs shortest path, Bi-connected components(Tarjan-Vishkin al-
gorithm). Studies and uses of such parallel graph algorithms have focused on computational
science and engineering applications such as mesh structures, unstructured grids, linear/PDE
solvers, N-body methods, VLSI layout, compiler design, system modelling, event-driven sim-
ulation, wireless communication, distributed networks, social network analysis, data mining,
bio-informatics and security. A graph can be dense or sparse depending upon the number of
edges. For a graph with n-vertices and m-edges, the graph is
Sparse Graph, iff m=O
 
f (n)

Dense Graph, iff m=O
 
n2

Graph algorithms for dense graphs lend themselves to efficient parallel realization and
speedups in both theory and implementation on HPC platforms. On the other hand, graph
algorithms for sparse graphs have been traditionally known to be difficult to accelerate on
CMP’s and to date this continues to be a challenging exercise. One of the inherent problems
with sparse graph algorithms is that theoretically achievable sequential and parallel perfor-
mance cannot be easily realized in practice. There are many reasons for this:
• The gap between parallel models(like PRAM) assumed by the graph algorithm and cur-
rent parallel computer architectures that masks machine dependent details(Cache mem-
ory hierarchy, Communication costs, Distributed Vs Shared memory architectures) crucial
for obtaining good speedups. Bridging this gap requires novel algorithm and data struc-
ture design techniques which are aware/oblivious of the target processor architecture.
• Some graph algorithms are combinatorial in nature and thus require an immense amount
of computation and communication. Other graph algorithms are more memory-intensive
and have very little computation such that they spend most of the time either in memory-
traversal or communication(through data sharing or message passing). Irregular sparse
graphs algorithms belong to this latter category and are characterized by poor data lo-
cality and irregular memory access patterns which results in poor cache performance.
• Few organized techniques exists which can evenly and efficiently partition an irregular
sparse graph problem instance onto available processors. The problem of effectively load-
balancing such workloads across PE’s(or hw threads) of a Chip Multiprocessor is another
challenging aspect.
3 1.1 Thesis Objectives
HPC class Supercomputers are still the typical machines used for efficient acceleration of
many such graph algorithms. With growing trend for Multicore desktop and embedded plat-
forms, it is very likely that in near future many graph algorithm based applications will be
ported to these platforms. The challenges for fast execution of graph algorithms on these
platforms are somewhat higher because these platforms still cannot match Supercomputing
capabilities of HPC domain. And therefore, obtaining good execution speedups for such irreg-
ular sparse graph algorithms on these emerging Multicore CPU’s and Manycore GPGPU’s is an
important problem.
1.1 Thesis Objectives
Main objective of this thesis work is to study and accelerate a Supercomputing-class irregular
sparse graph algorithm called Triad census [6, 7, 8]. This algorithm was previously accelerated
on Cray XMT Supercomputer [9]. The Multicore processor architectures used in our work
for accelerating this algorithm includes Dual-socket Intel Xeon NUMA processor(16 hardware
threads) and 240-core NVIDIA Tesla C1060 GPGPU processor(128 hardware threads per core).
The other objectives of this thesis work are as follows:
• Study relevant work on state of the art in various parallel graph algorithms to understand
the issues involved in creating efficient parallel realizations.
• Study state of the art in the Multicore and Manycore processor architectures to under-
stand how its various architectural features can be effectively exploited to improve appli-
cation performance.
• Improve memory complexity of targeted graph algorithm by exploring architecture aware
and architecture oblivious techniques.
• Explore efficient algorithm and data structure design techniques - both processor archi-
tecture independent as well as dependent - to improve performance
• Explore efficient data partitioning and load balancing strategies which are fast, scalable,
having low memory overheads and possibly amenable to parallelization.
• Finally, get good overall speedup(strong/weak scaling) for the Triad Census algorithm
on targeted Multicore and GPGPU architectures.
1.2 Contribution of thesis
The contributions of the thesis are the following:
1. Main contribution of thesis: Compared to previous results on the Cray XMT(16 hw threads),
for a graph data set called Patents Network, our results(raw performance) on Intel Xeon
Multicore(16 hw threads) were slightly better (1.27x times faster) while the resulting
performance achieved on GPGPU was comparatively slower(0.72x). To the best of our
knowledge, this algorithm has only been accelerated on a supercomputer class machine
named Cray XMT [9] and no work exists that demonstrates performance evaluation and
comparison of this algorithm for relatively low-cost commodity processors such as Intel
Xeon Multicore and NVIDIA Tesla C1060 GPGPU.
4 1.3 Outline of the thesis
2. Performance results on our platforms: On our Intel Xeon Multicore platform, with re-
spect to baseline performance results for the Sequential Triad Census algorithm imple-
mentation, the Multithreaded Triad Census implementation resulted in maximum 56x,
average 33x and minimum 8.3x performance speedups. In comparison, the CUDA
Multithreaded implementation of Triad Census algorithm on our NVIDIA Tesla C1060
GPGPU platform resulted in nearly competitive performance results - maximum 58.4x,
average 32.8x and minimum 4.2x speedups.
3. Graph data structure: On Intel Xeon Multicore processor, we improved performance(from
1.3x upto 6x faster) by replacing linked list based implementation of adjacency lists with
cache blocked linked lists. On GPGPU, we evaluated algorithm performance for graph
represented using CRS(Compressed Row Sparse) format.
4. Control flow optimizations: We identified two branch optimizations and used pre-computation
to improve performance upto 1.23x.
5. Graph data sets: Our work used only real world data graph sets(Amazon, Google, Slash-
dot, Patents etc.) in comparison to some synthetic data sets used in previous work. Eval-
uation of algorithm on real world graph networks increases reliability of our speedup
results.
6. Thread coupling: Previous work shared a single census array among threads which re-
quired synchronizing concurrent updates to it using fast atomics. In our work, we ob-
tained better results(2x relative speedup on GPGPU) by creating per thread local census
array(decoupled threads), which are later on merged to get the final census array.
7. Graph partitioning and Load balancing: Previous work used distributed task queues ap-
proach for graph partitioning and load balancing. In our work on Multicores, we used
this same graph partitioning strategy but improved the load balancing. Also, on GPGPU’s
we demonstrated use of centralized task queue based approach for graph partitioning.
1.3 Outline of the thesis
This chapter was an introduction to the graph processing on CMP’s, thesis objectives and con-
tributions. Following is the outline of the remaining parts of thesis:
Chapter 2 This chapter discusses necessary background to set the tone of the thesis. It first
discusses Social network analysis(SNA) and the targeted algorithm - Triad Census. Then some
related topics are discussed namely graph data structures, irregularity and techniques for effi-
ciently implementing adjacency lists based graph representations. The chapter also discusses
optimizations to tackle memory and control flow irregularity. Thereafter, the chapter presents
details of the Multicore and GPGPU platforms on which the Triad Census algorithm is im-
plemented, optimized and accelerated. Last part of chapter discusses important differences
between Multicore and GPGPU processors.
Chapter 3 This chapter starts with related work on various Parallel Graph Algorithm imple-
mentation and evaluation on CMP’s along with some discussion on state of the art graph data
structures used in such implementations. Then we compare in detail our work with previous
5 1.3 Outline of the thesis
relevant work (Acceleration of triad census algorithm on Cray XMT Supercomputer) and present
specific contributions of our work .
Chapter 4 Presents evaluation of Triad Census algorithm on Intel Multicores. It first discusses
the adopted methodology for sequential and parallel implementation of Triad Census. Then it
discusses details of implementation, optimization and acceleration results for sequential and
multithreaded Triad Census algorithm.
Chapter 5 Presents evaluation of Triad Census algorithm on NVIDIA Tesla GPGPU platform.
It describes the details of implementation, profiling, optimization and results of CUDA based
Multithreaded implementation of Triad Census algorithm.
Chapter 6 This chapter discusses the results, conclusion and prospective future directions.
Appendix A Discusses definitions of frequently used terminology in Concurrency and Paral-
lelism.
Appendix B Discusses definitions of frequently used terminology in Processor Architectures
and Micro-architectures.
Appendix C Discusses background topics in Graph theory related to small world networks.
Appendix D Discusses details of NVIDIA’s CUDA Programming model used for programming
NVIDIA based GPGPU’s.
Appendix E Presents definitions of different profiling counters for NVIDIA profiler tool called
computeprof.
Glossary This section describes acronyms.
6 1.3 Outline of the thesis
Chapter 2
Background
In this chapter, we discuss necessary background topics to set the tone of the thesis. We first
discuss Social network analysis (SNA) along with the irregular sparse graph algorithm that we
have chosen to accelerate - Subquadratic Triad Census. After this, we give overview of basic
graph data structures and briefly describe problem of Irregularity. Then we present techniques
that can be used for efficient implementation of adjacency lists based graph representations.
Next, the chapter presents details of the Multicore and GPGPU platforms on which the Triad
Census algorithm is implemented, optimized and accelerated. Last part of chapter discusses
important differences between Multicore and GPGPU processors. The topics discussed in this
chapter cover necessary background to understand discussions in later chapters.
2.1 Social Network Analysis(SNA)
Social network analysis or SNA deals with analysis of social network structures. It focuses on
social entities and ties or relationship between them. Relationship could be friendship, beliefs,
common interest, economic interactions etc. Examples of social structures are communities,
market, societies, and countries. In SNA vocabulary, any entity involved in a social network is
called an Actor. An actor can refer to a person, organization, nation etc. basically any entity
that can be conceptualized as an object with behaviour and which can interact with other ob-
jects.
SNA involves conceptualizing, modelling, representing, visualizing such social structures and
then detecting and interpreting patterns of social ties among the actors. It also deals with
measurement of characteristics or metrics of social networks. Examples of such metrics are
centrality (which actors are best connected to others, or have the most influence), between-
ness centrality of a vertex v(the number of shortest paths between all pairs of vertices in the
network going through v), connectivity(how actors are related to one another), cohesion(how
strong is the relationship between connected group of actors), transitivity(Is enemy of your
friend is also your enemy?), clustering(are two neighbours of a vertex are also neighbours
themselves), prestige, trust etc.
Social network analysis has also been linked with biology, communication studies, organi-
zational studies, internet social networks, sociology, geography, finance and terrorism studies
(Figure 2.1). As a result new forms of network analysis have emerged. For example SNA is
7
8 2.1 Social Network Analysis(SNA)
used in epidemiology to help understand how patterns of human contact aid or inhibit the
spread of diseases in a population such as HIV1 and obesity.
(a) (b)
(c) (d)
Figure 2.1. Examples of various social networks: (a) My linkedin professional network (b)
Soccer pass network (c) Human protein network (d) Livejournal social network
1http://www.respondentdrivensampling.org/reports/sociological_focus.pdf
9 2.2 Triad Census Algorithm
Figure 2.2. Dyad types
Graph abstractions from graph theory provide an intuitive way of representing such so-
cial networks and applying various algorithms to capture social network metrics. Actors are
represented as vertices and relations between them are represented as edges or arcs. To make
networks out of these graph representations, all real world information associated with actors
and ties are associated with the vertices and edges of graph. For example consider a social
graph of a soccer team2 during a match (Figure 2.1), with vertices representing soccer players
and arcs representing a pass from one player to another. By associating information such as
player names with vertices and number of passes made between two players with thickness of
the arc, one can make a social network out of it. This network can be analysed for various
metrics like which players made or received the most passes, percentage of passes made or
received by defenders, which sequence of passes had a higher probability of scoring a goal, etc.
Many tools are available for analysis and visualization of such social networks namely Pajek,
GUESS, ORA, Cytoscape, NetworkX, SNAP and igraph.
2.2 Triad Census Algorithm
Triad census based analysis are statistical methods that study and examine characteristics of
social networks by studying properties of local 3-vertex-subgraphs present in a graph. The
description of triad and dyad census in SNA was first studied and proposed by [10]. They
introduced the concept of dyad and triad census - number of 2-subgraphs and 3-subgraphs in a
graph and also generic k-subgraph census. [6] studied triad census for various random graph
distributions.
A triad is defined as a digraph with 3 vertices and zero or more arcs connecting them. Similar
to triad, a dyad is a pair of vertices which have zero or more arcs between them. There can
be four types of dyads(Figure 2.2): 1 - null dyad, 2 - asymmetric dyads and 1 - mutual dyad.
Only null dyad is a disconnected dyad, other dyad types are connected dyads. One can think of
a triad to be made up of three dyads and therefore 64-possible triads(4*4*4, as each pair of
vertices can be one of four possible dyads) can be formed from a strict digraph(no loops) of
order three. Further, these 64 triads can be merged into 16-unique isomorphic triads. All triads
can be classified into three triad classes namely null, dyadic and connected triads
Figure 2.3 shows the isomorphic types of triads numbered 1 to 16 and classification of these
triads. Established convention of naming the 16 triad types uses a 3-digit code suffixed with
an optional triad property: MAN< P>, where M is number of mutual dyads, A is number of
asymmetric dyads, N is the number of null dyads, P is an optional triad property.
2http://scientometrics.wordpress.com/
10 2.2 Triad Census Algorithm
Figure 2.3. Types of Triads
Each of the digits M,A,N can range from 0 to 3. P can take values U-up, C-cyclic, D-down,
and T-transitive. So triad types 1-16 are classified and named as follows:
• Null triads: 1- 003
• Dyadic triads: 2 - 012, 3 - 102
• Connected or complete triads: 4 - 021D, 5 - 021U, 6 - 021C, 7- 111D, 8 - 111U, 9 - 030T,
10 - 030C, 11 - 201, 12 - 120D, 13 - 120U, 14 - 120C, 15 - 210, 16 - 300
Only null and dyadic triads are disconnected, all other triads are connected. Triad census
algorithm is a combinatorial problem that aims to count number of each of these 16 unique iso-
morphic or 64 triads in a given graph i.e. census of triads. By finding out frequencies of these
16 or 64 triad types, different characteristics associated with social network can be examined
such as transitivity, cyclicity, mutuality (reciprocity) etc. Triadic methods have been used for
studies in psychology, network evolution, business networks, neuroscience and security.
For a digraph with n− ver t ices there can be T = nC3 triads. Naive algorithm for calculating
triad census will take too much time - O
 
n3

time complexity. Faster algorithms for calcu-
lating triad census have been proposed. [11] proposed an adjacency-matrix based algorithm
with quadratic complexity O
 
n2

. [8] proposed algorithm with sub-quadratic time complexity
O
 
m∆

where m = O
 
n.k(n)

is the number of arcs of the graph and ∆ is maximum degree
of the graph.
11 2.2 Triad Census Algorithm
We now introduce some notations and then explain the working of Subquadratic triad cen-
sus algorithm. {} denotes set notation, () denotes ordered pair notation. Consider a strict
digraph (no loops) G = [V, E] where V = {1,2...n} is set of vertices and E ⊆ V xV the set of
arcs (directed) i.e. E = {(u, v), ∃u, v ∈ V}. Order of graph n = |V |, size of graph m = |E|.
N is an array of neighbour lists indexed on vertex such that ∀u ∈ V , N(u) is the set of open
neighbourhood of vertex u i.e. N(u) = {v ∈ V |{u, v} ∈ E}. Note that N(u) contains all vertices
v adjacent to u in undirected sense.
The Subquadratic triad census algorithm(Figure 2.4) works by computing count of three triad
types separately. It first counts all dyadic triads(T2) and connected triads(T3). Then it counts
number of null triads T1 = T − T1− T2. Counting of the triads is basically based upon dyads in
the digraph. The algorithm starts with initializing Census array to zero(lines 1-3). It then runs
a for loop for each vertex of graph(lines 5-23). Within this loop, dyadic(T2) and connected
triads(T3) are counted.
• Counting number of dyadic triads(T2): a connected dyad (u, v) with u < v (line 7),
contributes number of dyadic triads equal to all vertices of the graph except u and v them-
selves and except all vertices which are in open neighbourhood of the connected dyad i.e.
n− |u, v| − |S| where S = N(u)∪ N(v) \ u, v is the set of vertices in open neighbourhood
of connected dyad (u,v)(see lines 8,14). Depending upon if the connected dyad (u, v) is
asymmetric or mutual, all dyadic triads contributed by it are either of triad type 2-012
or 3-102 (line 9 to 13). The condition u < v (line 7) is necessary for canonical selection
of a dyad. By canonical it means every non-null triad contributed by the connected dyad
must be counted only once during different iterations of the algorithm. For example the
connected dyad (1,2) is canonical but (2,1) is not. This is so because after counting the
triads contributed by dyad (1,2), processing dyad (2,1) would lead to repetition of triad
counting.
• Counting number of connected triads(T3): Each connected dyad (u,v) where u<v,
also contributes number of connected triads equal to vertices in its open neighbourhood
set S(line 15-20). Again, all non-canonical selections must be eliminated. Three ver-
tices forming a connected triad can be selected in six ways: (u,v,w), (u,w,v), (v,u,w),
(v,w,u), (w,u,v) and (w,v,u). Line 7 already already made last four of these selections
non-canonical by enforcing u<v condition. Line 16 determines which of the two remain-
ing selections (u,v,w) and (u,w,v) is canonical.
– Selection (u,v,w) with u<v< w is always canonical.
– However if u < w < v, then this selection can be non-canonical only if u and w are
neighbours because then the triad (u,v,w) was already counted in previous iteration
of algorithm when dyad (u,w) was processed. Line 16 enforces condition to check
if w forms part of a canonical triad:
v < w or ( w < v and u < w and !IsNeighbour(u,w) )
IsNeighbour(u, v) function checks if vertices u and v are neighbours in undirected
sense. Line 17 finds isomorphic triad types using TriadCode(u, v, w) function ( Fig-
ure 2.5 ). This function first calculates triad code value between 0-63 corresponding to
64 non-isomorphic triad types. It returns this value + 1 if main algorithm calculates
12 2.2 Triad Census Algorithm
Function: Subquadratic Triad Census Algorithm
Input: Graph G = [V, E] where V is set of vertices, E is array of edge lists and N is array
of neighbour lists
Output: Census array with frequencies of isomorphic triadic types
1: for i = 1→ 16 do
2: Census[i]← 0
3: end for
4:
5: for u ∈ V do
6: for v ∈ N[u] do
7: if u <v then
8: S← N(u)∪ N(v) \ {u, v}
9: if IsEd ge(u, v)∧ IsEd ge(v, u) then
10: TriadT ype← 3
11: else
12: TriadT ype← 2
13: end if
14: Census[TriadT ype]← Census[TriadT ype] + n− |S| − 2
15: for w ∈ S do
16: if v <w ∨ (w <v ∧ u <w ∧¬IsNeighbour(u, w)) then
17: TriadT ype← TriadCode(u, v, w, )
18: Census[TriadT ype]← Census[TriadT ype] + 1
19: end if
20: end for
21: end if
22: end for
23: end for
24:
25: sum← 0
26: for i = 2→ 16 do
27: sum← sum+ Census[i]
28: end for
29: Census[1]← n ∗ (n− 1) ∗ (n− 2)/6− sum
Figure 2.4. Subquadratic Triad Census Algorithm
non-isomorphic triad census. Otherwise a mapping table TriadTable maps the 64 non-
isomorphic triad types to 16 isomorphic types and function then returns a value between
1 and 16. IsEd ge(u, v) function checks if a directed edge exists from u to v. Returns 1 if
true, 0 otherwise. Finally the Census corresponding to triad type is incremented to count
the connected triad (line 18).
• Counting null triads(T3): Lines 24-27 stores sum of T2 and T3 in variable sum and
subtracts this sum (line 28) from total count of all triads of the graph(line 28) to get
count of null triads T1
13 2.3 Graph Data Structures
Function: TriadCode
Input: vertices u, v, w
Output: isomorphic or non-isomorphic triad code
1: TriadCodeVal = IsEd ge(u, v)
2: TriadCodeVal+= 2 ∗ IsEd ge(v, u)
3: TriadCodeVal+= 4 ∗ IsEd ge(u, w)
4: TriadCodeVal+= 8 ∗ IsEd ge(w, u)
5: TriadCodeVal+= 16 ∗ IsEd ge(v, w)
6: TriadCodeVal+= 32 ∗ IsEd ge(w, v)
7: return(IsIsomorphic?TriadTable[TriadCodeVal] : TriadCodeVal) + 1
Figure 2.5. Triad Code Algorithm
Time complexity of the of the above Subquadratic triad census algorithm is O
 
m4. Exact
complexity depends upon the sparsity of graph data set and the maximum degree of the graph.
For small world networks with power law degree distribution that are sparse i.e. m is O
 
n.k(n)

where k(n) << n, and have small maximum degree i.e. 4 << n, the algorithm complexity is
linear - O
 
n

. If maximum degree of graph is high i.e. 4 ö n, then algorithm complexity is
quadratic - O
 
n2

. For complete graphs, the time complexity becomes cubic - O
 
n3

2.3 Graph Data Structures
A graph data structure stores graph data and allows for traversal and manipulation of the graph.
In graph algorithmic theory, there are mainly two representations that are used for representing
graphs: Adjacency Matrix and Adjacency Lists. Figure 2.6 shows an example directed graph
Adjacency Matrix: For a graph G = [V, E]with n vertices and m edges(undirected graphs) or
arcs(directed graphs), an adjacency matrix uses an nxn matrix to represent list of edges or arcs.
Each location [i, j] of the matrix stores a 1 or edge weight wi j if there is an edge(or arc) be-
tween(from) vertex vi and(to) v j otherwise a 0 is stored. This graph representation(Table 2.1)
is good for dense graphs and small order graphs (dense/sparse) but uses lot of space O
 
n2

.
Its nice property is that elements of the data structure are accessed in contiguous fashion and
so is cache and prefetch friendly. For sparse graphs, adjacency matrix wastes lot of space as
there will be lot of zeros and instead adjacency list is preferred which uses space of order
O
 |V |+ |E|.
Adjacency List representation( Figure 2.7 ) of a graph G = [V, E] is a one dimensional array
Adj of size N = |V |. For a vertex v of graph G, Ad j[v] stores pointer to list of vertices adjacent
with v. For weighted and digraphs, each adjacent list has an additional field to hold weight for
that edge.
2.4 Irregularity
Many algorithmic problems in computer science exhibit a property called Irregularity [12]
which makes it hard to obtain their efficient sequential and parallel implementations of algo-
rithms. Irregularity is closely linked to the problem which the algorithm solves and also to the
14 2.4 Irregularity
Figure 2.6. Directed graph
Table 2.1. Adjacency matrix representation of a graph
V0 V1 V2 V3 V4 V5 V6
V0 0 1 1 1 0 0 0
V1 0 0 0 0 0 0 0
V2 0 0 0 0 0 0 0
V3 0 0 0 0 1 1 0
V4 0 0 0 0 1 0 0
V5 0 0 0 1 0 0 0
V6 0 0 0 0 0 0 0
Figure 2.7. Adjacency list representation of a digraph
15 2.4 Irregularity
way it is represented. Irregularity stems itself in many ways:
• Irregularity in flow control: This irregularity results from presence of many flow control
statements inside the program such as if-else, GOTO, switch, while/for loops, function
calls etc. This is mostly the property of the algorithm itself and not necessarily of its
representation/implementation. Different processors and machines will handle and give
different performance for given control flow instructions. For example general purpose
processors(like Intel x86 or x64) are typically designed to perform well for control-flow
intensive programs. On the other hand, hardware platforms like GPGPU’s, FPGA’s or
DSP’s typically do not perform well for control flow intensive programs as these platforms
are simply not designed for them and are instead designed to target low-latency, high
computational throughput applications.
• Irregularity in data structure: Typically every program uses some sort of data structures
to store data associated with algorithm so as to allow primitive data operations - cre-
ate, insert, delete, update or traverse. Broadly speaking, a data structure can be array
based(data static or known priori), pointer chasing based (for unknown/dynamic data)
or a combination of array / pointer chasing. Depending upon the processor and memory
architecture, data structure may be stored physically different than logically imagined
by data structure designer. Due to the fact that there always has been a performance
gap between processor and memory, the actual achieved performance may differ from
the logically expected performance(usually based on some machine model like PRAM).
Other factors which influence this kind of irregularity are data partition technique, data
mapping/layout, data traversal order, data dependency, operating system internals. This
kind of irregularity thus results from the data structure and algorithm representation/im-
plementation and is not necessarily a property of the algorithm or its associated data
structures.
• Irregularity in communication: Many programs need data structures to be partitioned
across memory architecture so that data can be processed by many similar threads as in
data parallel applications or by different threads as in task parallel applications or a com-
bination thereof. Most of the time, application threads have to communicate with each
other or share some data. Threads may also have to synchronize with other threads
at some point. This thread communication is typically handled either using shared
memory(as in tightly coupled multiprocessors) or through explicit message passing(as
in loosely coupled multiprocessors). Depending upon the processor coupling, commu-
nication architecture (shared/distributed address-space/memory) and the data partition
technique employed, the communication cost can can take significant amount of the over-
all application execution time. Irregularity in communication is the problem in matching
theoretical optimal communication cost with actually achieved communication cost on
hardware platform.
16 2.5 Efficiently Implementing Irregular Graph Algorithms
2.5 Efficiently Implementing Irregular Graph Algorithms
We now present techniques to address performance limiting issues. As the algorithm targeted
for acceleration - Triad Census - deals with irregular sparse graph data, so we first focus on
techniques to design efficient adjacency list graph representations. In the remaining part of the
section, we describe techniques to address the irregularity problem with focus on memory and
control-flow optimizations.
2.5.1 Efficiently implementing adjacency lists
Performance of any graph algorithm depends a lot on the design of graph data structure. Which
graph data structure is good for the graph algorithm depends upon many factors like graph size,
dynamic/static graph algorithm, graph density, hardware resources, algorithm specific require-
ments (data partitionability, fast insertions, deletions, traversal etc.). The challenges lies in
creating compact space efficient and cache friendly representations of such sparse graphs.
In particular, because we are dealing with irregular sparse graphs, it is important to have
efficient design of adjacency list structure. Main problem with adjacency list representation is
that finding whether an edge (u, v) is present in graph requires traversal of adjacency list of
Ad j[u]. As lists are normally implemented as linked lists, they suffer from problem of cache
unfriendly design due to poor spatial locality. In worst case, a cache miss occurs for every node
access - O
 
n

. There are few approaches to overcome some of these problems:
1. Cache blocked or unrolled linked lists 3: This technique replaces adjacency linked lists of
graph data structure with blocked or unrolled linked lists. An unrolled or cache blocked
linked list is a linked list of fixed size arrays( Figure 2.8 ). Each node of an unrolled list
consists of
(a) A linear array of fixed size(blocking) called Element array
(b) A variable that stores number of valid elements in the element array.
Compared to normal linked list, unrolled linked lists have two major advantages. Firstly
the memory overheads(pointers, alignment, meta-data etc.) are lower as a blocked node
amortizes the memory overheads. Secondly, because of better spatial locality of arrays,
overall cache misses that occur during traversal also reduce. Cache misses in unrolled
linked list is close to optimal cache misses in arrays i.e. O
 
n/B

. An additional benefit
that comes with improved spatial locality is the increase in accuracy of compiler and
hardware assisted prefetching.
If
n = Elements inserted
B = Cache line size in terms of number of elements
N = blocking size in terms of number of elements
Cache misses on traversing the list of n elements in
• Array - O
 
n/B

3http://www.wikipedia.com
17 2.5 Efficiently Implementing Irregular Graph Algorithms
Figure 2.8. Normal linked list and Unrolled(blocked) list
• Linked list - O
 
n

• Unrolled linked list - O
 
(n/N + 1)(N/B)

[Note traversing elements in each node
require O
 
N/B

cache misses. Further assumption is all nodes are full]
Unrolled list can be easily tuned to the specific processor architecture by choosing size
of the element array according to the cache line size. One disadvantage of unrolled lists
is that in some cases, overall memory consumption may increase. For example if there
are many graph vertices with low degrees, then element arrays of such vertices will be
partially filled leading to memory wastage. Even then an unrolled list can perform better
than a normal linked list.
A blocked or unrolled linked list thus combines spatial locality benefits of arrays with
dynamic capabilities (quick insertion, re-sizing) of a linked list and therefore should be
used whenever feasible.
2. Dynamic Array: Another data structure which combines spatial locality benefits of linear
array with dynamic data structure capabilities is a Dynamic array. A dynamic array is an
array which can be resized at runtime. Other advantages of dynamic arrays over linked
lists include compactness, faster indexing(constant time), constant time(amortized) in-
sertion/deletion at end, faster linear time iteration and faster insertion/deletion at be-
ginning of array(due to better spatial locality). Standard Template Library(STL) Vector
container is a popular implementation of a dynamic array. This container is guaranteed
by C++ standard to be contiguous 4
3. Adjacency array representation[13]: This representation concatenates adjacency arrays of
all vertices in a single array called Edge array E. An additional array V stores the starting
index positions of the individual adjacency arrays in E, i.e., for any vertex v, V[v] is the
index in E of the first edge out of v.
The edges out of any vertex v are then easily accessible as E[V[v]] . . . E[V[v + 1]1];
Normally a dummy entry V[n+ 1] with V[n+ 1] = m+ 1 is added to ensure that this
also holds true for vertex n. Figure 2.9 shows an example. The memory consumption for
storing a directed graph using adjacency arrays is n+m+ Theta(1) words.
In addition, adjacency array representation( Figure 2.9 ) is analogous to CRS(Compressed
4http://herbsutter.com/2008/04/07/cringe-not-vectors-are-guaranteed-to-be-contiguous/
18 2.5 Efficiently Implementing Irregular Graph Algorithms
Figure 2.9. Adjacency array representation of a digraph. Top array V is the row index
array of length Size(G)+1. Lower array E is the column or edge array of length Order(G)
Row Storage) format used for representing sparse matrices 5. In CRS terminology, array
V is called Row index array and array E is called Column Array. Value array of CRS for-
mat normally would represent weights or any other edge property and is thus optional.
Advantage of this graph data structure is higher spatial locality and compactness.
2.5.2 Optimizing irregularity
In this section, we will present taxonomy of techniques that can be employed to address irreg-
ularity issues. What makes it difficult to create generalized methodologies to address irregular-
ity is that multiple abstraction layers of software and hardware hide many software/hardware
details which could be crucial for obtaining good performance. As a result, during program
execution many non-deterministic events(cache misses, page faults, context switching, thread
interleaving/migration, resource sharing, branch mispredictions, bandwidth/latency variance,
etc.) occur which simply cannot be controlled even when they can be monitored. Also, as
different irregularity types could be interdependent, optimizing a particular irregularity type
may result in a detrimental effect for other type of irregularity
Therefore, in order to achieve optimal performance, the programmer must make careful
trade-off’s, do thorough analysis of the problem, understand the machine architecture, map
the problem in best possible way to the machine architecture and finally do many experiments
to get optimal results.
Irregularity issues can be mitigated both at software level and processor hardware level.
At software level, irregularity can be handled from application programming layer to compiler
layer. Some of the software based techniques are generic and some are specific to processor
architecture(at compiler level). Hardware based Sophisticated micro-architecture circuit blocks
in modern day Multicore and Manycore processors try to address irregularity. We will now
explain important hardware and software techniques to address irregularity for control flow
and memory optimization
5Intel MKL CRS Format
19 2.5 Efficiently Implementing Irregular Graph Algorithms
Control flow optimizations
All control flow programming constructs on compilation generate branch instructions. Instruc-
tion stream containing high percentage of branch instructions cause control hazards(irregular
control flow), resulting in processor pipeline stalls and thus can deteriorate performance. For
irregular algorithms, many branches tend to be data dependent and thus branch predictors
in Multicore processors have hard time in correctly predicting these branches. On GPGPU
processors, presence of unpredictable data dependent branches can significantly lower the per-
formance if different threads tend to follow different branch paths ( called as warp divergence
).
In other words, processors work best when control flow is sequential. Consequently, opti-
mizing branches should not be ignored as it can yield significant performance improvements.
Some of the techniques to optimize branches are:
• Eliminate branches: The simplest way to optimize branches is to try to eliminate unnec-
essary branches. Most of the time, this requires analysis at algorithmic and code level.
This is the simplest and least effort branch optimization which can reduce irregular con-
trol flow and yield good performance improvements.
• Loop unrolling reduces number of loop iterations and consequently reduces the branch
penalties. Loops with static or predictable iteration bounds can be easily unrolled either
by the compiler or explicitly by programmer. For loops whose iteration range at compile
time is not predictable or unknown, the programmer can explicitly unroll the loops with
some programming effort.
• Function inlining is another branch optimization technique which is also normally done
by the compiler but can also be controlled programmatically. When a function is inlined,
all calls to this function in the entire program are replaced with the actual function code
by the compiler. For functions with small code size and which are called very frequently,
function inlining makes sense and can yield good performance improvements. Note that
compilers may ignore the inlining of a function.
• Predication is a branch optimization technique which is normally employed by compil-
ers. When a branch is predicated by the compiler, all instructions in different branch
paths are predicated with the branch condition. All the predicated instructions are exe-
cuted by the processor irrespective of the branch condition, however only those instruc-
tions for which the predicate condition is true are committed. Compiler based predica-
tion thus requires support at ISA level of the processor. Although compilers are able to
optimize many branches with predictable paths but data dependent branches pose sig-
nificant challenges for compiler. For such data dependent branches, predication can be
implemented programmatically.
• Modifying algorithm: Many a times, formulating alternative algorithmic solution to the
problem alters flow control and thus there could be possibility of realizing algorithmic
solutions with better control flow.
At micro-architecture level, many processors employ sophisticated techniques to efficiently
execute branch instructions. These includes branch prediction(instruction level speculative exe-
cution) and thread level speculative execution. Branch prediction involves predicting the most
20 2.5 Efficiently Implementing Irregular Graph Algorithms
likely control-flow path of a branch instruction and then speculatively fetching-executing in-
structions from the predicted branch path. TLS or thread level speculation is analogous to
branch prediction in the sense that it operates at thread level instead of instruction level.
Memory focussed optimizations
Irregular computations and data structures suffer from locality problems making them memory
bound. The underlying cause for this problem - processor-memory performance gap - can be mit-
igated by mainly two approaches. First, by hiding this gap(memory latency hiding by keeping
processor busy while it is awaiting a slow memory transaction to finish. Second approach is to
reduce this gap by making sure data is held close to the processor whenever possible. Following
are some state of the art hardware and software techniques that implement these approaches
Hardware architecture level techniques: Sophisticated hardware architecture design tech-
niques try to address memory irregularity problem in software independent manner. Tradi-
tional hardware techniques used for hiding and reducing memory latency are as follows:
• Hardware Prefetching also called Data level speculation
• Cache memory hierarchy (see next sub section for more details)
• TLB (Translation look-aside buffers)
• Out-of-order execution
Another hardware approach to hide memory latency is through Hardware Multi-threading.
Hardware Multi-threading hides memory latency by exploiting high amount of thread level
parallelism. The key to hardware multithreading is to keep the processor busy with work
by scheduling a ready thread while current thread is waiting for a memory transaction to
complete. Following is a taxonomy of different hardware multithreading techniques :
• Block or Coarse grained Multithreading: Scheduling another thread on processor pipeline
when present thread stalls on high latency operation(like cache miss). At a given clock
cycle, a single pipeline stage contains(issued or to be executed) instructions from single
thread only. Cooperative or fine grained Multithreading:
• Scheduling many threads in processor pipeline by interleaving their execution in pipeline
stages. At a given clock cycle, a single pipeline stage contains(issued or to be executed)
instructions from single thread only.
• SMT(Simultaneous Multithreading) architectures like Intel Hyperthreading combine Su-
perscalar processing with Hardware Multi-threading where instructions from multiple
independent threads can be simultaneously issued and executed to/in a given pipeline
stage
• Thread level speculation is another technique for hiding memory latency through specu-
lative execution of hardware threads.
• Chip Multithreading(CMT)
21 2.5 Efficiently Implementing Irregular Graph Algorithms
For more information on Hardware Multithreading and other processor related concepts,
please see the Appendix B on Processor Architectures and Micro-architectures.
Software level techniques: Software approach to address irregular memory access problems is
by using a mix of algorithmic, data structure design and compiler techniques. Some of these
techniques are aware of the underlying processor architecture and some of them are oblivi-
ous to it. Before looking at the techniques, we will first give overview of processor memory
hierarchy which will set tone for understanding the techniques.
Most modern state of the art processor architectures use some form of Multilevel Memory
Hierarchy where each level act as a cache for next. A typical hierarchy consists of registers,
level 1 cache, level 2 cache, level 3 cache, main memory(DRAM), and disk. The underlying
importance of this hierarchy is that as the memory level gets farther from the CPU, it becomes
larger in size and slower(high latency). Typically registers and level 1 caches are the fastest
when compared to other memory levels. The difference in memory access time increases dra-
matically from level caches to DRAM.
An instruction which needs to access data from a memory address, can execute only if the
required data is present or brought in closest cache level. Instruction thus waits for the data if
its in farther memory levels. A consequence of this is that in order to obtain high algorithmic
performance, it is important to make sure the accessed data is held by closest memory level.
This problem of making sure that accessed data always resides in closest level (atleast in some
level cache) - i.e. memory access latency is low - is non-trivial to solve, depends upon many
factors and requires careful algorithmic and architecture specific design optimization’s.
As already explained before, many a times pointer-chasing based dynamic data structures
need to be used in algorithms. Examples of such data structures are lists, sets, trees, graphs,
unstructured grids etc. Such structures suffer from memory locality problem causing large
number of cache misses and thereby proving detrimental to overall performance. As a result,
improving memory locality by optimizing cache memory use of the algorithm and its data
structures can yield impressive performance gains. A positive side-effect of improving locality
of data layout/access is that TLB misses also decrease because most data will be accessed from
locations in the same page.
All optimizations which can be applied to keep data in on-chip level caches are classified
as Cache Optimizations. Many cache optimizations methods have been proposed in literature
[14, 15, 16] that try to address the memory latency issue. In essence, they basically try to do
the following:
• Reduce cache misses(cold, capacity and conflict)
• Reduce latency to serve a cache miss
Like all software optimizations, cache optimizations can be both aware or oblivious to
underlying processor architecture. If cache optimizations require detailed knowledge of un-
derlying cache hierarchy(levels, block or line sizes, alignment) then it is called Cache Aware,
otherwise Cache oblivious [17, 18, 19]. Most cache optimizations rely on code-restructuring,
data-layout and data access optimizations. Table 2.2 shows the summary of these memory
optimizations.
We will now discuss in some detail the techniques in the table 2.2.
22 2.5 Efficiently Implementing Irregular Graph Algorithms
Table 2.2. Memory hierarchy optimization summary
Optimization Performance improvement
Cache blocking(linked lists) Fewer cold misses
Cache blocking(arrays) Fewer capacity misses
Sorting linked lists in sequential order of
node addresses
Fewer cold misses, better prefetching, amor-
tized average load latency
Prefetching using auxiliary struc-
tures(linked lists)
Reduces cold misses, aids prefetching
Reduce cache pollution and pack caches
with useful data
Better cache utilization and reduced bandwidth
wastage
Cache aligned data structures [20] Lower cache footprint, halves cache misses due
to misalignment
Keep hot fields together and access in
layout order
Higher cache utilization, reduced capacity
misses due to lower cache footprint, more ef-
ficient prefetching
Loop interchange Fewer conflict misses ( cache thrashing )
Loop fusion Fewer conflict misses
Loop fission Fewer conflict and capacity misses
Loop unrolling Increases ILP, lowers loop overhead
1. Cache blocked linked lists: Its basics have been discussed before. There is plenty of mate-
rial6 available on how to implement blocked linked lists and tune block sizes [21].
2. Use dynamic arrays for small lists: Its basics have been discussed before. Dynamic arrays
should be preferred over linked or unrolled lists when number of elements is small(exact
size depends) and most insertions are at end or beginning rather in the middle.
3. Sort the linked list layout( [22] ): Nodes of a linked list are normally allocated in non-
contiguous memory locations. In the worst case, a cache miss occurs on every load
operation. By sorting the elements of the linked list in a sequential memory address
order, cache misses can be reduced as consecutive nodes may occupy the same cache
line. Further, this gives PE’s in Multicore processor(with out-of order capability), oppor-
tunity to hide memory latency for successive node accesses by overlapping multiple load
operations by issuing memory loads in advance.
4. Software prefetching: Also called data level speculation, prefetching is a technique to
reduce memory latency by advancing load instructions in anticipation that results of such
loads are very likely to be required soon. Prefetching in software requires ISA support and
is usually done by compiler by inserting non-blocking load instructions. Programmer can
also implement prefetching in software by using inline prefetch assembly instructions
or API specific prefetching functions. Different studies [23, 24, 25] have been done
on data prefetching for irregular data structures. They basically either try to estimate
prefetch distances or use an auxiliary data structure. Prefetching techniques can be hard
to implement for irregular applications due to obvious reasons: difficulty in predicting
6http://software.intel.com/en-us/articles/cache-block-size-on-processors-that-support-hyper-threading-
technology
23 2.5 Efficiently Implementing Irregular Graph Algorithms
traversal-orders and prefetch distances and many times not enough work to hide memory
latency. Still, If feasible and done carefully, prefetching can yield good performance
improvements.
5. Avoid wasting cache space: Choose member data type according to range of actual data
values so that more data can be read or written per cache line access.
6. Reduce cache pollution[26]: Cache pollution is related to repetition of cold misses for
same memory location. As long as relevant data is there in the cache, use it there and
then only as much as possible before it gets replaced. This also reduces the bandwidth
overhead due to reduced data transfer round trips between memory and processor
7. Cache aligned data structures([20]):All memory accesses which are not aligned to bound-
ary of cache line size can be twice slower than aligned memory access because extra
cache lines need to be fetched to load all data. As extra cache lines are fetched, conflict
misses also increase. To avoid this, pack data structures with members such that they fit
evenly on cache line boundaries. One can also force cache aligned memory allocation
but forcing aligned allocation can create problem of memory fragmentation leading to
higher memory consumption and possibly performance degradation.
8. Hot fields should be placed together and accessed in order of layout [14, 20]: Arrange lay-
out of member items of a structure in order of frequency of access and also order in
which they will be accessed. This will divide the memory layout of structure in groups
of hot fields(higher access frequency) and cold fields(rarely accessed). As a result, group
temporal/spatial re-usability increases as clubbed items are more likely to be accessed
together. With this, the chances that clubbed items occupy the same cache line increases.
Also by accessing the items within hot field in order of their layout, the prefetch mecha-
nism can work more accurately.
9. Loop optimizations[14]: Many loop specific optimizations can be used for structures other
than linked lists. Loop-blocking or strip-mining makes tiles or blocks of data size related
to cache line size and reduces capacity misses. Loop interchange improves spatial locality
by interchanging nested loops to traverse data in continuous dimension(innermost array
dimension in C). Loop fusion amortizes cache misses by fusing two independent loops
with compatible indexes. Loop fission reduces splits a loop into many loops to reduce the
memory footprint of loop iteration thereby reducing capacity misses. Loop unrolling is a
technique that reduces loop overheads, increases ILP, reduces control flow and improves
cache utilization by data reuse. Unrolling is effective when loops enclose low to moderate
amount of computation and contain very few function calls.
24 2.6 Multicore Platform
2.6 Multicore Platform
As explained in earlier sections, many architecture aware optimizations exists which can be
effectively utilized by understanding details of the underlying processor architecture. With the
aim of getting efficient parallel implementations of Triad census algorithm for Multicore pro-
cessors, in this section we present details of architectural and micro-architectural features of
the Intel Multiprocessors used for evaluating Triad Census algorithm. The reader is encouraged
to read Appendix A Concurrency and Parallelism and Appendix B Processor Architectures
and Micro-architectures, as they cover significant state of the art concepts and terminology
relevant for this section and the following sections.
2.6.1 Platform Overview
For implementing and testing single-threaded and multi-threaded implementation of triad cen-
sus algorithm, we used Intel Core 2 Duo(64 bit) processor. For the main optimizations, ex-
periments and evaluation of different triad census algorithm implementations, we used high
performance Dell 11G PowerEdge R510 server containing two Intel Xeon E5630 Quad core
processors on two different sockets(Figure 2.10). The Westmere-EP (Intel Xeon 5600 series)
or Gulftown is basically a die-shrink (32 nm) of previous Intel micro-architecture named Ne-
halem(45 nm) and implements the Intel64 ISA. Westmere is the "tick" and Nehalem is the "tock"
in the Intel’s tick-tock processor series. Tock is an entirely new micro-architecture while Tick
is a new silicon process technology based on last Tock. Table 2.3 shows specifications of these
two Multicore systems.
2.6.2 Intel Nehalem(Westmere-EP) Micro-architecture
We will now explain in detail the organization and details of the processor-memory system of
this dual socket server(Figure 2.10)[27]. The server contains two sockets each holding one
Intel Xeon Quad core processor chip. A Xeon Quad core chip is a CMP having mainly two parts
on the chip: core and uncore. Intel says the core/uncore approach is mainly taken to make
whole chip design modular and depending upon the requirements of domain in which the
processor will be used, this approach enables it to make reusable, flexible, power efficient and
scalable multi-socket multicore platforms with different number of sockets, processor cores,
cache sizes, memory controllers, QPI ports etc.
Core
The core part of Xeon CMP has four processor cores(Figure 2.10) and can run at maximum
frequency (turbo) of 2.8 GHz. Each processor core has following features:
• Implements Intel64 ISA and supports two hardware threads based on Intel’s implemen-
tation of Simultaneous Multithreading called Hyper-threading (HT)
• Has separate L1 Data and Instruction cache and a combined L2 Data/Instruction cache.
L2 is an exclusive cache w.r.t L1. Both L1 and L2 have write-back write policy.
25 2.6 Multicore Platform
Table 2.3. Details of Multicore Systems
Machine Core 2 Duo(64 bit) Xeon E5630
Type Mobile Server
Code name Penryn Westmere EP
Base Micro-arch? Core Nehalem
Tick(Die shrink) or
Tock(Micro)?
Die shrink Die shrink
Architecture style SMP NUMA
Sockets /Core
per socket /HW
threads per core
1/2/1 2/4/2
Core frequency 2 GHz 2.53 GHz /2.8 GHz
Cache hierarchy L1 - 32 KB Data , 32 KB
Inst per core L2 - 2048 KB
shared Both 8-way set as-
sociative All caches write
back, non-inclusive, 64-
byte line size
L1 - 32 KB Data/core , 32 KB Inst/Core
L2 - 256 KB per core L3 - 12 MB shared
between cores / socket (Smart cache -
fully inclusive) All caches write-back, 64-
byte cache line size
DRAM 3 GB DDR2 RAM (2-ch) 8 GB DDR3@1066 MHz (3-ch) / socket
Bus Frequency
/Bandwidth (bidi-
rectional)
FSB - 800 MHz QPI with 2 links @2.93 GHz QPI - 5.86
GT/s or 23.4 GB/s Memory - 25.6 GB/s
Hyper-threading ? No Yes
• Consists of following units: in-order front-end, out-of-order execution-engine and in-order
retirement unit.
– Front-end can decode four macro-instructions per cycle and issues four micro-
instructions per cycle to the out-of-order execution engine. Instruction stream from
two SMT threads are decoded in alternate clock cycles. There is a second level
branch predictor which improves branch prediction compared to earlier architec-
tures. In comparison, recovery from misprediction is also faster. Similar to Core
Micro-architecture, Macro-fusion and micro-fusion is used to improve front-end
throughput to the out-of-order engine.
– Out-of order Execution Engine can receive up to four micro-ops per cycle from
Front-end. Micro-ops go through register renaming, resource allocation and reser-
vation station for re-ordering before up to six micro-ops per cycle can be dispatched
through six dispatch ports to various pipelined execution units. Appropriate entry is
made in 128-size re-order buffer for in-order retirement. So upto six operations can
be executed per cycle which includes 1 load, 1 store address, 1 store data and three
Arithmetic-Logic operations. Also, window for out-of-order execution is increased
by increasing capacity of various buffers like reservation station (36 entries instead
of 32), load buffers (48 instead of 32 ) and store buffers (32 instead of 20)
– In-order Retirement Unit does retirement and write-back. It can retire upto four
26 2.6 Multicore Platform
micro-ops per cycle. A reorder buffer with 128 entries allows retirement of com-
pleted instructions.
Overall length of processor pipeline is two stages longer than Intel Core Micro-architecture
which enables higher exploitation of ILP. In addition, instructions for synchronizing concurrent
access like lock, xchg execute much faster than Intel Core micro-architecture.
Uncore
The uncore[28] part of the chip mainly combines L3 cache, QPI interconnect and a North-
bridge circuit all integrated on same die as the processor cores. Following are important
features of Uncore:
• 12 MB of L3 cache as LLC. L3 cache is fully shared between four processors cores which
means any single core may fill or access the full L3 cache if other cores are not using it.
Also it is an inclusive cache i.e. it contains data from all L1/L2 caches in all processor
cores. One advantage of the inclusive L3 cache is that, if a core knows that a data is
not present in L3, then this data will not be present on other cores of the same chip
effectively reducing snooping traffic between cores of the same chip.
• An Integrated Memory Controller (IMC) that supports three 8-byte wide DDR3-1066
bidirectional memory channels. Each channel supports memory bandwidth of 8 GB/s.
This is unlike older Intel processor implementations where memory controller was al-
ways outside the processor chip ( on Northbridge chip ) and frequently was a source of
bottleneck in multicores based on such configuration. In effect, also separates the the
cache coherence traffic and memory access traffic enabling higher memory bandwidth
compared to previous architectures.
• Two QPI controllers along with QPI( Intel Quick path interconnect) port interfaces
• Auxiliary circuits for system management and performance monitoring
• Global Queue(GQ): This component is a crossbar that acts as the sole interface between
LLC(Last level cache), QPI Controller, Integrated Memory Controller and cores inside the
Core part. In addition , it provides a Power control unit to dynamically control voltage
and frequency of processor cores according to workload demands thereby enabling Intel’s
Turbo boost technology mechanism. GQ component also implements other features such
as LLC cache controller, Cache coherency mechanism in conjunction with QPI controller.
All data handled by GQ is at granularity of cache line size. Cache line requests come to
GQ due to on-chip L2 cache misses, from remote socket and from I/O Hub. GQ maintains
three different queues with different capacities to handle these requests: Write queue for
memory store from local cores, Load queue for memory load requests from local cores
and QPI queue for off-chip requests coming from remote socket or I/O hub.
• Cache coherence protocol implemented in QPI is called MESIF and has an additional
"Forward" cache line state in addition to usual four MESI states which allows it to handle
socket-to-socket coherence.
27 2.6 Multicore Platform
Memory and Socket Communication
In the server, each processor chip is connected to an 8GB DDR3 RAM module through its IMC
interface. Maximum memory bandwidth per socket is 25.6 GB/s ( 3 ch x 8 GB/s per ch). As
our Westmere-EP processor has two QPI ports, first of the two QPI ports is used to connect the
processor to its sibling Xeon Quad processor chips on the remote socket. The second of the
two QPI ports is used to connect the processor socket to the I/O Hub. We will now discuss
architectural features of Intel QPI
Intel QPI Interconnect[29] is a full-duplex, point-to-point and packetized high bandwidth
low latency interconnect. At physical level, a QPI port is pair of unidirectional links - Tx and
Rx link - that allows connecting a component to another component with a QPI port. Each
Tx/Rx link is made up of 20 differential signal pairs plus a differential forwarded clock. Each
differential signal pair carries a 1-bit of information, this implies 20-bits information can be
carried by one link. Out of 20 bits, 16 bits are used to carry data payload per forwarded clock
edge and remaining 4 bits are used for protocol/error information.
In Westmere-EP processors, each QPI link is operated at a clock frequency of 2.93 GHz. Also,
the data links are operated at double-data rate, such that the data signals are asserted on both
rising and falling edge of the clock. As a result, each QPI link can make 5.86 Giga Transfers per
second ( 2 transfers/Hz x 2.93 GHz ). For a QPI link pair, this translated to aggregate 11.72
GT/s ( 2 links x 5.86 GT/s per link). In terms of theoretical aggregate bandwidth, this comes
out to be 23.44 GB/s (11.72 GT/s x 2 bytes per transfer ) for QPI link pair or unidirectional
11.72 GB/s per link.
It can be seen in Figure 2.10, that when a chip accesses remote DRAM, this will incur higher
latency and lesser bandwidth compared to accessing its local DRAM. This distributed shared
memory makes this two CMP-DRAM arrangement a 2-socket cc-NUMA shared address space
system. In the figure we can see the two NUMA nodes or memory domains.
28 2.6 Multicore Platform
Figure 2.10. Dual socket configuration of two Quad Core Intel Xeon E5630 processors
with two hardware threads per core. Each Xeon chip can access its local DDR RAM
directly through its IMC. This organisation makes the system a 2-way cc-NUMA with
shared address space. Each Xeon processor chip is organised as Core and Uncore
29 2.7 Tesla GPGPU Platform
2.7 Tesla GPGPU Platform
GPGPU’s(General Purpose GPGPU’s) have come forward as the most promising widespread
SPMD model based performance accelerators. Their ubiquitous presence in desktops, lap-
tops and servers means availability of massive Manycore multithreaded multiprocessing that
generally excel in highly data-parallel and compute intensive applications. The philosophy
of GPGPU based computing is to use a heterogeneous CPU-GPGPU co-processing approach
for accelerating performance for applications having a varying mixture of serial and parallel
fraction. The approach is to accelerate highly data parallel part of the application using the
fine-grained(hw threads) GPGPU processor and accelerate the sequential part of the application
using the latency-optimized coarse-grained(hw threads) Multicore CPU. In other words, exploit
fine-grained thread level parallelism(TLP) for highly data parallel codes on GPGPU and exploit
Instruction level parallelism(ILP) plus coarse-grained TLP in inherently sequential(irregular)
codes on Multicore CPU’s.
In this section, we first present overview of our NVIDIA Tesla C1060 GPGPU(General Purpose
GPGPU) based hardware platform on which Triad Census is evaluated. Thereafter, we concen-
trate on understanding in some detail the important hardware features of the NVIDIA Tesla
C1060 GPGPU. The software programming model for NVIDIA GPGPU’s - called CUDA - is dis-
cussed in Appendix D. In the last part of this section, we mainly discuss important differences
between GPGPU’s and Multicore.
2.7.1 Platform Overview
For evaluating Triad Census on GPGPU, we used ICS workstation system which consists of a
discrete NVIDIA Tesla C1060 GPGPU compute processor connected to Quad-Core Intel Xeon
X5260 Multicore processor through a PCI-Express x16 2.0 interface. Intel Xeon X5260 Multi-
core processor is nicknamed Wolfdale-DP and is a die-shrink(tick) of Intel Core Micro architec-
ture(Penryn). Figure 2.11 shows the details of this platform.
2.7.2 NVIDIA Tesla GPGPU Processor
Tesla C1060 GPGPU belongs to 2nd generation Tesla architecture(T10/GT200b)7. It is a 1.4
billion transistor chip made using 55nm technology node and consists of 240 CUDA processor
cores(n) with processor core(shader) frequency of 1300 MHz(f), graphics core(core engine)
frequency of 600 MHz, 1 TFLOPS (f x n x 3 flops/clock-cycle) of peak single-precision floating
point rates, IEEE 754-2008 double precision 64-bit floating point arithmetic, 4GB of GDDR3
memory having clock rate of 1600 MHz with 512-bit (64-bit x 8) wide memory interface. It
supports peak theoretical memory bandwidth of 102 GB/s (512/8 bytes x 1600 x 106 cycles)
between GDDR RAM and GPGPU device. CUDA compute capability of Tesla C1060 GPGPU is
1.3. 8
In terms of chip resource organization (see Figure 2.12), Tesla C1060 GPGPU consists of
a single Streaming Processor Array (SPA). The SPA consists of ten Thread Processing Clus-
7As of this writing, the state of the art GPGPU processor architectures are Fermi T20/GF100/GF104 [30, 31] and
Kepler GK104/GK11 [32] released in 2009 and 2012
8CUDA compute capability describes CUDA features supported by the GPGPU. More advanced CUDA compute
capabilities are 2.0 and 2.1 are available in Fermi GF100/104 GPGPU’s
30 2.7 Tesla GPGPU Platform
ters (TPC). Each TPC consists of three Streaming Multiprocessors(SM). Each SM consists of
eight Scalar Processors(SP’s or CUDA cores). Thereby each TPC consists of 3 SM x 8 SP/SM
= 24 CUDA cores or SP’s. As there are 10 TPC’s, in total GT200 has 10 TPC x 24 SP/TPC
Figure 2.11. Tesla C1060 GPGPU connected through PCIe 2.0 @ 8GB/s to Intel Quad
Core(2core x 2socket) Xeon X5260 @3.33 GHz, per core L1-32 KB data cache, Per Socket
6MB L2 Unified Cache shared between two cores, 1333 MHZ FSB, and 8 GB DDR2@667
MHz RAM
31 2.7 Tesla GPGPU Platform
= 240 CUDA cores. In total, 30720 hardware threads can be executed by the Streaming Pro-
cessor Array(SPA), or in other words 3072 threads/TPC, 1024 threads/SM and 128 threads/SP.
Each SM has following additional features:
• 64 KB (32 bit x 16K) Register file
• 16KB - Shared Memory which is connected to 8 SP’s through a low-latency interconnect
and is organised into 16 banks of 1KB each
• Instruction Cache (I-Cache), 8 KB-Constant Cache (C-Cache)
• Two Special Function Units (SFU’s) each clocked at 1300 MHz used for transcendental
functions, attribute interpolation and floating point MUL instructions. Each SFU contains
4 single precision floating point multipliers, so in total there are 8 single precision FP
multiplier units per SM (1 FP Multiplier per SP).
• One IEEE 754-2008 Double-Precision floating point unit multiplier supporting 1 Fused
Multiply Add (FMA) per cc.
• Multithreaded Instruction Fetch and Issue Unit (MT issue) supporting fine-grained in-
terleaved multithreading (see Glossary). MT unit enables SIMT (Single Instruction Mul-
Figure 2.12. Tesla C1060 processor die (T10/GT200 GPU Architecture). Source: NVIDIA
32 2.7 Tesla GPGPU Platform
tiple Thread) by creating, managing, and scheduling hardware threads in group of 32
parallel threads called Warp
• One Warp Scheduler (see later sections for more details)
• Eight Load/Store Units (LD/ST) for executing Load/Store/Atomic Memory operations.
Each SP has following additional features:
• Fully pipelined in-order issue scalar processor with pipeline latency of 24 clock cycles.
• One scalar Integer Unit (IU) supporting 32-bit precision multiply, 1 multiple-add (MAD)
operation per cc and efficient 64-bit operations.
Total size of register file is 1920 KB (10 TPC x 3 SM/TPC x 64 KB/SM). Total size of shared
memory is 480 KB (10 TPC x 3 SM/TPC x 16 KB/SM).
2.7.3 Differences between Multicore and GPGPU processors
Before moving to the evaluation of Triad Census algorithm on our Tesla GPGPU, we point out
some major differences between GPGPU’s and Multicore CPU’s
1. Usage Context for Workload Processing: Primary usage of Multicore is as main pro-
cessor for running operating system and all software applications on top of it. Multicore
processors are built on legacy single core ILP(Instruction Level Parallel) processors and
can run both single threaded and multithreaded applications(exploiting coarse grained
TLP). On the other hand, GPGPU’s are used as co-processor of the Multicore processor and
are primarily meant for accelerating highly multithreaded and compute intensive work-
loads. Although the GPGPU’s focus on accelerating applications through Thread level
parallelism(TLP), to an extent they can also extract ILP.
2. Hardware Threads: GPGPU’s provide massive hardware multithreading (fine-grained)
in form of thousand of hardware threads(30720 Tesla T10/GT200). Multicores typically
have few number (multiples of 2 say 2, 6, 8 etc.) of hardware threads (see SMT or CMT)
per processor core. Hardware threads in GPGPU’s are light-weight and require very few
cycles to generate and schedule, whereas threads in Multicores are comparatively coarser
and thus require more allocation/de-allocation cycles typically multiples of 10 cycles.
3. Chip resource utilization: GPGPU’s tend to use most part of chip area for in-order
small processor cores, register file, thread scheduler and lesser area for on-chip memory
hierarchy, control logic. In Multicore processors, on-chip cache hierarchy and control
logic like branch predictor, O-o-O scheduler occupies significant part of chip area whereas
chip area dedicated for computation occupies relatively lesser area. ( Figure 2.13 )
4. SIMD vs. SIMT: Multicores use SIMD (single instruction, multiple data) units for do-
ing vector processing on vectorized registers by doing same operation on multiple data
words. On the other hand GPGPU’s employ SIMT (single instruction multiple thread)
where bunch of threads(warps on NVIDIA GPU) execute same set of instructions in lock
step, although it does permits arbitrary branching behaviour for individual threads.
5. Memory Bandwidth: for GPGPU’s (>100 GB/S) memory bandwidth is many times ( 10
or more) higher than memory bandwidth in Multicores (10-30 GB/s).
33 2.7 Tesla GPGPU Platform
Figure 2.13. Single core CPU and GPGPU architecture comparison. Source: NVIDIA
CUDA C Programming guide
6. Floating-point Throughput: GPGPU’s provide massive floating point operations through-
put of the order of 1 TFLOP (Tesla T10) in single precision and more (Fermi T20) whereas
Multicore processors have much lower maximum FLOPS.
7. Memory latency hiding: GPGPU’s use massive Hardware Multithreading along with
ILP, combined with high memory bandwidth of GDDR memory to hide memory latency.
Multicore processors employ sophisticated cache hierarchy, branch prediction, hardware
prefetching, out of order processing and hardware multithreading(coarse grained) to
hide memory latency.
8. Programmability: Multicores can be programmed much easily with little or almost no
knowledge of the underlying processor hardware. On the other hand, programming
GPGPU’s requires learning a vendor specific programming framework (example CUDA)
and requires programmer to have deeper knowledge of the processor architecture.
9. On-chip memory management: In Multicores, on-chip cache hierarchy and register
allocation is mostly handled by compiler and hardware with programmer having little to
no control over on-chip memory allocation. In GPGPU’s, allocation of on-chip memory
such as registers can be controlled through compiler and usage of on-chip scratch pad
memory(shared memory) can be explicitly managed by programmer.
34 2.7 Tesla GPGPU Platform
Chapter 3
Related Work
This chapter presents related work on sequential and parallel graph algorithm implementation
on CMP’s along with some discussion on various graph data structure representations as used
in them. In particular, we focus on comparing our work with the only known previous work
where Triad Census algorithm was accelerated on a high-end CMP(Cray XMT supercomputer).
There exists considerable amount of literature which describes work on algorithm, design,
implementation and optimization of sequential/parallel irregular/graph applications for com-
modity Multicore and high-end CMP systems. [26] studied cache-aware and cache-oblivious
based data structure design techniques for various sequential graph algorithms. They demon-
strate good execution speedups for Dijkstra’s single-source shortest path algorithm and Prim’s
minimum spanning tree using cache-aware adjacency array representations. Further, they ob-
tain significant speedups using cache-oblivious recursive implementation for Floyd-Warshall
algorithm. They also combine adjacency array based approach and cache-oblivious approach
and apply it to matching algorithm for bi-partite graphs.
Many authors have presented work on effective parallelization of irregular algorithms on com-
modity Multicore and Manycore processors. [33] investigate number of parallelization strate-
gies for mixed regular-irregular applications. The main focus of their work is to improve utiliza-
tion of performance critical resources(cache memory, NUMA specific thread/data mapping) of
the commodity Multicore processors(AMD Opteron, Intel Xeon). Authors in [34] study effect
of graph topology (random, sparse, high data locality) on performance of Parallel BFS perfor-
mance on shared memory Multicores(Dual socket Quad core Intel Xeon and AMD Opteron).
They propose an improved Parallel BFS algorithm which dynamically adapts itself(number of
threads) to the variations in graph topology. Authors in [35] evaluate number of basic dy-
namic parallel graph kernels(st-connectivity, BFS, betweenness centrality) on SUN Ultra Sparc
T2 and IBM p5 570 SMP and obtain good speedups (22x,11x, 23x). As such algorithms involve
run-time parallel insertion/deletions in large scale graphs of small-world networks, a number
of novel graph representation techniques are proposed. To get high performance for runtime
structural updates to graph data structure, they implement adjacency lists as resizeable adja-
cency arrays called Dynamic Arrays. Another data structure that they experiment for adjacency
representation is treap which allows faster deletions than dynamic arrays. They use similar
re-sizing strategy for treaps and dynamic arrays. To handle irregularity in graph topology, they
35
36 3.1 Acceleration of Triad Census on Cray XMT
make a hybrid adjacency representation which uses dynamic arrays for low degree vertices and
treap for high degree vertices.
A number of graph representations have been studied on GPGPU processor architectures.
[36] implemented and evaluated number of graph algorithms using Adjacency Array and Ad-
jacency Matrix based representations namely breadth first search, st-connectivity, single-source
shortest path, all-pairs shortest path, minimum spanning tree, and maximum flow algorithms.
Authors [37] use Edge list representations for implementing graph connectivity algorithms.
Authors in [38] used optimized linked list representation for irregular graph algorithms like
list ranking and connected components. They use arrays and sub-lists to improve performance
bottlenecks associated with obtaining good performance on GPGPU’s. Further they make some
comparisons of speedups obtained on SMP’s and GPGPU’s. Some recent work[39, 40] evaluates
and compares performance of parallel graph algorithms(BFS) on different multicore processors
such as Intel Xeon Multicores(Nehalem), SUN Ultra Sparc T2 and NVIDIA Tesla/Fermi GPGPU’s
There also exists number of studies[9, 35, 41, 42, 43] where irregular applications are eval-
uated on high-end CMP’s such as SUN Niagara, SUN Ultra Sparc T2, Cray MTA-2 and Cray XMT.
These studies demonstrate capabilities of such highly latency tolerant multithreaded processor
architectures to effectively handle applications with high and varying degree of irregularities.
Authors in [44] demonstrate effectiveness of shared memory CMP called Cray MTA-2(40 core,
no cache hierarchy but with 128 hw threads) to yield highly scalable performance for irreg-
ular graph algorithms(BFS and st-connectivity) even when processing graph networks with
highly irregular topologies(random, scale-free, sparse). Similarly, authors in [45] present an
improved parallel algorithm of betweenness centrality graph kernel and obtain good parallel
speedups(10x) on Cray XMT(16 PE). In the next section, we will present details of the only
known previous work where Triad Census algorithm was accelerated.
3.1 Acceleration of Triad Census on Cray XMT
In their work[9], authors evaluated Multithreaded implementation of the original Subquadratic
Triad census algorithm [8] on Cray XMT supercomputer. Table 3.1 summarizes the details of
their work and also highlights(see Comments column) comparison of our work to it along with
our contributions.
Authors evaluated four versions of Subquadratic triad census algorithm on Cray XMT. First,
they implemented the original Subquadratic triad census algorithm with minor variations. Sec-
ond, they implement a distributed task queue based implementation of their modified sub-
quadratic triad census algorithm. In this implementation, each task was a canonical dyad and
task queues were created sequentially. For synchronizing concurrent updates to census array,
the implementation used fast atomics. For storing graph data, both implementations used com-
pact data structures(no details). For parallelization, they relied on compiler to automatically
parallelize loops(implicit parallelism). Third and fourth implementations basically augment
loops in their first and second implementations with loop future compiler pragma to designate
each iteration of the loop to be executed by a thread. For experimentation and performance
evaluation, they used one real world graph data set and two random graph data sets(generated
using a tool). By iterating through phases of simulations–optimization (compiler directives,
stream management, load balancing by interleaved loop iteration scheduling), they are able to
obtain nearly linear speed-ups for most of their implementations on almost all data sets. They
conclude that loop future versions scale better with high CPU utilization (load balancing)
37 3.1 Acceleration of Triad Census on Cray XMT
Table 3.1. Comparison of our work to previous work on accleration of Triad Cenus on Cray
XMT
Previous work Our work Comments
Hardware Cray XMT with 4
threadstorm multi-
threaded VLIW proces-
sors each supporting
128 hardware threads
(streams) and 1 TB
shared memory. In-
cludes high-speed
interconnect and
network 3D-Torus
topology. Fine-grained
multithreading - thread
management and
synchronization.
AlaRI server as Multicore
Dual socket Intel Quad
Core Xeon E5630 with 2
hardware threads per core
and 16 GB shared memory
in cc-NUMA (8GB/socket)
configuration. Socket-to-
socket and socket-memory
communication through
Intel QPI interconnect.
ICS Workstation with
NVIDIA Tesla GPGPU
ICS Workstation has
NVIDIA Tesla C1060
GPGPU processor with
4GB of GDDR3 memory.
The chip has 240-scalar
processor cores arranged
as 30 Streaming Mul-
tiprocessors(SM) and
supports maximum 1024
hardware threads per SM.
This GPGPU is connected
to Intel Quad core Xeon
X5260 processor with 8 GB
memory through a PCIe
2.0 x16 interface.
Cray XMT belongs to class of
Supercomputers. They do not
rely on exploiting reference
locality and instead leverage
performance gains from la-
tency hiding (highly latency
tolerant) through fine-grained
multithreading. Out of many
Multithreaded processor based
systems in the market, Cray
XMT’s have been found to
fit well the requirements of
latency sensitive irregular
algorithms, particularly ir-
regular graph algorithms. In
the past, many data intensive
irregular algorithms 1 2 have
performed well on this ma-
chine architecture.
Xeon Multicore is a server
class machine. Multicores use
cache hierarchy, branch pre-
dictor, hardware prefetching,
out-of-order processing and
coarse grained thread level
parallelism for hiding memory
latency. They are designed for
general purpose applications.
Tesla GPGPU is a graphics
hardware designed for use
as co-processor. GPGPU’s use
massive fine-grained multi-
threading and high GDDR
memory bandwidth for hiding
and reducing memory latency.
Designed for data parallel
applications.
Continued on next page
1Source: PNNL, US
2Source: CSCS, CH
38 3.1 Acceleration of Triad Census on Cray XMT
Previous work Our work Comments
Programming
environ-
ment
Distributed shared
memory model(shared
address space). Paral-
lelizing C/C++ Cray
compiler automatically
parallelizes loops(also
nested). Compiler gen-
erates special load store
instructions and VLIW
instructions(upto three
operations - control,
arithmetic, memory).
Pragmas for controlling
multi-threading and
Canal profiling tool.
Multicore
- Implemented using C,
Pthread and gcc compiler
- For application profiling,
Intel Vtune Amplifier was
used
Tesla C1060
- Implemented using
C++/STL and C/CUDA
- Application profiling was
done using CUDA com-
puteprof command line
profiler
No comments
Data sets Only one real world
graph data set used
- patents. All other
graph sets were syn-
thetic graphs generated
using some tool.
Data processing on CMP’s,
Multicores and GPGPU’s
takes place on real world
data and hence we choose
real world graph data sets
like Amazon, Google We-
bgraph, Slashdot, Patents,
Actors(IMDB). All data
sets obtained from from
Stanford Large Network
Dataset Collection and
pajek collection
Performance evaluation based
on real world graph net-
works increases reliability of
our speedup results
Branch Opti-
mizations
No specific details in
the work
We identified and opti-
mized some branches (two
if statement and two func-
tion calls) which were un-
necessary and used pre-
computed values to im-
prove performance (Ver-
sion v0.4)
Improved speedup by upto
1.23x on Xeon Multicore (on
GPGPU’s we have not mea-
sured improvement as we al-
ready included this optimiza-
tion in initial implementation)
Continued on next page
39 3.1 Acceleration of Triad Census on Cray XMT
Previous work Our work Comments
Graph data
structures
Linked list and compact
data structures(exact
details not specified)
Multicores
- Linked list and cache
blocked linked list(block
size optimized after experi-
mentation)
- Implemented a novel
Sorted insert algorithm for
cache blocked linked lists
Tesla GPGPU
- All graph data was first
constructed on CPU using
dynamic arrays. This graph
data is then transformed to
CRS format for processing
by GPGPU
Multicores
With cache blocked linked list
our sequential implementation
ran 1.03x to 6.04x faster, with
3.1x average speedup
Memory
challenges
Cray XMT
1 TB memory with Cray
specific memory alloca-
tor
AlaRI Xeon Multicore
- 16 GB (2-way NUMA)
with standard non- NUMA-
aware memory allocator
ICS Tesla GPGPU
- 4 GB GDDR3 with no sup-
port for dynamic memory
allocation
In our GPGPU
C++/STL/C/CUDA Triad
Census code, we implemented
technique (with almost zero
overhead) to run triad census
algorithm without needing
any dynamic allocation. This
enabled us to accommodate
big graph data sets within
GPGPU memory limits. Our
changes can be ported to any
class of GPGPU accelerators
can give good performance
in comparison to scenarios in
which frequent dynamic mem-
ory allocation/de-allocation
occurs at runtime.
Continued on next page
40 3.1 Acceleration of Triad Census on Cray XMT
Previous work Our work Comments
Synch. of
concurrent
updates to
triad census
algorithm
Cray XMT
Fast atomics used for
synchronizing concur-
rent updates to shared
triad census array
Intel Xeon Multicore
Experimented with two
approaches:
- shared triad census array
updated through atomics
- local triad census array
per thread (completely
decoupled threads)
Tesla GPGPU
We experimented with two
approaches:
- shared triad census array
stored in global memory
and protected with fast
atomics
- per thread block shared
triad census array pro-
tected through atomics
(completely decoupled
thread blocks)
Multicores
We found that second ap-
proach i.e. Local census array
per thread is generally faster
Tesla GPGPU
We found that the per thread
block triad census array ap-
proach gave us on an average
2x speedup
Load Bal-
ancing
Cray XMT
Distributed task queue
approach
Multicores
Distributed task queues.
We identified different
load balancing strategies
for triad census algorithm.
Refined authors load
balancing strategy and
brought the implemented
load balancing closer to
ideal load balancing with-
out any overhead
Tesla GPGPU
Centralized task queues in
CRS format
Our refined load balancing
strategy improves balancing of
workload.
Continued on next page
41 3.1 Acceleration of Triad Census on Cray XMT
Previous work Our work Comments
Speedups Cray XMT
Good linear or nearly
linear speedups
Multicore and GPGPU
Linear and Super linear
speedups obtained due to
sequential optimizations.
Some of the speedups were
sub-linear for highly sparse
graphs where Multicore
and GPGPU’s memory
couldn’t beat Cray XMT’s
ability to hide memory
latency
Multicore
Super-linear speedups in-
cluded 56x(google) , Ama-
zon(31x) and Slashdot(25x).
Sub-linear speedups in-
cluded Patents(8.4x) and
Actors(11x)
Tesla GPGPU
Superlinear speedups
of 58.4x(google),
38.5x(Amazon) and
Slashdot(22x). Sublinear
speedups of 4.2x(patents)
and 8.5x(Actors)
42 3.1 Acceleration of Triad Census on Cray XMT
Chapter 4
Evaluation of Triad Census on Intel
Multicores
In this chapter we discuss implementation and evaluation of the Subquadratic Triad Census
algorithm on Intel Multicores. As said before, for implementation and testing we used Intel
Core 2 Duo processor based system. Later for optimizations, experiments and evaluation we
used a system containing two Intel Xeon processors.
4.1 Methodology
We briefly explain the overall methodology we adopted for accelerating triad census algorithm
starting:
• Start with a sequential triad census implementation
• Obtain baseline results on Intel core 2 duo processor and Intel Xeon processor based
system
• Profile the sequential implementation to find function hotsposts where the algorithm is
spending most of the time. Analyse the profiling results and identify prospective sequen-
tial optimization’s
• Evaluate Sequential optimization’s to accelerate sequential performance
• Create a Multithreaded implementation of optimized sequential implementation
• Profile Multithreaded implementation and identify potential sources of performance bot-
tlenecks. Analyse the source of bottlenecks in multithreaded implementation and identify
corresponding optimizations to alleviate them
• Implement and evaluate the optimizations to further accelerate performance of the Mul-
tithreaded triad census algorithm
Table 4.1 shows the graph data sets that are used for evaluation of Triad Census algorithm
on our Multicore and GPGPU based machines.
43
44 4.2 Sequential Triad Census
Table 4.1. Graph data sets used for evaluation
Data Set Vertices Edges Directed ? Description
Actors[46] 520223 2940808 No Actor network data based on
imdb.com.
Patents [47] 3774768 16518948 Yes The NBER U.S. Patent Citations
Data File, version 2001.
Amazon[48] 403394 3387388 Yes Amazon product co-purchasing
network from June 01 2003
Slashdot [49] 82144 549202 Yes Slashdot Zoo signed social net-
work from February 21 2009
Google [49] 916428 5105039 Yes Webgraph from the Google pro-
gramming contest, 2002
eatSR.net [50] 23219 325589 Yes Pajek Edinburgh Associative The-
saurus, 2003
NDwww.net [46] 325729 1497135 Yes Notre Dam web-page network
4.2 Sequential Triad Census
We initially obtained a sequential implementation of Triad Census algorithm(see figure 2.4
and figure 2.5 in chapter 2) from authors [9]. This implementation is based upon original
Subquadratic Triad Census algorithm [8]. The implementation is written in C programming
language and uses linked lists based adjacency list data structure for graph representation.
4.2.1 Implementation details
The graph is read from an input graph file which contains list of vertices and edges(or arcs)
of the undirected(directed) graph. The graph file normally lists Order of graph data set ex-
plicitly and optionally lists Size of the graph. Each vertex and edge(arc) has an optional text
label. Graph data sets differentiate between directed and undirected graphs through different
keywords: arcs for directed graphs and edge for undirected graphs.
The read graph is stored as an array of Vertex objects (AOS or Array of structures). Each
Vertex object has a text label and a stores pointer to linked list of Edge objects incident with
it in the outward direction i.e. Vertex object is the tail of all the arcs represented by its list
of Edge objects. An Edge object represents a directed edge. It has a text label and stores
index of destination vertex to which it is connected i.e. destination vertex is the head of the arc
represented by the Edge object. To access the directed edge list of a vertex, we simply index
the array of Vertex objects by the vertex id.
The Triad Census algorithm also requires usage of open neighbourhood set for each vertex
of the graph. For this an array of neighbour lists (array of list pointers) structure is used in
which each element represents a pointer to open neighbourhood list of a given vertex. Each
neighbour list is a linked list of Neighbour objects. Note that each Neighbour object still
represents a vertex. To access the open neighbourhood list of a vertex, we can simply index the
array of neighbour list pointers by the vertex id. Figure 4.1 shows the graph data structure.
45 4.2 Sequential Triad Census
Figure 4.1. Top Left: shows a digraph of order=7 and size=7. Top Right: shows key to
bottom two diagrams. Bottom Right: shows array of neighbour list pointers indexed by
vertex id. Bottom left: shows array of Vertex objects indexed by vertex id
4.2.2 Baseline results
Sequential Triad Census code was instrumented to get wall elapsed time with millisecond res-
olution. The code was then compiled with settings shown in table 4.2.
The sequential implementation was then executed with many different graph data sets on
both Intel Core 2 duo and Intel Xeon server to obtain baseline results(v0.1). Table 4.6(row
46 4.2 Sequential Triad Census
Table 4.2. Compilation settings for Sequential Triad Census implementation
Machine Compiler Compilation flags
Intel Core 2 Duo(x86-64 ISA) gcc 4.4.1 -c -m64 -pipe -g Wall -O3
Dual-socket Intel Xeon E5630 gcc 4.1.2 -c -m64 -pipe -g Wall -O3
with Version v0.1) shows the initially obtained baseline results for different graph data sets
on Intel Xeon E5630 Westmere-EP platform.
4.2.3 Performance profiling and analysis
To get insights into performance bottlenecks in sequential Triad Census implementation, we
performed its code profiling on the Intel Core 2 duo machine. The profiling was done for many
data sets. Due to space constraints, here we will present profiling data only for three data sets
namely eatSR.net, NDwww.net and Actors. Although the profiling experiments presented here
are specific to respective data sets, but the derived analysis results are general and are thus
applicable to the other data sets.
Initially, we did profiling using gprof tool to identify hotspots at application level(function).
Here, the Sequential Triad Census implementation was compiled with highest(-O3) optimiza-
tion level. Below is a gprof execution time trace for data sets eatSR.net and NDwww.net.
eatSR.net
% cumulative self self total
time seconds seconds calls s/call s/call name
52.45 40.77 40.77 189408983 0.00 0.00 is_edge
33.17 66.56 25.79 34415992 0.00 0.00 add_neighbor
9.08 73.62 7.06 20783877 0.00 0.00 is_neighbor
NDwww.net
% cumulative self self total
time seconds seconds calls s/call s/call name
76.09 243.29 243.29 57670691 0.00 0.00 is_neighbor
21.00 310.44 67.15 7187328 0.00 0.00 add_neighbor
1.89 316.49 6.05 348535010 0.00 0.00 is_edge
As can be observed above, in one of the traces, the function is_edge is the hotspot while
in other trace the function is_neighbor is the hotspot. Also, add_neighbour function occupies
the same relative position in terms of ranking of hotspot in both traces. These three hotspot
functions basically traverse linked list either searching for an element(is_edge and is_neighbor
functions have same code but traverse on different lists) or finding a position where element is
to be inserted (add_neighbor).
To have a deeper insight in the actual source of bottleneck, we profiled using another hotspot
profiling tool named Zoom [51] which allows static and dynamic profiling of an application.
Zoom allows collection of statistics at different levels (process, module, function, code line
47 4.2 Sequential Triad Census
level, assembly instruction level). Static profiling statistics are processor architecture spe-
cific and include statistics like distribution of assembly instructions, nominal instruction tim-
ing(latency:timing) and likely resulting pipeline stalls during execution. Dynamic statistics
are collected during runtime and include execution time(self, total) at multiple levels(module,
function, code line level, assembly instruction). User needs to choose or create a profile con-
figuration in order to get the dynamic statistics. Profile configuration provides what details on
how the tool should sample(time based or event based). Event based profile configurations
include cache miss and retired instruction profile
We profiled with unoptimized (w/o -O3 compilation) Sequential Triad Census implementation
on eatSR.net data set using Zoom’s event-based profile configuration called cache miss config-
uration: A sample is taken every 1,000 occurrences of event "Last Level Cache Misses". The
last level cache here represents the level L2 of cache in Intel Core 2 Duo processor. Below is a
partial trace of statistics collected from Zoom.
=====================================================
Hotspot - Process SubQuadTriads [20424]
=====================================================
Self Total Symbol Module
2.5min 2.5min is_edge SubQuadTriads
13.3sec 14.3sec inc_adj_union SubQuadTriads
=====================================================
Out of the total 2.9 min, function is_edge alone takes 2.5 min. Next, inc_adj_union function
takes 13.3 sec. Below tables show elapsed execution time at source and assembly code levels
for the is_edge and inc_adj_union functions.
is_edge code trace for eatSR.net data set
------------------------------------------------------
Source Line Time Source Code
------------------------------------------------------
line 0 2.4min if(v2 < iptr->dest_node){
line 1 5.2sec iptr = iptr->next ;
line 2 }
------------------------------------------------------
is_edge assembly instruction trace for eatSR.net data set
-------------------------------------------------------------------------------
Time Address Stalls Instruction Source Line
-------------------------------------------------------------------------------
+ 0x41 mov rax,[rbp-0x8] line 0
+ 0x45 mov rax, [rax+0xc8] line 0
2.4min + 0x4c 2 cmp rax, [rbp-0x28] line 0
80.8ms + 0x50 jbe 0x401960 <is_edge + 0x63> line 0
99.3ms + 0x52 mov rax, [rbp-0x8] line 1
22.9ms + 0x56 2 mov rax, [rax+0xd0] line 1
3.8sec + 0x5d 2 mov [rbp-0x8], rax line 1
1.3sec + 0x61 jmp 0x40197f <is_edge + 0x82> line 1
48 4.2 Sequential Triad Census
-------------------------------------------------------------------------------
inc_adj_union code trace for eatSR.net data set
-------------------------------------------------------------------
Source Line Time Source Code
-------------------------------------------------------------------
line 0 1.2msec while (iptr) {
line 1 13.3sec if (iptr->dest_node != exc_neighbor1) {
-------------------------------------------------------------------
inc_adj_union assembly instruction trace for eatSR.net data set
--------------------------------------------------------------------------------
Time Address Stalls Instruction Source Line
--------------------------------------------------------------------------------
566.4us + 0x9e mov rax, [rbp-0x20] line 1
4.4ms + 0xa2 2 mov rax, [rax] line 1
13.3sec + 0xa5 2 cmp rax, [rbp-0x28] line 1
99.8us + 0xa9 jz 0x40205d <inc_adj_union + 0xdc> line 1
--------------------------------------------------------------------------------
In the above code traces, one can see that we are basically traversing linked list(loop) and
making a decision(if statement) based on data value read from current node of the linked list.
For both functions, the first four lines of instruction traces are similar, as they contain two move
- mov instructions followed by a cmp instruction and then a conditional jump - jbe/jz instruc-
tion.
Consider for the time being the source code line 0 of function is_edge and four lines of assembly
instruction trace corresponding to this source code line. Among the four assembly instructions,
the first is a move instruction which makes a memory read request(iptr) and writes it to rax
register. Based on the value in rax register, the second move instruction makes a memory read
request(value iptr->dest_node) and writes it to rax register. Clearly, there is a RAW depen-
dency and also WAW dependency between these two move instructions. A WAW dependency
also exists between the second move instruction and following compare instruction leading to
pipeline stalls. Further, due to poor locality of linked lists, it is very likely that cache misses
occur when each of these three instructions(two move and one compare) make a memory read
request (iptr, value iptr->dest_node, value v2). There is also a RAW dependency between com-
pare cmp and the jump instruction jbe, which likely causes jump instruction to stall as it cannot
resolve its target branch until the compare instruction writes its result to the appropriate flag
registers(ZF or CF). The same behaviour can be observed for inc_adj_union function.
Later, we ran hotspot analysis on Actor graph data set using Intel Vtune Amplifier tool [52].
This data set takes a total 104.6 sec CPU execution time. From the trace, the main bottleneck
function is triad_code which consumes nearly 71 sec.
Thread / Function / Call Stack CPU Time Function (Full)
main 104.570s
triad_code 70.193s triad_code
make_triad_census 15.457s make_triad_census
49 4.2 Sequential Triad Census
inc_adj_union 7.692s inc_adj_union
add_neighbor 6.950s add_neighbor
triad_code code trace for Actors data set
Line Source CPU Time
773 if (is_edge (verts, u, w)) { 9.121s
774 edges[2] = 1; 0.015s
775 }
776
777 if (is_edge (verts, w, u)) { 42.568s
778 edges[3] = 1; 0.062s
779 }
triad_code instruction trace for Actors data set
Address Line Assembly CPU Time
0x401df1 773 cmp edi, dword ptr [eax+0xcc] 0.651s
0x401df7 773 jnbe 0x401e0f <Block 30> 5.091s
0x401df9 773 jb 0x401e03 <Block 28> 1.868s
0x401e38 777 cmp ebx, dword ptr [eax+0xcc] 0.093s
0x401e3e 777 jnbe 0x401e56 <Block 40> 24.164s
Block 36:
0x401e40 777 jb 0x401e4a <Block 38> 15.955s
From the above traces, we think there are mainly two factors that contribute to the bot-
tleneck in function triad_code. First is the irregularity in control flow which arises due to one
function call(is_edge) and enclosing data dependent branch instruction(conditional jump based
on result of is_edge function). Second factor contributing to poor performance is the perfor-
mance of is_edge function itself which we already explained earlier.
The profiling experiments were also done on other data sets and similar results were observed
for most of these functions. As a result, we think the following three factors either individually
or in combination, leads to poor performance of sequential Triad Census implementation:
• Occurrence of cache miss due to irregular memory access during linked list traversal
• Irregular(data dependent) control flow
• RAW/WAW dependencies leading to possible pipeline stalls during execution
All these sources of bottlenecks, limit the ILP of the Sequential Triad Census algorithm im-
plementation. In summary, we note down following important conclusions from code profiling
exercise:
• We obtained base line results of Sequential Triad Census algorithm on Xeon Multicore
processor;
• Most execution time is spent during linked list traversal for searching or manipulation
which causes irregular memory access(cache misses), irregular(data-dependent) control-
flow and pipeline stalls due to RAW/WAW dependencies;
50 4.2 Sequential Triad Census
• We have identified the hotspot functions, code and instructions which are candidates for
further optimizations. The following are the hotspot functions: is_edge, add_neighbor,
is_neighbor, inc_adj_union, triad_code, make_triad_census.
4.2.4 Optimizing Sequential Triad Census
To address the source of bottlenecks in on sequential implementation of Triad Census algo-
rithm, we tried different optimizations to improve its execution time. Some of the optimiza-
tions resulted in substantial improvements while other optimizations produced results contrary
to expectations and of course were not used. Although the experimental results presented here
are for Actors data set, but we selected only those optimizations which also worked on other
data sets. We did incremental optimizations which means that any presented optimization in-
cludes all previously applied optimizations. These incremental optimizations are marked as
versions v0.2, v0.3, v0.4, v0.5. We refer to baseline results as v0.1. After evaluating these
optimizations on Intel Core 2 Duo for Actors Data, we evaluated these different versions for
different graph data sets on dual-socket Intel Xeon Multicore. Finally we choose only those
optimizations which worked for most of the data sets. If any optimization that worked on Intel
Core 2 duo but lead to bad results on Intel Xeon Multicore for some or many graph data sets,
we decided not to use it in the final combinations of optimizations.
Throughout this optimization phase, we compiled our triad census application with Level 3
optimization level of GCC compiler. At this level, gcc automatically performs most of the op-
timizations. The baseline(v0.1) execution time for Actors graph data set on Intel core 2 duo
was 103 sec. We will now explain the optimizations we experimented with on Intel Core 2 Duo.
Cache fitting the data structures (v0.2)
Clearly, the adjacency based structures implemented using linked-lists make the algorithm sen-
sitive to memory latency. In order to reduce the cache misses incurred while traversing the
adjacency linked lists, we tried an optimization which aligns the structure to cache line bound-
ary. In this cache alignment optimization, we basically try to fit size of data structure to a
nearest multiple of cache line size. This is not the same as aligning dynamic allocation of
memory block to cache line boundary. We first check the size and layout of each adjacency
structure(Edge, Node, Neighbour) using utility called pahole 1and also used gcc compiler op-
timization "-Wpadded" to check which of the structures were padded by compiler to force
alignment to structure boundaries. We rearranged the structure elements according to mem-
ber size and access frequency(hot fields) to reduce padding and any memory holes. The Edge
and Node structures were then fitted to cache line multiples of 3 or 4 by decreasing/increasing
the length of label strings in edge and node structures and accordingly. As we see in table 4.5,
in both cases Fitting (inc) and Fitting (dec) performance improvement is 1.14x speedup w.r.t
baseline, but we decided to choose Fitting (dec) optimization as it lowers the memory footprint.
Function inlining (v0.3)
It’s generally a good idea to inline the functions that are frequently called and have small
code size. This results in overall reduction of function call-return overhead and regularizes the
1http://www.ohloh.net/p/pahole
51 4.2 Sequential Triad Census
Table 4.3. Baseline structure characteristics
Structure Size(bytes) Cache lines(64 byte cache line) Wasted(bytes)
Edge 208 4 16
Neighbour 216 4 24
Table 4.4. After structure fitting (decreasing struct size, increasing struct size)
Structure Size(bytes) Cache lines(64 byte cache line) Wasted(bytes)
Edge (dec, inc) 192, 256 3, 4 0, 0
Neighbour (dec, inc) 192, 256 3, 4 0, 0
Table 4.5. Performance improvement after applying optimization(Actors)
Time(sec) Memory(MB)
Baseline(v0.1) 103 829
Fitting (dec) 90 776
Fitting (inc) 90 989
control-flow thereby reducing instruction cache misses. In addition, this gives compiler a big-
ger windows of instructions to extract ILP. At gcc optimization level 2 and 3, this optimization is
automatically applied. Nevertheless, we still inlined the possible functions by giving compiler
hints and compiled with "-Winline" compiler option to see which of the functions could not be
inlined. Resulting performance improvement is 1.02x speedup w.r.t previous optimization or
1.17x speedup w.r.t baseline. Execution time now reduced to 88 sec.
Eliminating branches by using Pre-computed Triad Types (v0.4)
The Triad Census function Figure 2.4 ) uses TriadCode function ( Figure 2.5 ) for calculating
all triads except null triads (type-1 triads calculated in line 29) and dyadic triads (type-2 and
type-3 triads counted in lines 9-14). Also Lines 9-13 of the Triad Census algorithm already pre-
compute the part of Line 1-2 calculations of TriadCode algorithm i.e. IsEdge(u,v)+2*(IsEdge(v,u).
Instead of recalculating this sum again in TriadCode function, we can simply pass this pre-
computed value to it(See Figure 4.2 ). This reduces the number of IsEdge function calls in the
TriadCode function from 6 to 4 i.e 33% lesser function calls. This results in performance im-
provement of approximately 1.18x speedup w.r.t previous optimization or 1.36x speedup w.r.t
baseline. Execution time now reduced to 76 sec.
> Changes in Subquadratic Triad Census Algorithm Between lines 8-9 we insert three new lines
with the following code:
IsEdge[0] = IsEdge(u, v)
IsEdge[1] = IsEdge(v, u)
precomputed_triad_type = IsEdge[0] + 2 * IsEdge[1];
Line 9 is modified to following:
if IsEdge[0] && IsEdge[1] then
52 4.2 Sequential Triad Census
Function: TriadCode
Input: vertices u, v, w and pre-computed triad type value
Output: isomorphic or non-isomorphic triad code
1: TriadCodeVal = precomputed_triad_type
2: TriadCodeVal+= 4 ∗ IsEd ge(u, w)
3: TriadCodeVal+= 8 ∗ IsEd ge(w, u)
4: TriadCodeVal+= 16 ∗ IsEd ge(v, w)
5: TriadCodeVal+= 32 ∗ IsEd ge(w, v)
6: return(IsIsomorphic?TriadTable[TriadCodeVal] : TriadCodeVal) + 1
Figure 4.2. Modified TriadCode Algorithm
Line 17 is replaced with following:
TriadType = TriadCode(u, v, w, precomputed_triad_type)
Cache blocking, faster searching and AOS-to-SOA (v0.5) As already discussed in sub-
section 2.5.1 of chapter 2, unrolled lists or cache blocked linked lists is a good alternative to
linked lists when it is critical to obtain good cache hit ratio. To further improve performance
of Triad Census implementation, we replaced linked lists structures with cache blocked linked
lists. The replaced list structures includes edge and neighbour lists. One notable algorithm
named Sorted Insert required significant implementation and optimization effort based on
heuristics. This algorithm was required because of the requirement of merging(set union and
set difference algorithm required while calculating set S) two neighbour lists which mandated
that both the neighbour lists participating in the merging(union) must be sorted. We tested
cache blocked implementation with various block sizes for edge and neighbour lists. The opti-
mal blocking value found was found to be 8 (edge blocking) and 16 (neighbour blocking).The
resulting performance improved from 76 sec to 60 sec.
To take full advantage of the Streaming SIMD Extensions (SSE) based capabilities in Intel
processors, the Intel optimization manual recommends to convert or swizzle data layout from
AOS(Array of structures) format to SOA (Structure of arrays) format [53]. Our cache blocked
implementation initially used SOA approach for blocking edge lists:
Listing 4.1. Original cache blocked edge structure in AOS format
#def ine BLOCK_EDGES 6
typedef s t r u c t edge
{
char edge_name[MAXLABEL ] ;
unsigned long long i n t dest_node ;
}EDGE, EDGE_PTR;
typedef s t r u c t edge_entry
{
EDGE array_edges [BLOCK_EDGES ] ;
s t r u c t edge_entry next ;
53 4.2 Sequential Triad Census
i n t l a s t _ i n d e x ;
} EDGE_SET , EDGE_SET_PTR ;
We replaced AOS with SOA based implementation
Listing 4.2. Modified cache blocked edge structure in SOA format
#def ine BLOCK_EDGES 6
typedef s t r u c t edge_entry
{
char arr_edge_names [BLOCK_EDGES][MAXLABEL ] ;
unsigned long long i n t arr_des t_nodes [BLOCK_EDGES ] ;
i n t l a s t _ i n d e x ;
s t r u c t edge_entry next ;
} EDGE_SET , EDGE_SET_PTR ;
As destination nodes are accessed more often than the edge names (example IsEdge(v1,v2) ),
SoA format saves memory and prevents unnecessary fetching of edge names, thus resulting is
more efficient use of the the caches and avoiding bandwidth wastage.
Also, in some of the functions, searching was required. This includes following functions:
add_neighbor, inc_copy_neighbor_set, is_edge, is_neighbor. The conventional wisdom 2 says
that for a pre-sorted array of length smaller than 100 elements, linear search would beat bi-
nary search. However in our cache blocked implementation, we that binary search resulted in
slightly better performance. The execution time improved further from 60 sec to 50 sec. Over-
all, the three optimizations in v0.5 resulted in performance speedup of 1.52x w.r.t previous
optimization or 2.06x speedup w.r.t baseline.
Besides, the above optimizations, we had tried few more optimizations which either proved
detrimental to performance or didn’t improve any performance. This includes
• Cache alignment optimization that was tried would allocate memory blocks on cache line
boundaries by replacing function malloc with posix_memalign.
• Software prefetching to reduce memory latency due to pointer chasing in linked list, im-
plemented through assembly instructions(PREFETCHNTA) and compiler based directives
(__mm_prefetch)
4.2.5 Summary of sequential optimizations
Figure 4.3 shows the resulting performance improvement(speedup) from the above optimiza-
tions(incremental) for the Actors graph data set on Intel Core 2 duo machine.
We then ran this optimized code on our Intel Xeon Multicore for different graph data sets.
Table 4.6 shows the resulting speedups and Table 4.7 shows the execution time as observed on
the Intel Xeon Multicore. What we observe is that Actors and Patents data sets benefit most
from optimizations v0.2, v0.3 and v0.4 while Amazon, Slashdot and Google data sets benefit
most from v0.5 optimization. Overall the above sequential optimizations resulted in average
3.3x speedup, maximum 6x speedup and minimum 1.3x speedup.
2http://en.wikipedia.org/wiki/Binary_search_algorithm
54 4.2 Sequential Triad Census
Table 4.6. Sequential Triad Census performance speedup and memory overhead results on
Intel Xeon Multicore
Overall performance speedup w.r.t baseline(v0.1)
Optimization Version Actors Patents Amazon Slashdot Google
Baseline v0.1 1.00 1.00 1.00 1.00 1.00
Cache fitting v0.2 1.13 1.16 0.99 1.01 1.00
Function inlining v0.3 1.13 1.18 1.00 1.01 1.00
Eliminating expensive
branches
v0.4 1.40 1.32 1.00 1.05 1.00
Cache blocking, faster
searching, ASO-to-SOA
v0.5 2.00 1.37 4.58 2.91 6.06
Overall memory overhead w.r.t baseline(v0.1)
Baseline v0.1 1.00 1.00 1.00 1.00 1.00
Cache fitting v0.2 0.94 0.94 0.94 0.94 0.94
Function inlining v0.3 0.94 0.94 0.94 0.94 0.94
Eliminating expensive
branches
v0.4 0.94 0.94 0.94 0.94 0.94
Cache blocking, faster
searching, ASO-to-SOA
v0.5 1.40 1.08 1.05 1.06 1.17
Table 4.7. Sequential Triad Census execution time(sec) and memory consumption(MB)
results on Intel Xeon Multicore
Overall Execution Time(seconds)
Optimization Version Actors Patents Amazon Slashdot Google
Baseline v0.1 60 324 394 393 12203
Cache fitting v0.2 53 280 396 389 12183
Function inlining v0.3 53 274 394 390 12180
Eliminating expensive
branches
v0.4 43 245 394 373 12160
Cache blocking, faster
searching and AOS-to-
SOA
v0.5 30 237 86 135 2013
Overall memory consumption(MBytes)
Baseline v0.1 830 5320 960 170 1550
Cache fitting v0.2 778 5010 903 160 1457
Function inlining v0.3 778 5010 903 160 1457
Eliminating expensive
branches
v0.4 778 5010 903 160 1457
Cache blocking, faster
searching and AOS-to-
SOA
v0.5 1163 5762 1007 181 1813
55 4.3 Multithreaded Triad Census
Figure 4.3. Performance speedup of optimized sequential triad census for actors graph
network on Intel Core 2 Duo with baseline execution time of 103 sec.
4.3 Multithreaded Triad Census
In order to harness hardware multi-threading capabilities of the dual Intel Quad Core Xeon
processors(2 sockets x 4 core/processor x 2 HT/core = 16 Hardware threads), we created mul-
tithreaded implementation of the optimized sequential triad census algorithm using PThread
multi-threading framework. PThread is a POSIX standard for explicit threads implemented in
form of a C based API on many platforms(Unix, Linux, Mac, sun and windows). It provides data
structures, algorithms and constructs for explicit thread management and synchronization. Be-
sides PThread, there are number of libraries and frameworks available for implementation of
parallel algorithms on shared memory CMP’s. For implementations written in C/C++, other
popular libraries besides PThread are OpenMP, Boost threads, Intel TBB and Cilk. These li-
braries have different features and advantages/disadvantages. We choose PThread as it enables
programmer to exercise comparatively finer control over many different aspects of multithread-
ing.
4.3.1 Graph partitioning and Load Balancing strategy
Parallelization of many graph algorithm problems broadly requires three steps:
• Graph partitioning: This step involves finding ways to effectively decompose the graph
56 4.3 Multithreaded Triad Census
structure into substructures that can be processed in parallel. A good graph partition-
ing strategy would decompose it in such a way so as to decouple graph partitions from
each other and minimize communication & synchronization required between concur-
rent computations. At the same time, the partitioning procedure should be efficient in
terms of performance and memory overheads.
• Balancing workload: The amount of computation required to process an input data
set for producing the desired output is defined as the workload. When workload is
shared equally between processors or hardware threads, then each one of them will
spend roughly equal amount of time executing the assigned workload. In Multiprocessor
systems, good load balancing is essential for obtaining good scalability with increasing
number of processors and increasing size of data set. Many a times graph partitioning
step is not able to effectively address the load balancing problem. This is possible because
the amount of work involved in processing a graph partition may depend upon properties
of the parts ( vertices, edges, neighbourhoods, other active regions etc. ) of the graph
partition such as in-degree, out-degree, vertex/edge weights, neighbourhood size, cycles,
distance, eccentricity, radius, diameter, center, circumference etc. So there could be two
same sized graph partitions which may or may not have same workload present in them.
To load balance, the partitions of the graph should be sized to balance the workload.
• Mapping load balanced partitions to threads: Once the graph is partitioned according
to the load balancing strategy, different graph partitions can be allocated to different
threads for processing. If the partitions are completely decoupled i.e. independent,
threads may independently process their respective graph data and write the results to a
common sink(concurrent write to common destination is protected by a lock or atomic).
In case the partitions are not completely decoupled, the threads need to use some thread
communication and synchronization mechanism to produce the final results
In the case of Triad Census algorithm, the ideal partitioning and load balancing strategy is
not immediately clear. To understand this better, we will first analyse the distribution of work-
load in the algorithm. For any further discussions, we will refer to the Triad Census Algorithm
(See Figure 2.4 ). We will first define some units of workloads. A unit of W2 work is amount
of computation involved to increment T2 census count i.e. all computation required until Line
14 is executed once. A unit of W3 work is amount of computation involved to increment T3
census count i.e. all computation required until Line 18 is executed once. Note that, there
is dependency between T2 and T3 ( T2 requires |S| while T3 requires set S and pre-computed
triad type calculated under T2)
Referring to the Triad Census algorithm, it is clear that |N[u]| is the measure of W2 work-
load w.r.t vertex u. Note that W2 workload doesn’t depend upon |N[v]|. A higher or lower
value of |N[v]| will not impact number of times computations in line 9-14 are executed ( only
value of |N[u]| will ). By aggregating |N[u]| ∀u ∈ V , we can find total W2 workload for the
graph. If there are T threads, good load balancing of W2 workload will assign (
∑
u∈V
|N[u]|)/T
work to each thread. Similarly, measure of W3 workload w.r.t canonical dyad < u, v > is|S|. Lower or higher value of |S| will increase or decrease the number of times computa-
tion in lines 16-19 are executed. By aggregating |S| ∀ < u, v >, we can find total W3 work-
load for the graph. If there are T threads, ideal load balancing of W3 workload will assign
57 4.3 Multithreaded Triad Census
Function: Distributed Task Queue generation algorithm that generates an array of task
queues based on Canonical Dyad(non-uniform distr.) load balancing strategy
Input: Graph G = [V, E] where V is set of vertices, E is array of edge lists and N is array
of neighbour lists, MaxNsetSize is maximum size of neighbour set of a bunch of dyads
Output: Array of task queues - TQ[]
1: thid ← 1
2: NsetSize← 0
3:
4: for u ∈ V do
5: for v ∈ N[u] do
6: if u <v then
7: TQ[thid]← TQ[thid]∪< u, v >
8: S← N[u]∪ N[v] \ {u, v}
9: NsetSize← NsetSize+ |S|
10: if NsetSize > MaxNsetSize then
11: thid ← thid + 1
12: NsetSize← 0
13: end if
14: end if
15: end for
16: end for
Figure 4.4. Distributed Task Queue generation algorithm based on Canonical
Dyad(non-uniform distr.) load balancing
(
∑
u∈V
∑
v∈N[u],u<v
|N(u)∪ N(v) \ {u, v}|)/T work to each thread.
In order to partition the graph, we must first define as to what constitutes an independent
piece of work - or a task abstraction. Table 4.8 lists possible task abstraction and the corre-
sponding graph partitioning, load balancing approaches for Triad Census algorithm. Ideally,
one would like to load balance both W2 and W3 workload precisely. As we can see from ta-
ble 4.8, a perfect load balancing doesn’t come for free as overheads increase considerably. In
other words, there is a trade-off between load balancing and the associated overhead. From
scalability point of view, keeping performance overheads low is important and so Canonical
Dyad(uniform distr.) and Canonical Dyad(non-uniform distr.) approaches should be good
candidates and we decided to evaluate these two strategies. We implemented these two cho-
sen graph partitioning strategies by using the task queue based approach as used by authors
[9]. Figure 4.4 and Figure 4.5 shows the respective task queue generation algorithms based on
Canonical Dyad(non-uniform distr.) and Canonical Dyad(uniform distr.) strategy. Figure 4.6
shows the corresponding modified Sequential Triad Census algorithm that uses output(array
of task queues) of these task queue generation algorithms.
Both of these task queue generation algorithms are based on distributed task-queue data
structure [54]. A distributed task-queue data structure consists of an array of task queues
where each task queue is a queue of tasks. Both algorithms use the same task abstrac-
tion(canonical dyad < u, v > where u < v) and create task queues in the same way(see
58 4.3 Multithreaded Triad Census
Table 4.8. Task abstractions, graph partitioning, load balancing strategies. Assuming T
threads. n vertices, N(u) neighbours of vertex u
Task abstraction Partitioning strategy Load balancing strat-
egy
Comments
Vertex Distribute vertices
among threads
n/T vertices per thread Simple to implement,
No overheads, No or
poor load balancing
Dyad(uniform distri-
bution)
Distribute dyads uni-
formly among threads
(
∑
u∈V
|N[u]|)/T dyads
allocated per thread
Good load balancing of
W2 workload but not
W3 workload. Non-
canonical dyads will
unbalance the load dis-
tribution. Some per-
formance and mem-
ory overheads required
for assigning dyads to
each thread
Canonical
Dyad(uniform dis-
tribution)
Distribute canoni-
cal dyads uniformly
among threads
(
∑
u∈V
∑
v∈N[u],u<v
1)/T
canonical dyads allo-
cated per thread
Fully balanced W2
workload. W3 work-
load is subsumed
under canonical dyads
allocated to each
thread and is thus
partially balanced.
Overhead increases as
non-canonical dyads
need to be eliminated
Canonical Dyad(non-
uniform distribution)
Distribute canonical
dyads among threads
keeping W3 load same
for each thread
Number of dyads
are allocated to
each thread so that
W3 load across all
threads is equal
i.e.(
∑
u∈V
∑
v∈N[u],u<v
|N(u)∪
N(v)\{u, v}|)/T work-
load per thread
Good load balancing of
W3 workload and only
partial load balancing
of W2. Non-canonical
triads will unbalance
the W3 load distribu-
tion
Canonical Dyads and
Triads(uniform distri-
bution)
Distribute canoni-
cal dyads and triads
among threads uni-
formly
(
∑
u∈V
∑
v∈N[u],u<v
1)/T
canonical dyads allo-
cated per thread and
(
∑
u∈V
∑
v∈N[u],u<v
|N(u) ∪
N(v) \ {u, v}|)/T
canonical triads allo-
cated per thread
Load balances W3
and W2. Calculating
T2 census requires
that dyads and triads
be stored separately
and possibly pro-
cessed separately.
Non-canonical triads
will unbalance the
W3 load distribution.
Overheads increase
further as now triads
need to be stored
along with dyads.
Distribution of triads
increases performance
overheads
59 4.3 Multithreaded Triad Census
Function: Distributed Task Queue generation algorithm that generates an array of task
queues based on Canonical Dyad(uniform distr.) load balancing strategy
Input: Graph G = [V, E] where V is set of vertices, E is array of edge lists and N is array
of neighbour lists, MaxNsetSize is maximum size of neighbour set of a bunch of dyads
Output: Array of task queues - TQ[]
1: thid ← 1
2: NsetSize← 0
3:
4: for u ∈ V do
5: for v ∈ N[u] do
6: if u< v then
7: TQ[thid]← TQ[thid]∪< u, v >
8: NsetSize← NsetSize+ |N[u]|+ |N[v]| − 2
9: if NsetSize > MaxNsetSize then
10: thid ← thid + 1
11: NsetSize← 0
12: end if
13: end if
14: end for
15: end for
Figure 4.5. Distributed Task Queue generation algorithm based on Canonical
Dyad(uniform distr.) load balancing
Figure 4.7). They differ only in the manner in which workload of a task is calculated (which
depends upon the respective load balancing strategy). The algorithms create task queues iter-
atively by appending a task to the existing task queue. After appending a task to current task
queue, the workload of current task is calculated according to the load balancing strategy and
then this value is accumulated in the workload of current task queue (NsetSize). The unit of
workload for a task(canonical dyad) is size of its neighbour set. When workload of the current
task queue reaches a pre-defined maximum workload(MaxNsetSize), a new task queue is cre-
ated and the above procedure repeats.
The value of maximum workload per task queue - MaxNsetSize - can be varied to change the
number of created task queues. More workload per task queue means overall lesser number of
task queues and vice versa. Depending upon required number of task queues, we calculated
value of this maximum workload for each graph data set separately. In our Multithreaded
implementations, we decided to schedule one task-queue per thread i.e. required number of
task queues equals the required number of software threads. In this case, maximum workload
per task queue(or thread) equals the corresponding value defined in load balancing strategy
column in Table 4.8. Note that alternate task-queue to thread mapping strategies (for example
scheduling multiple task queues per thread) are possible. Also, Canonical Dyad(uniform distr.)
strategy implemented in Figure 4.5 is nearly the same approach as taken by authors [9], only
difference being that we modified the expression which calculates workload of current task
queue(NsetSize). The reason for this slight modification will be explained in next section.
60 4.3 Multithreaded Triad Census
Function: Sequential Subquadratic Triad Census Algorithm based on distributed task
queues
Input: Graph G = [V, E] where V is set of vertices, E is array of edge lists, N is array of
neighbour lists, TQ is array of task queues
Output: Census array with frequencies of isomorphic triadic types
1: for i = 1→ 16 do
2: Census[i]← 0
3: end for
4:
5: for taskqueue ∈ TQ[] do
6: for task ∈ taskqueue do
7: u← task.u
8: v← task.v
9: S← N(u)∪ N(v) \ {u, v}
10: IsEd ge[0]← IsEd ge(u, v)
11: IsEd ge[1]← IsEd ge(v, u)
12: precomputed_t r iad_t ype← IsEd ge[0] + 2 ∗ IsEd ge[1]
13: if IsEd ge[0]∧ IsEd ge[1] then
14: TriadT ype← 3
15: else
16: TriadT ype← 2
17: end if
18: Census[TriadT ype]← Census[TriadT ype] + n− |S| − 2
19: for w ∈ S do
20: if v <w ∨ (w <v ∧ u <w ∧¬IsNeighbour(u, w)) then
21: TriadT ype← TriadCode(u, v, w, precomputed_t r iad_t ype)
22: Census[TriadT ype]← Census[TriadT ype] + 1
23: end if
24: end for
25: end for
26: end for
27:
28: sum← 0
29: for i = 2→ 16 do
30: sum← sum+ Census[i]
31: end for
32: Census[1]← n ∗ (n− 1) ∗ (n− 2)/6− sum
Figure 4.6. Subquadratic Triad Census Algorithm based on distributed task queues
4.3.2 Using Canonical Dyad(non-uniform distr.) load balancing(v0.6)
Based on Canonical Dyad(non-uniform distr.) load balancing approach, we created our initial
Multithreaded implementation using PThread multithreading framework. Each created thread
is assigned one task queue from the array of task queues(see Figure 4.6 and Figure 4.7). All
61 4.3 Multithreaded Triad Census
Figure 4.7. Distributed task queue data structure
threads execute the same triad census algorithm but each thread processes tasks in its own
task queue. To build final triad census array, we mainly experimented with following two
approaches and found that second approach is slightly better as it decouples the threads com-
pletely.
1. Single census array shared between threads: In this approach, a single census array is
shared among all the threads. In order to synchronize concurrent updates to census array
from multiple threads, we used fast atomic primitive __sync_fetch_and_add provided by
gcc.
2. Local census array per thread: In this approach, we assigned each thread its own
census array. Now there is no writeable data shared between the threads. All threads
execute asynchronously and each thread produces its own census array corresponding
to the task queue allocated to it. When a thread has finished processing tasks in its task
queue, it waits at a barrier for other threads to finish. When all threads are finished
processing their respective task queues, local arrays of all threads are accumulated in a
global census array.
62 4.3 Multithreaded Triad Census
If we compare the above two approaches, the first approach has advantage only in terms
of reduced memory as there is just one census array and not per thread copies of census array.
In second approach we waste some memory but we improve performance by avoiding threads
waiting for exclusive access to shared census array. Also, in case there are too many threads,
the second approach should scale better as all decoupled threads can proceed independently.
Experimentally we found that second approach fairs better in terms of performance and so
decided to use it. We ran our implementation with 16 threads(each software thread per hard-
ware thread) Table 4.10 shows the speedup obtained from this implementation w.r.t baseline
sequential triad census(v0.1). Table 4.11 shows the absolute values of time in seconds. From
the results we see that the an average speedup (w.r.t baseline) of 10x is obtained for Amazon,
Slashdot and Google data sets. Also, Actors(5x) and patents(3.2x) graph data sets fair real
badly in terms of speedup.
To understand why we couldn’t get good scaling, we analysed the overall breakup of exe-
cution time( see Table 4.9 ). In terms of percentage execution time and absolute number, we
can see that the most time is spent creating task queues during pre-processing. As a result,
sequential pre-processing dominates the overall execution time and is thus the source of bot-
tleneck leading in poor parallel scaling. Consequently, we conclude that although Canonical
Dyad(non-uniform distr.) approach is precise, but its has very high overheads causing serious
performance bottlenecks and should this be replaced by a simpler and maybe less precise load
balancing strategy like Canonical Dyad(uniform distr.)
Table 4.9. Execution Time(seconds) breakup of Multithreaded Triad Census algo-
rithm(v0.6) on Intel Xeon Multicore
Version v0.6 Actors Patents Amazon Slashdot Google
%Time - Read Graph 11.67 5.49 6.29 0.85 0.24
%Time - Create Neighbour sets 5.50 6.18 2.00 0.55 0.13
%Time - Create Task queues 55.00 63.40 70.86 70.75 83.86
%Time - Total Pre-processing 72.17 75.07 79.14 72.15 84.23
%Time - Triad Census execution 27.83 24.93 20.86 27.85 15.77
Total Pre-processing time (sec) 8.66 76.5 27.69 28.86 1069.72
Triad Census exec time (sec) 3.33 24.5 7.3 11.14 200.28
Total Time (seconds) 12.00 102.00 35.00 40.00 1270.00
Table 4.10. Multithreaded Triad Census speedup(v0.6) w.r.t baseline(v0.1) results on Intel
Xeon Multicore
Optimization Version Actors Patents Amazon Slashdot Google
Baseline v0.1 1.00 1.00 1.00 1.00 1.00
Multithreaded with per
thread local array and
Canonical Dyad(non-
uniform distr.) load
balancing
v0.6 5.00 3.18 11.26 9.83 9.61
63 4.3 Multithreaded Triad Census
Table 4.11. Multithreaded Triad Census Execution Timing(seconds) results(v0.6) on Intel
Xeon Multicore
Optimization Version Actors Patents Amazon Slashdot Google
Baseline v0.1 60 324 394 393 12203
Multithreaded with per
thread local array and
Canonical Dyad(non-
uniform distr.) load
balancing
v0.6 12 102 35 40 1270
Table 4.12. Comparison of load balance strategy based on aggregate NsetSize(all dyads)
Data Sets Author’s [9] Approach Our Canonical Dyad(uniform distr.) approach
Actors 81908982 78968174
Patents 704600440 671562549
Amazon 149306190 144419374
Slashdot 148238520 147237558
Google 1463477820 1454834448
4.3.3 Using Canonical Dyad(uniform distr.) load balancing(v0.7)
The Multithreaded Triad Census implementation involves number of pre-processing tasks be-
fore Parallel Triad Census algorithm can be executed. This involves reading graph, making
edge/neighbour sets and creating task-queues. As most of this pre-processing is done sequen-
tially, according to Amdahl’s law, a sequential bottleneck will limit the strong scalability possible
for our Multithreaded implementation. Removing sequential bottlenecks thus remained our top
priority.
In order to improve strong scaling of Multithreaded Triad Census algorithm, we decided to
replace Canonical Dyad(non-uniform distr.) load balancing with Canonical Dyad(uniform
distr.) load balancing approach (see Figure 4.5). This load balancing approach is nearly the
same as authors [9] approach except that while calculating NsetSize(line 8), we subtract two
from overall expression. The logic behind this subtraction is as follows: As < u, v> is a dyad,
the set N[u] will always contain v and set N[v] will always contain u. By subtracting two from
the expression |N[u]|+|N[v]|, we are bringing the value of this expression more closer to ideal
value |S|= N[u]∪ N[v] \ {u, v}.
Although a factor of two may seem trivial, but the Table 4.12 shows the difference in ag-
gregate neighbour set size(all dyads) when we use Canonical Dyad(uniform distr.) approach.
Experimentally we found that our approach significantly improves the load balancing as chang-
ing the MaxNsetSize from M1 to M2 leads to nearly proportional change( M2/M1) in number
of threads being created, suggesting the fact that load being proportionally divided between
threads. When we used authors [9] approach, this was not the case and we had to use trial
and error to find MaxNsetSize values for getting good load balancing.
Finally, we ran simulations with our Canonical Dyad(uniform distr.) load balancing strat-
64 4.3 Multithreaded Triad Census
egy and found that we were able to substantially lower the task queue creation part of pre-
processing time and overall pre-processing time. Table 4.13 shows the results with 16 software
threads (1 software thread per hardware thread) and also the breakup of execution time. Ta-
ble 4.14 shows best results obtained through simulations which corresponds to substantially
higher number of software threads per hardware thread. Figure 4.8 shows the graphs for
speedup and time scaling w.r.t number of threads.
With 16 software threads, we get good speedups for Amazon(15x), Slashdot(19x), Google(35x)
data sets(Table 4.13). Speedups for Patents(7.3x) and Actors(9.4x) has also improved but
still not linear. Thus with 16 threads, the average speedup comes out to be 17x w.r.t baseline
sequential(v0.1).
Also, as we increase the number of number of software threads w.r.t hardware threads (Ta-
ble 4.14), the speedup continues to improve. Somewhere between 600-700 software threads,
we get maximum speedups for Amazon(31x), Slashdot(25x), Google(56x), Patents(8.4x),
Actors(11x). Above this thread count threshold, the speedup improvement saturates. This
improvement in speedup with higher number of threads is but natural with Simultaneous Mul-
tithreaded Xeon processors because even when 16 threads stall due to a memory load operation
for example, the processor still has enough threads to keep its pipeline filled with useful in-
structions(reduced vertical waste).
The speedup for networks Patents and Actors networks has also improved a little bit in com-
parison to speedups with 16 software threads but the scaling is still not as effective as for
Amazon, Slashdot and Google networks. At this point, we analyse as to why there is discrep-
ancy between scaling of Amazon, Slashdot Google networks and Patents/Actors network. If
one compares Triad census arrays generated for Amazon, Slashdot and Google networks with
Triad census arrays generated for Actors/Patents, we see that in Actors and Patents network,
most of the census count belongs to triad type 1,2, and 3 and all other census counts are mostly
zeros. This means we are hardly executing the innermost for loop( line 19-24 in Figure 4.6 ).
In case of Amazon, Slashdot and Google networks, the distribution of census count is more
uniform and we get some census count for most of the triad types if not all. This suggests that
in these networks we are running the innermost for loop lot more number of times than in
patents and actors network.
From these observations, firstly we can see that in Amazon, Slashdot and Google networks
there are more number of instructions that are executed. With higher number of executed
instructions per thread, processor has more opportunities to extract ILP per thread. Effec-
tively this means that processor pipeline utilization is higher(lower vertical waste) for Amazon,
Slashdot and Google networks. Also as we increase the number of threads, the overall vertical
waste(pipeline stalls due to hazards) reduces further as now we have still have enough use-
ful instructions from waiting threads to keep the processor pipeline busy. Secondly, in Triad
census algorithm, there are very few number of ALU operations in comparison to Memory
load/store operations. This ratio of useful ALU operations to memory load/store operations
i.e. Arithmetic intensity is also higher for Amazon, Slashdot and Google networks in compari-
son to Actors and Patent networks. With higher arithmetic intensity, the processor does more
useful computational work(reduced pipeline horizontal waste) per load/store operation. Thus
with higher number of executable instructions in Amazon, Slashdot and Google networks, uti-
65 4.3 Multithreaded Triad Census
lization of ALU’s also increases. This also gives processor more opportunities to hide memory
latency as now it has more useful work to do while another thread is stalled on a expensive
memory operation. As we increase the number of threads, the overall horizontal waste reduces
further as now we have enough useful ALU instructions from threads waiting to be executed to
keep the ALU resources busy.
66 4.3 Multithreaded Triad Census
T
able
4.13.
M
ultithreaded
T
riad
C
ensus
w
ith
im
proved
load
balancing
[C
anonicalD
yad(uniform
distr.)]
(v0.7)
-
perform
ance
speedup
w
.r.t
baseline(v0.1)
and
E
xecution
T
im
ing(seconds)
breakup
w
ith
16
threads
on
SM
T
IntelX
eon
M
ulticore
D
ata
Sets
Total
Ex-
ecu
tion
Tim
e(sec)
Parallel
Triad
C
en
su
s
Tim
e(sec)
Pre-
processin
g
Tim
e(sec)
G
raph
R
ead
Tim
e
(sec)
Tim
e
to
m
ake
N
eighbou
r
Sets(sec)
Tim
e
to
create
TaskQ
u
eu
es
an
d
Load
bal-
an
cin
g(sec)
Speedu
p
w
.r.t
baselin
e
A
ctors
6.36
3.99
2.13
1.35
0.68
0.1
9.42
Patents
44.13
27.23
15.17
5.43
7.81
1.93
7.34
A
m
azon
25.69
22.53
2.91
2.01
0.73
0.17
15.34
Slashdot
20.38
19.88
0.43
0.27
0.14
0.02
19.28
G
oogle
339.82
333.96
5.25
3.12
1.74
0.39
35.91
T
able
4.14.
M
ultithreaded
T
riad
C
ensus
w
ith
im
proved
load
balancing
[C
anonical
D
yad(uniform
distr.)]
(v0.7)
-
best
speedup
w
.r.t
baseline(v0.1)
and
E
xecution
T
im
ing(seconds)
breakup
w
ith
m
axim
um
threads(found
experim
antally)
on
SM
T
IntelX
eon
M
ulticore
D
ata
Sets
T
hreads
Total
Ex-
ecu
tion
Tim
e(sec)
Parallel
Triad
C
en
su
s
Tim
e(sec)
Pre-
processin
g
Tim
e(sec)
G
raph
R
ead
Tim
e
(sec)
Tim
e
to
m
ake
N
eighbou
r
Sets(sec)
Tim
e
to
create
TaskQ
u
eu
es
an
d
Load
balan
c-
in
g(sec)
Speedu
p
w
.r.t
base-
lin
e
A
ctors
672
5.5
3.21
2.09
1.31
0.68
0.1
10.9
Patents
688
38.76
21.9
14.94
5.19
7.81
1.94
8.36
A
m
azon
624
12.49
9.52
2.73
1.82
0.74
0.17
31.52
Slashdot
631
15.85
15.19
0.6
0.33
0.21
0.06
24.78
G
oogle
688
217.87
212.07
5.22
3.1
1.73
0.39
56
67 4.3 Multithreaded Triad Census
(a)
(b)
Figure 4.8. Performance Speedup and Execution time scaling for Actors network
68 4.3 Multithreaded Triad Census
(c)
(d)
Figure 4.8. Performance Speedup and Execution time scaling for Amazon network
69 4.3 Multithreaded Triad Census
(e)
(f)
Figure 4.8. Performance Speedup and Execution time scaling for Google web network
70 4.3 Multithreaded Triad Census
(g)
(h)
Figure 4.8. Performance Speedup and Execution time scaling for Patents network
71 4.3 Multithreaded Triad Census
(i)
(j)
Figure 4.8. Performance Speedup and Execution time scaling for Slashdot network
72 4.3 Multithreaded Triad Census
Chapter 5
Evaluation of Triad Census on Tesla
GPGPU
In the last chapter, results from Multithreaded implementation of Triad Census algorithm gave
us good insights into complexities involved in efficient parallelization of Triad Census algo-
rithm on Multicore processors. We now follow this up with efficient parallel implementation of
Triad Census algorithm for NVIDIA Tesla GPGPU. In this chapter, we discuss the approach, im-
plementation and evaluation of Multithreaded Triad census algorithm on NVIDIA Tesla C1060
GPGPU.
5.1 Design and Implementation
Storing and processing data based on pointer chasing data structures like linked list is not
advisable for GPGPU’s. Instead, linear contiguous arrays which can be read and written in
parallel result in best performance. Below we give a detailed description of the multithreaded
triad census implementation written using C++/STL and C/CUDA for Tesla C1060 GPGPU.
For implementation and testing purposes we used CUDA nvcc 3.0 compiler driver in device
emulation mode and at times also used ocelot emulator1. For experiments and simulations
we used Tesla C1060 GPGPU and CUDA nvcc 4.0 compiler driver.
5.1.1 Graph Data Pre-processing on host CPU
We started with C++/STL implementation of graph data pre-processing on host CPU. The pre-
processing stage basically constructs all the graph data structures from graph data files. Once
this data is constructed, it is transferred to the Tesla GPGPU for processing by Triad Census
kernel implementation. Both undirected or directed graph data sets can be read. In addition,
we extended support for processing zero-index or one-index based graphs i.e. graphs where
vertex id can start with 0 or 1.
Edge and Neighbour Data Structures
To store the Edge and Neighbour data information, we used Compressed Row Storage (CRS)
1http://code.google.com/p/gpuocelot/
73
74 5.1 Design and Implementation
format as outlined in section 2.5.1 on page 18. This format usually uses three arrays for storing
a sparse graph or matrix: value, column indexes and row indexes. For our purposes, value
array was not needed and only column indexes and row indexes were used for storing adja-
cency information of Edges and Neighbours. Column array stores the adjacency(neighbour)
information for each and every vertex while the row array (indexed on vertex id) stores the
indexes of column array from where adjacency(neighbour) information of a vertex starts.
As not all graph information( number of edges etc.) was available beforehand, we had to
construct the graph dynamically from graph data sets. For this we used Standard Template
Library(STL) implementation of dynamic arrays popularly known as Vector container . This
container is guaranteed by C++ standard to be contiguous 2. Using vectors containers and as-
sociated STL algorithms/iterators, we constructed adjacency array information for all vertices
by reading and inserting one edge(neighbour) at a time. Whenever it was possible, we would
reserve some space in the vector to make sure that subsequent push backs to the vector internal
array do not have to do frequent reallocation-copy-insert operations. Following are the data
structures constructed for reading the graph:
• Edge vectors of all vertices
• Neighbour vectors of all vertices
All Edge(Neighbour) vectors were then concatenated into a single big edge(neighbour) ar-
ray that represents the column array of CRS format. While concatenating the edge(neighbour)
vectors, we also constructed its corresponding index array(row indexes) to locate adjacency(neighbour)
array of each vertex.
Graph Partitioning and Load Balancing
For graph partitioning, we used a different approach then the one used for Multicore. Instead
of using a distributed task queue, we created a centralized task queue based on CRS format
where column array stores a single big task queue and row array to store indexes. Similar to
Triad Census implementation on Multicores, we used dyad < u, v > as the task abstraction. The
load balancing strategy remained the same as used in our earlier Multicore implementation.
While constructing this centralized task queue, the corresponding row index array(indexed on
thread id) is created to store the index information of each sub-queue that can be assigned to
a GPGPU thread.
Managing graph data under Tesla Memory Constraints
In the Triad census algorithm, the set S = N[u]∪N[v]\{u, v} needs to be dynamically created
at run-time for each and every dyad task < u, v > begin processed. As NVIDIA Tesla C1060
GPGPU is based on 1.3 compute capability, it doesn’t support dynamic memory allocation or de-
allocation inside kernel or the device code. Even if device with 2.x capability is used, frequent
heap memory allocations/de-allocations on device can substantially deteriorate performance.
We thought about mainly two alternatives to solve this problem which basically explored ways
to allocate required memory buffers on CPU and then copying it to GPGPU for creating S sets
at run-time:
• First approach is to create all the possible S sets for each and every possible canonical
dyad on CPU and then transform all of the S sets into CRS format suitable for GPGPU
2http://herbsutter.com/2008/04/07/cringe-not-vectors-are-guaranteed-to-be-contiguous/
75 5.1 Design and Implementation
Table 5.1. Initial GPGPU memory requirements for different graph data sets for 64 bit
unsigned integer type data i.e. range of 0 to 18,446,744,073,709,551,615
Data Set
∑ |S| Memory re-
quirement (MB)
Other Memory re-
quirement (MB)
Total Memory re-
quirement (MB)
Actors 630 64 704
Patents 5372 688 6060
Amazon 1155 110 1265
Slashdot 1200 20 1220
Google 11100 185 11285
Wikitalk 192500 220 192720
processing. This are two main issues with this approach. Firstly, sequentially creating
set S on CPU will make pre-processing stage very slow(this was the exact pre-processing
sequential bottleneck in Canonical Dyad(non-uniform distr.) load balancing approach)
creating the sequential bottleneck. Secondly, our Tesla GPGPU has a limited memory
capacity of 4 GB. Throughout our Multicore and CUDA implementation we used 64-bit
unsigned integers for storing graph data. We calculated that for big graphs like Wikitalk
we needed at least 192720 MB of memory (see Table 5.1) if this approach is used. One
way to deal with memory exhaustion problem was to instead use 32-bit unsigned integers
for storing graph data and do up-casting to 64 bit unsigned integer only while calculating
out of range triad census sums. This pretty much halves the memory requirement but
still we found that some graph sets couldn’t fit in GPGPU GDDR memory.
• Second approach - the one we decided to use - is based on observation that each GPGPU
thread will process only a subset of dyad tasks of the centralized task queue and these
tasks will be processed iteratively. Also, different dyad tasks allocated to a thread have
different memory requirements to build the neighbour set |S|. However if we can assign
to each CUDA thread a pre-allocated buffer(in global memory) whose size equals the
maxim neighbour set size among all tasks of its task queue, then the thread can reuse
this buffer to process different tasks in its task queue iteratively.
For Example: Thread-1 is responsible for dyadic tasks: tk1, tk2 and tk3. Memory re-
quirements to create sets S1, S2 and S3 for each of these tasks is 24, 40 and 8 bytes. If
a memory buffer of maximum(24,40,8) = 40 bytes is provided to Thread-1, then it can
iteratively reuse it to create set S1 in first iteration, set S2 in second iteration and set S3
in third iteration.
We discarded the first approach due to mentioned issues and decided to use the second
approach. In terms of implementation, we had to make modifications in our load balancing
implementation to store maximum neighbour set size requirement of each and every task queue
(or thread). Based on these maximum values, we created an empty CRS column array whose
size equals sum of maximum neighbour set sizes of all threads. A corresponding CRS row array
is also created to store the index of starting location of neighbour buffer of each thread. From
the resulting implementation, we found that all graph data sets were able to fit well within 4
GB memory limit of GPGPU (Table 5.2). Note that these memory requirement depends upon
76 5.1 Design and Implementation
Table 5.2. Final GPGPU memory requirements for different graph data sets for 64 bit
unsigned integer type data
Data Set Overall Memory require-
ment (MB)
Actors 100
Patents 750
Amazon 350
Slashdot 230
Google 1230
Wikitalk 3910
the number of threads or task queues created. When we increase number of task queue or
threads, the memory requirement increases. As of now these requirement were calculated for
30720 thread limit of Tesla C1060 (except for Wikitalk data set for which we reached the 4 GB
limit with 10000 threads).
The single-threaded pre-processing stage was optimized to make it as fast as possible
so as to keep its overall time consumption much lower w.r.t triad census computation time.
Figure 2.9 in section 2.5.1 shows an example graph and its corresponding CRS arrays.
5.1.2 CUDA Kernel implementation of Triad Census
All data structures constructed in pre-processing stage were wrapped in device pointers and
corresponding device memory was allocated. Host data was then copied to device global(GDDR)
memory using appropriate CUDA C language primitives. We utilized constant memory for stor-
ing the mapping table that translates non-isomorphic triads(64 types) to corresponding iso-
morphic triads(16 types). An empty Triad census array on host and device were allocated. The
device Triad census array was subsequently transferred to the global memory for storing Triad
Census results of execution of Triad Census kernel.
We implemented the Triad census kernel and required device function in form of C++ Function
templates. Following functions were implemented: IsEdge (if u-v is an edge), IsNeighbour ( if
u and v are neighbours ), TriadCode, Set Union and Set Difference( for calculating S sets ). All
the code was written using CUDA C/C++ and compiled for Tesla C1060 using nvcc compiler
driver version 4.0.
Thread hierarchy configuration parameters
For the thread hierarchy configuration parameters ( kernel«<thread_hierarchy_config»> ) of
the triad census kernel, we made the number of threads per thread block ( threadblock_config
) configurable at compile time. The number of thread blocks per grid( grid_config ) is then
accordingly calculated at run-time. We kept the the block and grid dimensions to 1-D as we
are dealing with 1-D arrays.
Shared Census Array
The Triad census array stored in the device global memory is shared by all threads running the
same kernel on their respective task queues. To synchronize concurrent updates to the census
77 5.2 Initial experiments and Speedup Results
array stored in global memory by different CUDA threads, we used CUDA atomic primitive
function named atomicAdd.
Timing Kernel Execution
For timing the total triad census kernel execution correctly, before measuring the elapsed ker-
nel execution time, we used cudaThreadSynchronize() after kernel function call to make sure
that host CPU waits for the previous asynchronous kernel execution to finish.
5.2 Initial experiments and Speedup Results
While implementing the device code, we incorporated most of the Multicore optimizations
which we thought also made sense on GPGPU ( e.g.: pre-computing triad type). We compiled
the code with following nvcc compiler driver options:
-arch=sm_13
To compile the device code for NVIDIA Tesla C1060 GPGPU device of compute capability 1.3
-O3
To be passed to the host g++ compiler for host code optimization
–ptxas-options=-v
Specify -v option to the ptx optimizing assembler to get the details of register, local memory,
shared memory, and constant memory usage of the kernel
Initial compilation revealed the following per thread memory usage:
Number of 32-bit registers per thread: 31
Shared memory Per thread block ( allocated by nvcc compiler ): 128 bytes
Constant memory: 536 bytes
Local memory per thread: 0 bytes
We used this memory utilization data with the spreadsheet based CUDA occupancy calcula-
tor to explore possible thread hierarchy parameters. CUDA occupancy calculator allows to
calculate the SM occupancy. The SM occupancy gives a measure of Thread level parallelism in
terms of ratio of active warps to the maximum number of warps supported on the SM of the
GPGPU.
NVIDIA recommends higher occupancy for latency bound applications so that if few of the
warps stall due to slow loads, the SM scheduler still has enough warps that are waiting for
execution. However this approach might not always give best results as demonstrated by au-
thors [55]. There is a trade-off between granularity of thread block and the resource usage
(registers, shared memory ) per thread. Higher occupancy translates into fine grained TLP re-
sulting in lower number of resources per thread but requires large number of threads to lower
78 5.3 Profiling Triad Census CUDA kernel
hide memory or ALU latency. Alternatively, lower occupancy translates into coarse grained TLP
resulting in more resources per thread. Lower occupancy is effective only when number of
threads are lower and each thread does more work ( ALU and memory instructions ) in order
to effectively use the extra resources.
During this initial phase of experimentation, we kept our options open for exploring granu-
larity of Thread blocks. Tesla C1060 supports maximum 1024 threads or 32 warps or 8 thread
blocks(with 4 warps per thread block) per SM. The thread occupancy % in Tesla GPGPU is
limited mainly by per thread block usage of shared memory and registers. In our case, the oc-
cupancy calculator calculates that with 31 registers per thread, our implementation can achieve
maximum occupancy of 50% per SM i.e. 16 active wraps per SM. Table 5.3 shows all possible
thread hierarchy configurations to get this maximum achievable occupancy of 50%. All other
threads per block configuration lowers the achievable occupancy.
Table 5.3. Thread hierarchy configuration suggested by Occupancy calculator for 50%
occupancy(maximum with 31 registers)
Threads per block(we
configure this at compile
time)
Thread Blocks per
SM(chosen by com-
piler based on threads
per block)
Occupancy (%)
48 or 64 8 50
112 or 128 4 50
240 or 256 2 50
496 or 512 1 50
Initial trial runs revealed that varying threads per block had little effect on performance
when number of threads is kept at near maximum of 30720 threads. Performance deteriorates
when we launch with lesser number of threads. Table!5.4 describes initial results with 256
threads/block and 120 thread blocks thread hierarchy configuration. As we can see from Fig-
ure 5.1, we get good speedup w.r.t baseline sequential for Amazon(16x), Slashdot(13x) and
Google(34x) data sets. However for patents(2.5x) and Actors(3.45x) networks the speedup
results is relatively poor. We suspect that the reason for discrepancy in speedups has similarities
to the reasons we discussed during acceleration on Multicores. Only this time we see the dis-
crepancy on the Tesla GPGPU. If we compare the GPGPU results with fastest Multicore results,
we see that on an average GPGPU results are 2.4x slower.
5.3 Profiling Triad Census CUDA kernel
To get better insight into execution of Triad census kernel on Tesla GPGPU in terms of perfor-
mance, memory and branching, we profiled its using command-line based profiler tool called
computeprof. The computeprof tool enables user to gather different runtime information
statistics(profile counters) for a CUDA kernel. Depending upon the profile settings, the profiler
collects various profile counters like executed instructions, occupancy, warp_serialize, global
coherent load/stores, branches, divergent_branches etc. Table E.1 in Appendix E shows de-
scription of basic profile counters and also some of the derived counters that we created for
79 5.3 Profiling Triad Census CUDA kernel
(a)
(b)
Figure 5.1. Initial speedup results obtained for CUDA implementation of triad census
algorithm (a) Speedup w.r.t baseline sequential (b) Speedup w.r.t fastest multicore
80 5.3 Profiling Triad Census CUDA kernel
Table 5.4. Initial performance results of Triad Census on Tesla GPGPU with 256 thread-
s/block and 120 thread blocks
Data sets Total time(sec) Algorithm
exec
time(sec)
Pre-process
time(sec)
GPGPU
Speedup
w.r.t multi-
core
GPGPU
Speedup
w.r.t se-
quential
Actors 17.38 16.17 1.21 0.32 3.45
Amazon 24.7 21.25 3.45 0.51 15.95
Patents 129.73 115.06 14.67 0.3 2.5
Slashdot 29.69 29.22 0.47 0.53 13.24
Google 354.71 348.64 6.07 0.61 34.4
profiling Triad Census kernel. For more detailed description of various CUDA compute profile
counters see 3
We first configured the profiler with profile counters in a profile configuration file and then
enabled it. After this, we ran our Triad Census CUDA application. The compute profiler collects
the statistics while the application runs. When the CUDA application has finished its execution,
the profiler dumps the collected profile counter values in configured output file (CSV format).
In order to target potential bottlenecks in Triad Census Kernel, we created different types of
profiles. Using basic profile counters, we derived some new profile counters like instruction
throughput, global memory read/write/overall throughput, percentage divergent branches.
We will now describe the analysis of the performance and memory statistics obtained by pro-
filing our initial CUDA implementation of Triad Census algorithm:
1. Firstly we see in Table 5.5 that all the global load and store transactions are coherent or
coalesced which is ideally what one needs for good global memory performance. Also, we
can see in Table 5.6 that the number of 32 byte global memory load and store transactions
are much higher compared to number of 64 bytes and 128 byte global memory load/store
transactions. Increasing the count of 128 byte global memory transactions and reducing
32 byte transactions should improve results.
Another way could be to distribute the single centralized task queue into two centralized
task queues. First task queue would contain u values of task and the second task queue
would contain v values of task. This optimization would be basically synonymous with
AOS (Array of structures ) to SOA (Structure of Arrays) optimization.
2. The instruction throughput(Table 5.8) and overall global memory throughput(See Ta-
ble 5.7) is very low compared to the peak hardware values(102.4 GB/s theoretical GDDR
memory bandwidth for Tesla C1060) , again reiterating the fact the the triad census ker-
nel is not compute bound but rather memory latency bound. There are number of ways
to hide memory latency on CUDA devices. Increasing warp occupancy is one factor that
can reduce memory latency and this requires reducing register pressure by using shared
memory instead. Alternatively, we could increase per thread register usage with lower
3computeprof user-guide
81 5.3 Profiling Triad Census CUDA kernel
Table 5.5. Memory Coherency statistics
Data set gld_incoherent gld_coherent gst_incoherent gst_coherent
Actors 0 337167983 0 16894132
Amazon 0 355220513 0 26933520
Patents 0 2894521697 0 130256154
Slashdot 0 1111485764 0 29259288
Table 5.6. Global Memory Load/Store Transactions
data set gld_128b gst_128b gld_64b gst_64b gld_32b gst_32b
Actors 1155633 108 464835 14 335547515 8446606
Amazon 2600465 3 1150847 0 377896253 13192445
Patents 693274 49 221158 3 2944903315 65148117
Slashdot 3407550 0 1555610 0 1198046922 14502431
Table 5.7. Global Memory Throughput
data set glob mem read
throughput(GB/s)
glob mem write
throughput(GB/s)
Total global memory
throughput(GB/s)
Actors 8.59 0.21 8.8
Amazon 5.87 0.2 6.07
Patents 8.19 0.18 8.37
Slashdot 13.27 0.16 13.43
82 5.3 Profiling Triad Census CUDA kernel
Table 5.8. Instruction profile
Data sets instructions Instruction throughput
Actors 650649633 0.03
Amazon 516564153 0.02
Patents 4294967295 0.03
Slashdot 2168289924 0.06
Table 5.9. Branch profile
Graph Net-
work
Branches Divergent
Branches
% Di-
vergent
Branches
instructions % Branches
Actors 105456730 3368243 3.10 650649633 16.73
Amazon 86825816 3714570 4.10 516564153 17.53
Patents 1163523028 29532915 2.48 4294967295 27.78
Slashdot 341669397 7379273 2.11 2168289924 16.10
occupancy by either storing some per thread global memory data in registers or by in-
creasing ILP somehow [55]. Other approaches to hide memory latency: Use shared
memory for doing computations per thread block, data prefetching and reducing register
dependent RAW stalls.
3. Divergent branches: The average percentage of divergent branches in our triad census
kernel is around 3% (See Table 5.9). In addition to triad census kernel, there are num-
ber of device functions which carry branches : IsEdge, IsNeighbour and set_adj_union. In
comparison to 1% average divergence in many CUDA benchmark kernels 4, our percent-
age divergence seems little high indicating that serialization of threads within a warp
taking different paths. In worst case warp serialization can slow down by 32 times if all
threads in a warp take 32 separate paths. In our case, as most branches have two paths,
so a factor of 2x slowdown is possible.
There are different approaches to reduce divergence. We have already discussed some
approaches(see section 2.5.2) for regularizing control flow. Other suggested approaches
[56] to reduce branch divergence include using signed integers as loop counters, creating
smaller branch-controlled compound statements so that compiler optimizes using predi-
cation, grouping threads according to branch path they tend to take and lowering total
number of CUDA threads. The nvcc compiler driver can replace the instructions in dif-
ferent branch paths with predicated instructions, provided number of such instructions
is lower than a threshold value.
4. Looking at the memory load store (Table 5.10), we see that there are considerable num-
ber of warps whose execution serialized on address conflicts to either shared or constant
memory.
4http://divmap.wordpress.com/home/divergence-data/
83 5.3 Profiling Triad Census CUDA kernel
Table 5.10. Memory load store profile
Data set gld_request gst_request warp_serialize
Actors 49017268 419723 0
Amazon 52952146 705510 2098473
Patents 549967365 5281342 3546663
Slashdot 163110847 673969 2285670
So far in our implementation we have used constant memory for storing mapping table
with 64 entries of size 8-byte each to map non-isomorphic triads to isomorphic triads.
Below is the relevant code fragment in which the different locations in mapping table are
accessed depending upon the value returned by TriadCode function and also preTriadType
value
if( v < w ||
( u < w && w < v && !IsNeighbor( Neighbors, NeighborIndexes, u, w) ) )
{
triadType = dIsoTriadMappingTable[preTriadType - 1 +
TriadCode(Edges, EdgeIndexes, u, v, w)];
atomicAdd( &TriadCensus[triadType-1], 1 );
}
Looking at the above code fragment, we can see that the mapping table in constant mem-
ory is being read by all the CUDA threads. CUDA C best practices guide[56] suggests the
following approach to obtain best constant memory performance:
For all threads of a half warp, reading from the constant cache is as fast as reading from
a register as long as all threads read the same address. Accesses to different addresses by
threads within a half warp are serialized, so cost scales linearly with the number of different
addresses read by all threads within a half warp.
As different threads within a warp or half wrap will be possibly accessing different loca-
tions of the mapping table, all such reads to constant memory will be serialized. If all
threads of a warp were to access the same location in mapping table, this implies we
need to find a mechanism to group threads according to the location they are going to
access. As there are 64 such locations, we need a mechanism to categorize all threads in
64 groups. But how do we know which thread is going to access which location ? This
basically depends upon the value of following expression
TriadCode(IN Edges,IN EdgeIndexes,IN u,IN v,IN w) - 1 + preTriadType
and which in turn depends upon whether u-v, v-u, u-w, w-u, v-w and w-u pairs are edges
or not. Due to the very random nature of graph data, it is very difficult to predict the
answers to these questions beforehand. Pre-computing these value may seriously dete-
riorate overall performance. Therefore we rule out a mechanism to group these threads
according to the locations they are going to access.
84 5.4 Optimizing Triad Census CUDA implementation
One way to avoid this performance bottleneck is to simply calculate non-isomorphic tri-
ads. By doing that, we don’t need to do any mapping using the mapping table.
Second approach is to replicate the mapping table for each thread block in shared mem-
ory. Shared memory is much suitable for concurrent access to different locations as it is
physically organized in separate memory banks which can be accessed in parallel. Effec-
tiveness of shared memory can reduce in case of bank conflicts though. Bank conflicts in
shared memory can occur when simultaneous memory accesses from threads to different
memory locations map to same memory bank. Bank conflicts lead to serialization of ac-
cesses to it, thereby reducing obtained performance. Shared memory conflicts [57] can
be prevented in two ways:
• Simultaneous memory access from threads map to different memory banks(16 banks
in our GPGPU device)
• Simultaneous memory load requests made by threads of a half-wrap map to a
shared memory bank but all such memory loads are made for the same memory
address location. This leads to broadcast of loaded value to half-warp threads. More
precisely, even if one of the memory loads is for a different memory address(mapped
to same memory bank), this will lead to a bank conflict.
5.4 Optimizing Triad Census CUDA implementation
Based on our analysis in last section, we tried many optimizations. Some optimizations worked
and some didn’t. Below we give description of all the tried optimizations.
5.4.1 Using shared memory to calculate Triad Census per Thread block
In this optimization we try to hide memory latency by using low latency and high throughput
shared memory. Instead of all threads of all thread blocks updating the global triad census
array through atomic instructions, we provide each thread block an empty Triad census array
in shared memory where it can calculate its share of triad census array. Now only the threads
local to a thread block can update their shared private triad census array in shared memory.
When all the thread blocks are done creating their own Triad census, we merge the individual
triad census results of different thread blocks to the global census array.
One issue we faced here was that for 1.3 compute capability devices, the atomicAdd func-
tion can work only on 64-bit integer values in global memory but not on 64-bit integer values
in shared memory(works only on 32-bit integers). As some of the Triad census values for big
graphs could easily overflow beyond the range provided by 32-bit unsigned integers, this could
lead to incorrect results. Statistically speaking, triad types 1,2 and 3 tend to have very high
census values in comparison to census values of remaining 13 (4 to 16) triad types. To re-
solve this issue, we calculated Triad census for triad types 2 and 3 using 64-bit global census
array and not shared memory. In terms of code modifications it means that only the innermost
atomicAdd function operated on shared memory (Line 22 in Figure 4.6) while the outermost
atomicAdd function continued to operate on global triad census array (Line 18 Figure 4.6).
85 5.4 Optimizing Triad Census CUDA implementation
Table 5.11. Speedups obtained by using shared memory to calculate Triad Census per
Thread block
Data sets Total
time(sec)
Algorithm
exec
time(sec)
Pre-
process
time(sec)
Speedup
w.r.t
GPGPU
baseline
GPGPU
Speedup
w.r.t mul-
ticore
GPGPU
Speedup
w.r.t se-
quential
baseline
Actors 7.08 5.86 1.22 2.45 0.78 8.47
Amazon 10.24 6.76 3.48 2.41 1.22 38.48
Patents 77.33 62.63 14.7 1.68 0.50 4.19
Slashdot 18.09 17.62 0.47 1.64 0.88 21.72
Google 209.03 203 6.03 1.70 1.04 58.38
Average 2.05 1.10 32.81
After running experiments on this optimization, we saw that there was almost negligible per-
formance impact (< 1%) of calculating triad types 2 and 3 in global census array instead of
calculating it using shared triad census array. Consequently the shared memory consumption
per thread block reported by nvcc compiler increased from 128 bytes to 192 bytes.
ptxas info: Used 31 registers, 176+16 bytes smem, 512 bytes cmem[0],
24 bytes cmem[1]
With this optimization, we obtained a maximum, average and minimum speedups of 2.45x,
2.05x, 1.64x w.r.t our baseline GPGPU implementation and 1.22x, 0.84x, 0.5x speedups
w.r.t fastest Multicore implementation.
With respect to sequential baseline results, the resulting maximum, average and min-
imum speedups were 58x(Google), 32x, 4.2x(patents). Table 5.11 shows speedups and
absolute time elapsed values for different graph data sets.
Comparing the speedup results of Triad Census CUDA implementation and Fastest Multicore
Triad Census implementation(see column GPGPU Speedup w.r.t multicore in Table 5.11), we see
that Tesla GPGPU on an average was 1.1x times faster. Only for patents graph data set we see
that GPGPU speedup ran half the speed of Multicore. However, for most graph data sets, our
CUDA implementation was equally fast as fastest Multicore results and for some graph data
sets(Amazon, Google) it bettered the Multicore results.
5.4.2 Discussion on other optimizations
CUDA occupancy calculator shows that 50% warp occupancy can be maintained with upto
8192 bytes of shared memory per thread block. So far we had used just 192 bytes per thread
block, we can still utilize 8000 remaining bytes. With 8 byte integers, we can load maximum
1000 integer elements more in shared memory of each thread block. Total amount of shared
memory in Tesla GPGPU is 480 KB with each SM having 16KB of shared memory. So maximum,
we can store 60 K values in shared memory of type 8 byte integer.
86 5.4 Optimizing Triad Census CUDA implementation
We analysed different data structures placed in global memory that could possible be accommo-
dated in shared memory. Data structures like Edges, EdgeIndexes, Neighbour, NeighbourIndexes,
Task Queue were not feasible to be accommodated in available shared memory. We tried many
combinations, data sharing and data reuse approaches to try and fit these data structures but
all of these data structures either seemed too big to fit or they would fit by seriously lowering
the total thread count or thread block size causing serious under-utilization of GPGPU hard-
ware resources.
We found that we could only place two data structures completely in shared memory with-
out causing any serious performance deteriorating trade-off’s:
• CRS row array storing indexes of centralized task queue
• CRS row array storing indexes of maximum S set sizes vector
This optimization, however resulted in no performance improvements.
Experimental optimizations that either reduced the performance or yielded insignificant
performance improvement
• Reducing per thread register pressure: We first tried to increase occupancy by forcing
the compiler to use lesser number of registers per thread using the nvcc compiler flag
–maxrregcount. In such cases, the compiler starts storing the thread local data in local
memory. As we reduced the number of registers per thread from 31 to 16, 8 and 4,
the amount of per thread local memory increased correspondingly. Using occupancy
calculator we tried different thread blocks corresponding to 100 % occupancy. What
we found was that even with 100% occupancy, we couldn’t hide memory latency and
performance actually reduced as we reduced the number of registers per thread.
• Using shared memory for mapping table: We already saw that irregular access to map-
ping table stored in constant memory lead to warp serialization(caused due to address
conflicts). In order to reduce warp serialization, we instead tried using share memory for
storing mapping table for each thread-block. The resulting performance improvement
was almost negligible , suggesting that warp serialization reduced just slightly. This also
suggests that the irregular access of indexes now started causing some bank conflicts in
shared memory.
• AOS to SOA conversion: We tried replacing centralized task queue with two task queue
by splitting the task <u, v> into two components. Now one queue just stores u’s in CRS
column array while the other task queue stores corresponding v’s in another CRS column
array. Also, as row index array remained the same, we can easily extract a task pair (u,
v) from these two individual queues. When we tested this optimization, there were no
improvements in performance.
• Loop unrolling: Referring to the work by [55], unrolling deep nested loops could im-
prove ILP per thread though this would mean more per thread register usage and de-
creased warp occupancy. If the benefits(memory latency hiding) of improved ILP and
fast register memory are more than cost paid for reduction in warp occupancy(TLP),
performance can be improved. In our Triad Census kernel, the innermost for-loop(Line
87 5.5 Summary
19–24 in Figure 4.6) couldn’t be unrolled through CUDA unrolling pragma or hand-coded
unrolling as the loop bounds are known only at run-time. Only loop that we could man-
ually unroll was the outermost loop (line 6–26). By unrolling this outer for-loop X times,
we started processing X tasks per loop iteration instead of one task per loop iteration
with no unrolling. We tried loop unrolling factors of 2 and 3. With these unroll factors,
the register per thread increased from 31(no unroll) to 42(2x unroll) and 51(3x unroll).
However, we found that we couldn’t gain any improvements in performance and at 3x
unroll factor, performance actually reduced. However in our final implementation, we
still decided to keep a 2x unroll factor as it had no negative impact on performance.
• Reducing Branch divergence: In our initial GPGPU implementation, we already opti-
mized branches in different device functions. This included branch elimination optimiza-
tion that we had also successfully employed in our Multicore sequential triad census. Al-
though Loop unrolling was also employed as explained previously, but we don’t think it
yielded any significant improvements to branch divergence. In addition, we tried hand-
coded predication in triad census kernel and other device functions and implemented
only the ones which gave us some benefits.
After unsuccessfully trying many optimizations, we carefully re-looked if any other opti-
mizations might still improve performance but found that we had tried most of the optimiza-
tion ideas that could possibly improve performance. At this point, we are convinced that we
have all the reasons to conclude that we have achieved the maximum obtainable performance
of Triad Census algorithm on our Tesla GPGPU.
Figure 5.2. Comparison of our results on Intel Xeon Multicore and Tesla GPGPU with
previous work on Cray XMT
5.5 Summary
To summarize the final results, figure 5.3 shows the overall speedup results obtained which
shows comparison between GPGPU and Multicore. Further, in previous work [9] on accelerat-
ing Triad Census, authors used the same patents data set and accelerated it for 16 PE’s of Cray
XMT(128 PE’s support). From previous work, we extracted the execution time for processing
of patent data set. Figure 5.2 shows the comparison of our execution time results on Intel Xeon
Multicore(with 16 threads) and GPGPU (30K threads) with previous results on Cray XMT(16
PE’s). As we see, our results for patents data set on Intel Xeon Multicore are slightly better
88 5.5 Summary
(1.27x times faster) than Cray XMT results while performance on GPGPU is comparatively
slower(0.72x)
Figure 5.3. Comparison between GPGPU and Multicore w.r.t baseline sequential
Chapter 6
Conclusion and future work
The main contribution of this work is to accelerate an Irregular Graph Algorithm for Social Net-
work Analysis called Triad Census algorithm on Intel Xeon Multicore and NVIDIA Tesla GPGPU
processor architectures. The main challenges in this thesis were firstly to address the proces-
sor - memory speed gap for algorithms which are bound by memory latency and secondly to
evaluate and accelerate such an irregular algorithm on two orthogonal commodity processor
architectures - Multicore and GPGPU.
We started the work by focusing on the state of the art in graph theory, social network analy-
sis and parallel graph algorithms. Then we concentrated on understanding the Triad Census
algorithm and previous results on it. Thereafter we built good foundations on state of the art
solutions in areas such as efficient graph data structures, processor architecture aware opti-
mizations and reducing control-flow irregularities. After this, we spent effort on understanding
the internals of Dual socket Quad core Intel Xeon based system(used for running experiments)
and Advanced Processor architecture concepts. We started with a Sequential Triad Census al-
gorithm and through profiling at application and instruction level, we found the targets for
sequential optimization. We successfully applied many different optimization techniques such
as branch elimination, cache fitting, Cache blocking etc. and obtained sequential speedups of
maximum 6x, average 3.3x and minimum 1.3x for different real world social network graphs.
With good sequential speedup results, we started analysis for implementation of Multithreaded
Triad Census algorithm for our Intel Xeon processor based Multicore system. We carefully
analysed different ways of partitioning the graph data and balancing the workload across the
threads. Based on this understanding, we implemented Multithreaded Triad census using dis-
tributed task queue approach. Through experiments and observations we deduced ways to
decouple threads by reducing the amount of shared data between them, and also improved
load balancing through experimentation and analysis. Overall, on Intel Xeon Multicores we
were able to achieve 56x maximum, 33x average and 8.3x minimum speedups for real
world social network graphs.
By keeping speedup results on Multicores as a reference, we changed our focus to NVIDIA
Tesla C1060 GPGPU. We spent significant effort on understanding many aspects of GPGPU’s
processor hardware and programming - differences in Multicore and GPGPU architectures, ar-
89
90 6.1 Future work
chitecture details of Tesla GPGPU Architecture, CUDA Programming and memory model, CUDA
thread scheduling, Thread Synchronization, state of the art techniques for optimizing CUDA
applications. Based on our experiences on Multicore and good knowledge base developed on
GPGPU’s, we carefully crafted a CUDA implementation of Triad census algorithm. Through pro-
filing and shared memory optimizations, initial results demonstrated speedups as competitive
as Multicore results. We continued experimenting with all possible optimizations such as Loop
unrolling, reducing branch divergence etc, exploring trade-off between register pressure and
war occupancy etc. However after a lot of experiments and analysis we found that we could
no longer extract any more performance out of Tesla GPGPU. Still, overall on the NVIDIA Tesla
C1060 GPGPU, we were almost able to match the Xeon Multicore results - 58.4x maximum,
32.8x average and 4.2x minimum speedups.
In terms of raw performance, we found that for the graph data set called Patents network,
our results on Intel Xeon Multicore(16 hw threads) were slightly better (1.27x times faster)
than previous results on Cray XMT(16 hw threads) while resulting performance achieved on
GPGPU was comparatively slower(0.72x). To the best of our knowledge, this algorithm has
only been accelerated on a supercomputer class machine named Cray XMT and no work ex-
ists that demonstrates performance evaluation and comparison of this algorithm on relatively
lower-cost Multicore and GPGPU based platforms.
6.1 Future work
Having said all this, there is always room to try to explore new paradigms and platforms
in the Chip Multiprocessor landscape. We now briefly give an outline of our perspective on
prospective future work
• On Multicores: As of now, the state of the art Multicore processor architecture from
Intel is Ivy Bridge which is an entirely new micro-architecture. Also, all future architec-
tures tend to be cc-NUMA style. In our current work we didn’t explore NUMA specific
optimization’s but we think that such optimizations will become necessary as number of
cores scale on the CMP’s. On the algorithmic side, analysing the effect of graph topol-
ogy on performance scalability of the application and doing cache oblivious analysis of
the algorithm should open more avenues for improving application performance. Other
algorithmic/implementation areas which can be explored on Multicores are: Exploring
work or task stealing approach to improve load balancing and evaluating performance
with threading libraries/frameworks such as OpenMP, Cilk and Intel TBB.
• On GPGPU’s: Since the release of NVIDIA Tesla GT200 GPU processor architecture, many
more advanced GPGPU devices have been released by NVIDIA, AMD(Radeon) and more
recently by Intel(Xeon Phi). From NVIDIA Fermi to NVIDIA Kepler, the GPGPU processor
architectures have mainly seen a shift towards making them more memory-latency and
programming friendly. For example dynamic nested parallelism is now possible on the
latest NVIDIA GPGPU’s and also cache hierarchy for on-chip shared memory is introduced
to target applications sensitive to memory-latency. The number of threads, size and speed
of on-chip and off-chip memory as well as SIMD resources have been improved, so as the
speed of context switching and atomic instructions. It will be a good exercise to evaluate
91 6.1 Future work
Triad Census or a similar irregular algorithm on latest GPGPU processor and a Multi-GPU
system.
92 6.1 Future work
Appendix A
Concurrency and Parallelism
Amdahl’s Law [2, 58, 59]
It is a simple observation which quantifies the overall possible speedup in execution that can
be gained when a given part of the execution is accelerated. Normally this speedup is provided
by doing a part of the total work in parallel.
Speedup = ( S + P ) / ( S + P/N )
where,
T = S + P = Total execution time units
S = Execution time units of serial part of T
P = Execution time units of parallelizable part of T
N = Factor by which P can be accelerated
Amdahl’s law assumes that total workload is fixed and thus focuses on strong scaling. Also
see Gustafson’s Law.
Asynchronous Execution [60, 61]
Tasks are said to be executing asynchronously if they may execute at different speeds and each
of it can halt any time for an unpredictable amount of time due to unpredictable events (cache
miss, pre-emption, page-fault etc.). Different tasks thus may make progress independently of
the each other.
Atomicity
A computation is said to be atomic, if once it starts executing, it is allowed to execute to
completion without any interruption. Atomicity comes in flavours of atomic variables, atomic
statements, functions etc. An atomic computation cannot be subdivided further into smaller
computations and for this granularity of computation even the operating systems has to wait
for atomic computation to complete its execution before it can interrupt it.
Barrier Synchronization [60, 62, 63]
A synchronization mechanism in which all threads or tasks must meet at a point called Barrier
before any of them is allowed to make any further progress. Each thread waits for all other
threads by blocking itself. A blocked thread waiting for other threads will unblock only if it
93
94
is sure that all other threads have reached the barrier. Each thread thus must execute all op-
erations or code before the barrier. Some barriers are implicit while others can be specified
explicitly in implementation. A barrier is thus a mechanism to enforce synchronization of asyn-
chronously executing independent threads.
Blocking [64, 65, 66]
We say a task is blocked or enters a blocked state when it is waiting for a resource and cannot
continue execution until it gets access to that resource. The resource could be semaphore, re-
sponse from an I/O device or memory, an interrupt (say from timer) or event or message etc.
As soon as the resource becomes available, the task is unblocked. An unblocked task can be-
come a ready task by moving to a priority queue of ready tasks or can directly move to running
state(by pre-empting current running task) if it is the task with highest priority compared to
all ready tasks and the currently running task.
Concurrent Program and Computation [66]
A concurrent program consists of a finite set of (sequential) processes. Each process is written
using a finite set of atomic statements. The execution of a concurrent program proceeds by
executing a sequence of the atomic statements obtained by arbitrarily interleaving the atomic
statements from the processes. A computation is an execution sequence that can occur as a
result of the arbitrary interleaving of atomic statements of the processes of the program. For
a set of N-concurrent tasks T1, T2,..,TN executing on a single PE, each having S1, S2, ..., SN
number of atomic computations, the total number of possible interleaving are:
Number of Possible Interleaving = (S1 + S2 + ...+ SN )!/(S1 ∗ S2 ∗ ... ∗ SN )
Concurrency [2, 61, 67, 68, 69]
As set of tasks or threads are said to be concurrent iff
• They execute and make progress at the same time or within the same time interval. By
saying same time or same time interval we mean that they make progress together but
they may or may not be executing at the same time instant.
• They can be executed sequentially through interleaved execution (non-deterministic or-
dering) on a single threaded uniprocessor but they can also be executed in parallel if
multiple PE’s are available as in Multicore, Manycore or Multithreaded processor.
Concurrency thus can imply either interleaved (pseudo-parallel) or parallel execution and
is more of property of the algorithm. Concurrency as a concept is more general than paral-
lelism.
Data Level Parallelism [70]
In this type of parallelism, same task or operation is executed in parallel on different data
streams. In essence, they are similar to SIMD or SPMD as they perform same operations on
different data sets. Vector processing is yet another form of data level parallelism.
Decomposition of problem into tasks is usually based on data partitioning followed my static
or semi-static mapping of tasks to threads or processes. Also the task interactions are generally
insignificant in data parallel applications making them more amenable to coarse grained par-
allelism.
95
Data reuse
Data re-usability is central to obtaining high performance on high performance processor ar-
chitectures. It cane be categorized in four categorize which is Cartesian product between two
sets: Locality set - {spatial, temporal} and Data usage mode set - {self, group} :
• Self temporal use: Accessing same memory location in consecutive iterations of the loop.
Example: accessing a variable, array element, structure member in consecutive itera-
tions.
• Self spatial reuse: Accessing consecutive memory locations in the same cache line in
the same loop iteration. Example: access of consecutive array elements in same loop
iteration
• Group temporal reuse: Accessing group of memory locations in consecutive iterations of a
loop. Example: accessing same set of array elements, structure members in consecutive
iterations.
• Group spatial reuse: Accessing a group of memory locations that belong to the same cache
line in the same iteration.
Distributed Memory[58]
Some parallel architecture’s have memory which is physically distributed among PE’s such that
each PE has its own cache-hierarchy and main memory. Depending upon whether the address
space is local or globally shared, inter-processor communication occurs through explicit mes-
sage passing or through shared address space.
Granularity [58, 61]
It is defined as the amount of concurrent computation is done before synchronization or com-
munication is needed between concurrently executing computations. The longer the time be-
tween synchronizations or computation, the coarser is the granularity and vice versa. In simple
terms it is just ratio of computation to communication. The number and size of tasks into which
a problem is decomposed determines the granularity of the decomposition. Decomposition into
a large number of small tasks is called fine-grained and decomposition into a small number of
large tasks is called coarse-grained.
Gustafson’s Law [70]
This was an observation made by John Gustafson from Sandia National Labs. It basically com-
plements the Amdahl’s law. Gustafson observed that speedup scales with number of cores only
if the parallel workload increases at a faster rate relative to serial workload. It this focuses on
weak scaling.
The difference between Amdahl’s and Gustafson’s law lies in whether one wants to make an
existing program run faster with the same workload(see strong scaling) or whether one envi-
sions working on a larger workload. The scalability of an application comes down to increasing
the work done in parallel and minimizing the work done serially.
96
Amdahl motivates us to reduce the serial portion, whereas Gustafson tells us to consider larger
problems, where the parallel work is likely to increase relative to the serial work. Making to-
day’s application run faster by switching to a parallel algorithm without expanding the problem
size is harder than making it run faster on a larger problem.
Latency
It is another term used for delay. It is the time elapsed between the instant an item of an input
data set enters a processing system to the time instant when this item comes out after process-
ing.
Multitasking
Multitasking allows more than one task to execute at the same time on a single processor by
interleaving their execution. Each task executes for a designated time interval. When the time
interval has expired or some event occurs, the task is removed from the processor and another
task is assigned to it. Multitasking can be implemented through multithreading
Multithreading (Software)
Operating systems allow user to achieve multi-tasking through Multithreading by providing API
or programming language constructs. Essentially Multithreading involves concurrent execution
of threads belonging to a multithreaded process. For communication and synchronization, spe-
cific constructs are provided.
Mutex
A Mutex is a programming abstraction for implementing the concept of mutual exclusion.
Mutual Exclusion [66, 69]
It is the process of preventing concurrently executing threads or tasks to simultaneous access a
shared resource or critical section. In other words, atomic statements from the critical sections
of two or more processes must not be interleaved.
Non-determinism [60, 61, 62, 70]
Same program or a system for same input set gives different output at different times of exe-
cution. Non-deterministic behaviour of multi-threaded programs occurs due to races.
Parallelism [2, 61, 67, 68, 69]
Two or more tasks or threads are said to be executing in parallel if they execute simultaneously.
Simultaneous execution means making progress together at every instant of time. Parallelism
can be achieved by running each thread of the process on a separate processor core or hard-
ware thread.
Parallelism thus implies simultaneous or overlapped execution and requires separate processor
core or hardware thread for each software thread. Also, it is a subset of Concurrency.
Pipelining [70, 71]
It is a fundamental technique to improve throughput of a processing system where a stream of
data is processed in stages such that output of each stage is fed as input to the next stage. Ev-
97
ery pipeline stage acts as a consumer of input data and producer of output data. The different
pipeline stages may take different execution times and execute in parallel either synchronously
based on a single clock or asynchronously by communicating via request acknowledgement
system. The pipeline stages are separated with buffers (E.g.: registers) to store the data tem-
porarily.
To increase the speed of a pipeline, one would break down the stages into smaller and smaller
units, thus lengthening the pipeline and increasing overlap in execution. Speed of a pipeline
is ultimately limited by the longest processing atomic stage in the pipeline. So essentially we
can picture pipelining a special case of task parallelism where there is a dependency between
tasks. It is sometimes also called Fine-grained stream parallelism because of frequent interac-
tions between tasks which are involved in processing a data stream.
Preemption [58, 64, 65, 69]
It is operation of suspending a running task by interrupting (using clock) its execution, saving
its context in its TCB, inserting the task it into priority queue of ready tasks and then allocating
processor to another task(if available and ready). Preemption is done in order to allocate pro-
cessor to second ready task either because the second task has higher priority or because the
time slice of the first task has expired or because of decision of OS according to its scheduling
policy.
Re-entrant Code or Re-entrancy [60, 68, 69]
A block of code is considered re-entrant if it can be safely executed concurrently by multiple
threads or if the code cannot be changed while in use. By saying that "code cannot be changed"
it basically means that the threads invoking the same function either do not change any shared
modifiable data or, there is just no modifiable data shared between them. Re-entrant code
avoids race conditions by removing references to global variables and modifiable static data.
Therefore, the code can be shared by multiple concurrent threads or processes without a race
condition occurring. Enabling conditions for re-entrancy of a code block or function are:
1. Must hold no static (or global) non-constant data.
2. Must not return the address to static (or global) non-constant data.
3. Must work only on the data provided to it by the caller.
4. Must not rely on locks to singleton resources.
5. Must not modify its own code (unless executing in its own unique thread storage)
6. Must not call non-re-entrant computer programs or routines.
Therefore, re-entrancy is a more fundamental property than thread-safety and by defini-
tion, leads to thread-safety: Every re-entrant function is thread-safe; however, not every thread-
safe function is re-entrant
Shared Memory or Shared Address Space [58]
It is a mechanism for inter-process communication (IPC) whereby two or more address spaces
that are protected from each other are allowed to intersect at points, still retaining protection
over the non-intersecting regions. It opens up a pre-defined portion of a process’s address
98
space; the rest of the address space is still protected, and even the shared portion is only un-
protected for those processes sharing the memory. Shared memory also reduces requirements
for physical memory for example a multi-threaded program typically has number of threads
which share the same address space of instructions. The key mechanism that enables shared
memory is the virtual memory mechanism more precisely page table organization structure.
Strong Vs Weak Scaling [72]
When the problem size stays fixed but the number of PE’s are increased, the observed scaling
is Strong scaling. Strong scaling of a program is said to be linear if the speedup is equal to
number of PE’s. Strong scaling efficiency is calculated as follows:
Strong scaling efficiency = (t sn ∗ N)/t s1 ∗ 100%
where,
t s1 = Time to execute a given workload with single PE
N = number of PE’s
t sn = time to execute same workload with N PE’s
On the other hand, if the workload assigned to each PE remains constant but total workload is
increased in proportion with PE’s than scaling observed is called Weak Scaling. For a program,
weak scaling is said to be linear if total execution time remains constant while total workload
is increased in proportion to increase in PE’s.
Weak scaling efficiency = twn /t
w
1 ∗ 100%
where,
tw1 = Time to execute unit workload with single PE
N = number of PE’s
twn = time to execute N unit workload with N PE’s
Deviation from linear weak scaling indicates that either algorithm is not truly O
 
N

or par-
allel overhead is increasing.
Slowdown Theorem or Brent’s Theorem
When q instead of p processors are used, such that q < p, the slowdown is at most p/q.
Speedup
Speedup is a relative term expressed as ratio of time taken to execute an application for a given
workload on one processor to time taken to execute it on p-processors
Speedup = T1/Tp
Where,
T1 = Execution time on 1-PE Tp = Execution time on p-PE’s
99
If the application is able to achieve a speedup
• Proportional to p, the speedup is called Linear Speedup.
• Exactly equal to p, the speedup is called Perfect linear speedup
• Greater than p, the speedup is called Super-linear speedup
Synchronization [58, 69]
Synchronization is the process of coordinating execution of multiple concurrently executing
threads or tasks. Synchronization is essential in order to ensure correctness of the application.
Essentially, there are two variants of Synchronization
• Data Access Synchronization: Data synchronization is needed in order to control race
conditions and allow concurrent threads or processes to safely access a block of memory.
Data synchronization controls when a block of memory can be read or modified. We can
identify four types of data access synchronization (Cartesian product of R/W access and
E/C access) required by concurrently executing tasks: EREW, ERCW, CREW and CRCW
(Key: C - Concurrent, E - Exclusive, R - Read, W - Write)
• Task Sequencing Synchronization: It involves coordinating the order of execution of
concurrent tasks. Four types of synchronization relationship exists between any two tasks
whose order of execution is to be coordinated:
– SS (Start-to-Start)
– FF (Finish-to-Finish)
– FS (Finish-to-Start)
– SF (Start-to-Finish)
Task or Thread [64]
The term task is used in the theory of concurrency, while the term thread is commonly used in
programming languages. The terms can thus be used interchangeably.
A Task or Thread is a block of code or instructions that form part of a process such that,
it can be scheduled for execution by the operating system scheduler and can compete with
other threads of same or different process for processor execution time. Each thread represents
a flow of control or a unit of computation.
A process can be broken up into multiple threads to make it a Multi-threaded Process. The kernel
stores and maintains dynamic information or state for each thread of a process called Thread
Context in a data structure called Task Control or Information Block (TCB or TIB). Threads of a
process called as Peers normally have following common characteristics:
100
• Have a TCB that consists of a program pointer, register set, state, priority and stack.
• Are part of address space of the process
• Are independent from main process control flow
• Share data segment of the process
• Each thread stack is part of process Stack Segment
• Can be controlled by parent process
• Can access and modify resources of the parent process only
• Can acquire or create new resources
• Can communicate and control other peers i.e. create, suspend, resume, terminate peers
• Have a scope: Process or System level
Task or Thread Scheduler [64, 65, 66, 69]
A Task scheduler forms part of operating system kernel which provides the scheduling algo-
rithms needed to determine which task or thread executes when. For a given set of tasks,
a scheduling algorithm determines order of assignment of task to the processor so that each
task is executed until completion. Most non-real time operating systems schedulers these days
support at least the following types of scheduling:
• Pre-emptive priority based
• Round-robin
• FIFO
Task Level Parallelism [70]
For a given data set, we perform different and in-dependent tasks on the all elements in the
data set in parallel. These separate tasks thus share the same data set. So they perform differ-
ent operations on elements of same data set.
Thread Safe [60, 68, 69]
• When code is written in such a way that it cannot be run in parallel without the con-
currency causing problems, it is said not to be thread-safe.
• According to Klieman, Shah, and Smaalders (1996): "A function or set of functions is said
to be thread safe or re-entrant when the functions may be called by more than one thread at
a time without requiring any other action on the caller’s part"
101
• A thread-safe function is one that can be safely (i.e., it will deliver the same results
regardless of whether it is) called from multiple threads at the same time.
• If the functions are not thread safe, then this means the functions contain one or more of
the following: static variables, accesses global data, and/or is not re-entrant.
• Essentially means that each function can be invoked by more than one thread at the same
time i.e. re-entrant code are thread safe. Be sure to use thread-safe libraries.
• Thread safety only concerns the implementation of the function and not its external
interface.
• Any function that maintains a persistent state between invocations requires careful writ-
ing to ensure it is thread-safe. In general, functions should be written to have no side
effects so that concurrent use is not an is-sue. In cases where global side effects âA˘Tˇ
which might range from setting a single variable to creating or deleting a file - do need
to occur, the programmer must be careful to call for mutual exclusion, ensuring that only
one thread at a time can execute the code that has the side effect, and that other threads
are excluded from that section of code.
• If it is not known which functions from a library are thread safe and which are not, a
programmer has three choices:
– Restrict use of all thread-unsafe functions to a single thread.
– Do not use any of the thread-unsafe functions.
– Wrap all potential thread-unsafe functions within a single set of synchronization
mechanisms.
• An additional approach is to create interface classes for all thread-unsafe functions that
will be used in a multithreaded application. The unsafe functions are encapsulated within
an interface class. The interface class can be combined with the appropriate synchroniza-
tion objects through inheritance or composition. The interface class can be used by the
host class through inheritance or composition. The approach eliminates the possibility of
race conditions.
Throughput
It is defined as number of operations completed per unit time. It is one of the metric of system
performance. It can be expressed as number of operations performed per unit time for in terms
of floating point operations per second(FLOPS): GFLOPS(Giga-), TFLOPS(Tera-), PFLOPS(Peta-
)
102
Appendix B
Processor Architectures and
Micro-architectures
Asymmetric Multiprocessors
See Homogeneous VS Heterogeneous CMP
Cache Coherency [59, 73]
In shared memory or shared address space CMP architectures, each PE may locally cache data
which is shared with other PE’s. As each PE can independently update its own cache(or shared
memory depending upon cache-write policy), other PE’s may be using older cached copies of
the same memory location. This problem of maintaining consistent view of the shared memory
across PE’s of a shared address space CMP is called Cache coherency. Depending upon the log-
ical and physical structure of CMP as well as the cache protocol, there are different choices of
cache coherence protocols. Snoopy cache coherence protocol is implemented in shared bus
CMP’s where cache controllers communicate with each other through shared bus to maintain
coherency of their caches. Directory Protocol is suitable for architectures where broadcasts or
multi-cast between cache controllers of respective PE is not straight-forward. Instead, a direc-
tory maintains state information of each shared memory block which may be cached.
Chip Multiprocessors (CMP)
It is a kind of Multiprocessor in which multiple processors are packed on a single chip. For
example Multicore and Manycore processors. Also see Multicore Vs Manycore Processors
Chip Multithreading(CMT) [74]
Chip multithreading is basically terminology used by SUN to describe architecture of its Ul-
traSPARC T1/T2 processors. According to SUN, CMT combines CMP with fine-grained multi-
threading. For example UltraSPARC T2 processor has eight processor cores each supporting
two instruction pipelines, with four hardware threads per pipeline. In effect up to 64 hardware
threads can be executed on the processor. The SUN CMT’s are focused towards high-end server
market where high throughput is more essential than latency and it calls the overall concept as
"Throughput computing".
Die [71, 75]
103
104
Integrated circuits are manufactured jointly on single piece of thin semiconductor wafer (typ-
ically silicon) of diameter of 200mm to 300mm. Later this wafer is cut apart into multiple
pieces each with identical integrated circuits. This piece is what is known as a Die.
Later, one or more of these dies can be combined with similar or different dies on a her-
metic package to form a chip or microchip. The I/O circuit terminals of the on-die integrated
circuits are brought out from package through pins. Size of a die can range from a pinhead
to a postage stamp. Many sub-functions maybe integrated on a single chip: processor cores,
SRAMs, FIFOs, phase-locked loops (PLLs), RF and analog devices, standard interfaces (PCI,
USB, Ethernet, WLAN, etc.), FPGA’s, custom accelerators, etc.
Flynn’s Taxonomy [58, 59, 73, 76, 77]
As early as 1966, Michael J. Flynn classified computer organizations into four categories ac-
cording to the uniqueness or multiplicity of instruction and data streams. By instruction stream
is meant the sequence of instructions as performed by a single processor. The sequence of data
manipulated by the instruction stream(s) forms the data stream. Accordingly, it classifies mul-
tiprocessor architectures in four categories: SISD, SIMD, MISD and MIMD
Key
S - Single, M - Multiple
D - Data Stream, I - Instruction Stream
GPGPU [30]
General purpose GPU exemplifies massively parallel multithreaded GPU processor used for
General purpose computing. It is essentially meant for data and control intensive applications
unlike the traditional graphics applications which are mostly data parallel or task parallel.
Hardware Multithreading [59, 73]
Processors with hardware multithreading basically support execution of multiple hardware
threads that can share functional units. Each of these hardware threads has associated archi-
tectural state: instructions, data, registers, PC, page table etc. Such hardware threads arise
from a software thread either explicitly created by the user or by the compiler. Basic motives
behind hardware multi-threading is to exploit thread level parallelism to hide memory latency
and increase throughput through concurrent execution of many hardware threads, improve
utilization of hardware resources ( caches, tlb’s, alu’s etc.) of a PE by sharing them among HW
threads.
Hardware multi-threading requires that every hardware thread has its own thread context
or state stored in its private registers, PC’s, etc requiring some extra hardware for this and also
hardware required for choosing next thread and switching between threads.
Many variations of hardware multi-threading exists namely Temporal Multi-threading( Fine-
grained and Coarse-grained ), Simultaneous Multithreading, Chip multithreading(CMT),
Speculative multithreading.
Homogeneous VS Heterogeneous CMP [78, 79]
105
Homogeneous CMP’s have several identical processor cores integrated on the same chip. For
example Intel Multicores (Dual/Quad/Oct Core), AMD Opteron Multicores, NVIDIA Fermi GPU
Architecture ( 512 Streaming Multiprocessor Cores). Homogeneous architectures are suitable
where flexibility of programmability, portability and average performance/power for multiple
domains is essential.
In Heterogeneous CMP architectures, different types of cores are integrated on the same chip.
In essence, each class of core has its own instruction set and is meant to serve a given function-
ality say SIMD cores for multimedia (audio/video/image) vector processing, Superscalar cores
for control-intensive algorithms. Such architectures have been popular in embedded domain as
set of applications to be executed are fixed and high performance with low power consumption
for each computation domain is an essential requirement.
Easy programmability and portability is still an unsolved problem for such architectures.
However some trends indicate that desktop and server processor architectures may move to-
wards heterogeneous approach. E.g. Notable example is IBM’s Cell BE which has one Multi-
threaded Power Architecture based PE and multiple RISC based co-processors called SPE (Syn-
ergistic Processing Element). Another example is Imagine processor from Stanford. NVIDIA’s
planned future GPU architecture named Echelon project is a Manycore CMP with heteroge-
neous architecture. It will consist of 1024 Streaming Multiprocessor Cores and 8 latency opti-
mized cores similar to cores of an SMP.
Hyperthreading (HT)
Also see Simultaneous Multithreading(SMT)
ILP (Instruction Level Parallelism) [59, 76]
At this level of parallelism, multiple independent instructions can be executed in parallel using
pipelined and replicated hardware resources. There are two basic approaches to exploit ILP:
static-compiler based (Intel’s EPIC and VLIW) and dynamic-hardware based (Superscalar).
For Superscalar processors with multi-instruction issue per clock-cycle, ILP is limited by true
data dependencies (RAW or Read After Write) causing data hazards, resource or structural
dependencies (limited number of functional units) causing structural hazards, anti-data de-
pendency (WAR or Write After Read) and output dependency ( WAW - Write After Write),
control or branch dependencies.
• True data dependencies (RAW) are eliminated or reduced by register forwarding and
bypassing, static and dynamic instruction scheduling (score boarding), out-of-order issue
and by interleaving multiple independent threads of instructions on a single processor
• Structural or Resource dependencies are eliminated or reduced by duplicating functional
units like multiple ALU’s, multiple memory ports for each cache, separate data/instruc-
tion cache, multiple pipelines each mapped to different functional units(integer vs. float-
ing point)
• Control or Branch dependencies are eliminated or reduced by code optimization’s like
loop transformations like unrolling, predicated instructions and branch predictors for
106
speculative execution (branch prediction, value prediction, pre-fetch data)
• False (Anti - WAR and Output - WAW) dependencies create problems with out-of-order
scheduling and are eliminated by techniques like register and memory renaming hard-
ware
Massive Parallel Processors (MPP)
See Chip Multi-processors and GPGPU
MIMD [58, 59, 73]
It is a generic class of multi-processor architecture in Flynn’s Taxonomy. In a MIMD type mul-
tiprocessor, each PE can execute a different instruction stream independently of other PE’s. In
practice, this means that a multiple instruction multiple data (MIMD) system could be running
different programs on each processor or different parts of the same program.
MIMD computers exploit thread (or task) level parallelism, since multiple threads or tasks
operate in parallel. MIMD are basically classified as Shared Memory and Distributed Memory
• Shared Address Space MIMD: In a shared memory system, all processes share a single
address space and communicate with each other by writing and reading shared variables.
UMA (Uniform Memory Access) and NUMA (Non-Uniform Memory Access) are two ma-
jor classes of shared memory systems classified according to whether access to shared
memory by PE’s is uniform or non-uniform.
• Distributed Address Space MIMD: In a distributed address space MIMD systems, each
process has its own address space and communicates with other processes by message
passing (sending and receiving messages).
MISD [58, 59, 73, 76, 77]
It is a class of multi-processor architecture in Flynn’s Taxonomy. As per there is no exact re-
alization of a MISD processor. Only Streaming Processors and to some extent VLIW processor
comes can be considered close to this category.
Multicore
A Multicore processor typically has multiple dies on the same chip package. Each die contains
one or more processor cores. These multiple dies are encapsulated on a chip package and in-
terconnected through an on-chip interconnection network or shared memory interface.
Multicore Vs Manycore Processors [1, 30, 78]
There are some differences between these two kinds of processors although both are essen-
tially based on SIMD/SPMD paradigm. A Multicore Processor (e.g.: Intel 2/4/6[i7] Core, AMD
Opteron 4/8/12 core), has small number of big processor cores on a single die. Also there
maybe multiple dies on the same chip. Each processor core on the die is a microprocessor
107
in itself but cores on the same die can share some architectural components with lower level
memory (L2/L3 cache), memory management etc. It is meant mainly for general purpose com-
puting which are data and control intensive in nature.
On the other hand Manycore Processors (e.g.: GPGPU from NVIDIA, AMD-ATI etc.) have large
number of small processor cores on the chip die. They are meant mainly for Data intensive
applications close to SIMD or SPMD paradigm.
Multiprocessors
It is a very broad term which encapsulates computing systems that leverage multiple processors
to perform computations. In such a system, multiple processors maybe organised with differ-
ent physical configurations. For example Chip Multiprocessors, Multicore Processor, Manycore
Processor, Clusters. Chip Multiprocessors have all processor cores on the same chip. Multicore
Processors (a kind of CMP) on the other hand have small number of big processor cores on a
single die.
Multi-threaded processors
See Hardware Multi-threading
NUMA Processors
A NUMA CMP is a shared memory or shared address space architecture. Each PE has its own lo-
cal cache hierarchy, local main memory and a globally shared address space. The global shared
memory Local memory access is thus faster than non-local/shared memory. The memory la-
tency in NUMA architecture thus depends upon the relative position of memory with respect
to PE.
One of the advantages of NUMA is that the architecture scales in terms of memory bandwidth
and local memory latency as number of PE’s scale. Disadvantage is that it requires resource (in-
terconnect, memory) replication as number of PE’s is increased and so its cost is higher (area,
power). Another disadvantage is that they cannot be programmed easily unless problem of
cache coherency is addressed (through inter-processor communication either programmatically
in software or through hardware as in cc-NUMA architectures)
A NUMA is thus a shared-address-space (logically) CMP but with distributed-memory (physically)
Out-of-Order Issue/Execution [59, 76, 77]
If a processor supports out of order issue/execution, then it has the ability to dynamically per-
form dependency analysis on a given window of instructions in the instruction stream and then
issue/execute later independent instruction(s) ahead of an earlier instruction which is waiting
for operands and functional units.
Out of order issue/execution is also sometimes referred to as dynamic instruction issue/exe-
cution. Generally long latency memory operations are launched earlier and instead of waiting
for these memory operations to complete, other instructions not dependent on the previously
launched memory operations are launched.
Processor Architecture
Processor Architecture refers to the instruction set, registers, and memory data-resident data
108
structures that are public to a programmer. Processor architecture maintains instruction set
compatibility so processors will run code written for processor generations, past, present, and
future.
Processor Core or Processing Element (PE)
A processor core or PE is a monolithic processor that may implement architectures like Super-
scalar, VLIW, SIMD or Multi-threaded.
Processor Coupling [58]
The processor cores on a Multiprocessor may be tightly or loosely coupled depending upon the
way the Inter-Processor communication (IPC) takes place. In Tightly Coupled Multiproces-
sors, IPC takes place on same die or chip through shared memory at some level. Example SMP,
UMA, NUMA, Multicores etc.
In case of Loosely Coupled Multiprocessors, multiple standalone processors are connected
together either through an on-board interconnection network (buses etc.) or through a high
speed communication network (Example Gigabit Ethernet on clusters)
Processor Micro-architecture
Micro-architecture refers to the implementation (Register-Transfer/Gate/Transistor/Circuit level)
of processor architecture in silicon. Within a family of processors, the micro-architecture is of-
ten enhanced over time to deliver improvements in performance and capability, while maintain-
ing compatibility to the architecture. A given Processor architecture can have many different
micro-architecture implementations.
Scalar Processor [59, 76, 77]
A processor that fetches and executes one instruction at a time is called a Scalar Processor.
Performance (CPI) of scalar processors has been improved through pipelining, out-of-order is-
sue/execution.
SIMD [58, 59, 73, 76, 77]
It is a class of multi-processor architecture in Flynn’s Taxonomy. In a single instruction stream,
multiple data stream (SIMD) processor, several identical processors perform the same set of
operations on different sets of data. Each processor has its own data memory (hence Multiple
Data), but there is a single instruction memory and central control processor, which fetches
and dispatches instructions. They basically realize Data Level Parallelism. The SIMD category
includes array processors, vector processors, and systolic arrays.
Simultaneous Multithreading(SMT) [59, 73, 80, 81]
This is a kind of hardware multi-threading in which more than one hardware thread can exe-
cute its given instruction in any pipeline stage in a given cycle. Such architectures basically
combine Superscalar execution which exploits ILP in an instruction stream with hardware
multi-threading which exploits TLP. Main idea behind this is to use multithreading to hide
pipeline stalls(vertical wastage) due to high latency operations and reduce under-utilization of
multiple functional units per pipeline stage ( also called horizontal waste ) due to limited ILP
in a single instruction stream.
109
Such architectures are able to simultaneously issue or execute or keep in flight instructions
from multiple instruction streams (or threads) in a given pipeline stage thereby filling most
of the pipeline slots with work and thereby avoiding pipeline stalls(vertical waste) and effec-
tively increasing utilization ( reduced horizontal waste ) of replicated hardware units per stage
by sharing them between instructions from different threads ( overlapping ILP of different in-
struction streams in same pipeline stage ). For example if there are 3 integer and 2 floating
point units in execute stage, with SMT all these units can be used together by integer and float-
ing point instructions from different threads. Some Examples of such architectures: Compaq
alpha EV8 which was a 4 thread SMT. Intel calls SMT as Hyperthreading(HT) and many of its
single and multicore processors use HT for its PE’s notably atom, core i3/i5/i7, P4, Itanium and
Xeon(2-way HT). Another notable example is SUN UltraSPARC T2 plus (Victoria Falls) which
is an 8 core single chip with 8 threads per core.
SIMT
See Section-6.4 CUDA Programming Model
SISD [58, 59, 73, 76, 77]
It is a class of multi-processor architecture in Flynn’s Taxonomy. A conventional sequential pro-
cessor fits into the single-instruction stream, single data stream (SISD) category. Only a single
instruction is executed at a time in a SISD processor, although pipelining may allow several
instructions to be in different phases of execution at any given time. Nevertheless, it still pro-
cesses one instruction stream.
SPMD [58, 59, 73, 76, 77]
Shared Program Multiple Data or SPMD is a coarser version of SIMD architecture paradigm.
It basically means that a number of different processors execute the same program or set of
instructions but each processor executes it on a different set of data. Each PE in an SPMD pro-
cessor is assigned a sort of identifier and based on this identifier it can conditionally execute
parts of the program. So although different PE’s have the same program but they might be
executing different instructions.
Speculative Execution
It involves execution of set of instructions ahead of an earlier instruction, but unlike out-of-
order execution, here instruction look-head based dependency analysis is not performed and
so the usefulness of the work done from the speculative execution depends upon whether spec-
ulation was correct or not. Correctness of speculation is determined later on when results of
earlier instructions are actually resolved.
Basic motive behind speculative execution is to extract available ILP in instruction stream by
utilising the hardware resources which would otherwise go idle even when the work obtained
from the speculative execution can later on become useless because of miss-speculations. In
case of miss-speculation, some extra work is done in order to undo the effects of speculative
execution i.e. to rollback or restore system state and then to start execution of the correct
instructions.
• Data speculation is done in form of value prediction and advanced Loads (or Prefetching).
110
• Control Speculation in the form of Branch prediction is used to lower control stalls of
pipeline by predicting a branch as taken(or not taken) and then speculatively fetching
and executing set of instructions from the predicted path(or sometimes both the paths).
Speculative execution requires support of processor hardware and also software (optimiz-
ing compiler). Normally extra hardware is required to buffer the results of speculative execu-
tion, to buffer the system state for rollback on miss-speculation and to handle precise excep-
tions.
Speculative Simultaneous Multithreading(SST)
Extends SMT architecture with speculative execution at thread level also called Thread-level
speculation.
Streaming Processor
A Streaming Processor is composed of several processors each of which executes its own dedi-
cated instruction stream but all of them process a single data stream. To some extent they can
be considered as pipelining of multiple processors which process an incoming data stream and
pass it on to the next processor in the pipeline for further processing.
Superscalar Processor [59, 73, 76]
The ability of a processor to issue and execute multiple instructions in each clock cycle and
to do it dynamically at runtime is referred to as Superscalar Execution. Multiple instructions
can be issued or/and executed either in-order or out-of-order(O-o-O). Superscalar processors
replicate execution units, such as integer ALUs, FPUs, load/store units, etc. into one CPU so
that it can fetch and execute more than one instruction at a time in same stage of pipeline. The
goal of Superscalar processors is to attain fractional CPI (Cycles per Instruction) by exploiting
instruction level parallelism.
Symmetric Multiprocessors (SMP) [58, 59, 73, 76]
A Shared Memory or Symmetric Multiprocessor is a UMA CMP where each individual PE has
a local cache hierarchy and shares a centralized global memory hierarchy symmetrically with
other PE’s. By symmetrically it means that for each PE, the latency to access data in shared
global memory is same. SMP style architectures don’t scale linearly in performance with scal-
ing of PE’s and usually some sort of bottleneck emerges if we try to do so (say saturation of
shared memory controller due to increased memory contention). On the other hand, NUMA
style architectures normally do scale as number of processors increase.
An SMP is thus a CMP with shared-address-space (logically) but with a centralized-memory (phys-
ically)
Transaction Look Aside Buffer (TLB)
see Virtual Memory
Temporal Multithreading [59, 73, 80, 82]
This is a kind of hardware multi-threading in which only one hardware thread can execute a
given instruction in any pipeline stage in a given cycle. Temporal multithreading has mainly
two variations: fine-grained multi-threading and coarse grained multi-threading.
111
In fine-grained or co-operative multi-threading, the processor pipeline interleaves execu-
tion of multiple active threads by switching among these hardware threads on each instruction
every few clock cycles(as low as 1 cc). The concept is similar in essence time-slice based pre-
emptive round-robin scheduling of threads. This type of multi-threading is effective when there
are sufficient number of threads with non-blocking instructions that can hide long operation
latency. It is a good architecture for highly data parallel and data intensive applications where
lot of time is spent waiting for data. Main advantages are overall higher throughput, fills well
short stalls, less complex hardware: no or smaller caches, no register forwarding, no O-o-O
execution hardware. Primary disadvantage is that for hardware threads with few or no stalls
get delayed unnecessarily(poor single thread performance). Hardware overheads of this kind
of multi-threaded architectures involves thread tracking in every stage, bigger shared resources
to avoid contention among large number of small hardware threads. Example of such archi-
tecture: MTA ( Multi-threaded architecture ) is a CMP supported 128 threads per PE. Another
example is SUN UltraSparc T1 (Niagara) which additionally is a multi-core processor with 4,6
or 8 PE’s with each PE can handle up to 4 hardware threads scheduled in round-robin fashion.
In coarse grained or block multi-threading, thread context switch is done by hardware only
when an existing thread blocks or stalls the pipeline due to certain events like long latency
cache misses on loads, I/O, synchronization with another hardware thread etc. These sort
of multi-threaded architectures are much less likely to slowdown non-blocking threads unlike
fine-grained architectures. Other advantage is that fills well long stalls. However, one key
drawback of this approach is that it requires freezing or flushing the instruction pipeline when
a thread stalls and loading instructions from a waiting thread. This whole procedure takes
some time and can degrade performance if there are large number of threads with short stalls.
It is much more useful if most of the blocking threads result in long stalls.
UMA
Uniform Memory Access CMP architectures are essentially shared memory or shared address
space architectures where all PE’s have same latency to access shared main memory. The shared
main memory could be a single memory or organized as multiple memory banks. Example of
UMA CMP is an SMP (Symmetric or Shared Memory Multiprocessor)
Vectorization or Vector Processors
Vectorization is essentially ’SIMDization’ or in essence a data parallel memory access and ex-
ecution where multiple data items stored in a vector or array are accessed and operated in
parallel by multiple functional units of the processor. For example Multiplication of two 1-D
integer vectors.
Vector processors also pipeline the operations on the individual elements of a vector. The
pipeline includes not only the arithmetic operations (multiplication, addition, and so on), but
also memory accesses and effective address calculations. Net effect is amortized main mem-
ory access, lesser data/control hazards due to reduced instruction bandwidth, more paral-
lelism due to replication and pipelining of functional units. Ideas of vector processors which
are ubiquitous in all super-computing systems were slowly introduced in the ISA of Super-
scalar processors in the form Instruction set extensions (Streaming SIMD extensions like MMX,
SSE/SSE2/SSE in Intel Processors)
112
VLIW Processor [59, 76]
A VLIW (Very Long Instruction Word) processor belongs to another category of ILP processors
which also issue and execute multiple instructions in each clock cycle. It distinguishes itself
from a Superscalar processor in terms of simpler hardware architecture (control complexity)
but more complex compiler in order to extract ILP statically in instruction stream.
It was Joseph Fischer and Bob Rau who devised the idea of VLIW, and characterised it as
an architecture that can issue one Very Long Instruction per cycle, where each Very Long In-
struction encapsulates many tightly coupled independent operations each of which execute in
a small and statically predictable number of cycles. In such a system, the task of grouping
independent operations into a very long instruction word is done by a compiler. The processor
freed from the cumbersome task of dependence analysis has to merely execute in parallel the
operations contained within a very long instruction. This leads to simpler and faster proces-
sor implementations. Later, Intel incorporated ideas of VLIW architecture in its EPIC (Explicit
Parallel Instruction Set Computing) styled Itanium IA-64 ISA.
Appendix C
Graph theory - Small world networks
Power Law
The power law states that the probability that any vertex of a graph is connected to k other
vertices is proportional to 1/kn where n> 1.In simpler terms, when an independent variable x
increases, the dependent variable f (x) decreases and vice versa. Power law distributions are
similar to heavy-tailed, pareto or zipf distributions.
Scale free networks [83]
Networks with power law degree distribution are called as Scale-free networks. Such net-
works have some set of vertices which have relatively high degree compared to the average
degree of the network. Such vertices are called hubs. Scale free networks are distinguished
from random networks in the sense that in random networks such hubs are relatively negligible
and most vertex degrees do not diverge too much from average network degree.
High degree of hubs in scale-free net-works is correlated to networks fault tolerance. Even if
a hub fails, the network maintains its connectedness due to remaining hubs. However if most
hubs fail the network can become disconnected. So hubs are both strength and weakness of the
scale-free networks. Examples of scale free networks(vertices and arcs) is for example world-
wide web(web-pages and links) [84], internet(routers and physical/optical connections) [85],
citations network connecting scientific papers [86], alliance networks in business(companies
and collaborations/alliances), protein regulatory networks(protein and interaction)[87, 88].
Small world networks [89]
Scale free-networks in which most actors are not neighbours of one another, but most actors
are reachable from other actors of the network are known as Small world networks. If a
network has a degree-distribution which can be fit with a power law distribution, it is taken as
a sign that the network is small-world. For example network of airline flights, network of con-
nected proteins, many large-scale neural networks in the brain [90], such as the visual system
and brain stem, exhibit small-world properties.
113
114
Appendix D
CUDA Programming Model
Compute Unified Device Architecture or CUDA[1, 91, 92] is NVIDIA’s hardware software co-
processing architecture cum programming model that allows programmer to write applications
for its GPGPU’s. CUDA programming model requires programmer to specify structured paral-
lelism in form of shared-memory data parallel programs(SPMD style). Unlike general purpose
processors, most GPGPU’s do not support dynamic parallelism1
NVIDIA calls the main system processor plus its DRAM as Host and the GPGPU plus its GDDR
DRAM as device. Host acts as a Master which fully manages the GPGPU Slave or co-processor.
In other words, the GPGPU cannot access CPU memory, cannot allocate local memory or cannot
start computing unless host CPU tells it to.
Using minimal set of CUDA abstractions(Kernel and device functions) for expressing parallelism,
a programmer can write application for CUDA based GPGPU’s in languages such as C, C++,
OpenCL, etc. A CUDA program is a unified source code containing interleaving of ANSI C code
that executes on host called Host Code and CUDA C (minimal extensions to ANSI C) code
that runs on GPGPU called device code or kernel. Programmer needs to carefully divide and
map the application parts depending upon capability of host CPU and GPGPU device. Usually,
highly data parallel part of application is accelerated on GPGPU and inherently sequential or
less data parallel part is executed on host CPU. For executing such CUDA program, this unified
source code needs to be compiled using NVCC compiler driver.
In addition, a CUDA program generally involves transferring data to be processed on GPGPU
from CPU memory to GPGPU GDDR RAM. Once the GPGPU has finished execution of CUDA
kernel, the results are subsequently copied back to CPU memory. Following is a typical CUDA
program execution flow managed by the host CPU:
• Allocate host data [malloc() function ]
• Allocate device memory where host data will be copied [cudaMalloc() function]
• Copy data from host memory to device memory [cudaMemcpy() synchronous function]
1In Dynamic parallelism, one parallel thread can dynamically spawn new parallel threads at runtime. Dynamical
parallelism is a usual feature of any General purpose or Multicore processors but is not supported on Tesla or Fermi
GPGPU’s. Recently, NVIDIA also added support for dynamic parallelism on its Kepler GPGPU’s
115
116
• Invoke kernel on device by calling it with following parameters: thread hierarchy config-
uration and memory pointers to allocated device memory
// asynchronous function call
kernel_function<<<thread_hierarchy_config>>>(device_memory_ptrs,...);
• Do some execution on host ( happens in parallel with kernel execution )
• Optionally, host waits for previous calls to complete [cudaThreadSynchronize() function]
• Copy result data from device to host[cudaMemcpy() synchronous function]
• Deallocate host and device memory [free() and cudaFree()]
D.0.1 Thread Hierarchy
A CUDA kernel is a sequential piece of C/C++ code that can be executed by one or more CUDA
hardware threads. Any C/C++ function can be made a kernel function by putting __global__
prefix to its function definition. Such a kernel can be called from host and will be executed on
device. Similarly any function that is to be called and executed on the device should be pre-
fixed with __device__ keyword. Functions without any function qualifiers are by default host
functions(called and executed on host). A function can be made both host and device function
by putting both __host__ and __device__ qualifier to the function definition.
CUDA requires programmer to specify a hierarchy of threads through thread_hierarchy_config
parameter passed to the device kernel during invocation. This parameter basically specifies
organization of CUDA threads in terms of thread blocks and number of threads per block
(threads_per_block). This means a total number of (thread blocks x threads_per_block) CUDA
threads will be created on kernel invocation.
Further, all the thread blocks that are generated during a kernel invocation are collectively
called as grid. Each thread block executes independently of other thread blocks. When all
threads of a kernel complete execution, the corresponding grid terminates. Every time device
kernel is invoked and terminated, a corresponding grid is created and terminated. Any CUDA
program is thus an interleaving of CPU code and CUDA grids or kernel invocations (Figure D.1).
Within a grid, each thread block has a thread block id which can be accessed within kernel
function by threads of the block through variables blockIdx.x and blockIdx.y (2-dimensions
for compute capability 1.3). Similarly each thread within a block has an associated id which can
be referred in kernel function through variables threadIdx.x, threadIdx.y and threadIdx.z.
Additional built-in variables, gridDim and blockDim, provide the dimension of the grid and
the dimension of each thread block respectively. During execution of kernel by CUDA threads,
each thread will see different value for its blockIdx and threadIdx variables. These four vari-
ables: blockIdx, threadIdx, gridDim and blockDim can be accessed within a kernel function
and are basically used for calculating index values which can be used by thread executing the
kernel to read or write data to appropriate memory locations. Accordingly we can specify
thread_hierarchy_config parameter as follows:
117
thread_hierarchy_config = grid_config, threadblock_config}
grid_config = (threadBlock.X, threadBlock.Y)}
threadblock_config = (threadIdx.X, threadIdx.Y, threadIdx.Z)}
CUDA provides a C struct type named dim3 to create grid_config and threadblock_config pa-
rameters. For example, invocation of kernel function kernelF ≪ dim3(2,2, 1), dim3(4, 2,1)≫
will generate grid and thread blocks as shown in Figure D.2.
Figure D.1. CUDA Program - interleaving of host code and kernel invocations
Figure D.2. Multidimensional grid and thread block
118
D.0.2 Thread Scheduling
The CUDA thread execution is somewhat analogous to SIMD/SPMD model of execution but
NVIDIA calls this as Single Instruction Multiple Thread(SIMT) architecture [91]. Just as a
SIMD processor controls a vector of multiple data lanes, a SIMT processor controls bunch of
threads(warp). The SIMT architecture manages CUDA threads in bunch of 32 parallel threads
of the same type(vertex, geometry, pixel, or compute) called as Warp (Figure D.1).
With SIMT execution, the hardware fetches and issue same instruction for each thread of
a warp independently, before moving to the next instruction of the warp. If threads within
a warp stall( Figure D.3 ) due to an instruction causing long latency operation like global
memory loads, floating point arithmetic, etc., the Multithreaded issue/warp scheduler stops
execution of the warp and swaps it with another waiting warp ready to be executed. If there
are one or more ready warps waiting to be executed, some priority scheme is used to identify
which warp is to be scheduled next. NVIDIA says the overhead of scheduling a new warp is
almost zero if there are one or more warps ready for execution. In other words, CUDA fine
grained Multithreading basically interleaves execution of warps. Therefore, on a CUDA device,
a warp is the smallest schedulable executable unit of parallelism and is thus the granularity of
hardware multithreading.
According to NVIDIA recommendations, an SM realizes full efficiency and performance when
all threads of a warp take the same execution path. If some threads within a warp diverge due
to say a data-dependent conditional branch, or other programming constructs (loop iterations
etc.), the warp execution serializes leading to performance deterioration equal to number of
divergent paths.
Figure D.3. Scheduling of warps and thread blocks. Source: Mark Silberstein, Technion,
"Lectues notes - Programming massively parallel processors"
D.0.3 Thread communication and synchronization
CUDA provides various mechanism for thread communication and synchronization primitives
[93] depending upon level in thread hierarchy (see Table D.1)
D.0.4 Memory Model
In CUDA memory model(see Figure D.4), there are multiple levels of memory: per thread
registers and per thread local memory (not shown in figure), per thread block shared
memory, per grid global and constant memory
119
Table D.1. CUDA thread communication and synchronization
Thread hierarchy Communication Synchronization
Threads within a warp shared/global memory implicit synchronization
Warps within a thread block
shared memory barrier synchronization
(__syncthreads()) or
Atomics
global memory CUDA Atomic operations
Warps of different thread blocks global memory No mechanism
Thread blocks within a given grid
or kernel
global memory atomic memory operations
Thread blocks from different grids
or kernels
global memory implicit synchronization
Figure D.4. CUDA memory model [1]
Table D.2. CUDA Variable Qualifiers
Variable Qualifier Variable placed in memory ?
__device__ device global memory(GDDR)
__constant__ Constant Memory
__shared__ Shared Memory(On-chip)
No qualifier Registers(On-chip). Register spills
go into local memory(GDDR)
120
T
able
D
.3.
M
em
ory
organization
of
T
esla
C
1060
T
20/G
T
200
w
ith
C
U
D
A
com
pute
capability
1.3
R
egisters
Localm
em
ory
Shared
m
em
ory
G
lobalm
em
ory
C
on
stan
t
M
em
ory
O
n
-chip/O
ff-chip
O
n-chip
O
ff-chip
G
D
D
R
O
n-chip
O
ff-chip
G
D
D
R
O
ff-chip
Scope
Private
per
thread
Private
per
thread
Threads
of
the
thread
block
A
llthreads
of
a
grid
A
llthreads
of
a
grid
Lifetim
e
Thread
Thread
Thread
block
A
pplication
A
pplication
R
ead(R
)/W
rite(W
)
by
device?
R
/W
R
/W
R
/W
R
/W
R
R
/W
access
by
host?
N
o
N
o
N
o
R
/W
R
/W
Size
64K
B
per
SM
or
TB
(32bit
x
16K
)
(64
logicalbanks/SM
)
<
4
G
B
M
B
16K
B
per
SM
or
TB
(1K
B
x
16
banks)
4
G
B
64
K
B
total
(8
K
B/SM
)
Is
C
ached?
N
A
N
o
N
o
N
o
Yes,C
onstant
cache
Laten
cy
[93]
(clock
cycles)
~
1
cc
~
400-450
cc
~
30-40
cc
~
400-450
cc
~
1-3
cc
cache
hit.
~
100
cc
cache
m
iss
B
an
dw
idth
~
8-10
TB/s
170
G
B/s
~
1000-1400
G
B/s
aggregate
for
30
SM
~
100-150
G
B/s
~
1000
G
B/s
121
CUDA API provides appropriate qualifiers (see Table D.2) that can be used to place variables
in appropriate CUDA memory. In case of absence of variable qualifier(automatic variables), the
compiler allocates register memory to the variable(thread local). In case there are not sufficient
registers, the variable is placed in local memory(GDDR). Thread local Arrays can also be allo-
cated in registers provided there are enough registers otherwise they also spill to local memory.
In addition, shared memory can be statically allocated at compile time and also dynamically
allocated at runtime.
Table D.3 shows specifications of different kind of memories in CUDA memory model for
GPGPU with GT200 CUDA architecture [1, 94, 95, 96, 97] and compute capability 1.3. All
memories in CUDA memory model belong to two types of GPU physical memories: on-chip
and off-chip. All on-chip CUDA memories provide low latency and high bandwidth access to
the data. To get good performance for memory bound algorithms on such architectures, it
is important to utilize the on-chip CUDA memories effectively and minimize interaction with
off-chip memories.
122
Appendix E
Counters for computeprof profiler
Table E.1 shows description of the profile counters for the computeprofile command-line tool
(Source: NVIDIA Compute Visual Profiler).
Table E.1. Profile counters for computeprof tool
Counter Description
Counter Description
warp_serialize This counter gives the number of thread warps that serialize on ad-
dress conflicts to either shared or constant memory.
instructions Number of Instructions executed
instruction throughput (derived
counter)
This is the ratio of achieved instruction rate to peak single issue in-
struction rate. The achieved instruction rate is calculated using the
profiler counter "instructions". This is calculated as: (instructions) /
(gpu_time * clock_frequency)
cta_launched Number of threads blocks launched on a TPC
sm_cta_launched Number of threads blocks launched on a SM
branch Number of unique branch instructions in program ( includes barrier
instructions )
divergent_branch Number of unique branches that diverge
Divergent branches (%) (derived
counter)
The percentage of branches that are causing divergence within a
warp amongst all the branches present in the kernel. Divergence
within a warp causes serialization in execution. This is calculated
as: (100*divergent branch)/(divergent branch + branch)
dynsmemperblock Size of dynamically allocated shared memory per block in bytes for
a kernel launch.
stasmemperblock Size of statically allocated shared memory per block in bytes for a
kernel launch
regperthread Number of registers used per thread for a kernel launch.
gld_request Number of executed load instructions per warp in a SM
123
124
Counter Description
gst_request Number of executed global store instructions per warp in a SM
gld_64b 64-byte global memory load transactions
gst_64b 64-byte global memory store transactions
glob mem read throughput (de-
rived counter)
Global memory read throughput in gigabytes per second. For
compute capability < 2.0 this is calculated as: (((gld_32*32) +
(gld_64*64) + (gld_128*128)) * TPC) / (gputime * 1000)
glob mem write throughput (de-
rived counter)
Global memory write throughput in gigabytes per second. For
compute capability < 2.0 this is calculated as: (((gst_32*32) +
(gst_64*64) + (gst_128*128)) * TPC) / (gputime * 1000)
glob mem overall throughput (de-
rived counter)
Global memory overall throughput in gigabytes per second. This is
calculated as: Global memory read throughput + Global memory
write throughput
gld_incoherent Non-coalesced (incoherent) global memory loads
gld_coherent Number of Coalesced (coherent) global memory loads
gst_incoherent Number of Non-coalesced (incoherent) global memory stores
gst_coherent Number of Coalesced (coherent) global memory stores
Appendix F
Business Plan
A business plan describes an idea and analysis on how to start a new business of any topic
of interest to the student and is a mandatory part of the master thesis for ALaRI Business
Track students. The business plan that has been developed in scope of my studies in group
with Darayus Patel. This business plan proposes a business in the Embedded system domain in
India but is not related to topic of this Master thesis. Therefore, the details of this business plan
has been left out of this thesis report and business plan report is instead made available on the
ALaRI platform(privately - As report is confidential can be accessed only with permission).
125
126
Glossary
API Application Programming Interface
ALU Arithmetic and Logical Unit
CMOS Complementary Metal Oxide Semiconductor
CMP Chip Multiprocessor
CPI Cycles per Instruction
CUDA Compute Unified Device Architecture
DAG Directed Acyclic Graph
ESL Electronic System Level
GPP General Purpose Processor
GPU Graphics Processing Unit
GPGPU General Purpose Graphics Processing Unit
IC Integrated Circuit
IP Intellectual Property
IPC Instructions per Cycle or Inter Processor Communication
ILP Instruction Level Parallelism
ISA Instruction Set Architecture
ITRS International Technology Roadmap for Semiconductors
MIPS Million Instructions per Second
MPSoC Multiprocessor System on Chip
NoC Network on chip
NUMA Non-Uniform Memory Access
PE Processing Element
PCI-E Peripheral Component Interface - Express
SIMD Single Instruction Stream Multiple Data Stream
SFU Special Function Unit
SM Streaming Multiprocessor
SP Scalar Processor or CUDA Core
SMP Symmetric Multiprocessors or Shared Memory Processors
TCB Task Control Block
TLP Thread Level Parallelism
TBB Threading Building Blocks
UMA Unified Memory Access
VLIW Very Long Instruction Word
127
128
Bibliography
[1] D. B. Kirk and W.-m. W. Hwu, Programming Massively Parallel Processors: A Hands-on
Approach, 1st ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2010.
[2] H. Sutter and J. Larus, “Software and the Concurrency Revolution,” Queue, vol. 3,
no. 7, pp. 54–62, Sep. 2005. [Online]. Available: http://doi.acm.org/10.1145/1095408.
1095421
[3] D. Geer, “Chip makers turn to multicore processors,” Computer, vol. 38, no. 5, pp. 11–13,
May 2005.
[4] D. P. Mehta and S. Sahni, Handbook Of Data Structures And Applications (Chapman &
Hall/Crc Computer and Information Science Series.). Chapman & Hall/CRC, 2004.
[5] M. J. Atallah and M. Blanton, Eds., Algorithms and theory of computation handbook: gen-
eral concepts and techniques, 2nd ed. Chapman & Hall/CRC, 2010.
[6] S. S. Wasserman, “Random Directed Graph Distributions in the Triad Census in Social
Networks,” National Bureau of Economic Research Working Paper Series, vol. No. 113, pp.
61–86, 1975. [Online]. Available: http://www.nber.org/papers/w0113
[7] J. Moody, “Matrix methods for calculating the triad census,” Social Networks, vol. 20,
no. 4, pp. 291–299, 1998. [Online]. Available: http://linkinghub.elsevier.com/retrieve/
pii/S0378873398000069
[8] V. Batagelj and A. Mrvar, “A subquadratic triad census algorithm for large sparse networks
with small maximum degree,” Social networks, vol. 23, no. 3, pp. 237–243, 2001.
[Online]. Available: http://linkinghub.elsevier.com/retrieve/pii/S0378873301000351
[9] G. Chin, A. Marquez, S. Choudhury, and K. Maschhoff, “Implementing and evaluating
multithreaded triad census algorithms on the Cray XMT,” in Parallel & Distributed
Processing, 2009. IPDPS 2009. IEEE International Symposium on. IEEE, 2009, pp. 1–9.
[Online]. Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5161109
[10] P. W. Holland and S. Leinhardt, “A method for detecting structure in sociometric data,”
American Journal of Sociology, vol. 76, no. 3, pp. 492–513, 1970. [Online]. Available:
http://www.jstor.org/stable/2775735
[11] J. Moody, “Matrix methods for calculating the triad census,” Social Networks, vol. 20,
no. 4, pp. 291–299, 1998. [Online]. Available: http://linkinghub.elsevier.com/retrieve/
pii/S0378873398000069
129
130 Bibliography
[12] K. A. Yelick, S. Chakrabarti, E. Deprit, J. Jones, A. Krishnamurthy, and C. po Wen, “Data
structures for irregular applications,” in DIMACS Workshop on Parallel Algorithms for Un-
structured and Dynamic Problems, 1993.
[13] K. Mehlhorn and P. Sanders, Algorithms and Data Structures: The Basic Toolbox, 1st ed.
Springer Publishing Company, Incorporated, 2008.
[14] Y. N. Srikant and P. Shankar, Eds., The Compiler Design Handbook: Optimizations and
Machine Code Generation. CRC Press, 2002.
[15] U. Drepper and R. Hat, “What Every Programmer Should Know About Memory,”
Changes, vol. 3, no. 4, p. 114, 2007. [Online]. Available: http://citeseerx.ist.psu.edu/
viewdoc/download?doi=10.1.1.91.957&amp;rep=rep1&amp;type=pdf
[16] R. P. Garg and I. Sharapov, Techniques for Optimizing Applications: High Performance Com-
puting. Upper Saddle River, NJ, USA: Prentice Hall PTR, 2001.
[17] E. D. Demaine, “Cache-oblivious algorithms and data structures,” in in Lecture Notes from
the EEF Summer School on Massive Data Sets, 2002.
[18] P. Kumar, “Cache Oblivious Algorithms,” in Algorithms for Memory Hierarchies, ser.
Lecture Notes in Computer Science, U. Meyer, P. Sanders, and J. Sibeyn, Eds.
Springer Berlin / Heidelberg, 2003, vol. 2625, pp. 193–212. [Online]. Available:
http://dx.doi.org/10.1007/3-540-36574-5_9
[19] L. Arge, M. A. Bender, E. D. Demaine, C. E. Leiserson, and K. Mehlhorn, Eds., Cache-
Oblivious and Cache-Aware Algorithms, ser. Dagstuhl Seminar Proceedings, vol. 04301.
IBFI, Schloss Dagstuhl, Germany, 2005.
[20] G. Fry, “Implementing AMD cache-optimal coding techniques,” AMD Developer Central,
2008. [Online]. Available: http://developer.amd.com/documentation/articles/pages/
implementingamdcache-optimalcodingtechniques.aspx
[21] L. Frias, J. Petit, and S. Roura, “Lists revisited: Cache-conscious stl lists,”
J. Exp. Algorithmics, vol. 14, pp. 5:3.5–5:3.27, Jan. 2010. [Online]. Available:
http://doi.acm.org/10.1145/1498698.1564505
[22] K. Madduri, “Scaling up Graph Algorithms on Emerging Multicore Systems,” 2009.
[23] M. Karlsson, F. Dahlgren, and P. Stenstrom, “A prefetching technique for irregular accesses
to linked data structures,” in High-Performance Computer Architecture, 2000. HPCA-6.
Proceedings. Sixth International Symposium on, 2000, pp. 206–217.
[24] Z. Zhang and J. Torrellas, “Speeding up irregular applications in shared-memory mul-
tiprocessors: memory binding and group prefetching,” in Computer Architecture, 1995.
Proceedings., 22nd Annual International Symposium on, Jun. 1995, pp. 188–199.
[25] AMD, “Performace optimization for windows applications,” AMD Developer Central,
2008. [Online]. Available: http://developer.amd.com/documentation/articles/pages/
PerformanceOptimizationofWindowsApplicationsonAMDProcessors2.aspx
131 Bibliography
[26] J.-S. Park, M. Penner, and V. K. Prasanna, “Optimizing graph algorithms for improved
cache performance,” Parallel and Distributed Systems, IEEE Transactions on, vol. 15, no. 9,
pp. 769–782, 2004.
[27] E. T. Michael, “A Westmere Addition to a High-Performance Nehalem iDataPlex
Cluster and DDN S2A9900 Storage for Texas A&M University,” Texas A&M University
Supercomputing Facility, 2010. [Online]. Available: http://sc.tamu.edu/systems/eos/
Westmere-iDP.php
[28] S. K. Derek Bachand, Selim Bilgin, Robert Greiner, Per Hammarlund, David L Hill,
Thomas Huff and R. Safranek, “The Uncore: A Modular Approach to Feeding The High-
Performance Cores,” Intel Technology Journal, vol. Volume 14, no. Issue 3, p. 20, 2010.
[29] G. S. Maddox and R. J. Safranek., “An Introduction to Intel QuickPath In-
terconnect,” in High Performance Multi- Core Processor Fabric. Intel Press,
2009. [Online]. Available: http://www.intel.com/content/dam/doc/white-paper/
quick-path-interconnect-introduction-paper.pdf
[30] J. Nickolls and W. J. Dally, “The GPU Computing Era,” Micro, IEEE, vol. 30, no. 2, pp.
56–69, 2010.
[31] N. Corporation, “NVIDIA’s Next Generation CUDA Compute Architecture: Fermi,” 2009.
[32] “NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110,”
p. 24, 2012. [Online]. Available: http://www.nvidia.com/content/PDF/kepler/
NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
[33] G. Rünger and M. Schwind, “Parallelization Strategies for Mixed Regular-Irregular
Applications on Multicore-Systems,” in Proceedings of the 8th International Symposium
on Advanced Parallel Processing Technologies, ser. APPT ’09. Berlin, Heidelberg:
Springer-Verlag, 2009, pp. 375–388. [Online]. Available: http://dx.doi.org/10.1007/
978-3-642-03644-6_30
[34] Y. Xia and V. K. Prasanna, “TOPOLOGICALLY ADAPTIVE PARALLEL BREADTH-FIRST
SEARCH ON MULTICORE PROCESSORS,” Nov. 2009.
[35] K. Madduri and D. Bader, “Compact graph representations and parallel connectivity
algorithms for massive dynamic network analysis,” Work, 2009. [Online]. Available:
http://www.computer.org/portal/web/csdl/doi/10.1109/IPDPS.2009.5161060
[36] P. Harish and P. Narayanan, “Accelerating large graph algorithms on the GPU using
CUDA,” High Performance ComputingâA˘S¸HiPC 2007, pp. 197–208, 2007. [Online].
Available: http://www.springerlink.com/index/y4816x2q7475v93n.pdf
[37] A. Pradesh, “A Fast GPU Algorithm for Graph Connectivity,” web2py.iiit.ac.in,
2010. [Online]. Available: http://web2py.iiit.ac.in/publications/default/download/
inproceedings.pdf.b1f53d644da58553.747269616c2e706466.pdf
[38] F. Dehne and K. Yogaratnam, “Exploring the Limits of GPUs With Parallel
Graph Algorithms,” Engineering, vol. abs/1002.4, 2010. [Online]. Available: http:
//arxiv.org/abs/1002.4482
132 Bibliography
[39] S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun, “Accelerating cuda graph algorithms
at maximum warp,” in Proceedings of the 16th ACM symposium on Principles and practice
of parallel programming, ser. PPoPP ’11. New York, NY, USA: ACM, 2011, pp. 267–276.
[Online]. Available: http://doi.acm.org/10.1145/1941553.1941590
[40] S. Hong, T. Oguntebi, and K. Olukotun, “Efficient parallel graph exploration on multi-
core cpu and gpu,” in Parallel Architectures and Compilation Techniques (PACT), 2011
International Conference on, oct. 2011, pp. 78 –88.
[41] J. Nieplocha, A. Márquez, J. Feo, D. Chavarr\’\ia-Miranda, G. Chin, C. Scherrer, and
N. Beagley, “Evaluating the potential of multithreaded platforms for irregular scientific
computations,” in Proceedings of the 4th international conference on Computing frontiers.
ACM, 2007, pp. 47–58. [Online]. Available: http://portal.acm.org/citation.cfm?id=
1242541
[42] K. Pingali, M. Kulkarni, D. Nguyen, M. Burtscher, M. Mendez-lojo, D. Prountzos, X. Sui,
and Z. Zhong, “Amorphous Data-parallelism in Irregular Algorithms,” 2009.
[43] K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A. Hassaan, R. Kaleem,
T.-H. Lee, A. Lenharth, R. Manevich, M. Méndez-Lojo, D. Prountzos, and
X. Sui, “The tao of parallelism in algorithms,” in Proceedings of the 32nd
ACM SIGPLAN conference on Programming language design and implementation, ser.
PLDI ’11. New York, NY, USA: ACM, 2011, pp. 12–25. [Online]. Available:
http://doi.acm.org/10.1145/1993498.1993501
[44] D. A. Bader and K. Madduri, “Designing multithreaded algorithms for breadth-first search
and st-connectivity on the cray mta-2,” in Proceedings of the 2006 International Conference
on Parallel Processing, ser. ICPP ’06. Washington, DC, USA: IEEE Computer Society,
2006, pp. 523–530. [Online]. Available: http://dx.doi.org/10.1109/ICPP.2006.34
[45] K. Madduri, D. Ediger, K. Jiang, D. A. Bader, and D. Chavarria-Miranda, “A faster parallel
algorithm and efficient multithreaded implementations for evaluating betweenness
centrality on massive datasets,” in Proceedings of the 2009 IEEE International Symposium
on Parallel&Distributed Processing. Washington, DC, USA: IEEE Computer Society, 2009,
pp. 1–8. [Online]. Available: http://portal.acm.org/citation.cfm?id=1586640.1587741
[46] Batagelj, Vladimir, “Pajek data sets - Notre Dame Self-Organized Networks Database,”
2004. [Online]. Available: http://vlado.fmf.uni-lj.si/pub/networks/data/ND/NDnets.
htm
[47] J. Leskovec, J. Kleinberg, and C. Faloutsos, “Graphs over time: densification
laws, shrinking diameters and possible explanations,” in Proceedings of the eleventh
ACM SIGKDD international conference on Knowledge discovery in data mining, ser.
KDD ’05. New York, NY, USA: ACM, 2005, pp. 177–187. [Online]. Available:
http://doi.acm.org/10.1145/1081870.1081893
[48] J. Leskovec, L. A. Adamic, and B. A. Huberman, “The dynamics of viral
marketing,” ACM Trans. Web, vol. 1, no. 1, May 2007. [Online]. Available:
http://doi.acm.org/10.1145/1232722.1232727
133 Bibliography
[49] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney, “Community structure in large
networks: Natural cluster sizes and the absence of large well-defined clusters,” CoRR, vol.
abs/0810.1355, 2008.
[50] Batagelj, Vladimir, “Pajek data sets - The Edinburgh Associative Thesaurus,” 2003.
[Online]. Available: http://vlado.fmf.uni-lj.si/pub/networks/data/dic/eat/Eat.htm
[51] Rotateright LLC, “Zoom ver 1.6.3 build 5.”
[52] Intel, “Intel Vtune Amplifier.” [Online]. Available: http://software.intel.com/en-us/
intel-vtune-amplifier-xe
[53] ——, “Intel 64 and IA-32 Architectures Software Developer Man-
uals.” [Online]. Available: http://www.intel.com/content/dam/doc/manual/
64-ia-32-architectures-optimization-manual.pdf
[54] Gonina, Ekaterina and Chong, Jike, “Task Queue Implementation Pattern.” [Online].
Available: http://parlab.eecs.berkeley.edu/wiki/_media/patterns/taskqueue.pdf
[55] V. Volkov, “Better performance at lower occupancy,” in GPU Technology Conference, 2010.
[Online]. Available: http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf
[56] CUDA C Best Practices Guide, 4th ed., NVIDIA Corporation, 2701 San Tomas Expressway,
Santa Clara 95050, USA, May 2011. [Online]. Available: http://developer.download.
nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Best_Practices_Guide.pdf
[57] CUDA Toolkit Documentation, 4th ed., NVIDIA Corporation, 2701 San Tomas Expressway,
Santa Clara 95050, USA, May 2012. [Online]. Available: http://docs.nvidia.com/cuda/
cuda-c-programming-guide/index.html#compute-capability-1-x
[58] A. Grama, G. Karypis, V. Kumar, and A. Gupta, Introduction to Parallel Computing (2nd
Edition), 2nd ed. Addison Wesley, Jan. 2003. [Online]. Available: http://www.amazon.
com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/0201648652
[59] J. L. Hennessy and D. A. Patterson, Computer Architecture, Fourth Edition: A Quantitative
Approach. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2006.
[60] M. Herlihy and N. Shavit, The Art of Multiprocessor Programming. San Francisco, CA,
USA: Morgan Kaufmann Publishers Inc., 2008.
[61] C. Breshears, The Art of Concurrency: A Thread Monkey’s Guide to Writing Parallel Appli-
cations. O’Reilly Media, Inc., 2009.
[62] B. Chapman, G. Jost, and R. van der Pas, Using OpenMP: Portable Shared Memory
Parallel Programming (Scientific and Engineering Computation). The MIT Press,
Oct. 2007. [Online]. Available: http://www.amazon.com/exec/obidos/redirect?tag=
citeulike07-20&path=ASIN/0262533022
[63] T. Rauber and G. Ruenger, Parallel Programming: for Multicore and Cluster Systems.
Springer Berlin Heidelberg, 2010. [Online]. Available: http://www.springerlink.com/
index/10.1007/978-3-642-04818-0
[64] Q. Li and C. Yao, Real-Time Concepts for Embedded Systems. CMP Books, 2003.
134 Bibliography
[65] G. C. Buttazzo, Hard Real-time Computing Systems: Predictable Scheduling Algorithms And
Applications (Real-Time Systems Series). Santa Clara, CA, USA: Springer-Verlag TELOS,
2004.
[66] M. Ben, Principles of concurrent and distributed programming, second edition, 2nd ed.
Addison-Wesley, 2006.
[67] F. L. Magazine, S. Management, S. Development, and D. Eadline, “Concurrent and Parallel
Are Not The Same,” Raritan A Quarterly Review, 2009.
[68] H. Sutter, “Effective Concurrency,” Web, Sep. 2010.
[Online]. Available: http://herbsutter.com/2010/09/24/
effective-concurrency-know-when-to-use-an-active-object-instead-of-a-mutex/
[69] C. Hughes and T. Hughes, Parallel and Distributed Programming Using C++. Prentice
Hall Professional Technical Reference, 2003.
[70] J. Reinders, Intel threading building blocks, 1st ed. Sebastopol, CA, USA: O’Reilly &
Associates, Inc., 2007.
[71] H. Kaeslin, Digital Integrated Circuit Design: From VLSI Architectures to CMOS Fabrication,
1st ed. New York, NY, USA: Cambridge University Press, 2008.
[72] I.J.Bush and W.Smith, “The Weak Scaling of DL_POLY 3.” [Online]. Available:
http://www.cse.scitech.ac.uk/arc/dlpoly_scale.shtml
[73] J.-L. Baer, Microprocessor Architecture From Simple Pipelines to Chip Multiprocessors.
Cambridge University Press, 2010.
[74] “Multithreaded Application Acceleration with Chip Multithreading (CMT), Multicore/-
Multithread UltraSPARC Processors,” p. 12, 2008.
[75] W. Wolf, Modern VLSI Design: IP-Based Design, 4th ed. Upper Saddle River, NJ, USA:
Prentice Hall PTR, 2008.
[76] B. Mathew and M. Franklin, Computer Architecture and Design. CRC Press, Inc., 2008,
ch. 1, pp. 1–67.
[77] M. Smotherman, System Design. CRC Press, Inc., 2008, ch. 2, pp. 2–9.
[78] M. Levy and T. M. Conte, “Embedded Multicore Processors and Systems,” IEEE Micro,
vol. 29, pp. 7–9, 2009.
[79] D. Liu, Embedded DSP Processor Design. Morgan Kaufmann Publishers Inc., 2008.
[80] C. Kozyrakis, “EE382a- Advanced Processor Architecture (Fall 2010),” 2010.
[81] D. Koufaty and D. T. Marr, “Hyperthreading technology in the netburst microarchitec-
ture,” Micro, IEEE, vol. 23, no. 2, pp. 56–65, 2003.
[82] P. Ienne, “Advanced Computer Architecture,” 2008.
135 Bibliography
[83] A.-L. Barabási and E. Bonabeau, “Scale-Free Networks,” Scientific American, vol. 288,
no. 5, pp. 60–69, May 2003. [Online]. Available: http://www.nature.com/doifinder/10.
1038/scientificamerican0503-60
[84] A. Barabasi, R. Albert, and H. Jeong, “Scale-free characteristics of random
networks: the topology of the world-wide web,” Physica A: Statistical Mechanics
and its Applications, vol. 281, no. 1-4, pp. 69–77, 2000. [Online]. Available:
http://linkinghub.elsevier.com/retrieve/pii/S0378437100000182
[85] M. Faloutsos, P. Faloutsos, and C. Faloutsos, “On power-law relationships of the Internet
topology,” ACM SIGCOMM Computer Communication Review, vol. 29, no. 4, pp. 251–262,
1999. [Online]. Available: http://portal.acm.org/citation.cfm?doid=316194.316229
[86] S. Redner, “How popular is your paper? An empirical study of citation distribution,”
European Physical Journal B, vol. 4, no. 2, pp. 131–134, 1998. [Online]. Available:
http://arxiv.org/abs/cond-mat/9804163
[87] A. S. N. Seshasayee, “Social behavior of the yeast protein-protein interaction network.”
In Silico Biology, vol. 6, no. 1-2, pp. 127–130, 2006.
[88] A. Barrat, M. Barthélemy, R. Pastor-Satorras, and A. Vespignani, “The architecture of
complex weighted networks,” Proceedings of the National Academy of Sciences of the
United States of America, vol. 101, no. 11, pp. 3747–3752, 2004. [Online]. Available:
http://arxiv.org/abs/cond-mat/0311416
[89] D. J. Watts, “Collective dynamics of small-world networks,” The Structure and
Dynamics of Networks, vol. 393, no. 6684, pp. 440–442, 2006. [Online].
Available: http://books.google.it/books?hl=en&lr=&id=6LvQIIP0TQ8C&oi=fnd&pg=
RA1-PA301&ots=v84PnWclzV&sig=0ILQcrg8bLKT5Yr_ccH7CjX4GSc
[90] V. M. Eguiluz, D. R. Chialvo, G. A. Cecchi, M. Baliki, and A. V. Apkarian, “Scale-free
brain functional networks.” Physical Review Letters, vol. 94, no. 1, p. 4, 2005. [Online].
Available: http://arxiv.org/abs/cond-mat/0309092
[91] E. Lindholm, J. Nickolls, S. F. Oberman, and J. Montrym, “NVIDIA Tesla: A Unified Graph-
ics and Computing Architecture,” IEEE Micro, vol. 28, no. 2, pp. 39–55, 2008.
[92] J. Sanders and E. Kandrot, CUDA by Example: An Introduction to General-Purpose GPU
Programming, 1st ed. Addison-Wesley Professional, Jul. 2010. [Online]. Available:
http://www.worldcat.org/isbn/0131387685
[93] H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos, “Demystifying
GPU microarchitecture through microbenchmarking,” in Performance Analysis of Systems
Software (ISPASS), 2010 IEEE International Symposium on, Mar. 2010, pp. 235–246.
[94] M. T. Silberstein, “Programming Massively Parallel Processors,” 2011.
[95] J. Siegel, J. Ributzka, and X. Li, “CUDA Memory Optimizations for Large Data-Structures
in the Gravit Simulator,” in Parallel Processing Workshops, 2009. ICPPW ’09. International
Conference on, 2009, pp. 174–181.
136 Bibliography
[96] R. E. Yang and C. E. Zach, “GPGPU: General Purpose Programming on the Graphics
Processing Unit.”
[97] R. Farber, “CUDA, Supercomputing for the Masses,” 2008. [Online]. Available:
http://www.ddj.com/architect/207200659
