Accelerating Graph Processing on Large-scale Multicores by Ahmad, Masab
University of Connecticut 
OpenCommons@UConn 
Doctoral Dissertations University of Connecticut Graduate School 
10-7-2019 
Accelerating Graph Processing on Large-scale Multicores 
Masab Ahmad 
University of Connecticut - Storrs, masab.ahmad@uconn.edu 
Follow this and additional works at: https://opencommons.uconn.edu/dissertations 
Recommended Citation 
Ahmad, Masab, "Accelerating Graph Processing on Large-scale Multicores" (2019). Doctoral 
Dissertations. 2329. 
https://opencommons.uconn.edu/dissertations/2329 
Accelerating Graph Processing on Large-scale
Multicores
Masab Ahmad, Ph.D.
University of Connecticut, 2019
ABSTRACT
With the ever-increasing amount of data and input variations, portable performance is becoming
harder to exploit on today’s architectures. Computational setups utilize single-chip processors,
such as GPUs or large-scale multicores for graph analytics. Some algorithm-input combinations
perform more efficiently when utilizing a GPU’s higher concurrency and bandwidth, while others
perform better with a multicore’s stronger data caching capabilities. Architectural choices also
occur within selected accelerators, where variables such as threading and thread placement need
to be decided for optimal performance. This paper proposes a performance predictor paradigm
for a heterogeneous parallel architecture where multiple disparate accelerators are integrated in an
operational high performance computing setup. The predictor aims to improve graph processing
efficiency by exploiting the underlying concurrency variations within and across the heterogeneous
integrated accelerators using graph benchmark and input characteristics. The evaluation shows
that intelligent and real-time selection of near-optimal concurrency choices provides performance
benefits ranging from 5% to 3.8×, and an energy benefit averaging around 2.4× over the traditional
single-accelerator setup.
Accelerating Graph Processing on Large-scale
Multicores
Masab Ahmad
M.S, University of Connecticut, Storrs, CT, USA, 2016
B.E., National University of Sciences and Technology, Islamabad, Pakistan, 2013
A Dissertation
Submitted in Partial Fulfillment of the
Requirements for the Degree of
Doctor of Philosophy
at the
University of Connecticut
2019
Copyright by
Masab Ahmad
2019
APPROVAL PAGE
Doctor of Philosophy Dissertation
Accelerating Graph Processing on Large-scale
Multicores
Presented by
Masab Ahmad, M.S., B.E.
Major Advisor
Omer Khan
Associate Advisor
John Chandy
Associate Advisor
Marten Van Dijk
University of Connecticut
2019
ii
ACKNOWLEDGMENTS
I am very thankful to my advisor, Prof. Omer Khan, who steered me towards the right path
during my PhD studies. Even during my personal struggles, he supported me with utmost sincerity.
I am indebted to him for the knowledge and invaluable experiences I have gained during my PhD.
I would like to thank my former and current colleagues, Farrukh Hijaz, Syed Kamran Haider,
Hamza Omar, Halit Dogan, Qingchuan Shi, Mohsin Shan, and Akif Rehman, for their continued
support, and great friendship. Special thanks should go to Farrukh Hijaz and Syed Kamran Haider as
I owe them a lot for mentoring me in my early years of my studies at the University of Connecticut.
I would like to acknowledge the funding received for supporting my PhD research from US
National Science Foundation, the Naval Research Laboratory (NRL), Semiconductor Research
Corporation (SRC), IBM Research, and NXP Semiconductors. I would specially like to express my
appreciation to Brian Kahne of NXP Semiconductors for his continuous support on the development
of the multicore simulator and workloads that I employed in my research. I would also like to
thank Jose´ Joao of Arm Research, and Christopher Hughes of Intel Corporation for their valuable
discussions and feedback towards the evolution of my research.
I also would like to thank my unconditionally loving and supporting family. Knowing that my
parents, Zia and Rema Ahmad, and my siblings, Sohaib and Aleena Ahmad, always keep me in
their prayers gave me great strength when I was under a lot of stress. I wouldn’t have been able to
be where I am without their support and encouragements.
Last but not least, I am grateful to my wife, Fajar, for being beside me, with a great patience and
support during the most stressful times. When I was almost always working, and busy, she was the
iii
iv
one who took care of everything else in my life, including our son, Azlan Ahmad.
Contents
Page
List of Figures vii
List of Tables ix
Ch. 1. Introduction 1
Ch. 2. Multi-Accelerator System 6
Ch. 3. Performance Prediction Paradigm 8
3.1 Tuning the Intra- and Inter- Accelerator Choices . . . . . . . . . . . . . . . . . . . 8
3.2 Input (I) Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.3 Input Graph Expression using I Variables: . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Benchmark (B) Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5 Benchmark Expression using B Variables . . . . . . . . . . . . . . . . . . . . . . 15
Ch. 4. HeteroMap Decision Tree Model 18
4.1 Inter-Accelerator (M1) Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Intra-Accelerator (M2− 20) Selection . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 M Choice Selection Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Ch. 5. HeteroMap Framework Automation 28
5.1 Offline Learning Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2 Training of Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3 Online Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.4 Deep Learning Prediction Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.5 Regression Prediction Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
v
vi
Ch. 6. Methodology 35
6.1 Accelerator Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.3 Processing Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Ch. 7. Evaluation 39
7.1 Selecting a Learning Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.2 Performance Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.3 Understanding Energy & Utilization Variations . . . . . . . . . . . . . . . . . . . 43
7.4 Changing Fixed Accelerator & Memory Sizes . . . . . . . . . . . . . . . . . . . . 43
7.5 The Impact of Re-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.5.1 Training Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Ch. 8. Related Work 50
Ch. 9. Conclusion 52
Ch. 10. Future Work 53
Ch. 11. Associated Publications 55
Bibliography 59
List of Figures
Page
1.0.1 How input graph variations exhibit different performance within and across under-
lying accelerators in SSSP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.0.2 Multi-accelerator system example with the run-time performance predictor for
graph benchmarks and inputs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.0.1 Machine choices (M ) for GPUs and multicores. . . . . . . . . . . . . . . . . . . 7
3.5.1 Discretization of (B) variables for SSSP-Bellman-Ford (SSSP-BF). . . . . . . . . 17
4.2.1 Decision Tree Heuristic Model flow for SSSP-BF and SSSP-Delta with the USA-
Cal input graph. The proposed model predicts and selects nine M choices. . . . . 26
4.3.1 HeteroMap Framework Flow for the Multi-Accelerator Architecture. . . . . . . . 27
5.1.1 Example synthetic benchmarks generated. . . . . . . . . . . . . . . . . . . . . . 30
5.3.1 Neural Network showing network parameters. . . . . . . . . . . . . . . . . . . . 32
5.4.1 Non-Linear Regression Equation. High-profile variables and associated trends in
input dependence on output thread selections also shown. . . . . . . . . . . . . . 33
7.2.1 Scheduler Comparisons for Graph Workload-Input Combinations (All results
normalized to the GTX750Ti GPU implementation) (Higher is worse). . . . . . . 42
7.2.2 Energy benefits averaged for various inputs for a given benchmark. (Xeon Phi vs.
GTX 750Ti). All results normalized to the maximal energy used for any B − I
combination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.4.1 Scheduler Comparisons for various Graph Workloads-Input Combinations (All
results normalized to the GTX970Ti GPU implementation) (Higher is worse).
Note that Optimal Choices change when compared to the GTX750Ti in Figure 7.2.1. 44
7.4.2 Geomean results averaged for different inputs for each benchmark for the 40-core
CPU. All results are normalized to the GPU implementation. . . . . . . . . . . . 45
vii
viii
7.4.3 Memory size variations for different machine combinations. The x-axis varies
memory sizes for a multi-accelerator system, while the y-axis shows normalized
completion time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
List of Tables
Page
3.1.1 Input Datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
6.2.1 Primary Accelerator Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . 36
6.2.2 Synthetic Input Datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.1.1 Learning Model Strategies. Speedup shown over the GTX-750 GPU. . . . . . . . 40
7.5.1 Re-Learning Performance. Compared with the baseline case of the GTX-970 as it
had better performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.5.2 Sensitivity to Training on Deep. - 128. Speedup shown over the GPU. Using all
synthetic graphs provide the speedup shown in Table 3. . . . . . . . . . . . . . . 49
ix
Chapter 1
Introduction
Target applications that utilize graph processing are rising in a plethora of architectures [1, 2].
Future HPC datacenters are expected to have heterogeneous connected accelerators, with Cray
and NVidia already edging on similar ideas [3, 4]. It has been indicated in prior works that graph
analytics pose limitations when executed on a single accelerator setup [5] [6]. Thus, this paper
proposes a multi-accelerator setup to situationally adapt the graph problem and input to the right
machine and its concurrency configurations. To understand this problem, consider the iterative
Bellman-Ford algorithm and its variants finding shortest paths. Such a graph algorithm lends itself
for data-parallel execution since it easily allows graph chunks to be accessed in parallel. Hence,
such an algorithm performs well on a GPU, since it exploits massively available threading to
exploit parallelism [7]. On the other hand, algorithms such as Triangle Counting are not as parallel,
and comprise of reductions on vertices that result in complex data access patterns. These access
patterns lead to increased data movement and synchronization requirements [8]. Multicores perform
well in such cases as they incorporate caching capabilities for efficient data movement and thread
synchronization [9]. These variations solidify the need for diverse types of accelerators in a setup
1
2executing graph analytic workloads.
Taking this problem into context, this paper takes a heterogeneous architecture that constitutes
both types of competitive multicore and GPU accelerators connected under their own discrete
memories. This setup exposes concurrency choices to graph applications, thus catering for the
missing throughput and reuse capabilities in GPUs and multicores. Performance variations occur not
only due to changes in benchmark characteristics, but also input changes within a benchmark, as well
as different mappings of graph analytic benchmark-input combinations on different accelerators.
These choices do not exist in a single accelerator setup. Algorithmically, in the presence of
expensive synchronization on shared-data or indirect memory accesses, GPUs cannot perform as
well as multicores. Multicores possess hardware cache coherence and a complex cache hierarchy to
exploit performance in such cases [10]. In various cases, the massive throughput of the GPU, or
the data reuse of the multicore needs to be constrained to reduce stress on the memory system and
data communication. One way to manage this is to spawn less threads in the workload [7]. Thus,
choices occur both within and across accelerators, for different benchmarks and inputs.
Input dependence is known to play a big role in graph analytic performance [11] [12]. An
example of such a trade-off is shown in Figure 1.0.1, which shows an OpenTuner optimized [13]
∆-stepping single source shortest path (SSSP) algorithm [14] running a sparse and a dense graph
on an Intel Xeon Phi 7120P multicore, and an Nvidia GTX-750TI GPU. Threads are varied from
minimum total available threads to maximum total threads for both accelerators, and are normalized
on the x-axis, while the y-axis shows completion time. The two accelerators are categorized as
competitive as they possess similar compute capabilities.
The multicore performs better than the GPU for the sparse road network [15], as a higher graph
diameter results in longer dependency chains that determine the optimal path between source and
destination vertices. This linked traversal leads to more complex data access patterns that are more
expensive on the GPU, as it does not possess the addressing capabilities to perform such complex
3050010 0
500
0 20 40 60 80
Xeon Phi GTX750Sparse Graph
CAL-road-network
Runs well 
on GPUs
Runs well on
Multicores
C
o
m
p
l e
t i
o
n
T
i m
e
 (
m
s )
C
o
m
p
l e
t i
o
n
T
i m
e
 (
m
s )
0
5000
10000
Normalized Threads
Min Max
0
1000
2000
Dense Graph
CAGE-14
S
S
S
P
 (
∆
- S
t e
p
)
Figure 1.0.1: How input graph variations exhibit different performance within and across underlying
accelerators in SSSP.
data accesses. Moreover, the different phases in ∆-stepping result in more divergence and complex
indirect addressing, which adds to GPU overheads. The multicore in this scenario performs several
orders of magnitude faster than the GPU. The CAGE-14 graph [16] has a lower diameter, and thus
requires less iterations to converge. Due to high density of edge connectivity, it lends itself to
map optimally on a GPU. Larger available core and thread counts in GPU allow it to outperform
the multicore by 3×. The GPU fares well in this case as they possess the capability to spawn
thousands of threads without having to enforce many barrier synchronization calls. Even when
the optimal accelerator is selected, there are a slew of machine choices within the accelerator to
choose from. In the case of CAGE-14 graph, intermediate threading performs best on the GPU,
as spawning more threads raises stress on the GPU’s already small cache system. This exhibits
the vastness of the dimensionality of the input problem space, as various benchmarks and inputs
may constitute different characteristics, making manual tuning difficult. Machine choices within
and across accelerators therefore need to be tuned based on different inputs to achieve optimal
performance. Moreover, for different benchmarks, the patterns that lead to concurrency and data
accesses also vary across graph analytics, which further motivates the need to tune this accelerator
4choice space.
This poses several questions: What patterns in graph benchmarks and inputs lead to best ex-
ploitation of concurrency within and across GPUs and multicores? What are the architectural
differences in these machines that lend them for mapping to the diverse benchmark-input com-
binations? What are the run-time concurrency trade-offs of using one accelerator over another
in a heterogeneous setup? Benchmarks and inputs reveal accelerator choices due to their direct
correlations with the optimal architectural choices. Thus, graph benchmark and input choices need
to be exposed systematically, after which a high level intelligent predictor tunes the accelerator
choices. However, due to the increased high-dimensional space complexity and non-linear aspects
of having multiple accelerators and their intra-concurrency choices, selecting the right choices
becomes a hard problem.
This paper proposes a novel performance predictor framework, HeteroMap, which integrates
benchmark and input choices to do dynamic selection of parameters within and across accelerators.
The prediction framework captures program characteristics by intelligently discretizing graph
benchmarks and inputs into easily expressible representative variables. Mappings of benchmark and
input representations to inter- and intra-accelerator choices are done using a decision tree analytical
model. The proposed analytical model is further automated using machine learning to amortize
costs associated with the large graph algorithmic choice space. The automated model is trained
using synthetically generated graph benchmarks [17, 18], and inputs [19, 20]. For a variety of
graph analytic benchmarks executing real-world inputs, HeteroMap provides performance benefits
ranging from 5% to 3.8× when compared to a single GPU-only or multicore-only setup.
5Multi-Accelerator 
Architecture
Cache Coherence
No Coherence
Better Hierarchy
Large Registers
More Threads
Less Threads
Multicore
GPU
More
Bandwidth
Less
Bandwidth
DDRx
Controllers
Strong Cores
Weak Cores
FP FP FP FP FP
FP FP FP FP FP
FP FP FP FP FP
Chunks from 
Large Graph 
Loaded to 
Memory
DDRx
DDRx
Architectural
Choices
Benchmark 
& Input 
Choices
Real-time 
Performance 
Predictor
Predicted 
Setting
Figure 1.0.2: Multi-accelerator system example with the run-time performance predictor for graph
benchmarks and inputs.
Chapter 2
Multi-Accelerator System
The target system utilizes discrete GPU and multicore accelerators. The setup considers either a
weaker NVidia GTX-750Ti GPU or a stronger NVidia GTX-970 GPU, but not both at the same
time. We also consider a weaker Intel Xeon Phi 7120P multicore or a stronger 40-core Intel Xeon
E5-2650 v3 multicore. All multicore-GPU combination pairs are considered to analyze the inter-
and intra-accelerator design space. This multi-accelerator system is used as a prototype to convey the
underlying idea of mapping architectural choices using graph benchmarks and inputs. Figure 1.0.2
depicts an example multi-accelerator system showcasing a GPU and a Xeon Phi multicore with
GDDR5 memories, as well as various architectural differences between associated accelerators. As
memory size changes require architectural reconfigurations, evaluations are done on fixed memory
sizes for each target accelerator. The design space of various combinations of memory sizes is also
studied to analyze how main memory size changes affect performance in accelerators.
Input graph chunks are loaded in the accelerator’s DDR memory for processing. The system is
used in a way that graph benchmark-input combinations are loaded and executed with the appropriate
architectural choices for individual accelerators with the discrete memory size constraint. In a
6
7M
u
l t
i c
o
r e
H
a
r d
w
a
r e
 C
h
o
i c
e
s
M9     OMP nowait, 
M11   OMP for schedule (static 
M12   guided
M13   dynamic
M14   auto
M15 chunk_size)
M16   OMP_Nested
M17   OMP_Max_Active_levels
M18   GOMP_spincount
M19    Global Threads
M20    Local Threads
M1     Accelerator 
(Multicore, GPU)
M2     Cores
M3     Multi-threading
M4     KMP_Blocktime
M5-7 KMP_Place_Threads
M8     KMP_Affinity
M10   #pragma SIMD
M
u
l t
i c
o
r e
 
O
p
e
n
M
P
C
h
o
i c
e
s
G
P
U
 H
a
r d
w
a
r e
C
h
o
i c
e
s
A
c c
e
l e
r a
t o
r
C
h
o
i c
e
Figure 2.0.1: Machine choices (M ) for GPUs and multicores.
real-time context, it is harder to allocate graph chunks and process them as larger graphs do not
fit in main memory. Hence, chunks from larger graphs are thus extracted temporally using a
state-of-the-art Stinger framework [21], and streamed in the accelerator’s memory to be processed.
The prediction paradigm takes in graph chunk characteristics, and predicts optimal architectural
concurrency parameters for each chunk.
Chapter 3
Performance Prediction Paradigm
Graph inputs consist of vertices, V , which are connected to other vertices via edges, E. Graph
benchmarks loop around outer vertices and inner edges, and different phases in workloads have
different complexities and have diverse data access patterns. Due to data access and synchroniza-
tion pattern differences in graph inputs and workloads, different benchmarks and inputs perform
optimally on different machines with different intra-accelerator settings. The multi-accelerator
architecture in Figure 1.0.2 exposes these intra- and inter- accelerator variations, and we create a
knowledge base from benchmarks and inputs that can be mapped to these machine choices.
3.1 Tuning the Intra- and Inter- Accelerator Choices
Various capabilities in GPU and multicore accelerators allow improved performance extraction
for specific graph benchmark and input characteristics. This trade-off between accelerators is
depicted as M1 in Figure 2.0.1, where either a GPU or a multicore can be selected. In GPUs,
massively available threading hides data access latencies to deliver high throughput execution.
8
9This occurs in data-parallel workloads with small dependency chains and less shared data, and
thus GPU accelerators must be selected for such cases. Although GPUs fare well with highly
data-parallel execution, they under-perform when benchmarks have complex data access patterns,
costly synchronization, and inter-thread data movement. Moreover, even in data-parallel workloads,
the threading and throughput of a GPU may need to be constrained due to varying input sizes
and densities to reduce stress on the memory system for optimal performance. This creates two
choices within a GPU: Global threading, which distributes threads across the GPU chip,
and Local threading, which specifies the thread count on a GPU core. These choices are
listed in Figure 2.0.1 as GPU hardware choices, M19− 20.
Multicores perform well for complex data access patterns by taking advantage of their cache
reuse and cache coherence capabilities. Therefore, multicores should be selected if there is ample
shared data. Multi-threading usage and placement intra-choices depend on the input graph char-
acteristics such as edge density. Specifically for multicore threading, KMP affinity/place threads
are thread placement hardware choices in Figure 2.0.1, while # pragma simd controls SIMD
usage. Thread placement may be compact or loose, and is important for data movement along
with core and cache utilization. For example, threads may want to use cache slices of unused
cores, which can be enabled by placing threads in the center of unused core clusters. This improves
performance by reducing data movement and synchronization costs as threads are placed closer to
the residing data. KMP blocktime is another parameter, which defines the time a thread waits before
going to sleep. This is helpful during contention and load imbalance, as threads can go to sleep
before polling on contended data.
Other parameters, such as those in the OpenMP paradigm, also have non-linear relationships with
benchmarks and inputs, and are used to improve shared data reuse and movement costs. Scheduling
variables in OpenMP involve dynamic scheduling, which control work distributions across parallel
regions. Scheduling is controlled by OMP for schedule, which is tasked with static, dynamic,
10
guided, or auto choices, and data tile/chunk sizes. Data scheduling is related to access patterns,
which require dynamic scheduling on read-write shared data. This mitigates contention and data
movement overheads [22]. Additional parameters such as OMP Nested exploit nested parallelism
within loops, while OMP Max Active Levels states how many levels of parallelism can be
nested. GOMP Spincount defines how long threads actively wait for OpenMP calls. Larger times
with this variable may be used to increase waiting times for threads if there is high contention.
These OpenMP parameters are denoted as M9,M11− 18, and are listed in Figure 2.0.1.
The M variable space is a function of the target benchmark and associated graph input, and this
is the formulation required to achieve tuning of M parameters. All choices symbolize a non-linear
mapping between benchmarks and graphs, and M choices. Thus, we create a benchmark and input
graph representation space, denoted by B and I respectively. To minimize performance, a tuple
vector, X , is constructed that takes benchmark choices ~B, input choices ~I , and accelerator choices
~M , to minimize performance in the proposed architecture: ~X(M) = MinPerf ( ~B, ~I). The function,
MinPerf () is the proposed configurator that finds M choices. To properly relate benchmark and
inputs with M choices, B and I variables need to be extracted and classified for tuning. The next
sections first describe B, I variables in the context of how they are expressed, and their relationships
with machine choices.
3.2 Input (I) Variables
The most relevant input variables are graph size using vertex counts (I1) and edge density (I2),
which specify the size of the graph and the density of computations. Higher graph sizes and
densities can be divided into more threads, thus thread count selections in accelerators are directly
correlated with I1 and I2. The maximum edge count of any vertex in the graph (I3) is also relevant
11
Table 3.1.1: Input Datasets.
Evaluation Data #V #E Max.Deg Diameter
USA-Cal(CA)[15] 1.9M 4.7M 12 850
Facebook(FB)[23] 2.9M 41.9M 90K 12
Livejournal(LJ) 4.8M 85.7M 20K 16
Twiter(Twtr)[24] 41.7M 1.47B 3M 5
Friendster(Frnd) 65.6M 1.81B 5.2K 32
M. Ret. 3(CO)[25] 562 0.57M 1027 1
Cage14(CAGE)[16] 1.5M 25.6M 80 8
rgg-n-24(Rgg)[23] 16.8M 387M 40 2622
Kron.-Large(Kron) 134M 2.15B 16.0 12
as it defines how much deviation there is in edge connectivity from the average density using I2.
This is used to define average per-thread work, as well as divergence in work between threads.
Higher or lower per-thread work is used to decide how much local threading and/or SIMD to use,
while work divergence is used to optimally place threads, in a selected accelerator. Graph diameter
(I4) specifies the largest connectivity distance between any two vertices, specifying dependency
chain sizes between vertices in a graph. I4 is obtained alongside input graphs or using run-time
approximations [26]. This in turn expresses how much the memory system is going to be stressed
during execution, as longer vertex dependency chains need to be remembered in memory. I4 is
helpful in deciding which type of memory system needs to be tuned for an input graph.
All input variables are also easily expressible in percentages, as maximum vertex and edge
count, maximum degree, and diameter, are known in literature [17]. These proposed I variables are
used to classify a real input graphs to expose input variations, shown in Table 3.1.1. These range
from sparse road networks, social networks, to dense mouse brain graphs.
12
3.3 Input Graph Expression using I Variables:
I variables are deduced from graph data and are shown in Figure 3.3.1a. These representations are
simply obtained by normalizing the input graph’s characteristic data, and setting it to a value between
0 and 1, with increments of 0.1, depending on the acquired value. I variables are normalized by
comparing the input graph characteristics to the maximum values available in literature [27, 14] for
these variables. Normalization is necessary, as these characteristics need to be compared to each
other to predict inter- and intra-accelerator choices. Furthermore, as graphs have extremely large
variations among themselves in terms of characteristics, a logarithmic normalization is applied to
further smoothen I values. Using the USA-Cal input graph as an example to compute I variables,
vertex and edge counts in USA-Cal are low compared to the largest graphs such as Friendster.
Hence I1, 2 are set to 0.1 for USA-Cal, but 0.8 for Friendster. As the maximum degree of USA-Cal
is also extremely low compared to the largest available degree in Twitter (which is 1), I3 is set as
0 in this case. However, its diameter is close to the highest available (850 is close to the largest
diameter of 2622 for the Rgg graph). Therefore, we set I4 as 0.8 for USA-Cal and 1 for Twitter,
and 0 for all other graphs. I variables for other input graphs are extracted similarly and shown in
Figure 3.3.1a.
3.4 Benchmark (B) Variables
In parallel graph algorithms, the outermost loop is parallelized, and traverses graph vertices in
various phases such as highly parallel vertex division and pareto fronts, or less parallel reductions
and push-pop phases. An algorithm may consist of multiple phases, where phases are separated
by global thread barriers. Inner loops traverse edges, where data is addressed either directly using
loop indexes, or indirectly using complex pointers. Data may be either shared as read-only or as
13
FB
LJ
Twtr
Frnd
CO
CAGE
Rgg
Kron
CA
I1 I2 I4I3
0.1 0.1 0.80
0.1 0.2 00.1
0.2 0.3 00.2
0.7 0.7 01
0.8 0.8 00.1
0 0 00.1
0.1 0.2 00.1
0.5 0.4 10
1 1 00
USA-Cal (CA)
I1 = 0.1
I2 = 0.1
I3 = 0
I4 = 0.8
I n
p
u
t  
G
r a
p
h
 
V
a
r i
a
b
l e
s
I1 Vertex Count
# of vertices in the graph
I2 Edge Count
# of edges in the graph
I3 Max. Edge Count
Highest edge count of a 
vertex in the input graph
I4 Diameter
Greatest distance 
between any two vertices
Input Graph 
Classification using I 
variables
CAGE-14 (CAGE)
I1 = 0.1
I2 = 0.2
I3 = 0.1
I4 = 0
(a) Input variables for real graphs.
SSSP-Delta
BFS
DFS
PageRank-DP
PageRank
Tri.Cnt.
Comm.
Conn. Comp.
SSSP-BF
B1 B2 B4 B5B3 B7 B9 B10B8B6 B11 B12 B13
    
       
     
    
      
    
      
     
       
B1
B2
Vertex Division
Pareto
V
e
r t
e
x
 P
r o
c e
s s
i n
g
 
/  
S
c h
e
d
u
l i
n
g
C
o
m
p
.
T
y
p
e
M
e
m
o
r y
A
c c
e
s s
% program in vertex 
division
% program in pareto
fronts
B3Pareto-Division
% program in divided 
paretos
B4Push-Pop
% program in 
Push-Pops
B5Reduction
% program in 
reductions
B6Floating Point
% floating point 
data
B8
B9
Indirect
Read-only 
Data
% complex pointer 
addressing
% read-only 
program data
B10 Read-write 
Shared Data
% shared read-write 
program data
B11
Locally 
Accessed Data
% locally accessed 
data
B12Contention
% data contended via 
atomics
B7 Data Driven
% accesses 
addressed by data
# global barriers per 
iteration
B13Barriers
Benchmark Representation using B Variables
S
y
n
c h
r o
n
i z
a
t i
o
n
D
a
t a
 
M
o
v
e
m
e
n
t
D
a
t a
 
M
o
v
e
m
e
n
t

(b) Benchmark variables and representations.
read-write, and may require atomic updates. Read-write shared data may require local computations
with either fixed point or floating point (FP) requirements to calculate output values to write into
global data structures. These generic primitives are used to generate B benchmark variables. In
this work, variables B1− 13 define structural differences within graph-specific data structures and
parallel phases, which are critical components in predicting machine choices.
Initial B variables are derived using outer loop parallel primitives. The outer loop may be
data-parallel using Vertex division (B1), lending itself easily for execution with a larger number of
independent threads. Pareto (B2) execution can also be applied on outer loops, where chunks of
vertices mapped to threads statically increase with workload progression. These Pareto phases may
also dynamically increase vertices in threads (B3). Graph workload phases may also take the form of
Push-Pop B4 accesses, which add certain ordering constraints for processing. This in turn enforces
dependencies, leading to complex data access patterns. Like Push-Pop accesses, Reductions (B5)
contain more sequential work than other phases, and involve synchronization primitives with atomic
14
operations. (B4) and (B5) phase types complicate data access and parallelism, leading to thread
divergence. Therefore, GPUs may under perform for such scheduling patterns [28]. However,
(B1− 3) lend themselves for high parallelism on the GPU. These variables are important, as they
describe how much each phase constitutes a benchmark. These vertex processing and scheduling
variables (B1 − 5) are mutually exclusive, as programs are divided into phases. For example, a
program may consist of 80% vertex division, and a 20% reduction phase. A programmer sets these
variables by finding out how much a phase constitutes each benchmark.
Compute type within phases may be FP computations done by the inner loops of workload
phases. These FP computations determine if dedicated hardware units need to be exploited. This
is shown by how much program data is specified as FP (B6), which trades-off accelerators, as
some accelerators may have more FP capabilities than others. For example, if 20% of program data
requires FP, then (B6) is set as 0.2. FP operations perform optimally on multicores if they are in a
dense format to exploit SIMD capabilities. Therefore, knowing how much FP computations are
needed can decide in mapping a benchmark to either a multicore or a GPU.
In terms of memory access patterns, addressing is either done with loop variables (B7), or
by complex indirect addressing such as double pointers (B8). Complex addressing primitives are
better handled in multicores as they possess larger caches to hold addressing metadata, and have
faster ways of resolving complex pointers and addressing. Indirectly accessing and reusing data via
addressing in the cache does not fare well with GPUs as they do not have the capabilities or enough
cache sizes to hold such contents. A programmer sets (B7, 8) by viewing what percentage of data is
accessed indirectly, or by using loop indexes.
Runtime data movement is also diverse, and takes the form of read-only shared data (B9),
read-write shared data (B10), and locally accessed data (B11). (B9− 10) fare well on multicores
as they have cache management mechanisms for efficient data movement between cores. (B11) is
data that is locally operated in thread registers, where it depends on the accelerator’s cores on how
15
fast they process local computations. Each of these variables is expressed by the programmer as a
percentage of the total accessed data.
Shared data may also require updates with synchronization (B12), where (B12) is viewed as
the percentage of data requiring locks, as certain accelerators may have better performing atomics
than others. The number of barriers in a workload separating phases (B13) also causes variations.
If there are more locks and barriers in a benchmark, then it produces more opportunities for inter-
thread communication to cause bottlenecks and load imbalance. (B13) is specified as the number of
barriers between phases, and each barrier increments (B13) by 0.1, per iteration.
(B1 − 5) are considered as independent variables. Although the interactions of remaining
B variables are complex, these variables are not considered mutually exclusive. All benchmark
variables are also easily expressible in percentages. The programmer specifies which B variables
are interesting in a given benchmark. For simplicity, this section first uses a Xrepresentation to
signify whether each B variable is specified or not in a benchmark. This classification is shown in
the subsequent subsection.
3.5 Benchmark Expression using B Variables
Now that benchmark variables are defined, these variables can be used to classify real graph
workloads. Graph workloads are thus acquired from a variety of benchmark suites, further specified
in Section 6.2. These benchmarks are also listed in Figure 3.3.1b, with the Xrepresentations
showing if a B variable is used in a benchmark. Based on compile-time information about loops
and inputs, loop indexes and data structure sizes are inferred, and are used to approximate relative
strengths of B variables. As multicore and GPU versions of benchmarks use the same algorithms,
their B variable classification remains the same.
16
Taking the case of SSSP-BF as an example, the only parallelization applied is vertex division,
which enables B1 to be set as X. If the SSSP-Delta workload is used then parallel buckets are
used to push and pop edges, setting B4 as X. The GAP version also uses a reduction to select a
bucket to use in subsequent iterations, which sets B5 asX. In terms of program phases, the general
distribution is that workloads use data-parallel vertex division B1 along with reductions B5. BFS
uses only Pareto-division B3, and DFS uses only Push-Pop B4, as workload phases only contain
one phase of these types. All workloads have data-driven accesses B7, and read-write shared data
B10. DFS and Conn. Comp. have complex indirect data accesses, which are due to queuing and
data-manipulated addressing, and these set B8 to X.
B variables are percentages of program sections or percentages of data types used. These
variables need to be normalized because simple Xrepresentations do not show intensities of each
B variable. B variables are depicted within a range of 0 and 1, with increments of 0.1. Finer
increments may be applied, however we keep the model simple by not using very fine increments.
As graph workloads consist of only phases separated by barriers, values for B1 − 5 variables
for phases add to 1 for all benchmarks. To assign values for more than one B1 − 5 variables
in a benchmark (e.g. a workload having both Push-Pop and Reduction phases), the programmer
decides approximately how much % code is in each phase. The programmer can statically view data
structures to assign how much % of the structures fall in each of the remaining variable B6− 12
categories. By specifying B variable values between 0 and 1, the programmer assigns percentages
to variables, therefore properly assigning benchmark characteristics.
As an example, we take SSSP-BF to show this discretization, with its pseudocode shown in
Figure 3.5.1 to visualize B variables. As all of the program code in SSSP-BF only uses vertex
division to parallelize outer loops, thus B1 is set as 1 in Figure 3.5.1, while the remaining B2−5 are
set as 0. B6 is set as 0 because SSSP-BF does not utilize FP operations. Most of the data accesses
are done using loop indexes, such as accesses for D tmp[], D[], and W[] arrays, therefore setting
17
SSSP-Bellman-Ford Example
.
While (!terminate)
Parallel for (v: vertex=0 - N)
Parallel for (e: edge=0 - edges[v])
If D[v] + W[v,e] < D_tmp[e]
lock[e]
D_tmp[e] = D[v] + W[v,e]
unlock[e]
Barrier
Parallel for (vertex=0 - N)
D[v] = D_tmp[v]
if (D == D_tmp)
terminate = 1
Barrier
B1 = 1
B2 = 0
B3 = 0
B4 = 0
B5 = 0
B6 = 0
B7 = 0.8
B8 = 0
B9 = 0.5
B10 = 0.5
B11 = 0.2
B12 = 0.2
B13 = 0.2
Vertex Div
Data Driv.
Read-only
R/W Shared
Local Acc.
Locks
Barriers
Figure 3.5.1: Discretization of (B) variables for SSSP-Bellman-Ford (SSSP-BF).
B7 to 0.8. B8 is set to 0 as there are no indirect accesses. Approximately half of the program data is
composed of the input graph W[], which is read-only by all threads. The other half are the distance
arrays (D tmp[] and D[]), which are read and written frequently by all threads. This sets B9 and
B10 to 0.5 each. Local computations are done on D tmp[], which constitutes approximately 20%
of program data, hence this sets B11 to 0.2. Locks are also applied only on the D[] array, which is
half the size of the two distance arrays combined, and there are two barrier calls in the benchmark.
This sets B12 and B13 to 0.2. Now that B, I variables are set, relationships between B, I and M
variables can be found to predict M variables.
Chapter 4
HeteroMap Decision Tree Model
Patterns of M mappings allow visualization of accelerator choices with B and I variables. For
example, benchmarks utilizing Push-Pop (B4) phases are expected to perform better on multicores
than on GPUs. This is due to better data movement capabilities in multicore cache hierarchies, as
well as better core performance for queuing operations. On the other hand, benchmarks with high
data-level parallelism and local thread computations are expected to perform well on a GPU. This
is because GPUs possess more threads to exploit available parallelism, and large register files to
hold local computations. These relationships between B, I and M variables are used to create a
simplified analytical decision tree model.
This section proposes a decision tree heuristic that analytically minimizes the choice space
problem for performance (and energy if needed). Decision trees are easily readable and tunable, and
thus allow for manual modeling. This model is expressed as an inter-accelerator model to first select
an optimal accelerator, and then an intra-accelerator model to select concurrency choices within the
accelerator. However, with 13 B variables, 4 I variables, and 20 M variables, the resulting choice
space consists of thousands of combinations to select from. Hence, to simplify the prediction model,
18
19
we only look at the most important variables that affect each M parameter. The complete M model
is provided as a C/C++ program in the URL provided1.
4.1 Inter-Accelerator (M1) Model
A 3-layer manually constructed decision tree is formulated, selecting an accelerator based on
(B, I) combinations. As the complete decision tree is too large, we describe a few partial decision
examples. For example, if a combination has B1 or B2 or B3 each with a value greater than 0.5,
meaning it has lots of vertex level parallelism, then a GPU is chosen as it exploits this available
parallelism. This allows workloads such as SSSP-BF and BFS to run on the GPU. On the other hand,
if a benchmark has serial Push-Pop accesses (B4) with a high graph density, then the multicore is
selected as it performs well on Push-Pop accesses with the dense graph fitting in its local caches.
In another example, if a benchmark has a high value of B5 (reductions) with some FP (B6), and
negligible local computations (B11), then the GPU is selected. This is because GPUs perform
well with reductions having low local computations, meaning the small GPU threads can make fast
progress using their small caches.
The multicore is selected for the case with reductions (B5) and read-write shared data (B10).
This is because the cache capabilities in multicores allow faster operations on shared data, while
synchronization primitives required for reductions on vertices also perform well due to faster
inter-thread communication abilities. For large graphs with I1 > 0.5, benchmarks with indirect
addressing are also run on the multicore for this reason. Larger graphs running with benchmarks
requiring FP operations (B6) are also run on the multicore as it has a stronger memory hierarchy
and FP capabilities. Thus, workloads such as Conn. Comp., PageRank, and Comm. are run on
multicores if graphs are large.
1HeteroMap Repository: https://github.com/masabahmad/HeteroMap
20
A threshold of 0.5 is set as default to select between the GPU and the multicore as it shows
the unbiased mid-point in normalized B, I values. For example, for high reduction and read-write
shared data values (B5 > 0.5 and B10 > 0.5), multicores are selected. The execution model
assumes that the programmer has to input such values, and hence selecting the mid-point seems to
be the easiest way to acquire ample performance. Other thresholds may also work by fine tuning
thresholds, however this is left as future work.
4.2 Intra-Accelerator (M2− 20) Selection
Intra-accelerator calculations are more complicated due to non-linear B − I to M relationships,
solidifying the need to create a simpler linear equation model. Linear equations are of the form
y = ax+k, which are converted to the following equation when input (B, I) and output M variables
are linearized.
M = a(B, I) + k
As all M variables need to be set to a minimum value, k is used to specify this value. For example,
when using a multicore, at least one core must be used, which sets k = 1 for variable M2. k values
for other M variables are set similarly. The term a(B, I) may incorporate linear relationships of dif-
ferent B, I variables. These relationships are intuitively derived using visualization of relationships
between B, I and M variables. In some cases, an M variable may either be set or unset, and thus a
threshold of 0.5 is used for such cases after resolving the equation result. Similar to M1, B − I to
M relationships for the rest of the M variables augment to many partial linear equations. Hence,
we do a simpler showcase in this section by discussing only the most important equations for M
variables. Each relationship for each M variable is discussed in the following text.
In GPUs, if the graph is dense (seen from I variables), then more local threads are desirable to
21
parallelize edges, making GPU local threads (M20) proportional to the graph density. To obtain
the deployable value, the acquired normalized result from the above mentioned relationship is
multiplied with the maximum value of the machine variable being applied (GPU local threads in
this case). This is given by the variable CL KERNEL WORK GROUP SIZE for OpenCL (simplified
to max local threads). This relationship with the added constant (k=1 for GPU local threads as at
least 1 thread must be spawned), is thus shown by the following equation:
M20 = Avg.Deg ∗max local threads + k
Avg.Deg = |I3− (I2/I1)|
GPU global threads (M19) derive from I1, as outer loops are parallelized among threads. This
implies that if there are more vertices, then more threads can be spawned for additional parallelism,
resulting in the following relation:
M19 = I1 ∗max global threads + k
Similarly, in multicores, cores depend on the available parallelism in the outer loop, as more vertices
can be parallelized among more cores with their additional cache slices. This is similar to the
derivation of M19. Multi-threading/SIMD are also a function of the graph density, similar to M20.
The higher the graph density, the larger the inner loops, meaning more threads or a wider SIMD per
core must be spawned. These two variables are given by the following equations:
M2 = I1 ∗max cores + k
M3, 10 = Avg.Deg ∗max multi− threading + k
22
The thread blocktime parameter (M4) defines thread wait times (max thread wait time is set to be
1000ms, while the minimum can be set as 1ms). Threads are known to wait on locks and barriers via
OS calls, and higher wait times are associated with higher contention levels. Thus, this parameter is
acquired by taking the average of B12 and B13 as it depends on contention, and by setting k = 1,
as shown by the following equation. The purpose of this equation is to correlate thread wait times
to contention.
M4 = B12 + B13/2 ∗max thread wait time + k
In multicores, threads are placed in a more fine-grained manner, using variables M5− 7. Thread
placements not only depend on the average degree of the graph, but also on the graph diameter, as it
determines temporal progression of work within a graph. Thread placement variables consist of
three variables to create placement combinations: core ids (M5), thread ids (M6), and thread offsets
(M7). Higher deviations between I3 and the average degree signifies variations in edge mapping
across the chip. Thus, threads need to be placed loosely across the chip. A higher graph diameter
depicts longer dependency chains between vertices, meaning that each thread needs to work longer
to achieve desired outputs. This means that more threads are required, as vertices remain idle due
to threads being busy waiting on longer dependencies. Thus, variables M5− 7 are calculated by
taking the average of average degree and the diameter:
M5− 7 = Avg.Deg.Dia ∗max thread placement + k
Avg.Deg.Dia = |(I4 + Avg.Deg)/2|
Thread affinities in multicores mean pinning threads to cores in movable or strictly compact ways.
Movable in this case means threads may be moved around by the OS or OpenMP scheduler if it
determines that performance may be gained by moving threads to other cores. Again, affinity is
23
related to thread placement, hence a relationship with Avg.Deg.Dia is assumed. However, pinning
threads to specific cores also relates to read-write shared data (B10), as performance improves when
shared data is not moved between cores. In such cases, if B10 is high then threads need not be
moved between cores to avoid unnecessary data movement. In the minimum case for k, all threads
may be moved around by the scheduler, setting k = 0. Thus, thread affinity may be taken as the
average of Avg.Deg.Dia and B10, as shown by the following equation.
M8 = Avg.Deg.Dia + B10/2 ∗max thread placement + k
M9 specifies whether threads wait at implicit barriers, where performance is correlated with
doing more local computations (B11), including FP work (B6), and waiting at other barriers (B13).
M9 > 0.5? = B6 + B11 + B13/3
For OpenMP (OMP) scheduling parameters and chunk sizes, relationships depend on both I and
certain B variables as access patterns depend on benchmark functions. OMP for-static is beneficial
with simple vertex division (B1), data driven loops (B7), and high read-only data (B9).
M11 > 0.5? = B1 + B7 + B9/3
OMP for-guided scheduling is intuitively beneficial with pareto fronts (B2, B3), thus the average of
B2, B3 may be used to correlate with M12.
M12 > 0.5? = B2 + B3/2
OMP for-dynamic scheduling is intuitively beneficial with complex parallelizations (B4, B5), indi-
24
rect accesses (B8), and read-write shared data (B10), thus the average of these variables may be
used to correlate with M13.
M13 > 0.5? = B4 + B5 + B8 + B10/4
OMP for-auto scheduling is beneficial with simple parallelizations and reductions (B1, B5), data
driven accesses (B7), and barriers (B13) for automatic scheduling, thus the average of these
variables may be used to correlate with M14.
M14 > 0.5? = B1 + B5 + B7 + B10/4
Scheduling intensity, M15, also depends on B2, B3, B10 values, as larger paretos need increased
load balancing. k is selected as 1.
M15 = B2 + B3 + B10/3 ∗max scheduling chunksize + K
Nested parallelism mainly depends on available local computations, simple loop parallelizations,
and simple accesses (B1, 7, 9, 11). k is selected as 0.
M16 = B1 + B7 + B9 + B11/4 ∗max nested parallelism + K
However, the maximum active nested levels in OpenMP depend on how much computation can
be unrolled with dependencies. Thus, the opposite of the average of (B4, B10, B12) is used. k is
selected as 0.
M17 = (1−B4 + B10 + B12/3) ∗max active levles + K
25
Exponential backoff times for lock spins (M18), depends on locks and barriers, thus the average of
these variables is taken. k is specified as a minimum value of 1 clock cycle.
M18 = B12 + B13/2 ∗max spin time + K
For example, for the Xeon Phi’s multi-threading M3 variable, K is set as 1, which results in at least
1 thread being initialized per core. Similarly, this sets cores in the Xeon Phi, and Global threads
in the GPU, to a quarter their maximum value, so a very small I1 does not result in a minimal
thread count. This is done for variables M2− 4, 9− 10, 15, 17− 20. If a calculated value resolves
to a larger than maximum value for an M variable, then a ceiling function sets it to its maximum
value. The proposed M variable equations are also expected to work for other GPUs and multicores,
including CPU multicores. The remaining M parameters pertain to OpenMP choices not shown
here due to space constraints, but are described in the HeteroMap repository1.
4.3 M Choice Selection Example
In lieu of the shown relationships of B, I variables with M variables, we show an example of how
M variables are predicted using the proposed model. Figure 4.2.1 shows this flow for SSSP-BF
and SSSP-Delta running with the USA-Cal (CA) input graph. Discretized B variables are shown
for the two benchmarks and the input graph, which are acquired by benchmark profiling. Using a
visual inspection, B variables for SSSP-BF are more inclined towards exhibiting a highly parallel
workload that has low read-write shared data and contention. This implies that SSSP-BF is expected
to perform optimally on a GPU. On the other hand, SSSP-Delta has more sequential-like functions,
such as reductions and the use of push-pop structures. This also makes its data more contended and
shared in terms of reads and writes. This implies that SSSP-Delta is expected to perform optimally
26
050010 0500
0 20 40 60 80
Xeon Phi GTX750
SSSP-Bellman-Ford
[Pannotia, IISWC’13]
B1 = 1
B2 = 0
B3 = 0
B4 = 0
B5 = 0
B6 = 0
B7 = 0.8
B8 = 0
B9 = 0.5
B10 = 0.5
B11 = 0.2
B12 = 0.2
B13 = 0.2
SSSP-Delta-Step
[GAPBS, IISWC’15]
B1 = 0.6
B2 = 0
B3 = 0
B4 = 0.2
B5 = 0.2
B6 = 0
B7 = 0.8
B8 = 0.2
B9 = 0.5
B10 = 0.4
B11 = 0.5
B12 = 0.2
B13 = 0.3
B variables imply a 
GPU as optimal
B variables imply a 
multicore as optimal
Highly 
Parallel
Low R/W 
Shared Data 
& Contention
Higher R/W 
Shared Data 
& Contention
Reductions
& Push-Pops
M1 = GPU
I1 = 0.1
I2 = 0.1
I3 = 0
I4 = 0.8
USA-Cal (CA)
M1 = Xeon Phi
M19 = (0.1*max_global_threads) + K
M20 = (1*max_local_threads) + K
M2 = (0.1*max_cores) + K
M3 = (1*max_multithreading) + K
M5-7 = (0.9*max_multithreading) + K
050010 0500
0 20 40 60 80
Xeon Phi GTX750
C
o
m
p
l e
t i
o
n
 T
i m
e
 (
m
s )
Min Max
0
2000
4000
6000
8000
10000
Normalized Threading 
Combinations
Selected
Optimal
0
5000
10000
15000
20000
C
o
m
p
l e
t i
o
n
 T
i m
e
 (
m
s )
Min MaxNormalized Threading 
Combinations
Optimal
Selected
M8 = (0.7*max_affinity) + K
Figure 4.2.1: Decision Tree Heuristic Model flow for SSSP-BF and SSSP-Delta with the USA-Cal
input graph. The proposed model predicts and selects nine M choices.
on a multicore (Xeon Phi used in this case) using the proposed B variables.
Intra-acceleratorM variables are predicted using the proposed equations. For SSSP-BF selecting
the GPU case, M19, 20 are calculated using the vertex count, I1, and the average degree respectively.
These resolve to values of 0.1 for M19 and 1 for M20, meaning that only some global threading
is required, but maximum local threading is to be deployed. For SSSP-Delta, M1 resolves to
select the multicore. Furthermore, M2 and M3 selections follow M19 and M20 as the input graph
retains the same I1− 3 variables. This results in M2 resolving to 7 cores and M3 resolving to its
27
Input Graph
Variables (I1-4)
Benchmark 
Variables (B1-13) Prediction Paradigm
Proposed 
Heterogeneous 
Architecture Setup
Variable 
Profile
Analyst Graph Query
Selected Benchmark
Deploy 
on Setup
1
3
Intra- & Inter-
Accelerator 
Variables (M1-20)
Accelerator
Configuration
Model
2
Input 
Variables 
to Model
Training for
Automation
Figure 4.3.1: HeteroMap Framework Flow for the Multi-Accelerator Architecture.
maximum value of 4 threads per core. Thread placement variables, M5− 7, resolve to 0.9 due the
high indicated diameter in the CA graph, meaning that very loose thread placement is required.
These calculated variables are then deployed, which results in a selected performance as shown in
Figure 4.2.1.
To find the optimal performance point, all M variables are swept and completion times are
acquired for each of the two benchmarks. Figure 4.2.1 shows these performance curves, along
with selected and optimal performance points, where the selected threading results in about a
15% performance difference from the optimal case. This is because the decision tree heuristic
and relationship equations do not take into account all the B and I variables for each M variable.
Thus, some performance exploitation remains left out, as linearizations only help so much for
non-linear relationships. Acquiring the optimal point is only possible if all B and I variables are
tuned exhaustively for each M combination. This is not possible using a manual decision tree and
linear equations, and thus M choice selections need to be automated with more complex models to
reason about these relationships.
Chapter 5
HeteroMap Framework Automation
This section formulates HeteroMap’s predictors in an automated fashion, shown in Figure 4.3.1.
Configuration starts with a central performance prediction paradigm, utilizing off-line learning,
and real-time on-line evaluations. Several predictors are evaluated, namely deep learning and
regression based predictors, that show trade-offs in terms of overhead, accuracy, and performance.
A programmer first sets (B, I) variables for a particular benchmark and input (1), after which the
variable profile is input to the model (decision tree heuristic or automated) (2). The predicted M
parameters are then deployed on the heterogeneous accelerator setup (3).
5.1 Offline Learning Formulation
The vast space of B, I,M tuples disallows for learning on real inputs. As learning and evaluation
cannot be done on the same benchmarks and inputs, synthetically generated mechanisms are required
for off-line training. Off-line learning is thus done on synthetic benchmarks and graphs (Note that
the analytical decision tree does not require off-line training). Synthetic variants are generated using
28
29
formulations in existing benchmark tools [18, 29], and graph generators (Uniform random [19] and
Kronecker [20]). These are well-known to represent real inputs [30], and are thus considered good
contenders for training. For synthetic benchmarks, the formulation described in Section 3 generates
various generic micro benchmarks. High level constructs are generated from the generic graph
benchmark structure. This generalization follows the V − E formulation of graph loops, and these
loops form phases in a benchmark, with each phase having unique characteristics such as read-write
shared data or FP arithmetic. Phases are separated by barriers to propagate values to other threads.
For example, a micro benchmark can be extracted from PageRank, where a generalization has vertex
division parallel loops, reductions, barriers, and floating point requirements. Synthetic variations
are then created by changing these requirements. For example, floating point capabilities can be
removed, causing B13 to become 0, or the reduction can be removed to make B1 as 1, and B5 as
0. A per-iteration generalization of a graph workload using the benchmark variations explain in
Section 3 is shown in Figure 5.1.1.
Figure 5.1.1 shows how diverse synthetic benchmarks are created. Mixes of phases (varying
B1− 5 values) are obtained by having different B1− 5 phases, along with loop variations such as
read-write data, contention, and FP requirements (varying B6− 13 values). This creates a large
synthetic space, as B1− 5 can create up to a hundred combinations, while variations of B6− 13
create many more. Two generated examples are shown in Figure 5.1.1, with the first example having
a vertex division phase writing local computations to shared data using indirect addressing. The
second example shows two phases separated by barriers, with the first phase having pareto division
updating a shared array using local computation via locks, and the second phase doing a reduction.
B values are also shown for each of the examples (derived using Figure 3.3.1b), and these are input
to the learning model to be run as training data.
30
Parallel for (vertex=0 to N)   //vertex division
Parallel for (edge=0 to edges[v])
Array[Array[edge]]   //Indirect Accesses
= Local_Computation (R/W work)
Generated Synthetic Examples
Parallel for (vertex=0 to N)   //Pareto Division
Parallel for (edge=0 to edges[v])
Lock[edge] 
Array[edge] = Local_Computation (FP)
UnLock[edge] 
Barrier
Parallel for (vertex=0 to N)   //Reduction Phase
Reduction(Array)
B1 = 1
B8 = 0.8
B9 = 0.9
B11 = 0.9
B3 = 0.8
B5 = 0.2
B6 = 0.5
B11 = 0.8
B12 = 0.1
B13 = 0.1
B Values
Example 1
Example 2
Figure 5.1.1: Example synthetic benchmarks generated.
5.2 Training of Models
Several million samples are generated from different B, I combinations mapping to M variables,
which shows the complexity of this space. This stems from several thousand synthetic B, I-
varying combinations. Moreover, ~M variables from Figure 2.0.1 have combinations within them-
selves as well. In the case of the GTX-750TI - Xeon Phi 7120P setup, 20 unique ~M variables
are chosen, as shown in Figure 2.0.1. These range from thread placement variables, such as
KMP PLACE THREADS (which alone has a thousand combinations), to OMP Nested, which spec-
ifies nested parallelism. Threading and thread placement specifies how much parallelism to extract
in an application, and how to place threads with respect to the data on-chip. Other OMP variables
specify how much and how many levels of parallelism are enabled, and how to spin on contended
variables. These are chosen so maximum parallelism is extracted from a given program on the Xeon
Phi (e.g. thread placement on the Xeon Phi alone has more than a thousand combinations). For
a particular synthetic graph B, I combination, only one ~M combination tuple is selected, which
31
provides the best performance, as a model would like to train on close to optimal parameters.
The resulting training dataset thus has output architecture variable values that provide the best
performance on synthetic benchmark-input combinations. In the case of the GTX-750 - Xeon Phi
setup, training takes several hours if millions of combinations are run in parallel on each accelerator.
These B, I combinations are run and their optimal M selections are stored in an off-line database
for training. Combinations created for various benchmark-input sets from Figure 3.3.1b, and their
objective results, are stored in an off-line database for training. These performance results are
highly optimized using auto-tuning (OpenTuner used in this case). This creates a profiler database
of B, I,M tuples residing in the CPU file system, which is indexed using B, I tuples to get M
solutions. Overheads with this training are performed per multi-accelerator setup, and are not
included in evaluation, as training only needs to be done once per setup.
5.3 Online Evaluation
After training the automated model, HeteroMap takes in real benchmarks and graphs for evaluation.
Benchmarks and inputs are first discretized into (B, I) variables, after which their M variables
are predicted. This process is depicted in Figure 4.2.1. This allows it to infer concurrency choices
within accelerators, and also to create a binary decision to select between accelerators. This work
does not consider temporal aspects, where program parts are run on either accelerator. Although
the program can be chunked up into separate programs and fed through the framework to be
scheduled on separate accelerators. As only on-chip architectural characteristics of accelerators
are compared to simplify complexity, memory transfer variations are not taken into account, and
only the time spent in processing the graph on-chip is analyzed. However, we still do a sensitivity
study in Section 7.4 to show how HeteroMap responds to memory and architecture changes. Model
32
17 Input Neurons
13 B’s + 4 I’s
20 Output Neurons
20 M’s
Input Graph 
characteristics (I1 - I4)
Deep Learning FeedForward Network
4 Layers
32 Neurons/Layer Inter-Accelerator Choice
(M1)
Intra-Accelerator Choices
(M2 – M20)
Internal Hidden Layers/Neurons
Benchmark Characteristics
(B1 – B13)
Figure 5.3.1: Neural Network showing network parameters.
training and database derivations are all done on a host CPU, although they can be distributed on
the accelerators themselves. This is left as future work on how to distribute decision making onto
various accelerators. Once all architectural choices are decided, HeteroMap deploys the benchmark-
input combination on an accelerator. The overhead of HeteroMap during runtime evaluation phase
is added to the overall completion time. Different predictors are analyzed for automation, namely
the Deep Learning model, and the Regression model.
5.4 Deep Learning Prediction Model
Neural networks are known to effectively learn on non-linear characteristics, and may be efficiently
re-trained for various configurations and programmer-driven strategies. Such networks learn on non-
linear performance curves, which changes neuron weights and biases to create complex equation
representations within the neural network path from inputs to outputs. Figure 5.3.1 shows the
proposed neural network with 4 layers and 32 neurons per layer. Benchmark-input characteristics
are characterized as 17 input neurons, with each neuron set for a benchmark and input variable.
Similarly, output neurons are categorized for each M choice. Several works use the internal hidden
neuron amount that is at least twice the size of the output neurons [31]. We thus take the internal
neuron count as 128 [32]. The network is configured as a feed-forward neural network and the
33
F(M1, M2, .... M12) = W1(B1)4 + W2(B2)7 + W3(B3)5 + W4(B4)4+ W5(B5)6 + 
.. + w13(B13)3 + W6(I1)7 + W7(I2)6 + W8(I3)5 + W9(I4)2 + 16
Graph DensityL
o
c a
l  
T
h
r e
a
d
s /
S
I M
D CA COFB
Graph SizeG
l o
b
a
l  
T
h
r e
a
d
s /
C
o
r e
s
CA FRFB
Max Degree, Diameter
L o
a
d
 B
a
l a
n
c i
n
g
 CA TWTRFB
Non-Linear Regression Equation:
Figure 5.4.1: Non-Linear Regression Equation. High-profile variables and associated trends in input
dependence on output thread selections also shown.
size of the network is selected by balancing the trade-offs between learner complexity, accuracy,
and overhead. Non-linear performance curves can also be captured using a regression model, as
outlined next.
5.5 Regression Prediction Model
A non-linear regression (similar to [33]) is presented that finds the optimal choice configurations.
Regression models are much simpler than neural models, as they need fewer equations. However,
they do require higher orders and variable coefficients, which demand more multiplications, in-
creasing complexity. These trade-offs may cause variations in deciding which learning model to
use for optimal performance. This proposed regression model is fitted via Matlab, and then ported
to C++ for performance comparisons. It is analyzed that a 7th order model fits well (provides an
85% accuracy for curve predictions) for the target choices. Models with lower order do not have
sufficient classification accuracy, and models with higher orders have higher performance overheads.
Figure 5.4.1 shows the fitted curve and associated trends with each input variable (I1 − 4). It is
seen that higher order variables, such as B5 (reduction), play more important roles in the function
34
F of architectural choice selections. Same is the case with inputs, such as vertices and edges (I1, 2).
Graph variations are further shown as trends in Figure 5.4.1, where variations are fitted to the
regression. This shows how such choices can be learned and reasoned about in HeteroMap.
Chapter 6
Methodology
6.1 Accelerator Configurations
Two accelerators are primarily evaluated to build the multi-accelerator architecture, NVidia GTX-
750TI and Intel Xeon Phi 7120P (parameters listed in Table 6.2.1). These accelerators are competi-
tive as their compute performance (single/double precision) overlap. Although the double precision
capability of the Xeon Phi is higher, not all benchmark combinations require it during execution,
and hence it contributes to the chip differences between accelerators which vary performance. The
main memory used by both accelerators is pinned to the smallest one available. Memory size is
not considered as a first-order effect in our work due to the fact that the whole architecture needs
to be reconfigured and relearned for memory size changes. Still, a sensitivity study is done to
show memory size effects, where the memories of both accelerators are swept and performance is
acquired for all combinations. Storage to stand-alone memory transfer times are not measured, as
they are assumed to be constant.
To evaluate with a more powerful GPU, we choose an NVidia GTX-970 to replace the smaller
35
36
Table 6.2.1: Primary Accelerator Configuration.
GTX-750Ti Xeon Phi 7120P
Cores, Threads 640, Many 61, 244
Cache Size, Coherence 2MB, No 32MB, Yes
Mem. (GB), BW. (GB/s) 2, 86 2, 352
Single-Precision (TFlops) 1.3 2.4
Double-Precision (TFlops) 0.04 1.2
GPU for the multi-accelerator setup. GTX-970 incorporated 1664 cores with 3.5 TFLOPs single-
precision and 0.1 TFLOPs double-precision compute capability, and has a larger 4 GB memory size.
This work also evaluates an Intel Xeon E5-2650 v3 multicore having 10 hyper-threaded cores in 4
sockets, executing at 2.30GHz, with a 1TB DDR4 RAM. In addition to the primary (GTX-750TI,
Xeon Phi) configuration, the following accelerator combinations are analyzed: (GTX-970, Xeon
Phi), (GTX-750TI, CPU-40-Core), and (GTX-970, CPU-40-Core).
6.2 Benchmarks
For multicore benchmarks, SSSP-Bellman-Ford (SSSP-BF), BFS, DFS, PageRank, PageRank-DP,
Triangle Counting (Tri.Cnt.), Community Detection (Comm.), and Connected Components (Conn.
Comp.) are acquired from CRONO [8], MiBench [34], and Rodinia [35]. As SSSP-BF may not
provide optimal performance on lower core counts in multicores, an SSSP implementation using
∆-Stepping (SSSP-Delta) is also acquired from the GAP benchmark suite [14] and compared.
These versions use pthread/OpenMP implementations to run on multicores (using the offload
programming model). For GPUs, benchmarks are acquired from Pannotia [5] and Rodinia [35]
for OpenCL workloads, which provide SSSP, BFS, PageRank, and PageRank-DP. The remaining
benchmarks are ported from the multicore implementations to OpenCL.
37
Table 6.2.2: Synthetic Input Datasets.
Training Data #Vertices #Edges Avg.Deg. Size(GB)
Unif. Rand. [19] 16-65M 16-2B 1-32K 0.01-32
Kronecker [20] 16-65M 16-2B 1-32K 0.01-32
6.3 Processing Metrics
Original benchmarks from various benchmark suites are not optimized. In this case, they manually
tuned as well to compare the proposed architecture and configuration framework to an ideal case.
For fair comparison, the same algorithm within the benchmark is run on both accelerators. Training
is therefore done to optimize all parameters off-line using OpenTuner [13]. HeteroMap’s output
is compared with an ideal output that manually optimizes by running all possible configurations.
Percentage accuracies are found by comparing the integer outputs (constituting choice selections) of
the learners. Accuracy is measured by finding the percentage difference acquired performance using
the proposed predictors, to the ideal case that optimizes all M variables. Target baselines are also
taken from multicore-only and GPU-only runs. Different learners, namely regression and adaptive
libraries, are also compared with HeteroMap in terms of accuracy and overheads. Completion times
are compared for all benchmarks. Synthetically generated graphs for training the automated learners
are depicted in Table 6.2.2.
Graphs that are larger than the accelerator’s main memory size are broken into chunks and
processed one by one spatio-temporally using the Stinger framework [21]. To maintain fairness
between accelerators, memory transfer times are not included in the completion time. Thus, only
the time spent on the accelerator is measured, where the overhead of HeteroMap is added to the
completion time. Energy numbers are also compared to allow the framework to utilize it as a
metric. Power measurements are acquired using micsmc [36] and powerstat [37] utilities. Core
utilization is measured using nvprof and PAPI [38, 39], and is the time each core spends in
38
executing instructions in the pipeline.
Chapter 7
Evaluation
This section first evaluates HeteroMap by selecting an optimal learning model, and then compares
to multicore-only and GPU-only baselines. Primary comparisons and analysis are done using the
Xeon Phi and GTX-750Ti GPU setup, however, comparisons are also done on a stronger 40-core
Intel Xeon E5-2650 v3 with 1 TB memory and a NVidia GTX-970 GPU. In addition to the primary
(GTX-750TI, Xeon Phi) configuration, the following accelerator combinations are analyzed: (GTX-
970, Xeon Phi), (GTX-750TI, CPU-40-Core), and (GTX-970, CPU-40-Core). Further analysis
is done in terms of model overhead and energy objectives. The various automated performance
predictors are also compared with a baseline which optimizes all choices with no learner overheads
(marked as ideal).
7.1 Selecting a Learning Model
It is important to understand whether different learners in HeteroMap are optimal for the given
choice space. We therefore take different parallel learning algorithms for comparison. Multiple
39
40
Table 7.1.1: Learning Model Strategies. Speedup shown over the GTX-750 GPU.
Learner SpeedUp Accuracy Overhead
(%) (%) (ms)
Decision Tree 28 86.2 0.10
Linear Regression 6 50.1 0.05
Multi Regression 27 85.4 4.11
Adaptive Library[40] 8 56.5 0.17
Deep.16[28, 41] 11 59.3 1.52
Deep.32 22 68.4 2.52
Deep.64 26 82.2 3.01
Deep.128 31 90.5 3.48
Deep.256 30 92.9 6.39
Non-Linear Regression (fitted equations from Section 5.5) and Decision Trees (IF-ELSE systems
using thresholds from Section 4) are thus compared. A simple linear regression is also trained and
compared. XAPP [33] uses regression with more than 7 variables, similar to the one evaluated
in this paper. Rinnegan [40] uses a performance model adaptive library scheme, which profiles
program performance and then uses a simple model equation to predict performance. The equation’s
output is directly proportional to only the data movement and accelerator utilization parameters
given by a programmer/profiler. Deep learners are compared using various model sizes (explained
in Section 5.4). However, as they do not discuss details regarding their learning models, they cannot
be directly compared against, and hence these two schemes are compared separately. Automated
decision trees consist of 64 variables, with each level containing a binary threshold generated via
training data. All are trained with the same amount of training data/time used for the proposed
learners. All learners are parallelized on the CPU as well. Geomean completion times are taken for
all benchmark-input combinations, and the speedup of each performance predictor is shown over
the GPU. Evaluations and accuracy are computed when optimizing for performance objectives only,
and are computed by comparing to the ideal case
41
Table 7.1.1 shows that the adaptive library and linear regression paradigms do not perform well
for our setup. This happens because of non-linear variations associated with graph benchmark-input
combinations and multi-accelerator architecture choices. Regression does perform well enough, and
results in a higher overhead, as complex equations are required to maintain accuracy. The decision
tree model from Section 4 provides low overhead, but does not provide a comparable speedup to
the best deep learning model. The deep learning model exhibits a larger overhead, with higher
classification accuracy. Larger deep learners follow quadratic trends in overheads and classification
accuracy. This raises the acquired speedup to a certain extent, after which returns diminish due
to the increasing overhead. At the end, the deep learning model performs the best in terms of
performance, and hence it is selected for analysis. Overall, a speedup of 31% is acquired using the
deep learning model as shown in Table 7.1.1, with a classification accuracy of 90.5% and overhead
of 3.48ms. Hence, all further evaluations are done using the deep learning model with 128 neurons.
7.2 Performance Variations
It is understandable that choices in the proposed multi-accelerator architecture will occur within and
across accelerators. In various cases either accelerator will be better in performance over the other.
Figure 7.2.1 shows these variations for all benchmark-input combinations with the deep learning
model. The results include the framework’s performance overhead in selecting a combination.
GPU-Biased Combinations: Benchmark-input combinations with highly concurrent algo-
rithms, such as SSSP-BF, BFS, and DFS mostly fare well with the GPU. Their work division and
parallelization strategy benefits from an excess of threads, which are available on the GPU. Due
to the nature of their critical sections and data structures, the Xeon Phi cannot exploit its SIMD
capabilities, and hence it performs poorly compared to a GPU. In the case of DFS-CO, the multicore
42
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
C
A F
B
C
O
C
A
G
E
R
g
g
F
r n
d
K
r o
n
T
w
t r
C
A F
B
C
O
C
A
G
E
R
g
g
F
r n
d
K
r o
n
T
w
t r
C
A F
B
C
O
C
A
G
E
R
g
g
F
r n
d
K
r o
n
T
w
t r
C
A F
B
C
O
C
A
G
E
R
g
g
F
r n
d
K
r o
n
T
w
t r
C
A F
B
C
O
C
A
G
E
R
g
g
F
r n
d
K
r o
n
T
w
t r
C
A F
B
C
O
C
A
G
E
R
g
g
F
r n
d
K
r o
n
T
w
t r
C
A F
B
C
O
C
A
G
E
R
g
g
F
r n
d
K
r o
n
T
w
t r
C
A F
B
C
O
C
A
G
E
R
g
g
F
r n
d
K
r o
n
T
w
t r
C
A F
B
C
O
C
A
G
E
R
g
g
F
r n
d
K
r o
n
T
w
t r
SSSP-BF SSSP-Delta BFS DFS PR-DP PR Tri Cnt. Comm Conn Comp. GeoMeanIdeal
N
o
r m
a
l i
z e
d
 t
o
 G
P
U
 C
o
m
p
l e
t i
o
n
 T
i m
e
GTX 750Ti Xeon Phi HeteroMap
2.1 3 4.3 6 3.2 275 19 2259 5.551 4.7 3.5
G
e
o
m
e
a
n
I d
e
a
l
HeteroMap 31% better than GPU, 75% better than Xeon Phi
4.1
Ideal 39% better than GPU, 79% better than Xeon Phi
SSSP-BF BFS DFS PR-DP PR Tri. Cnt. Comm CC
2.7 2.5
SSSP-Delta
4.6
Figure 7.2.1: Scheduler Comparisons for Graph Workload-Input Combinations (All results normal-
ized to the GTX750Ti GPU implementation) (Higher is worse).
outperform the GPU, as it uses additional inner loop parallelization. Notice the lower overheads of
SSSP-CO and BFS-CO are also because of the same reason. Such workloads are therefore easier
for the learner to configure, as their performance curves remain biased towards the GPU.
Multicore-Biased Combinations: When benchmarks require FP capabilities they perform well
on the multicore. Thus, PR, PR-DP, and COMM benchmarks perform well on the Xeon Phi as
they require FP capabilities. When benchmarks require push-pop accesses on structures, alongside
reductions (SSSP-Delta), then these benchmark-input combinations also perform well on the Xeon
Phi. Some notable exceptions in these cases are Frnd. and Kron. graphs, which perform better on
the GPU because they are large and require more threads. CC fares well on the Xeon Phi because it
has a significant amount of indirect accesses, which fare well in the cache hierarchy of the multicore.
TRI-CNT consists of a reduction step where all the threads reduce to a final total triangle count.
The Xeon Phi can better exploit this as its on-chip data movement capabilities are far better than
those of the GPU. Due to larger variations, the deep learning scheduler does not calculate optimal
M choices, hence the scheduler exhibits some overhead over the GPU for some cases. Overall, the
43
0
0.2
0.4
0.6
0.8
1
N
o
r m
a
l i
z e
d
 E
n
e
r g
y GTX 750Ti Xeon Phi HeteroMap
0.15
0.16
0.06
0.03
Figure 7.2.2: Energy benefits averaged for various inputs for a given benchmark. (Xeon Phi vs.
GTX 750Ti). All results normalized to the maximal energy used for any B − I combination.
framework is 31% better than a GPU-only and 75% better than a Xeon-Phi-only setup.
7.3 Understanding Energy & Utilization Variations
HeteroMap is also trained for the energy objective. Figure 7.2.2 shows normalized energy (normal-
ized to the maximum energy for B, I combinations) for various benchmarks. Geomeans of energy
are taken across the different inputs for each benchmark. The Xeon Phi has a larger power rating
compared to the two GPUs, and hence it dissipates more energy. Certain inputs take more time to
complete on the GPU, which adds to its energy woes. HeteroMap reduces energy usage in this case
from (0.15, 0.16) to just 0.06, by a factor of 2.4×. This is fairly close to the ideal case (0.03). This
favors the deployment of HeteroMap in energy constrained environments.
7.4 Changing Fixed Accelerator & Memory Sizes
Accelerator Changes:
A weaker GPU was compared first to show whether the GPU architecture inherently benefits
benchmark-input combinations or not (which was shown to be the case). Machine learning models
44
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
C
A F
B
C
O
C
A
G
E
R
g
g
F
r n
d
K
r o
n
T
w
t r
C
A F
B
C
O
C
A
G
E
R
g
g
F
r n
d
K
r o
n
T
w
t r
C
A F
B
C
O
C
A
G
E
R
g
g
F
r n
d
K
r o
n
T
w
t r
C
A F
B
C
O
C
A
G
E
R
g
g
F
r n
d
K
r o
n
T
w
t r
C
A F
B
C
O
C
A
G
E
R
g
g
F
r n
d
K
r o
n
T
w
t r
C
A F
B
C
O
C
A
G
E
R
g
g
F
r n
d
K
r o
n
T
w
t r
C
A F
B
C
O
C
A
G
E
R
g
g
F
r n
d
K
r o
n
T
w
t r
C
A F
B
C
O
C
A
G
E
R
g
g
F
r n
d
K
r o
n
T
w
t r
C
A F
B
C
O
C
A
G
E
R
g
g
F
r n
d
K
r o
n
T
w
t r
SSSP-BF SSSP-Delta BFS DFS PR-DP PR Tri Cnt. Comm Conn Comp. GeoMeanIdeal
N
o
r m
a
l i
z e
d
 t
o
 G
P
U
 C
o
m
p
l e
t i
o
n
 T
i m
e
GTX970 Xeon Phi HeteroMap4.8 5.8 8.2 10 6 5.4 4 4.3 2.6 10 1K 33 12 10 10 5 18 42 26 2 2 2.5 3.933
HeteroMap 14% better than GPU, 3.8x better than Xeon Phi
Ideal 18% better than GPU, 4.1% better than Xeon Phi
12
SSSP- elta G
e
o
m
e
a
n
I d
e
a
l
1K571
SSSP-BF BFS DFS PR-DP PR Tri. Cnt. Co m CC
Figure 7.4.1: Scheduler Comparisons for various Graph Workloads-Input Combinations (All results
normalized to the GTX970Ti GPU implementation) (Higher is worse). Note that Optimal Choices
change when compared to the GTX750Ti in Figure 7.2.1.
are re-learned for this architectural change with the stronger GTX-970 GPU. As shown in Fig-
ure 7.4.1, benchmark trends compared to the smaller GPU remain mostly the same, with concurrent
workloads such as SSSP-BF still performing well on the GPU. Comparing other workloads that
were only slightly better on the Xeon Phi before, such as TRI-LJ, the stronger GPU performs better.
This happens more specifically on inputs such as CA, with the reason being that such inputs have
less inner loop parallelism, and fare better with more threads and the right amount of caching in the
new GPU. Overall, HeteroMap outperforms a GPU-only case by 14% and a Xeon-Phi-only case by
3.8×, as the magnitude by which the GPU outperforms Xeon Phi in some cases is higher compared
to the GTX-750. But the Xeon Phi still beats the GTX-970 for other combinations, and 14% is
remarkable as the GTX-970 has twice the single-precision compute power.
A 40-core multicore CPU is also compared with the GTX-750Ti and the GTX-970 GPUs.
Figure 7.4.2 shows the normalized to GPU completion times averaged for all inputs for a particular
benchmark. The GPUs in both cases are seen to outperform the CPU for highly parallel benchmarks
45
0
0.5
1
1.5
2
N
o
r m
a
l i
z e
d
 t
o
 G
P
U
 
C
o
m
p
l e
t i
o
n
 T
i m
e
GTX 750Ti CPU-40-Core HeteroMap
0
0.5
1
1.5
2
N
o
r m
a
l i
z e
d
 t
o
 G
P
U
 
C
o
m
p
l e
t i
o
n
 T
i m
e
GTX 970 CPU-40-Core HeteroMap
3.2 2.5 2.1
Figure 7.4.2: Geomean results averaged for different inputs for each benchmark for the 40-core
CPU. All results are normalized to the GPU implementation.
such as SSSP-BF and BFS. For other benchmarks, the CPU outperforms the weaker GTX-750 GPU.
In the case of the GTX-970, the GPU performs better than the CPU for DFS and Conn. Comp.
This is because the stronger GPU has larger caches and more cores than the smaller GPU, allowing
the two benchmark’s indirect accesses to be able to perform better in the GTX-970. The 40-core
multicore outperforms the GTX750 by 3% for a 2 GB memory size for each accelerator. For the case
with the GTX-970, the GPU outperforms the 40-core multicore by 10% for a 4 GB memory size
for each accelerator. Using HeteroMap, performance gains of 22% and 5% are acquired over the
GTX-750 and the GTX-970 respectively. HeteroMap achieves these gains as it selects the optimal
accelerator for each benchmark-input combination. Averaging across inputs, HeteroMap picks the
better result of the two accelerators to produce better results than either of the two accelerators for
each benchmark.
Memory variations are considered a first-order effect by many members of the research commu-
nity, and this makes it imperative to analyze memory size variations. GPU-Xeon Phi memory size
sensitivity: Main memory is an important parameter that one can re-architect to change a system.
However, in our system we only sweep memory sizes that the accelerators support i.e., up to 2-4
GB for GPUs, and up to 16GB for the Xeon Phi. Figure 7.4.3 shows various memory sizes for
46
GTX 750Ti Xeon Phi HeteroMap
GTX 750Ti Xeon Phi HeteroMap
97
N
o
r m
a
l i
z e
d
 t
o
 G
P
U
 
C
o
m
p
l e
t i
o
n
 T
i m
e
(2,2) (2,4) (2,8) (2,16)(1,1)
GTX 970 Xeon Phi HeteroMapCPU-40-Core
GTX 750Ti Xeon Phi HeteroMap
N
o
r m
a
l i
z e
d
 t
o
 G
P
U
 
C
o
m
p
l e
t i
o
n
 T
i m
e
0
0.2
0.4
0.6
0.8
1
1.2
0
0.2
0.4
0.6
0.8
1
1.2
(2,2)(2,4) (2,8) (2,1000)(1,1)
(2,2)(4,4) (4,8) (4,1000)(1,1)
CPU-40-Core
0
0.2
0.4
0.6
0.8
1
1.2
0
0.2
0.4
0.6
0.8
1
1.2
0 2 4 6 8 10 12 14 16( ,2) (4,4) (4,8) (4,16)(1,1)
Memory Size Variations (GPU, Xeon Phi) Memory Size Variations (GPU, CPU-40-Core)
Figure 7.4.3: Memory size variations for different machine combinations. The x-axis varies memory
sizes for a multi-accelerator system, while the y-axis shows normalized completion time.
the target accelerators. Error bars show variation in performance of either accelerator. Geometric
mean of all the benchmark-input combinations is taken for a particular memory size (GPU, Xeon
Phi). The y-axis shows the completion time normalized to the max (the upper error bar for (1,1)),
with the geomean of the average of all combinations normalized to the GPU. The overall trend is
that the Xeon Phi performs better when it is exposed to its full main memory compared to both
GPUs (30% better than the GTX-750TI and 15% better than the GTX-970). This is because it can
exploit the full memory bandwidth and size, forgoing the need for memory transfers compared to
GPUs. Even though GPU memory saturates at 2 or 4 GB, keeping the GPU performance constant
after maximizing memory, the Xeon Phi’s performance is still off by 10-20%. This shows the
effectiveness of the GPUs concurrency model. HeteroMap is able to exploit this memory variation
as an addition to the vector ~M , and is able to learn with performance benefits higher than acquired
with limited main memory for the Xeon Phi. Even when the Xeon Phi is provided with increased
memory, HeteroMap gets benefits by selecting the right choices being tilted in the Phi’s favor.
GPU-40-Core CPU memory size sensitivity: This work also compares a 40-core multicore
47
CPU in conjunction with GPU accelerators. Figure 7.4.3 shows performance with various memory
sizes for this setting. The y-axis shows the completion time normalized to the max, with the
geomean of the average of all combinations normalized to the GPU. Error bars show variation in
performance of either accelerator. Error bars show variation in performance of either accelerator,
while geometric means of all benchmark-input combinations are taken for memory sizes (GPU,
CPU-40-Core). The y-axis shows the completion time normalized to the max (the upper error bar
for (1,1)), with the geomean of the average of all combinations normalized to the GPU. The 40-core
CPU performs better than the two GPUs on average. The CPU also improves when it is exposed
to its full memory capacity, which allows larger graphs such as Twitter and Friendster to fit in its
main memory. The CPU improves over the GTX-750Ti by 18%, and over the GTX-970 by 5%,
for the maximum memory sizes. Although HeteroMap improves slightly in the geomean case over
the GPU, there are many individual cases where it improves over both machines by up to 3×. The
primary reason why the 40-core CPU is better than the GTX-750 and the GTX-970 is that the CPU
runs at a higher frequency (2.3 GHz vs. GTX750’s 1.3 GHz and GTX-970’s 1.7 GHz). Other
reasons that improve the CPUs performance include its better caching capabilities and stronger core
pipelines.
7.5 The Impact of Re-Learning
Training is an integral part of the proposed learning framework, as it dictates output performance. It
is done off-line, but it also takes time and newer parameters need to be fed into it if the accelerator
changes its internal architecture. Changing the GPU accelerator to the GTX-970 results in a change
in the choice space, mainly due to increased threading combinations. This requires re-learning
to be done (results already shown in Section 7.4). We also take the case where no re-learning is
48
Table 7.5.1: Re-Learning Performance. Compared with the baseline case of the GTX-970 as it had
better performance.
Learning Setting Speedup % Accuracy %
With Re-Learning 14 90.4
Without Re-Learning 7 71.8
done, which makes the learner utilize learning done for a GTX-750TI to be used for a GTX-970.
This is done to show the effectiveness of the proposed framework. Table 7.5.1 shows the acquired
speedups and accuracy results for this setting. Results are compared with the case of the GPU
only as it has better overall performance. With re-learning, the acquired speedup is 14% (also in
Figure 7.4.1), with an accuracy of 90.4% and an overhead of 2.48ms. The accuracy is slightly
lower because of higher choice complexity, while the overhead is similar because we are utilizing
the same neural network. Without re-learning the acquired performance benefit drops to 7%, and
with lower accuracy (compared to 14% with re-learning). This is because the learner is unable to
schedule optimal intra-accelerator choices properly for the new GPU, as the choice performance
curves change with the change in the underlying architecture. The acquired benefit is still higher
than none because the architectures are somewhat similar (i.e. the trends remain the same for GPUs).
It is possible to make the learner optimize and learn once for different multi-accelerator setups.
However, we plan to consider such methods as future work. The learner can be made more complex
and optimized for it to learn once for difference accelerators, however it is a separate endeavor and
we leave it as future work.
7.5.1 Training Evaluation
The target results in Table 7.1.1 are shown with training on both uniform random and kronecker
synthetic graphs. Variations in training are shown in Table 7.5.2, which shows that the scheduler
49
Table 7.5.2: Sensitivity to Training on Deep. - 128. Speedup shown over the GPU. Using all
synthetic graphs provide the speedup shown in Table 3.
Training Setting Speedup % Accuracy %
Graphs: Only Uniform Random 21 68.1
Graphs: Only Kronecker 25 80.3
Training Size: 1/4 of Full Set 15 64.9
Training Size: 1/2 of Full Set 25 80.4
Training Size: 3/4 of Full Set 29 88.7
Training Size: Full Set 31 90.5
is still somewhat robust to both training graphs, the training time, and the synthetic training set
size. This difference mainly stems from those combinations that have competitive performance,
such as PR-Kron. and COMM-FB. HeteroMap in such cases is expected to pick the sub-optimal
intra-accelerator choices. Overall, training on a smaller subset of the synthetic data also results in a
robust predictor, while training on just Kronecker graphs also gives a similar result.
Chapter 8
Related Work
Future HPC setups and datacenters are expected to have various tightly coupled heterogeneity
architectures [3]. Cray already envisions such an operational high performance computer to build
their next generation supercomputer. Examples include Nvidia’s NVLink and IBM’s CAPI [42].
Such architectures use one accelerator at a time to limit energy dissipation, and share the main
memory, rather than the last level cache, as the operating system and programmers can control
access. These examples lay the basis for this work. Prior works in performance prediction mainly
involve operating system runtimes [40] [43] [44] to improve utilization in single machine setups.
Such works do not analyze graphs and input dependence due to space complexity. There is a
plethora of work that optimizes unary single-accelerator CPU-GPU systems [45] [6]. HeteroMap
differs from these works to justify how architectural aspects across accelerators can be exploited in
real-time to overcome unary setup limitations. Schemes proposed in this paper can be deployed on
top of runtimes (OpenMP utilized in this paper).
Knowing that utilizing GPUs always does not lead to performance benefit is already a known
problem [46]. Such works mainly optimize the space between heterogeneous CPUs and GPUs
50
51
using CUDA/OpenCL kernel analysis [46]. Some works generate predictive models [33] to optimize
for inputs [47], and optimize intra-machine choices [28]. Ardalani et. al. [33] analyzes CPU
code to predict GPU performance, and analyzes different GPUs to generate a predictive model.
However none of these works generate analytical models or optimize for different competitive
accelerators, such as GPUs and Xeon Phis, for graph analytics. Such multicores have many more
concurrency choices compared to CPUs due to more thread count, placement, dynamic scheduling,
and synchronization, combinations [48]. Adding such diversity in connected accelerators adds
complexity due to input sensitivity, which needs to be tackled at some level. Prior predictive models
also suffer from high error rates (e.g. 26.9% in [33]), making QoS [49] an issue. Therefore a proper
learning analysis is necessary to enable real-world deployment aspects.
Other works in auto-tuning such as PetaBricks [50, 11] and OpenTuner [13] exploit algorithmic
choices, and have not explored architectural variations. Moreover, as algorithmic spaces consti-
tute higher complexities, learning takes unreasonable amounts of time [51]. Regression based
autotuners [33] have lower complexities, but these are still high enough to defer near real-time
deployment. Thus developing runtimes for optimizing such spaces in graph processing remains an
intractable problem for now, and the optimal way is to learn intelligently on a limited number of
choices to configure accordingly. Such works also lack characterization of graph workloads as tar-
geted in this paper, which are more unpredictable due to input dependencies. However, OpenTuner
is used for off-line training in this work, as it is used to exhaustively search the complex B, I,M
choice space.
Chapter 9
Conclusion
This paper presents a prediction framework, HeteroMap, for a multi-accelerator architecture that
optimizes architectural choices for real-time processing of graph analytics. Analysis shows that
multi-accelerator systems constitute many more architectural choices. When inter- and intra-
accelerator and graph benchmark-input choices are coupled together, the near-optimal choice
selection problem is very complex. This work not only quantifies graph benchmark and input
choices, but also relates them to machine choices in a multi-accelerator system using an analytical
model and automated machine learning predictors. Automation of the framework is done using
off-line training and on-line evaluation to select an optimal accelerator and its architectural choices.
Evaluations show performance gains of 5% to 3.8× when comparing single accelerators, and the
proposed learner is within 10% of an ideal case, which is a boost in predictive concurrency analysis
compared to prior works.
52
Chapter 10
Future Work
While the prediction paradigm proposed in this work performs really well for graph analytics and
for multiple connected accelerators, other domains and architectures may be evaluated as well.
A performance predictor for machine learning may be considered and created in future works,
predicting machine parameters proposed in this work for diverse machine learning workloads [52],
ranging from neural networks to those using bayesian inference. Additional ubiquitous domains
such as databases, the larger artificial intelligence (AI) superset, and real applications such as
browsers, electronic games, and operating system (OS) applications, may also utilize the proposed
HeteroMap system.
This work also considers workloads that are highly parallel and do not requiring ordering
constraints on graph tasks. Ordered workloads such as parallel versions of Dijkstra’s algorithm or the
A* workload may also be considered in this regard [53]. These ordered workloads behave differently
when it comes to predicting machine parameters, as ordering constraints require synchronization
that affects scalability. Thus such workloads are ripe for a HeteroMap-like system to optimize.
Ordering may also be relaxed in such ordered workloads for increasing scalablity [29] and improved
53
54
work-efficiency [54], which further adds to the fact that GPUs and multicores may be traded-off for
such algorithms and input graphs. The proposed predictor may also be built in hardware, where
for example a deep learner predicts machine parameters with almost zero performance overheads.
This presents different challenges, as OS interactions and streaming aspects along with real-time
constraints dictate architectural requirements for the predictor.
Chapter 11
Associated Publications
1. Masab Ahmad, Akif Rehman, Mohsin Shan, Omer Khan, Exploiting Multi-Level Task
Dependencies to Prune Redundant Work in Relax-Ordered Task-Parallel Algorithms, Interna-
tional Symposium on Parallel Architectures and Compilation Techniques (PACT), 2019.
2. Halit Dogan, Masab Ahmad, Jose´ A. Joao, Omer Khan, In-Hardware Moving Compute to
Data Model to Accelerate Synchronization on Tilera TILE-Gx72, IEEE Micro Magazine,
2019. (under revision)
3. Masab Ahmad, Omer Khan, Efficient speculative task-parallel execution using Quarq’s
moving computation model, SRC TECHCON , 2019.
4. Masab Ahmad, Halit Dogan, Chris J. Michael, Omer Khan, HeteroMap: Exploiting Hetero-
geneous Parallel Accelerators to Improve Performance in Graph Analytics, IEEE International
Symposium on Performance Analysis of Systems and Software (ISPASS), 2019.
5. Syed Kamran Haider, Chenglu Jin, Masab Ahmad, Devu Mankishila, Marten van Dijk, Omer
Khan, Advancing the State-of-the-Art in Hardware Trojans Detection, IEEE Transactions
55
56
on Dependable and Secure Computing (TDSC), 2017. (Featured in IEEE Computer Society
TDSC Jan/Feb Issue: https://www.computer.org/web/tdsc)
6. Halit Dogan, Masab Ahmad, Brian Kahne, Omer Khan, Accelerating Synchronization
using Moving Compute to Data Model at 1000-core Multicore Scale, ACM Transactions on
Architecture and Code Optimization, (TACO), 2019. Presentation at HiPEAC 2020.
7. Halit Dogan, Masab Ahmad, Jose Joao, Omer Khan, Accelerating Synchronization in
Graph Analytics using Moving Compute to Data Model on Tilera TILEGx72, International
Conference on Computer Design, (ICCD), 2018.
8. Masab Ahmad, Halit Dogan, Omer Khan, A Temporally Reconfigurable Multi-Accelerator
Parallel Architecture for Reuse and Throughput Oriented Computing, SRC TECHCON, 2018.
Architectures for Machine Learning.
9. Hamza Omar, Qingchuan Shi, Masab Ahmad, Halit Dogan, Omer Khan, Declarative Re-
silience: A Holistic Soft-Error Resilient Multicore Architecture, ACM Transactions on
Embedded Computing Systems, (TECS), 2018.
10. Masab Ahmad, Halit Dogan, Fabio Checconi, Xinyu Que, Danielle Buono, Omer Khan,
Software-Hardware Managed Last-level Cache Allocation Scheme for Large-Scale NVRAM-
based Multicores Executing Parallel Data Analytics Applications, IEEE International Parallel
& Distributed Processing Symposium, (IPDPS), 2018.
11. Hamza Omar, Masab Ahmad, Omer Khan, GraphTuner: Input-Dependence Aware Loop Per-
foration for Efficient Execution of Approximate Graph Algorithms, International Conference
on Computer Design, (ICCD), 2017.
57
12. Masab Ahmad, Omer Khan, Exploiting Heterogeneous Parallel Accelerators to Improve
Performance in Graph Analytics, SRC TECHCON, 2017. Best of Session: Heterogeneous &
Reliable System Design.
13. Halit Dogan, Farrukh Hijaz, Masab Ahmad, Brian Kahne, Peter Wilson, Omer Khan, Ac-
celerating Graph and Machine Learning Workloads Using a Shared Memory Multicore
Architecture with Auxiliary Support for in-Hardware Explicit Messaging, IEEE International
Parallel & Distributed Processing Symposium, (IPDPS), 2017.
14. Masab Ahmad, Chris J. Michael, Omer Khan, Efficient Situational Scheduling of Graph
Workloads on Single-chip Large-scale Multicores and GPUs, IEEE Micro Special Issue on
Cognitive Architectures, 2017.
15. Masab Ahmad, Omer Khan, GPU Concurrency Choices in Graph Analytics, IEEE Interna-
tional Symp. on Workload Characterization, (IISWC), September 2016.
16. Masab Ahmad, Farrukh Hijaz, Qingchuan Shi, Omer Khan, CRONO: A Benchmark Suite
for Multithreaded Graph Algorithms Executing on Futuristic Multicores, IEEE International
Symposium on Workload Characterization, (IISWC), October 2015. Best Paper Nominee.
17. Syed K. Haider, Masab Ahmad, Farrukh Hijaz, Astha Patni, Ethan Johnson, Matthew Seita,
Omer Khan, Marten van Dijk, M-MAP: Multi-Factor Memory Authentication for Secure
Embedded Processors, IEEE International Conference on Computer Design, (ICCD), October
2015.
18. Masab Ahmad, Kartik Lakshminarasimhan, Omer Khan, Efficient Parallelization of Path
Planning Workload on Single-chip Shared-memory Multicores, IEEE High Performance
Extreme Computing Conference, (HPEC), September 2015.
58
19. Masab Ahmad, Syed K. Haider, Farrukh Hijaz, Marten van Dijk, Omer. Khan, Exploring the
Performance Implications of Memory Safety Primitives in Many-core Processors Executing
Multi-threaded Workloads, ACM Workshop on Hardware and Architectural Support for
Security and Privacy, (HASP), June 2015.
20. Masab Ahmad, Awais M. Kamboh, Ahmad Khan, Non-invasive blood glucose monitoring
using near-infrared spectroscopy, Medical Design Center, EDN, August 2014. Hundreds of
shares based on entrepreneurial impact on Twitter. Translated to Dutch and Japanese.
21. Masab Ahmad, Awais M. Kamboh, Rehan Hafiz, Power & throughput optimized lifting
architecture for Wavelet Packet Transform, IEEE International Symposium on Circuits and
Systems, (ISCAS), June 2014.
Bibliography
[1] Masab Ahmad, Kartik Lakhsminarasimhan, and Omer Khan. Efficient parallelization of path
planning workload on single-chip shared-memory multicores. In Proceedings of the IEEE
High Performance Extreme Computing Conferenc, HPEC ’15. IEEE, 2015.
[2] Nadathur Satish, Changkyu Kim, Jatin Chhugani, Hideki Saito, Rakesh Krishnaiyer, Mikhail
Smelyanskiy, Milind Girkar, and Pradeep Dubey. Can traditional programming bridge the
ninja performance gap for parallel computing applications? In Proceedings of the 39th Annual
International Symposium on Computer Architecture, ISCA ’12, pages 440–451, Washington,
DC, USA, 2012. IEEE Computer Society.
[3] Cray-Inc. http://www.cray.com/blog/characterization-of-an-application-for-hybrid-multimany-
core-systems/. 2015.
[4] W. Li, G. Jin, X. Cui, and S. See. An evaluation of unified memory technology on nvidia
gpus. In Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International
Symposium on, pages 1092–1098, May 2015.
[5] Shuai Che, B.M. Beckmann, S.K. Reinhardt, and K. Skadron. Pannotia: Understanding
irregular gpgpu graph applications. In IEEE Int. Symposium on Workload Characterization
(IISWC), pages 185–195, Sept 2013.
[6] Shuangde Fang, Zidong Du, Yuntan Fang, Yuanjie Huang, Yang Chen, Lieven Eeckhout,
Olivier Temam, Huawei Li, Yunji Chen, and Chengyong Wu. Performance portability across
heterogeneous socs using a generalized library-based approach. ACM Trans. Archit. Code
Optim., 11(2):21:1–21:25, June 2014.
[7] M. Ahmad and O. Khan. Gpu concurrency choices in graph analytics. In 2016 IEEE
International Symposium on Workload Characterization (IISWC), pages 1–10, Sept 2016.
[8] M. Ahmad, F. Hijaz, Q. Shi, and O. Khan. Crono : A benchmark suite for multithreaded graph
algorithms executing on futuristic multicores. In Proc. of IEEE Int. Symposium on Workload
Characterization, IISWC, 2015.
59
60
[9] Halit Dogan, Masab Ahmad, Brian Kahne, and Omer Khan. Accelerating synchronization
using moving compute to data model at 1,000-core multicore scale. ACM Trans. Archit. Code
Optim., 16(1):4:1–4:27, February 2019.
[10] M. Ahmad, H. Dogan, F. Checconi, X. Que, D. Buono, and O. Khan. Software-hardware
managed last-level cache allocation scheme for large-scale nvram-based multicores executing
parallel data analytics applications. In 2018 IEEE International Parallel and Distributed
Processing Symposium (IPDPS), pages 316–325, May 2018.
[11] Yufei Ding, Jason Ansel, Kalyan Veeramachaneni, Xipeng Shen, Una-May O’Reilly, and
Saman Amarasinghe. Autotuning algorithmic choice for input sensitivity. In Proceedings of
the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation,
PLDI 2015, pages 379–390, New York, NY, USA, 2015. ACM.
[12] H. Omar, M. Ahmad, and O. Khan. Graphtuner: An input dependence aware loop perforation
scheme for efficient execution of approximated graph algorithms. In 2017 IEEE International
Conference on Computer Design (ICCD), pages 201–208, Nov 2017.
[13] Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bos-
boom, Una-May O’Reilly, and Saman Amarasinghe. Opentuner: An extensible framework
for program autotuning. In Proceedings of the 23rd International Conference on Parallel
Architectures and Compilation, PACT ’14, pages 303–316, New York, NY, USA, 2014. ACM.
[14] Scott Beamer, Krste Asanovic, and David Patterson. Locality exists in graph processing:
Workload characterization on an ivy bridge server. In International Symposium on Workload
Characterization, IISWC, 2015.
[15] Nick Edmonds, Alex Breuer, Douglas Gregor, and Andrew Lumsdaine. Single-source shortest
paths with the parallel boost graph library. The Ninth DIMACS Implementation Challenge:
The Shortest Path Problem, Piscataway, NJ, pages 219–248, 2006.
[16] Timothy A. Davis and Yifan Hu. The university of florida sparse matrix collection. ACM
Trans. Math. Softw., 38(1), December 2011.
[17] Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher, M. Amber Hassaan,
Rashid Kaleem, Tsung-Hsien Lee, Andrew Lenharth, Roman Manevich, Mario Me´ndez-Lojo,
Dimitrios Prountzos, and Xin Sui. The tao of parallelism in algorithms. In Proceedings of the
32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation,
PLDI ’11, pages 12–25, New York, NY, USA, 2011. ACM.
[18] Yunming Zhang, Mengjiao Yang, Riyadh Baghdadi, Shoaib Kamil, Julian Shun, and Saman P.
Amarasinghe. Graphit - A high-performance DSL for graph analytics. CoRR, abs/1805.00923,
2018.
61
[19] David A. Bader and Kamesh Madduri. Gtgraph: A synthetic graph generator suite, 2006.
[20] Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, Christos Faloutsos, and Zoubin Ghahra-
mani. Kronecker graphs: An approach to modeling networks. J. Mach. Learn. Res., 11:985–
1042, March 2010.
[21] D. Ediger, R. McColl, J. Riedy, and D. A. Bader. Stinger: High performance data structure for
streaming graphs. In 2012 IEEE Conference on High Performance Extreme Computing, pages
1–5, Sept 2012.
[22] Aaron B. Adcock, Blair D. Sullivan, Oscar R. Hernandez, and Michael W. Mahoney. Evaluat-
ing openmp tasking at scale for the computation of graph hyperbolicity. In Alistair P. Rendell,
Barbara M. Chapman, and Matthias S. Mu¨ller, editors, OpenMP in the Era of Low Power
Devices and Accelerators, pages 71–83, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg.
[23] Jure Leskovec and Rok Sosivc. Snap: A general-purpose network analysis and graph-mining
library. ACM Transactions on Intelligent Systems and Technology (TIST), 8(1):1, 2016.
[24] Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. What is Twitter, a social
network or a news media? In WWW ’10: Proceedings of the 19th international conference on
World wide web, pages 591–600, New York, NY, USA, 2010. ACM.
[25] J. W. Lichtman, H. Pfister, and N. Shavit. The big data challenges of connectomics. In Nature
Neuroscience 17, Sept 2014.
[26] T. Suzumura and K. Ueno. Scalegraph: A high-performance library for billion-scale graph
analytics. In 2015 IEEE International Conference on Big Data (Big Data), pages 76–84, Oct
2015.
[27] Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Jiwon Seo, Jongsoo Park,
M. Amber Hassaan, Shubho Sengupta, Zhaoming Yin, and Pradeep Dubey. Navigating the
maze of graph analytics frameworks using massive graph datasets. In Proc. of the 2014 ACM
SIG. Int. Conf. on Management of Data (SIGMOD), NY, USA, 2014. ACM.
[28] M. Ahmad, C. J. Michael, and O. Khan. Efficient situational scheduling of graph workloads
on single-chip multicores and gpus. IEEE Micro, 37(1):30–40, Jan 2017.
[29] Donald Nguyen, Andrew Lenharth, and Keshav Pingali. Deterministic galois: On-demand,
portable and parameterless. In Proceedings of the 19th International Conference on Archi-
tectural Support for Programming Languages and Operating Systems, ASPLOS ’14, pages
499–512, New York, NY, USA, 2014. ACM.
[30] Jure Leskovec, Kevin J. Lang, Anirban Dasgupta, and Michael W. Mahoney. Community
structure in large networks: Natural cluster sizes and the absence of large well-defined clusters,
2008.
62
[31] Guang-Bin Huang, Dian Hui Wang, and Yuan Lan. Extreme learning machines: a survey.
International Journal of Machine Learning and Cybernetics, 2(2):107–122, Jun 2011.
[32] S. Tamura and M. Tateishi. Capabilities of a four-layered feedforward neural network: four
layers versus three. IEEE Transactions on Neural Networks, 8(2):251–255, Mar 1997.
[33] Newsha Ardalani, Clint Lestourgeon, Karthikeyan Sankaralingam, and Xiaojin Zhu. Cross-
architecture performance prediction (xapp) using cpu code to predict gpu performance. In
Proceedings of the 48th International Symposium on Microarchitecture, MICRO-48, pages
725–737, New York, NY, USA, 2015. ACM.
[34] S.M.Z. Iqbal, Yuchen Liang, and H. Grahn. Parmibench - an open-source benchmark for
embedded multiprocessor systems. Computer Architecture Letters, 9(2):45–48, Feb 2010.
[35] Shuai Che, M. Boyer, Jiayuan Meng, D. Tarjan, J.W. Sheaffer, Sang-Ha Lee, and K. Skadron.
Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization,
2009. IISWC 2009. IEEE International Symposium on, pages 44–54, Oct 2009.
[36] Measuring power on intel xeon phiTM product family devices, https://software.intel.com/en-
us/articles/measuring-power-on-intel-xeon-phi-product-family-devices, 2015.
[37] Ubuntu-manuals : powerstat - a tool to measure power consumption ,
http://manpages.ubuntu.com/manpages/wily/man8/powerstat.8.html, 2015.
[38] Nvidia. https://docs.nvidia.com/cuda/profiler-users-guide/index.htmls. In CUDA, 2018.
[39] Philip J. Mucci, Shirley Browne, Christine Deane, and George Ho. Papi: A portable interface
to hardware performance counters. In In Proceedings of the Department of Defense HPCMP
Users Group Conference, pages 7–10, 1999.
[40] Sankaralingam Panneerselvam and Michael Swift. Rinnegan: Efficient resource use in
heterogeneous architectures. In Proceedings of the 2016 International Conference on Parallel
Architectures and Compilation, PACT ’16, pages 373–386, New York, NY, USA, 2016. ACM.
[41] Guang-Bin Huang and H. A. Babri. Upper bounds on the number of hidden neurons in feed-
forward networks with arbitrary bounded nonlinear activation functions. IEEE Transactions
on Neural Networks, 9(1):224–229, Jan 1998.
[42] N. Agarwal, D. Nellans, E. Ebrahimi, T. F. Wenisch, J. Danskin, and S. W. Keckler. Selective
gpu caches to eliminate cpu-gpu hw cache coherence. In 2016 IEEE International Symposium
on High Performance Computer Architecture (HPCA), pages 494–506, March 2016.
[43] Srinath Sridharan, Gagan Gupta, and Gurindar S. Sohi. Adaptive, efficient, parallel execution
of parallel programs. In Proceedings of the 35th ACM SIGPLAN Conference on Programming
63
Language Design and Implementation, PLDI ’14, pages 169–180, New York, NY, USA, 2014.
ACM.
[44] Heidi Pan, Benjamin Hindman, and Krste Asanovic´. Composing parallel software efficiently
with lithe. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language
Design and Implementation, PLDI ’10, pages 376–387, New York, NY, USA, 2010. ACM.
[45] Sungpack Hong, Tayo Oguntebi, and Kunle Olukotun. Efficient parallel graph exploration
on multi-core cpu and gpu. In Proceedings of the 2011 International Conference on Parallel
Architectures and Compilation Techniques, PACT ’11, pages 78–88, Washington, DC, USA,
2011. IEEE Computer Society.
[46] Yuan Wen and Michael F.P. O’Boyle. Merge or separate?: Multi-job scheduling for opencl
kernels on cpu/gpu platforms. In Proceedings of the General Purpose GPUs, GPGPU-10,
pages 22–31, New York, NY, USA, 2017. ACM.
[47] Klaus Kofler, Ivan Grasso, Biagio Cosenza, and Thomas Fahringer. An automatic input-
sensitive approach for heterogeneous task partitioning. In Proceedings of the 27th International
ACM Conference on International Conference on Supercomputing, ICS ’13, pages 149–160,
New York, NY, USA, 2013. ACM.
[48] H. Dogan, F. Hijaz, M. Ahmad, B. Kahne, P. Wilson, and O. Khan. Accelerating graph and
machine learning workloads using a shared memory multicore architecture with auxiliary sup-
port for in-hardware explicit messaging. In 2017 IEEE International Parallel and Distributed
Processing Symposium (IPDPS), pages 254–264, May 2017.
[49] Christina Delimitrou, Daniel Sanchez, and Christos Kozyrakis. Tarcil: Reconciling scheduling
speed and quality in large shared clusters. In Proceedings of the Sixth ACM Symposium on
Cloud Computing, SoCC ’15, pages 97–110, New York, NY, USA, 2015. ACM.
[50] Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, Alan Edelman, and Saman
Amarasinghe. Petabricks: A language and compiler for algorithmic choice. In Proceedings of
the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation,
PLDI ’09, pages 38–49, New York, NY, USA, 2009. ACM.
[51] David Tarjan, Kevin Skadron, and Paulius Micikevicius. The art of performance tuning for
cuda and manycore architectures. Birds-of-a-feather session at Supercomputing (SC), 2009.
[52] Tyson Condie, Paul Mineiro, Neoklis Polyzotis, and Markus Weimer. Machine learning for big
data. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of
Data, SIGMOD ’13, pages 939–942, New York, NY, USA, 2013. ACM.
[53] Muhammad Amber Hassaan, Donald D. Nguyen, and Keshav K. Pingali. Kinetic dependence
graphs. In Proceedings of the Twentieth International Conference on Architectural Support for
64
Programming Languages and Operating Systems, ASPLOS ’15, pages 457–471, New York,
NY, USA, 2015. ACM.
[54] Masab Ahmad, Mohsin Shan, Akif Rehman, and Omer Khan. Exploiting multi-level task de-
pendencies to prune redundant work in relax-ordered task-parallel algorithms. In Proceedings
of the 19th International Conference on Parallel Architecture and Compilation Techniques,
PACT ’19, Washington, DC, USA, 2019. IEEE Computer Society.
